# BYOA Tutorial - Prophet Forecasting en Sagemaker

The following notebook shows how to integrate your own algorithms to Amazon Sagemaker. We are going to go the way of putting together an inference pipeline on the Prophet algorithm for time series. The algorithm is installed in a docker container and then it helps us to train the model and make inferences on an endpoint.


## Step 1: Assemble the dataset

We are going to work with a public dataset that we must download from Kaggle. This dataset is called: Avocado Prices: Historical data on avocado prices and sales volume in multiple US markets and can be downloaded from: https://www.kaggle.com/neuromusic/avocado-prices/download Once downloaded, we must upload it to the same directory where we are running this notebook. The following code prepares the dataset so that Prophet can understand it:

In [1]:
import pandas as pd

# We are left with only the date and the sales
df = pd.read_csv('avocado.csv')
df = df[['Date', 'AveragePrice']].dropna()

df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')

# We leave 1 single record per day with the average sales
daily_df = df.resample('D').mean()
d_df = daily_df.reset_index().dropna()

# We format the column names as Prophet expects them
d_df = d_df[['Date', 'AveragePrice']]
d_df.columns = ['ds', 'y']
d_df.head()

# We save the resulting dataset as avocado_daily.csv
d_df.to_csv("avocado_daily.csv",index = False , columns = ['ds', 'y'] )

# Step 2: Package and Upload the Algorithm for Use with Amazon SageMaker


### An overview of Docker

Docker provides a simple way to package code into an _image_ that is completely self-contained. Once you have an image, you can use Docker to run a _container_ based on that image. Running a container is the same as running a program on the machine, except that the container creates a completely self-contained environment for the program to run. Containers are isolated from each other and from the host environment, so the way you configure the program is the way it runs, no matter where you run it.

Docker is more powerful than environment managers like conda or virtualenv because (a) it is completely language independent and (b) it understands your entire operating environment, including startup commands, environment variables, etc.

In some ways, a Docker container is like a virtual machine, but it is much lighter. For example, a program that runs in one container can start in less than a second, and many containers can run on the same physical machine or virtual machine instance.

Docker uses a simple file called `Dockerfile` to specify how the image is assembled.
Amazon SagMaker uses Docker to allow users to train and implement algorithms.

In Amazon SageMaker, Docker containers are invoked in a certain way for training and in a slightly different way for hosting. The following sections describe how to create containers for the SageMaker environment.


### How Amazon SageMaker runs the Docker container

Because it can run the same image in training or hosting, Amazon SageMaker runs the container with the `train` or` serve` argument. How your container processes this argument depends on the container:

* In the example here, we did not define an ʻENTRYPOINT ʻin the Dockerfile for Docker to execute the `train` command at training time and` serve` at service time. In this example, we define them as executable Python scripts, but they could be any program that we want to start in that environment.
* If you specify a program as "ENTRYPOINT" in the Dockerfile, that program will run at startup and its first argument will be either `train` or` serve`. The program can then examine that argument and decide what to do.
* If you are building separate containers for training and hosting (or building just for one or the other), you can define a program as "ENTRYPOINT" in the Dockerfile and ignore (or check) the first argument passed.

#### Run container during training

When Amazon SageMaker runs the training, the `train` script runs like a regular Python program. A series of files are arranged for your use, under the `/ opt / ml` directory:

    /opt/ml
    ├── input
    │   ├── config
    │   │   ├── hyperparameters.json
    │   │   └── resourceConfig.json
    │   └── data
    │       └── <channel_name>
    │           └── <input data>
    ├── model
    │   └── <model files>
    └── output
        └── failure


##### The entrance

* `/ opt / ml / input / config` contains information to control how the program runs. `hyperparameters.json` is a JSON-formatted dictionary of hyperparameter names to values. These values ​​will always be strings, so you may need to convert them. `ResourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn does not support distributed training, we will ignore it here.
* `/ opt / ml / input / data / <channel_name> /` (for File mode) contains the input data for that channel. Channels are created based on the call to CreateTrainingJob, but it is generally important that the channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure.
* `/ opt / ml / input / data / <channel_name> _ <epoch_number>` (for Pipe mode) is the pipe for a given epoch. The epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs you can run, but you must close each pipe before reading the next epoch.
    
##### The exit

* `/ opt / ml / model /` is the directory where the model generated by your algorithm is written. Your model can be in any format you want. It can be a single file or an entire directory tree. SagMaker will package any files in this directory into a compressed tar file. This file will be available in the S3 location returned in the `DescribeTrainingJob` output.
* `/ opt / ml / output` is a directory where the algorithm can write a` failure` file that describes why the job failed. The content of this file will be returned in the `FailureReason` field of the` DescribeTrainingJob` result. For successful jobs, there is no reason to write this file as it will be ignored.

#### Running the container during hosting

Hosting has a very different model than training because it must respond to inference requests that arrive through HTTP. In this example, we use recommended Python code to provide a robust and scalable inference request service:

Amazon SagMaker uses two URLs in the container:

* `/ ping` will receive` GET` requests from the infrastructure. Returns 200 if the container is open and accepting requests.
* `/ invocations` is the endpoint that receives inference` POST` requests from the client. The format of the request and the response depends on the algorithm. If the client supplied the `ContentType` and ʻccept` headers, these will also be passed.

The container will have the model files in the same place where they were written during training:

    /opt/ml
    └── model
        └── <model files>


### Container Parts

In the `container` directory are all the components you need to package the sample algorithm for Amazon SageManager:

    .
    ├── Dockerfile
    ├── build_and_push.sh
    └── decision_trees
        ├── nginx.conf
        ├── predictor.py
        ├── serve
        ├── train
        └── wsgi.py


Vamos a ver cada uno:

Let's see each one:

* __`Dockerfile`__ describes how to build the Docker container image. More details below.
* __`build_and_push.sh`__ is a script that uses Dockerfile to build its container images and then publishes (push) it to ECR. We will invoke the commands directly later in this notebook, but you can copy and run the script for other algorithms.
* __`prophet`__ is the directory that contains the files to be installed in the container.
* __`local_test`__ is a directory that shows how to test the new container on any machine that can run Docker, including an Amazon SageMaker Notebook Instance. With this method, you can quickly iterate using small data sets to eliminate any structural errors before using the container with Amazon SageMaker.

The files that we are going to put in the container are:

* __`nginx.conf`__ is the configuration file for the nginx front-end. Generally, you should be able to take this file as is.
* __`predictor.py`__ is the program that actually implements the Flask web server and Prophet predictions for this application.
* __`serve`__ is the program started when the hosting container starts. It just launches the gunicorn server that runs multiple instances of the Flask application defined in `predictor.py`. You should be able to take this file as is.
* __`train`__ is the program that is invoked when the container for training is executed.
* __`wsgi.py`__ is a small wrapper used to invoke the Flask application. You should be able to take this file as is.

In summary, the two Prophet-specific code files are `train` and` predictor.py`.

### The Dockerfile file

The Dockerfile file describes the image we want to create. It is a description of the complete installation of the operating system of the system you want to run. A running Docker container is significantly lighter than a full operating system, however, because it leverages Linux on the host machine for basic operations.

For this example, we will start from a standard Ubuntu installation and run the normal tools to install the things Prophet needs. Finally, we add the code that implements Prophet to the container and configure the correct environment to run correctly.

The following is the Dockerfile:

In [2]:
!cat container/Dockerfile

# Build an image that can do training and inference in SageMaker
# This is a Python 3 image that uses the nginx, gunicorn, flask stack
# for serving inferences in a stable way.

FROM ubuntu:16.04

MAINTAINER Amazon AI <sage-learner@amazon.com>

RUN apt-get -y update && apt-get install -y --no-install-recommends \
         wget \
         curl \
         python-dev \
         build-essential libssl-dev libffi-dev \
         libxml2-dev libxslt1-dev zlib1g-dev \
         nginx \
         ca-certificates \
    && rm -rf /var/lib/apt/lists/*

RUN curl -fSsL -O https://bootstrap.pypa.io/get-pip.py && \
    python get-pip.py && \
    rm get-pip.py
 
RUN pip --no-cache-dir install \
        numpy \
        scipy \
        sklearn \
        pandas \
        flask \
        gevent \
        gunicorn \
        pystan 

RUN pip --no-cache-dir install \
        fbprophet 
        
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

# Set up the program in th

### El archivo train

The train file describes the way we are going to do the training.
The Prophet-Docker / container / prophet / train file contains the specific training code for Prophet.
We must modify the train () function in the following way:

    def train():
        print('Starting the training.')
        try:
            # Read in any hyperparameters that the user passed with the training job
            with open(param_path, 'r') as tc:
                trainingParams = json.load(tc)
            # Take the set of files and read them all into a single pandas dataframe
            input_files = [ os.path.join(training_path, file) for file in os.listdir(training_path) ]
            if len(input_files) == 0:
                raise ValueError(('There are no files in {}.\n' +
                                  'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                                  'the data specification in S3 was incorrectly specified or the role specified\n' +
                                  'does not have permission to access the data.').format(training_path, channel_name))
            raw_data = [ pd.read_csv(file, error_bad_lines=False ) for file in input_files ]
            train_data = pd.concat(raw_data)
            train_data.columns = ['ds', 'y']

            # Usamos Prophet para entrenar el modelo.
            clf = Prophet()
            clf = clf.fit(train_data)

            # save the model
            with open(os.path.join(model_path, 'prophet-model.pkl'), 'w') as out:
                pickle.dump(clf, out)
            print('Training complete.')


### El archivo predictor.py

The predictor.py file describes the way we are going to make predictions.
The file Prophet-Docker / container / prophet / predictor.py contains the specific prediction code for Prophet.
We must modify the predict () function in the following way:

    def predict(cls, input):
        """For the input, do the predictions and return them.

        Args:
            input (a pandas dataframe): The data on which to do the predictions. There will be
                one prediction per row in the dataframe"""
        clf = cls.get_model()
        future = clf.make_future_dataframe(periods=int(input.iloc[0]))
        print(int(input.iloc[0]))
        print(input)
        forecast = clf.predict(future)
              
        return forecast.tail(int(input.iloc[0]))



And then the transformation () function as follows:

    def transformation():
        """Do an inference on a single batch of data. In this sample server, we take data as CSV, convert
        it to a pandas data frame for internal use and then convert the predictions back to CSV (which really
        just means one prediction per line, since there's a single column.
        """
        data = None

        # Convert from CSV to pandas
        if flask.request.content_type == 'text/csv':
            data = flask.request.data.decode('utf-8')
            s = StringIO.StringIO(data)
            data = pd.read_csv(s, header=None)
        else:
            return flask.Response(response='This predictor only supports CSV data', status=415, mimetype='text/plain')

        print('Invoked with {} records'.format(data.shape[0]))

        # Do the prediction
        predictions = ScoringService.predict(data)

        # Convert from numpy back to CSV
        out = StringIO.StringIO()
        pd.DataFrame({'results':[predictions]}, index=[0]).to_csv(out, header=False, index=False)
        result = out.getvalue()

        return flask.Response(response=result, status=200, mimetype='text/csv')
 

Basically we modify the line:

        pd.DataFrame({'results':predictions}).to_csv(out, header=False, index=False)
 
By the line:

        pd.DataFrame({'results':[predictions]}, index=[0]).to_csv(out, header=False, index=False)



# Part 3: Using Prophet in Amazon SageMaker
Now that we have all the files created, we are going to use Prophet in Sagemaker

## Container assembly
We start by building and registering the container

In [3]:
%%time
%%sh

# The name of our algorithm
algorithm_name=sagemaker-prophet

cd container

chmod +x prophet/train
chmod +x prophet/serve

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

Login Succeeded
Sending build context to Docker daemon  63.49kB
Step 1/11 : FROM ubuntu:16.04
 ---> c6a43cd4801e
Step 2/11 : MAINTAINER Amazon AI <sage-learner@amazon.com>
 ---> Using cache
 ---> c0ea7ed783e7
Step 3/11 : RUN apt-get -y update && apt-get install -y --no-install-recommends          wget          curl          python-dev          build-essential libssl-dev libffi-dev          libxml2-dev libxslt1-dev zlib1g-dev          nginx          ca-certificates     && rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> 17bd5ae1900b
Step 4/11 : RUN curl -fSsL -O https://bootstrap.pypa.io/get-pip.py &&     python get-pip.py &&     rm get-pip.py
 ---> Using cache
 ---> e1f1939e31e1
Step 5/11 : RUN pip --no-cache-dir install         numpy         scipy         sklearn         pandas         flask         gevent         gunicorn         pystan
 ---> Using cache
 ---> 8ff73a969fc2
Step 6/11 : RUN pip --no-cache-dir install         fbprophet
 ---> Using cache
 ---> 815dc3862860
Step 7/11 :

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



CPU times: user 9.27 ms, sys: 594 µs, total: 9.87 ms
Wall time: 2.64 s


## Building the Training Environment
We initialize the session, execution role.

In [4]:
%%time
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

import sagemaker as sage
from time import gmtime, strftime


prefix = 'DEMO-prophet-byo'
role = get_execution_role()
sess = sage.Session()


CPU times: user 408 ms, sys: 40.3 ms, total: 448 ms
Wall time: 503 ms



# We upload the data to S3

In [5]:
WORK_DIRECTORY = 'data'
data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)

## We train the model
Using the data uploaded to S3, we train the model by raising an ml.c4.2xlarge instance.
Sagemaker will leave the trained model in the / output directory

In [6]:
%%time

account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/sagemaker-prophet:latest'.format(account, region)

tseries = sage.estimator.Estimator(image,
                       role, 
                        1, 
                        'ml.c4.2xlarge',
                       output_path="s3://{}/output".format(sess.default_bucket()),
                       sagemaker_session=sess)

tseries.fit(data_location)

2019-12-27 16:00:08 Starting - Starting the training job...
2019-12-27 16:00:09 Starting - Launching requested ML instances......
2019-12-27 16:01:13 Starting - Preparing the instances for training...
2019-12-27 16:01:58 Downloading - Downloading input data
2019-12-27 16:01:58 Training - Downloading the training image...
2019-12-27 16:02:34 Training - Training image download completed. Training in progress..[34mINFO:matplotlib.font_manager:font search path ['/usr/local/lib/python2.7/dist-packages/matplotlib/mpl-data/fonts/ttf', '/usr/local/lib/python2.7/dist-packages/matplotlib/mpl-data/fonts/afm', '/usr/local/lib/python2.7/dist-packages/matplotlib/mpl-data/fonts/pdfcorefonts'][0m
[34mINFO:matplotlib.font_manager:generated new fontManager[0m
[34mERROR:fbprophet:Importing matplotlib failed. Plotting will not work.[0m
[34mERROR:fbprophet:Importing plotly failed. Interactive plots will not work.[0m
[34mStarting the training.[0m
[34mINFO:fbprophet:Disabling weekly seasonality. R


## Endpoint assembly for inference
Using the newly trained model, we create an endpoint for inference hosted on an ml.c4.2xlarge instance

In [7]:
%%time

from sagemaker.predictor import csv_serializer
predictor = tseries.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer)

---------------------------------------------------------------------------------------------------------------!CPU times: user 518 ms, sys: 39.5 ms, total: 557 ms
Wall time: 9min 20s


## Inference test
Finally we ask the model to predict the sales for the next 30 days.

In [8]:
%%time
p = predictor.predict("30")
print(p)

b'"            ds     trend  trend_lower  ...  yearly_lower  yearly_upper      yhat\n169 2018-03-26  1.473312     1.473312  ...     -0.076117     -0.076117  1.397195\n170 2018-03-27  1.472971     1.472971  ...     -0.072531     -0.072531  1.400440\n171 2018-03-28  1.472631     1.472631  ...     -0.068829     -0.068829  1.403802\n172 2018-03-29  1.472291     1.472291  ...     -0.065070     -0.065070  1.407221\n173 2018-03-30  1.471950     1.471950  ...     -0.061313     -0.061313  1.410637\n174 2018-03-31  1.471610     1.471610  ...     -0.057619     -0.057619  1.413991\n175 2018-04-01  1.471270     1.471270  ...     -0.054048     -0.054048  1.417222\n176 2018-04-02  1.470929     1.470929  ...     -0.050657     -0.050657  1.420273\n177 2018-04-03  1.470589     1.470589  ...     -0.047500     -0.047500  1.423089\n178 2018-04-04  1.470248     1.470241  ...     -0.044627     -0.044627  1.425622\n179 2018-04-05  1.469908     1.469861  ...     -0.042080     -0.042080  1.427828\n180 2018-04-0