## Introduction

This sample Notebook creates a [statsmodels](https://www.statsmodels.org/stable/index.html) Machine Learning (ML) model that can be used with the `convert-statsmodel-tutorial.ipynb` tutorial.

This example provides the following:

* `train-statsmodel.ipynb`: A sample Jupyter Notebook that trains a sample model.  The model predicts how many bikes will be rented on each of the next 7 days, based on the previous 7 days' bike rentals, temperature, and wind speed.
   Additional files to support this example are:
  * `day.csv`: Data used to train the sample `statsmodel` example.
  * `infer.py`: The inference script that is part of the `statsmodel`.
* `convert-statsmodel-tutorial.ipynb`: A sample Jupyter Notebook that demonstrates how to upload, convert, and deploy the `statsmodel` example into a Wallaroo instance.    Additional files to support this example are:
  * `bike_day_model.pkl`: A `statsmodel` ML model trained from the `train-statsmodel.ipynb` Notebook.
  * `bike_day_eval.json`: Evaluation data used to test the model's performance.



## Steps

### Import Libraries

Start by importing the libraries we will need to train the model.

In [1]:
import pandas as pd
import datetime

### Train the Model

Load the data from the file `day.csv` and prepare it to be used in the training.

In [2]:
bike_day_frame = pd.read_csv(
        'day.csv',
    low_memory=False)

bike_day_frame['date'] = pd.to_datetime(bike_day_frame['dteday']).dt.date
assert bike_day_frame['date'][0] == datetime.datetime.fromisoformat('2011-01-01').date()

# limit down to just the columns I want
extra_regressors = ["temp", "holiday", "workingday", "windspeed"] 
bike_day_frame = bike_day_frame.loc[:, ['date', 'cnt'] + extra_regressors]

# get dates we want to work with
startday = datetime.datetime.fromisoformat('2011-03-15').date() # day of first forecast
nforecast = 7
# limit to range we want
delta_days = (bike_day_frame['date'] - startday).dt.days

### Split and Train

With the data loaded, we can now train the model.

In [3]:
# limit down to the training period (basically, delta_days < 0) and the days we want to forecast
training_frame = bike_day_frame.loc[(delta_days < 0), :].reset_index(drop=True, inplace=False)
evaluation_frame = bike_day_frame.loc[(delta_days >= 0) & (delta_days<nforecast), :].reset_index(drop=True, inplace=False)

In [5]:
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(training_frame.cnt, training_frame.loc[:, extra_regressors], order=(1,0,1)).fit(disp=False)

### Create Pickle Package

The pickled Python runtime expects a dictionary with two keys: `model` and `script`:

* `model`—the pickled model, which will be automatically loaded into the python runtime with the name 'model'
* `script`—the text of the python script to be run, in a format similar to the existing python script steps (i.e. defining a wallaroo_json method which operates on the data)

In this case, we use `infer.py` as the source for our python script, which has the following contents:
```python
import json
import pandas as pd


def wallaroo_json(data):
    obj = json.loads(data)
    evaluation_frame = pd.DataFrame.from_dict(obj)
    extra_regressors = ["temp", "holiday", "workingday", "windspeed"]
    forecast = model.forecast(steps=7, exog=evaluation_frame.loc[:, extra_regressors])

    return {"forecast": forecast.tolist()}
```

In [6]:
# create the pickled dictionary

import pickle

# add the model
package = {
    'model': model
}

# add the text of the inference script
with open('infer.py', 'r') as f:
    package['script'] = f.read()

# save off your pickled file
pickle.dump(package, open("bike_day_model.pkl","wb"))

### Prepare evaluation data

For ease of inference, we save off the evaluation data to a separate json file.

In [7]:
# save off the evaluation frame json, too
with open("bike_day_eval.json", "w") as f:
    f.write(evaluation_frame.loc[:, extra_regressors].to_json())