# **Stock Prices Predictions with Machine Learning**

This Notebook will contain the modeling phases needed to predict stock prices using a deep learning model.
The stocks analyzed will be the following:
* IBM
* AAPL (Apple Inc.)
* AMZN (Amazon Inc.)
* GOOGL (Alphabet Inc.)


# Data preparation

Data must be prepared in order to be processed by DeepAR model:
* Train/test set split
* Save Data locally
* Upload to S3

## Save Data Locally

In [168]:
data_dir = 'stock_deepar'

In [169]:
import os

In [170]:
data_dir_csv = os.path.join(data_dir, 'csv') # The folder we will use for storing data
if not os.path.exists(data_dir_csv): # Make sure that the folder exists
    os.makedirs(data_dir_csv)

In [171]:
interval ='D'

In [172]:
# IBM
df_ibm_train.to_csv(os.path.join(data_dir_csv, 'ibm_train.csv'), header=True, index=True)
df_ibm_test.to_csv(os.path.join(data_dir_csv, 'ibm_test.csv'), header=True, index=True)
df_ibm_valid.to_csv(os.path.join(data_dir_csv, 'ibm_valid.csv'), header=True, index=True)

In [173]:
# Apple Inc.
df_aapl_train.to_csv(os.path.join(data_dir_csv, 'aapl_train.csv'), header=True, index=True)
df_aapl_test.to_csv(os.path.join(data_dir_csv, 'aapl_test.csv'), header=True, index=True)
df_aapl_valid.to_csv(os.path.join(data_dir_csv, 'aapl_valid.csv'), header=True, index=True)

In [174]:
# Amazon.com
df_amzn_train.to_csv(os.path.join(data_dir_csv, 'amzn_train.csv'), header=True, index=True)
df_amzn_test.to_csv(os.path.join(data_dir_csv, 'amzn_test.csv'), header=True, index=True)
df_amzn_valid.to_csv(os.path.join(data_dir_csv, 'amzn_valid.csv'), header=True, index=True)

In [175]:
# Alphabet Inc.
df_googl_train.to_csv(os.path.join(data_dir_csv, 'googl_train.csv'), header=True, index=True)
df_googl_test.to_csv(os.path.join(data_dir_csv, 'googl_test.csv'), header=True, index=True)
df_googl_valid.to_csv(os.path.join(data_dir_csv, 'googl_valid.csv'), header=True, index=True)

### JSON serialization

In order to feed DeepAR model, JSON files must be prepared from data.
I'll dispose two kind of JSON inputs:
* one with "dynamic features", to use a DeepAR API terminology: all dataset features except for target column and related one ('Adj Close', 'Close');
* one without "dynamic features: only 'Adj Close' column will be fed to DeepAR model.

### JSON files

Now I'm going to convert data to JSON file format, in order to feed the DeepAR model correctly

As already announced, I will create two kind of time series, one with a list of dynamic features `dyn_feat`and the other one with only the target column (`Adj Close`) time series. 

In [176]:
# initializing train/test dataframe lists to iterate on them
dfs_train = [df_ibm_train, df_aapl_train, df_amzn_train, df_googl_train]
dfs_test = [df_ibm_test, df_aapl_test, df_amzn_test, df_googl_test]

Creating local storage path:

In [177]:
data_dir_json = os.path.join(data_dir, 'json')
if not os.path.exists(data_dir_json): # Make sure that the folder exists
    os.makedirs(data_dir_json)

Serializing data to json files

In [178]:
from source_deepar.deepar_utils import ts2json_serialize

Dataset with the `Adj Close` time series alone:

Training data:

In [179]:
data_dir_json_train = os.path.join(data_dir_json, 'train') # The folder we will use for storing data
if not os.path.exists(data_dir_json_train): # Make sure that the folder exists
    os.makedirs(data_dir_json_train)

In [180]:
for df, m in zip(dfs_train, mnemonics):
    ts2json_serialize(df, data_dir_json_train, m+'.json')

Test data:

In [181]:
data_dir_json_test = os.path.join(data_dir_json, 'test') # The folder we will use for storing data
if not os.path.exists(data_dir_json_test): # Make sure that the folder exists
    os.makedirs(data_dir_json_test)

In [182]:
for df, m in zip(dfs_test, mnemonics):
    ts2json_serialize(df, data_dir_json_test, m+'.json')

Dataset containing dynamic features:

Training data:

In [183]:
data_dir_json_dyn_feat = os.path.join(data_dir_json, 'w_dyn_feat')

In [184]:
data_dir_json_train_dyn_feat = os.path.join(data_dir_json_dyn_feat, 'train') # The folder we will use for storing data
if not os.path.exists(data_dir_json_train_dyn_feat): # Make sure that the folder exists
    os.makedirs(data_dir_json_train_dyn_feat)

In [185]:
for df, m in zip(dfs_train, mnemonics):
    ts2json_serialize(df, data_dir_json_train_dyn_feat, m+'.json', dyn_feat=['Open'])

Test data:

In [186]:
data_dir_json_test_dyn_feat = os.path.join(data_dir_json_dyn_feat, 'test') # The folder we will use for storing data
if not os.path.exists(data_dir_json_test_dyn_feat): # Make sure that the folder exists
    os.makedirs(data_dir_json_test_dyn_feat)

In [187]:
for df, m in zip(dfs_test, mnemonics):
    ts2json_serialize(df, data_dir_json_test_dyn_feat, m+'.json', dyn_feat=['Open'])

Validation data:

In [186]:
data_dir_json_test_dyn_feat = os.path.join(data_dir_json_dyn_feat, 'test') # The folder we will use for storing data
if not os.path.exists(data_dir_json_test_dyn_feat): # Make sure that the folder exists
    os.makedirs(data_dir_json_test_dyn_feat)

In [187]:
for df, m in zip(dfs_test, mnemonics):
    ts2json_serialize(df, data_dir_json_test_dyn_feat, m+'.json', dyn_feat=['Open'])

## Uploading data to S3

In [188]:
import sagemaker
# Define IAM role and session
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()

#Define training data location
s3_data_key = 'stock_deepar/train_artifacts'
s3_bucket = sagemaker_session.default_bucket()
interval = 'D' #Use D or H
s3_output_path = "s3://{}/{}/{}/output".format(s3_bucket, s3_data_key, interval)


In [189]:
# *unique* train/test prefixes
train_prefix   = '{}/{}'.format(data_dir_json, 'train')
test_prefix    = '{}/{}'.format(data_dir_json, 'test')
train_prefix_dyn_feat   = '{}/{}'.format(data_dir_json_dyn_feat, 'train')
test_prefix_dyn_feat    = '{}/{}'.format(data_dir_json_dyn_feat, 'test')

In [190]:
input_data_train = sagemaker_session.upload_data(path=data_dir_json_train, bucket=s3_bucket, key_prefix=train_prefix)

In [191]:
input_data_test = sagemaker_session.upload_data(path=data_dir_json_test, bucket=s3_bucket, key_prefix=test_prefix)

In [192]:
input_data_train_dyn_feat = sagemaker_session.upload_data(path=data_dir_json_train, bucket=s3_bucket, key_prefix=train_prefix_dyn_feat)

In [193]:
input_data_test_dyn_feat = sagemaker_session.upload_data(path=data_dir_json_test, bucket=s3_bucket, key_prefix=test_prefix_dyn_feat)

### Hyperparameters

DeepAR is the model of choice of this project.
This model expects input data to be already test-train splitted.
A big part of the model design has to be done looking close at data.
More specifically, defining these two hyperparameters about the data:
* Context length
* Prediction length

### Prediction length

This is the length of the time series future predictions in days. It will be conveniently set to 5 days (exactly a week of trading hours) because a shorter interval would be of little significance.
A longer interval could be interesting from an application point of view, but it can be challenging in terms of model performances.

### Context length

Context length can be either:
* designed on patterns or seasonality observed in the data, if any is present;
* chosen as a fixed value. This will be my choice, and it will be the same as the moving average window, in order to have a good reference metrics, applicable to both this model and the benchmark model.

To explore this second option, we will refer to what we've found during the EDA stage.

In [252]:
'''
covariate_columns = list(df_ibm.columns)
covariate_columns.remove('Close')
covariate_columns.remove('Adj Close')
'''

"\ncovariate_columns = list(df_ibm.columns)\ncovariate_columns.remove('Close')\ncovariate_columns.remove('Adj Close')\n"

In [259]:
from source_deepar import deepar_utils

# setting target columns
target_column = 'Adj Close'

# retrieving covariate columns
'''covariate_columns = list(df_ibm.columns)
covariate_columns.remove('Close')
covariate_columns.remove('Adj Close')
'''
train_test_split = 0.9
num_test_windows = 4
    
    
hyperparameters = {
    "prediction_length": str(prediction_length[0]), #number of time-steps model is trained to predict, always generates forecasts with this length
    "context_length": str(context_length[0]), #number of time-points that the model gets to see before making the prediction, should be about same as the prediction_length
    "time_freq": interval, #granularity of the time series in the dataset
    "epochs": "200", #maximum number of passes over the training data
    "early_stopping_patience": "40", #training stops when no progress is made within the specified number of epochs
    "num_layers": "2", #number of hidden layers in the RNN, typically range from 1 to 4    
    "num_cells": "40", #number of cells to use in each hidden layer of the RNN, typically range from 30 to 100
    "mini_batch_size": "128", #size of mini-batches used during training, typically values range from 32 to 512
    "learning_rate": "1e-3", #learning rate used in training. Typical values range from 1e-4 to 1e-1
    "dropout_rate": "0.1", # dropout rate to use for regularization, typically less than 0.2. 
    "likelihood": "gaussian" #noise model used for uncertainty estimates - gaussian/beta/negative-binomial/student-T/deterministic-L1
}

In [260]:

# Define IAM role and session
role = sagemaker.get_execution_role()
session = sagemaker.Session()

#Obtain container image URI for SageMaker-DeepAR algorithm, based on region
region = session.boto_region_name
image_name = sagemaker.amazon.amazon_estimator.get_image_uri(region, "forecasting-deepar", "latest")
print("Model will be trained using container image : {}".format(image_name))

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: latest.


Model will be trained using container image : 495149712605.dkr.ecr.eu-central-1.amazonaws.com/forecasting-deepar:1


## Estimator Instantiation

In [261]:
from sagemaker.estimator import Estimator

# instantiate a DeepAR estimator
estimator = Estimator(image_uri=image_name,
                      sagemaker_session=sagemaker_session,
                      image_name=image_name,
                      role=role,
                      train_instance_count=1,
                      train_instance_type='ml.c4.xlarge',
                      output_path=s3_output_path,
                      hyperparameters=hyperparameters
                      )

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


## Training Job Creation

Creation of a training job with stand alone time series (no dynamic features provided).

In [262]:
%%time
# train and test channels
data_channels = {
    "train": input_data_train,
    "test": input_data_test
}

# fit the estimator
estimator.fit(inputs=data_channels)

2021-03-06 21:15:30 Starting - Starting the training job...
2021-03-06 21:15:54 Starting - Launching requested ML instancesProfilerReport-1615065330: InProgress
......
2021-03-06 21:16:54 Starting - Preparing the instances for training.........
2021-03-06 21:18:22 Downloading - Downloading input data
2021-03-06 21:18:22 Training - Downloading the training image..[34mArguments: train[0m
[34m[03/06/2021 21:18:40 INFO 139741071263552] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'num_dynamic_feat': u'auto', u'dropout_rate': u'0.10', u'mini_batch_size': u'128', u'test_quantiles': u'[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'num_eval_samples': u'100', u'learning_rate': u'0.001', u'num_cells': u'40', u'num_layers': u'2', u'embedding_dimension': u'10', u'_kvstore': u'auto', u'_num_kv_servers': u'auto', u'cardinality': u'auto', u'likelihood': u'student-t', 

# Generating Predictions

According to the [inference format](https://docs.aws.amazon.com/sagemaker/latest/dg/deepar-in-formats.html) for DeepAR, the `predictor` expects to see input data in a JSON format, with the following keys:
* **instances**: A list of JSON-formatted time series that should be forecast by the model.
* **configuration** (optional): A dictionary of configuration information for the type of response desired by the request.

Within configuration the following keys can be configured:
* **num_samples**: An integer specifying the number of samples that the model generates when making a probabilistic prediction.
* **output_types**: A list specifying the type of response. We'll ask for **quantiles**, which look at the list of num_samples generated by the model, and generate [quantile estimates](https://en.wikipedia.org/wiki/Quantile) for each time point based on these values.
* **quantiles**: A list that specified which quantiles estimates are generated and returned in the response.


Below is an example of what a JSON query to a DeepAR model endpoint might look like.

```
{
 "instances": [
  { "start": "2009-11-01 00:00:00", "target": [4.0, 10.0, 50.0, 100.0, 113.0] },
  { "start": "1999-01-30", "target": [2.0, 1.0] }
 ],
 "configuration": {
  "num_samples": 50,
  "output_types": ["quantiles"],
  "quantiles": ["0.5", "0.9"]
 }
}
```

## JSON Prediction Request

The code below accepts a **list** of time series as input and some configuration parameters. It then formats that series into a JSON instance and converts the input into an appropriately formatted JSON_input.

In [266]:
def json_predictor_input(input_ts, num_samples=50, quantiles=['0.1', '0.5', '0.9']):
    '''Accepts a list of input time series and produces a formatted input.
       :input_ts: An list of input time series.
       :num_samples: Number of samples to calculate metrics with.
       :quantiles: A list of quantiles to return in the predicted output.
       :return: The JSON-formatted input.
       '''
    # request data is made of JSON objects (instances)
    # and an output configuration that details the type of data/quantiles we want
    
    instances = []
    for k in range(len(input_ts)):
        # get JSON objects for input time series
        instances.append(series_to_json_obj(input_ts[k]))

    # specify the output quantiles and samples
    configuration = {"num_samples": num_samples, 
                     "output_types": ["quantiles"], 
                     "quantiles": quantiles}

    request_data = {"instances": instances, 
                    "configuration": configuration}

    json_request = json.dumps(request_data).encode('utf-8')
    
    return json_request

### Get a Prediction

We can then use this function to get a prediction for a formatted time series!

In the next cell, I'm getting an input time series and known target, and passing the formatted input into the predictor endpoint to get a resultant prediction.

In [265]:
df_ibm_valid['Adj Close'].head()

Date
2019-12-27    126.834549
2019-12-30    124.527962
2019-12-31    125.681244
2020-01-02    126.975204
2020-01-03    125.962540
Name: Adj Close, dtype: float64

In [None]:
# get all input and target (test) time series
input_ts = df_ibm_valid['Adj Close']
target_ts = time_series

# get formatted input time series
json_input_ts = json_predictor_input(input_ts)

# get the prediction from the predictor
json_prediction = predictor.predict(json_input_ts)

#print(json_prediction)