# Jupyter Notebook for the Stock Prediction Data Analysis ( SPDA )

## Data Source : https://www.kaggle.com/qks1lver/amex-nyse-nasdaq-stock-histories

# Development stages

### 1. [Dataset formatting and freature extraction](#dataset_formatting_and_extraction)
### 2. [Manual model development](#manual_model_development)
### 3. [AutoML development](#automl_development)
### 4. [Model Training and Testing](#model_training_and_testing)

---
---

# Dataset formatting and feature extraction<a id='dataset_formatting_and_extraction'></a>

## a. Setup the links to the data file to be used for model training and testing

In [3]:
# System library for path management
from os import path

# Set the patha for the training and test datafiles
user_home_dir = str(path.expanduser('~'))
print('Home directory for the current user : ', user_home_dir)

Home directory for the current user :  C:\Users\Alpha


In [6]:
# Add path to the sample data file for training and testing models
sample_file_path = path.join(user_home_dir,
                                  'Desktop\MLH-2018\\amex-nyse-nasdaq-stock-histories\subset_data\AAL.csv')
print('Path to the data file currently being used : ', sample_file_path)

Path to the data file currently being used :  C:\Users\Alpha\Desktop\MLH-2018\amex-nyse-nasdaq-stock-histories\subset_data\AAL.csv


## b. Read the data file into a dataframe to be pre-processed

In [9]:
# Import the system built-in modules needed for feature extraction
import os
import time
from datetime import datetime

In [7]:
# Import the essential data processing libraries
import pandas as pd
import numpy as np

In [10]:
# Import visualization libraries for plotting and visualizing the dataset 
# vectors
import matplotlib.pyplot as plt

In [11]:
% matplotlib inline

In [17]:
# Read in the dataset from the csv file, and convert to a Pandas dataframe
sample_dataframe = pd.read_csv(sample_file_path, engine='python', encoding='utf-8 sig')
display(sample_dataframe)

Unnamed: 0,date,volume,open,close,high,low,adjclose
0,2018-11-16,9832851,37.400002,36.750000,37.529999,36.500000,36.750000
1,2018-11-15,8296700,37.869999,37.820000,38.160000,36.310001,37.820000
2,2018-11-14,7288400,38.000000,38.110001,38.590000,37.450001,38.110001
3,2018-11-13,9694100,37.150002,37.779999,38.419998,37.099998,37.779999
4,2018-11-12,9360800,36.310001,36.860001,37.299999,35.779999,36.860001
5,2018-11-09,6794400,36.700001,36.220001,37.259998,36.029999,36.220001
6,2018-11-08,6884800,36.770000,36.860001,37.049999,35.970001,36.860001
7,2018-11-07,10904000,35.570000,36.970001,37.389999,35.480000,36.970001
8,2018-11-06,11376700,35.599998,35.169998,35.959999,34.840000,35.169998
9,2018-11-05,11305300,36.349998,35.720001,36.520000,35.130001,35.720001


In [18]:
# Convert the date format in the dataframe into POSIX Timestamps

default_timestamps = sample_dataframe['date'].values
show_values = 5

# Initialize the list for storing POSIX timestamps
posix_timestamps = []

# Transform the datetime into POSIX datetime
for i in range(default_timestamps.shape[0]):
    
    # Collect the logged time value
    timestamp_logged = default_timestamps[i]    
    
    # Convert the logged default timestamp to POSIX and add to the list
    posix_timestamps.append(datetime.strptime(timestamp_logged, '%Y-%m-%d'))
    posix_timestamps[i] = time.mktime(posix_timestamps[i].timetuple())

# Add the list to the dataframe
sample_dataframe['Timestamp'] = posix_timestamps

# Set the POSIX timestamp column to be the index of the dataframe
#sample_dataframe.set_index('Timestamp', inplace=True)

# Sort the POSIX timestamp values in the dataframe
sample_dataframe.sort_values(by=['Timestamp'], inplace=True)

# Give a preview of the re-index dataframe
print('Showing the first %d values from the dataframe.' %(show_values))
sample_dataframe.head(show_values)

Showing the first 5 values from the dataframe.


Unnamed: 0,date,volume,open,close,high,low,adjclose,Timestamp
3309,2005-09-27,961200,21.049999,19.299999,21.4,19.1,18.489122,1127797000.0
3308,2005-09-28,5747900,19.299999,20.5,20.530001,19.200001,19.638702,1127884000.0
3307,2005-09-29,1078200,20.4,20.209999,20.58,20.1,19.360884,1127970000.0
3306,2005-09-30,3123300,20.26,21.01,21.049999,20.18,20.127277,1128056000.0
3305,2005-10-03,1057900,20.9,21.5,21.75,20.9,20.596684,1128316000.0


## Now that the dataframe are constructed, we can start the feature extraction and normalization

In [39]:
# Initialize the list to hold the normalized data valuess
scaled_features = []

# Approach 1 : Using the data from previous 7 days to predict the next day
from sklearn.preprocessing import MinMaxScaler

# Normalize and scale each one of the input features and construct the 
# scaler array for doing inverse scaling later
sample_df_cols = list(sample_dataframe.columns.values)
sample_df_cols.remove('date')

scaler_dict = {}
scaled_features_dict = {}

for col in sample_df_cols:
    
    # Initialize the scaler for the given feature
    # Range : -1...1
    scaler = MinMaxScaler(feature_range=(-1,1), copy=True)
    
    # Extract the features and fit the scaler
    feature_list = sample_dataframe[col].values.reshape(-1,1)
    scaler.fit(feature_list)
    
    # Add th scaler to the dictionary
    scaler_dict[col] = scaler
    
    # Transform the feature dataset
    scaled_feature_list = scaler.transform(feature_list)
    scaled_features.append(scaled_feature_list)
    scaled_features_dict[col] = scaled_feature_list
    
display(scaled_features)



[array([[-0.98804464],
        [-0.91848503],
        [-0.98634442],
        ...,
        [-0.89609871],
        [-0.88144624],
        [-0.85912312]]), array([[-0.36804076],
        [-0.42552146],
        [-0.38939072],
        ...,
        [ 0.18870092],
        [ 0.18443089],
        [ 0.1689933 ]]), array([[-0.42670374],
        [-0.38748162],
        [-0.39696032],
        ...,
        [ 0.18810264],
        [ 0.17862393],
        [ 0.14365091]]), array([[-0.36740694],
        [-0.39581971],
        [-0.39418681],
        ...,
        [ 0.19399085],
        [ 0.17994773],
        [ 0.15937291]]), array([[-0.41701072],
        [-0.41370766],
        [-0.38398017],
        ...,
        [ 0.18909994],
        [ 0.15144513],
        [ 0.15772089]]), array([[-0.4267036 ],
        [-0.38748159],
        [-0.39696036],
        ...,
        [ 0.24273242],
        [ 0.23283801],
        [ 0.19633117]]), array([[-1.        ],
        [-0.99958316],
        [-0.99916633],
        ...,
      

In [42]:
# Construct the Numpy array to hold the data values
scaled_features = np.array(scaled_features)
scaled_features = np.transpose(scaled_features).reshape(scaled_features.shape[1],
                                           scaled_features.shape[2])

# Determine the dimenstions on the input dataset
print('Dimensions of the input dataset : ', scaled_features.shape)

Dimensions of the input dataset :  (3310, 7)


## Split the dataset into training and testing 

In [54]:
def make_dataset_batchable(data_array_in, desired_ratio, batch_size):
      
      # Length of the input dataset and number of batches available
      data_length = data_array_in.shape[0]
      num_batches = int(data_length/batch_size)
      
      # Length of the usable dataset with the given batches
      data_use_len = num_batches * batch_size
      
      if data_use_len < data_length:
          # Format and remove the extra datapoints from the datasets
          actual_data_in = np.delete(data_array_in, np.s_[data_use_len::], axis=0)
      else:
          actual_data_in = data_array_in
      
      # Size of training and testing sets initially available
      train_length = int(desired_ratio * data_use_len)
      test_length = int((1 - desired_ratio) * data_use_len)
      
      # Number of testing and training initially batches available
      num_train_batches = int(train_length / batch_size)
      num_test_batches = int(test_length / batch_size)
      
      # NUmber of data points not used for training and testing
      leftover = data_use_len - (num_train_batches*batch_size) - (
              num_test_batches*batch_size)
      
      # Number of batches available after initial splitting
      leftover_batches = leftover / batch_size
      
      # Calculate the best ratio to use and increase the training size
      actual_ratio = float((batch_size*num_train_batches)+(batch_size*
                      leftover_batches)) / float(data_use_len)
      
      return actual_ratio, actual_data_in

In [55]:
# Make the dataset batchable for the training and testing

batch_size = 7
test_train_ratio = 0.75

ratio_to_use, scaled_features = make_dataset_batchable(scaled_features,
                                                      test_train_ratio,
                                                      batch_size)

In [56]:
# Determine the dimenstions on the input dataset
print('Dimensions of the input dataset : ', scaled_features.shape)

Dimensions of the input dataset :  (3304, 7)


## Now that the input dataset is guaranteed to be batchable, create the training and testing batches

In [58]:
def split_dataset(data_set_in, train_test_ratio):
      
     # Determine the length of the training and testing arrays
     train_size = int(len(data_set_in) * train_test_ratio)
    
     # Split the dataset
     train, test = data_set_in[0:train_size][:], data_set_in[train_size:len(data_set_in)][:]
     
     return train, test

In [59]:
train_dataset, test_dataset = split_dataset(scaled_features, ratio_to_use)

In [62]:
# Display the training dataset to be used for training the network
print('Displaying the training dataset with size : ', train_dataset.shape)
display(train_dataset)

Displaying the training dataset with size :  (2478, 7)


array([[-0.98804464, -0.91848503, -0.98634442, ..., -0.98663941,
        -0.97630872, -0.98887151],
       [-0.97855534, -0.99106436, -0.99418435, ..., -0.98874653,
        -0.99429189, -0.98911128],
       [-0.99526988, -0.97563008, -0.98070897, ..., -0.98868695,
        -0.99461885, -0.99565788],
       ...,
       [-0.79702564, -0.82382741, -0.77774137, ..., -0.75257392,
        -0.72609899, -0.70256579],
       [-0.728387  , -0.77153126, -0.7986599 , ..., -0.81238761,
        -0.81206076, -0.88658273],
       [-0.84409219, -0.87024024, -0.85553191, ..., -0.89736888,
        -0.91338452, -0.91109655]])

In [63]:
# Display the testing dataset to be used for testing the network
print('Displaying the testing dataset with size : ', test_dataset.shape)
display(test_dataset)

Displaying the testing dataset with size :  (826, 7)


array([[-0.89050498, -0.86599116, -0.86468374, ..., -0.89017813,
        -0.86141527, -0.8558588 ],
       [-0.84016993, -0.81336816, -0.79964044, ..., -0.81892462,
        -0.844419  , -0.83918941],
       [-0.84997546, -0.82382741, -0.81663668, ..., -0.81434872,
        -0.81794414, -0.82088575],
       ...,
       [ 0.96371784,  0.96413467,  0.96455151, ...,  0.96621886,
         0.96663569,  0.96705253],
       [ 0.96746937,  0.9678862 ,  0.96955355, ...,  0.97038722,
         0.97080406,  0.97205457],
       [ 0.97247141,  0.97288824,  0.97330508, ...,  0.97497243,
         0.97538926,  0.9758061 ]])

## Now that the features have been scaled, construct the supervised dataset for generating predictions

## Reshape the input dataset

In [88]:
# Construct the supervised dataset for training the models

# Define the dimenstions of the input dataset to be used for training the
# model

num_past_timestamps = 7
num_future_predictions = 1
num_features = len(sample_df_cols)
num_samples = len(scaled_features[0])

# Method for converting the input dataset to a supervised training dataset
from pandas import DataFrame
from pandas import concat
 
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    """
    Frame a time series as a supervised learning dataset.
    Arguments:
        data: Sequence of observations as a list or NumPy array.
        n_in: Number of lag observations as input (X).
        n_out: Number of observations as output (y).
        dropnan: Boolean whether or not to drop rows with NaN values.
    Returns:
        Pandas DataFrame of series framed for supervised learning.
    """
    
    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols, names = list(), list()
    
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
        if i == 0:
            names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
        else:
            names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
    # put it all together
    agg = concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg

In [89]:
training_input_data = series_to_supervised(data=train_dataset,
                                       n_in=num_past_timestamps,
                                      n_out=num_future_predictions)

In [90]:
test_input_data = series_to_supervised(data=test_dataset,
                                       n_in=num_past_timestamps,
                                      n_out=num_future_predictions)

In [91]:
display(training_input_data)

Unnamed: 0,var1(t-7),var2(t-7),var3(t-7),var4(t-7),var5(t-7),var6(t-7),var7(t-7),var1(t-6),var2(t-6),var3(t-6),...,var5(t-1),var6(t-1),var7(t-1),var1(t),var2(t),var3(t),var4(t),var5(t),var6(t),var7(t)
7,-0.988045,-0.918485,-0.986344,-0.956625,-0.986639,-0.976309,-0.988872,-0.978555,-0.991064,-0.994184,...,-0.996610,-0.995784,-0.985673,-0.990524,-0.990621,-0.996437,-0.996059,-0.995887,-0.998718,-0.999711
8,-0.978555,-0.991064,-0.994184,-0.990902,-0.988747,-0.994292,-0.989111,-0.995270,-0.975630,-0.980709,...,-0.995887,-0.998718,-0.999711,-0.997451,-0.978080,-0.996528,-0.993809,-0.991823,-0.998412,-0.994848
9,-0.995270,-0.975630,-0.980709,-0.992699,-0.988687,-0.994619,-0.995658,-0.994119,-0.996909,-0.978545,...,-0.991823,-0.998412,-0.994848,-0.995090,-0.997249,-0.999255,-0.999502,-0.987289,-0.973989,-0.997393
10,-0.994119,-0.996909,-0.978545,-0.991747,-0.989416,-0.960303,-0.988845,-0.988751,-0.994937,-0.981741,...,-0.987289,-0.973989,-0.997393,-0.996938,-0.998804,-0.999233,-0.987854,-0.972820,-0.991343,-0.967845
11,-0.988751,-0.994937,-0.981741,-0.982943,-0.973692,-0.984808,-0.986621,-0.990980,-0.994128,-0.987308,...,-0.972820,-0.991343,-0.967845,-0.976238,-0.989556,-0.990551,-0.994299,-0.992583,-0.989860,-0.997454
12,-0.990980,-0.994128,-0.987308,-0.994314,-0.992766,-0.995882,-0.992257,-1.000000,-0.992852,-0.995784,...,-0.992583,-0.989860,-0.997454,-0.994476,-0.988683,-0.994883,-0.990143,-0.990864,-0.996235,-0.994007
13,-1.000000,-0.992852,-0.995784,-0.991984,-0.996610,-0.995784,-0.985673,-0.990524,-0.990621,-0.996437,...,-0.990864,-0.996235,-0.994007,-0.994895,-0.989374,-0.995569,-0.996146,-0.990009,-0.992297,-0.990274
14,-0.990524,-0.990621,-0.996437,-0.996059,-0.995887,-0.998718,-0.999711,-0.997451,-0.978080,-0.996528,...,-0.990009,-0.992297,-0.990274,-0.998461,-0.997421,-0.990784,-0.992753,-0.988552,-0.992509,-0.994187
15,-0.997451,-0.978080,-0.996528,-0.993809,-0.991823,-0.998412,-0.994848,-0.995090,-0.997249,-0.999255,...,-0.988552,-0.992509,-0.994187,-0.996081,-0.985308,-0.996366,-0.988995,-0.992112,-0.990142,-0.991958
16,-0.995090,-0.997249,-0.999255,-0.999502,-0.987289,-0.973989,-0.997393,-0.996938,-0.998804,-0.999233,...,-0.992112,-0.990142,-0.991958,-0.998003,-0.985682,-0.994771,-0.994181,-0.986259,-0.993362,-0.995809


In [92]:
display(test_input_data)

Unnamed: 0,var1(t-7),var2(t-7),var3(t-7),var4(t-7),var5(t-7),var6(t-7),var7(t-7),var1(t-6),var2(t-6),var3(t-6),...,var5(t-1),var6(t-1),var7(t-1),var1(t),var2(t),var3(t),var4(t),var5(t),var6(t),var7(t)
7,-0.890505,-0.865991,-0.864684,-0.862723,-0.890178,-0.861415,-0.855859,-0.840170,-0.813368,-0.799640,...,-0.870567,-0.868933,-0.878738,-0.886910,-0.878738,-0.880699,-0.892139,-0.907828,-0.902925,-0.902272
8,-0.840170,-0.813368,-0.799640,-0.834613,-0.818925,-0.844419,-0.839189,-0.849975,-0.823827,-0.816637,...,-0.907828,-0.902925,-0.902272,-0.913058,-0.933976,-0.939859,-0.955548,-0.941494,-0.937898,-0.949665
9,-0.849975,-0.823827,-0.816637,-0.816637,-0.814349,-0.817944,-0.820886,-0.807485,-0.814676,-0.815002,...,-0.941494,-0.937898,-0.949665,-0.951299,-0.964373,-0.976794,-0.982350,-0.980062,-0.993136,-0.986926
10,-0.807485,-0.814676,-0.815002,-0.811407,-0.804870,-0.783298,-0.785259,-0.761399,-0.760745,-0.744729,...,-0.980062,-0.993136,-0.986926,-0.991829,-0.979408,-0.972871,-0.966008,-0.954895,-0.966008,-0.958490
11,-0.761399,-0.760745,-0.744729,-0.763360,-0.778395,-0.787220,-0.815983,-0.810753,-0.778722,-0.808139,...,-0.954895,-0.966008,-0.958490,-0.952280,-0.969930,-0.980389,-0.968949,-0.972545,-0.971891,-0.966008
12,-0.810753,-0.778722,-0.808139,-0.811080,-0.807485,-0.801928,-0.805851,-0.838209,-0.818925,-0.846053,...,-0.972545,-0.971891,-0.966008,-0.970257,-0.977774,-0.974832,-0.972545,-0.965027,-0.960451,-0.954895
13,-0.838209,-0.818925,-0.846053,-0.872201,-0.870567,-0.868933,-0.878738,-0.886910,-0.878738,-0.880699,...,-0.965027,-0.960451,-0.954895,-0.958163,-0.956202,-0.944435,-0.939206,-0.947377,-0.936591,-0.932996
14,-0.886910,-0.878738,-0.880699,-0.892139,-0.907828,-0.902925,-0.902272,-0.913058,-0.933976,-0.939859,...,-0.947377,-0.936591,-0.932996,-0.927766,-0.940513,-0.920902,-0.915019,-0.900637,-0.899330,-0.926785
15,-0.913058,-0.933976,-0.939859,-0.955548,-0.941494,-0.937898,-0.949665,-0.951299,-0.964373,-0.976794,...,-0.900637,-0.899330,-0.926785,-0.932015,-0.927112,-0.933649,-0.926459,-0.907828,-0.895735,-0.894100
16,-0.951299,-0.964373,-0.976794,-0.982350,-0.980062,-0.993136,-0.986926,-0.991829,-0.979408,-0.972871,...,-0.907828,-0.895735,-0.894100,-0.927766,-0.936918,-0.934957,-0.946070,-0.960124,-0.954241,-0.958163


In [93]:
# Reshape the supervised dataset into the 3-D format for the LSTM
# Input format : [samples, timestamps, features]
training_input_data = np.array(training_input_data)
display(training_input_data)
training_input_data = training_input_data.reshape(training_input_data.shape[0],
                                         num_past_timestamps+1,
                                         num_features)

array([[-0.98804464, -0.91848503, -0.98634442, ..., -0.99588749,
        -0.99871829, -0.99971082],
       [-0.97855534, -0.99106436, -0.99418435, ..., -0.99182293,
        -0.99841167, -0.99484846],
       [-0.99526988, -0.97563008, -0.98070897, ..., -0.98728899,
        -0.97398944, -0.99739299],
       ...,
       [-0.76270631, -0.76368691, -0.80781171, ..., -0.75257392,
        -0.72609899, -0.70256579],
       [-0.81761721, -0.82611534, -0.79081547, ..., -0.81238761,
        -0.81206076, -0.88658273],
       [-0.79898674, -0.8065043 , -0.8244811 , ..., -0.89736888,
        -0.91338452, -0.91109655]])

In [127]:
print('Displaying the input training dataset ready for input to RNNs : ')
display(training_input_data)

Displaying the input training dataset ready for input to RNNs : 


array([[[-0.98804464, -0.91848503, -0.98634442, ..., -0.98663941,
         -0.97630872, -0.98887151],
        [-0.97855534, -0.99106436, -0.99418435, ..., -0.98874653,
         -0.99429189, -0.98911128],
        [-0.99526988, -0.97563008, -0.98070897, ..., -0.98868695,
         -0.99461885, -0.99565788],
        ...,
        [-0.99098008, -0.99412768, -0.98730788, ..., -0.99276604,
         -0.99588167, -0.99225743],
        [-1.        , -0.99285178, -0.99578431, ..., -0.99660972,
         -0.99578431, -0.98567305],
        [-0.99052378, -0.99062114, -0.99643679, ..., -0.99588749,
         -0.99871829, -0.99971082]],

       [[-0.97855534, -0.99106436, -0.99418435, ..., -0.98874653,
         -0.99429189, -0.98911128],
        [-0.99526988, -0.97563008, -0.98070897, ..., -0.98868695,
         -0.99461885, -0.99565788],
        [-0.99411896, -0.99690907, -0.97854517, ..., -0.98941645,
         -0.96030334, -0.98884535],
        ...,
        [-1.        , -0.99285178, -0.99578431, ..., -

In [123]:
# Display the dimensions of the input training and testing data
print('Shape of training data :', training_input_data.shape[0])

Shape of training data : 2471


In [95]:
# Reshape the supervised dataset into the 3-D format for the LSTM
# Input format : [samples, timestamps, features]
test_input_data = np.array(test_input_data)
display(test_input_data)
test_input_data = test_input_data.reshape(test_input_data.shape[0],
                                         num_past_timestamps+1,
                                         num_features)

array([[-0.89050498, -0.86599116, -0.86468374, ..., -0.90782806,
        -0.9029253 , -0.90227163],
       [-0.84016993, -0.81336816, -0.79964044, ..., -0.94149372,
        -0.93789833, -0.94966498],
       [-0.84997546, -0.82382741, -0.81663668, ..., -0.98006212,
        -0.99313614, -0.98692597],
       ...,
       [ 0.93453927,  0.9349561 ,  0.93537294, ...,  0.96621886,
         0.96663569,  0.96705253],
       [ 0.9382908 ,  0.93870763,  0.93995814, ...,  0.97038722,
         0.97080406,  0.97205457],
       [ 0.942876  ,  0.94329284,  0.94412651, ...,  0.97497243,
         0.97538926,  0.9758061 ]])

In [96]:
print('Displaying the input testing dataset ready for input to RNNs : ')
display(test_input_data)

Displaying the input testing dataset ready for input to RNNs : 


array([[[-0.89050498, -0.86599116, -0.86468374, ..., -0.89017813,
         -0.86141527, -0.8558588 ],
        [-0.84016993, -0.81336816, -0.79964044, ..., -0.81892462,
         -0.844419  , -0.83918941],
        [-0.84997546, -0.82382741, -0.81663668, ..., -0.81434872,
         -0.81794414, -0.82088575],
        ...,
        [-0.81075334, -0.778722  , -0.80813855, ..., -0.80748485,
         -0.80192839, -0.80585063],
        [-0.83820883, -0.81892462, -0.84605327, ..., -0.87056708,
         -0.86893283, -0.87873836],
        [-0.88690961, -0.87873836, -0.88069944, ..., -0.90782806,
         -0.9029253 , -0.90227163]],

       [[-0.84016993, -0.81336816, -0.79964044, ..., -0.81892462,
         -0.844419  , -0.83918941],
        [-0.84997546, -0.82382741, -0.81663668, ..., -0.81434872,
         -0.81794414, -0.82088575],
        [-0.80748485, -0.81467557, -0.81500243, ..., -0.80487006,
         -0.78329788, -0.78525904],
        ...,
        [-0.83820883, -0.81892462, -0.84605327, ..., -

## Now that the input dataset is ready, construct the output dataset for training and testing

In [121]:
# Create the output dataset for the training and testing the model
def create_output(data, n_in, n_out, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols, names = list(), list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
    # forecast sequence (t, t+1, ... t+n)
    for i in range(1, n_out):
        cols.append(df.shift(-i))
        if i == 0:
            names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
        else:
            names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
    # put it all together
    agg = concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg

In [130]:
# Create the train and test output dataset

train_output = create_output(data=train_dataset,
                            n_in=0,
                            n_out=2,
                            dropnan=True)


train_output.drop(index=[0,1,2,3,4,5], inplace=True)
display(train_output)

train_output

Unnamed: 0,var1(t+1),var2(t+1),var3(t+1),var4(t+1),var5(t+1),var6(t+1),var7(t+1)
6,-0.990524,-0.990621,-0.996437,-0.996059,-0.995887,-0.998718,-0.999711
7,-0.997451,-0.978080,-0.996528,-0.993809,-0.991823,-0.998412,-0.994848
8,-0.995090,-0.997249,-0.999255,-0.999502,-0.987289,-0.973989,-0.997393
9,-0.996938,-0.998804,-0.999233,-0.987854,-0.972820,-0.991343,-0.967845
10,-0.976238,-0.989556,-0.990551,-0.994299,-0.992583,-0.989860,-0.997454
11,-0.994476,-0.988683,-0.994883,-0.990143,-0.990864,-0.996235,-0.994007
12,-0.994895,-0.989374,-0.995569,-0.996146,-0.990009,-0.992297,-0.990274
13,-0.998461,-0.997421,-0.990784,-0.992753,-0.988552,-0.992509,-0.994187
14,-0.996081,-0.985308,-0.996366,-0.988995,-0.992112,-0.990142,-0.991958
15,-0.998003,-0.985682,-0.994771,-0.994181,-0.986259,-0.993362,-0.995809


Unnamed: 0,var1(t+1),var2(t+1),var3(t+1),var4(t+1),var5(t+1),var6(t+1),var7(t+1)
6,-0.990524,-0.990621,-0.996437,-0.996059,-0.995887,-0.998718,-0.999711
7,-0.997451,-0.978080,-0.996528,-0.993809,-0.991823,-0.998412,-0.994848
8,-0.995090,-0.997249,-0.999255,-0.999502,-0.987289,-0.973989,-0.997393
9,-0.996938,-0.998804,-0.999233,-0.987854,-0.972820,-0.991343,-0.967845
10,-0.976238,-0.989556,-0.990551,-0.994299,-0.992583,-0.989860,-0.997454
11,-0.994476,-0.988683,-0.994883,-0.990143,-0.990864,-0.996235,-0.994007
12,-0.994895,-0.989374,-0.995569,-0.996146,-0.990009,-0.992297,-0.990274
13,-0.998461,-0.997421,-0.990784,-0.992753,-0.988552,-0.992509,-0.994187
14,-0.996081,-0.985308,-0.996366,-0.988995,-0.992112,-0.990142,-0.991958
15,-0.998003,-0.985682,-0.994771,-0.994181,-0.986259,-0.993362,-0.995809


In [131]:
test_output = create_output(data=test_dataset,
                            n_in=0,
                            n_out=2,
                            dropnan=True)


test_output.drop(index=[0,1,2,3,4,5], inplace=True)
display(test_output)

Unnamed: 0,var1(t+1),var2(t+1),var3(t+1),var4(t+1),var5(t+1),var6(t+1),var7(t+1)
6,-0.886910,-0.878738,-0.880699,-0.892139,-0.907828,-0.902925,-0.902272
7,-0.913058,-0.933976,-0.939859,-0.955548,-0.941494,-0.937898,-0.949665
8,-0.951299,-0.964373,-0.976794,-0.982350,-0.980062,-0.993136,-0.986926
9,-0.991829,-0.979408,-0.972871,-0.966008,-0.954895,-0.966008,-0.958490
10,-0.952280,-0.969930,-0.980389,-0.968949,-0.972545,-0.971891,-0.966008
11,-0.970257,-0.977774,-0.974832,-0.972545,-0.965027,-0.960451,-0.954895
12,-0.958163,-0.956202,-0.944435,-0.939206,-0.947377,-0.936591,-0.932996
13,-0.927766,-0.940513,-0.920902,-0.915019,-0.900637,-0.899330,-0.926785
14,-0.932015,-0.927112,-0.933649,-0.926459,-0.907828,-0.895735,-0.894100
15,-0.927766,-0.936918,-0.934957,-0.946070,-0.960124,-0.954241,-0.958163


In [132]:
# Extract the data from the dataframes
train_output_np = train_output.values
test_output_np = test_output.values

print(train_output_np)
print(test_output_np)

[[-0.99052378 -0.99062114 -0.99643679 ... -0.99588749 -0.99871829
  -0.99971082]
 [-0.99745111 -0.97808015 -0.99652834 ... -0.99182293 -0.99841167
  -0.99484846]
 [-0.99508969 -0.99724912 -0.99925452 ... -0.98728899 -0.97398944
  -0.99739299]
 ...
 [-0.79702564 -0.82382741 -0.77774137 ... -0.75257392 -0.72609899
  -0.70256579]
 [-0.728387   -0.77153126 -0.7986599  ... -0.81238761 -0.81206076
  -0.88658273]
 [-0.84409219 -0.87024024 -0.85553191 ... -0.89736888 -0.91338452
  -0.91109655]]
[[-0.88690961 -0.87873836 -0.88069944 ... -0.90782806 -0.9029253
  -0.90227163]
 [-0.91305768 -0.93397613 -0.93985946 ... -0.94149372 -0.93789833
  -0.94966498]
 [-0.95129922 -0.96437327 -0.9767936  ... -0.98006212 -0.99313614
  -0.98692597]
 ...
 [ 0.96371784  0.96413467  0.96455151 ...  0.96621886  0.96663569
   0.96705253]
 [ 0.96746937  0.9678862   0.96955355 ...  0.97038722  0.97080406
   0.97205457]
 [ 0.97247141  0.97288824  0.97330508 ...  0.97497243  0.97538926
   0.9758061 ]]


# Manual model development<a id='manual_model_development'></a>

## Model 1 : LSTM Neural Network

### Goal - Initialize and implement the manual LSTM model and train using the given dataset

In [134]:
# Layer and Model Initializers from Keras
from keras.models import Model
from keras.layers import Dense, Input, LSTM, Bidirectional, GRU

# Visualizers for the model
from keras.utils import plot_model

# Optimizizers for training and network performance
from keras.callbacks import EarlyStopping
from keras.optimizers import Adam, RMSprop, SGD

Using TensorFlow backend.


In [133]:
"""# fit an LSTM network to training data
def fit_lstm(train_in, train_out, batch_size, nb_epoch, neurons):
    X, y = train_in, train_out
    X = X.reshape(X.shape[0], 1, X.shape[1])
    model = Sequential()
    model.add(LSTM(neurons, batch_input_shape=(batch_size, X.shape[1], X.shape[2]), stateful=True))
    model.add(Dense(train_out.shape[1]))
    model.compile(loss='mean_squared_error', optimizer='adam')
    for i in range(nb_epoch):
        model.fit(X, y, epochs=1, batch_size=batch_size, verbose=0, shuffle=False)
        model.reset_states()
    return model"""

In [None]:
"""# Initialize the neural network model
opt = SGD(lr=0.05, nesterov=True)
batch_size = 4
past_timesteps = 1

# Define the network model and layers
input_1 = Input(batch_shape=(batch_size, past_timesteps))
dense_2 = LSTM(units=5, activation='relu')(input_1)
dense_3 = Dense(units=10, activation='relu')(dense_2)
dense_6 = Dense(units=10, activation='sigmoid')(dense_3)
dense_7 = Dense(units=5, activation='relu')(dense_6)
output_1 = Dense(units=1)(dense_7)
        
# Generate and compile the model
predictor = Model(inputs=input_1, outputs=output_1)
predictor.compile(optimizer= opt, loss= 'mae', metrics=['mape', 'mse', 'mae'])
predictor.summary()"""

In [143]:
opt = SGD(lr=0.05, nesterov=True)
batch_size = 7
past_timestamps = 8
num_features = len(sample_df_cols)

# Define the network model and layers
input_1 = Input(batch_shape=(batch_size, past_timestamps, num_features))
lstm_1 = LSTM(units=7, stateful=True, return_sequences=True)(input_1)
lstm_2 = LSTM(units=52, stateful=True)(lstm_1)
dense_1 = Dense(units=10, activation='sigmoid')(lstm_2)
dense_2 = Dense(units=5, activation='relu')(dense_1)
output_1 = Dense(units=num_features)(dense_2)

# Generate and compile the model
predictor = Model(inputs=input_1, outputs=output_1)
predictor.compile(optimizer=opt, loss='mse', metrics=['mape',
                                                     'mse',
                                                     'mae'])
predictor.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         (7, 8, 7)                 0         
_________________________________________________________________
lstm_9 (LSTM)                (7, 8, 7)                 420       
_________________________________________________________________
lstm_10 (LSTM)               (7, 52)                   12480     
_________________________________________________________________
dense_13 (Dense)             (7, 10)                   530       
_________________________________________________________________
dense_14 (Dense)             (7, 5)                    55        
_________________________________________________________________
dense_15 (Dense)             (7, 7)                    42        
Total params: 13,527
Trainable params: 13,527
Non-trainable params: 0
_________________________________________________________________


In [None]:
num_epochs = 100
predictor.fit(x=training_input_data, y=train_output_np, 
              batch_size=batch_size, epochs=num_epochs, verbose=2,
             shuffle=False)

Epoch 1/100
 - 13s - loss: 0.1815 - mean_absolute_percentage_error: 9439.8459 - mean_squared_error: 0.1815 - mean_absolute_error: 0.3388
Epoch 2/100
 - 10s - loss: 0.0461 - mean_absolute_percentage_error: 17903.8261 - mean_squared_error: 0.0461 - mean_absolute_error: 0.1608
Epoch 3/100
 - 10s - loss: 0.0312 - mean_absolute_percentage_error: 18347.4518 - mean_squared_error: 0.0312 - mean_absolute_error: 0.1325
Epoch 4/100
 - 10s - loss: 0.0265 - mean_absolute_percentage_error: 17155.7471 - mean_squared_error: 0.0265 - mean_absolute_error: 0.1221
Epoch 5/100
 - 10s - loss: 0.0230 - mean_absolute_percentage_error: 16638.1316 - mean_squared_error: 0.0230 - mean_absolute_error: 0.1134
Epoch 6/100
 - 12s - loss: 0.0199 - mean_absolute_percentage_error: 14768.5110 - mean_squared_error: 0.0199 - mean_absolute_error: 0.1049
Epoch 7/100
 - 10s - loss: 0.0172 - mean_absolute_percentage_error: 13296.7741 - mean_squared_error: 0.0172 - mean_absolute_error: 0.0968
Epoch 8/100
 - 10s - loss: 0.0148 -

# AutoML model development<a id='automl_development'></a>

## Model Development using AutoKeras

In [None]:
# Import the AutoML libraries
import autokeras as ak