# ARIMA Modeling

---------


###  Index: 

- 1) [Importing the Data](#Importing)
- 2) [Shifting the Dates](#Shifting)
- 3) Train and Test Set:
    - 3a. [Manually Splitting the Data](#Splitting) to Predict 2017 & onwards.
- 4) [Normalizing the Data](#Norm)
- 5) Supervised Machine Learning Models:
    - [ARIMA Model](#ARIMA)
    - [Principal Component Analysis](#PCA)
    - [Gridsearch Pipeline](#Gridsearch)

---------

## Importing Libraries:

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

from datetime import datetime

from statsmodels.tsa.arima_model import ARIMA
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import sys

sys.path.append('..')

-----

## Company Name

**Company Options:**

- Apple, Inc. - `Apple`
- Facebook, Inc. -`Facebook`
- Google LLC - `Google`
- JPMorgan Chase & Co. - `JPMorgan`
- The Goldman Sachs Group, Inc. - `GoldmanSachs`
- Moody's Corporation - `Moodys`
- The International Business Machines Corporation (IBM) - `IBM`
- Twitter Inc. - `Twitter`
- BlackRock, Inc. - `BlackRock`
- Microsoft Corporation - `Micrisoft`

In [2]:
company_name = 'Apple'

---
<a class="anchor" id="Importing"></a>

## Importing the Data:
The data is being imported using a custom function.

In [3]:
from lib.helper import data_importer

In [4]:
df = data_importer(company_name)

In [5]:
df.head(3)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Ex_Dividend,Split_Ratio,Adj_Open,Adj_High,Adj_Low,Adj_Close,Adj_Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1980-12-12,28.75,28.87,28.75,28.75,2093900.0,0.0,1.0,0.422706,0.42447,0.422706,0.422706,117258400.0
1980-12-15,27.38,27.38,27.25,27.25,785200.0,0.0,1.0,0.402563,0.402563,0.400652,0.400652,43971200.0
1980-12-16,25.37,25.37,25.25,25.25,472000.0,0.0,1.0,0.37301,0.37301,0.371246,0.371246,26432000.0


-------
<a class="anchor" id="Shifting"></a>

## Importing the Shifted Data Set:
The data is being shifted using a custom function.

In [6]:
from lib.helper import df_shift_importer

In [7]:
df_shift = df_shift_importer(company_name)

In [8]:
df_shift.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Ex_Dividend,Split_Ratio,Adj_Open,Adj_High,Adj_Low,...,Low_Long_EMA,Close_Long_EMA,Volume_Long_EMA,Ex_Dividend_Long_EMA,Split_Ratio_Long_EMA,Adj_Open_Long_EMA,Adj_High_Long_EMA,Adj_Low_Long_EMA,Adj_Close_Long_EMA,Adj_Volume_Long_EMA
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1981-04-15,27.88,28.0,27.88,27.88,29700.0,0.0,1.0,0.409914,0.411679,0.409914,...,27.100545,27.100545,395955.964908,0.0,1.0,0.399428,0.40078,0.398454,0.398454,22173530.0
1981-04-16,26.63,26.63,26.5,26.5,152000.0,0.0,1.0,0.391536,0.391536,0.389625,...,27.086579,27.086579,390282.570375,0.0,1.0,0.399245,0.400565,0.398249,0.398249,21855820.0
1981-04-20,25.12,25.12,25.0,25.0,106600.0,0.0,1.0,0.369335,0.369335,0.36757,...,27.038054,27.038054,383685.301297,0.0,1.0,0.398549,0.399838,0.397535,0.397535,21486380.0
1981-04-21,25.75,25.87,25.75,25.75,157800.0,0.0,1.0,0.378597,0.380362,0.378597,...,27.008099,27.008099,378432.154755,0.0,1.0,0.398085,0.399385,0.397095,0.397095,21192200.0
1981-04-22,27.5,27.62,27.5,27.5,127400.0,0.0,1.0,0.404327,0.406092,0.404327,...,27.019539,27.019539,372594.197668,0.0,1.0,0.398231,0.399541,0.397263,0.397263,20865280.0


### Taking a Look at the Time-Shifted Data Set:

In [9]:
df_shift.tail(3)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Ex_Dividend,Split_Ratio,Adj_Open,Adj_High,Adj_Low,...,Low_Long_EMA,Close_Long_EMA,Volume_Long_EMA,Ex_Dividend_Long_EMA,Split_Ratio_Long_EMA,Adj_Open_Long_EMA,Adj_High_Long_EMA,Adj_Low_Long_EMA,Adj_Close_Long_EMA,Adj_Volume_Long_EMA
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-03-23,170.0,172.68,168.6,168.845,41051076.0,0.0,1.0,170.0,172.68,168.6,...,169.691723,171.097703,33582580.0,0.000518,1.0,171.126419,172.657153,169.671634,171.077463,33582580.0
2018-03-26,168.39,169.92,164.94,164.94,40248954.0,0.0,1.0,168.39,169.92,164.94,...,169.581218,170.9545,33737610.0,0.000506,1.0,171.062781,172.593499,169.561596,170.934731,33737610.0
2018-03-27,168.07,173.1,166.44,172.77,36272617.0,0.0,1.0,168.07,173.1,166.44,...,169.508167,170.996721,33796570.0,0.000494,1.0,170.993182,172.605278,169.489,170.977412,33796570.0


-------
<a class="anchor" id="Splitting"></a>

# Splitting the Data into a Training and Testing Set
 Split the Data to Predict 2017 and onwards using a custom function.

In [10]:
from lib.helper import import_split_data

### Splitting the Data into a Train and Test Set:

In [11]:
X_train, X_test = import_split_data(company_name)

### Taking a Look at the Train Set:

In [12]:
X_train.head(2)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Ex_Dividend,Split_Ratio,Adj_Open,Adj_High,Adj_Low,...,Low_Long_EMA,Close_Long_EMA,Volume_Long_EMA,Ex_Dividend_Long_EMA,Split_Ratio_Long_EMA,Adj_Open_Long_EMA,Adj_High_Long_EMA,Adj_Low_Long_EMA,Adj_Close_Long_EMA,Adj_Volume_Long_EMA
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1981-04-15,27.88,28.0,27.88,27.88,29700.0,0.0,1.0,0.409914,0.411679,0.409914,...,27.100545,27.100545,395955.964908,0.0,1.0,0.399428,0.40078,0.398454,0.398454,22173530.0
1981-04-16,26.63,26.63,26.5,26.5,152000.0,0.0,1.0,0.391536,0.391536,0.389625,...,27.086579,27.086579,390282.570375,0.0,1.0,0.399245,0.400565,0.398249,0.398249,21855820.0


### Taking a Look at the Test Set:

In [13]:
X_test.head(2)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Ex_Dividend,Split_Ratio,Adj_Open,Adj_High,Adj_Low,...,Low_Long_EMA,Close_Long_EMA,Volume_Long_EMA,Ex_Dividend_Long_EMA,Split_Ratio_Long_EMA,Adj_Open_Long_EMA,Adj_High_Long_EMA,Adj_Low_Long_EMA,Adj_Close_Long_EMA,Adj_Volume_Long_EMA
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-01-03,116.65,117.2,115.43,115.82,30586265.0,0.0,1.0,115.209202,115.752409,114.004271,...,110.936682,111.809549,32622640.0,0.006808,1.0,109.969134,110.881942,109.289159,110.149132,32622660.0
2017-01-04,115.8,116.33,114.76,116.15,28781865.0,0.0,1.0,114.369701,114.893155,113.342546,...,111.025596,111.91049,32533320.0,0.00665,1.0,110.071473,110.975226,109.383424,110.255324,32533340.0


-----
<a class="anchor" id="Norm"></a>

# Normalizing the Data with a MinMaxScaler
**MinMaxScaler**: transforms features by scaling each feature to a given range, estimating and translating each feature individually such that it is in the given range on the training set. (i.e. between zero and one.) 

For this example, we will use the range (0, 1).

A helper function will be imported and will do the scaling.

In [14]:
from lib.helper import mm_scaler

In [15]:
X_train_sc, X_test_sc, y_train, y_test = mm_scaler(X_train, X_test, df)

-----
<a class="anchor" id="ARIMA"></a>

# Autoregressive Integrated Moving Average (ARIMA) Model

- **ARIMA:** a statistical model used to analyze and forecast time-series data.

**Parameters**

- p: The number of lag observations included in the model, also called the lag order.
- d: The number of times that the raw observations are differenced, also called the degree of differencing.
- q: The size of the moving average window, also called the order of moving average.

In [31]:
# fit model
model = ARIMA(df_shift, order=(5,0,0))
model_fit = model.fit(disp=0)
print(model_fit.summary())



ValueError: could not broadcast input array from shape (84) into shape (1)

In [29]:
# plot residual errors
residuals = DataFrame(model_fit.resid)
residuals.plot()
pyplot.show()
residuals.plot(kind='kde')
pyplot.show()
print(residuals.describe())

NameError: name 'DataFrame' is not defined

### Training a Random Forest Regression Model:

#### Evaluation:

- The results are HORRIBLE! The model is underfit and forecasts the price poorly; as shown on the graph below. This result was expected since a random forest will cluster its results due to its decision-tree-like nature. The inefficiency is demonstrated in the graph as we analyze the clusters in the predicted price provided by the model.

In [None]:
rf_model(X_train_sc, y_train, X_test_sc, y_test, 
         n_estimators=100, max_depth=15, 
         min_samples_leaf= 15, bootstrap=False)

-----
<a class="anchor" id="PCA"></a>

## Decomposing Signal Components with Principal Component Analysis (PCA):

Sometimes, centering and scaling the features independently is not enough, since a downstream model can further make some assumption on the linear independence of the features.

To address this issue, **Principal Component Analysis** (PCA) is used to decompose signal components.

- **Principal Component Analysis**: reduces linear dimensionality using Singular Value Decomposition (SVD) of the data to project it to a lower dimensional space.

The function being imported is a custom function, similar to the PCA function provided in the 3rd notebook.

In [None]:
from lib.helper import pca_decomposition

In [None]:
X_train_pca, X_test_pca = pca_decomposition(X_train_sc, X_test_sc, 2)

### Training a Random Forest Model with PCA Decomposition:

#### Evaluation:

- Again, and as expected, the scores are HORRIBLE! PCA did not help very much as the model only increased .05 in performance.
- The inefficiency of this model is similar to the regular random forest regression model as the predicted prices are clustered due to the decision tree nature of the model.

In [None]:
rf_model(X_train_pca, y_train, X_test_pca, y_test, 
         n_estimators=100, max_depth=10, 
         min_samples_leaf= 8, bootstrap=False)

-----
<a class="anchor" id="Gridsearch"></a>

# Grid Searching a Random Forest Regression Model:

- **GridSearch:** is an exhaustive search over specified parameter values for an estimator.

In [None]:
from sklearn.model_selection import GridSearchCV

### Creating a Pipeline

In [None]:
pipe = Pipeline([
    ('rf', RandomForestRegressor())
])

### Setting up the Parameters:

In [None]:
# Number of trees to consider in random forest
n_estimators = [x for x in range(8, 14, 2)]

# Number of features to consider at every split
max_features = ['auto', 'log2']

# Maximum number of levels in tree to consider
max_depth = [x for x in range(1, 3)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [x for x in range(1, 3)]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2]

# Method of selecting samples for training each tree
bootstrap = [True, False]

In [None]:
params = {'rf__n_estimators': n_estimators,
           'rf__max_features': max_features,
           'rf__max_depth': max_depth,
           'rf__min_samples_leaf': min_samples_leaf,
           'rf__bootstrap': bootstrap}
print(params)

### Gridsearching the Parameters:

In [None]:
rf_search = GridSearchCV(pipe, params, n_jobs=3)

### Fitting the Scaled Data with the Model:

In [None]:
rf_search.fit(X_train_sc, y_train)

### Scoring the Training Data:

In [None]:
rf_search.score(X_train_sc, y_train)

### Scoring the Test Data:

In [None]:
rf_search.score(X_test_sc, y_test)

#### Evaluation:

- I wanted to surpass the -2.32 score that we saw in the regular random forest with PCA model so, I tried GridSearching the parameters just out of curiosity.
- Resulting in a better score! Kind of, this model scored a -2.22, which is STILL HORRIBLE, but I beat the model I was comparing it too.
- Again, we can see the inefficiency in the clusters graphed below.


### Plotting the Results:

In [None]:
plt.figure(figsize=(14,7))
sns.set_style("darkgrid")
sns.regplot(y_test, rf_search.predict(X_test_sc))
plt.title('Random Forest Regression (Gridsearch): Predicted and Actual Prices', fontsize=18)
plt.xlabel('Actual', fontsize=16)
plt.ylabel('Predicted', fontsize=16)
plt.legend()
plt.tight_layout()