# Random Forest Classification Models

---------

###  Overview: 
- [Importing the Data](#Importing)
- [Data Preprocessing](#DPP)
- [Splitting the Data](#Splitting)
- Classification Models:
    - [Random Forest Classification](#RFC)
    - [Bagging Classification](#Bagging)
    - [GridSearching a Random Forest](#Gridsearch)
    - [GridSearching a Bagging Classifier](#Gridsearch2)

--------


## Importing Libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

from datetime import datetime

from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

import sys
sys.path.append('..')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
%load_ext autoreload
%autoreload 2

-----

## Company Name

In [2]:
company_name = 'Apple'

-------
<a class="anchor" id="Importing"></a>

# Importing the Data

### Importing the Raw Dataframe:

In [3]:
def file_importer(company_name, file_name):
    """
    Imports a dataframe according to the company and file-name.
    
    Parameters
    ------------
    company_name : str or var
        Passes the company's name as a string or variable.
    
    file_name : str
        Passes a the file name as a string. 
    """
    # Using the company's name, the dataset will be read from a CSV file via panda's dataframe.
    company_name=company_name
    df = pd.read_csv(f'data/{company_name}_{file_name}.csv')
    # Converting the date as time, placing it as the index, and will sort in ascending order.
    df['Date'] = pd.to_datetime(df.Date)
    df.set_index('Date', inplace=True)
    df.sort_index(inplace=True, ascending=True)
    return df

### Importing the Engineered Dataframe:

In [4]:
df = file_importer(company_name, 'wSEC_Inner')
df.head(3)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Ex_Dividend,Split_Ratio,Adj_Open,Adj_High,Adj_Low,...,PX14A6G,S-3,S-3ASR,S-4,S-8,SC 13D,SC 13G,SC TO-I,SD,UPLOAD
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1994-01-26,33.75,34.0,33.25,33.5,1480400.0,0.0,1.0,1.057482,1.065315,1.041815,...,0,0,0,0,0,0,0,0,0,0
1994-01-26,33.75,34.0,33.25,33.5,1480400.0,0.0,1.0,1.057482,1.065315,1.041815,...,0,0,0,0,0,0,0,0,0,0
1994-02-10,36.25,37.5,36.0,36.5,2696700.0,0.0,1.0,1.139548,1.178843,1.131689,...,0,0,0,0,0,0,1,0,0,0


--------
<a class="anchor" id="DPP"></a>

# Data Preprocessing - Preparing the Data for Modeling:


### Shifting the Dates for the Engineered Dataframe:

- Using a custom function to shift the data one business-day into the future.


In [5]:
from lib.helper import date_shifter

In [6]:
df_shifted = date_shifter(df)
df_shifted.head(3)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Ex_Dividend,Split_Ratio,Adj_Open,Adj_High,Adj_Low,...,PX14A6G,S-3,S-3ASR,S-4,S-8,SC 13D,SC 13G,SC TO-I,SD,UPLOAD
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1994-01-26,33.75,34.0,33.25,33.5,1480400.0,0,1,1.05748,1.06532,1.04182,...,0,0,0,0,0,0,0,0,0,0
1994-02-10,33.75,34.0,33.25,33.5,1480400.0,0,1,1.05748,1.06532,1.04182,...,0,0,0,0,0,0,0,0,0,0
1994-02-17,36.25,37.5,36.0,36.5,2696700.0,0,1,1.13955,1.17884,1.13169,...,0,0,0,0,0,0,1,0,0,0


### Setting the Label:
- The Adjusted Close Difference will be set as the target to predict. This will indicate whether a document filing has a negative or positive correlation with the price; or, in simple words whether the price will increase or decrease.

In [7]:
df_shifted['Target'] = df_shifted.Adj_Close_Diff.apply(lambda x: str(1) if x >= 0 else str(0))

### Dropping the Continuous Data:
- In the following cell, the continuous data will be dropped, keeping all of the Categorical (Binary) data for a classification model.

In [8]:
new_df = df_shifted.loc[:, 'document_type':'Target']

### Converting All Values into Integers:
- Turning all float (decimal) values into zero integers.

In [9]:
new_df = new_df.apply(pd.to_numeric, errors='ignore')

In [10]:
new_df.tail(3)

Unnamed: 0_level_0,document_type,10-K,10-K405,10-Q,424B2,424B3,424B5,8-A12B,8-K,CERTNYS,...,S-3,S-3ASR,S-4,S-8,SC 13D,SC 13G,SC TO-I,SD,UPLOAD,Target
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-02-12,10-Q,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2018-02-14,SC 13G,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
2018-03-07,8-K,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1


--------
<a class="anchor" id="Splitting"></a>

## Importing the Training and Test Set:

### Importing the Training Dataset:

In [11]:
X_train = file_importer(company_name, 'SEC_X_Train')
X_train.drop('document_type', 1, inplace=True)
X_train.head(3)

Unnamed: 0_level_0,10-K,10-K405,10-Q,424B2,424B3,424B5,8-A12B,8-K,CERTNYS,CORRESP,...,PX14A6G,S-3,S-3ASR,S-4,S-8,SC 13D,SC 13G,SC TO-I,SD,UPLOAD
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1994-01-26,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1994-02-10,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1994-02-17,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


-------
### Importing the Testing Dataset:

In [12]:
X_test = file_importer(company_name, 'SEC_X_Test')
X_test.drop('document_type', 1, inplace=True)
X_test.head(3)

Unnamed: 0_level_0,10-K,10-K405,10-Q,424B2,424B3,424B5,8-A12B,8-K,CERTNYS,CORRESP,...,PX14A6G,S-3,S-3ASR,S-4,S-8,SC 13D,SC 13G,SC TO-I,SD,UPLOAD
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-01-06,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2017-01-06,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2017-01-19,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


---------

### Splitting the the Label:
- The label will be split into a training and dataset to validate our models.

In [13]:
y_train = new_df[X_train.index[0] : X_train.index[-1]].Target.values

In [14]:
y_test = new_df[X_test.index[0] : X_test.index[-1]].Target.values

-----
<a class="anchor" id="RFC"></a>


# Random Forest Classification Model
- **Random Forest:** is a meta estimator that trains a number of decision tree classifiers on various sub-samples of the dataset and uses the average to improve the predictive accuracy and control over-fitting. 

### Setting up the Random Forest (RF) Classification:
- **Parameters:**
    - Number of Estimators is the maximum number of trees in the forest, which has been set to 100.
    - Maximum depth is how deep the forest will be, or how many levels in the decision tree. This parameter was set to 15.
    - Minimum leaf samples is the minimum number of values in a given node (leaf), which has been set to 3
    - Bootstrapping will use subsamples with replacement, which has been set as True.
        - Bootstrapping has been set as True simply because I read a book on financial modeling that recommended it.
    - Balanced Subsample mode will compute weights based on the bootstrap sample for every tree grown.

In [23]:
rf = RandomForestClassifier(n_estimators=100, criterion='entropy', 
                            max_depth=15, min_samples_leaf=3, bootstrap=True, 
                            n_jobs=3, random_state=42, class_weight='balanced_subsample')

### Training the Data with the RF Model:

In [24]:
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight='balanced_subsample',
            criterion='entropy', max_depth=15, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=3,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=3, oob_score=False, random_state=42,
            verbose=0, warm_start=False)

### Scoring the Train and Test Set:

**Evaluation:**
- The model seems to perform fairly well around baseline.
- Overfit: The score on the training dataset outperforms the testing score.

In [25]:
rf.score(X_train, y_train)

0.5777351247600768

In [29]:
rf.score(X_test, y_test)

0.46551724137931033

### Inspecting the Average Prediction:

In [30]:
y_test.mean()

0.603448275862069

In [31]:
rf.predict(X_test).mean()

0.5172413793103449

-------
<a class="anchor" id="Bagging"></a>

# Bagging Classification Model using Random Forest:

- **Bagging Classifier:** is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset, and, then, aggregates the individual predictions by either voting or by averaging to perform a final prediction. 
- Bagging is a method often used to reduce variance in a decision tree estimator by introducing randomization into its construction procedure and then making an ensemble out of it.

**Parameters:**
- The base estimator is the estimator, or model, to be fit (trained) on random subsets of the dataset. A Random Forest estimator will be the base for this model.
- Number of Estimators is the maximum number of trees in the forest, which has been set to 100.
- The Maximum Number of features will limit the amount of features used in the model to reduce variance. This has been set to 1.0.

In [32]:
bc = BaggingClassifier(base_estimator=rf, n_estimators=100, 
                   max_features=1.0, n_jobs=3, random_state=42)

### Fitting the Data with the Bagging Classification:

In [33]:
bc.fit(X_train, y_train)

BaggingClassifier(base_estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced_subsample',
            criterion='entropy', max_depth=15, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=3,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=3, oob_score=False, random_state=42,
            verbose=0, warm_start=False),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=1.0, n_estimators=100, n_jobs=3, oob_score=False,
         random_state=42, verbose=0, warm_start=False)

### Scoring the Train and Test Set:

**Evaluation:**
- This model perform fairly well.
- Still relatively overfit as the train score out performs the test scores.

In [34]:
bc.score(X_train, y_train)

0.5758157389635317

In [35]:
bc.score(X_test, y_test)

0.43103448275862066

### Inspecting the Average Prediction:

In [36]:
y_test.mean()

0.603448275862069

In [37]:
bc.predict(X_test).mean()

0.3793103448275862

-----
<a class="anchor" id="Gridsearch"></a>


# Grid Searching

- **Pipelines:** allow sequences of data to be transformed and chained together in a sequential manner, which will culminate in a model. 

- **GridSearch:** is an exhaustive search over specified parameter values for an estimator, which will allow the model to be tuned to perform better.

### Pipeline for the Model:
- Principal Component Analysis (PCA). 
- Random Forest Estimator.


### Decomposing Signal Components with Principal Component Analysis (PCA):

Sometimes, centering and scaling the features independently is not enough, since a downstream model can further make some assumption on the linear independence of the features.

To address this issue, **Principal Component Analysis** (PCA) is used to decompose signal components.

- **Principal Component Analysis**: reduces linear dimensionality using Singular Value Decomposition (SVD) of the data to project it to a lower dimensional space.


### Establishing the Pipeline:

In [48]:
pipe = Pipeline([
    ('pca', PCA(random_state=42)),
    ('rf', RandomForestClassifier(random_state=42))
])

### Establishing Parameters:

In [49]:
# Number of trees in random forest
n_estimators = [int(x) for x in range(2, 16, 2)]

# Maximum number of levels in tree
max_depth = [int(x) for x in range(6, 18, 2)]

# Minimum number of samples required to split a node
min_samples_split = [int(x) for x in range(2, 12, 2)]

# Minimum number of samples required at each leaf node
min_samples_leaf = [int(x) for x in range(2, 10, 3)]

# Method of selecting samples for training each tree
bootstrap = False

# PCA: Number of components to be used
n_components = [int(x) for x in range(2, 10, 3)]

# Evaluation Metric (Error Function)
criterion = ['entropy']

### Setting the Parameters for the GridSearch Model:

In [50]:
params = {'pca__n_components': n_components,
          'rf__criterion' : criterion,
           'rf__n_estimators': n_estimators,
           'rf__max_depth': max_depth,
           'rf__min_samples_split': min_samples_split,
           'rf__min_samples_leaf': min_samples_leaf
         }
print(params)

{'pca__n_components': [2, 5, 8], 'rf__criterion': ['entropy'], 'rf__n_estimators': [2, 4, 6, 8, 10, 12, 14], 'rf__max_depth': [6, 8, 10, 12, 14, 16], 'rf__min_samples_split': [2, 4, 6, 8, 10], 'rf__min_samples_leaf': [2, 5, 8]}


### Setting up a Custom Cross Validation for Sequential Data:
Since the nature of the dataset is sequential, a cross validation will not be efficient for this model since we do not want any subsample to be validated on a historical time period.

- **Time Series Split:** provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. In each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate.

- Number of splits: 5 

### Instantiating the Time Series Split:

In [51]:
time_cv = TimeSeriesSplit(n_splits=5).split(X_train)

### GridSearching the Model:

In [52]:
rf_search = GridSearchCV(pipe, params, n_jobs=3, cv=time_cv)

### Training the Model using the GridSearch Parameters:

In [53]:
rf_search.fit(X_train, y_train)

GridSearchCV(cv=<generator object TimeSeriesSplit.split at 0x119f66938>,
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)), ('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            ...stimators=10, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=3,
       param_grid={'pca__n_components': [2, 5, 8], 'rf__criterion': ['entropy'], 'rf__n_estimators': [2, 4, 6, 8, 10, 12, 14], 'rf__max_depth': [6, 8, 10, 12, 14, 16], 'rf__min_samples_split': [2, 4, 6, 8, 10], 'rf__min_samples_leaf': [2, 5, 8]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

### Scoring the Train and Test Set:
- This model seems to also be overfit, but the test scores did improve.

In [54]:
rf_search.score(X_train, y_train)

0.54510556621881

In [55]:
rf_search.score(X_test, y_test)

0.5172413793103449

### Looking at the Best Parameters:

In [56]:
rf_search.best_estimator_.named_steps['rf'].get_params()

{'bootstrap': True,
 'class_weight': None,
 'criterion': 'entropy',
 'max_depth': 6,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 8,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 2,
 'n_jobs': 1,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

### Taking a Look at the Important Features:

In [57]:
feature_importance = pd.DataFrame(rf_search.best_estimator_.named_steps['rf']\
                                  .feature_importances_, index=X_train.columns[:8])\
                                  .sort_values(0, ascending=False).sort_values(0,ascending=False)
feature_importance

Unnamed: 0,0
10-Q,0.239136
424B3,0.180925
8-A12B,0.177197
10-K405,0.123888
10-K,0.119019
424B5,0.110845
424B2,0.04899
8-K,0.0


---------
<a class="anchor" id="Gridsearch2"></a>

# GridSearching a Bagging Classifier


### Pipeline:
- Principle Component Analysis (PCA).
- Bagging Classifier with a Random Forest as the base estimator.

In [62]:
pipe2 = Pipeline([
    ('pca', PCA(random_state=42)),
    ('br', BaggingClassifier(random_state=42))
])

### Establishing Parameters:

In [63]:
# Number of trees in random forest
n_estimators = [int(x) for x in range(5, 100, 20)]

# Maximum number of levels in tree
max_features = [x for x in np.linspace(.10, 1, 5)]

# PCA: Number of components to be used
n_components = [int(x) for x in range(2, 10, 3)]

# Evaluation Metric (Error Function)
criterion = ['entropy']

### Setting the Parameters for the GridSearch Model:

In [64]:
params2 = {
    'br__n_estimators' : n_estimators,
    'br__max_features' : max_features,   
}
print(params2)

{'br__n_estimators': [5, 25, 45, 65, 85], 'br__max_features': [0.1, 0.325, 0.55, 0.775, 1.0]}


### Setting up a Custom Cross Validation for Sequential Data:

In [65]:
time_cv2 = TimeSeriesSplit(n_splits=4).split(X_train)

### GridSearching the Bagging Classification Model:

In [66]:
grid = GridSearchCV(pipe2, params2, n_jobs=3, cv=time_cv2)

### Fitting the Training Set using GridSearch:

In [67]:
grid.fit(X_train, y_train)

GridSearchCV(cv=<generator object TimeSeriesSplit.split at 0x119d50888>,
       error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=42,
  svd_solver='auto', tol=0.0, whiten=False)), ('br', BaggingClassifier(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=1.0, max_samples=1.0,
         n_estimators=10, n_jobs=1, oob_score=False, random_state=42,
         verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=3,
       param_grid={'br__n_estimators': [5, 25, 45, 65, 85], 'br__max_features': [0.1, 0.325, 0.55, 0.775, 1.0]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

### Scoring on the Training and Test Sets:
**Evaluation:**
- The model is still relatively overfit, but and performs worse than the Random Forest model that was GridSearched.

In [68]:
grid.score(X_train, y_train)

0.5834932821497121

In [69]:
grid.score(X_test, y_test)

0.43103448275862066

### Taking a Look at the Best Parameters:

In [70]:
grid.best_estimator_.named_steps['br'].get_params()

{'base_estimator': None,
 'bootstrap': True,
 'bootstrap_features': False,
 'max_features': 0.55,
 'max_samples': 1.0,
 'n_estimators': 45,
 'n_jobs': 1,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}