### Note: Using this notebook
This notebook walks through the available data, and introduces you to the concepts and tools that you can use in order to prepare, propose, and solve data science problems. Each code cell in this tool can be executed to replicate the results. 

For tips/tricks on using Jupyter Notebooks, please see: https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Notebook%20Basics.html

## Load Packages
Pandas: To Play with data frames (kind of like Excel for Python)

Numpy: For numerical operation

glob: For file system operations

In [2]:
import pandas as pd
import numpy as np
from glob import glob 


Let's read the data from the previous week

In [3]:
training_data_with_target = pd.read_csv("../Week 5/full_data_with_features.csv")

## Training the Model

We are now ready to use this dataset with the defined target (i.e. output) and features (i.e. input) and create a model. We will use a Random Forest Classifier available through the python Scikit-learn library and evaluate the performance of this model. 

### Train and Test Split
To evaluate the performance of any model, we need data for which the truth is known (i.e. we need to hold out a subset of our prepared data that the model doesn't see) so that we can evaluate the performance of the model. Otherwise, we will have to wait until we see new data and will not know if our model is any good. 

Let's finish processing our prepred data for modeling and create the training and testing splits. 

Following convention, we will denote our inputs with X and our outputs with y

In [5]:
is_na_condition = (training_data_with_target["ratio_up10days"].isna()) | (training_data_with_target["beat"].isna())

In [6]:
#Drop all cases where an input feature is NA/Null
training_data_with_target = training_data_with_target[~is_na_condition]

#Split into X and Y
X = training_data_with_target.reset_index()[['ratio_updays','ratio_up10days','ratio_up60days',
                                             'upday_50','upday10_50','upday60_50',
                                             'ratio_up_return','ratio_up_volume',]]
y = training_data_with_target.reset_index()["beat"]

In [7]:
X.head(10)

Unnamed: 0,ratio_updays,ratio_up10days,ratio_up60days,upday_50,upday10_50,upday60_50,ratio_up_return,ratio_up_volume
0,0.352941,0.382353,0.382353,0.0,0.0,0.0,0.264706,0.264706
1,0.629032,0.725806,0.387097,1.0,1.0,0.0,0.483871,0.241935
2,0.421875,0.515625,0.90625,0.0,1.0,1.0,0.375,0.265625
3,0.507692,0.615385,0.676923,1.0,1.0,1.0,0.430769,0.230769
4,0.5,0.435484,0.274194,0.0,0.0,0.0,0.387097,0.209677
5,0.442623,0.42623,0.245902,0.0,0.0,0.0,0.311475,0.295082
6,0.4375,0.40625,0.71875,0.0,0.0,1.0,0.359375,0.21875
7,0.47541,0.311475,0.04918,0.0,0.0,0.0,0.327869,0.245902
8,0.516129,0.580645,0.822581,1.0,1.0,1.0,0.403226,0.241935
9,0.47541,0.57377,0.918033,0.0,1.0,1.0,0.327869,0.262295


In [8]:
y.head(10)

0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
5    1.0
6    0.0
7    0.0
8    1.0
9    1.0
Name: beat, dtype: float64

In [9]:
#Let's import the train and test split from sklearn
from sklearn.model_selection import train_test_split

We have split the inputs and outputs and now we can use the train_test_split function to put 90% of the data into X_train, and 10% into X_test (and corresponding values for y as well). We have taken the additional step here to 
stratify our sampling across the ticker codes (i.e. we want to make sure that our 90% and 10% split of the overall data contain the same relative proportion of companies). It is important to stratify the dataset when sub-populations are involved

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.9, test_size = 0.1,
                                                    stratify = training_data_with_target.reset_index()["ticker"])

In [11]:
X_train.shape

(2034, 8)

In [12]:
X_test.shape

(226, 8)

#### Model Building
Let's use the scikit learn RandomForest classifier to build our model using X_train and y_train. We will use certain parameters for our classifier - you should read up on the significance of these classifiers and try changing them to see the effect of those on the accuracy of the model

In [13]:
#Load in the package
from sklearn.ensemble import RandomForestClassifier

#Initialize a classifier 
clf = RandomForestClassifier(n_estimators=10000, max_depth=5, random_state=0)

In [14]:
# Fit the model to the training data
clf.fit(X_train, y_train) 

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10000, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

#### Model Testing
Now that we have fit the model to the training data, let's see how we can use this model to create prdictions

In [15]:
#Use clf.predict to get the predictions
y_pred = clf.predict(X_test)


In [16]:
#Add the predictions to the original data subset 
subset = training_data_with_target.reset_index().loc[y_test.index]
subset["predicted"] = y_pred

In [17]:
#Let's look at our predictions, lined up with the actual data
subset.head(10)

Unnamed: 0,index,ticker,fiscal_year,fiscal_quarter,ratio_updays,ratio_up10days,ratio_up60days,upday_50,upday10_50,upday60_50,ratio_up_return,ratio_up_volume,beat,predicted
673,683,CSCO,2006,2,0.412698,0.349206,0.253968,0.0,0.0,0.0,0.063492,0.285714,1.0,1.0
1184,1200,INTC,2008,3,0.515625,0.46875,0.46875,1.0,0.0,0.0,0.1875,0.171875,0.0,1.0
1258,1275,INTU,2012,2,0.428571,0.634921,0.873016,0.0,1.0,1.0,0.142857,0.174603,1.0,1.0
1938,1969,SYMC,2008,4,0.459016,0.344262,0.377049,0.0,0.0,0.0,0.245902,0.262295,1.0,1.0
458,465,ATVI,2011,4,0.460317,0.47619,0.190476,0.0,0.0,0.0,0.174603,0.222222,1.0,1.0
1073,1087,FTR,2006,2,0.421875,0.484375,0.71875,0.0,0.0,1.0,0.109375,0.21875,1.0,1.0
154,155,ADP,2008,4,0.587302,0.555556,0.809524,1.0,1.0,1.0,0.174603,0.206349,1.0,1.0
1297,1315,KLAC,2007,2,0.587302,0.650794,1.0,1.0,1.0,1.0,0.253968,0.238095,1.0,1.0
1499,1521,MCHP,2013,4,0.568966,0.534483,0.862069,1.0,1.0,1.0,0.137931,0.258621,1.0,1.0
2026,2057,VRSN,2000,4,0.47619,0.301587,0.428571,0.0,0.0,0.0,0.333333,0.31746,1.0,1.0


We can see here that our model primarily predicts that actual EPS will beat concensus. This is not surprising considering the very basic strategy of always saying beat will give you 66% accuracy. However, we need to evaluate how our model is actually doing on our entire sample. For this, we need to consider a few metrics

##### Precision
Precision looks at how often our model has false positives. This is represented by the formula: (Number of Predicted Actual Positives)/(/Total number of Predicted Positives). Higher value in this case is better, since it means that we have very little actual 0’s falsely classified as 1’s.

##### Recall 
Recall looks at how sensitive our model is. It seeks to answer the question – how many of the actual 1’s did we correctly identify.
This is represented by the formula: (Number of Predicted Actual Positives)/(/Total number of Actual Positives). Here too, higher number is better, as it means that we have fewer actual 0’s classified as 1’s.

##### Accuracy 
Accuracy looks to answer the question – how often do we make the right decision. In other words, does model correctly classify 1’s when actual value is 1, and 0’s when actual value is 0. This is captured in the formula :    (Number of Correct Examples / )/(Total number of Examples)  . This metric is however sensitive to imbalanced classes (cases when number of 1’s is greater than 0’s, as is the case here since companies more often beat consensus estimates based on the way we calculate a beat)


We can use the sklearn libraries with built-in functions to calculate these, and calculate our models accuracy, precision, and recall. 

In [18]:
from sklearn.metrics import precision_score, recall_score, accuracy_score

In [99]:
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

In [102]:
print("Model Precision: %f"%prec)
print("Model Recall: %f"%rec)
print("Model Accuracy: %f"%acc)


Model Precision: 0.752222
Model Recall: 0.980944
Model Accuracy: 0.752389


## Discussion

We can see here that we improve on the baseline (66%) precision and accuracy by 9%. Furthermore, we improve this baseline without sacrificing much in the way of recall. 

In human judgement, we trade off heavily between precision and recall - e.g. saying every stock is going to beat eestimates every quarter is a recall of 1 (since you get the actual beats right 100% of the time). However, the recall suffers substantially (only 66% of the predictions will be accurate). In this model, we trade off 2% of the recall (100% to 98.1%) for a 9% increase in recall. This is a great start and the increased accuracy can be leveraged in trading models to create efficient portfolios with high returns. 

The model can be improved by trying alternative methods (such as RNNs, CNNs, XGBoost, Ensemble Models, etc.) and creating more features (using hypotheses and data) as we demonstrated. These steps should be taken incrementally with an eye on the precision and recall tradeoffs

