<a href="https://colab.research.google.com/github/edmasters/51-assign05-edmasters/blob/main/Model_Comp_HW5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering 

Finalized List of Features:

1. Securities Code: Int representing the stock

2. TypeOfCurrentPeriod: Obj, tells quarter vs fiscal year

3. ForecastDividendPerShare1stQuarter: float, predicted dividends to be paid in 1st quarter

4. ForecastDividendPerShare2ndQuarter: float, predicted dividends to be paid in 2nd quarter

5. ForecastDividendPerShare3rdQuarter: float, predicted dividends to be paid in 3rd quarter

6. ForecastOperatingProfit: float, predicted operating profit of company

7. ForecastProfit: float, predicted profit of company

8. ForecastEarningsPerShare: float, predicted earnings per share

9. Open: float, open price of stock that day

10. SectorCode: float, integer that stands for the sector a company is in

11. Up: Boolean, y-variable of whether the stock went up or down that day




In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.pipeline import Pipeline 
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelBinarizer, StandardScaler
from sklearn import config_context
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.datasets import make_classification
from sklearn.metrics import roc_curve
#Upload data
#uploaded = files.upload()

## Exploring the Data

In [None]:
#Loading and prepping data before the pipeline
earnings = pd.read_csv('earnings.csv')

#replace "-" which stands for zero as well
earnings = earnings.replace("－",0)

#cast to floats
earnings = earnings.astype({'ForecastDividendPerShare1stQuarter': 'float64','ForecastDividendPerShare2ndQuarter':'float64','ForecastDividendPerShare3rdQuarter':'float64','ForecastOperatingProfit':'float64','ForecastProfit':'float64','ForecastEarningsPerShare':'float64'})

# turn y values from True, False to 1,0
earnings.Up = earnings.Up.replace(True,1)
earnings.Up = earnings.Up.replace(False,0)

#Turn Current Period to Catagory - not needed with pipeline set-up
earnings['TypeOfCurrentPeriod'] = pd.factorize(earnings['TypeOfCurrentPeriod'])[0] 

earnings = earnings.replace(np.nan,0)

display(earnings.tail(5))
earnings.info()

Unnamed: 0,SecuritiesCode,TypeOfCurrentPeriod,ForecastDividendPerShare1stQuarter,ForecastDividendPerShare2ndQuarter,ForecastDividendPerShare3rdQuarter,ForecastOperatingProfit,ForecastProfit,ForecastEarningsPerShare,Open,SectorCode,Up
20158,9997.0,1,0.0,0.0,0.0,3850000000.0,4228000000.0,43.49,1001.0,6100,0
20159,9997.0,2,0.0,8.0,0.0,14000000000.0,10500000000.0,108.43,1024.0,6100,1
20160,9997.0,1,0.0,0.0,0.0,3863000000.0,1678000000.0,17.35,630.0,6100,1
20161,9997.0,2,0.0,8.0,0.0,7000000000.0,5200000000.0,53.8,643.0,6100,1
20162,9997.0,2,0.0,9.5,0.0,17500000000.0,12500000000.0,129.3,1313.0,6100,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20163 entries, 0 to 20162
Data columns (total 11 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   SecuritiesCode                      20163 non-null  float64
 1   TypeOfCurrentPeriod                 20163 non-null  int64  
 2   ForecastDividendPerShare1stQuarter  20163 non-null  float64
 3   ForecastDividendPerShare2ndQuarter  20163 non-null  float64
 4   ForecastDividendPerShare3rdQuarter  20163 non-null  float64
 5   ForecastOperatingProfit             20163 non-null  float64
 6   ForecastProfit                      20163 non-null  float64
 7   ForecastEarningsPerShare            20163 non-null  float64
 8   Open                                20163 non-null  float64
 9   SectorCode                          20163 non-null  int64  
 10  Up                                  20163 non-null  int64  
dtypes: float64(8), int64(3)
memory usage: 1.7

## Creating the initial Pipeline w/ logistic regression

In [None]:
#Setting up train/test splits and y-column

class_column = 'Up'
random_seed = 2

#distinguishing between the quarters wasn't useful so I'm dropping for this method
earnings_regress = earnings.drop(columns=['TypeOfCurrentPeriod'])
print(earnings_regress.head())

X_train, X_test, y_train, y_test = train_test_split(earnings_regress.drop(columns=class_column), earnings_regress[class_column],
                                                   test_size=0.2, random_state=random_seed, stratify=earnings_regress[class_column])


#using a constant for the strategy and added fill_value=0 since the blanks are really zero's (eg no dividend that quarter)
cat_pipeline = Pipeline(steps=[('cat_impute', SimpleImputer(missing_values=np.nan, strategy='constant',fill_value= 0)),
                               ('onehot_cat', OneHotEncoder(drop='if_binary'))])
num_pipeline = Pipeline(steps=[('impute_num', SimpleImputer(missing_values=np.nan, strategy='constant',fill_value= 0)),
                               ('scale_num', StandardScaler())])

preproc = ColumnTransformer([('cat_pipe', cat_pipeline, make_column_selector(dtype_include=object)),
                             ('num_pipe', num_pipeline, make_column_selector(dtype_include=np.number))],
                             remainder='passthrough')

#using logistic regression as the one initial model
pipe = Pipeline(steps=[('preproc', preproc),
                       ('mdl', LogisticRegression(penalty='elasticnet', solver='saga', tol=0.01))])

print(pipe)

   SecuritiesCode  ForecastDividendPerShare1stQuarter  \
0          1301.0                                 0.0   
1          1301.0                                 0.0   
2          1301.0                                 0.0   
3          1301.0                                 0.0   
4          1301.0                                 0.0   

   ForecastDividendPerShare2ndQuarter  ForecastDividendPerShare3rdQuarter  \
0                                 0.0                                 0.0   
1                                 0.0                                 0.0   
2                                 0.0                                 0.0   
3                                 0.0                                 0.0   
4                                 0.0                                 0.0   

   ForecastOperatingProfit  ForecastProfit  ForecastEarningsPerShare    Open  \
0             5.000000e+09    3.500000e+09                    325.35  3065.0   
1             0.000000e+00    0.00

## Logistic Regression Results

In [None]:
tuning_grid = {'mdl__l1_ratio' : np.linspace(0,1,5),
               'mdl__C': np.logspace(-1, 6, 3) }
# TODO: choose your cv folds

grid_search = GridSearchCV(pipe, param_grid = tuning_grid, cv = 5, return_train_score=True)

grid_search.fit(X_train, y_train.values.ravel())

grid_search.best_params_

#logistic regression results
logit_result = (classification_report(y_test, grid_search.best_estimator_.predict(X_test)))
print(logit_result)

              precision    recall  f1-score   support

           0       0.50      0.00      0.00      1593
           1       0.61      1.00      0.75      2440

    accuracy                           0.61      4033
   macro avg       0.55      0.50      0.38      4033
weighted avg       0.56      0.61      0.46      4033



## Random Forest Results

In [None]:
class_column = 'Up'

X_train, X_test, y_train, y_test = train_test_split(earnings.drop(columns=class_column), earnings[class_column],
                                                   test_size=0.2, random_state=random_seed, stratify=earnings[class_column])

pipe_forest = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' ,n_estimators=100, oob_score = True)


# forest tuning

forest_grid = { 
    'n_estimators': [50, 100],
    'max_features': ['auto', 'sqrt']
}

grid_search = GridSearchCV(estimator = pipe_forest, cv = 5, param_grid = forest_grid, return_train_score=True)

grid_search.fit(X_train, y_train.values.ravel())

grid_search.best_params_

#logistic regression results
forest_result = (classification_report(y_test, grid_search.best_estimator_.predict(X_test)))
print(forest_result)

              precision    recall  f1-score   support

           0       0.46      0.31      0.37      1593
           1       0.63      0.76      0.69      2440

    accuracy                           0.58      4033
   macro avg       0.54      0.53      0.53      4033
weighted avg       0.56      0.58      0.56      4033



## Gradient Boosting Results

In [None]:
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=3, random_state=0)

clf_grid = { 
    'n_estimators': [10,20,50,100],
    'max_depth': [1,3,5]
}

grid_search = GridSearchCV(estimator = clf, cv = 5, param_grid = clf_grid, return_train_score=True)

grid_search.fit(X_train, y_train.values.ravel())

grid_search.best_params_

#logistic regression results
clf_result = (classification_report(y_test, grid_search.best_estimator_.predict(X_test)))
print(clf_result)


              precision    recall  f1-score   support

           0       0.54      0.15      0.24      1593
           1       0.62      0.92      0.74      2440

    accuracy                           0.61      4033
   macro avg       0.58      0.53      0.49      4033
weighted avg       0.59      0.61      0.54      4033



## Comparison

given that the measure of success for this experiment for the business case was correctly predicting the price movement more than 52% of the time it would seem that all of the models are a success from that perspective!

The best model by almost all metrics is the **Gradient Boosting Classifier**

It is the only model with above 52% for both pos/neg stock moves which is especially nice since it would then be expected to perform well in both bad and good markets.

Additionally it had the highest precision & recall for the macro and weighted average.

It is the clear cut winner in this comparison. 