#Modeling

In the notebook I set up two different approaches to creating model to predict whether or not a company will switch up, down, or neither for the scale at which they report their financial statements. The first approach uses the change in total assets and net income as features, while the second approach will drop those features. I will be using tree based methods to make these predictions, and will score the performance on precision. I use the macro parameter to calculate the precision score since there are three labels instead of two. This takes the unweighted average of the precision for all three categories. I wanted to take the unweighted average, because a weighted average would consider the label 0 as more important because there are more occurences of it in the test set. However, for the purposes of this project, I am more interested in knowing why a company would switch. 


# Load Libraries and Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, precision_score, make_scorer

In [None]:
cur_path = "/content/drive/MyDrive/Colab Notebooks/Capstone/"

In [None]:
reports = pd.read_csv(cur_path+"full_switches.csv")
reports.head()

Unnamed: 0,cik,switch_type,switch,scale,year,qtr,dates,cash equivalents,total assets,total liabilities,total shareholders equity,net income,ce_change,asset_change,liabilities_change,se_change,net_income_change
0,2034,0,0,thousands,2018,1,2018-03-31,62032000.0,831280000.0,627898000.0,203382000.0,-196635000.0,0.0,0.0,0.0,0.0,0.0
1,2098,0,0,thousands,2017,3,2017-09-30,7021000.0,110938000.0,60218000.0,50720000.0,1202000.0,0.0,0.0,0.0,0.0,0.0
2,2098,0,0,thousands,2018,2,2018-06-30,1894000.0,118578000.0,66354000.0,52224000.0,2436000.0,-5127000.0,7640000.0,6136000.0,1504000.0,1234000.0
3,2098,0,0,thousands,2019,3,2019-09-30,5698000.0,114224000.0,59200000.0,55024000.0,1059000.0,3804000.0,-4354000.0,-7154000.0,2800000.0,-1377000.0
4,2186,0,0,thousands,2017,3,2017-09-30,8938000.0,46156000.0,9217000.0,36939000.0,600000.0,0.0,0.0,0.0,0.0,0.0


# Modeling Approach 1

## Data Preprocessing

### Choose features for the model

For this approach I am choosing the two features, asset_change and net_income_change, that indicate the change in those two variables from one period to the next

In [None]:
X = reports[['scale', 'qtr', 'total assets', 'net income', 'asset_change', 'net_income_change', 'switch']]
y = reports['switch_type']

In [None]:
#test/train split
#I use a 50/50 split because I oversample the training set later, so it will end up being bigger than the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=42)

In [None]:
#We need the switch column for oversampling on the training set but it can be dropped from the test set
X_test.drop('switch', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [None]:
#Temporarily combine our features and label for oversampling
temp=pd.concat([X_train, y_train], axis=1)
temp.head()

Unnamed: 0,scale,qtr,total assets,net income,asset_change,net_income_change,switch,switch_type
669,ones,3,1477682.0,2518.0,0.0,0.0,0,0
5409,thousands,3,7355084000.0,210254000.0,62935000.0,95302000.0,0,0
1910,thousands,2,47386000.0,1102000.0,0.0,0.0,0,0
4074,thousands,2,10238140000.0,40588000.0,0.0,0.0,0,0
1152,thousands,3,453724000.0,4330000.0,0.0,0.0,0,0


### Random Oversampling

I take the number of indices equal to the number of non-switches in the data. This number is used for the number of switches I will randomly sample in order to make those two categories equal

In [None]:
indices = temp[temp['switch'] != 0].index
sample_size = len(temp[temp['switch'] == 0])
random_indices = np.random.choice(indices, sample_size, replace=True)
sample = temp.loc[random_indices]
no_switch_sample = temp[temp['switch'] == 0]
oversampled_df = pd.concat([sample, no_switch_sample])
len(oversampled_df)

5686

In [None]:
#splitting the features from the labels again
X_train = oversampled_df[['scale', 'qtr', 'total assets', 'net income', 'asset_change', 'net_income_change']]
y_train = oversampled_df['switch_type']

In [None]:
#one-hot encoding the categorical variables
X_train = pd.get_dummies(X_train, columns=['scale', 'qtr'])
X_test = pd.get_dummies(X_test, columns=['scale', 'qtr'])

In [None]:
X_train.head()

Unnamed: 0,total assets,net income,asset_change,net_income_change,scale_millions,scale_ones,scale_thousands,qtr_1,qtr_2,qtr_3,qtr_4
3745,13473360000.0,27147000.0,0.0,0.0,0,0,1,0,1,0,0
1249,401471.0,477685.0,-389482529.0,66685.0,0,1,0,0,1,0,0
520,2384294000.0,2813000.0,0.0,0.0,0,0,1,0,0,1,0
2657,1040521.0,1983.0,77556.0,189.0,0,1,0,0,1,0,0
1098,12206470000.0,27259000.0,0.0,0.0,0,0,1,0,0,1,0


## Modeling

### Decision Tree Classifier

I use a grid search to pick the hyper parameters

In [None]:
clf = DecisionTreeClassifier()
param_grid = {
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 3, 4, 5],
    'min_samples_leaf': [1, 2, 3, 4]
}
precision=make_scorer(precision_score, average='macro')
grid_search = GridSearchCV(clf, param_grid, scoring=precision, cv=10, n_jobs=-1)

In [None]:
#fit the different parameters to the data
grid_search.fit(X_train, y_train)

GridSearchCV(cv=10, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=None,
                                              splitter='best'),
             iid='deprecated', n_jobs=-1,
             param_grid={'max_depth': [5, 10, 15],
                         'mi

In [None]:
#The output for the best parameters is listed below
best_grid = grid_search.best_estimator_
best_grid

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=15, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

### Predictions

In [None]:
pred = best_grid.predict(X_test)
cnf_matrix=confusion_matrix(y_test,pred)

In [None]:
# 3x3 confusion matrix
print(cnf_matrix)

[[   2   45    0]
 [  93 2663   90]
 [   0   33   18]]


In [None]:
#precision score
precision = precision_score(y_test, pred, average='macro') #macro calculates precision for each variable and outputs the unweighted avg
print(precision)

0.3864208435475165


### Random Forest Classifier

In [None]:
rf = RandomForestClassifier()
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 3, 4, 5],
    'min_samples_leaf': [1, 2, 3, 4]
}
precision=make_scorer(precision_score, average='macro')
grid_search = GridSearchCV(rf, param_grid, scoring=precision, cv=10, n_jobs=-1)

In [None]:
grid_search.fit(X_train, y_train)

GridSearchCV(cv=10, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rand

In [None]:
#The output for the best parameters is listed below
best_grid = grid_search.best_estimator_
best_grid

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=15, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=150,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

### Predictions

In [None]:
pred = best_grid.predict(X_test)
cnf_matrix=confusion_matrix(y_test,pred)

In [None]:
# 3x3 confusion matrix
print(cnf_matrix)

[[   0   47    0]
 [  11 2777   58]
 [   0   30   21]]


In [None]:
#precision score
precision = precision_score(y_test, pred, average='macro') #macro calculates precision for each variable and outputs the unweighted avg
print(precision)

0.41294770238823886


The random forest classifier scored slightly better than a single decision tree classifier using this approach. 

# Modeling Approach 2

This approach does not consider the changes in total assets and net income. 

## Data Preprocessing

In [None]:
#feature selection
X = reports[['scale', 'qtr', 'total assets', 'net income', 'switch']]
y = reports['switch_type']

In [None]:
#test/train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=42)

In [None]:
X_test.drop('switch', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [None]:
temp=pd.concat([X_train, y_train], axis=1)
temp.head()

Unnamed: 0,scale,qtr,total assets,net income,switch,switch_type
669,ones,3,1477682.0,2518.0,0,0
5409,thousands,3,7355084000.0,210254000.0,0,0
1910,thousands,2,47386000.0,1102000.0,0,0
4074,thousands,2,10238140000.0,40588000.0,0,0
1152,thousands,3,453724000.0,4330000.0,0,0


### Random Oversampling

I use the same method for oversampling as I did in the previous approach. 

In [None]:
indices = temp[temp['switch'] != 0].index
sample_size = len(temp[temp['switch'] == 0])
random_indices = np.random.choice(indices, sample_size, replace=True)
sample = temp.loc[random_indices]
no_switch_sample = temp[temp['switch'] == 0]
oversampled_df = pd.concat([sample, no_switch_sample])
len(oversampled_df)

5686

In [None]:
#splitting the features from the labels again
X_train = oversampled_df[['scale', 'qtr', 'total assets', 'net income']]
y_train = oversampled_df['switch_type']

In [None]:
#one-hot encoding the categorical variables
X_train = pd.get_dummies(X_train, columns=['scale', 'qtr'])
X_test = pd.get_dummies(X_test, columns=['scale', 'qtr'])

In [None]:
X_train.head()

Unnamed: 0,total assets,net income,scale_millions,scale_ones,scale_thousands,qtr_1,qtr_2,qtr_3,qtr_4
2411,3058178000.0,23970570.0,0,1,0,0,0,1,0
2657,1040521.0,1983.0,0,1,0,0,1,0,0
634,2794456000.0,7595000.0,0,0,1,0,0,1,0
343,1559892000000.0,11645000000.0,0,0,1,1,0,0,0
3384,88150000000.0,1120000000.0,1,0,0,0,0,1,0


## Modeling

### Decision Tree Classifier

In [None]:
clf = DecisionTreeClassifier()
param_grid = {
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 3, 4, 5],
    'min_samples_leaf': [1, 2, 3, 4]
}
precision=make_scorer(precision_score, average='macro')
grid_search = GridSearchCV(clf, param_grid, scoring=precision, cv=10, n_jobs=-1)

In [None]:
#fit the different parameters to the data
grid_search.fit(X_train, y_train)

GridSearchCV(cv=10, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=None,
                                              splitter='best'),
             iid='deprecated', n_jobs=-1,
             param_grid={'max_depth': [5, 10, 15],
                         'mi

In [None]:
#The output for the best parameters is listed below
best_grid = grid_search.best_estimator_
best_grid

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=15, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=3,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

### Predictions

In [None]:
pred = best_grid.predict(X_test)
cnf_matrix=confusion_matrix(y_test,pred)

In [None]:
# 3x3 confusion matrix
print(cnf_matrix)

[[   1   46    0]
 [ 102 2693   51]
 [   0   36   15]]


In [None]:
#precision score
precision = precision_score(y_test, pred, average='macro') #macro calculates precision for each variable and outputs the unweighted avg
print(precision)

0.40247730519575176


The decision tree classifier without the two change variables performed almost as well as the random forest classifier that consdiered the two change variables. 

### Random Forest Classifier

In [None]:
rf = RandomForestClassifier()
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 3, 4, 5],
    'min_samples_leaf': [1, 2, 3, 4]
}
precision=make_scorer(precision_score, average='macro')
grid_search = GridSearchCV(rf, param_grid, scoring=precision, cv=10, n_jobs=-1)

In [None]:
grid_search.fit(X_train, y_train)

GridSearchCV(cv=10, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rand

### Predictions

In [None]:
#The output for the best parameters is listed below
best_grid = grid_search.best_estimator_
best_grid

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=15, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=4,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [None]:
pred = best_grid.predict(X_test)
cnf_matrix=confusion_matrix(y_test,pred)

In [None]:
# 3x3 confusion matrix
print(cnf_matrix)

[[   1   46    0]
 [  29 2780   37]
 [   0   33   18]]


In [None]:
#precision score
precision = precision_score(y_test, pred, average='macro') #macro calculates precision for each variable and outputs the unweighted avg
print(precision)

0.44432467381050805


This was the best achieved precision score, which indicates that change in numerical values might not be super valuable in predicting when a company will switch. 