# Binary Resolution Classifier
In the previous notebook "Data Exploration", the majority of data transformtion was done, and was copied and cut down into this first executable cell. To get more details on how and why transformations were done as below, please refer to the previous notebook.

After executing the cell, the data will be ready to be pushed through various Machine Learning Classifying models. These will include Random Forest Classifier, Logisitic Regression, and Gradient Boosted Decision Tree Classifier. As a more experimentive aproach, SVC and SGDClassifier will also be trained after a SVD dimensionality-reduction. Lastly, Deep Neural Network (DNN) classifier models will be built and trained using a few different architechs.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import re

#Defining what dtype to convert each column to
#numberic columns are transformed after reading in
dtype_dict = {'Product':"category",
             'Consumer consent provided?': "category",
             'Submitted via': "category",
             'Company response to consumer': "category",
             'Consumer disputed?': "category"}

#read in .csv file, dates are parsed into datetime objects. 
#The Complaint ID is Unique in every entry, so it can be used as index
df = pd.read_csv('Consumer_Complaints.csv',
                 index_col=['Complaint ID'],
                 parse_dates=["Date received","Date sent to company"],
                 dtype=dtype_dict)

#This will replace ending '-' to 5 (average linespace of 10)
regexReplaceDash = r"(\d+)(-)$"
df['ZIP code'] = df['ZIP code'].str.replace(regexReplaceDash, r'\g<1>5')

#This will change ending XX to 50 (average linespace of 100)
regex_XX = r'(\d{3})(XX)'
df['ZIP code'] = df['ZIP code'].str.replace(regex_XX, r'\g<1>50')

#This will remove all other entries that are still not 5 digits
regexRemove = r'\D+'
df['ZIP code'] = df['ZIP code'].replace(regexRemove, np.nan, regex=True)

#imputes the mean for nan 
imputeMean = df['ZIP code'].astype(np.float).mean()
df['ZIP code'] = df['ZIP code'].astype(np.float).fillna(imputeMean)

#Transforming 2 unique valued col to float boolean
booleanize = {'Yes': 1, 'No': 0}
df['Timely response?'] = pd.Series(df['Timely response?'].map(booleanize), dtype = np.float)

#function to apply to column to convert less common results to 'Other', as well as NaN
def convertToOther(value, keepList):
    if (value == ''):
        return "Other"
    else:
        return value if value in keepList else "Other"
    
#Lists top 23 value counts (allowed to exclude values), turns NaN to '' to others, converts to category dtype
def cleanReduceConvert(df, column, blackList=[]):
    keepList = []
    for category in df[column].value_counts().head(23).index.tolist():
        if (category.lower().split()[0] != "other"):
            keepList.append(category)
    for category in blackList:
        try:
            keepList.remove(category)
        except ValueError:
            pass

    df[column].fillna('', inplace=True)
    return pd.Series(df[column].apply(convertToOther, args=(keepList,)), dtype = 'category')

df['Sub-product'] = cleanReduceConvert(df, 'Sub-product', blackList= ['I do not know'])
df['Issue'] = cleanReduceConvert(df, 'Issue')
df['Sub-issue'] = cleanReduceConvert(df, 'Sub-issue')
df['Company'] = cleanReduceConvert(df, 'Company')

def entryOrNull(strVal):
    return 1.0 if strVal is not np.nan else 0.0

df['Consumer complaint narrative submitted?'] = df['Consumer complaint narrative'].apply(entryOrNull)

def dtToCols(df, dtcolumn):
    df["{} day".format(dtcolumn)] = df[dtcolumn].dt.day
    df["{} month".format(dtcolumn)] = df[dtcolumn].dt.month
    df["{} year".format(dtcolumn)] = df[dtcolumn].dt.year
    
dtToCols(df, "Date received")
dtToCols(df, "Date sent to company")

df["Consumer consent provided?"] = df["Consumer consent provided?"].cat.add_categories("Not recorded").fillna("Not recorded")

df = df.drop(df[df["Company response to consumer"].isna()].index)

dfInProgress = df[df["Company response to consumer"] == "In progress"]
df = df[df["Company response to consumer"] != "In progress"]

dfUntimelyResponse = df[df["Company response to consumer"] == "Untimely response"]
df = df[df["Company response to consumer"] != "Untimely response"]

twoOutputsDict = {"Closed with explanation":"Closed without relief", 
                  "Closed with non-monetary relief":"Closed with relief",
                  "Closed with monetary relief":"Closed with relief",
                  "Closed without relief":"Closed without relief", 
                  "Closed":"Closed without relief",
                  "Closed with relief":"Closed with relief"}
df["Company response to consumer"] = df["Company response to consumer"].map(twoOutputsDict)

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils

#data columns not be used for the model
dropList = ["Consumer complaint narrative",
            "Company public response",
            "State",
            "Tags",
            "Consumer disputed?",
            "Date received", 
            "Date sent to company",
            "Company response to consumer"]
X = df.drop(dropList, axis=1)
Y = df["Company response to consumer"]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3)

#Columns to be standard scaled/imputed
numeric_features = ['ZIP code',
                    'Date received day',
                    'Date received month',
                    'Date received year',
                    'Date sent to company day',
                    'Date sent to company month',
                    'Date sent to company year']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

#Columns to one hot encoded
categorical_features = ['Product',
           'Sub-product',
           'Issue',
           'Sub-issue',
           'Company',
           'Consumer consent provided?',
           'Submitted via',
           'Timely response?']
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

#building the column transformer with both transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

#fit the preprocessor, then transform trainging and test set, assign sparse matrix to variables
preprocessor.fit(X)
encX_train = preprocessor.transform(X_train)
encX_test = preprocessor.transform(X_test)

Since there are multiple transformers wrapped in the ColumnTransformer, this function below provided by pjgao creates a list of the columns. This is useful for feature importances after running the Model Trainging. Heres a link to where the code was found: https://github.com/scikit-learn/scikit-learn/issues/12525#issuecomment-436217100

In [37]:
def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

processedColumns = get_column_names_from_ColumnTransformer(preprocessor)

# RandomForestClassifier
The first model to be run on this will be RandomForestClassifier from sklearn's library. A grid search is used to find the better tuning of the hyper parameters.

In [38]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

#define classifier and parameter grid
rf = RandomForestClassifier()
forest_para = {'criterion':['gini','entropy'],
                'min_samples_split' : range(2,6,2),
                'max_depth': range(3,7,2),
                'n_estimators':range(10,30,10)}


#Grid search to find best hyper parameters
fitmodel = GridSearchCV(rf,forest_para,cv=3, scoring='roc_auc')
fitmodel.fit(encX_train,y_train)

#store the best fitting model, create prediciton from X_test data
bestfitRF = fitmodel.best_estimator_

#fit once more on the best estimator
bestfitRF.fit(encX_train,y_train)
y_pred = bestfitRF.predict(encX_test)
print(bestfitRF.score(encX_test,y_test))

#display the result
print(classification_report(y_test, y_pred))

0.8117695739781232


  'precision', 'predicted', average, warn_for)


                       precision    recall  f1-score   support

   Closed with relief       0.00      0.00      0.00     52313
Closed without relief       0.81      1.00      0.90    225607

            micro avg       0.81      0.81      0.81    277920
            macro avg       0.41      0.50      0.45    277920
         weighted avg       0.66      0.81      0.73    277920



It seems that the random forrest couldn't make any prediction with this dataset, as it went the lazy underfitting route of gussing every label to be the majority class. This is likely due to the One Hot Encoding with too many categories. Since the tree makes a few set of decisions, and the max depth goes to 7, there's not much predictive power in only using 7 columns in a row that are all 0 (for example).

# LogisticRegression
Next, the data will be run through a LogisticRegression Model. Note that 'l1' penalty was used because that penalty usually zeros out less important coefficients, which since this was one hot encoded, is expected to happen a lot.

In [40]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

lr = LogisticRegression(n_jobs=-1, solver='saga', penalty='l1')
lr_para = {'C':[1.0,0.1,0.01], 
           'class_weight':[None,'balanced'],
           'max_iter':[50,100,150]}

#Apply grid search with above parameters specified
fitmodel = GridSearchCV(lr, lr_para,cv=5, scoring='roc_auc', n_jobs=-1)
fitmodel.fit(encX_train,y_train)

#store the best fitting LogisiticRegression(), create prediciton from X_test data
bestfitLR = fitmodel.best_estimator_

bestfitLR.fit(encX_train,y_train)
y_pred = bestfitLR.predict(encX_test)
print(bestfitLR.score(encX_test,y_test))

#display the result
print(classification_report(y_test, y_pred))



0.6753598157743236
                       precision    recall  f1-score   support

   Closed with relief       0.32      0.66      0.43     52229
Closed without relief       0.90      0.68      0.77    225691

            micro avg       0.68      0.68      0.68    277920
            macro avg       0.61      0.67      0.60    277920
         weighted avg       0.79      0.68      0.71    277920



Although the score was low for this, looking at precision, recall, and f1, it seems like a improvement from guessing the majority class everytime.

In [41]:
print(bestfitLR)

coef_df = pd.DataFrame([processedColumns,list(bestfitLR.coef_[0])]).transpose().sort_values(1)
coef_df.iloc[np.r_[0:10, -10:0]]

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=-1, penalty='l1', random_state=None,
          solver='saga', tol=0.0001, verbose=0, warm_start=False)


Unnamed: 0,0,1
21,x0_Prepaid card,-1.31532
86,x3_Personal information incorrect,-1.24942
100,x4_Experian Information Solutions Inc.,-1.06433
108,x4_PORTFOLIO RECOVERY ASSOCIATES INC,-0.839405
98,x4_ENCORE CAPITAL GROUP INC.,-0.737273
11,x0_Credit card or prepaid card,-0.719218
91,x4_AES/PHEAA,-0.645815
81,x3_Information belongs to someone else,-0.567322
111,x4_TD BANK US HOLDING COMPANY,-0.547387
25,x1_Checking account,-0.519221


From these results, it seems like a big influence is what company the complaint is being filled out for, the sub-issue, and what type of product the complaint pertains to.

# GradientBoostingClassifier
Next the a gradient boosted tree classifier will be trained.

In [42]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier()
gbc_para = {'loss':['deviance', 'exponential'],
            'learning_rate': [0.1, 0.05],
            'max_depth': [3,7]}

#Apply grid search with above parameters specified
fitmodel = GridSearchCV(gbc, gbc_para,cv=3, scoring='roc_auc', n_jobs=-1)
fitmodel.fit(encX_train,y_train)

#store the best fitting GradientBoostingClassifier(), create prediciton from X_test data
bestfitGBC = fitmodel.best_estimator_

bestfitGBC.fit(encX_train,y_train)
y_pred = bestfitGBC.predict(encX_test)
print(bestfitGBC.score(encX_test,y_test))

#display the result
print(classification_report(y_test, y_pred))

0.8208945020149684
                       precision    recall  f1-score   support

   Closed with relief       0.64      0.11      0.18     52229
Closed without relief       0.83      0.99      0.90    225691

            micro avg       0.82      0.82      0.82    277920
            macro avg       0.73      0.55      0.54    277920
         weighted avg       0.79      0.82      0.77    277920



In [43]:
print(bestfitGBC)

coef_df = pd.DataFrame([processedColumns,list(bestfitGBC.feature_importances_)]).transpose().sort_values(1, ascending=False)
coef_df.iloc[np.r_[0:10, -10:0]]

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='exponential', max_depth=7,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              n_iter_no_change=None, presort='auto', random_state=None,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)


Unnamed: 0,0,1
100,x4_Experian Information Solutions Inc.,0.204651
6,Date sent to company year,0.0825464
10,x0_Credit card,0.0761093
25,x1_Checking account,0.0633851
17,x0_Mortgage,0.0338467
12,x0_Credit reporting,0.0319135
0,ZIP code,0.0314691
11,x0_Credit card or prepaid card,0.0284411
5,Date sent to company month,0.0257309
106,x4_Other,0.0251874


It appears here that the biggest influence in prediction is the type of products. There also seems to be an issue with Experian Information Solutions Inc. being consitent with their response, what ever it may be.

## Resampling
Since there was an imbalance of classes, I've resample by downsampling the majority class to create a 1:1 ratio between the classes, to see if this improves predicting power.

In [38]:
from sklearn.utils import resample

#seperate the target based on classification
df_resample = pd.concat([X_train, y_train],axis=1)
not_relief = df_resample[df_resample['Company response to consumer']=='Closed without relief']
relief = df_resample[df_resample['Company response to consumer']=='Closed with relief']

#resample the majority class with equal frequency as minorty class
not_relief_downsample = resample(not_relief,
                                 replace=False,
                                 n_samples = len(relief))

#concat resampled majority class with minorty class
downsample = pd.concat([not_relief_downsample, relief])

X_train = df.drop(dropList, axis=1)
y_train = df["Company response to consumer"]

#fit the preprocessor, then transform trainging and test set, assign sparse matrix to variables
encX_train = preprocessor.transform(X_train)
encX_test = preprocessor.transform(X_test)

In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

lr = LogisticRegression(penalty='l1')
lr_para = {'C':[1.0,0.1,0.01], 
           'solver':['liblinear','saga'],
           'class_weight':[None,'balanced'],
           'max_iter':[50,100,150]}

#Apply grid search with above parameters specified
fitmodel = GridSearchCV(lr, lr_para,cv=5, scoring='roc_auc')
fitmodel.fit(encX_train,y_train)

#store the best fitting LogisticRegression(), create prediciton from X_test data
bestfitLR = fitmodel.best_estimator_

bestfitLR.fit(encX_train,y_train)
y_pred = bestfitLR.predict(encX_test)
print(bestfitLR.score(encX_test,y_test))

#display the result
print(classification_report(y_test, y_pred))



0.6755001439263097
                       precision    recall  f1-score   support

   Closed with relief       0.32      0.66      0.43     52452
Closed without relief       0.90      0.68      0.77    225468

             accuracy                           0.68    277920
            macro avg       0.61      0.67      0.60    277920
         weighted avg       0.79      0.68      0.71    277920



There doesn't seem to be any significant improvement for the logistic regression after resampling.

In [42]:
print(bestfitLR)

coef_df = pd.DataFrame([processedColumns,list(bestfitLR.coef_[0])]).transpose().sort_values(1)
coef_df.iloc[np.r_[0:10, -10:0]]

LogisticRegression(C=0.01, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=50, multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)


Unnamed: 0,0,1
21,x0_Prepaid card,-1.18221
86,x3_Personal information incorrect,-1.16044
100,x4_Experian Information Solutions Inc.,-1.07654
108,x4_PORTFOLIO RECOVERY ASSOCIATES INC,-0.812528
98,x4_ENCORE CAPITAL GROUP INC.,-0.716444
11,x0_Credit card or prepaid card,-0.654846
81,x3_Information belongs to someone else,-0.553695
91,x4_AES/PHEAA,-0.553528
111,x4_TD BANK US HOLDING COMPANY,-0.523632
25,x1_Checking account,-0.521711


Above, it is observed that resampling did not significantly change the feature importance of the columns.

In [45]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier()
gbc_para = {'loss':['deviance', 'exponential'],
            'learning_rate': [0.1, 0.05],
            'max_depth': [3,7]}

#Apply grid search with above parameters specified
fitmodel = GridSearchCV(gbc, gbc_para,cv=3, scoring='roc_auc')
fitmodel.fit(encX_train,y_train)

#store the best fitting GradientBoostingClassifier(), create prediciton from X_test data
bestfitGBC = fitmodel.best_estimator_

bestfitGBC.fit(encX_train,y_train)
y_pred = bestfitGBC.predict(encX_test)
print(bestfitGBC.score(encX_test,y_test))

#display the result
print(classification_report(y_test, y_pred))

0.8149431491076569
                       precision    recall  f1-score   support

   Closed with relief       0.67      0.04      0.07     52452
Closed without relief       0.82      1.00      0.90    225468

             accuracy                           0.81    277920
            macro avg       0.74      0.52      0.48    277920
         weighted avg       0.79      0.81      0.74    277920



It appears that the resampling made Gradient Boosted Classifier preform slightly worse than before.

In [46]:
print(bestfitGBC)

coef_df = pd.DataFrame([processedColumns,list(bestfitGBC.feature_importances_)]).transpose().sort_values(1, ascending=False)
coef_df.iloc[np.r_[0:10, -10:0]]

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.05, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)


Unnamed: 0,0,1
100,x4_Experian Information Solutions Inc.,0.305674
17,x0_Mortgage,0.0891856
10,x0_Credit card,0.0872126
25,x1_Checking account,0.0815731
6,Date sent to company year,0.0620886
106,x4_Other,0.0525954
93,"x4_BANK OF AMERICA, NATIONAL ASSOCIATION",0.0253743
11,x0_Credit card or prepaid card,0.0243908
112,"x4_TRANSUNION INTERMEDIATE HOLDINGS, INC.",0.0238843
81,x3_Information belongs to someone else,0.0232868


Just like the Logistic Regression, resampling did not significantly affect the feature importance.

# Truncated Singular Value Decomposition(SVD)

Below, there was an attempt to run a Support Vector Machine Approach, along with a TruncatedSVD demensionality reduction. However, the training yielded weak results, and the SVC was set to take weeks to train, for questionable returns.

In [37]:
from sklearn.decomposition import TruncatedSVD

#fit the demensionality reducer to 2 dimensions
svd = TruncatedSVD()
svd.fit(preprocessor.transform(X))

#transfrom input
encX_train = svd.transform(preprocessor.transform(X_train))
encX_test = svd.transform(preprocessor.transform(X_test))

## Support Vector Classifier (SVC)

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

svc = SVC(probability=True, kernel='poly')
svc_para = {'C':[1, 0.01],
            'gamma': [0.001,1],
            'class_weight':[None, 'balanced'],
            'degree': [2,3]}


#Apply grid search with above parameters specified
fitmodel = GridSearchCV(svc, svc_para,cv=3, scoring='roc_auc', verbose=100)
fitmodel.fit(encX_train,y_train)

#store the best fitting SVC(), create prediciton from X_test data
bestfitSVC = fitmodel.best_estimator_

bestfitSVC.fit(encX_train,y_train)
y_pred = bestfitSVC.predict(encX_test)
print(bestfitSVC.score(encX_test,y_test))

#display the result
print(classification_report(y_test, y_pred))

Fitting 3 folds for each of 16 candidates, totalling 48 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] C=1, class_weight=None, degree=2, gamma=0.001 ...................
[CV]  C=1, class_weight=None, degree=2, gamma=0.001, score=0.529, total=1133.4min
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 1133.4min remaining:    0.0s
[CV] C=1, class_weight=None, degree=2, gamma=0.001 ...................
[CV]  C=1, class_weight=None, degree=2, gamma=0.001, score=0.568, total=993.3min
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 2126.7min remaining:    0.0s
[CV] C=1, class_weight=None, degree=2, gamma=0.001 ...................
[CV]  C=1, class_weight=None, degree=2, gamma=0.001, score=0.468, total=1096.0min
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 3222.7min remaining:    0.0s
[CV] C=1, class_weight=None, degree=2, gamma=1 .......................


## Support Vector Machine with Stocatic Gradient Descent
(Incomplete investigation due to long training time, and unpromising returns)

In [44]:
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

sgdc = SGDClassifier(penalty='elasticnet')
sgdc_para = {'loss':['hinge','log','squared_hinge'],
            'l1_ratio':[0,.5,1], 
            'alpha':[0.0001, 0.01,1]}


#Apply grid search with above parameters specified
fitmodel = GridSearchCV(sgdc, sgdc_para,cv=3, n_jobs=-1, verbose=10)
fitmodel.fit(encX_train,y_train)

#store the best fitting SGDClassifier(), create prediciton from X_test data
bestfitSGDC = fitmodel.best_estimator_

bestfitSGDC.fit(encX_train,y_train)
y_pred = bestfitSGDC.predict(encX_test)
print(bestfitSGDC.score(encX_test,y_test))

#display the result
print(classification_report(y_test, y_pred))

Fitting 3 folds for each of 27 candidates, totalling 81 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    5.4s
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    9.4s
[Parallel(n_jobs=-1)]: Done  29 tasks      | elapsed:   10.9s
[Parallel(n_jobs=-1)]: Done  40 tasks      | elapsed:   15.1s
[Parallel(n_jobs=-1)]: Done  59 out of  81 | elapsed:   27.7s remaining:   10.3s
[Parallel(n_jobs=-1)]: Done  68 out of  81 | elapsed:   32.8s remaining:    6.2s
[Parallel(n_jobs=-1)]: Done  77 out of  81 | elapsed:  3.1min remaining:    9.5s
[Parallel(n_jobs=-1)]: Done  81 out of  81 | elapsed:  3.2min finished


0.8111327000575705


  'precision', 'predicted', average, warn_for)


                       precision    recall  f1-score   support

   Closed with relief       0.00      0.00      0.00     52490
Closed without relief       0.81      1.00      0.90    225430

             accuracy                           0.81    277920
            macro avg       0.41      0.50      0.45    277920
         weighted avg       0.66      0.81      0.73    277920



# Deep Neural Network (DNN) Classifier

Below, the target is turned into a boolean 0-1 column, so that it can be processed in a deep neural network that will be built on Keras with Tensorflow and a backend. The architecture used was 128 input, to 50 Fully Connected (FC) layer, to a second FC layer, both with relu activation, to a final output layer, with sigmoid activation. An adam optimizer was used with binary crossentropy as the loss function, and accuracy as the metric. The model was trained on 100 epochs. Models after this one only differ by number of layers and layer size.

Note that in the classification reports, "1" is "closed with relief" and thus the recall and f1 we are more interested in.

In [38]:
booly_train = y_train.apply(lambda x: 1 if x=='Closed with relief' else 0)
booly_test = y_test.apply(lambda x: 1 if x=='Closed with relief' else 0)

Please note that the error that appears in the cell occurs after training. Since training took about 12 hours, I found it better to correct and cell and after wards without rerunning it, and posting the resutls in the next cell. The cell should still run correctly if ran again.

## Two Hidden Layers: 128->50->50->1

In [39]:
from keras.models import Sequential
from keras.layers import Dense

#Build model with 2 FC layers each 50 nodes (relu activation), and 1 node output output layer (sigmoid activation)
model = Sequential()
model.add(Dense(50, init='uniform', input_dim=128, activation='relu'))
model.add(Dense(50, init='uniform', activation='relu'))
model.add(Dense(1, init= 'uniform', activation = 'sigmoid'))

#Compiled with adam optimizer, binary crossentropy loss function, accuracy metric for evaluation
model.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])

#train the model over 100 epochs
model.fit(encX_train, booly_train, epochs=100, batch_size=10)

#print the results
_, accuracy = model.evaluate(encX_test, booly_test)
print('Accuracy: %.2f' % (accuracy*100))


Using TensorFlow backend.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


  """
  
  import sys



Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 7

NameError: name 'y' is not defined

In [46]:
from sklearn.metrics import classification_report

#print the results
_, accuracy = model.evaluate(encX_test, booly_test)
print('Accuracy: %.2f' % (accuracy*100))

#see the classification report
print(classification_report(booly_test, model.predict_classes(encX_test)))

#Save the trained model into an .h5 file
model.save("model.h5")

Accuracy: 81.75
              precision    recall  f1-score   support

           0       0.82      0.99      0.90    225460
           1       0.65      0.07      0.13     52460

    accuracy                           0.82    277920
   macro avg       0.74      0.53      0.51    277920
weighted avg       0.79      0.82      0.75    277920



## Three Hidden Layers: 128->75->50->25->1

In [40]:
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping
from sklearn.metrics import classification_report

#Build model with 2 FC layers each 50 nodes (relu activation), and 1 node output output layer (sigmoid activation)
model = Sequential()
model.add(Dense(75, init='uniform', input_dim=128, activation='relu'))
model.add(Dense(50, init='uniform', activation='relu'))
model.add(Dense(25, init='uniform', activation='relu'))
model.add(Dense(1, init= 'uniform', activation = 'sigmoid'))

#setup earlystop callback
earlystop_callback = EarlyStopping(monitor='val_accuracy', min_delta=0.0001, patience=3)

#Compiled with adam optimizer, binary crossentropy loss function, accuracy metric for evaluation
model.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])

#train the model over 100 epochs
model.fit(encX_train, booly_train, epochs=25, batch_size=10, callbacks=[earlystop_callback])

#print the results
_, accuracy = model.evaluate(encX_test, booly_test)
print('Accuracy: %.2f' % (accuracy*100))

#print the results
_, accuracy = model.evaluate(encX_test, booly_test)
print('Accuracy: %.2f' % (accuracy*100))

#see the classification report
print(classification_report(booly_test, model.predict_classes(encX_test)))

#Save the trained model into an .h5 file
model.save("model2.h5")

  
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()


Epoch 1/25
Epoch 2/25
     90/1111679 [..............................] - ETA: 25:43 - loss: 0.4540 - accuracy: 0.7778  



Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
Accuracy: 81.93
Accuracy: 81.93
              precision    recall  f1-score   support

           0       0.82      0.99      0.90    225430
           1       0.65      0.09      0.17     52490

    accuracy                           0.82    277920
   macro avg       0.74      0.54      0.53    277920
weighted avg       0.79      0.82      0.76    277920



## Three Hidden Layers: 128->50->50->50->1

In [39]:
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping, TensorBoard
import datetime
from sklearn.metrics import classification_report

#Build model with 2 FC layers each 50 nodes (relu activation), and 1 node output output layer (sigmoid activation)
model = Sequential()
model.add(Dense(50, init='uniform', input_dim=128, activation='relu'))
model.add(Dense(50, init='uniform', activation='relu'))
model.add(Dense(50, init='uniform', activation='relu'))
model.add(Dense(1, init= 'uniform', activation = 'sigmoid'))

#Compiled with adam optimizer, binary crossentropy loss function, accuracy metric for evaluation
model.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])

#Setting up logging for tensorboard
#log_dir ="logs/" #+ datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
#tensorboard_callback = TensorBoard(log_dir=log_dir, histogram_freq=1)

#setup earlystop callback
earlystop_callback = EarlyStopping(monitor='val_accuracy', min_delta=0.0001, patience=3)

#train the model over 100 epochs, with validation data, and tensorbard and early stopping callbacks added
model.fit(encX_train, booly_train, 
          epochs=50, batch_size=5, 
          validation_data=(encX_test, booly_test), 
          callbacks =[earlystop_callback])

#print the results
_, accuracy = model.evaluate(encX_test, booly_test)
print('Accuracy: %.2f' % (accuracy*100))

#print the results
_, accuracy = model.evaluate(encX_test, booly_test)
print('Accuracy: %.2f' % (accuracy*100))

#see the classification report
print(classification_report(booly_test, model.predict_classes(encX_test)))

#Save the trained model into an .h5 file
model.save("model50-50-50.h5")

  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':


Train on 1111679 samples, validate on 277920 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Accuracy: 81.82
Accuracy: 81.82
              precision    recall  f1-score   support

           0       0.83      0.98      0.90    225272
           1       0.59      0.13      0.22     52648

    accuracy                           0.82    277920
   macro avg       0.71      0.56      0.56    277920
weighted avg       0.78      0.82      0.77    277920



## Three Hidden Layers: 128->100->100->100->1

In [40]:
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping
from sklearn.metrics import classification_report

#Build model with 2 FC layers each 50 nodes (relu activation), and 1 node output output layer (sigmoid activation)
model = Sequential()
model.add(Dense(100, init='uniform', input_dim=128, activation='relu'))
model.add(Dense(100, init='uniform', activation='relu'))
model.add(Dense(100, init='uniform', activation='relu'))
model.add(Dense(1, init= 'uniform', activation = 'sigmoid'))

#Compiled with adam optimizer, binary crossentropy loss function, accuracy metric for evaluation
model.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])

#Setting up logging for tensorboard
#log_dir ="logs/" #+ datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
#tensorboard_callback = TensorBoard(log_dir=log_dir, histogram_freq=1)

#setup earlystop callback
earlystop_callback = EarlyStopping(monitor='val_accuracy', min_delta=0.0001, patience=3)

#train the model over 100 epochs, with validation data, and tensorbard and early stopping callbacks added
model.fit(encX_train, booly_train, 
          epochs=50, batch_size=5, 
          validation_data=(encX_test, booly_test), 
          callbacks =[earlystop_callback])

#print the results
_, accuracy = model.evaluate(encX_test, booly_test)
print('Accuracy: %.2f' % (accuracy*100))

#print the results
_, accuracy = model.evaluate(encX_test, booly_test)
print('Accuracy: %.2f' % (accuracy*100))

#see the classification report
print(classification_report(booly_test, model.predict_classes(encX_test)))

#Save the trained model into an .h5 file
model.save("model100-100-100.h5")

  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':


Train on 1111679 samples, validate on 277920 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Accuracy: 81.87
Accuracy: 81.87
              precision    recall  f1-score   support

           0       0.83      0.97      0.90    225272
           1       0.58      0.15      0.24     52648

    accuracy                           0.82    277920
   macro avg       0.71      0.56      0.57    277920
weighted avg       0.78      0.82      0.77    277920



## Two Hidden Layers: 128->100->100->1

In [41]:
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping
import datetime
from sklearn.metrics import classification_report

#Build model with 2 FC layers each 50 nodes (relu activation), and 1 node output output layer (sigmoid activation)
model = Sequential()
model.add(Dense(100, init='uniform', input_dim=128, activation='relu'))
model.add(Dense(100, init='uniform', activation='relu'))
model.add(Dense(1, init= 'uniform', activation = 'sigmoid'))

#Compiled with adam optimizer, binary crossentropy loss function, accuracy metric for evaluation
model.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])

#Setting up logging for tensorboard
#log_dir ="logs/" #+ datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
#tensorboard_callback = TensorBoard(log_dir=log_dir, histogram_freq=1)

#setup earlystop callback
earlystop_callback = EarlyStopping(monitor='val_accuracy', min_delta=0.0001, patience=3)

#train the model over 100 epochs, with validation data, and tensorbard and early stopping callbacks added
model.fit(encX_train, booly_train, 
          epochs=50, batch_size=5, 
          validation_data=(encX_test, booly_test), 
          callbacks =[earlystop_callback])

#print the results
_, accuracy = model.evaluate(encX_test, booly_test)
print('Accuracy: %.2f' % (accuracy*100))

#print the results
_, accuracy = model.evaluate(encX_test, booly_test)
print('Accuracy: %.2f' % (accuracy*100))

#see the classification report
print(classification_report(booly_test, model.predict_classes(encX_test)))

#Save the trained model into an .h5 file
model.save("model100-100.h5")

  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()


Train on 1111679 samples, validate on 277920 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Accuracy: 81.86
Accuracy: 81.86
              precision    recall  f1-score   support

           0       0.82      0.99      0.90    225272
           1       0.64      0.09      0.16     52648

    accuracy                           0.82    277920
   macro avg       0.73      0.54      0.53    277920
weighted avg       0.79      0.82      0.76    277920



## Conclusion

From the results of the DNN models, it seems like we did not achieve an optimal predictor, however, since training was done locally with my computer (GTX 1080 GPU w/ 8GRAM), we could experiment to far. It seems like if a much bigger network was made, it could get better result, but this would seem impractical because of how long it would take to train, but also it would have a hard time trying to keep up with the results for the Logistic Regression. The logisitic regression by far outpreformed all other models, most models having abysmal testing accuracy. However, it should still be ackknowledged that the Logistic Regression was medicore at best with its score.