# Refining the Logistic Regression Model

In this notebook, some few possible improvements are explored for the model which already yields the best results, the Logistic Regression model. The exploration attempts a few extra hyper parameters, and test to see if dropping features improves predictability.

Below, the data is loaded in and transformed according to the first notebook.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import re

#Defining what dtype to convert each column to
#numberic columns are transformed after reading in
dtype_dict = {'Product':"category",
             'Consumer consent provided?': "category",
             'Submitted via': "category",
             'Company response to consumer': "category",
             'Consumer disputed?': "category"}

#read in .csv file, dates are parsed into datetime objects. 
#The Complaint ID is Unique in every entry, so it can be used as index
df = pd.read_csv('Consumer_Complaints.csv',
                 index_col=['Complaint ID'],
                 parse_dates=["Date received","Date sent to company"],
                 dtype=dtype_dict)

#This will replace ending '-' to 5 (average linespace of 10)
regexReplaceDash = r"(\d+)(-)$"
df['ZIP code'] = df['ZIP code'].str.replace(regexReplaceDash, r'\g<1>5')

#This will change ending XX to 50 (average linespace of 100)
regex_XX = r'(\d{3})(XX)'
df['ZIP code'] = df['ZIP code'].str.replace(regex_XX, r'\g<1>50')

#This will remove all other entries that are still not 5 digits
regexRemove = r'\D+'
df['ZIP code'] = df['ZIP code'].replace(regexRemove, np.nan, regex=True)

#imputes the mean for nan 
imputeMean = df['ZIP code'].astype(np.float).mean()
df['ZIP code'] = df['ZIP code'].astype(np.float).fillna(imputeMean)

#Transforming 2 unique valued col to float boolean
booleanize = {'Yes': 1, 'No': 0}
df['Timely response?'] = pd.Series(df['Timely response?'].map(booleanize), dtype = np.float)

#function to apply to column to convert less common results to 'Other', as well as NaN
def convertToOther(value, keepList):
    if (value == ''):
        return "Other"
    else:
        return value if value in keepList else "Other"
    
#Lists top 23 value counts (allowed to exclude values), turns NaN to '' to others, converts to category dtype
def cleanReduceConvert(df, column, blackList=[]):
    keepList = []
    for category in df[column].value_counts().head(23).index.tolist():
        if (category.lower().split()[0] != "other"):
            keepList.append(category)
    for category in blackList:
        try:
            keepList.remove(category)
        except ValueError:
            pass

    df[column].fillna('', inplace=True)
    return pd.Series(df[column].apply(convertToOther, args=(keepList,)), dtype = 'category')

df['Sub-product'] = cleanReduceConvert(df, 'Sub-product', blackList= ['I do not know'])
df['Issue'] = cleanReduceConvert(df, 'Issue')
df['Sub-issue'] = cleanReduceConvert(df, 'Sub-issue')
df['Company'] = cleanReduceConvert(df, 'Company')

def entryOrNull(strVal):
    return 1.0 if strVal is not np.nan else 0.0

df['Consumer complaint narrative submitted?'] = df['Consumer complaint narrative'].apply(entryOrNull)

def dtToCols(df, dtcolumn):
    df["{} day".format(dtcolumn)] = df[dtcolumn].dt.day
    df["{} month".format(dtcolumn)] = df[dtcolumn].dt.month
    df["{} year".format(dtcolumn)] = df[dtcolumn].dt.year
    
dtToCols(df, "Date received")
dtToCols(df, "Date sent to company")

df["Consumer consent provided?"] = df["Consumer consent provided?"].cat.add_categories("Not recorded").fillna("Not recorded")

df = df.drop(df[df["Company response to consumer"].isna()].index)

dfInProgress = df[df["Company response to consumer"] == "In progress"]
df = df[df["Company response to consumer"] != "In progress"]

dfUntimelyResponse = df[df["Company response to consumer"] == "Untimely response"]
df = df[df["Company response to consumer"] != "Untimely response"]

twoOutputsDict = {"Closed with explanation":"Closed without relief", 
                  "Closed with non-monetary relief":"Closed with relief",
                  "Closed with monetary relief":"Closed with relief",
                  "Closed without relief":"Closed without relief", 
                  "Closed":"Closed without relief",
                  "Closed with relief":"Closed with relief"}
df["Company response to consumer"] = df["Company response to consumer"].map(twoOutputsDict)

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder


#data columns not be used for the model
dropList = ["Consumer complaint narrative",
            "Company public response",
            "State",
            "Tags",
            "Consumer disputed?",
            "Date received", 
            "Date sent to company",
            "Company response to consumer"]
X = df.drop(dropList, axis=1)
Y = df["Company response to consumer"]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3)

#Columns to be standard scaled/imputed
numeric_features = ['ZIP code',
                    'Date received day',
                    'Date received month',
                    'Date received year',
                    'Date sent to company day',
                    'Date sent to company month',
                    'Date sent to company year']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

#Columns to one hot encoded
categorical_features = ['Product',
           'Sub-product',
           'Issue',
           'Sub-issue',
           'Company',
           'Consumer consent provided?',
           'Submitted via',
           'Timely response?']
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

#building the column transformer with both transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

#fit the preprocessor, then transform trainging and test set, assign sparse matrix to variables
preprocessor.fit(X)
encX_train = preprocessor.transform(X_train)
encX_test = preprocessor.transform(X_test)

  interactivity=interactivity, compiler=compiler, result=result)
  mask |= (ar1 == a)


In [2]:
def get_column_names_from_ColumnTransformer(column_transformer):    
    col_name = []
    for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
        raw_col_name = transformer_in_columns[2]
        if isinstance(transformer_in_columns[1],Pipeline): 
            transformer = transformer_in_columns[1].steps[-1][1]
        else:
            transformer = transformer_in_columns[1]
        try:
            names = transformer.get_feature_names()
        except AttributeError: # if no 'get_feature_names' function, use raw column name
            names = raw_col_name
        if isinstance(names,np.ndarray): # eg.
            col_name += names.tolist()
        elif isinstance(names,list):
            col_name += names    
        elif isinstance(names,str):
            col_name.append(names)
    return col_name

processedColumns = get_column_names_from_ColumnTransformer(preprocessor)

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

lr = LogisticRegression(n_jobs=-1, solver='saga', penalty='l1')
lr_para = {'C':[10,1.0,0.1,0.01], 
           'class_weight':[None,'balanced'],
           'max_iter':[50,100,150]}

#Apply grid search with above parameters specified
fitmodel = GridSearchCV(lr, lr_para,cv=4, scoring='roc_auc', n_jobs=-1)
fitmodel.fit(encX_train,y_train)

#store the best fitting LogisiticRegression(), create prediciton from X_test data
bestfitLR = fitmodel.best_estimator_

bestfitLR.fit(encX_train,y_train)
y_pred = bestfitLR.predict(encX_test)
print(bestfitLR.score(encX_test,y_test))

#display the result
print(classification_report(y_test, y_pred))

print(bestfitLR)



0.6761274227595471
                       precision    recall  f1-score   support

   Closed with relief       0.32      0.66      0.43     78302
Closed without relief       0.90      0.68      0.77    338578

             accuracy                           0.68    416880
            macro avg       0.61      0.67      0.60    416880
         weighted avg       0.79      0.68      0.71    416880

LogisticRegression(C=10, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=-1, penalty='l1',
                   random_state=None, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)


After training the model, a dataframe is generated with the coefficents from each column. Since the index is included in this data frame it can be used to drop columns from the original dataframe.

In [4]:
#Creates dataframe with columns index,column name name, and coef value
coef_df = pd.DataFrame([processedColumns,list(abs(bestfitLR.coef_[0]))]).transpose().sort_values(1,ascending=False)
#allows to display full dataframe
pd.set_option('display.max_rows', None)
#print dataframe
coef_df

Unnamed: 0,0,1
24,x0_Virtual currency,3.09648
103,x4_NATIONSTAR MORTGAGE,2.40505
105,x4_OCWEN FINANCIAL CORPORATION,1.49491
101,x4_HSBC NORTH AMERICA HOLDINGS INC.,1.42493
21,x0_Prepaid card,1.3289
86,x3_Personal information incorrect,1.19502
100,x4_Experian Information Solutions Inc.,1.17415
108,x4_PORTFOLIO RECOVERY ASSOCIATES INC,0.95341
70,x3_Account terms,0.878602
98,x4_ENCORE CAPITAL GROUP INC.,0.85512


As a test, the best model from the First GridSearch will be used to test if dropping columns will lead to any improvements. The first variable is a list of minimum absolute value a feature must have to be kept in the trial model.

In [5]:
#list of minimum coef allowed for each trial
coef_cutoffs = [0, 0.05, 0.01, 0.1, 0.15, 0.2, 0.3]

for coef_cutoff in coef_cutoffs:
    
    #collects indicies to drop from dataframe, using coef_cutoff in filter
    drop_indicies = coef_df[coef_df[1] <= coef_cutoff].index
    
    #creates list of variables to keep, and filters to only those columns to training data
    to_keep = list(set(range(encX_train.shape[1]))-set(drop_indicies))    
    new_encX_train = encX_train[:,to_keep]
    
    #applies filtering to test data
    new_encX_test = encX_test[:,to_keep]
    
    
    #tests new column dropped data by fitting then predicting
    bestfitLR.fit(new_encX_train,y_train)
    y_pred = bestfitLR.predict(new_encX_test)
    
    print("Dropping indicies with coef less than an absolute value of ", coef_cutoff, " :\n")
    print(bestfitLR.score(new_encX_test,y_test))

    #display the result
    print(classification_report(y_test, y_pred))

Dropping indicies with coef less than an absolute value of  0  :

0.675626079447323
                       precision    recall  f1-score   support

   Closed with relief       0.32      0.66      0.43     78437
Closed without relief       0.90      0.68      0.77    338443

             accuracy                           0.68    416880
            macro avg       0.61      0.67      0.60    416880
         weighted avg       0.79      0.68      0.71    416880

Dropping indicies with coef less than an absolute value of  0.05  :

0.6754581654193054
                       precision    recall  f1-score   support

   Closed with relief       0.32      0.66      0.43     78437
Closed without relief       0.90      0.68      0.77    338443

             accuracy                           0.68    416880
            macro avg       0.61      0.67      0.60    416880
         weighted avg       0.79      0.68      0.71    416880

Dropping indicies with coef less than an absolute value of  0.01  

The results consistently came back the same. There really is little influence from keeping or removing the few low coef columns.

The test will be rerun, using a GridSearhCV instead of the prior bestfit model.

In [5]:
coef_cutoffs = [0, 0.05, 0.01, 0.1, 0.15, 0.2, 0.3]

for coef_cutoff in coef_cutoffs:
    
    drop_indicies = coef_df[coef_df[1] <= coef_cutoff].index

    to_keep = list(set(range(encX_train.shape[1]))-set(drop_indicies))    
    drop_encX_train = encX_train[:,to_keep]

    drop_encX_test = encX_test[:,to_keep]
    
    lr = LogisticRegression(n_jobs=-1, solver='saga', penalty='l1')
    lr_para = {'C':[10,1.0,0.1,0.01], 
               'class_weight':[None,'balanced'],
               'max_iter':[50,100,150]}

    #Apply grid search with above parameters specified
    fitmodel = GridSearchCV(lr, lr_para,cv=4, scoring='roc_auc', n_jobs=-1)
    fitmodel.fit(drop_encX_train,y_train)

    #store the best fitting LogisiticRegression(), create prediciton from X_test data
    bestfitLR_wdrop = fitmodel.best_estimator_

    bestfitLR_wdrop.fit(drop_encX_train,y_train)
    y_pred = bestfitLR_wdrop.predict(drop_encX_test)
    
    print("Keeping indicies with coef more than an absolute value of ", coef_cutoff, " :\n")
    print(bestfitLR_wdrop.score(drop_encX_test,y_test))

    #display the result
    print(classification_report(y_test, y_pred))





Keeping indicies with coef more than an absolute value of  0  :

0.6761274227595471
                       precision    recall  f1-score   support

   Closed with relief       0.32      0.66      0.43     78302
Closed without relief       0.90      0.68      0.77    338578

             accuracy                           0.68    416880
            macro avg       0.61      0.67      0.60    416880
         weighted avg       0.79      0.68      0.71    416880





Keeping indicies with coef more than an absolute value of  0.05  :

0.6762617539819612
                       precision    recall  f1-score   support

   Closed with relief       0.32      0.66      0.43     78302
Closed without relief       0.90      0.68      0.77    338578

             accuracy                           0.68    416880
            macro avg       0.61      0.67      0.60    416880
         weighted avg       0.79      0.68      0.71    416880





Keeping indicies with coef more than an absolute value of  0.01  :

0.6760698522356553
                       precision    recall  f1-score   support

   Closed with relief       0.32      0.66      0.43     78302
Closed without relief       0.90      0.68      0.77    338578

             accuracy                           0.68    416880
            macro avg       0.61      0.67      0.60    416880
         weighted avg       0.79      0.68      0.71    416880

Keeping indicies with coef more than an absolute value of  0.1  :

0.676563999232393
                       precision    recall  f1-score   support

   Closed with relief       0.32      0.66      0.43     78302
Closed without relief       0.90      0.68      0.77    338578

             accuracy                           0.68    416880
            macro avg       0.61      0.67      0.60    416880
         weighted avg       0.79      0.68      0.71    416880

Keeping indicies with coef more than an absolute value of  0.15  :



Keeping indicies with coef more than an absolute value of  0.3  :

0.674740932642487
                       precision    recall  f1-score   support

   Closed with relief       0.32      0.66      0.43     78302
Closed without relief       0.90      0.68      0.77    338578

             accuracy                           0.67    416880
            macro avg       0.61      0.67      0.60    416880
         weighted avg       0.79      0.67      0.71    416880



Again, the results did not vary after removing columns. Since there are more cut off values to try, the next cell explores dropping many more columns.

In [6]:
coef_cutoffs = [0.4, 0.5, 0.6, 0.8, 1]

for coef_cutoff in coef_cutoffs:
    
    drop_indicies = coef_df[coef_df[1] <= coef_cutoff].index

    to_keep = list(set(range(encX_train.shape[1]))-set(drop_indicies))    
    drop_encX_train = encX_train[:,to_keep]

    drop_encX_test = encX_test[:,to_keep]
    
    lr = LogisticRegression(n_jobs=-1, solver='saga', penalty='l1')
    lr_para = {'C':[10,1.0,0.1,0.01], 
               'class_weight':[None,'balanced'],
               'max_iter':[50,100,150]}

    #Apply grid search with above parameters specified
    fitmodel = GridSearchCV(lr, lr_para,cv=4, scoring='roc_auc', n_jobs=-1)
    fitmodel.fit(drop_encX_train,y_train)

    #store the best fitting LogisiticRegression(), create prediciton from X_test data
    bestfitLR_wdrop = fitmodel.best_estimator_

    bestfitLR_wdrop.fit(drop_encX_train,y_train)
    y_pred = bestfitLR_wdrop.predict(drop_encX_test)
    
    print("Keeping indicies with coef more than an absolute value of ", coef_cutoff, " :\n")
    print(bestfitLR_wdrop)
    print(bestfitLR_wdrop.score(drop_encX_test,y_test))

    #display the result
    print(classification_report(y_test, y_pred))

Keeping indicies with coef more than an absolute value of  0.4  :

LogisticRegression(C=0.1, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=150, multi_class='warn', n_jobs=-1, penalty='l1',
                   random_state=None, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)
0.6802748992515832
                       precision    recall  f1-score   support

   Closed with relief       0.32      0.64      0.43     78302
Closed without relief       0.89      0.69      0.78    338578

             accuracy                           0.68    416880
            macro avg       0.61      0.66      0.60    416880
         weighted avg       0.78      0.68      0.71    416880

Keeping indicies with coef more than an absolute value of  0.5  :

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_i



Keeping indicies with coef more than an absolute value of  0.8  :

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=50,
                   multi_class='warn', n_jobs=-1, penalty='l1',
                   random_state=None, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)
0.8128430243715218
                       precision    recall  f1-score   support

   Closed with relief       0.66      0.01      0.01     78302
Closed without relief       0.81      1.00      0.90    338578

             accuracy                           0.81    416880
            macro avg       0.74      0.50      0.46    416880
         weighted avg       0.79      0.81      0.73    416880





Keeping indicies with coef more than an absolute value of  1  :

LogisticRegression(C=10, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=50, multi_class='warn', n_jobs=-1, penalty='l1',
                   random_state=None, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)
0.7962291306850893
                       precision    recall  f1-score   support

   Closed with relief       0.41      0.20      0.27     78302
Closed without relief       0.83      0.93      0.88    338578

             accuracy                           0.80    416880
            macro avg       0.62      0.57      0.58    416880
         weighted avg       0.76      0.80      0.77    416880



There is finally a change in preformance. Unfortunately, it is a disimprovement. Since the first value immediately saw worsening scores, a finer tuned exploration will be done around the value to see when disimprovement will be noticeable

In [7]:
coef_cutoffs = [0.35, 0.4, 0.45]

for coef_cutoff in coef_cutoffs:
    
    drop_indicies = coef_df[coef_df[1] <= coef_cutoff].index

    to_keep = list(set(range(encX_train.shape[1]))-set(drop_indicies))    
    drop_encX_train = encX_train[:,to_keep]

    drop_encX_test = encX_test[:,to_keep]
    
    lr = LogisticRegression(n_jobs=-1, solver='saga', penalty='l1')
    lr_para = {'C':[10,1.0,0.1,0.01], 
               'class_weight':[None,'balanced'],
               'max_iter':[50,100,150]}

    #Apply grid search with above parameters specified
    fitmodel = GridSearchCV(lr, lr_para,cv=4, scoring='roc_auc', n_jobs=-1)
    fitmodel.fit(drop_encX_train,y_train)

    #store the best fitting LogisiticRegression(), create prediciton from X_test data
    bestfitLR_wdrop = fitmodel.best_estimator_

    bestfitLR_wdrop.fit(drop_encX_train,y_train)
    y_pred = bestfitLR_wdrop.predict(drop_encX_test)
    
    print("Keeping indicies with coef more than an absolute value of ", coef_cutoff, " :\n")
    print(bestfitLR_wdrop)
    print(bestfitLR_wdrop.score(drop_encX_test,y_test))

    #display the result
    print(classification_report(y_test, y_pred))

Keeping indicies with coef more than an absolute value of  0.35  :

LogisticRegression(C=0.1, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=150, multi_class='warn', n_jobs=-1, penalty='l1',
                   random_state=None, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)
0.6745490308961811
                       precision    recall  f1-score   support

   Closed with relief       0.32      0.65      0.43     78302
Closed without relief       0.89      0.68      0.77    338578

             accuracy                           0.67    416880
            macro avg       0.61      0.66      0.60    416880
         weighted avg       0.79      0.67      0.71    416880

Keeping indicies with coef more than an absolute value of  0.4  :

LogisticRegression(C=0.1, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None



Keeping indicies with coef more than an absolute value of  0.45  :

LogisticRegression(C=10, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=150, multi_class='warn', n_jobs=-1, penalty='l1',
                   random_state=None, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)
0.651823066589906
                       precision    recall  f1-score   support

   Closed with relief       0.30      0.66      0.41     78302
Closed without relief       0.89      0.65      0.75    338578

             accuracy                           0.65    416880
            macro avg       0.60      0.65      0.58    416880
         weighted avg       0.78      0.65      0.69    416880



The best results with the most dropped columns is using a cutoff coefficient value of 0.3. Before including this in the productionized, it will be checked if there is a significant improvement in time required to train the model.

In [9]:
import time


start_time = time.time()

bestfitLR.fit(encX_train,y_train)
print('Time to fit with full sparse matrix')
print("--- %s seconds ---" % (time.time() - start_time))

start_time = time.time()

bestfitLR_wdrop.fit(drop_encX_train,y_train)
print('Time to fit with dropped-columns sparse matrix')
print("--- %s seconds ---" % (time.time() - start_time))

Time to fit with full sparse matrix
--- 89.3507010936737 seconds ---
Time to fit with dropped-columns sparse matrix
--- 78.12662744522095 seconds ---


## Conclusion

There was little improvement when reducing the size of the data frame by removing useless columns. For this reason, the orginal Logisitc Regression will be used with the original preproccesing pipeline. Follow the production code folder to find the final files of the project.