# Designing the Production Code

In this final notebook, the code that will be pushed to production will be developed. Although there will be a file to train and a file to load and predict, both will be developed and tested here.

First thing to test is a code to make an REST API call to the Consumer Financial Complaint Bureau's API to fetch an entry based on a complaint ID. The requests library is used for making the api request.

In [1]:
import requests
import pandas as pd

resp = requests.get('https://www.consumerfinance.gov/data-research/consumer-complaints/search/api/v1/3398126')
if resp.status_code != 200:
    # This means something went wrong.
    raise ApiError('GET /tasks/ {}'.format(resp.status_code))
print(resp.json())

{'took': 3, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'failed': 0}, 'hits': {'total': 1, 'max_score': 1.0, 'hits': [{'_index': 'complaint-public-v2', '_type': 'complaint', '_id': '3398126', '_score': 1.0, '_source': {'tags': None, 'date_indexed_formatted': '01/20/20', ':updated_at': 1579303028, 'date_indexed': '2020-01-20T12:00:00-05:00', 'zip_code': '191XX', 'complaint_id': '3398126', 'issue': 'Unauthorized transactions or other transaction problem', 'date_received': '2019-10-07T12:00:00-05:00', 'state': 'PA', 'date_sent_to_company_formatted': '10/07/19', 'date_received_formatted': '10/07/19', 'consumer_disputed': 'N/A', 'has_narrative': True, 'product': 'Money transfer, virtual currency, or money service', 'company_response': 'Closed with non-monetary relief', 'submitted_via': 'Web', 'company': 'Paypal Holdings, Inc', 'date_sent_to_company': '2019-10-07T12:00:00-05:00', 'company_public_response': None, 'sub_product': 'Mobile or digital wallet', 'timely': 'Yes', 'co

In the next cell, the response from the request is printed to screen. After exploring the json a bit, the data entry is found using the indexing below.

In [2]:
resp.json()['hits']['hits'][0]['_source']

{'tags': None,
 'date_indexed_formatted': '01/20/20',
 ':updated_at': 1579303028,
 'date_indexed': '2020-01-20T12:00:00-05:00',
 'zip_code': '191XX',
 'complaint_id': '3398126',
 'issue': 'Unauthorized transactions or other transaction problem',
 'date_received': '2019-10-07T12:00:00-05:00',
 'state': 'PA',
 'date_sent_to_company_formatted': '10/07/19',
 'date_received_formatted': '10/07/19',
 'consumer_disputed': 'N/A',
 'has_narrative': True,
 'product': 'Money transfer, virtual currency, or money service',
 'company_response': 'Closed with non-monetary relief',
 'submitted_via': 'Web',
 'company': 'Paypal Holdings, Inc',
 'date_sent_to_company': '2019-10-07T12:00:00-05:00',
 'company_public_response': None,
 'sub_product': 'Mobile or digital wallet',
 'timely': 'Yes',
 'complaint_what_happened': 'I was using Venmo, a PayPal company, to transfer {$180.00} into my bank account via their instant transfer option on XX/XX/XXXX at XXXX. I have contacted both my bank ( XXXX XXXX XXXX ) and

## TrainSaveModel.py

In the next cell, a cleaned up version of the code to train the logistic regression is displayed and run. The model and preprocessor are then linked with a pipeline, and then saved using joblib.

In [134]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
import joblib



#Defining what dtype to convert each column to
#numberic columns are transformed after reading in
dtype_dict = {'Product':"category",
             'Consumer consent provided?': "category",
             'Submitted via': "category",
             'Company response to consumer': "category",
             'Consumer disputed?': "category"}

#read in .csv file, dates are parsed into datetime objects. 
#The Complaint ID is Unique in every entry, so it can be used as index
df = pd.read_csv('../Consumer_Complaints.csv',
                 index_col=['Complaint ID'],
                 parse_dates=["Date received","Date sent to company"],
                 dtype=dtype_dict)

#This will replace ending '-' to 5 (average linespace of 10)
regexReplaceDash = r"(\d+)(-)$"
df['ZIP code'] = df['ZIP code'].str.replace(regexReplaceDash, r'\g<1>5')

#This will change ending XX to 50 (average linespace of 100)
regex_XX = r'(\d{3})(XX)'
df['ZIP code'] = df['ZIP code'].str.replace(regex_XX, r'\g<1>50')

#This will remove all other entries that are still not 5 digits
regexRemove = r'\D+'
df['ZIP code'] = df['ZIP code'].replace(regexRemove, np.nan, regex=True)

#imputes the mean for nan 
imputeMean = df['ZIP code'].astype(np.float).mean()
df['ZIP code'] = df['ZIP code'].astype(np.float).fillna(imputeMean)

#Transforming 2 unique valued col to float boolean
booleanize = {'Yes': 1, 'No': 0}
df['Timely response?'] = pd.Series(df['Timely response?'].map(booleanize), dtype = np.float)

#function to apply to column to convert less common results to 'Other', as well as NaN
def convertToOther(value, keepList):
    if (value == ''):
        return "Other"
    else:
        return value if value in keepList else "Other"
    
#Lists top 23 value counts (allowed to exclude values), turns NaN to '' to others, converts to category dtype
def cleanReduceConvert(df, column, blackList=[]):
    keepList = []
    for category in df[column].value_counts().head(23).index.tolist():
        if (category.lower().split()[0] != "other"):
            keepList.append(category)
    for category in blackList:
        try:
            keepList.remove(category)
        except ValueError:
            pass

    df[column].fillna('', inplace=True)
    return pd.Series(df[column].apply(convertToOther, args=(keepList,)), dtype = 'category')

df['Sub-product'] = cleanReduceConvert(df, 'Sub-product', blackList= ['I do not know'])
df['Issue'] = cleanReduceConvert(df, 'Issue')
df['Sub-issue'] = cleanReduceConvert(df, 'Sub-issue')
df['Company'] = cleanReduceConvert(df, 'Company')

def entryOrNull(strVal):
    return 1.0 if strVal is not np.nan else 0.0

df['Consumer complaint narrative submitted?'] = df['Consumer complaint narrative'].apply(entryOrNull)

def dtToCols(df, dtcolumn):
    df["{} day".format(dtcolumn)] = df[dtcolumn].dt.day
    df["{} month".format(dtcolumn)] = df[dtcolumn].dt.month
    df["{} year".format(dtcolumn)] = df[dtcolumn].dt.year
    
dtToCols(df, "Date received")
dtToCols(df, "Date sent to company")

df["Consumer consent provided?"] = df["Consumer consent provided?"].cat.add_categories("Not recorded").fillna("Not recorded")

df = df.drop(df[df["Company response to consumer"].isna()].index)

dfInProgress = df[df["Company response to consumer"] == "In progress"]
df = df[df["Company response to consumer"] != "In progress"]

dfUntimelyResponse = df[df["Company response to consumer"] == "Untimely response"]
df = df[df["Company response to consumer"] != "Untimely response"]

twoOutputsDict = {"Closed with explanation":"Closed without relief", 
                  "Closed with non-monetary relief":"Closed with relief",
                  "Closed with monetary relief":"Closed with relief",
                  "Closed without relief":"Closed without relief", 
                  "Closed":"Closed without relief",
                  "Closed with relief":"Closed with relief"}
df["Company response to consumer"] = df["Company response to consumer"].map(twoOutputsDict)

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder


#data columns not be used for the model
dropList = ["Consumer complaint narrative",
            "Company public response",
            "State",
            "Tags",
            "Consumer disputed?",
            "Date received", 
            "Date sent to company",
            "Company response to consumer"]
X = df.drop(dropList, axis=1)
Y = df["Company response to consumer"]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3)

#Columns to be standard scaled/imputed
numeric_features = ['ZIP code',
                    'Date received day',
                    'Date received month',
                    'Date received year',
                    'Date sent to company day',
                    'Date sent to company month',
                    'Date sent to company year']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

#Columns to one hot encoded
categorical_features = ['Product',
           'Sub-product',
           'Issue',
           'Sub-issue',
           'Company',
           'Consumer consent provided?',
           'Submitted via',
           'Timely response?']
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

#building the column transformer with both transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

#fit the preprocessor, then transform trainging and test set, assign sparse matrix to variables
preprocessor.fit(X)
encX_train = preprocessor.transform(X_train)
encX_test = preprocessor.transform(X_test)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

lr = LogisticRegression(n_jobs=-1, solver='saga', penalty='l1')
lr_para = {'C':[10,1.0,0.1,0.01], 
           'class_weight':[None,'balanced'],
           'max_iter':[50,100,150]}

#Apply grid search with above parameters specified
fitmodel = GridSearchCV(lr, lr_para,cv=4, scoring='roc_auc', n_jobs=-1)
fitmodel.fit(encX_train,y_train)

#store the best fitting LogisiticRegression(), create prediciton from X_test data
bestfitLR = fitmodel.best_estimator_
y_pred = bestfitLR.predict(encX_test)
print('Training Completed with these results: \n')

print(classification_report(y_test, y_pred))

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('logregclassifier', bestfitLR)])



import joblib
pipeline_filename = "lrmodelpipeline.save"
joblib.dump(clf, pipeline_filename) 

print('Model has been saved to :', pipeline_filename)

  interactivity=interactivity, compiler=compiler, result=result)
  mask |= (ar1 == a)


Training Completed with these results: 

                       precision    recall  f1-score   support

   Closed with relief       0.32      0.66      0.43     78895
Closed without relief       0.89      0.68      0.77    337985

             accuracy                           0.68    416880
            macro avg       0.61      0.67      0.60    416880
         weighted avg       0.79      0.68      0.71    416880

Model has been saved to : lrmodelpipeline.save


## LoadModelPredict.py

The code below starts by creating a data from response json, and reformats the columns to be able to be pushed into the pre-trained model. Since this is done only on one data entry, the prediction comes out 'instantly'

In [4]:
import sys, getopt
import requests
import pandas as pd
import numpy as np

def main(argv):
    try:
        opts, args = getopt.getopt(argv, "hi:", ['complaint='])
    except getopt.GetoptError:
        print  'LoadModelPredict.py -i <complaintID>'
        sys.exit(2)
    for opt, arg in opts:
        if opt == 'h':
            print 'LoadModelPredict.py -i <complaintID>'
        elif opt in ("-i", "--complaint"):
            complaintID = int(arg)
    print('Complaint #', complaintID, "'s outcome will be predicted:")
    
    
    

    apiurl = 'https://www.consumerfinance.gov/data-research/consumer-complaints/search/api/v1/'
    queriedUrl = apiurl + str(complaintID)
    resp = requests.get(queriedUrl)
    if resp.status_code != 200:
        # This means something went wrong.
        print("ApiError: response status code", resp.status_code)
    if resp.json()['hits']['total'] == 0:
        print("Complaint ID yielded 0 result. Check to make sure you inputed it correctly.")
        sys.exit()
    ### Paste rest of code here
    
if __name__ == '__main__':
    main(sys.argv[1:])

SyntaxError: Missing parentheses in call to 'print'. Did you mean print('LoadModelPredict.py -i <complaintID>')? (<ipython-input-4-793a0cf8ba25>, line 10)

In [49]:
import sys, getopt
import requests
import pandas as pd
import numpy as np

resp = requests.get('https://www.consumerfinance.gov/data-research/consumer-complaints/search/api/v1/3398126')
if resp.status_code != 200:
    # This means something went wrong.
    raise ApiError('GET /tasks/ {}'.format(resp.status_code))

#Gets complaint Id from request
complaint_id = int(resp.json()['hits']['hits'][0]['_source']['complaint_id'])

#Creates DataFrame from REST API response
df1 = pd.DataFrame(resp.json()['hits']['hits'][0]['_source'], index=[complaint_id])

#generate drop list for before preprocessing
droplist1 = ['date_indexed_formatted',
             'complaint_id',
             'date_received_formatted',
             ':updated_at',
             'date_indexed',
             'date_sent_to_company_formatted',
             'has_narrative']
#drop columns
df_drop1 = df1.drop(droplist1,axis=1)

#list that corrects names to agree with preprocessor's accepted name
corrected_cols_dict = {'tags':'Tags',
                       'zip_code':'ZIP code',
                       'issue':'Issue',
                       'date_received':'Date received',
                       'state':'State',
                       'consumer_disputed':'Consumer disputed?',
                       'product':'Product',
                       'company_response':'Company response to consumer',
                       'submitted_via':'Submitted via',
                       'company':'Company',
                       'date_sent_to_company':'Date sent to company',
                       'company_public_response':'Company public response',
                       'sub_product':'Sub-product',
                       'timely':'Timely response?',
                       'complaint_what_happened':'Consumer complaint narrative',
                       'sub_issue':'Sub-issue',
                       'consumer_consent_provided':'Consumer consent provided?'}

df_drop1 = df_drop1.rename(corrected_cols_dict, axis=1)

#match order of columns. Generated with list(df.columns.values) from other notebooks
reordered_cols= ['Date received',
                 'Product',
                 'Sub-product',
                 'Issue',
                 'Sub-issue',
                 'Consumer complaint narrative',
                 'Company public response',
                 'Company',
                 'State',
                 'ZIP code',
                 'Tags',
                 'Consumer consent provided?',
                 'Submitted via',
                 'Date sent to company',
                 'Company response to consumer',
                 'Timely response?',
                 'Consumer disputed?']
#actually reorderes columns
df_reordered = df_drop1[reordered_cols]

#set index name to match
df_reordered.index.name='Complaint ID'

#define dictionary to define new dtypes
dtype_dict = {'Product':"category",
             'Consumer consent provided?': "category",
             'Submitted via': "category",
             'Consumer disputed?': "category",
             'Date received':'<M8[ns]',
             'Date sent to company':'<M8[ns]'}

#change dtypes
df = df_reordered.astype(dtype_dict)

#use old code to transfrom data

#This will replace ending '-' to 5 (average linespace of 10)
regexReplaceDash = r"(\d+)(-)$"
df['ZIP code'] = df['ZIP code'].str.replace(regexReplaceDash, r'\g<1>5')

#This will change ending XX to 50 (average linespace of 100)
regex_XX = r'(\d{3})(XX)'
df['ZIP code'] = df['ZIP code'].str.replace(regex_XX, r'\g<1>50')

#This will remove all other entries that are still not 5 digits
regexRemove = r'\D+'
df['ZIP code'] = df['ZIP code'].replace(regexRemove, np.nan, regex=True)

#imputes the mean for nan 
imputeMean = df['ZIP code'].astype(np.float).mean()
df['ZIP code'] = df['ZIP code'].astype(np.float).fillna(imputeMean)

#Transforming 2 unique valued col to float boolean
booleanize = {'Yes': 1, 'No': 0}
df['Timely response?'] = pd.Series(df['Timely response?'].map(booleanize), dtype = np.float)

#function to apply to column to convert less common results to 'Other', as well as NaN
def convertToOther(value, keepList):
    if (value == ''):
        return "Other"
    else:
        return value if value in keepList else "Other"
    
#Lists top 23 value counts (allowed to exclude values), turns NaN to '' to others, converts to category dtype
def cleanReduceConvert(df, column, blackList=[]):
    keepList = []
    for category in df[column].value_counts().head(23).index.tolist():
        if (category.lower().split()[0] != "other"):
            keepList.append(category)
    for category in blackList:
        try:
            keepList.remove(category)
        except ValueError:
            pass

    df[column].fillna('', inplace=True)
    return pd.Series(df[column].apply(convertToOther, args=(keepList,)), dtype = 'category')

df['Sub-product'] = cleanReduceConvert(df, 'Sub-product', blackList= ['I do not know'])
df['Issue'] = cleanReduceConvert(df, 'Issue')
df['Sub-issue'] = cleanReduceConvert(df, 'Sub-issue')
df['Company'] = cleanReduceConvert(df, 'Company')

def entryOrNull(strVal):
    return 1.0 if strVal is not np.nan else 0.0

df['Consumer complaint narrative submitted?'] = df['Consumer complaint narrative'].apply(entryOrNull)

def dtToCols(df, dtcolumn):
    df["{} day".format(dtcolumn)] = df[dtcolumn].dt.day
    df["{} month".format(dtcolumn)] = df[dtcolumn].dt.month
    df["{} year".format(dtcolumn)] = df[dtcolumn].dt.year
    
dtToCols(df, "Date received")
dtToCols(df, "Date sent to company")

df["Consumer consent provided?"] = df["Consumer consent provided?"].cat.add_categories("Not recorded").fillna("Not recorded")

df = df.drop(df[df["Company response to consumer"].isna()].index)

dfInProgress = df[df["Company response to consumer"] == "In progress"]
df = df[df["Company response to consumer"] != "In progress"]

dfUntimelyResponse = df[df["Company response to consumer"] == "Untimely response"]
df = df[df["Company response to consumer"] != "Untimely response"]

twoOutputsDict = {"Closed with explanation":"Closed without relief", 
                  "Closed with non-monetary relief":"Closed with relief",
                  "Closed with monetary relief":"Closed with relief",
                  "Closed without relief":"Closed without relief", 
                  "Closed":"Closed without relief",
                  "Closed with relief":"Closed with relief"}
df["Company response to consumer"] = df["Company response to consumer"].map(twoOutputsDict)



#data columns not be used for the model
dropList = ["Consumer complaint narrative",
            "Company public response",
            "State",
            "Tags",
            "Consumer disputed?",
            "Date received", 
            "Date sent to company",
            "Company response to consumer"]
X = df.drop(dropList, axis=1)
Y = df["Company response to consumer"]

import joblib

pipeline_filename = "lrmodelpipeline.save"

loaded_clf = joblib.load(pipeline_filename)
prediction = loaded_clf.predict(X)[0]
pred_proba_perc = loaded_clf.predict_proba(X)[0][0] * 100

print("Prediction of Outcome: ", prediction)
print("With a ", round(pred_proba_perc, 2) , "% chance of being Closed with relief")

Prediction of Outcome:  Closed without relief
With a  43.31 % chance of being Closed with relief




In [55]:
complaintID = int(3398126)
apiurl = 'https://www.consumerfinance.gov/data-research/consumer-complaints/search/api/v1/'
queriedUrl = apiurl + str(complaintID)
resp = requests.get(queriedUrl)
if resp.status_code != 200:
    # This means something went wrong.
    raise ApiError('GET /tasks/ {}'.format(resp.status_code))
if resp.json()['hits']['total'] == 0:
    print("Complaint ID yielded 0 result. Check to make sure you inputed it correctly.")
    sys.exit()
print(resp.json())

{'took': 3, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'failed': 0}, 'hits': {'total': 1, 'max_score': 1.0, 'hits': [{'_index': 'complaint-public-v2', '_type': 'complaint', '_id': '3398126', '_score': 1.0, '_source': {'tags': None, 'date_indexed_formatted': '01/24/20', ':updated_at': 1579840210, 'date_indexed': '2020-01-24T12:00:00-05:00', 'zip_code': '191XX', 'complaint_id': '3398126', 'issue': 'Unauthorized transactions or other transaction problem', 'date_received': '2019-10-07T12:00:00-05:00', 'state': 'PA', 'date_sent_to_company_formatted': '10/07/19', 'date_received_formatted': '10/07/19', 'consumer_disputed': 'N/A', 'has_narrative': True, 'product': 'Money transfer, virtual currency, or money service', 'company_response': 'Closed with non-monetary relief', 'submitted_via': 'Web', 'company': 'Paypal Holdings, Inc', 'date_sent_to_company': '2019-10-07T12:00:00-05:00', 'company_public_response': None, 'sub_product': 'Mobile or digital wallet', 'timely': 'Yes', 'co

In [1]:
import pandas as pd

dtype_dict = {'Product':"category",
             'Consumer consent provided?': "category",
             'Submitted via': "category",
             'Company response to consumer': "category",
             'Consumer disputed?': "category"}

#read in .csv file, dates are parsed into datetime objects. 
#The Complaint ID is Unique in every entry, so it can be used as index
df = pd.read_csv('../Consumer_Complaints.csv',
                 index_col=['Complaint ID'],
                 parse_dates=["Date received","Date sent to company"],
                 dtype=dtype_dict)

  interactivity=interactivity, compiler=compiler, result=result)
  mask |= (ar1 == a)


In [6]:
df.head().to_json()

'{"Date received":{"3384392":1569283200000,"3379500":1568851200000,"3417821":1571961600000,"3483640":1577923200000,"3433198":1573171200000},"Product":{"3384392":"Debt collection","3379500":"Credit reporting, credit repair services, or other personal consumer reports","3417821":"Credit reporting, credit repair services, or other personal consumer reports","3483640":"Debt collection","3433198":"Debt collection"},"Sub-product":{"3384392":"I do not know","3379500":"Credit reporting","3417821":"Credit reporting","3483640":"Credit card debt","3433198":"I do not know"},"Issue":{"3384392":"Attempts to collect debt not owed","3379500":"Incorrect information on your report","3417821":"Incorrect information on your report","3483640":"Took or threatened to take negative or legal action","3433198":"Communication tactics"},"Sub-issue":{"3384392":"Debt is not yours","3379500":"Information belongs to someone else","3417821":"Information belongs to someone else","3483640":"Sued you without properly not

In [47]:
df["Date received"].dtype == numpy.dtype('<M8[ns]')

True

In [2]:
import pandas as pd

dtype_dict = {'Product':"category",
                'Consumer consent provided?': "category",
                'Submitted via': "category",
                'Company response to consumer': "category",
                'Consumer disputed?': "category"}

#read in .csv file, dates are parsed into datetime objects. 
#The Complaint ID is Unique in every entry, so it can be used as index
df = pd.read_csv('./complaints.csv',
                index_col=['Complaint ID'],
                parse_dates=["Date received","Date sent to company"],
                dtype=dtype_dict)

  mask |= (ar1 == a)


In [3]:
df.columns.to_list()

['Date received',
 'Product',
 'Sub-product',
 'Issue',
 'Sub-issue',
 'Consumer complaint narrative',
 'Company public response',
 'Company',
 'State',
 'ZIP code',
 'Tags',
 'Consumer consent provided?',
 'Submitted via',
 'Date sent to company',
 'Company response to consumer',
 'Timely response?',
 'Consumer disputed?']

In [4]:
li1 = ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?', 
'Consumer complaint narrative submitted?', 'Date received day', 'Date received month', 'Date received year', 'Date sent to company day', 'Date sent to company month', 'Date sent to company year']

li2 = ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?'] 

list(set(li1)-set(li2))

['Date received year',
 'Date sent to company month',
 'Date received day',
 'Consumer complaint narrative submitted?',
 'Date sent to company day',
 'Date received month',
 'Date sent to company year']