<h1 style='color:#00868b'>Read, balance and clean dataset<span class="tocSkip"></span></h1>

This notebook reads in the data, balances it, cleans it and then saves it to a csv file. This csv can then be used in a separate notebook to train a model. This notebook also allows the user to tweak the balancing and cleaning as they see fit.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

**Read file**

In [2]:
df = pd.read_csv("../../complaints-2020-01-22_08_24.csv", encoding="utf-8")

**Select columns to keep**

Add your columns here.

In [6]:
df_selected = df.loc[:, ('Product', 'Consumer complaint narrative', 'Issue')]

**Remove columns with the least complaints**

In [7]:
df_balanced1 = df_selected

df_balanced1 = df_balanced1[df_balanced1["Product"] != "Virtual currency"]
df_balanced1 = df_balanced1[df_balanced1["Product"] != "Other financial service"]
df_balanced1 = df_balanced1[df_balanced1["Product"] != "Prepaid card"]
df_balanced1 = df_balanced1[df_balanced1["Product"] != "Money transfers"]
df_balanced1 = df_balanced1[df_balanced1["Product"] != "Payday loan"]

**Make a copy and iteratively drop the products with the most complaints, adding a random sample % of them back afterwards**

In [8]:
df_balanced2 = df_balanced1.copy()
df_balanced2.shape

(480700, 3)

In [9]:
indexNames = df_balanced2[df_balanced2["Product"] == "Mortgage"].index
df_balanced2.drop(indexNames, inplace=True)
indexNames = df_balanced2[df_balanced2["Product"] == "Debt collection"].index
df_balanced2.drop(indexNames, inplace=True)
indexNames = df_balanced2[df_balanced2["Product"] == "Credit reporting, credit repair services, or other personal consumer reports"].index
df_balanced2.drop(indexNames, inplace=True)

In [10]:
df_balanced2 = df_balanced2.append(df_balanced1.loc[df_balanced1["Product"] == 'Mortgage'].sample(frac=0.5))
df_balanced2 = df_balanced2.append(df_balanced1.loc[df_balanced1["Product"] == 'Debt collection'].sample(frac=0.4))
df_balanced2 = df_balanced2.append(df_balanced1.loc[df_balanced1["Product"] == 'Credit reporting, credit repair services, or other personal consumer reports'].sample(frac=0.3))

**Data cleaning & lemmatization**

In [11]:
import re
import string

# Remove urls
def clean_url(complaint):
    # to do: more regex url garbage matching
    complaint = re.sub('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', '', complaint)
    complaint = re.sub('https? ?: ?// ?(?:[-\w.]|(?:%[\da-fA-F]{2}))+', '', complaint)
    return complaint

# Remove punctuation from complaint
def clean_punctuation(complaint):
    complaint = re.sub('[%s]' % re.escape(string.punctuation), '', complaint)
    return complaint

# Remove non-sensical characters from complaint
def clean_nonsense(complaint):
    complaint = re.sub('[''""...]', '', complaint)
    complaint = re.sub('\n', '', complaint)
    return complaint

# Remove censored words from complaint
def clean_censored(complaint):
    complaint = re.sub('[XXXX]', '', complaint)
    return complaint

# Turn everything into lowercase
def clean_lowercase(complaint):
    complaint = complaint.lower()
    return complaint

# Remove numbers from complaint
def clean_numbers(complaint):
    complaint = re.sub("\d+", '', complaint)
    return complaint

In [12]:
df_balanced2["Consumer complaint narrative"] = df_balanced2["Consumer complaint narrative"].apply(clean_url)
df_balanced2["Consumer complaint narrative"] = df_balanced2["Consumer complaint narrative"].apply(clean_punctuation)
df_balanced2["Consumer complaint narrative"] = df_balanced2["Consumer complaint narrative"].apply(clean_nonsense)
df_balanced2["Consumer complaint narrative"] = df_balanced2["Consumer complaint narrative"].apply(clean_censored)
df_balanced2["Consumer complaint narrative"] = df_balanced2["Consumer complaint narrative"].apply(clean_lowercase)
df_balanced2["Consumer complaint narrative"] = df_balanced2["Consumer complaint narrative"].apply(clean_numbers)

In [13]:
print(df_balanced2.shape)
df_balanced2.head()

(287475, 3)


Unnamed: 0,Product,Consumer complaint narrative,Issue
9,"Payday loan, title loan, or personal loan",they would not let me pay my loan off days be...,Problem with the payoff process at the end of ...
12,"Payday loan, title loan, or personal loan",service finance are liars and are charging me ...,Charged fees or interest you didn't expect
21,Credit card or prepaid card,re amex card ending disputes were done in ti...,Fees or interest
27,Checking or savings account,over draft fees due to fraudulent charges subm...,Problem caused by your funds being low
36,Student loan,i took out a loan to go to school total of in...,Dealing with your lender or servicer


In [14]:
# pickle for later use
df_balanced2.to_pickle("corpus_balanced2_cleaned_first_cut_scarce_elimination.pkl")

**Run the lemmatization.py script in Pycharm to lemmatize the balanced & cleaned dataset.**

This will produce a csv, <code>corpus_balanced3_cleaned_lemmatized.csv</code> that can be read into any notebook used for training using the code below:

<code> my_df = pd.read_csv("corpus_balanced3_cleaned_lemmatized.csv", encoding="utf-8")</code>

In [32]:
my_df = pd.read_csv("corpus_balanced3_cleaned_lemmatized.csv", encoding="utf-8")

In [33]:
my_df.head()

Product                              0
Consumer complaint narrative    126485
dtype: int64

**Train in your separate notebook**

Use <code>RandomForest_run_006.ipynb</code> as an example for structure. Make a copy, rename is and replace what you need. [notebook](RandomForest_run_006.ipynb)