# Basic Preprocessing for NLP
> Few simple functions to prepare raw text for NLP
- toc: false
- comments: true

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [0]:
import os
import glob
from io import open
import pandas as pd
import numpy as np
import re

In [3]:
import nltk 
nltk.download("popular")

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

True

Mount your Google Drive, using 'google.colab' library. The contents of the drive are available under the folder : '/content/gdrive/My Drive'. We will store our data csv files in the 'data/' folder, inside our drive.

In [4]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
data_location = '/content/gdrive/My Drive/data/consumer_complaints.csv'

In [0]:
def find_files(path):
  return glob.glob(path)

In [0]:
data_file_list = find_files(data_location)

In [8]:
for file in data_file_list:
  print(file)

/content/gdrive/My Drive/data/consumer_complaints.csv


Read your CSV File into a Pandas DataFrame object

In [9]:
df = pd.read_csv(data_location)

  interactivity=interactivity, compiler=compiler, result=result)


Let's get an over view of our data.

In [10]:
df.head()

Unnamed: 0,date_received,product,sub_product,issue,sub_issue,consumer_complaint_narrative,company_public_response,company,state,zipcode,tags,consumer_consent_provided,submitted_via,date_sent_to_company,company_response_to_consumer,timely_response,consumer_disputed?,complaint_id
0,08/30/2013,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,U.S. Bancorp,CA,95993,,,Referral,09/03/2013,Closed with explanation,Yes,Yes,511074
1,08/30/2013,Mortgage,Other mortgage,"Loan servicing, payments, escrow account",,,,Wells Fargo & Company,CA,91104,,,Referral,09/03/2013,Closed with explanation,Yes,Yes,511080
2,08/30/2013,Credit reporting,,Incorrect information on credit report,Account status,,,Wells Fargo & Company,NY,11764,,,Postal mail,09/18/2013,Closed with explanation,Yes,No,510473
3,08/30/2013,Student loan,Non-federal student loan,Repaying your loan,Repaying your loan,,,"Navient Solutions, Inc.",MD,21402,,,Email,08/30/2013,Closed with explanation,Yes,Yes,510326
4,08/30/2013,Debt collection,Credit card,False statements or representation,Attempted to collect wrong amount,,,Resurgent Capital Services L.P.,GA,30106,,,Web,08/30/2013,Closed with explanation,Yes,Yes,511067


We will use the 'consumer_complaint_narrative' column. For our current purpose, we do not need rows which have no text in that column. Drop all 
the columns where 'consumer_complaint_narrative' column value is absent

In [0]:
cust_complaint_df = df[df['consumer_complaint_narrative'].notnull()]

In [12]:
cust_complaint_df.head()

Unnamed: 0,date_received,product,sub_product,issue,sub_issue,consumer_complaint_narrative,company_public_response,company,state,zipcode,tags,consumer_consent_provided,submitted_via,date_sent_to_company,company_response_to_consumer,timely_response,consumer_disputed?,complaint_id
190126,03/19/2015,Debt collection,"Other (i.e. phone, health club, etc.)",Cont'd attempts collect debt not owed,Debt was paid,XXXX has claimed I owe them {$27.00} for XXXX ...,,"Diversified Consultants, Inc.",NY,121XX,Older American,Consent provided,Web,03/19/2015,Closed with explanation,Yes,No,1290516
190135,03/19/2015,Consumer Loan,Vehicle loan,Managing the loan or lease,,Due to inconsistencies in the amount owed that...,,M&T Bank Corporation,VA,221XX,Servicemember,Consent provided,Web,03/19/2015,Closed with explanation,Yes,No,1290492
190155,03/19/2015,Mortgage,Conventional fixed mortgage,"Loan modification,collection,foreclosure",,In XX/XX/XXXX my wages that I earned at my job...,,Wells Fargo & Company,CA,946XX,,Consent provided,Web,03/19/2015,Closed with explanation,Yes,Yes,1290524
190207,03/19/2015,Mortgage,Conventional fixed mortgage,"Loan servicing, payments, escrow account",,I have an open and current mortgage with Chase...,,JPMorgan Chase & Co.,CA,900XX,Older American,Consent provided,Web,03/19/2015,Closed with explanation,Yes,Yes,1290253
190208,03/19/2015,Mortgage,Conventional fixed mortgage,Credit decision / Underwriting,,XXXX was submitted XX/XX/XXXX. At the time I s...,,Rushmore Loan Management Services LLC,CA,956XX,Older American,Consent provided,Web,03/19/2015,Closed with explanation,Yes,Yes,1292137


Let's view the quality of text in 'consumer_complaint_narrative' column of the data frame

In [13]:
sample = cust_complaint_df[cust_complaint_df.index == 190126][['consumer_complaint_narrative']].values[0]
print(sample)

['XXXX has claimed I owe them {$27.00} for XXXX years despite the PROOF of PAYMENT I sent them : canceled check and their ownPAID INVOICE for {$27.00}! \nThey continue to insist I owe them and collection agencies are after me. \nHow can I stop this harassment for a bill I already paid four years ago? \n']


Now, define your pre-processing function. We will perform the following actions:
1. Convert all upper case letters to lower case
2. Replace the following characters with spaces ['/', '(', ')', '{', '}', '[',']','|', '@', ',', ';']
3. Remove the following characters from the text ['^', '0-9', 'a-z', '#', '+', '_']
4. Remove the masking character 'X' from the text
5. Remove all stop words
6. Remove all numbers

In [0]:
from nltk.corpus import stopwords
CONVERT_TO_SPACE_REGEX = re.compile('[/(){}\[\]\|@,;]')
BAD_CHARACTERS_REGEX = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_pre_processor(text):
  text = text.lower()
  text = CONVERT_TO_SPACE_REGEX.sub(' ', text)
  text = BAD_CHARACTERS_REGEX.sub('', text)
  text = text.replace('x', '')
  text = ' '.join(word for word in text.split() if word not in STOPWORDS)

  return text

In [0]:
cust_complaint_df = cust_complaint_df.reset_index(drop=True)
cust_complaint_df['consumer_complaint_narrative'] = cust_complaint_df['consumer_complaint_narrative'].apply(text_pre_processor)
cust_complaint_df['consumer_complaint_narrative'] = cust_complaint_df['consumer_complaint_narrative'].str.replace('\d+', '')

Let's view the quality of text after applying the pre-processing to each text datum in the 'consumer_complaint_narrative' column

In [16]:
cust_complaint_df.head()

Unnamed: 0,date_received,product,sub_product,issue,sub_issue,consumer_complaint_narrative,company_public_response,company,state,zipcode,tags,consumer_consent_provided,submitted_via,date_sent_to_company,company_response_to_consumer,timely_response,consumer_disputed?,complaint_id
0,03/19/2015,Debt collection,"Other (i.e. phone, health club, etc.)",Cont'd attempts collect debt not owed,Debt was paid,claimed owe years despite proof payment sent ...,,"Diversified Consultants, Inc.",NY,121XX,Older American,Consent provided,Web,03/19/2015,Closed with explanation,Yes,No,1290516
1,03/19/2015,Consumer Loan,Vehicle loan,Managing the loan or lease,,due inconsistencies amount owed told bank amou...,,M&T Bank Corporation,VA,221XX,Servicemember,Consent provided,Web,03/19/2015,Closed with explanation,Yes,No,1290492
2,03/19/2015,Mortgage,Conventional fixed mortgage,"Loan modification,collection,foreclosure",,wages earned job decreased almost half knew tr...,,Wells Fargo & Company,CA,946XX,,Consent provided,Web,03/19/2015,Closed with explanation,Yes,Yes,1290524
3,03/19/2015,Mortgage,Conventional fixed mortgage,"Loan servicing, payments, escrow account",,open current mortgage chase bank # chase repor...,,JPMorgan Chase & Co.,CA,900XX,Older American,Consent provided,Web,03/19/2015,Closed with explanation,Yes,Yes,1290253
4,03/19/2015,Mortgage,Conventional fixed mortgage,Credit decision / Underwriting,,submitted time submitted complaint dealt rushm...,,Rushmore Loan Management Services LLC,CA,956XX,Older American,Consent provided,Web,03/19/2015,Closed with explanation,Yes,Yes,1292137


Let's view the quality of text post-processing

In [17]:
sample = cust_complaint_df[cust_complaint_df.index == 0][['consumer_complaint_narrative']].values[0]
print(sample)

['claimed owe  years despite proof payment sent canceled check ownpaid invoice  continue insist owe collection agencies stop harassment bill already paid four years ago']
