# How to create a NLP Pipeline: Consumers complaints classification - a sample project

### 1. Introduction

Natural Language Processing (NLP) enables computers to process and interpret human language in a way that is both useful and meaningful. In this project I will <b> take you through the steps of creating a NLP pipeline</b> using a real-world consumer complaints free text data. I will demonstrate how preprocess, clean and analyse unstructured text data and finally use a machine learning model to classify text into predefined complain categories. 


By the end of this project, you will have a good understanding of how to build a NLP pipeline, from raw text data to a machine learning-powered classification model and you will be able to apply these techniques to similar text-based data problems.

### 2. Data Overview
Consumer complaints are a valuable resource for understanding consumer greviences, issues, sentiments and trends. A typical dataset of this kind 
include complaints raised by consumers in a free text format. The goal of this project is to clean, preprocess and build a mdodel to classify complaints based on the text and issues they relate to. 


In [1]:
import os, sys
import pandas as pd

current_dir = os.getcwd()
path_to_add = os.path.abspath('consumer_complaints/src/lib/')
if path_to_add not in sys.path:
    sys.path.insert(0, path_to_add)

import data_reader as dr
import data_transformers as dt
import data_analyzer as da
import model_builder as mb

#### 2.1 Data 

In [2]:
fh = dr.FileHandler()
df = fh.read_file(file_path="Consumer_Complaints.csv")

df.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,03/21/2017,Credit reporting,,Incorrect information on credit report,Information is not mine,,Company has responded to the consumer and the ...,EXPERIAN DELAWARE GP,TX,77075,Older American,,Phone,03/21/2017,Closed with non-monetary relief,Yes,No,2397100
1,04/19/2017,Debt collection,"Other (i.e. phone, health club, etc.)",Disclosure verification of debt,Not disclosed as an attempt to collect,,,"Security Credit Services, LLC",IL,60643,,,Web,04/20/2017,Closed with explanation,Yes,No,2441777
2,04/19/2017,Credit card,,Other,,,Company has responded to the consumer and the ...,"CITIBANK, N.A.",IL,62025,,,Referral,04/20/2017,Closed with explanation,Yes,No,2441830
3,04/14/2017,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,Company believes it acted appropriately as aut...,"Shellpoint Partners, LLC",CA,90305,,,Referral,04/14/2017,Closed with explanation,Yes,No,2436165
4,04/19/2017,Credit card,,Credit determination,,,Company has responded to the consumer and the ...,U.S. BANCORP,LA,70571,,,Postal mail,04/21/2017,Closed with explanation,Yes,No,2441726


#### 2.2 Columns in the dataset

In [3]:
df.columns

Index(['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue',
       'Consumer complaint narrative', 'Company public response', 'Company',
       'State', 'ZIP code', 'Tags', 'Consumer consent provided?',
       'Submitted via', 'Date sent to company', 'Company response to consumer',
       'Timely response?', 'Consumer disputed?', 'Complaint ID'],
      dtype='object')

The dataset conatins several columns. However, for this analyis we will focus on **Product** (type of product/service) 
and **Consumer complaint narrative** (free text)columns. Let us also remove rows with no value in Consumer complaint narrative column. We also see that their are various types of products. For example:

In [4]:
print("Product count\n")
df = df[~df['Consumer complaint narrative'].isna()]
display(pd.DataFrame(df['Product'].value_counts()))

print("\nFiltered product count\n")
# Remove Products with low count - optional
products_to_remove = ['Payday loan', 'Money transfers', 'Prepaid card', 'Other financial service', 'Virtual currency']
# Filter the DataFrame to remove the specified products
df = df[~df['Product'].isin(products_to_remove)]
display(pd.DataFrame(df['Product'].value_counts()))

Product count



Unnamed: 0_level_0,count
Product,Unnamed: 1_level_1
Debt collection,38741
Mortgage,32000
Credit reporting,30319
Credit card,18276
Bank account or service,14500
Student loan,10176
Consumer Loan,9029
Payday loan,1695
Money transfers,1437
Prepaid card,1404



Filtered product count



Unnamed: 0_level_0,count
Product,Unnamed: 1_level_1
Debt collection,38741
Mortgage,32000
Credit reporting,30319
Credit card,18276
Bank account or service,14500
Student loan,10176
Consumer Loan,9029


### 3. Text Cleaning and Preprocessing
Before we dive into using ML models, we need to prepare our data which is a crucial step in any NLP pipeline. 

**Steps involved in text cleaning:**

* Remove stopwords (Recommended): Common words such as 'the', 'and', 'a' that don’t add value to the analysis are removed.
* Remove numbers and masked words (Recommended): We remove numerical digits and any masked words (e.g., ‘xxxx’).
* Convert to lowercase (Recommended): Text is converted to lowercase to ensure uniformity.
* Remove extra spaces (Recommended): Extra white spaces are removed to standardize the text format.
* Remove names (Optional): We use Named Entity Recognition (NER) to remove any personal names from the text.
* Remove punctuation and special characters (Optional): Punctuation marks (commas, periods, etc.) and special characters (like @, #) are removed.

#### 3.1 Clean the column names

In [5]:
ct = dt.CleanText(remove_emails= True, remove_mask = True, remove_names= True, remove_uppercase= True, remove_stopwords= True, 
                                remove_numbers= True, remove_punctuation= True, remove_extra_spaces= True, remove_invalid_two_letter_words= True)
columns = ct.make_colnames_consistent(df.columns)
df.columns = columns
print(df.columns)
# Remove empty rows
df = df[~df['consumer_complaint_narrative'].isna()]
df.reset_index(drop = True, inplace = True)

Index(['date_received', 'product', 'sub_product', 'issue', 'sub_issue',
       'consumer_complaint_narrative', 'company_public_response', 'company',
       'state', 'zip_code', 'tags', 'consumer_consent_provided',
       'submitted_via', 'date_sent_to_company', 'company_response_to_consumer',
       'timely_response', 'consumer_disputed', 'complaint_id'],
      dtype='object')


#### 3.2 Clean the text column


In [6]:
%%time
df_clean = ct.clean_pipeline(df = df, column_to_clean='consumer_complaint_narrative', method = 'spacy')

Removing named entities using 'spacy' library ...



Progress: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 157865/157865 [24:52:18<00:00,  1.76it/s]

Converting uppercase letters



00%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 157865/157865 [00:14<00:00, 10938.31it/s]

Removing stopwords



00%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 157865/157865 [00:36<00:00, 4270.47it/s]

Removing emails



00%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 157865/157865 [00:02<00:00, 59183.92it/s]

Removing numbers



00%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 157865/157865 [00:02<00:00, 73150.64it/s]

Removing punctuation



00%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 157865/157865 [00:10<00:00, 14412.30it/s]

Removing invalid two letter words



00%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 157865/157865 [00:02<00:00, 69760.01it/s]

Removing masked words such 'xxxx' 



00%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 157865/157865 [00:01<00:00, 90401.59it/s]

Removing extra spaces in the text



00%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 157865/157865 [00:04<00:00, 37329.90it/s]

CPU times: user 5min 42s, sys: 56.2 s, total: 6min 38s
Wall time: 1d 53min 38s


In [6]:
df_clean = fh.read_file(file_path="df_clean.csv")
df_clean = df_clean[~df_clean['product'].isin(products_to_remove)]
df_clean['product'].value_counts()

product
Debt collection            38741
Mortgage                   32000
Credit reporting           30319
Credit card                18276
Bank account or service    14500
Student loan               10176
Consumer Loan               9029
Name: count, dtype: int64

In [7]:
df_clean = df_clean[~df_clean['consumer_complaint_narrative'].isna()]
df_clean.reset_index(drop = True, inplace = True)

#### 3.3 Text column before cleaning

In [8]:
df['consumer_complaint_narrative'][0]

'Started the refinance of home mortgage process with cash out option on XX/XX/2016. Necessary documents were submitted by XXXX. After initial review, got good faith estimate with loan amount and closing cost. Based on this estimate, a deposit of {$350.00} was made towards appraisal. Appraisal came with lesser amount by {$5000.00}. Agreed to reduce the loan amount to that extent. However, got a revised estimate which was less by {$30000.00} and with additional closing cost towards points etc. In between got numerous revised estimates with different loan amounts and closing cost. It took more than 2 months to reach any definite closing document. Hence, want to get back the deposit of {$350.00}.'

#### 3.4 Text column after cleaning

In [9]:
df_clean['consumer_complaint_narrative'][0]

'started refinance home mortgage process cash option necessary documents submitted initial review got good faith estimate loan amount closing cost based estimate deposit made towards appraisal appraisal came lesser amount agreed reduce loan amount extent however got revised estimate less additional closing cost towards points etc got numerous revised estimates different loan amounts closing cost took months reach definite closing document hence want get back deposit'

### 4. Tokenization
Tokenization is a process of splitting the text into pieces, usually words or phrases. This is a necessary task in NLP pipeline as it converts 
unstructured text into structured format which machine learning algorithms can process. For example,


In [10]:
tokens = da.tokenize("I am facing issues with my credit card payment.")
# this would output
tokens

['I', 'am', 'facing', 'issues', 'with', 'my', 'credit', 'card', 'payment.']

### 5. Text analysis and Feature extraction

Now that we have cleaned and tokenzied the text, we can analyze the text and extract features that can be used to build a 
classification model. We will use two techniques
* **N-grams.** Sequence of n consecutive words like bigrams, trigrams that capture the semantic meaning better than individual words
* **TF-IDF.** A statistical measure that evaluates how important a word is in document or a collection of documents. 

#### 5.1 Look at the top 5 bi and tri grams in the text column

In [11]:
at = da.AnalyzeText(text = df_clean['consumer_complaint_narrative'])
at.get_top_n_words(percent= True, ngram_range = (2,3), n = 5)

Unnamed: 0,Words,Percent
0,credit report,0.19
1,credit card,0.11
2,wells fargo,0.09
3,bank america,0.06
4,customer service,0.05


In [23]:
# We can aslo look at the words or phrases of interest in the text column. For example
words_list = ['chase', 'credit', 'card', 'payment', 'loan']
at.get_count_for_words(word_list= words_list, percent = True)

Unnamed: 0,Words,Percent
0,chase,0.17
1,credit,1.6
2,card,0.5
3,payment,0.84
4,loan,0.89


#### 5.2 Generate TF-IDF scores for the entire population or corpus of the text

In [24]:
scores_df_all, _ = at.get_tfidf_scores(label = 'population', ngram_range= (3,4), max_features=100)
scores_df_all

Unnamed: 0,features,scores_population,std_population
30,credit reporting agencies,0.013975,0.107593
89,social security number,0.013381,0.107290
76,removed credit report,0.011339,0.098035
21,credit card account,0.009656,0.092310
92,victim identity theft,0.009006,0.086338
...,...,...,...
82,required promptly delete information,0.001326,0.021221
84,section fcra required,0.001229,0.020590
49,fcra required promptly,0.001219,0.020209
50,fcra required promptly delete,0.001215,0.020161


#### 5.3 Compare TF-IDF scores for the each product population to the score of entire population

In [25]:
for product in df_clean['product'].unique():
    label_df = df_clean[df_clean['product'] == product]
    scores_df_label, _ = at.get_tfidf_scores(text_column= label_df['consumer_complaint_narrative'], label = product, ngram_range= (3,4), max_features= 100)
    scores_df_merged = pd.merge(scores_df_all, scores_df_label, on = 'features')
    scores_df_merged['p_value'] = at.get_p_value(mean_group= scores_df_merged['scores_' + product], std_group= scores_df_merged['std_' + product], 
                              mean_pop= scores_df_merged['scores_population'], std_pop= scores_df_merged['std_population'] )
    print("==== Product: {} ====".format(product))
    display(scores_df_merged[1:10].sort_values(by = 'p_value', ascending = True))
    #display(scores_df_merged[1:10].style.apply(lambda x: highlight_cells(s = x, threshold= 0.7, operator= '<='),subset = ['p_value']))

==== Product: Mortgage ====


Unnamed: 0,features,scores_population,std_population,scores_Mortgage,std_Mortgage,p_value
1,social security number,0.013381,0.10729,0.00403,0.059951,0.6784
9,never missed payment,0.004411,0.064008,0.008175,0.084448,0.8464
3,wells fargo bank,0.005761,0.072822,0.009367,0.088107,0.8634
8,received letter stating,0.004585,0.064118,0.007523,0.078422,0.8743
7,consumer financial protection,0.005092,0.054681,0.007287,0.062446,0.8853
2,called customer service,0.006409,0.075643,0.00444,0.060651,0.9118
5,customer service representative,0.005423,0.069235,0.004916,0.064048,0.9766
4,days past due,0.005454,0.068935,0.005133,0.066542,0.9854
6,received phone call,0.005369,0.068994,0.005147,0.06554,0.9899


==== Product: Bank account or service ====


Unnamed: 0,features,scores_population,std_population,scores_Bank account or service,std_Bank account or service,p_value
4,wells fargo bank,0.005761,0.072822,0.021195,0.131046,0.6741
3,called customer service,0.006409,0.075643,0.014807,0.102834,0.7879
5,customer service representative,0.005423,0.069235,0.008422,0.079784,0.9075
7,customer service rep,0.004309,0.062441,0.006374,0.072253,0.9295
2,victim identity theft,0.009006,0.086338,0.006907,0.077261,0.9409
1,credit card account,0.009656,0.09231,0.008147,0.083344,0.9604
6,consumer financial protection,0.005092,0.054681,0.005336,0.056474,0.9899
8,financial protection bureau,0.004209,0.047125,0.004027,0.045625,0.9909
9,consumer financial protection bureau,0.004101,0.046095,0.00395,0.04494,0.9923


==== Product: Student loan ====


Unnamed: 0,features,scores_population,std_population,scores_Student loan,std_Student loan,p_value
1,social security number,0.013381,0.10729,0.005459,0.067941,0.7764
2,removed credit report,0.011339,0.098035,0.005482,0.070128,0.8249
3,fair credit reporting,0.007526,0.059535,0.003722,0.05627,0.8326
5,days past due,0.005454,0.068935,0.010406,0.091223,0.8437
9,never missed payment,0.004411,0.064008,0.007808,0.079081,0.8792
4,called customer service,0.006409,0.075643,0.004128,0.057869,0.9132
6,customer service representative,0.005423,0.069235,0.007655,0.07691,0.9217
8,consumer financial protection,0.005092,0.054681,0.003874,0.046459,0.9384
7,received phone call,0.005369,0.068994,0.004134,0.060082,0.951


==== Product: Credit reporting ====


Unnamed: 0,features,scores_population,std_population,scores_Credit reporting,std_Credit reporting,p_value
5,fair credit reporting,0.007526,0.059535,0.020763,0.096212,0.4451
7,fair credit reporting act,0.00703,0.056419,0.019397,0.091094,0.4512
6,credit reporting act,0.007279,0.058218,0.019887,0.093244,0.4541
4,information credit report,0.008124,0.080343,0.026366,0.141829,0.4651
9,credit reporting agency,0.006395,0.072182,0.022181,0.133469,0.497
2,removed credit report,0.011339,0.098035,0.027301,0.15137,0.5632
3,victim identity theft,0.009006,0.086338,0.022661,0.137762,0.5833
8,account credit report,0.006523,0.073887,0.014558,0.110118,0.6921
1,social security number,0.013381,0.10729,0.022524,0.136502,0.7307


==== Product: Credit card ====


Unnamed: 0,features,scores_population,std_population,scores_Credit card,std_Credit card,p_value
5,credit card company,0.008462,0.08812,0.039355,0.173231,0.3956
3,credit card account,0.009656,0.09231,0.038432,0.161177,0.4076
2,removed credit report,0.011339,0.098035,0.00596,0.070133,0.811
6,fair credit reporting,0.007526,0.059535,0.004362,0.048132,0.8246
7,credit reporting act,0.007279,0.058218,0.004267,0.047502,0.8299
8,fair credit reporting act,0.00703,0.056419,0.004116,0.045911,0.83
4,victim identity theft,0.009006,0.086338,0.006592,0.072909,0.9088
1,social security number,0.013381,0.10729,0.010836,0.095011,0.9242
9,account credit report,0.006523,0.073887,0.004882,0.063943,0.9283


==== Product: Debt collection ====


Unnamed: 0,features,scores_population,std_population,scores_Debt collection,std_Debt collection,p_value
3,credit card account,0.009656,0.09231,0.003738,0.057056,0.7247
5,credit card company,0.008462,0.08812,0.004579,0.064346,0.8182
1,social security number,0.013381,0.10729,0.018652,0.125964,0.837
6,information credit report,0.008124,0.080343,0.005852,0.068645,0.8895
4,victim identity theft,0.009006,0.086338,0.011808,0.099597,0.8908
2,removed credit report,0.011339,0.098035,0.014146,0.105194,0.8996
8,credit reporting act,0.007279,0.058218,0.006523,0.052569,0.9503
7,fair credit reporting,0.007526,0.059535,0.006793,0.054325,0.9531
9,fair credit reporting act,0.00703,0.056419,0.006361,0.051493,0.9549


==== Product: Consumer Loan ====


Unnamed: 0,features,scores_population,std_population,scores_Consumer Loan,std_Consumer Loan,p_value
2,removed credit report,0.011339,0.098035,0.006435,0.073015,0.7955
1,social security number,0.013381,0.10729,0.007908,0.085186,0.7964
6,fair credit reporting act,0.00703,0.056419,0.004173,0.045328,0.7987
4,fair credit reporting,0.007526,0.059535,0.004557,0.049084,0.8036
9,credit reporting agency,0.006395,0.072182,0.003345,0.053968,0.827
7,account credit report,0.006523,0.073887,0.003525,0.055694,0.8342
5,credit reporting act,0.007279,0.058218,0.005006,0.052539,0.8514
3,information credit report,0.008124,0.080343,0.005163,0.066357,0.8544
8,called customer service,0.006409,0.075643,0.005144,0.065901,0.9351


### 6.Train a ML model

In [16]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
'''
from sklearn.base import BaseEstimator
import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import accuracy_score, mean_squared_error
'''
tm = mb.trainModel(model_type= 'xgb', type = 'classifier', params= {'n_estimators': 20})
_,X = at.get_tfidf_scores(text_column=df_clean['consumer_complaint_narrative'], ngram_range = (1,2), max_features = 10000)
y = df_clean['product']

#### 6.1 Grid Search. Hyperparameter tuning

In [17]:
test_size = 0.2
random_state = 10
target_encoder = LabelEncoder()
y = target_encoder.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = test_size, random_state = random_state)

param_grid = {
        'n_estimators':[200, 300],
        'learning_rate': [0.2, 0.3, 0.4], 
        'max_depth': [3,5]
        }

tm.grid_search(X= X_train, y = y_train, cv = 2, param_grid = param_grid, scoring= 'accuracy')

Fitting 2 folds for each of 12 candidates, totalling 24 fits
[CV 2/2] END learning_rate=0.2, max_depth=3, n_estimators=200;, score=0.854 total time= 7.5min
[CV 2/2] END learning_rate=0.2, max_depth=5, n_estimators=200;, score=0.863 total time=14.7min
[CV 1/2] END learning_rate=0.3, max_depth=3, n_estimators=200;, score=0.861 total time= 7.0min
[CV 1/2] END learning_rate=0.3, max_depth=3, n_estimators=300;, score=0.865 total time=10.2min
[CV 1/2] END learning_rate=0.3, max_depth=5, n_estimators=300;, score=0.869 total time=20.0min
[CV 2/2] END learning_rate=0.4, max_depth=5, n_estimators=200;, score=0.866 total time=13.3min
[CV 1/2] END learning_rate=0.2, max_depth=3, n_estimators=200;, score=0.855 total time= 7.5min
[CV 1/2] END learning_rate=0.2, max_depth=5, n_estimators=200;, score=0.863 total time=14.7min
[CV 2/2] END learning_rate=0.3, max_depth=3, n_estimators=200;, score=0.860 total time= 7.0min
[CV 2/2] END learning_rate=0.3, max_depth=3, n_estimators=300;, score=0.863 total ti

({'learning_rate': 0.4, 'max_depth': 5, 'n_estimators': 300},
 np.float64(0.8690762068120559))

In [37]:
import sklearn
print(sklearn.__version__)


1.5.2


In [18]:
pred = tm.predict(X = X_test)
print(tm.score(X_test, y_test))
from sklearn.metrics import confusion_matrix
confusion_matrix(y_pred=pred, y_true= y_test)

0.8752613695765813


array([[2406,   44,  209,   31,   85,   85,    8],
       [  52, 1221,   73,   90,  245,   64,   22],
       [ 182,   20, 3056,  142,  234,   21,    4],
       [  28,   64,   92, 5299,  422,   68,   21],
       [  47,  119,  168,  380, 6844,  130,   95],
       [  74,   61,   24,   76,  108, 6156,    9],
       [   6,   31,    9,   24,  126,   25, 1808]])