# Introduction to Natural Langugage Processing

## Description

NLP or Natural Language Processing as is normally referred to, refers to working (or processing) **text data**, either for machine learning, or any of the host of use cases textual data comprises of. Working with text, is very different from working with numerical or categorical data. We have worked extensively with data, numerical, categorical and boolean, however text data is a different paradigm altogether and this tutorial aims to get you acquanited with the basics of working with text and understanding the underlying implications in Machine learning.  

## Overview

- Introduction to the problem statement **Consumer Complaints Database**
- What is NLP (Introduction and usecases)
- Tokenization and Introduction to NLTK
- Vectorization and vector space models **Count Vectorizer**
- Applying our first classification algorithm **Logistic Regression**
- Stopwords
- Basic Stemming 
- TFIDF
- Naive Bayes Classifier
- Linear kernel SVM
- Text Classification (Build a text classifier using NLTK)


## Pre-requisite

- Python (along with NumPy and pandas libraries)
- Basic statistics (knowledge of central tendancy)


## Learning Outcomes

- Understanding why working with text data isn't like numerical or categorical data
- What is NLP
- The basic building blocks of text
- Tokenization, Stemming and what constitutes as a stopword
- Preliminary cleaning of text data 

## Chapter 1: Introduction to text data

### Description: 
Uptill here, all of our problem statements had data in either a numerical format, a categorical format, or a Boolean format. In the real-world we usually do, and might very well encounter text data. We will now try to understand how we can use text analytics to solve data with text.

### 1.1 Introduction to the problem statement: <font color='green'> Categorize complaints into categories</font>

**What is the problem?**
#### The Dataset is a consumer complaints database where every complaint needs to be categorized into one of the pre-defined 12 categories. A multi- class classification problem. 

Each row in the dataset describes a single compliant. Including the complaint narrative, the issue, the category of the complaint, the date it was received on, the zip code and details of the customer placing the complaint and the current status of the complaint. The final idea is to build a model that will categorize each customer's complaint into a product (12 categories in all). You can download the dataset below.

We will work with the csv file. 

https://catalog.data.gov/dataset/consumer-complaint-database

However, for the purpose of understanding how text processing works, we will specifically, work on only 2 columns of this dataset. It is evident that if we add more features, the model accuracy will rise and be more robust, however initially, we will just have 2 columns - the consumer complaint narrative and the product the complaint has to be categorized into. 

- Consumer Complaint Narrative
- Product

The 2 categories the compliants need to be categorized into are: 

['Mortgage', 'Student loan', 'Credit card or prepaid card', 'Credit card', 'Debt collection', 'Credit reporting', 'Credit reporting, credit repair services, or other personal consumer reports', 'Bank account or service', 'Consumer Loan', 'Money transfers', 'Vehicle loan or lease', 'Money transfer, virtual currency, or money service', 'Checking or savings account', 'Payday loan', 'Payday loan, title loan, or personal loan', 'Other financial service', 'Prepaid card']

**Brief explanation of the dataset & features**

* `Consumer Complaint Narrative`: Is a paragraph (or text) written by the customer explianing his complaint in detail. It is not a numerical or categorical type, the data is a string type consisting of text in the form of paragraphs
    
* `Product`: Is the category we are to classify each complaint to.
 
**What we want as the outcome?**

Using a classification algorithm, classify each complaint to it's respective category

#### Why work with text

***

**Intuition for text**

Let's start with what information we have: The main goal is to build a machine learning model which can predict the category of the complaint based on the customer's written data. The written data here is in the form of a paragraph comprising of sentences (or natural language). How do we convert this text data to a form fit for machine learning? The usual ways of working with numerical or categorical data will not work here, as the data type is completely different, and the algorithm has to make sense of the written data, not a single variable unlike categorical or numerical. 


If you have a look at the Consumer Complaint Narrative columns, the values are paragraphs! Not numbers or categories. This needs to be pre-processed before running an algorithm onto this. 

**Why NLP for this data**

These complaints in the Narratives columns are typical examples of text data. Normal paragraphs, sentences in the form of text. The column Consumer complaint narrative has all rows in the form of either NaNs or text data. How do we make sense of this data?

Do we convert this to categorical by one-hot encoding the text? If yes, how do we do it?

How are we supposed to convert this text data to a numerical format to make sure the Machine Learning Algorithm can be applied to this?

Can this column be used in a multi category classfication model to predict the class of the complaint?
All these questions (and others) can be answered through a particular branch of ML. Enter Natural Language Processing. 

### Have a look at the data set 

In this task you will load Consumer_complaints.csv into a dataframe using pandas and explore the column Consumer Complaint Narrative.

We will see at the end of this exercise that the **The cell values of the consumer complaint narrative column is a paragraph!** 

This is a typical example of text data. Normal paragraphs, sentences in the form of text. The column Consumer complaint narrative has all rows in the form of either NaNs or text data. How do we make sense of this data?
Do we convert this to categorical by one-hot encoding the text? If yes, how do we do it?
How are we supposed to convert this text data to a numerical format to make sure the Machine Learning Algorithm can be applied to this?
Can this column be used in a multi-nominal classfication model to predict the class of the complaint?

All these questions (and others) can be answered through a particular branch of ML. **Enter Natural Language Processing.**


### Instructions
- Load the csv into a dataframe
- Drop all columns except the Product and the Consumer Complaints Narratives. Make sure to keep a copy of the original dataframe in a different instance. 
- Print out the first 5 instances of our 2 column dataframe. Name it df. 
- Rename column Consumer compliant narrative to X and Product to y (The reason we have done this is obvious. We intend to classify the complaints in the narrative (X) to each of the categories listed in the Product (or the Y column))
- print out the first value of the X column
- Have a look at the various categories that exist in the Product (the renamed y column, these are the categories that exist).

In [1]:
import os
os.chdir('C:\Learning')

In [27]:
import pandas as pd
df = pd.read_pickle("Consumer_complaints.pkl")
df_copy = df.head(1000).copy()
df = df_copy[["Consumer complaint narrative", "Product"]] #keeping the relevant columns
df.columns = ["X","y"]
df.head()
#Printing out the first non-empty value of the X column. Hence the second value, index is 1
print(df["X"][1]) 
print ("\n")
print (list(df["y"].unique()))

When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not. When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX. I have been faithful at paying my student loan. I was told that Navient was the company i had delinquency with. I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. I have had so much trouble bringing my credit score back up.


['Mortgage', 'Student loan', 'Credit card or prepaid card', 'Credit card', 'Debt collection', 'Credit reporting', 'Credit reporting, credit repair services, or other personal consumer reports', 'Bank account or service', 'Consumer Loan', 'Money transfers', 'Vehicle loan or lease', 'Money transfer, virtual c

## Chapter 2: What is NLP? (Introduction and Usecases)

### 2.1 Introduction

***

NLP or the Natural Language part in NLP, is called that, because it is the language that exists all around us. 
NLP can broadly be defined as the "cleaning" and "getting the text" to a form fit for machine learning. That's all it really is. Of course you can derive insights from the text as well just like the EDA operation, which aims to bring the data to a more ML application frinedly approach. There are other off shoots of Natural Langugage such as NLG - Natural language Generation which aims to generate new text data based on prior data and Natural Language Understanding - NLU which is the backbone of all intelligent chatbots out there currently, which focuses on recognizing the intent of a conversation. For the sake of this tutorial and brevity we will stick to NLP. 

#### Why is it difficult to work with text?

***

Comprehending Language is hard for computers. Several reasons exist. 

- Different ways of saying the same thing, is one of the prime reasons, why computers have a hard time deciphering the meaning or intent of those statements. "I like the rains of Mumbai" and "Mumbai is beautiful during the monsoons and that's why I like it" are basically advocating the same sentiment, however since they are 2 completely different sentences syntactically, computers have a hard time figuring out the intent of the user and get stuck.

- Ambiguity - "The shop is by the road" and "The shop was found to be closed by him". Are 2 completely different statements. The word "by" used in completely different meanings here. In the first case it represents proximity, in the second it refers to the person. 

- Context. "Anil is my friend. He likes football". In the second statement "he" refers to Anil. Computers are not inherently able to store the context of the first statement and use it to the decipher the second statement. 

- Understanding language. All code to a machine is just numbers. A statement in human language is just a sequence of numbers to a computer. Rudimentary Chatbots work because they detect key words in your statement. As long as the keywords remain the same, you could use any words in your statements and the end result will remain the same. 

- Every language has its own uniqueness. Like in the case of English we have words, sentences, paragraphs and so on to limit our language. But in Thai, there is no concept of sentences. That’s why Google Translator or any other translator struggles to perfectly convert a piece of text from one language to another.

- Machines have a hard time adapting to any new constructs that humans come up with. Suppose a teenager is looking at his twitter feed and comes across a word he has never seen before, he might not understand it’s meaning instantly. But this does not mean he cannot adapt. After looking at the word in several different tweets he might be able to understand why and in which context that word is to be used. This is not possible with machines. Machines can only handle data that they have seen before. If something new comes up, they get confused and are unable to respond.

#### Usecases of NLP

***

The usecases of NLP encompass almost anything you can do with Language in relation to a problem. 

1) Sentiment Analysis - Finding if the text opinionated a positive or negative sentiment.

(Sentiment analysis is immensely useful in figuring the overall sentiment of products (Amazon), movies (Netflix), food (Zomato) by parsing the reviews and doing a sentiment analysis on them)

2) Text Classfication - categorizing text to various categories
(Some examples of text classification are:

- Understanding audience sentiment from social media,
- Detection of spam and non-spam emails,
- Auto tagging of customer queries, and
- Categorization of news articles into defined topics.
)

3) Summarizing - Summarzing a paragraph into "n" words or sentences

(Example: Inshorts, news in 60 words or less)

4) Parts of Speech - Tagging - Figuring out the various nouns, adverbs, verbs etc; in your text

(Chatbots)

5) Language translation

(Google translate)

6) Grammar correction

(Autocorrect in messaging services)

7) Entity recognition - Finding places, animals, people from the text in question

(Chatbots

8) Intent recognition - Chatbots usually use this extensively. To figure what exactly you, or the customer in question needs information or services about. 

We will deal with each of them in detail in the subsequent tutorials. 

## Chapter 3: Tokenization and Introduction to NLTK


### 3.1 Building blocks of text:Motivation for tokenization

***
Now we can see that unlike all the machine learning datasets we have worked with previously, the data isn't boolean, numeric, cateorigical etc; How do we apply this text data to a ML algorithm?
The first step is understanding what text data consists of..

Usually a text is composed of paragraphs, paragraphs are composed of sentences, and sentences are composed of words. 

#### Words are the basic building blocks of any text. 

Sure you could do deeper into letters, but the letters as themselves have no meaning, it's only when they are combined into words, that the text starts to make sense. Hence NLP considers words as the absolute unit of text. 

Tokenization is exactly what it sounds like. Breaking down anything to "tokens". Tokens are the basic units of a particular dataset. In this case, our data is text and tokenization implies breaking it down into it's basic tokens. Which are words. 

We could also tokenize a paragrah into sentences. Since a paragraph is composed of sentences. 

#### Introduction to NLTK

***
Working with text data as we have seen is not as straightforward as working with numeric or other data types, hence it is no surprise, that this processing is achieved through special libraries. NLTK or the Natural Langugage Tool Kit is the de-facto standard library in python which specifically deals with text. Tasks, such as tokenization, Lemmatization, text_classification, vectorizing are methods built in and help working with text much simpler. Do not worry if the above terms do not make sense to you, we will get to them and cover them in detail eventually. 

NLTK is not part of the standard Anaconda 3.6 installation and will require an independent set-up.

#### Installing NLTK. 

***
To install NLTK and all it's dependencies, go to your terminal(mac) or command window(windows) and type pip install nltk. This could take a while, depending on your internet speed. It is to be noted that NLTK is not the only library that can be used for text processing, there are a ton of others, however NLTK is one of the first ones that came up and is generally astarting point from a beginner's perspective. There are obvious tasks the NLTK library cannot do, which have to be compensated via other libraries, howeer for thr purpose of this particukar tutorial, NLTK should more than suffice presently. 

#### Tokenizing with NLTK - The problem intuition 

***

Now that we have already defined our present dataframe df as below, we will need to find a way to convert the text in the X column to numbers to get them to a form where I would be able to apply an alogorithm to this. Think of this like sklearn, which require all non-numeric data to be encoded (label or one-hot) prior to the sklearn pipeline. 

Intuitively, it would make sense to divide each paragraph of text to it's basic form (words) and then convert each of those words to numbers. We could assign a particular number to each word, in which case a sentence could look like a set of numbers to us, each number representing a particular word. 

The first step to acheiving that would be to break the text down to words. That's what tokenization aims to do. NLTK has a built in for tokenization. Assuming NLTK is installed on your machines now, let us quickly run through a few sample tokenization exercises, before your assignment 3 where you will break doen every cel in the Y column to its words and create a new colunm for the same. 

### Tokenize the first complaint into words

In this task you will assign a variable to the first row of the consumer complaints narrative column (X column) and break down the text into it's constituent words.
### Instructions
- Load the dataframe defined earlier
- Assign a variable **first_complaint** to the paragraph listed in the first non-empty row of the consumer complaint narrative column (or X)
- Break it down into words using the split command initially and then using the nltk.word_tokenize function. 
- Assign this list of words to another list called bag_of_words
- We can see from both lists (the one using split and the one using word_tokenize) that the word_tokenize function is more robust as it splits the paragraph into purely words and seperates the punctuation into seperate tokens. In the case of split, full-stops have appeared along with certain words. 

In [28]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
df["X"].head(3)
df = df.dropna() #dropping nans
first_complaint = df["X"].iloc[0]
print (first_complaint)
print ("\n")
#Using the split command
print ("Using the Split Command")
print ("\n")
bag_of_words = first_complaint.split(" ")
print (bag_of_words)

bag_of_words = word_tokenize(first_complaint)
print ("\n")
print (bag_of_words)

When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not. When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX. I have been faithful at paying my student loan. I was told that Navient was the company i had delinquency with. I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. I have had so much trouble bringing my credit score back up.


Using the Split Command


['When', 'my', 'loan', 'was', 'switched', 'over', 'to', 'Navient', 'i', 'was', 'never', 'told', 'that', 'i', 'had', 'a', 'deliquint', 'balance', 'because', 'with', 'XXXX', 'i', 'did', 'not.', 'When', 'going', 'to', 'purchase', 'a', 'vehicle', 'i', 'discovered', 'my', 'credit', 'sco

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ashwani.saxena\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 3.2 Sent Tokenize

### Tokenize the first complaint into  sentences

One could also tokenize a paragraph into constituent sentences. 

### Assignment 2

#### Tokenize the Second non-empty complaint into words and convert all words to lower case and assign the list of words to a list

#### Description:
The importance of converting words to lower case - All words should be converted to lowercase while doing NLP. The reason behind being, that "Mumbai" and "mumbai" even though are the same word, will be considered 2 seperate words whilst converting the words into numbers, and that would be a biased conversion. To avoid such problems, it is standard practice to convert all words or text to lower case, before beginning NLP. 

### Instructions
- Load the dataframe defined earlier
- Assign a variable **first_complaint** to the paragraph listed in the first non-empty row of the consumer complaint narrative column (or X)
- Break it down into sentences using the sent_tokenize function from nltk
- Assign this list of words to another list called list_of_sentences

In [29]:
first_complaint #Is already loaded onto the workspace
from nltk.tokenize import sent_tokenize
list_of_sentences = sent_tokenize(first_complaint)
(list_of_sentences)

print ("\n")

print (df["X"].iloc[1])
bag_of_words_lower = word_tokenize(df["X"].iloc[1].lower())
print ("\n")
print (bag_of_words_lower)



I tried to sign up for a spending monitoring program and Capital One will not let me access my account through them


['i', 'tried', 'to', 'sign', 'up', 'for', 'a', 'spending', 'monitoring', 'program', 'and', 'capital', 'one', 'will', 'not', 'let', 'me', 'access', 'my', 'account', 'through', 'them']


## Chapter 4: Vectorization


### 4.1 Converting your text to numbers: The crux of NLP

***
We managed to convert our complaints into a bag of words or a list of words. But that is no good until we figure a way out to convert these words to a numeric format. And that is necessary to apply any sort of algorithm (machine learning or otherwise). 

**This process of converting text data to numbers is called vectorization**

There are multiple methods to convert words to numbers. We will be initially dealing with 1 of them: A count Vectorizer. 

### 4.2 The intuition behind vectorization: The count vectorizer

***
Every list of words corresponds to a row in our dataframe, something like this. Where the X column would now be the list of lowercased words and our y column would be the product category. 

<img src="img.png">

The idea now is to convert the X column to numbers. 

**One way to do that would be to represent every word as a key value pair in the form of a dictionary, where the key would be the word and the vale would be the number of times that word has appeared in the list.** 

This method of converting the counts of words in the list to convert them to a numeric format is called Count vectorization. We will initially do this manually, and then explot sklearn to do this automatically to understand the intuition behind this. 


### Convert the first complaint to numbers using the counts of words in the form of a dictionary

In this task you will implement your own code for count vectorization
### Instructions
- Load the df.
- Take the first complaint in the X column and assign that to a list called **first_complaint** 
- Tokenize the list to its words and convert them to lower case. 
- Create a dictionary (any method you prefer) so that the keys are the words themselves and the values are the number of times the word has appeared in the list **first_complaint**
- Name this dictionary as Count_Vectorizer


In [30]:
first_complaint = word_tokenize(df["X"].iloc[0].lower())
from collections import Counter
Count_Vectorizer = {}
Count_Vectorizer = Counter(first_complaint)
print (Count_Vectorizer)

Counter({'i': 11, 'the': 8, '.': 7, 'was': 5, 'to': 5, 'and': 5, 'my': 4, 'had': 4, 'credit': 4, 'navient': 3, 'told': 3, 'that': 3, 'with': 3, 'xxxx': 3, 'when': 2, 'loan': 2, 'a': 2, 'balance': 2, 'score': 2, 'been': 2, 'have': 2, 'delinquency': 2, 'just': 2, 'bureaus': 2, 'so': 2, 'switched': 1, 'over': 1, 'never': 1, 'deliquint': 1, 'because': 1, 'did': 1, 'not': 1, 'going': 1, 'purchase': 1, 'vehicle': 1, 'discovered': 1, 'dropped': 1, 'from': 1, 'into': 1, 'faithful': 1, 'at': 1, 'paying': 1, 'student': 1, 'company': 1, 'contacted': 1, 'resolve': 1, 'this': 1, 'issue': 1, 'you': 1, 'kept': 1, 'being': 1, 'contact': 1, 'expalin': 1, 'situation': 1, 'maybe': 1, 'they': 1, 'could': 1, 'help': 1, 'me': 1, 'angry': 1, 'hurried': 1, 'paid': 1, 'off': 1, 'then': 1, 'after': 1, 'tried': 1, 'dispute': 1, 'much': 1, 'trouble': 1, 'bringing': 1, 'back': 1, 'up': 1})


### 4.3 Introduction to the sklearn's Count Vectorizer

***
We did manage to convert our list of words to numbers. However the problem still remain unresolved. 

**How do we apply our algorithm to this?**

**Could we convert every word to a feature (or column) and the count associated with it to it's value and then apply a Classification algorith to it? Something like below?

<img src="img2.png">

** This looks very similar to one-hot encoding and is a typical method of applying ML to text data**. 

This is simiar to one-hot in the way that when we add the second row, and the the third row and subsequent rows, the features or the columns will increase as more and more words come in and there will be words which do not appear in say the first_complaint, the vectorizer will automatically assign 0 to those words. **Hence the number of features will be equal to the total number of unique words in all the complaints combined and the values for those features will be the count of those words in that particular complaint.**

A normal classifcation algorithm can now be applied where X is all the features except the Product column and y is the Product column. 

We could just use the sklearn's Count Vectorizer and convert all the text into numbers in a single step instead of breaking them down into individual words, lowercasing them, and then making a dictionary assigning the counts. 

This is how we would do it for the first row. 

In [31]:
df.head(5)

Unnamed: 0,X,y
1,When my loan was switched over to Navient i wa...,Student loan
2,I tried to sign up for a spending monitoring p...,Credit card or prepaid card
7,"My mortgage is with BB & T Bank, recently I ha...",Mortgage
13,The entire lending experience with Citizens Ba...,Mortgage
14,My credit score has gone down XXXX points in t...,Credit reporting


In [32]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
txt = [df["X"].iloc[0]]
print (txt)
print ("\n")
print ("Applying the count vectorizer")
cv.fit(txt)
vector = cv.transform(txt)
print ("Vector Shape")
print ("\n")
print (vector.shape)#Has 69 unique words
vector_values = vector.toarray()
print ("\n")
print ("vector Values")
print (vector_values)
print ('These are the counts of the 69 unique words in our first complaint. To find which words have these counts, we can execute the command below:cv.vocabulary_')
print ("\n")
print ("Count Vectorizer Vocabulary")
print (cv.vocabulary_) 

['When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not. When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX. I have been faithful at paying my student loan. I was told that Navient was the company i had delinquency with. I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. I have had so much trouble bringing my credit score back up.']


Applying the count vectorizer
Vector Shape


(1, 69)


vector Values
[[1 5 1 1 1 2 1 2 1 1 2 1 1 1 1 4 2 1 1 1 1 1 1 1 1 1 4 2 1 1 1 1 2 1 2 1
  1 1 4 3 1 1 1 1 1 1 1 1 2 1 2 1 1 3 8 1 1 1 5 3 1 1 1 1 5 2 3 3 1]]
These are the counts of the 69 unique words in our first complaint. To find which words hav

#### This does not specify the counts of the word. This specifies the index of the word in the vector_values list. So for the first word expalin, we see it's index is 22, if we see the index 22 in the vector_values list, we will see that it's value is 1,as specified in the figure above. 

In [33]:
vector_values = vector_values.tolist()
vector_values = vector_values[0]
print (vector_values)
print ("\n")
print ("count value of the word at index 22")
print (vector_values[22]) #Value of 1
print ("\n")
print ("count value of the word at index 34, the word is 'loan'")
print (vector_values[34]) #Value of 2
print ("\n")
print ("Seeing the cv.vocabulary_ dictionary we see that the word is 'loan' and it's value in the vector values list is 2. ")

[1, 5, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 4, 3, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 3, 8, 1, 1, 1, 5, 3, 1, 1, 1, 1, 5, 2, 3, 3, 1]


count value of the word at index 22
1


count value of the word at index 34, the word is 'loan'
2


Seeing the cv.vocabulary_ dictionary we see that the word is 'loan' and it's value in the vector values list is 2. 


<img src="img2.png">

### Coding the count vectorizer and getting the data to a form for algorithm application. 

#### Use the count vectorizer to numerize the X column of the dataframe and make a new dataframe with these features and the product column. We will consider just the top 3 rows of the entire dataframe to aid better understanding and then run it over the entire dataframe.


In [34]:
#Importing count vectorizer from sklearn
from sklearn.feature_extraction.text import CountVectorizer
#Initializing the Count vectorizer
cv = CountVectorizer()
#Initializing a dataframe "all text" with the first 3 rows of df
all_text = df["X"][:3]
all_text = pd.DataFrame(all_text)
#Renaming the column for that dataframe (has only one column) to "text"
all_text.columns = ["Text"]
#Converting to lower case
all_text["Text"] = all_text['Text'].str.lower()
all_text
#Fitting the Count vectorizer all text
cv.fit(all_text["Text"])
vector = cv.transform(all_text["Text"])
vector_values_array = vector.toarray()
#Converting the text to numbers - The transform function does this. 
vector_values_list = vector_values_array.tolist()
print (len(vector_values_list)) 
print ("\n")
#Because there are 3 rows in the entire dataframe. 
print (len(vector_values_list[0]))

3


214


214 is the number of unique words in all 3 rows combined. This value will be constant for every list element in the vector_values_list because all the unique words in the entire dataframe have been converted to features and the values for these features per row depends on the count of those words in that row, 0 in case the word does not exist in the row. 

In [35]:
len(vector_values_list[1])

214

In [36]:
len(vector_values_list[2]) 

#As you can see that every document has been converted to a fixed length vector of 214 words, and 
#have values coresponding to the occurences of those words in the particular document:0 in case those words
# words aren't present in the document

214

The y values for these 3 rows are:

We will have to label encode these categories to numerize them. 

In [37]:
from sklearn.preprocessing import LabelEncoder
labels = pd.DataFrame(df["y"][:3])
#label encoding the y values
labels.columns = ["labels"]
le = LabelEncoder()
labels["labels"] = le.fit_transform(labels["labels"])
print (labels)

   labels
1       2
2       0
7       1


Now our final dataframe for the X is the vector_values_array, which is the vectorized form of the text, and the Y is the labels dataframe. 

In [38]:
vector_values_array[:1][:5] #This is the row 1 of the original dataframe df now numerized

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 5, 1, 0, 0, 0, 1, 1, 2, 0, 0,
        0, 0, 1, 2, 1, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 1, 1, 0, 1, 4, 0, 0, 0, 2, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1,
        1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 4, 2, 1, 0, 0, 0, 1, 0, 0, 0,
        1, 0, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 1, 0, 0, 0,
        0, 1, 4, 3, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0,
        0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 3, 8,
        0, 0, 1, 1, 0, 1, 0, 5, 3, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 5, 0,
        0, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 3, 0, 0, 1, 0]], dtype=int64)

If we now, include all the 335 rows and vectorize them and label it as the X, and the corresponding y labels, we can train a classification algorithm on the same, and figure out the accuracy. 

In [39]:
#Running the exact same code as the earlier one on 335 rows
import numpy as np
all_text = df["X"]
all_text = pd.DataFrame(all_text)
all_text.columns = ["Text"]
all_text["Text"] = all_text['Text'].str.lower()
cv = CountVectorizer()
cv.fit(all_text["Text"])
vector = cv.transform(all_text["Text"])
vector_values_array = vector.toarray()
labels = pd.DataFrame(df["y"])
labels.columns = ["labels"]
labels["labels"] = le.fit_transform(labels["labels"])

In [42]:
len(vector_values_array) #Because 335 documents in all

211

In [41]:
len(labels) #Because each document or complaint has a label, hence 335 labels in all

211

Applying a normal Logistic Regression function to this X and y post breaking it down to a train and a test set, we can calculate the accuracy of the same. 

In [43]:
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split as tts
X = vector_values_array
y = labels["labels"]
X_train,X_test,y_train,y_test = tts(X,y,test_size=0.4,random_state=42)

In [44]:
log_reg = LogisticRegression(random_state=42)

In [45]:
log_reg.fit(X_train,y_train)
y_pred = log_reg.predict(X_test)
print (accuracy_score(y_test,y_pred))

0.4235294117647059




That's a 42% accuracy of predicting the product category. Which isn't really a good number. A look at the precision recall shows that the product category was heavily imblanaced, hence a few categories have not been detected by the algorithm. Before we rectify the same, let us first look at the distribtion of the y column.

In [47]:
labels["labels"].value_counts() 

6     69
7     52
5     17
9     16
4     12
1     11
11    10
3      6
2      5
0      5
8      4
12     2
10     2
Name: labels, dtype: int64

We can see very clearly that the data set clearly has heavily imbalanced labels. Labels 10,8,12 are under represented and hence the probability of the model catching those labels is less, hence the accuracy is going to suffer. A look at the classification report should reinfornce this.

In [48]:
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings("ignore")
print (classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         3
           1       0.50      0.50      0.50         4
           2       0.00      0.00      0.00         3
           3       0.00      0.00      0.00         3
           4       0.20      0.33      0.25         3
           5       0.67      0.25      0.36         8
           6       0.42      0.65      0.51        23
           7       0.46      0.62      0.53        21
           8       0.00      0.00      0.00         2
           9       0.33      0.43      0.38         7
          10       0.00      0.00      0.00         1
          11       0.00      0.00      0.00         5
          12       0.00      0.00      0.00         2

   micro avg       0.42      0.42      0.42        85
   macro avg       0.20      0.21      0.19        85
weighted avg       0.35      0.42      0.37        85



As predicted, the recall for Labels, 8,10,11,12,2,0 and 3 are 0. A typical sampling issue here. This can be rectified by an oversampling technique such as SMOTE or ROS, however before we get to that, it's time to understand what the other reasons for a lower accuracy could be. 

It is important to realize that count vectorizer is essentially a Ranking algorithm, in the way that it gives a higher weight to words which have appeared more number of times, in other words, the value of the key_wvalue pair is plain the count of the word in the dataset. It does not assign any importance to the order in whoch the words are sentences have appeared. 

Another point to be considered is the fact that words like "a","an","the"... will appear more number of times than the rest of the words as they are common articles. Using a Count vectorizer out of the box on a paragraph or a body of text will invariably give the highest count to these common words. Hence the words we are actually interested in will be underneath these words. one way to rectify this, is to remove these commonly occurng words. NLTK offers this functionality and has rightly defined these particluar words as "stopwords" or words we wouldn't include in out bag of words. We will have to remove the punctutaion as well, as you can see our initial bag of words had the commas, full-stops and all of the other symbols as well. We will want to remove them as well, to avoid being vectorized.

To see the list of stopwords, NLTK currently includes, we could check that by just running 

In [49]:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
print (set(stopwords.words('english')))

{'when', 'against', "shan't", 'be', 'my', 'all', "weren't", 'doing', 'very', 'for', "mustn't", 'their', 'it', 'other', 'can', 'again', 'him', 'been', 'herself', 'both', 'wasn', 'isn', "don't", "you'd", 'few', 'we', 'theirs', 'own', 'his', 'until', 'doesn', 'in', 'weren', 'by', "hasn't", 'during', 'who', 'to', 'being', 'over', 'than', 'should', "wouldn't", 'each', "didn't", 'of', "shouldn't", 'hasn', 'or', 'they', 'how', "hadn't", 'd', 'through', 'her', 'such', 'have', 'shouldn', 'wouldn', 'yourselves', 'only', 'after', 'hadn', 'before', 'up', 'hers', 'its', 'between', 're', "she's", 'once', 'won', 'was', 'ain', 'above', 'there', "doesn't", 'from', 'which', 'aren', 'were', 'themselves', 'into', 'down', 'are', "aren't", 'what', 'under', 'same', 'with', 'and', 'this', 'where', 'y', "mightn't", "couldn't", 'ourselves', 'will', 'out', 'did', 'at', "you're", 'had', "you've", 'ma', "won't", 'couldn', 'you', 'your', 'itself', 'about', 's', 't', 'didn', 'those', 'she', 'yours', "that'll", 'most

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ashwani.saxena\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The punctutaion list can be derived as follows. 

In [50]:
from string import punctuation

In [51]:
print (list(punctuation))

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [52]:
#The good thing is we can add our own list of words we want to remove our body of text over and above #this list. 

custom_set_of_stopwords = set(stopwords.words('english')+list(punctuation)+["Bangalore"])

#This will include the word Bangalore as a topword and remoe the same from our body of text before vectorizing it. 
print ("Bangalore" in custom_set_of_stopwords)

True


If we re-do the exercise for prediction removing the stopwords, the accuracy should increase, as now we are not giving any weight to meaningless words but to words which actually matter.

In [54]:
all_text = df["X"]
all_text = pd.DataFrame(all_text)
all_text.columns = ["Text"]
all_text["Text"] = all_text['Text'].str.lower()

Taking just the first row

In [55]:
first_complaint = all_text['Text'].iloc[0]

In [56]:
first_complaint

'when my loan was switched over to navient i was never told that i had a deliquint balance because with xxxx i did not. when going to purchase a vehicle i discovered my credit score had been dropped from the xxxx into the xxxx. i have been faithful at paying my student loan. i was told that navient was the company i had delinquency with. i contacted navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. i was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. i have had so much trouble bringing my credit score back up.'

In [57]:
first_complaint_bow = word_tokenize(first_complaint)

In [58]:
print (first_complaint_bow)

['when', 'my', 'loan', 'was', 'switched', 'over', 'to', 'navient', 'i', 'was', 'never', 'told', 'that', 'i', 'had', 'a', 'deliquint', 'balance', 'because', 'with', 'xxxx', 'i', 'did', 'not', '.', 'when', 'going', 'to', 'purchase', 'a', 'vehicle', 'i', 'discovered', 'my', 'credit', 'score', 'had', 'been', 'dropped', 'from', 'the', 'xxxx', 'into', 'the', 'xxxx', '.', 'i', 'have', 'been', 'faithful', 'at', 'paying', 'my', 'student', 'loan', '.', 'i', 'was', 'told', 'that', 'navient', 'was', 'the', 'company', 'i', 'had', 'delinquency', 'with', '.', 'i', 'contacted', 'navient', 'to', 'resolve', 'this', 'issue', 'you', 'and', 'kept', 'being', 'told', 'to', 'just', 'contact', 'the', 'credit', 'bureaus', 'and', 'expalin', 'the', 'situation', 'and', 'maybe', 'they', 'could', 'help', 'me', '.', 'i', 'was', 'so', 'angry', 'that', 'i', 'just', 'hurried', 'and', 'paid', 'the', 'balance', 'off', 'and', 'then', 'after', 'tried', 'to', 'dispute', 'the', 'delinquency', 'with', 'the', 'credit', 'bureaus

In [59]:
len(first_complaint_bow)

137

In [60]:
first_complaint_bow_stopwords_removed = [x for x in first_complaint_bow if x not in custom_set_of_stopwords]

In [61]:
print (first_complaint_bow_stopwords_removed)

['loan', 'switched', 'navient', 'never', 'told', 'deliquint', 'balance', 'xxxx', 'going', 'purchase', 'vehicle', 'discovered', 'credit', 'score', 'dropped', 'xxxx', 'xxxx', 'faithful', 'paying', 'student', 'loan', 'told', 'navient', 'company', 'delinquency', 'contacted', 'navient', 'resolve', 'issue', 'kept', 'told', 'contact', 'credit', 'bureaus', 'expalin', 'situation', 'maybe', 'could', 'help', 'angry', 'hurried', 'paid', 'balance', 'tried', 'dispute', 'delinquency', 'credit', 'bureaus', 'much', 'trouble', 'bringing', 'credit', 'score', 'back']


In [62]:
len(first_complaint_bow_stopwords_removed)

54

We see that the less important words such as "a","an","the","where" have been removed including the punctutaion. 

Redoing this for all the rows of the dataframe.

In [63]:
all_text = df["X"]
all_text = pd.DataFrame(all_text)
all_text.columns = ["Text"]
all_text["Text"] = all_text['Text'].str.lower()
cv = CountVectorizer(stop_words="english")
cv.fit(all_text["Text"])
vector = cv.transform(all_text["Text"])
vector_values_array = vector.toarray()
labels = pd.DataFrame(df["y"])
labels.columns = ["labels"]
labels["labels"] = le.fit_transform(labels["labels"])

In [64]:
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split as tts
X = vector_values_array
y = labels["labels"]
X_train,X_test,y_train,y_test = tts(X,y,test_size=0.4,random_state=42)

In [65]:
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train,y_train)
y_pred = log_reg.predict(X_test)
print (accuracy_score(y_test,y_pred))

0.47058823529411764


We see a massive 5% increase in accuracy by just removing the stopwords. The model is now being trained on the words which actually matter. Not the more commonly occuring stopwords. 

### 4.4 Introduction to the TF-IDF Vectorizer

In the last tutorial we saw how text was converted to numerics using a count vectorizer. 

In other words, a count vectorizer, counts the occurences of the words in a document and all the documents are considered independent of each other. Very similar to a one hot encoding or pandas getdummies function. However in cases where multiple documents are involved, count vectorizer still does not assume any interdependence between the documents and considers each of the documents as a seperate entity. 

It does not rank the words based on their importance in the document, but just based on whether they exist or not. This is not a wrong approach, but it intuitively makes more sense to rank words based on their importance in the document right? In fact, the process of converting, text to numbers should essentially be a ranking system of the words so that the documents can each get a score based on what words they contain. All words cannot have the same imprtance or relevance in the document right?

#### Enter TF-IDF!!

TF-IDF or Term Frequency and Inverse Document Frequency is kind of the holy grail of ranking metrics to convert text to numbers. Consider the count vectorizer as a metric which just counts the occurences of words in a document. 

** The ranking system in a count vectorizer is purely occurence based on a single document only!**

TF-IDF takes it a step further and ranks the words based not just on their occurences in one document but across all the documents. Hence if CV or Count vectorizer was giving more importance to words because they have appeared multiple times in the document, TF-IDF will rank them high if they have appeared only in that document, meaning that they are rare, hence higher importance and lower if they have appeared in all or most documents, because they are more common, hence lower ranking. 

Consider a scenario where there are 5 documents and all are talking aout football. The word football would have appeared multiple times in each document. CV is going to rank football consistently high and infact give the word football a different value across all 5 documents based on how many times that word has appeared in that document. In other words, it is assuming, that the more number of times a word appears, the more important it is. That is exactly what the TF or the Term Frequency component in TF-IDF does. 

IDF on the other hand now is the dominating factor in TFIDF which is going to find out the number of times football has also appeared in the other 4 documents except for the one it is currently seeing. If football has also appeared in rest of the documents, it means that though football is important to that one document based on the number of occurences, considering it has appeared in the rest as well, it is not that rare or more common, hence the importance now is going to reduce instead of going high!

**The ranking system is across the entire corpus or all documents.  It is not a single document based metric!**

We have seen how CV is calculated for a word in a document. Let us now see how TF IDF is...

The tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

#### Example

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

Let us now take the first 2 complaints and run a TF-IDF vectorizer on the same. So, in this case, the 2 complaints are our 2 documents. and instead of a CV which considers each document independent of each other and just calculates the count of every word in the document. now the corpus will be the sum total of both documents. 

In [66]:
complaint_1 = df["X"].iloc[0]
complaint_2 = df["X"].iloc[1]
complaint_3 = df["X"].iloc[2]

In [67]:
print ("Complaint 1: ", complaint_1)
print ("\n")
print ("Complaint 2: ", complaint_2)
print ("\n")
print ("Complaint 3: ", complaint_3)

Complaint 1:  When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not. When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX. I have been faithful at paying my student loan. I was told that Navient was the company i had delinquency with. I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. I have had so much trouble bringing my credit score back up.


Complaint 2:  I tried to sign up for a spending monitoring program and Capital One will not let me access my account through them


Complaint 3:  My mortgage is with BB & T Bank, recently I have been investigating ways to pay down my mortgage faster and I came across Biweekly Mortgage Calculat

In [68]:
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents called sents
sents = [complaint_1, complaint_2, complaint_3]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab

In [69]:
sents

['When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not. When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX. I have been faithful at paying my student loan. I was told that Navient was the company i had delinquency with. I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. I have had so much trouble bringing my credit score back up.',
 'I tried to sign up for a spending monitoring program and Capital One will not let me access my account through them',
 'My mortgage is with BB & T Bank, recently I have been investigating ways to pay down my mortgage faster and I came across Biweekly Mortgage Calculator on BB & T \'s website. It\'s a ni

In [70]:
vectorizer.fit(sents)
vector = vectorizer.transform(sents)

In [71]:
vector

<3x214 sparse matrix of type '<class 'numpy.float64'>'
	with 251 stored elements in Compressed Sparse Row format>

In [72]:
vector.shape #214 Unique words in both sentences combined. 3 documents in total. 

(3, 214)

In [73]:
vector_values = vector.toarray().tolist()[0]

vector_values would be the tf-idf score for each of the 214 words. Printing the first 5 elements of this list. 

In [74]:
vector_values[:5]

[0.0, 0.0, 0.0, 0.0, 0.0]

To figure out which these words are, just like the count vectorizer, we have the vectorizer.vocabulary_

In [75]:
print (vectorizer.vocabulary_)

{'when': 202, 'my': 112, 'loan': 101, 'was': 196, 'switched': 170, 'over': 128, 'to': 183, 'navient': 113, 'never': 114, 'told': 184, 'that': 174, 'had': 78, 'deliquint': 54, 'balance': 19, 'because': 24, 'with': 206, 'xxxx': 209, 'did': 55, 'not': 117, 'going': 76, 'purchase': 140, 'vehicle': 194, 'discovered': 57, 'credit': 49, 'score': 152, 'been': 25, 'dropped': 62, 'from': 73, 'the': 175, 'into': 88, 'have': 79, 'faithful': 66, 'at': 17, 'paying': 131, 'student': 169, 'company': 44, 'delinquency': 53, 'contacted': 46, 'resolve': 148, 'this': 181, 'issue': 91, 'you': 212, 'and': 12, 'kept': 94, 'being': 26, 'just': 93, 'contact': 45, 'bureaus': 30, 'expalin': 65, 'situation': 160, 'maybe': 105, 'they': 179, 'could': 48, 'help': 80, 'me': 106, 'so': 161, 'angry': 13, 'hurried': 84, 'paid': 129, 'off': 120, 'then': 178, 'after': 9, 'tried': 185, 'dispute': 58, 'much': 111, 'trouble': 186, 'bringing': 29, 'back': 18, 'up': 191, 'sign': 158, 'for': 72, 'spending': 166, 'monitoring': 10

In [76]:
import operator
sorted_x = sorted(vectorizer.vocabulary_.items(), key=operator.itemgetter(1))
words = [x[0] for x in sorted_x]
d = dict(zip(words,vector_values))

In [77]:
print (d)

{'26': 0.0, '30': 0.0, 'about': 0.0, 'accelerated': 0.0, 'access': 0.0, 'account': 0.0, 'across': 0.0, 'active': 0.0, 'advertising': 0.0, 'after': 0.05253580411952334, 'all': 0.0, 'amount': 0.0, 'and': 0.20399369240069398, 'angry': 0.06907826902804956, 'answer': 0.0, 'applied': 0.0, 'asked': 0.0, 'at': 0.06907826902804956, 'back': 0.05253580411952334, 'balance': 0.13815653805609912, 'bank': 0.0, 'bb': 0.0, 'bbt': 0.0, 'be': 0.0, 'because': 0.06907826902804956, 'been': 0.10507160823904668, 'being': 0.06907826902804956, 'bi': 0.0, 'biweekly': 0.0, 'bringing': 0.06907826902804956, 'bureaus': 0.13815653805609912, 'but': 0.0, 'calculates': 0.0, 'calculator': 0.0, 'call': 0.0, 'called': 0.0, 'calling': 0.0, 'came': 0.0, 'can': 0.0, 'capital': 0.0, 'center': 0.0, 'checking': 0.0, 'collected': 0.0, 'com': 0.0, 'company': 0.06907826902804956, 'contact': 0.06907826902804956, 'contacted': 0.06907826902804956, 'corporate': 0.0, 'could': 0.05253580411952334, 'credit': 0.27631307611219824, 'customer

Sorting this dictionary by value in the descending order to see the ranking. 

In [78]:
print (sorted(d.items(), key=operator.itemgetter(1), reverse = True))

[('the', 0.42028643295618673), ('credit', 0.27631307611219824), ('had', 0.27631307611219824), ('was', 0.2626790205976167), ('navient', 0.20723480708414868), ('and', 0.20399369240069398), ('to', 0.20399369240069398), ('my', 0.1631949539205552), ('that', 0.15760741235857004), ('told', 0.15760741235857004), ('with', 0.15760741235857004), ('xxxx', 0.15760741235857004), ('balance', 0.13815653805609912), ('bureaus', 0.13815653805609912), ('delinquency', 0.13815653805609912), ('just', 0.13815653805609912), ('score', 0.13815653805609912), ('so', 0.13815653805609912), ('been', 0.10507160823904668), ('have', 0.10507160823904668), ('loan', 0.10507160823904668), ('when', 0.10507160823904668), ('angry', 0.06907826902804956), ('at', 0.06907826902804956), ('because', 0.06907826902804956), ('being', 0.06907826902804956), ('bringing', 0.06907826902804956), ('company', 0.06907826902804956), ('contact', 0.06907826902804956), ('contacted', 0.06907826902804956), ('deliquint', 0.06907826902804956), ('did', 

We can see that the model learns to give lesser importance to words like is,it,in etc;. Unfortunately, it also gives a low importance to important words like financial, mortgage and a fairly high importance to unwanted words like the, was. It does give higher importance to words such as credit. And that is because TF-DF works better with larger corpuses. Just like a machine learning model, the larger the data, the better the model.  With a larger corpus, these issues would be resolved when a lot more documents would have words like financial but not the.

Rerunning this fo about 100 documents, we see that the ranking is completely different. 

In [79]:
sents=[]
for x in range(100):
    sents.append(df["X"].iloc[x])

In [80]:
len(sents)

100

In [81]:
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents called sents
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(sents)
vector = vectorizer.transform(sents)
vector.shape 
vector_values = vector.toarray().tolist()[0]
import operator
sorted_x = sorted(vectorizer.vocabulary_.items(), key=operator.itemgetter(1))
words = [x[0] for x in sorted_x]
d = dict(zip(words,vector_values))
print ((sorted(d.items(), key=operator.itemgetter(1), reverse = True))[:20])

[('navient', 0.35431369455512246), ('the', 0.24041090745878946), ('delinquency', 0.23620912970341496), ('had', 0.20963581996093744), ('told', 0.19801036084460227), ('bureaus', 0.19189624902422267), ('was', 0.18441012909485335), ('score', 0.1637072393027124), ('credit', 0.15907030235158917), ('just', 0.15203704157649714), ('balance', 0.14549112699503114), ('to', 0.14437952896171652), ('and', 0.14013869297167467), ('loan', 0.1344398602659683), ('angry', 0.12870728663065903), ('deliquint', 0.12870728663065903), ('expalin', 0.12870728663065903), ('faithful', 0.12870728663065903), ('hurried', 0.12870728663065903), ('switched', 0.12870728663065903)]


"the" has moved down from 0.42 to 0.24. 'Navient' has increased from 0.20 to 0.35. So has 'bureaus' from 0.13 to 0.19. As we include more and more sentences, the words whch have appeared more and more frequently across all the documents, such as "the" are moving down in value, and words like bureau and navient, which have appeared far lesser number of times have started increasing. Which reiterates the point we had. TF-DF works better with larger corpuses. Just like a machine learning model, the larger the data, the better the model. 


Let's run our initial Logistic Regression model using a tf-idf and see if there is a difference in the accuracies. 

In [82]:
all_text = df["X"]
all_text = pd.DataFrame(all_text)
all_text.columns = ["Text"]
all_text["Text"] = all_text['Text'].str.lower()
tfidf = TfidfVectorizer(stop_words="english")
tfidf.fit(all_text["Text"])
vector = tfidf.transform(all_text["Text"])
vector_values_array = vector.toarray()
labels = pd.DataFrame(df["y"])
labels.columns = ["labels"]
labels["labels"] = le.fit_transform(labels["labels"])
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split as tts
X = vector_values_array
y = labels["labels"]
X_train,X_test,y_train,y_test = tts(X,y,test_size=0.4,random_state=42)
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train,y_train)
y_pred = log_reg.predict(X_test)
print (accuracy_score(y_test,y_pred))

0.3764705882352941


And we can see that the overall accuracy of the model is **low** compared to the initial CV model. A 10% reduction. As far we have seen, do you think this will increase if we add more data to it. 

In [85]:
import pandas as pd
df = pd.read_pickle("Consumer_complaints.pkl")
print ("reading")
df_copy = df.tail(2000).copy()
df = df_copy[["Consumer complaint narrative", "Product"]] #keeping the relevant columns
df.shape

reading


(2000, 2)

In [86]:
df.columns = ["X","y"]
df = df.dropna()
df = df.iloc[:2000]
print (df.shape)
print ("Building model")
all_text = df["X"]
all_text = pd.DataFrame(all_text)
all_text.columns = ["Text"]
all_text["Text"] = all_text['Text'].str.lower()
tfidf = TfidfVectorizer(stop_words="english")
tfidf.fit(all_text["Text"])
vector = tfidf.transform(all_text["Text"])
vector_values_array = vector.toarray()
labels = pd.DataFrame(df["y"])
labels.columns = ["labels"]
labels["labels"] = le.fit_transform(labels["labels"])
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split as tts
X = vector_values_array
y = labels["labels"]
X_train,X_test,y_train,y_test = tts(X,y,test_size=0.3,random_state=42)
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train,y_train)
y_pred = log_reg.predict(X_test)
print (accuracy_score(y_test,y_pred))

(1998, 2)
Building model
0.5783333333333334


From 43% at 359 rows, to 59% at 2000 rows, we are starting to see how TF-IDF works better with larger
data sets. The last value for the word "the" was around 0.27. Just out of curiosity, let's see the value of the word "the" now at 2000 docs

In [90]:
### from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents called sents
# create the transform
sents=[]
for x in range(df.shape[0]):
    sents.append(df["X"].iloc[x])
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(sents)
vector = vectorizer.transform(sents)
vector.shape 
vector_values = vector.toarray().tolist()[0]
import operator
sorted_x = sorted(vectorizer.vocabulary_.items(), key=operator.itemgetter(1))
words = [x[0] for x in sorted_x]
d = dict(zip(words,vector_values))
print ((sorted(d.items(), key=operator.itemgetter(1), reverse = True))[:20])

[('xxxx', 0.37152555551310695), ('sell', 0.35476801553845877), ('mothers', 0.3261854926117282), ('wells', 0.20454943984068547), ('slowing', 0.18940866547356533), ('finalize', 0.1796962427013482), ('extract', 0.1674600371150138), ('feet', 0.1630927463058641), ('dragging', 0.1594002545398214), ('sought', 0.1594002545398214), ('to', 0.1576799906365181), ('mom', 0.15338032353364697), ('nc', 0.15338032353364697), ('deceased', 0.15085654071952967), ('help', 0.1475045824229167), ('home', 0.14562561360436718), ('my', 0.14059112518073988), ('totally', 0.13548171271131101), ('efforts', 0.13308433537212017), ('xxxxxxxxxxxx', 0.13308433537212017)]


"the" is at 0.21 at 2000 documents. "And" has been pushed down, expalin has come up, deliquent has increased etc; a few words have been pushed down as well, based on their importance **across all documents** not just each document alone. 

## Chapter 5: The Naive Bayes Classifier

Naive Bayes classifers are based on the Bayes' theorem. A pure classification algorithm, it predicts the various categories or classes of the target based on the fundamental premise that the features responsible are independent of each other. These features are assumed to independently contribute to the the probability to the target variable belonging to a certain class. 

Let us try running a Naive Bayes classifier using the TF-IDF method, exactly the same way we ran a log-reg model in the earlier section. 

In [92]:
import pandas as pd
df = pd.read_pickle("Consumer_complaints.pkl")
print ("reading")
df_copy = df.copy()
df = df[["Consumer complaint narrative", "Product"]] #keeping the relevant columns
df.shape
df.columns = ["X","y"]
df = df.dropna()
df = df.iloc[:2000]
print (df.shape)
print ("Building model")
all_text = df["X"]
all_text = pd.DataFrame(all_text)
all_text.columns = ["Text"]
all_text["Text"] = all_text['Text'].str.lower()
tfidf = TfidfVectorizer(stop_words="english")
tfidf.fit(all_text["Text"])
vector = tfidf.transform(all_text["Text"])
vector_values_array = vector.toarray()
labels = pd.DataFrame(df["y"])
labels.columns = ["labels"]
labels["labels"] = le.fit_transform(labels["labels"])
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split as tts
X = vector_values_array
y = labels["labels"]
X_train,X_test,y_train,y_test = tts(X,y,test_size=0.3,random_state=42)
# log_reg = LogisticRegression(random_state=42)
nb = MultinomialNB()
nb.fit(X_train,y_train)
y_pred = nb.predict(X_test)
print (accuracy_score(y_test,y_pred))

reading
(2000, 2)
Building model
0.415


One of the fundamental reasons the model isn't giving an accuracy can be deduced from the classification report, where we see that the recall of multiple categories is 0 and that means that the data isn't balanced well enough to reflect an equal weightage. Since a classfication algorithm, tends to predict the majority class unless, the output categories are more or less equally balanced, the error in preeicting the majority class will increase as it classfies more and more data, and classifies tg=hem worng, hence reducing the overall accuracy. 

A recap on the Classification report of the NB classifier. 

In [93]:
import pandas as pd
df = pd.read_pickle("Consumer_complaints.pkl")
print ("reading")
df_copy = df.copy()
df = df[["Consumer complaint narrative", "Product"]] #keeping the relevant columns
df.shape
df.columns = ["X","y"]
df = df.dropna()
df = df.iloc[:2000]
print (df.shape)
print ("Building model")
all_text = df["X"]
all_text = pd.DataFrame(all_text)
all_text.columns = ["Text"]
all_text["Text"] = all_text['Text'].str.lower()
tfidf = TfidfVectorizer(stop_words="english")
tfidf.fit(all_text["Text"])
vector = tfidf.transform(all_text["Text"])
vector_values_array = vector.toarray()
labels = pd.DataFrame(df["y"])
labels.columns = ["labels"]
labels["labels"] = le.fit_transform(labels["labels"])
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split as tts
X = vector_values_array
y = labels["labels"]
X_train,X_test,y_train,y_test = tts(X,y,test_size=0.3,random_state=42)
# log_reg = LogisticRegression(random_state=42)
nb = MultinomialNB()
nb.fit(X_train,y_train)
y_pred = nb.predict(X_test)
print (accuracy_score(y_test,y_pred))
print (classification_report(y_test,y_pred))

reading
(2000, 2)
Building model
0.415
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        14
           1       0.00      0.00      0.00        21
           2       0.00      0.00      0.00        12
           3       0.00      0.00      0.00        22
           4       0.00      0.00      0.00        42
           5       0.00      0.00      0.00        50
           6       0.36      1.00      0.53       190
           7       0.88      0.41      0.56       140
           8       0.00      0.00      0.00         9
           9       0.00      0.00      0.00         1
          10       1.00      0.02      0.03        58
          11       0.00      0.00      0.00         1
          12       0.00      0.00      0.00         6
          13       0.00      0.00      0.00         5
          14       0.00      0.00      0.00        21
          15       0.00      0.00      0.00         8

   micro avg       0.41      0.41      0.

In [94]:
y.value_counts()

6     669
7     469
10    206
4     128
5     123
14     96
1      80
3      64
0      37
2      33
15     31
8      27
12     21
9       7
13      5
11      4
Name: labels, dtype: int64

### As predicted, the dataset is heavily imbalanced. With Category 6 and 7 being over represented , while all else have a less than 10% weightage. 

We might have to oversample the under-represented categories. We will use the Random Over sampler to do this from the package IMBLEARN

In [95]:
df.columns = ["X","y"]
df = df.dropna()
df = df.iloc[:2000]

In [96]:
!pip install imblearn


Collecting imblearn
  Downloading https://files.pythonhosted.org/packages/81/a7/4179e6ebfd654bd0eac0b9c06125b8b4c96a9d0a8ff9e9507eb2a26d2d7e/imblearn-0.0-py2.py3-none-any.whl
Collecting imbalanced-learn (from imblearn)
  Downloading https://files.pythonhosted.org/packages/e5/4c/7557e1c2e791bd43878f8c82065bddc5798252084f26ef44527c02262af1/imbalanced_learn-0.4.3-py3-none-any.whl (166kB)
Installing collected packages: imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.4.3 imblearn-0.0


In [97]:
X = df["X"]
y = df["y"]

In [98]:
X = pd.DataFrame(X)

In [99]:
tfidf.fit(X["X"])
vector = tfidf.transform(X["X"])
vector_values = vector.toarray()

In [100]:
vector_values = pd.DataFrame(vector_values)

In [101]:
labels = pd.DataFrame(y)
labels.columns = ["labels"]
labels["labels"] = le.fit_transform(labels["labels"])

In [102]:
y = labels["labels"]

In [103]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler()
X_ros, y_ros = ros.fit_sample(vector_values, y)

In [104]:
y_ros = pd.Series(y_ros)

In [105]:
X_train,X_test,y_train,y_test = tts(X_ros,y_ros,test_size = 0.3, random_state = 0)
nb = MultinomialNB()
nb.fit(X_train,y_train)
y_pred = nb.predict(X_test)
accuracy_score(y_test,y_pred)

0.9430261519302615

#### We managed a 94% accuracy with multinominal Naive Bayes, TF-IDF and oversampling.

Using a Linear SVC as our last algorithm, to test the results

In [106]:
from sklearn.svm import SVC
X_train,X_test,y_train,y_test = tts(X_ros,y_ros,test_size = 0.3, random_state = 0)
svc = SVC(kernel="linear")
svc.fit(X_train,y_train)
y_pred = svc.predict(X_test)
accuracy_score(y_test,y_pred)

0.9713574097135741