# Introduction to Natural Language Processing

## Description

NLP or Natural Language Processing as is normally referred to, refers to working (or processing) **text data**, either for machine learning, or any of the host of usecases textual data comprises of. Working with text, is very different from working with numerical or categorical data. We have worked extensively with data, numerical, categorical and boolean, however text data is a different paradigm altogether and this tutorial aims to get you acquanited with the basics of working with text and understanding the underlying implications in Machine learning.  

## Overview

- Introduction to the problem statement **Consumer Complaints Database**
- What is NLP (Introduction and usecases)
- Tokenization and Introduction to NLTK
- Vectorization and vector space models **Count Vectorizer**
- Applying our first classification algorithm **Logistic Regression**
- Stopwords
- Basic Stemming 
- TFIDF
- Naive Bayes Classifier
- Linear kernel SVM
- Text Classification (Build a text classifier using NLTK)


## Pre-requisite

- Python (along with NumPy and pandas libraries)
- Basic statistics (knowledge of central tendancy)


## Learning Outcomes

- Understanding why working with text data isn't like numerical or categorical data
- What is NLP
- The basic building blocks of text
- Tokenization, Stemming and what constitutes as a stopword
- Preliminary cleaning of text data 

## Chapter 1: Introduction to text data

### Description: 
Until now, all of our problem statements had data in either a numerical format, a categorical format, or a Boolean format. In the real-world we usually do, and might very well encounter text data. We will now try to understand how we can use text analytics to solve data with text.

### 1.1 Introduction to the problem statement: <font color='green'> Categorize complaints into categories</font>

**What is the problem?**

Let us get started with the introduction to natural language processing by first looking at the problem that we are going to solve. We have a rich dataset of consumer complaints on various financial products and services. Each row in the dataset describes the complaint and the different features associated with it. In this concept, we'll first construct the features and then build a model that predicts the category into which the complaint falls. You can read more about this dataset [here](https://catalog.data.gov/dataset/consumer-complaint-database).

Along with the complaint narrative, the other features that are present in data are the issue, the category of the complaint, the date it was received on, the zip code, details of the customer placing the complaint and the current status of the complaint. The final idea is to build a model that will categorize each customer's complaint into a product (12 categories in all). 

For the purpose of understanding how text processing works, we will specifically, work on only 2 columns of this dataset. It is evident that if we add more features, the model accuracy will rise and be more robust. 

**Brief explanation of the dataset & features**

- `Consumer Complaint Narrative`: This is a paragraph (or text) written by the customer explaining his complaint in detail. The data is a string type consisting of text in the form of paragraphs.
- `Product`: This is the category we are to classify each complaint to. The 12 categories the complaints need to be categorized into are: 

'Mortgage', 'Student loan', 'Credit card or prepaid card', 'Credit card', 'Debt collection', 'Credit reporting', 'Credit reporting, credit repair services, or other personal consumer reports', 'Bank account or service', 'Consumer Loan', 'Money transfers', 'Vehicle loan or lease', 'Money transfer, virtual currency, or money service', 'Checking or savings account', 'Payday loan', 'Payday loan, title loan, or personal loan', 'Other financial service', 'Prepaid card'

 
**What we want as the outcome?**

We would classify each complaint to its respective category, so that the complaint can be directed to the right vertical.




### Instructions

In this task you will load Consumer_complaints.csv into a dataframe using pandas and explore the column Consumer Complaint Narrative.

- Load the dataset from `'path'`(given) using the `read_csv()` method from pandas and store it in `'full_data'`. 

- Subset the dataframe  `'full_data'` to only include `"Consumer complaint narrative"` and  `"Product"` and store this dataframe subset in `'data'`

- Rename the column `"Consumer complaint narrative"` to `"X"` and `"Product"` to `"y"` by assigning `["X","y"]` to `data.columns` 

(The reason we have done this is obvious. We intend to classify the complaints in the narrative (X) to each of the categories listed in the Product (or the Y column))

- Print out the first value of the `"X"` column to take a look at it.

- Also print column `'y'` and have a look at the various categories that exist in the Product (the renamed y column)s

In [1]:
import pandas as pd
#df_copy = df.copy()

path = "../data/new_complaints.csv"
#path = "/Users/greyatom/Desktop/complaints.csv"
# Loading of dataset
full_data = pd.read_csv(path)

# keeping the relevant columns
data = full_data[["Consumer complaint narrative", "Product"]]
data.columns = ["X", "y"]
data.head()

# Printing out the first non-empty value of the X column. Hence the second value, index is 1
print(data["X"][1])
print("\n")
print(list(data["y"].unique()))

When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not. When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX. I have been faithful at paying my student loan. I was told that Navient was the company i had delinquency with. I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. I have had so much trouble bringing my credit score back up.


['Mortgage', 'Student loan', 'Credit card or prepaid card', 'Credit card', 'Debt collection', 'Credit reporting', 'Credit reporting, credit repair services, or other personal consumer reports', 'Bank account or service', 'Consumer Loan', 'Money transfers', 'Vehicle loan or lease', 'Money transfer, virtual c

## 1.2 Motivation and Uses of NLP


At the end of the previous exercise we saw that the data contained in the consumer complaint narrative column is a **paragraph**.  

This is a typical example of text data - words combine to form sentences, and sentences combine to form paragraphs. The column `Consumer complaint narrative` has all rows in the form of either NaNs or text data. How do we make sense of this data?
Do we convert this to categorical format through one-hot encoding? If yes, how do we do it?
How do we convert the text data to a numerical format to make sure machine learning algorithms can be applied?
Can this column be used in a multinomial classification model to predict the class of the complaint?

All these questions (and more) can be answered through a branch of Artificial Intelligence. Enter **Natural Language Processing.**


Formally, Natural Language Processing or NLP is defined as the application of computational techniques for the analysis and the synthesis of text. The aim of NLP is to give computers the ability to do tasks involving human language. In terms of hands-on or engineering terms, it can broadly be defined as "cleaning" and "transforming" text to a form fit for machine learning. Of course, you can derive insights from text as well just like any EDA operation. But the inherent properties of text call for a principally different approach to deal with it. This leads us to an important question of why is to so hard to deal with text that it requires a separate field of study.



<!--There are other off shoots of Natural Langugage such as NLG - Natural language Generation which aims to generate new text data based on prior data and Natural Language Understanding - NLU which is the backbone of all intelligent chatbots out there currently, which focuses on recognizing the intent of a conversation. For the sake of this tutorial and brevity we will stick to NLP. -->



#### Why is it difficult to work with text?

***

Comprehending Language is hard for computers. Some of the unique challenges of working with text are as follows: 

- **Synonymy** - This corresponds to different words having the same meaning. A similar intent can be conveyed in various ways and this is one of the prime reasons, why computers have a hard time deciphering the meaning or intent of those statements. "The President of United States has signed a new decree" and "POTUS has inked in a new law" are basically advocating the same sentiment. However as they are completely different sentences syntactically, computers have a hard time figuring out the user intent.


- **Ambiguity** - "The bank deposit rate is quite high" and "He stood near the bank admiring the river". In these statements, the word `bank` has completely different meanings. In the first case it represents a financial institution, and in the second case it refers to land near the river. Disambiguating the meaning in sentences is quite challenging.  


- **Anaphora Resolution** - "George is my friend. He likes football". In the second statement `he` refers to George. It is difficult for the computers to discern what person/entity the pronoun `he` is referring to.  


- **Language related issues** - Every language has its own uniqueness. For English we have words, sentences, paragraphs and so on. But in Thai, there is no concept of sentences at all! The grammar and morphology of languages is so different. This is why we observe that Google Translator or any other translator service struggles to perfectly convert a piece of text from one language to another.


- **Out of Vocabulary problem** - Machines have a hard time adapting to any new constructs that humans come up with. As humans when we come across a word we haven't seen earlier, we might not understand its meaning instantly. But this does not mean we cannot adapt. After looking at the word in several different sentences and understanding its usage, we understand the context and meaning of the new word. Machines can only handle data that they have seen before. It is unable to adapt well.


- **Language generation** - While language understanding is hard, language generation too has its own set of challenges. For chatbots to work effectively, they need to communication properly constructed sentences which are grammatically correct. This is quite a hard problem and a challenge that needs to be overcome. 


We now know that working with text is hard. But there are also exciting applications and use cases involved with working on text. We will now take a look at some of the use cases. 

#### Usecases of NLP

***

The usecases of NLP encompass almost anything you can do with Language in relation to a problem. 

1) **Sentiment Analysis** - Finding if the text is leaning towards a positive or negative sentiment.

The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral is called Sentiment Analysis. The information present over the Internet is constantly growing resulting in a large number of texts expressing opinions in review sites, forums, blogs and different social media forums. Sentiment analysis is therefore a topic of great interest and development since it has many practical applications. It is immensely useful in figuring the overall sentiment of products (Amazon), movies (Netflix), food (Yelp),etc. Its applications include Market Research, Social Monitoring, Customer Support and Product Analytics.


2) **Text Classification** - Categorizing text to various categories


Text classifiers can be used to organize, structure, and categorize almost any text data we have. For e.g. New articles can be organized by topics, chat conversations can be organized by language, support tickets can be organized by urgency etc. Other examples of text classification include:

- Directing customer queries to the right vertical

- Detection of spam and non-spam emails,

- Auto tagging of customer queries


3) **Document Summarization** - Compressing a paragraph/document into few words or sentences

Text summarization is the method of compressing a text document, in order to create a summary of the major points of the document. The idea of summarization is to find a subset of data which contains the `information` of the entire set. It's applications include News summary(Inshorts app), Novel Summary, Book Summary (Blinkist) etc. With the overall attention span declining, the need to provide information in the shortest possible words has risen - and summarization helps solve this problem.

4) **Parts of Speech Tagging** - Figuring out the various nouns, adverbs, verbs etc in the text

Identifying part of speech tags is much more complicated than it looks. This is because over time in the development of language, a single word can have different parts of speech tag in different sentences based on different contexts. This makes it impossible to have a generic mapping for POS tags. Few of its applications include:

- Text to speech conversion

- Word Sense Disambiguation (Teach machine to know the difference of the meaning of word 'bears' in "I saw a couple of bears" and "Hard work always bears fruit")


5) **Machine translation** - Translate text from one language to another

Machine Translation is the task of automatically translating one natural language into another while retaining the meaning of the original text. Translation from one language to another is complex because some of the words in the original language could have multiple meanings and these words could have different forms in the output language. 
Its most popular application is Google Translate and it is employed in devices like Google Home as well. Machine translation allows business transactions between partners in different countries without the need of a human interpreter. 

6) **Named Entity Recognition** - Identify the entities present in text

Named Entity Recognition deals with named entity mentions in text and categorizes these entities into person, organization, datetime reference etc. This is used a lot in the field of bioinformatics, molecular biology and other medical NLP applications. It also plays an important role in the overall field of Information Extraction where we try to extract knowledge from unstructured text. 

7) **Conversational AI** - Chat with a machine in natural language and get queries resolved

Conversational AI deals with creating an interface between machines and humans to converse in natural language. Such interfaces are known as chatbots. A user can interact in natural language with natural language, the same way he usually communicates with a human. For organizations to truly scale in terms of customer support, chatbots are increasingly adopted as the first point of contact for customer query resolution across all organizations.


So for enabling all the NLP usecases, the first challenge is to convert the text into a form that the machine can understand. For that, we need to arrive at a fundamental component of text known as `tokens`. 

## 1.3: Tokenization 


### Motivation for tokenization

We can see that unlike all the machine learning datasets we have worked with previously, the data isn't boolean, numeric, categorical etc. Usually a text is composed of paragraphs, paragraphs are composed of sentences, and sentences are composed of words. You could also go deeper into letters, but the letters have no meaning. It's only when they are combined into words, that the text starts to make sense. Hence, it is better to work at the word level. 

Tokenization is the process of splitting the text into smaller parts called tokens. Tokens are the basic units of a particular dataset. The choice of tokens could be based on the application we are working on. 

#### Introduction to NLTK


Natural Language Tool Kit/NLTK is the standard library in python which specifically deals with text. All the text processing tasks could be easily done with this library. It is a leading platform for building Python programs to work with human language data. It also provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, along with an active discussion forum. On top of it, it is completely free and open-source with a vibrant developer community supporting it. Let us now take the first step towards categorizing the consumer complaints by starting with tokenization.




#### Tokenizing with NLTK - The problem intuition 

***

We will first need to find a way to convert the text to numbers to get them to a form where you would be able to apply an algorithm to this. Think of this like sklearn, which require all non-numeric data to be encoded (label or one-hot) prior to the sklearn pipeline. 

Intuitively, it would make sense to divide each paragraph of text to its basic form (words) and then convert each of those words to numbers. We could assign a particular number to each word, in which case a sentence could look like a set of numbers to us, each number representing a particular word. 

The first step to achieving that would be to break the text down to words. That's what tokenization aims to do. NLTK has a built in libraries for tokenization which we will use for our purpose. 

### Tokenize the first complaint into words

In this task you will assign a variable to the first row of the consumer complaints narrative column (X column) and break down the text into it's constituent words.
### Instructions

- Drop nan values from the entire dataframe `data` using `"dropna()"` using `inplace=True`


- Save the first value of column `"X"`(consumer complaint narrative) in a variable called `'first_complaint'` 


- Break down `'first_complaint'` into words using `"split()"` function and store the result in a variable called `'bag_of_words_1'` 


- Break down `'first_complaint'` into tokens using `"word_tokenize()"` method(i.e. `"word_tokenize(first_complaint)"`) and store the result in a variable called `'bag_of_words_2'` 


- Compare both bag of words.


**Observation:**

We can see from both lists (the one using split and the one using word_tokenize) that the word_tokenize function is more robust as it splits the paragraph into purely words and seperates the punctuation into seperate tokens. In the case of split, full-stops have appeared along with certain words. 

In [2]:
import nltk
from nltk.tokenize import word_tokenize

#nltk.download('punkt')

# Dropping nan values from dataframe
data.dropna(inplace=True)

# Storing the first complaint
first_complaint = data.iloc[0][0]


# Printing the first complaint
print("\nFirst Complaint\n")
print(first_complaint)

# Using the split command
print("\nUsing the Split Command\n")
bag_of_words_1 = first_complaint.split(" ")
print(bag_of_words_1)

# Using the tokenize command
print("\nUsing tokenize\n")
bag_of_words_2 = word_tokenize(first_complaint)
print(bag_of_words_2)


First Complaint

When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not. When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX. I have been faithful at paying my student loan. I was told that Navient was the company i had delinquency with. I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. I have had so much trouble bringing my credit score back up.

Using the Split Command

['When', 'my', 'loan', 'was', 'switched', 'over', 'to', 'Navient', 'i', 'was', 'never', 'told', 'that', 'i', 'had', 'a', 'deliquint', 'balance', 'because', 'with', 'XXXX', 'i', 'did', 'not.', 'When', 'going', 'to', 'purchase', 'a', 'vehicle', 'i', 'discovered', 'my'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


## 1.4 Sent Tokenize


One could also tokenize a paragraph into constituent sentences. 

To split into definite sentences, it is compulsory to identify the end and beginning of sentences. In case if sentences ending with **!** and  **?**, it is quite clear that they signal the end of a sentence; and consequently the next word is the start of a new sentence. But **periods (.)** are ambiguous. This is because periods can be used for:
- Sentence boundary
- Abbreviations (ex: Dr. Mr. etc.)
- Numbers (0.24, -0.43)

So how to identify periods which actually signify sentence boundaries? Some of the common ways to address this issue are:
- Use handwritten rules. For ex: If there is a word starting with "An" after a period then it is a sentence boundary
- A classical way is to use regular expressions 
- Modern methods include usage of machine learning models. For ex: Sequence modelling is a very popular choice for this use case.

The importance of converting words to lower case - All words should be converted to lowercase while doing NLP. The reason behind being, that "Yorkshire" and "yorkshire" even though are the same word, will be considered 2 separate words while converting the words into numbers. Since both of them represent the same word, it would lead to redundant features. Most of the classifiers assume that the features are independent of each other. To avoid such issues, it is standard practice to convert all words or text to lower case, before beginning NLP.




### ### Tokenize the first complaint into  sentences


### Instructions

- Break down `'first_complaint'` into sentences using the `"sent_tokenize()"` method(i.e. `"sent_tokenize(first_complaint)"`) from nltk and  to another list called list_of_sentences


- Convert `first_complaint` into lowercase using `"lower()"` method and store it in a variable called `'first_complaint_lower'`.

- Break down `'first_complaint_lower'` into tokens using `"word_tokenize()"` method and store the result in a variable called `'bag_of_words_lower'` 


In [3]:
# first_complaint is already loaded onto the workspace
from nltk.tokenize import sent_tokenize

# Sentence tokenizing
list_of_sentences = sent_tokenize(first_complaint)

# Print list of sentences
print("List of sentences\n", list_of_sentences)

# Lowering first complaint
first_complaint_lower = first_complaint.lower()

# Tokenizing first complaint lower
bag_of_words_lower = word_tokenize(first_complaint_lower)


print("\n",bag_of_words_lower)

List of sentences
 ['When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not.', 'When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX.', 'I have been faithful at paying my student loan.', 'I was told that Navient was the company i had delinquency with.', 'I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me.', 'I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus.', 'I have had so much trouble bringing my credit score back up.']

 ['when', 'my', 'loan', 'was', 'switched', 'over', 'to', 'navient', 'i', 'was', 'never', 'told', 'that', 'i', 'had', 'a', 'deliquint', 'balance', 'because', 'with', 'xxxx', 'i', 'did', 'not', '.', 'when', 'going', 'to', 'purchase', 'a', 'vehicle', 'i', 'discovered', '

### 1.5 Stemming and Lemmatization 


Owing to grammatical reasons, documents are going to use different forms of a word, such as discuss, discusses and discussing. Along with there are families of derivationally related words with similar meanings, such as liberal, liberty, and liberalization.

Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in the field of NLP  that are used to prepare text, words, and documents for further processing.

The goal of both the methods(stemming and lemmatization) is to reduce inflectional forms and derivationally related forms of a word to a common base form. 

For eg:

am, are, is $\Rightarrow$ be 

lion, lions, lion's, lions' $\Rightarrow$ lion

Let's look at them one by one.

**Stemming**

Stemming is the process of converting the words of a sentence to its non-changing portions. 
So stemming a word or sentence may result in words that are not actual words. Stems are created by removing the suffixes or prefixes used with a word.

For eg: Likes, liked, likely, unlike $\Rightarrow$ like

Lot of different algorithms have been defined for the process, each with their own set of rules. The popular ones include:


* Porter Stemmer(Implemented in almost all languages)

* Paice Stemmer

* Lovins Stemmer

We won't be getting into their details. Feel free to explore their exact mechanisms.

Let's see its python implementation:

```python

import nltk

text="Natural Language Processing is really fun and I want to study it more"
print("The words of text:",text,"\nis stemmed in the following way: ")

#Breaking the sentence to words
tokens=text.split()

#Defining Porter Stemmer object
porter = nltk.PorterStemmer()

#Applying the stemming
stem = [porter.stem(i) for i in tokens]
print(stem)
```

**Output:**

```python

"The words of text: Natural Language Processing is really fun and I want to study it more is stemmed in the following way:" 

['natur', 'languag', 'process', 'is', 'realli', 'fun', 'and', 'I', 'want', 'to', 'studi', 'it', 'more']

```

**Lemmatization:**

This method is a more refined way of breaking words through the use of a vocabulary and morphological analysis of words. The aim is to always return the base form of a word known as `lemma`.

Consider the following words:

'Studied', 'Studious' ,'Studying'

Stemming of them will result in `Studi`


Lemmatisation of them will result in `Study`

As it can be seen Lemmatization is more complex than stemming because it requires words to be categorized by a part-of-speech as well as by inflected form. 

In languages other than English, it can become quite complicated.

Let's see its python implementation:

```python
from nltk.stem import WordNetLemmatizer


text = "Women in  technology are amazing at coding"
print("The words of text:",text,"\nis lemmatized in the following way: ")

tokens=text.lower().split()
lemma = WordNetLemmatizer()
lemma_result = [lemma.lemmatize(i) for i in tokens]
print(lemma_result)

```

**Output:**

```python
"The words of text: Women in  technology are amazing at coding  is stemmed in the following way:" 
['woman', 'in', 'technology', 'are', 'amazing', 'at', 'coding']

```

Compare that with stemming of the same sentence:

```python

"The words of text: Women in  technology are amazing at coding is stemmed in the following way:" 
['women', 'in', 'technolog', 'are', 'amaz', 'at', 'code']

```


## Chapter 2: Vectorization

Description: In this chapter, we will talk about how to convert text into numeric form that can be an input to a machine learning algorithm

### 2.1 Basic Vectorization


Till now, we have arrived at the constituent elements of text. Now the next question to be answered is -  How do we convert this text data to a form fit for machine learning? The usual ways of working with numerical or categorical data will not work here, as the data type is completely different, and the algorithm has to make sense of the written data. 


**Bag of words:**

The problem with modeling text is that there is no well defined fixed-length inputs.

A bag of words model is a way of extracting features from text for use in modeling. In this approach, we use the tokenized words for each observation and find out the frequency of each token.

Let's take an example to understand it.

Consider the following sentences:

"Hope is a good thing" 
"Maybe the best thing" 
"No good thing ever dies"

We will treat each sentence as a different document and make a list of all unique words from the three documentations. We get:

`"hope", "is", "a", "good", "thing", "maybe", "the", "best", "no", "ever", "dies"`

Next, we try to create vectors from it.


In this, we take the first document = "Hope is a good thing" and check the frequency of words from the 10 unique words:

"hope" - 1
"is" - 1
"a" - 1
"good" - 1
"thing" - 1
"maybe" - 0
"the"-0
"best" - 0
"no" - 0
"ever" - 0
"dies" - 0

Following is how each document will look like:

"Hope is a good thing"  - [1,1,1,1,1,0,0,0,0,0,0]

"Maybe the best thing" - [0,0,0,0,1,1,1,1,0,0]

"No good thing ever dies" - [0,0,0,1,1,0,0,0,0,1,1]


**This process of converting text data to numbers is called vectorization**


There are multiple methods to convert words to numbers. 

We will be start with discussing the count Vectorizer. 

### The count vectorizer

***

 Count Vectorizer works on term frequency and building a sparse matrix of documents x tokens.

For e.g. In the dataset in which we are working, the X column is a list of lowercased words 

The idea now is to convert the X column to numbers. 

**One way to do that would be to represent every word as a key value pair in the form of a dictionary, where the key would be the word and the value would be the number of times that word has appeared in the list.** 

![](../images/img2.png)

This method of converting the counts of words in the list to convert them to a numeric format is called Count vectorization. 

We will initially do this manually, and then exploit sklearn to do this automatically to understand the intuition behind this. 



### Convert the first complaint to numbers using the counts of words in the form of a dictionary

In this task you will implement your own code for count vectorization

### Instructions

- We have already created `'bag_of_words_lower'`.


- Let's try to create a dictionary so that the keys are the words themselves and the values are the number of times the word has appeared in the list `'bag_of_words_lower'`.


- Pass `'bag_of_words_lower'` to the `"Counter()"` method and store the result in a variable called `'count_vectorizer'`


In [4]:
from collections import Counter

# Creating an object of count vectorizer
count_vectorizer = Counter(bag_of_words_lower)

print(count_vectorizer)

Counter({'i': 11, 'the': 8, '.': 7, 'to': 5, 'was': 5, 'and': 5, 'credit': 4, 'had': 4, 'my': 4, 'with': 3, 'that': 3, 'told': 3, 'xxxx': 3, 'navient': 3, 'balance': 2, 'score': 2, 'when': 2, 'delinquency': 2, 'been': 2, 'a': 2, 'bureaus': 2, 'have': 2, 'just': 2, 'loan': 2, 'so': 2, 'kept': 1, 'bringing': 1, 'because': 1, 'never': 1, 'then': 1, 'from': 1, 'company': 1, 'did': 1, 'they': 1, 'student': 1, 'up': 1, 'much': 1, 'angry': 1, 'hurried': 1, 'help': 1, 'paying': 1, 'into': 1, 'me': 1, 'purchase': 1, 'after': 1, 'discovered': 1, 'not': 1, 'going': 1, 'faithful': 1, 'vehicle': 1, 'dispute': 1, 'trouble': 1, 'off': 1, 'expalin': 1, 'contacted': 1, 'dropped': 1, 'you': 1, 'issue': 1, 'being': 1, 'this': 1, 'over': 1, 'could': 1, 'deliquint': 1, 'switched': 1, 'situation': 1, 'back': 1, 'contact': 1, 'resolve': 1, 'maybe': 1, 'paid': 1, 'at': 1, 'tried': 1})


### 2.2 Vectorization using sklearn

***
We did manage to convert our list of words to numbers. However the problem still remain unresolved. 

**How do we apply our algorithm to this?**

**Could we convert every word to a feature (or column) and the count associated with it to it's value and then apply a Classification algorithm to it? Something like below?**

<img src="../images/img2.png">

** This looks very similar to one-hot encoding and is a typical method of applying ML to text data**. 

This is similar to one-hot in the way that when we add the second row, and the the third row and subsequent rows, the features or the columns will increase as more and more words come in and there will be words which do not appear in say the first_complaint, the vectorizer will automatically assign 0 to those words. **Hence the number of features will be equal to the total number of unique words in all the complaints combined and the values for those features will be the count of those words in that particular complaint.**

A normal classification algorithm can now be applied where X is all the features except the Product column and y is the Product column. 

We could also use the sklearn's Count Vectorizer method and convert all the text into numbers in a single step. 

Let's do it for the first row

```python
from sklearn.feature_extraction.text import CountVectorizer

#Initialising a CountVectorizer object
cv = CountVectorizer()

#Storing the first row in Text
txt = [data["X"].iloc[0]]

#Printing the first row
print ("\nFirst Row:\n",txt)

#Fitting the CountVectorizer objext
cv.fit(txt)

#Transforming the first row
vector = cv.transform(txt)


print ("\nVector Shape:\n", vector.shape)

#Storing the values of vector in array format
vector_values = vector.toarray()

print("\nVector Values:\n",vector_values)

print("These are the counts of the 69 unique words in our first complaint.")

print ("\nCount Vectorizer Vocabulary:\n",cv.vocabulary_) 

```

**Output:**
```python

First Row:
['When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not. When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX. I have been faithful at paying my student loan. I was told that Navient was the company i had delinquency with. I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. I have had so much trouble bringing my credit score back up.']

Vector Shape:
(1, 69)

Vector Values:
[[1 5 1 1 1 2 1 2 1 1 2 1 1 1 1 4 2 1 1 1 1 1 1 1 1 1 4 2 1 1 1 1 2 1 2 1
  1 1 4 3 1 1 1 1 1 1 1 1 2 1 2 1 1 3 8 1 1 1 5 3 1 1 1 1 5 2 3 3 1]]
These are the counts of the 69 unique words in our first complaint.

Count Vectorizer Vocabulary:
{'when': 65, 'my': 38, 'loan': 34, 'was': 64, 'switched': 52, 'over': 43, 'to': 58, 'navient': 39, 'never': 40, 'told': 59, 'that': 53, 'had': 26, 'deliquint': 17, 'balance': 5, 'because': 6, 'with': 66, 'xxxx': 67, 'did': 18, 'not': 41, 'going': 25, 'purchase': 46, 'vehicle': 63, 'discovered': 19, 'credit': 15, 'score': 48, 'been': 7, 'dropped': 21, 'from': 24, 'the': 54, 'into': 30, 'have': 27, 'faithful': 23, 'at': 3, 'paying': 45, 'student': 51, 'company': 11, 'delinquency': 16, 'contacted': 13, 'resolve': 47, 'this': 57, 'issue': 31, 'you': 68, 'and': 1, 'kept': 33, 'being': 8, 'just': 32, 'contact': 12, 'bureaus': 10, 'expalin': 22, 'situation': 49, 'maybe': 35, 'they': 56, 'could': 14, 'help': 28, 'me': 36, 'so': 50, 'angry': 2, 'hurried': 29, 'paid': 44, 'off': 42, 'then': 55, 'after': 0, 'tried': 60, 'dispute': 20, 'much': 37, 'trouble': 61, 'bringing': 9, 'back': 4, 'up': 62}

```

The vocabulary only specifies the index of the word and the not the counts.

Comparing the vector values with vocabulary helps in identifying the word count.

So for the word with index 1 is `and` we see its value is 5. That means the count of the word `and` is 5. 

```python

# #Converting the vector values to list
vector_values = vector_values.tolist()[0]
print (vector_values)


print ("count value of the word at index 22")
print (vector_values[22]) 

print ("count value of the word at index 34, the word is 'loan'")
print (vector_values[34]) 

```

**Output:**
```python
[1, 5, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 4, 3, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 3, 8, 1, 1, 1, 5, 3, 1, 1, 1, 1, 5, 2, 3, 3, 1]

count value of the word at index 22
1

count value of the word at index 34, the word is 'loan'
2
```
***

Seeing the cv.vocabulary_ dictionary we see that the word is 'loan' and it's value in the vector values list is 2.

<img src="../images/img2.png">

### 2.3 Count vectorization on a dataframe

We have just seen how we can use the count vectorizer on a text snippet. Most of the times the data is present in a dataframe. Let's try implementing the count vectorizer on a dataframe. 

We will use the count vectorizer to transform the X column of the dataframe which corresponds to the text paragraph and make a new dataframe with these features and the product column. 

We will consider just the top 3 rows of the entire dataframe to aid better understanding and run it over the entire dataframe later.

```python
#Importing count vectorizer from sklearn
from sklearn.feature_extraction.text import CountVectorizer

#Initializing the Count Vectorizer object
cv = CountVectorizer()


#Creating a dataframe "all text" with the first 3 rows of data
all_text = data["X"][:3]
all_text = pd.DataFrame(all_text)

#Renaming the column for that dataframe (has only one column) to "Text"
all_text.columns = ["Text"]

#Converting to lower case
all_text["Text"] = all_text['Text'].str.lower()

#Fitting the Count vectorizer all text
cv.fit(all_text["Text"])

#Transforming "Text"
vector = cv.transform(all_text["Text"])

#Transforming the vector to array
vector_values_array = vector.toarray()

#Converting the text to numbers - The transform function does this. 
vector_values_list = vector_values_array.tolist()

#Length of vector value list
print ("No of rows of vector value list\n", len(vector_values_list)) 


print("\nThe first row of vector_values_list\n",vector_values_list[0])

print ("\nNo of words in first row\n",len(vector_values_list[0]))

```

**Output:**

```python
No of rows of vector value list:
3

The first row of vector_values_list:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 5, 1, 0, 0, 0, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 4, 0, 0, 0, 2, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 4, 2, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 4, 3, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 3, 8, 0, 0, 1, 1, 0, 1, 0, 5, 3, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 5, 0, 0, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 3, 0, 0, 1, 0]

No of words in first row:
214
```

214 is the number of unique words in all 3 rows combined. This value will be constant for every list element in the vector_values_list because all the unique words in the entire dataframe have been converted to features and the values for these features per row depends on the count of those words in that row, 0 in case the word does not exist in the row. 


We can confirm that with the following

***
```python
print("\nThe second row of vector_values_list\n",vector_values_list[1])

print ("\nNo of words in second row\n",len(vector_values_list[1]))

```

**Output:**

```python
The second row of vector_values_list
[0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]

No of words in second row
214
```
***

<!--```python
print("\nThe third row of vector_values_list\n",vector_values_list[2])

print ("\nNo of words in third row\n",len(vector_values_list[2]))

```

**Output:**

```python
The third row of vector_values_list
 [1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 3, 5, 0, 1, 1, 1, 0, 2, 0, 1, 4, 1, 1, 0, 1, 0, 3, 1, 0, 0, 3, 1, 2, 1, 1, 1, 1, 2, 0, 1, 1, 1, 1, 0, 0, 0, 2, 2, 0, 1, 2, 2, 0, 0, 0, 1, 0, 0, 4, 1, 2, 0, 1, 1, 0, 0, 1, 1, 2, 1, 1, 2, 1, 2, 1, 0, 1, 0, 3, 0, 1, 3, 1, 0, 4, 1, 1, 1, 1, 3, 0, 3, 0, 0, 1, 2, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 6, 1, 3, 0, 0, 1, 1, 4, 2, 3, 1, 3, 1, 4, 0, 1, 1, 1, 0, 0, 3, 0, 5, 2, 1, 2, 1, 1, 2, 1, 0, 1, 1, 1, 1, 2, 1, 1, 0, 1, 1, 1, 0, 1, 2, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 4, 6, 3, 0, 1, 7, 1, 8, 0, 9, 1, 0, 0, 2, 1, 2, 2, 2, 1, 1, 0, 1, 6, 1, 3, 3, 1, 1, 1, 1, 1, 0, 1, 2, 1, 1, 1, 1, 5, 3]

No of words in third row
 214
```-->

As you can see that every document has been converted to a fixed length vector of 214 words, and have values corresponding to the occurrences of those words in the particular document.

Let's now get the y values for these 3 rows

```python

#Creating a dataframe for target
labels = pd.DataFrame(data["y"][:3])
labels.columns = ["labels"]
print(labels)
```

**Output**

```python
                        labels
1                 "Student loan"
2  "Credit card or prepaid card"
7                     "Mortgage"


```

We will have to label encode these categories to numerize them. 

```python
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
labels["labels"] = le.fit_transform(labels["labels"])
print (labels)

```
**Output:**

```python
    labels
1       2
2       0
7       1

```

# Vectorising Dataset

In this task we will try to implement the vectorisation on all rows and implement a logistic regression model on the vectorised dataframe

### Data Vectorisation

**For X**
- Store the `"X"` column of dataframe `'data'`in a new dataframe `'all_text'`(i.e. instead of `data["X"]` use `data[["X"]]`)

- Convert the values of X column of `all_text` into lowercase using `"lower()"` method.

- Initialise a `"CountVectorizer()"` object and store it in `'cv'`

- Apply the `"fit_transform()"` method of `'cv'` on `X` column and store the result in `'vector'`

- Convert `'vector'` into an array using `"toarray()"` method and store the result in a new variable `'X'`

**For y**

- Store the `"y"` column of dataframe `'data'`in a new dataframe `'labels'`(i.e. instead of `data["y"]` use `data[["y"]]`)

- We need to label encode the values. Therefore initialise a `"LabelEncoder()"` object and store it in `'le'`

- Use the `fit_transform` method of `'le'` on column `"y"` of `'labels'` and store the results back in `y` column

***
If we now, include all the 335 rows and vectorize them and label it as the X, and the corresponding y labels, we can train a classification algorithm on the same, and figure out the accuracy. 

***

### Model building


- Split `'X'` and `'labels["y"]'` into `X_train,X_test,y_train,y_test` using `train_test_split()` function. Use `test_size = 0.4` and `random_state = 42`


- Initialise a logistic regression model with `LogisticRegression()` having `random_state=42` and save it to a variable called `'log_reg'`.


- Fit the model on the training data `'X_train'` and `'y_train'` using the `'fit()'` method.


- Find out the accuracy score between `X_test` and `'y_test'` using the `'score()'` method and save it in a variable called `'acc'`


In [5]:
#Running the exact same code as the earlier one on 335 rows
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder


#Subsetting 'X'
all_text = data[["X"]]

#Converting 'X' to lower case
all_text["X"] = all_text['X'].str.lower()

#Initialising a count vectorizer object
cv = CountVectorizer()

#Creating the count vectorizer of our 'X' column
vector =cv.fit_transform(all_text["X"])

#Converting the count vectoriser to array
X = vector.toarray()

#Subsetting y
labels = data[["y"]]

#Initialising a label encoder object
le = LabelEncoder()

#Label encoding 'y' column
labels["y"] = le.fit_transform(labels["y"])

#Splitting the dataset into train and test
X_train,X_test,y_train,y_test = train_test_split(X,labels["y"],test_size=0.4,random_state=42)

#Initialising Logistic Regression model
log_reg = LogisticRegression(random_state=42)

#Fitting the model on train data
log_reg.fit(X_train,y_train)

#Finding the accuracy score on test data
acc = log_reg.score(X_test,y_test)
print (acc)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


0.4925373134328358




That's a 49% accuracy of predicting the product category. Which isn't really a good number. A look at the precision recall shows that the product category was heavily imblanaced, hence a few categories have not been detecetde by the algorithm. Before we rectify the same, let us first look at the distribtion of the y column.

### 2.4 Removing Stopwords

In the previous task, we have seen 49% accuracy of predicting the product category. Now the question we need to ask is - can we improve the accuracy further? The answer lies in dealing with stopwords. 




**Stop Words**

It is important to realize that count vectorizer is essentially a Ranking algorithm, in the way that it gives a higher weight to words which have appeared more number of times. In other words, the key_value pair is nothing but the count of the words in the dataset. It does not assign any importance to the order in which the words have appeared in the sentences. 

Another point to be considered is the fact that words like "a","an","the" etc. will appear more number of times than the rest of the words as they are common articles. Using a Count vectorizer out of the box on a paragraph or a body of text will invariably give the highest count to these common words. Hence the words we are actually interested in will be underneath these words. One way to rectify this, is to remove these commonly occurring words. NLTK offers this functionality and has rightly defined these particular words as "stopwords" or words we wouldn't include in our bag of words. We will have to remove the punctuation as well, as you can see that our initial bag of words had the commas, full-stops and all of the other symbols as well. We will have to remove them as well, to avoid being vectorized. 

To see the list of stopwords, NLTK currently includes, we could check that by just running 

```python
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
print (set(stopwords.words('english')))

```
**Output:**

```python
{'their', 'yourselves', 'yours', 'ours', "isn't", 'ourselves', 'off', 'herself', 'each', 'hadn', "don't", 'and', 'haven', 'be', 'how', 'won', 'more', 'll', "needn't", 'hers', 'then', 'in', 'the', 'shouldn', 'very', 'doing', "you'll", 'theirs', 'than', 'will', 'under', 'when', "hadn't", 'few', 'isn', 'not', 'was', 'has', 'here', 'any', 'just', "didn't", "shouldn't", 'd', 'ma', "haven't", 'that', "doesn't", 'your', 'what', 'him', 'out', 'being', 'at', 'into', 'some', 'doesn', 'such', 'his', 'he', 'which', 've', 'about', 'up', 'during', 'they', "she's", 'myself', 'having', 'm', "it's", 'aren', 'this', "shan't", 'should', 'have', 'these', 'because', "weren't", "won't", 'mightn', 'down', "wasn't", 'our', 'to', "mustn't", 'but', 'are', 'had', 'y', 'same', 'shan', 'we', 'too', 'couldn', 'other', 'after', 'its', 'me', 'why', 'own', 'whom', 'for', "wouldn't", 'before', "you're", 'do', 'a', "mightn't", 'above', 'wouldn', 'who', 'no', 'of', 'an', 'nor', 'there', 'can', 'between', 'were', 'where', 'now', 'through', 'below', "you've", 'o', 'or', 'again', 'don', 'did', 'her', 'from', "that'll", 're', 'my', 'himself', 'further', 'wasn', 'with', 'until', 't', 'them', 'if', 'it', "hasn't", 'been', 'so', 'while', 'hasn', 'itself', 'is', 'weren', 'those', 'themselves', 'am', "couldn't", 'by', 'on', 'she', 'over', 'most', 'does', 'only', 'i', 'ain', 'as', 'once', 'all', 'mustn', 'you', 's', 'didn', 'needn', "should've", "aren't", 'against', 'both', "you'd", 'yourself'}
```
The punctuation list can be derived as follows. 

```python
from string import punctuation
print (list(punctuation))
```

**Output:**

```python
['!', '"', '#', 'dollar', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']

```
We can also add our own list of stop words we want to remove from our body of text. Different domains can have different stopwords - for example, if we are classifying medical articles into different subdomains like orthopedic and neurology, then the word `medicine` would be a stopword for our case. So we can add `medicine` to the set of stopwords in the following manner 

```python
custom_set_of_stopwords = set(stopwords.words('english')+list(punctuation)+["medicine"])
                             
```
This will include the word `medicine` as a stop word and remove the same from our body of text before vectorizing it. 

You can check it also
```python
print ("medicine" in custom_set_of_stopwords)
```

**Output:**
```python
True
```

If we re-do the exercise for prediction removing the stopwords, the accuracy would increase, as now we are not giving any weight to meaningless words but to words which actually matter. 

**Python Implementation of Stopwords**


```python
#Storing the first complaint
first_complaint = data.iloc[0][0]

print("\nFirst Complaint:\n",first_complaint)

bag_of_words = word_tokenize(first_complaint)

print ("\nBag of words of first complaint:\n",bag_of_words)
print("\nLen of bag of words:\n",len(bag_of_words))

#Removing stopwords
bow_stopwords_removed = [x for x in first_complaint_bow if x not in custom_set_of_stopwords]

print ("\nBag of words with stopwords removed:\n",bow_stopwords_removed)

print("Len of bag of words with stopwords removed:\n",len(bow_stopwords_removed))

```
**Output:**

```python

First Complaint:
When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not. When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX. I have been faithful at paying my student loan. I was told that Navient was the company i had delinquency with. I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. I have had so much trouble bringing my credit score back up.

Bag of words of first complaint:
['When', 'my', 'loan', 'was', 'switched', 'over', 'to', 'Navient', 'i', 'was', 'never', 'told', 'that', 'i', 'had', 'a', 'deliquint', 'balance', 'because', 'with', 'XXXX', 'i', 'did', 'not', '.', 'When', 'going', 'to', 'purchase', 'a', 'vehicle', 'i', 'discovered', 'my', 'credit', 'score', 'had', 'been', 'dropped', 'from', 'the', 'XXXX', 'into', 'the', 'XXXX', '.', 'I', 'have', 'been', 'faithful', 'at', 'paying', 'my', 'student', 'loan', '.', 'I', 'was', 'told', 'that', 'Navient', 'was', 'the', 'company', 'i', 'had', 'delinquency', 'with', '.', 'I', 'contacted', 'Navient', 'to', 'resolve', 'this', 'issue', 'you', 'and', 'kept', 'being', 'told', 'to', 'just', 'contact', 'the', 'credit', 'bureaus', 'and', 'expalin', 'the', 'situation', 'and', 'maybe', 'they', 'could', 'help', 'me', '.', 'I', 'was', 'so', 'angry', 'that', 'i', 'just', 'hurried', 'and', 'paid', 'the', 'balance', 'off', 'and', 'then', 'after', 'tried', 'to', 'dispute', 'the', 'delinquency', 'with', 'the', 'credit', 'bureaus', '.', 'I', 'have', 'had', 'so', 'much', 'trouble', 'bringing', 'my', 'credit', 'score', 'back', 'up', '.']

Len of bag of words:
137

Bag of words with stopwords removed:
['loan', 'switched', 'navient', 'never', 'told', 'deliquint', 'balance', 'xxxx', 'going', 'purchase', 'vehicle', 'discovered', 'credit', 'score', 'dropped', 'xxxx', 'xxxx', 'faithful', 'paying', 'student', 'loan', 'told', 'navient', 'company', 'delinquency', 'contacted', 'navient', 'resolve', 'issue', 'kept', 'told', 'contact', 'credit', 'bureaus', 'expalin', 'situation', 'maybe', 'could', 'help', 'angry', 'hurried', 'paid', 'balance', 'tried', 'dispute', 'delinquency', 'credit', 'bureaus', 'much', 'trouble', 'bringing', 'credit', 'score', 'back']

Len of bag of words with stopwords removed:
54

```

# Stopword removal

Let's remove the stopwords from our entire dataset

### Stopword handling
- Initialise a `"CountVectorizer()"` object with parameter `stop_words= "english"` and store it in `'cv_stop'`

- Apply the `"fit_transform()"` method of `'cv'` on `X` column and store the result in `'vector_stop'`

- Convert `'vector_stop'` into an array using `"toarray()"` method and store the result in a new variable `'X_stop'`

### Model building

- Split `'X_stop'` and `'labels["y"]'` into `X_train,X_test,y_train,y_test` using `train_test_split()` function. Use `test_size = 0.4` and `random_state = 42`


- Initialise a logistic regression model with `LogisticRegression()` having `random_state=42` and save it to a variable called `'log_reg'`.


- Fit the model on the training data `'X_train'` and `'y_train'` using the `'fit()'` method.


- Find out the accuracy score between `X_test` and `'y_test'` using the `'score()'` method and save it in a variable called `'stop_acc'`


In [6]:
#Initialising the count vectorizer with stop words parameter
cv_stop = CountVectorizer(stop_words="english")

#Creating the count vectorizer of our 'X' column
vector_stop = cv_stop.fit_transform(all_text["X"])

#Converting the count vectoriser to array
X_stop = vector_stop.toarray()

#Splitting the data to train and test
X_train,X_test,y_train,y_test = train_test_split(X_stop,labels["y"],test_size=0.4,random_state=42)

#Initalising a logistic regression model
log_reg = LogisticRegression(random_state=42)

#Fitting the model on train
log_reg.fit(X_train,y_train)

#Finding the accuracy score on test data
stop_acc = log_reg.score(X_test,y_test)
print (stop_acc)

0.5597014925373134




## Chapter 3: Advanced vectorization with TF-IDF 

Description: In this chapter, we will talk about how to convert text into tf-idf vectors for better classification.

### 3.1 Introduction to TF-IDF

In the last tutorial we saw how text was converted to numerics using a count vectorizer. 

In other words, a count vectorizer, counts the occurences of the words in a document and all the documents are considered independent of each other. Very similar to a one hot encoding or pandas getdummies function. However in cases where multiple documents are involved, count vectorizer still does not assume any interdependence between the documents and considers each of the documents as a seperate entity. 

It does not rank the words based on their importance in the document, but just based on whether they exist or not. This is not a wrong approach, but it intuitively makes more sense to rank words based on their importance in the document right? In fact, the process of converting, text to numbers should essentially be a ranking system of the words so that the documents can each get a score based on what words they contain. All words cannot have the same imprtance or relevance in the document right?

There are two ways to approach document similarity:


1. TF-IDF Score

2. Cosine Similarity

Let's look at them one by one.


#### TF-IDF!!

TF-IDF or Term Frequency and Inverse Document Frequency is kind of the holy grail of ranking metrics to convert text to numbers. Consider the count vectorizer as a metric which just counts the occurences of words in a document. 

** The ranking system in a count vectorizer is purely occurence based on a single document only!**

TF-IDF takes it a step further and ranks the words based not just on their occurences in one document but across all the documents. Hence if CV or Count vectorizer was giving more importance to words because they have appeared multiple times in the document, TF-IDF will rank them high if they have appeared only in that document, meaning that they are rare, hence higher importance and lower if they have appeared in all or most documents, because they are more common, hence lower ranking. 

Consider a scenario where there are 5 documents and all are talking aout football. The word football would have appeared multiple times in each document. CV is going to rank football consistently high and infact give the word football a different value across all 5 documents based on how many times that word has appeared in that document. In other words, it is assuming, that the more number of times a word appears, the more important it is. That is exactly what the TF or the Term Frequency component in TF-IDF does. 

IDF on the other hand now is the dominating factor in TFIDF which is going to find out the number of times football has also appeared in the other 4 documents except for the one it is currently seeing. If football has also appeared in rest of the documents, it means that though football is important to that one document based on the number of occurences, considering it has appeared in the rest as well, it is not that rare or more common, hence the importance now is going to reduce instead of going high!

**The ranking system is across the entire corpus or all documents.  It is not a single document based metric!**

We have seen how CV is calculated for a word in a document. Let us now see how TF IDF is...

The tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

#### Example

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

**Python Implementation of TF-IDF**
Let us now take the first 3 complaints and run a TF-IDF vectorizer on the same. So, in this case, the 3 complaints are our 3 documents. and instead of a CV which considers each document independent of each other and just calculates the count of every word in the document. now the corpus will be the sum total of both documents. 

```python
complaint_1 = data["X"].iloc[0]
complaint_2 = data["X"].iloc[1]
complaint_3 = data["X"].iloc[2]

print ("Complaint 1: ", complaint_1)

print ("\nComplaint 2: ", complaint_2)

print ("\nComplaint 3: ", complaint_3)

```

**Output:**

```python
Complaint 1:  When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not. When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX. I have been faithful at paying my student loan. I was told that Navient was the company i had delinquency with. I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. I have had so much trouble bringing my credit score back up.

Complaint 2:  I tried to sign up for a spending monitoring program and Capital One will not let me access my account through them

Complaint 3:  My mortgage is with BB & T Bank, recently I have been investigating ways to pay down my mortgage faster and I came across Biweekly Mortgage Calculator on BB & T 's website. It's a nice, easy to use calculator that you plug in your interest rate, mortgage amount, mortgage term, and payment type and it calculates your accelerated bi-weekly payment for you and shows you how much quicker you can pay down your loan. Ours figured out to pay off a 30 year mortgage in 26.4 years ... quite a savings! 
I called BB & T 's customer service number to inquire how I get set up on this payment plan. I was told they do not offer that type of payment plan, but I could send in my payments bi-weekly but it would not be applied until the full amount was received. ( the money would sit in a " holding account '' until the full payment amount was collected ). I ended up calling back a few days later thinking the rep I was talking to didn't understand what I wanted to do or was not knowledgeable of this program. I got the SAME ANSWER! 
I then asked for the corporate BB & T office number where I could speak to someone that was knowledgeable of this product. After 3 days I received a phone call back from a corporate manager stating they do not offer this product, and they were " checking into why this is on their website ''. She stated they do have a few customers that make bi-weekly payments, but they no longer offer this service. 
I don't understand how they can have this active link on their website under their Financial Planning Center tab to mislead customers when all they say is " I'm sorry, I know you're upset about this '' Sounds like false advertising to me! 
https : //www.bbt.com/XXXX
```

Let's find out the Tf-Idf score 

```python
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents called sents
sents = [complaint_1, complaint_2, complaint_3]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab

vectorizer.fit(sents)

vector = vectorizer.transform(sents)

print("Shape of the vectorized sentence:",vector.shape)

vector_values = vector.toarray().tolist()[0]

print("The tf-idf score of first five elements:",vector_values[:5])


# Converting the tf-idf score with the word into a dictionary
import operator
sorted_x = sorted(vectorizer.vocabulary_.items(), key=operator.itemgetter(1))
words = [x[0] for x in sorted_x]
d = dict(zip(words,vector_values))

print("Dictionary of words with tf-idf score:\n", d)

#Sorting this dictionary by value in the descending order to see the ranking
print("Sorted dictonary:\n")
print (sorted(d.items(), key=operator.itemgetter(1), reverse = True))

```

**Output:**

```python
Shape of the vectorized sentence: (3, 214)

The tf-idf score of first five elements: [0.0, 0.0, 0.0, 0.0, 0.0]

Dictionary of words with tf-idf score:
{'26': 0.0, '30': 0.0, 'about': 0.0, 'accelerated': 0.0, 'access': 0.0, 'account': 0.0, 'across': 0.0, 'active': 0.0, 'advertising': 0.0, 'after': 0.05253580411952334, 'all': 0.0, 'amount': 0.0,
  ..............................................................................................
 }
    
Sorted dictonary:

[('the', 0.42028643295618673), ('credit', 0.27631307611219824), ('had', 0.27631307611219824), ('was', 0.2626790205976167), ('navient', 0.20723480708414868), ('and', 0.20399369240069398), ..................................................................................................
]

```

We can see that the model learns to give lesser importance to words like is,it,in etc;. Unfortunately, it also gives a low importance to important words like financial, mortgage and a fairly high importance to unwanted words like the, was. It does give higher importance to words such as credit. And that is because TF-DF works better with larger corpuses. Just like a machine learning model, the larger the data, the better the model.  With a larger corpus, these issues would be resolved when a lot more documents would have words like financial but not the.

Rerunning this for about 100 documents, we see that the ranking is completely different. 

```python
sents=[]
for x in range(100):
    sents.append(data["X"].iloc[x])
    
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents called sents
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(sents)
vector = vectorizer.transform(sents)
vector.shape 
vector_values = vector.toarray().tolist()[0]

sorted_x = sorted(vectorizer.vocabulary_.items(), key=operator.itemgetter(1))
words = [x[0] for x in sorted_x]
d = dict(zip(words,vector_values))
print("Sorted dictionary: \n")
print ((sorted(d.items(), key=operator.itemgetter(1), reverse = True))[:20])

```

**Output:**
```python
Sorted dictionary: 

[('navient', 0.35431369455512246), ('the', 0.24041090745878946), ('delinquency', 0.23620912970341496), ('had', 0.20963581996093744), ('told', 0.19801036084460227), ('bureaus', 0.19189624902422267),
```

You can notice that "the" has moved down from 0.42 to 0.27. Navient has increased from 0.20 to 0.35. So has bureaus from 0.13 to 0.19. As we include more and more sentences, the words whch have appeared more and more frequently across all the documents, such as "the" are moving down in value, and words like bureau and navient, which have appeared far lesser number of times have started increasing. Which reiterates the point we had. TF-DF works better with larger corpuses. Just like a machine learning model, the larger the data, the better the model. 

### 3.2 Classification with TF-IDF features

In the previous topic, we have seen how the ranking of features changes as more data is available to the algorithm. Just like, how we have done with count vectorization we can use the tfidf features as input for a classification algorithm. 

We use the `TfidfVectorizer` and `fit_transform()` method that we have seen in the previous instances. We will apply a logistic regression model on the data and get the classification accuracy. 

### TF- IDF score

Let's run our initial Logistic Regression model using a tf-idf and see if there is a difference in the accuracies. 

#### TF-IDF scoring
- Initialise a `"TfidfVectorizer()"` object with parameter `stop_words= "english"` and store it in `'tfidf'`

- Apply the `"fit_transform()"` method of `'tfidf'` on `X` column and store the result in `'vector_tfidf'`

- Convert `'vector_tfidf'` into an array using `"toarray()"` method and store the result in a new variable `'X_tfidf'`

#### Model building

- Split `'X_tfidf'` and `'labels["y"]'` into `X_train,X_test,y_train,y_test` using `train_test_split()` function. Use `test_size = 0.4` and `random_state = 42`


- Initialise a logistic regression model with `LogisticRegression()` having `random_state=42` and save it to a variable called `'log_reg'`.


- Fit the model on the training data `'X_train'` and `'y_train'` using the `'fit()'` method.


- Find out the accuracy score between `X_test` and `'y_test'` using the `'score()'` method and save it in a variable called `'tfidf_acc'`


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Initialising the tf-idf model
tfidf = TfidfVectorizer(stop_words="english")

#Vectorizing the 'X' column
vector =tfidf.fit_transform(all_text["X"])

#Converting the vector to array
X_tfidf = vector.toarray()

#Splitting the dataset into train and test
X_train,X_test,y_train,y_test = train_test_split(X_tfidf,labels["y"],test_size=0.4,random_state=42)

#Initialising the logistic regression model
log_reg = LogisticRegression(random_state=42)

#Fitting the model with train data
log_reg.fit(X_train,y_train)

#Finding the accuracy score of model on test data
tfidf_acc = log_reg.score(X_test,y_test)
print (tfidf_acc)




0.43283582089552236


### 3.3 Tfidf vectorizer with more data

We can see that the overall accuracy of the model is **low** compared to the initial model we built with count vectorizer model. Let us see with additional data, what happens to the accuracy! 


```python
import pandas as pd

#Reading the data
df = pd.read_pickle("../data/a2.pkl")

#Keeping the relevant columns
df = df[["Consumer complaint narrative", "Product"]] 
df.columns = ["X","y"]
df = df.dropna()
df = df.iloc[:2000]
print("No. of rows in data: ",df.shape[0])


#Initialising the tf-idf model
tfidf = TfidfVectorizer(stop_words="english")

#Vectorizing the 'X' column
vector =tfidf.fit_transform(all_text["X"])

#Converting the vector to array
X_tfidf = vector.toarray()

#Splitting the dataset into train and test
X_train,X_test,y_train,y_test = train_test_split(X_tfidf,labels["y"],test_size=0.4,random_state=42)

#Initialising the logistic regression model
log_reg = LogisticRegression(random_state=42)

#Fitting the model with train data
log_reg.fit(X_train,y_train)

#Finding the accuracy score of model on test data
acc = log_reg.score(X_test,y_test)
print ("Accuracy Score: "acc)
```

**Output:**

```python
No. of rows in data:  2000
Accuracy Score: 0.5885608856088561
```
From 43% at 359 rows, to 59% at 2000 rows, we are starting to see how TF-IDF works better with larger data sets. The last value for the word "the" was around 0.27. Just out of curiosity, let's see the value of the word "the" now at 2000 docs

```python
[('navient', 0.36025047479827665), ('delinquency', 0.26853246892091703), ('the', 0.21010966235398107), ('had', 0.19406511916152228), ('deliquint', 0.17884877931376503), ('expalin', 0.17884877931376503), ('faithful', 0.17884877931376503), ('hurried', 0.17884877931376503), ('told', 0.1715249118834407), ('was', 0.16955513647222115), ('score', 0.16237392880244142), ('angry', 0.15785573774281664), ('bureaus', 0.15439765356279928), ('switched', 0.14197511484650308), ('credit', 0.14078592528264697), ('just', 0.1406764594136565), ('bringing', 0.13979147323891608), ('balance', 0.13890249109697195), ('trouble', 0.13596411449573909), ('maybe', 0.1285083957562635)]
```

"the" is at 0.21 at 2000 documents. "And" has been pushed down, expalin has come up, deliquent has increased etc; a few words have been pushed down as well, based on their importance across all documents not just each document alone. One thing to note is that sometimes Count Vectorizer would work better than the TF-IDF Vectorizer based on the distribution of the words. 


**Cosine Similarity**

Cosine similarity calculates similarity by measuring the cosine of angle between two vectors. This is calculated as

![](../images/cos_sim.png)

Here vectors can be either be bag of words with either TF (term frequency) or TF-IDF (term frequency- inverse document frequency).

Let's understand it better with an example

Consider the two sentences:

- a. Amy likes pie more than Linda likes pie

- b. Linda likes cake more than Amy likes cake


Following will be the unique words:

"Amy", "likes" "pie" , "more" , "than", "Linda", "cake"

Following will be their respective tf vectors:

a= [1,2,2,1,1,1,0]

b= [1,2,0,1,1,1,2]

Since term frequency counts favors the sentences that are longer, let's normalize the term frequencies with the respective magnitude.

Normalising(L2 normalising) them we will get:

a = [1/√12,2/√12,2/√12,1/√12,1/√12,1/√12,0/√12]

b = [1/√12,2/√12,0/√12,1/√12,1/√12,1/√12,2/√12]


Calculating cosine similarity is now simply finding the dot product of the vectors:

Cos similarity
= (1/√12 x 1/√12) + (2/√12 x 2/√12) + (2/√12 x 0/√12) + (1/√12 x 1/√12) + (1/√12 x 1/√12) + (1/√12 x 1/√12) + (0/√12 x 2/√12) = 1/12 + 4/12 + 0 + 1/12 + 1/12 + 1/12 + 0= 8/12= 0.66  


From the above example, we can ses cosine similarity is good for cases where duplication matters while analyzing text similarity.

Though TF-IDF is the popular method, you can still attempt cosine similarity to validate your findings.

## Chapter 4: More Classifiers for Text

Description: In this chapter, we will look at other different classifiers that can be used for classifying text. 

### 4.1: Naive Bayes Classifier

Naive Bayes classifier is a linear classifier based on the Bayes' theorem. The term naive comes from the assumption of considering all features in a dataset are mutually independent. The independent assumption is generally violated in real datasets, but the naive Bayes Classifier still tends to perform very well.

For a document d and class c, it is defined as:
$P(c|d)$ is the probability of a document d belonging to class c. Or in other words, given the content of the document, what is the probability that it belongs to class c?

On applying Bayes rule, we get

$posterior = \frac{prior\ \times\ conditional\ probability}{evidence}$ 

$P(c|d)= \frac{P(d|c).P(c)}{P(d)}$

P(d|c) means the probability of observing the document d given that it belongs to class c. If we consider the document to be represented by the words, and we assume the words to be independent of each other. Then
$P(d|c) = \prod_{i=1}^n P(w_i|c)$

So, it can be interpreted as the probability of observing the word $w_i$ in the document, given that it belongs to class c. Since all probabilities will have P(d) as their denominator, we can eliminate the denominator.
Resulting in the following formula

$P(c|d) = P(d|c) \times P(c) = \prod_{i=1}^n P(w_i|c) \times P(c) $ 

Being easy to implement and fast, naive Bayes classifiers are used in many different fields including classification of RNA sequences and spam filtering.

Let's understand it better with an example of spam filtering: 

For a spam classification problem, we have two classes - `ham` and `spam` (ham means not spam). Given any email, we need to find P(ham|email) and P(spam|email). Whichever probability is higher - we assign that label to the email. Consider, 

$$P(spam|email)= P(email|spam) \times P(spam)$$

$$P(spam) = \frac{\text{number of emails belonging to spam class}}{\text{total number of emails in the corpus}}$$
$$P(email|spam)= \prod_{i=1}^n P(w_i|spam)\ where\ w_i\in email$$
Let us take a word `lottery` for example that is present in the text of the email.

$$P(lottery|spam) = \frac{\text{number of times lottery appears in spam email}}{\text{sum of number of times all words of vocabulary appear in spam email}}$$

Intuitively, the word `lottery` would appear more in spam emails than in the ham emails, and overall increases the probability that email is spam. For all the words we choose as features, we calculate the above probability and aggregate it to get probability of observing the email. If P(spam|email) > P(ham|email), we allocate the label `spam` to the email or `ham` otherwise.  

Now that we have understood the basic intuition of how Naive Bayes works, let us try running a Naive Bayes classifier. We will be using the TF-IDF method, exactly the same way we ran a logistic regression model in the earlier section. 


### Naive Bayes Classifier

- Load the dataset from `'path'`(given) using the `read_csv()` method from pandas and store it in `'data'`. 

- Drop nan values from the entire dataframe `data` using `"dropna()"` and save it back to `'data'`

- Subset the dataframe  `'data'` to only include `"Consumer complaint narrative"` and  `"Product"` and store this dataframe subset back in `'data'`

- Subset the dataframe `'data'` to only include first `2000` rows and store this dataframe subset back in `'data'`

- Rename the column `"Consumer complaint narrative"` to `"X"` and `"Product"` to `"y"` by assigning `["X","y"]` to `data.columns` 

### TF-IDF vectoriser
- Initialise a `"TfidfVectorizer()"` object with parameter `stop_words= "english"` and store it in `'tfidf'`

- Apply the `"fit_transform()"` method of `'tfidf'` on `X` column and store the result in `'vector_tfidf'`

- Convert `'vector_tfidf'` into an array using `"toarray()"` method and store the result in a new variable `'X_tfidf'`


**For y**

- Store the `"y"` column of dataframe `'data'`in a new dataframe `'labels'`(i.e. instead of `data["y"]` use `data[["y"]]`)

- We need to label encode the values. Therefore initialise a `"LabelEncoder()"` object and store it in `'le'`

- Use the `fit_transform` method of `'le'` on column `"y"` of `'labels'` and store the results back in `y` column


### Model building


- Split `'X'` and `'labels["y"]'` into `X_train,X_test,y_train,y_test` using `train_test_split()` function. Use `test_size = 0.4` and `random_state = 42`


- Initialise a naive bayes model with `MultinomialNB()` having `random_state=42` and save it to a variable called `'nb'`.


- Fit the model on the training data `'X_train'` and `'y_train'` using the `'fit()'` method.


- Find out the accuracy score between `X_test` and `'y_test'` using the `'score()'` method and save it in a variable called `'nb_acc'`



In [8]:
from sklearn.naive_bayes import MultinomialNB

path = "../data/a2.pkl"

# reading the data
data = pd.read_pickle(path)

# keeping the relevant columns
data = data[["Consumer complaint narrative", "Product"]]

# renaming the columns
data.columns = ["X", "y"]

# dropping the nan values
data = data.dropna()

# choosing the first 2000 documents
data = data.iloc[:2000]

# X

# Subsetting 'X' column
all_text = data[["X"]]

# Converting the 'X' column to lower case
all_text["X"] = all_text['X'].str.lower()

# Initialising a tfidf vectorizer object with stopwords
tfidf = TfidfVectorizer(stop_words="english")

# Vectorizing the 'X' column
vector = tfidf.fit_transform(all_text["X"])

# Converting vector to array
X_tfidf = vector.toarray()

# y

# Subsetting 'y' column
labels = data[["y"]]

# Initialising label encoder object
le = LabelEncoder()

# Label encoding 'y' column
labels["y"] = le.fit_transform(labels["y"])

# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, labels["y"], test_size=0.4, random_state=42)

# Initialsing a naive bayes classifier
nb = MultinomialNB()

# Fitting the model on train data
nb.fit(X_train, y_train)

# Finding the accuracy score of model on test data
nb_acc = nb.score(X_test, y_test)
print(nb_acc)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


0.42461964038727523



### 4.2 Handling imbalanced text data

One of the fundamental reasons the model isn't giving a good accuracy can be deduced from the classification report, where we see that the recall of multiple categories is 0 and that means that the data isn't balanced well enough. Since a classification algorithm tends to predict the majority class unless, the output categories are more or less equally balanced, the error in predicting the majority class will increase as it classifies more and more data, and classifies them wrong, hence reducing the overall accuracy. 




```python
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings("ignore")
print (classification_report(y_test,nb.predict(X_test)))
```


```python
               precision    recall  f1-score   support

0                 0.00      0.00      0.00        14
1                 0.00      0.00      0.00        23
2                 0.00      0.00      0.00        15
3                 0.00      0.00      0.00        26
4                 0.00      0.00      0.00        48
5                 0.00      0.00      0.00        52
6                 0.38      1.00      0.55       250
7                 0.86      0.40      0.54       144
8                 0.00      0.00      0.00        16
9                 0.00      0.00      0.00         2
10                1.00      0.01      0.03        78
11                0.00      0.00      0.00         1
12                0.00      0.00      0.00         6
13                0.00      0.00      0.00         5
14                0.00      0.00      0.00        30
15                0.00      0.00      0.00        13

avg / total       0.41      0.42      0.30       723
```

Let's reconcile the classification report with the label wise distribution of data. 


```python
labels["y"].value_counts() 

```

**Output:**
```python
6     628
7     418
10    182
4     112
5     104
14     86
1      77
3      56
0      33
15     27
8      26
2      24
12     18
9       7
13      5
11      3
Name: y, dtype: int64
```

We can see very clearly that the data set has heavily imbalanced labels. Labels 10,8,12 are underrepresented and hence the probability of the model catching those labels is less, hence the accuracy is going to suffer. As predicted, the dataset is heavily imbalanced. With Category 6 and 7 being over represented , while all else have a less than 10% weightage. 

We might have to oversample the under-represented categories. 

**Random Oversampling**

Since we already have less data, the sampling method we should use is oversampling. We have already covered Random Oversampling. 

To refresh,

Random Oversampling is a method of selecting minority class samples with replacement(repeated occurences) resulting in higher proportion of minority class samples.


![ros](../images/ros.png)


We will use the RandomOverSampler() to do this from the package IMBLEARN.


### Sampling 

- Initialise a `RandomOverSampler()` object with `random_state=0` and save it to a variable called `'ros'`.

- Using `fit_sample()` method of `'ros'`, undersample `'X_train'` and `'y_train'` and store the new samples in variables `'X_ros'` and `'y_ros'`.


- Initialise a naive bayes model with `MultinomialNB()` with `random_state=0` and save it to a variable called `'nb'`.


- Fit the model on the training data `'X_ros'` and `'y_ros'` using the `'fit()'` method.

- Find out the accuracy score between `X_test` and `'y_test'` using the `'score()'` method and save it in a variable called `'ros_score'`


In [9]:
from imblearn.over_sampling import RandomOverSampler
from sklearn.naive_bayes import MultinomialNB

#Initialising a random over sampler object
ros = RandomOverSampler(random_state=0)

#Sampling the train data
X_ros, y_ros = ros.fit_sample(X_train, y_train)

#Initialsing multinomial naive bayes model
nb = MultinomialNB()

#Fitting the sampled train data
nb.fit(X_ros,y_ros)

#Finding the accuracy score of model on test data
ros_score=nb.score(X_test,y_test)

print(ros_score)


0.6016597510373444


### 4.3 Linear Kernel SVM

We have already covered in detail about Support Vector Machines.

To refresh, 

Support Vector Machines are based on the concept of decision planes that define decision boundaries. 
In other words, given labeled training data (supervised learning), the algorithm outputs an optimal `hyperplane` which can help categorize new examples. 

![](../images/kernel_2.png)



**Why SVMs work for text classification?**

- High Dimensional input space:

When dealing with text data, we know we need to deal with many features(>10000 usually). Since SVM(particulary Linear SVM) uses overfitting protection, they have the capability to handle large feature space.


- Few irrelevant features:

Extension of the above point, during text classification one can't really do a rigourous feature selection. Research has shown that even the features ranked low still contain considerable information. SVM is therefore apt to handle this large amount of feature space in which feature selection or reduction can't be achieved satisfactorily.

- Most text categorisation problems are linearly separable

Lot of experiments has resulted in the conclusion that text categorisation problems are usually linearly separable, since the concept of SVM is to find such linear separators, SVMS work better than most other models.





### Linear SVC

Let's use a Linear SVC as our last algorithm, to test the results.

- Initialise a support vector model with `SVC()` with `random_state=0`& `kernel="linear"` and save it to a variable called `'svc'`.


- Fit the model on the training data `'X_ros'` and `'y_ros'` using the `'fit()'` method.

- Find out the accuracy score between `X_test` and `'y_test'` using the `'score()'` method and save it in a variable called `'svc_score'`


In [10]:
from sklearn.svm import SVC

#Initialising a support vector model with linear kernel
svc = SVC(kernel="linear", random_state=0)

#Fitting the model on train data
svc.fit(X_ros,y_ros)

#Finding the accuracy score of the model on test data
svc_score=svc.score(X_test,y_test)

print(svc_score)

0.648686030428769
