## Text Classification tutorial 

### In the last tutorial we saw how text was converted to numerics using a count vectorizer. 

In other words, a count vectorizer, counts the occurences of the words in a document and all the documents are considered independent of each other. Very similar to a one hot encoding or pandas getdummies function. However in cases where multiple documents are involved, count vectorizer still does not assume any interdependence between the documents and considers each of the documents as a seperate entity. 

### It does not rank the words based on their importance in the document, but just based on whether they exist or not.

### This is not a wrong approach, but it intuitively makes more sense to rank words based on their importance in the document right? 

### In fact, the process of converting, text to numbers should essentially be a ranking system of the words so that the documents can each get a score based on what words they contain. All words cannot have the same imprtance or relevance in the document right?



## Enter TF-IDF!!

TF-IDF or Term Frequency and Inverse Document Frequency is kind of the holy grail of ranking metrics to convert text to numbers. Consider the count vectorizer as a metric which just counts the occurences of words in a document. 

### The ranking system here is purely occurence based on a single document only!

TF-IDF takes it a step further and ranks the words based not just on their occurences in one document but across all the documents. Hence if CV or Count vectorizer was giving more importance to words because they have appeared multiple times in the document, TF-IDF will rank them high if they have appeared only in that document, meaning that they are rare, hence higher importance and lower if they have appeared in all or most documents, because they are more common, hence lower ranking. 

Consider a scenario where there are 5 documents and all are talking aout football. The word football would have appeared multiple times in each document. CV is going to rank football consistently high and infact give the word football a different value across all 5 documents based on how many times that word has appeared in that document. In other words, it is assuming, that the more number of times a word appears, the more important it is. That is exactly what the TF or the Term Frequency component in TF-IDF does. 

IDF on the other hand now is the dominating factor in TFIDF which is going to find out the number of times football has also appeared in the other 4 documents except for the one it is currently seeing. If football has also appeared in rest of the documents, it means that though football is important to that one document based on the number of occurences, considering it has appeared in the rest as well, it is not that rare or more common, hence the importance now is going to reduce instead of going high!

### The ranking system is across the entire corpus or all documents.  It is not a single document based metric!

We have seen how CV is calculated for a word in a document. Let us now see how TF IDF is...

The tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

### Example

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

### Let's take the earlier example where we had to classify sentences and we used count vectorizer and a NB classifier. Let's use a tf idf and see if the results vary. 

In [1]:
train = [('I love this sandwich.', 'pos'),
    ('This is an amazing place!', 'pos'),
    ('I feel very good about these beers.', 'pos'),
    ('This is my best work.', 'pos'),
    ("What an awesome view", 'pos'),
    ('I do not like this restaurant', 'neg'),
    ('I am tired of this stuff.', 'neg'),
    ("I can't deal with this", 'neg'),
    ('He is my sworn enemy!', 'neg'),
    ('My boss is horrible.', 'neg')]

In [2]:
test = [
    ('The beer was good.', 'pos'),
    ('I do not enjoy my job', 'neg'),
    ("I ain't feeling dandy today.", 'neg'),
    ("I feel amazing!", 'pos'),
    ('Gary is a friend of mine.', 'pos'),
    ("I can't believe I'm doing this.", 'neg')
]


We will do a hold out cross validation this time. The test set will be untouched. 

In [3]:
sentences = [x[0] for x in train]
labels = [x[1] for x in train]

In [4]:
sentences

['I love this sandwich.',
 'This is an amazing place!',
 'I feel very good about these beers.',
 'This is my best work.',
 'What an awesome view',
 'I do not like this restaurant',
 'I am tired of this stuff.',
 "I can't deal with this",
 'He is my sworn enemy!',
 'My boss is horrible.']

In [5]:
labels

['pos', 'pos', 'pos', 'pos', 'pos', 'neg', 'neg', 'neg', 'neg', 'neg']

In [6]:
train_dict = dict(zip(sentences,labels))

In [7]:
import pandas as pd

In [8]:
pd.DataFrame(list(train_dict.items()), columns=['Sentence', 'label'])

Unnamed: 0,Sentence,label
0,I love this sandwich.,pos
1,This is an amazing place!,pos
2,I feel very good about these beers.,pos
3,This is my best work.,pos
4,What an awesome view,pos
5,I do not like this restaurant,neg
6,I am tired of this stuff.,neg
7,I can't deal with this,neg
8,He is my sworn enemy!,neg
9,My boss is horrible.,neg


Let's see quickly an example of TF-IDF before we add that to our sentences. 

Sentence 1 : The car is driven on the road.

Sentence 2: The truck is driven on the highway.

In this example, each sentence is a separate document.

We will now calculate the TF-IDF for the above two documents, which represent our corpus.

![title](img/ok1.png)

From the above table, we can see that TF-IDF of common words was zero, which shows they are not significant. On the other hand, the TF-IDF of “car” , “truck”, “road”, and “highway” are non-zero. These words have more significance.

In a different format this could look like this

![title](img/ok2.png)

Which looks very similar to our CV in the last tutorial

![title](img/ok.png)

Words like "a", "an", "the" "is" have been given a weight of zero as they are common across documents. This is also called stopword removal. Commonly occuring words are automatically given lower weights in tf-idf. CV, this wasn't the case.

 TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. Alternately, if you already have a learned CountVectorizer, you can use it with a TfidfTransformer to just calculate the inverse document frequencies and start encoding documents.

The same create, fit, and transform process is used as with the CountVectorizer.

Below is an example of using the TfidfVectorizer to learn vocabulary and inverse document frequencies across 3 small documents and then encode one of those documents.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
		"The dog.",
		"The fox"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]
(1, 8)
[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
  0.36388646 0.42983441]]


The scores are normalized to values between 0 and 1 and the encoded document vectors can then be used directly with most machine learning algorithms.

Let's try this on a couple of sentences to make it more clear. 

s1 = "Bangalore is the capital of Karnataka"

s2 = "Kolkata is the capital of West Bengal"

s3 = "All these states belong to India"

s4 = "The capital of India is New Delhi"

In [10]:
s1 = "Bangalore is the capital of Karnataka"

s2 = "Kolkata is the capital of West Bengal"

s3 = "All these states belong to India"

s4 = "The capital of India is New Delhi"

In [11]:
sents= [s1,s2,s3,s4]

In [12]:
sents

['Bangalore is the capital of Karnataka',
 'Kolkata is the capital of West Bengal',
 'All these states belong to India',
 'The capital of India is New Delhi']

In [13]:
vectorizer = TfidfVectorizer()

In [14]:
vectorizer.fit(sents)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [15]:
vector = vectorizer.transform(sents)

In [16]:
vector

<4x17 sparse matrix of type '<class 'numpy.float64'>'
	with 26 stored elements in Compressed Sparse Row format>

In [17]:
vector.shape

(4, 17)

In [18]:
#4 because 4 documents, 17 because 17 unique words

In [19]:
vector_values = vector.toarray().tolist()[0]

In [20]:
vectorizer.vocabulary_

{'bangalore': 1,
 'is': 7,
 'the': 13,
 'capital': 4,
 'of': 11,
 'karnataka': 8,
 'kolkata': 9,
 'west': 16,
 'bengal': 3,
 'all': 0,
 'these': 14,
 'states': 12,
 'belong': 2,
 'to': 15,
 'india': 6,
 'new': 10,
 'delhi': 5}

In [21]:
import operator
sorted_x = sorted(vectorizer.vocabulary_.items(), key=operator.itemgetter(1))
words = [x[0] for x in sorted_x]
d = dict(zip(words,vector_values))

In [22]:
d

{'all': 0.0,
 'bangalore': 0.5248898070510398,
 'belong': 0.0,
 'bengal': 0.0,
 'capital': 0.3350303646342538,
 'delhi': 0.0,
 'india': 0.0,
 'is': 0.3350303646342538,
 'karnataka': 0.5248898070510398,
 'kolkata': 0.0,
 'new': 0.0,
 'of': 0.3350303646342538,
 'states': 0.0,
 'the': 0.3350303646342538,
 'these': 0.0,
 'to': 0.0,
 'west': 0.0}

### Contrary to what we learnt, we see that though the TF-IDF doesn't really catch words like Kolkata, bengal, delhi and india and gives a 0 weightage to these. And that is because there aren't enogh documents. As the document list grows larger, the values ultimately converge to very low values for highly common words. And if kolkata, india, delhi etc; remain rare, they will get a higher weightage. 


### Let's understand that with one more example below. 

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
import operator
 
corpus=["this car got the excellence award",\
         "good car gives good mileage",\
         "this car is very expensive",\
         "the company is growing with very high production",\
         "this company is financially good"]

In [24]:
vocabulary = set()
for doc in corpus:
    vocabulary.update(doc.split())

In [25]:
vocabulary

{'award',
 'car',
 'company',
 'excellence',
 'expensive',
 'financially',
 'gives',
 'good',
 'got',
 'growing',
 'high',
 'is',
 'mileage',
 'production',
 'the',
 'this',
 'very',
 'with'}

In [26]:
vocabulary = list(vocabulary)
word_index = {w: idx for idx, w in enumerate(vocabulary)}

In [27]:
word_index

{'growing': 0,
 'very': 1,
 'production': 2,
 'high': 3,
 'the': 4,
 'award': 5,
 'this': 6,
 'gives': 7,
 'with': 8,
 'financially': 9,
 'car': 10,
 'is': 11,
 'expensive': 12,
 'company': 13,
 'excellence': 14,
 'got': 15,
 'good': 16,
 'mileage': 17}

In [28]:
tfidf = TfidfVectorizer(vocabulary=vocabulary)

In [29]:
tfidf.fit(corpus)
tfidf.transform(corpus)

<5x18 sparse matrix of type '<class 'numpy.float64'>'
	with 28 stored elements in Compressed Sparse Row format>

In [30]:
for doc in corpus:
    score={}
    print (doc)
    # Transform a document into TfIdf coordinates
    X = tfidf.transform([doc])
    for word in doc.split():
        score[word] = X[0, tfidf.vocabulary_[word]]
    sortedscore = sorted(score.items(), key=operator.itemgetter(1), reverse=True)
    print ("\t", sortedscore)

this car got the excellence award
	 [('got', 0.4689132131547637), ('excellence', 0.4689132131547637), ('award', 0.4689132131547637), ('the', 0.3783162278555838), ('this', 0.3140366438234139), ('car', 0.3140366438234139)]
good car gives good mileage
	 [('good', 0.7178821805115433), ('gives', 0.4448982295027494), ('mileage', 0.4448982295027494), ('car', 0.2979535293877717)]
this car is very expensive
	 [('expensive', 0.5776914793752232), ('very', 0.4660778481185906), ('this', 0.38688671647327205), ('car', 0.38688671647327205), ('is', 0.38688671647327205)]
the company is growing with very high production
	 [('growing', 0.3952457425281075), ('with', 0.3952457425281075), ('high', 0.3952457425281075), ('production', 0.3952457425281075), ('the', 0.3188817764021113), ('company', 0.3188817764021113), ('very', 0.3188817764021113), ('is', 0.264700680183337)]
this company is financially good
	 [('financially', 0.5591166343026756), ('company', 0.45109178007079426), ('good', 0.45109178007079426), ('

### We can see that the model learns to give lesser importance to words like is and this. Unfortunately, it also gives a low importance to important words like car and a fairly high importance to unwanted words like gives. With a larger corpus, these issues would be resolved when a lot more documents would have words like gives but not car.

In [31]:
sents

['Bangalore is the capital of Karnataka',
 'Kolkata is the capital of West Bengal',
 'All these states belong to India',
 'The capital of India is New Delhi']

In [32]:
vocabulary = set()
for doc in sents:
    vocabulary.update(doc.split())

In [33]:
vocabulary

{'All',
 'Bangalore',
 'Bengal',
 'Delhi',
 'India',
 'Karnataka',
 'Kolkata',
 'New',
 'The',
 'West',
 'belong',
 'capital',
 'is',
 'of',
 'states',
 'the',
 'these',
 'to'}

In [34]:
vocabulary = list(vocabulary)
word_index = {w: idx for idx, w in enumerate(vocabulary)}

In [35]:
word_index

{'The': 0,
 'India': 1,
 'of': 2,
 'these': 3,
 'belong': 4,
 'the': 5,
 'Bangalore': 6,
 'capital': 7,
 'Karnataka': 8,
 'states': 9,
 'All': 10,
 'to': 11,
 'Delhi': 12,
 'Kolkata': 13,
 'is': 14,
 'Bengal': 15,
 'West': 16,
 'New': 17}

In [36]:
tfidf = TfidfVectorizer(vocabulary=vocabulary)

In [37]:
tfidf.fit(sents)
tfidf.transform(sents)

<4x18 sparse matrix of type '<class 'numpy.float64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [38]:
for doc in sents:
    score={}
    print (doc)
    # Transform a document into TfIdf coordinates
    X = tfidf.transform([doc])
    for word in doc.split():
        score[word] = X[0, tfidf.vocabulary_[word]]
    sortedscore = sorted(score.items(), key=operator.itemgetter(1), reverse=True)
    print ("\t", sortedscore)

Bangalore is the capital of Karnataka
	 [('is', 0.5), ('the', 0.5), ('capital', 0.5), ('of', 0.5), ('Bangalore', 0.0), ('Karnataka', 0.0)]
Kolkata is the capital of West Bengal
	 [('is', 0.5), ('the', 0.5), ('capital', 0.5), ('of', 0.5), ('Kolkata', 0.0), ('West', 0.0), ('Bengal', 0.0)]
All these states belong to India
	 [('these', 0.5), ('states', 0.5), ('belong', 0.5), ('to', 0.5), ('All', 0.0), ('India', 0.0)]
The capital of India is New Delhi
	 [('capital', 0.5), ('of', 0.5), ('is', 0.5), ('The', 0.0), ('India', 0.0), ('New', 0.0), ('Delhi', 0.0)]


In [39]:
#As again , we see that delhi, bengal have been given a 0 while, karnataka, bangalore have been 
#retained at 0.5 like in the previous one. 

In [40]:
#Hence with larger corpuses of documents, it makes sense to use a tf idf, while if there are far lesser number 
#of documents, go ahead with a count vectorizer

### Now we will analyse a massive corpus of documents, a consumer complaints database and classify each of these complaints into categories. 

The problem is supervised text classification problem, and our goal is to investigate which supervised machine learning methods are best suited to solve it.

Given a new complaint comes in, we want to assign it to one of 12 categories. The classifier makes the assumption that each new complaint is assigned to one and only one category. This is a multi-class text classification problem.

In [41]:
import pandas as pd

In [42]:
df = pd.read_csv("Consumer_Complaints.csv")

In [43]:
df.shape #Has over a million rows. We will reduce this to 200000 rows to aid faster computation

(1151385, 18)

In [44]:
df = df.iloc[:200000]

In [45]:
df.head(4)

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,03/12/2014,Mortgage,Other mortgage,"Loan modification,collection,foreclosure",,,,M&T BANK CORPORATION,MI,48382.0,,,Referral,03/17/2014,Closed with explanation,Yes,No,759217
1,01/19/2017,Student loan,Federal student loan servicing,Dealing with my lender or servicer,Received bad information about my loan,When my loan was switched over to Navient i wa...,,"Navient Solutions, LLC.",LA,,,Consent provided,Web,01/19/2017,Closed with explanation,Yes,No,2296496
2,04/06/2018,Credit card or prepaid card,General-purpose credit card or charge card,"Other features, terms, or problems",Other problem,I tried to sign up for a spending monitoring p...,,CAPITAL ONE FINANCIAL CORPORATION,VA,,Older American,Consent provided,Web,04/06/2018,Closed with explanation,Yes,,2866101
3,06/08/2014,Credit card,,Bankruptcy,,,,AMERICAN EXPRESS COMPANY,ID,83854.0,Older American,,Web,06/10/2014,Closed with explanation,Yes,Yes,885638


Having a look at the columns, we see that the column "Product" is what we are interested in. In other words, this is 
our target variable or the various classes while the "Consumer complaint narrative" is what we will be classifying to the various categories. In other words, we will be "classifying" the text in the Consumer complaint narrative to each of the cateories in the Product column.

In [46]:
### Dropping all other columns

Or rather selecting a new df selecting only the 2 columns of relevance.


In [47]:
new_df = df[["Consumer complaint narrative", "Product"]]
new_df_copy= new_df.copy()

In [48]:
new_df.head(5) #looks good

Unnamed: 0,Consumer complaint narrative,Product
0,,Mortgage
1,When my loan was switched over to Navient i wa...,Student loan
2,I tried to sign up for a spending monitoring p...,Credit card or prepaid card
3,,Credit card
4,,Debt collection


### A few cleaning up operations have to be done before we can proceed

For instance, we will start off by removing the null values in the complain narrative column.

Secondly, we will have to label encode the Categories as they are in categories and any algorithm will need them to be in a numerical format.

In [49]:
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
new_df["Product"]= le.fit_transform(new_df["Product"])

In [50]:
new_df.head(3)

Unnamed: 0,Consumer complaint narrative,Product
0,,10
1,When my loan was switched over to Navient i wa...,15
2,I tried to sign up for a spending monitoring p...,4


In [51]:
#Removing, the null rows


In [52]:
new_df.shape

(200000, 2)

In [53]:
new_df = new_df[pd.notnull(new_df['Consumer complaint narrative'])]

In [54]:
new_df.shape

(34343, 2)

In [55]:
print ((1151385-335238)/1151385)

0.7088393543428132


### The data has been reduced by over 70%!!

In [56]:
new_df.head(4)

Unnamed: 0,Consumer complaint narrative,Product
1,When my loan was switched over to Navient i wa...,15
2,I tried to sign up for a spending monitoring p...,4
7,"My mortgage is with BB & T Bank, recently I ha...",10
13,The entire lending experience with Citizens Ba...,10


Let's now run a CV and a TF-IDF on the Consumer complaint narrative and then run a multinominal NB on it to see their performamces.

We will do a test_train split right away and then do a hold-out Cross validation this time.

In [57]:
X = new_df["Consumer complaint narrative"]
y = new_df["Product"]

Before breaking up the X into train and test, we will vectorize them into numbers using CV to ensure that that vocabulary remians intact through for both. Vectorizing both sets seperately will leads to shape mismatch issues . 

###  Let's understand why. If we vectorize each of the train and test sets seperately, the vocabulary for bith sets will be different, and so will the be number of words. Hence shape mismatches will occur during testing phase. To avoid this, if we vectorize our entire data set initially and then break it up, our shape issues will go away and total vocbulary set remains intact. 

In [58]:
from sklearn.feature_extraction.text import CountVectorizer

In [59]:
vectorizer = CountVectorizer()

In [60]:
X = pd.DataFrame(X)

In [61]:
vectorizer.fit(X["Consumer complaint narrative"])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [62]:
vector = vectorizer.transform(X["Consumer complaint narrative"])

In [63]:
vector.shape

(34343, 33772)

In [64]:
vector_values = vector.toarray()

In [65]:
y.shape

(34343,)

In [66]:
vector_values.shape

(34343, 33772)

Now we will employ a train test split for the X (or vector values) and y

In [67]:
from sklearn.model_selection import train_test_split as tts

In [68]:
X_train,X_test,y_train,y_test = tts(vector_values,y,test_size = 0.3, random_state = 42, stratify = y)

In [69]:
from sklearn.naive_bayes import MultinomialNB

In [70]:
nb = MultinomialNB()

In [71]:
nb.fit(X_train,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [72]:
y_pred = nb.predict(X_test)

In [73]:
from sklearn.metrics import accuracy_score, classification_report

In [74]:
accuracy_score(y_test,y_pred)

0.6751431621857711

In [75]:
print (classification_report(y_test,y_pred))

             precision    recall  f1-score   support

          0       0.64      0.09      0.15       269
          1       0.49      0.68      0.57       400
          2       0.31      0.02      0.04       184
          3       0.59      0.06      0.10       358
          4       0.45      0.78      0.57       688
          5       0.48      0.09      0.15       612
          6       0.69      0.82      0.75      3034
          7       0.76      0.77      0.77      2377
          8       0.85      0.28      0.42       157
          9       0.00      0.00      0.00        27
         10       0.77      0.92      0.84      1244
         11       0.00      0.00      0.00         7
         12       0.00      0.00      0.00        37
         13       0.60      0.04      0.08       140
         14       0.00      0.00      0.00        33
         15       0.70      0.84      0.76       548
         16       0.55      0.26      0.36       187
         17       0.00      0.00      0.00   

### A 66% Accuacy on multinominal NB using CV. No parameter optimization or dataset cleaning had been done. Let's try this off the bat on a TFIDF

In [76]:
new_df = df[["Consumer complaint narrative", "Product"]]

In [77]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
new_df = new_df[pd.notnull(new_df['Consumer complaint narrative'])]
new_df["Product"]= le.fit_transform(new_df["Product"])

In [78]:
new_df.shape

(34343, 2)

In [79]:
X = new_df["Consumer complaint narrative"]
y = new_df["Product"]

In [80]:
X = pd.DataFrame(X)

In [81]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [82]:
tfidf = TfidfVectorizer()

In [83]:
tfidf.fit(X["Consumer complaint narrative"])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [84]:
vector = tfidf.transform(X["Consumer complaint narrative"])

In [85]:
vector.shape

(34343, 33772)

In [86]:
vector_values = vector.toarray()

In [87]:
X_train,X_test,y_train,y_test = tts(vector_values,y,test_size = 0.3, random_state = 42, stratify = y)

In [88]:
nb = MultinomialNB()

In [89]:
nb.fit(X_train,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [90]:
y_pred = nb.predict(X_test)

In [91]:
accuracy_score(y_test,y_pred)

0.5246044841308357

### TF-IDF  will not work well, unless the data set is cleaned, which has not been done here. Cleaningthe dataset involves removing the stop words such as "a","an" and "the", etc;s as these words do not really have any significance in finding the importance of the sentences. Lemmatization, or stemming, which is basically reducing the word to it's base form such as closing, and closer to close hasn't been done. In the last tutorial, let's see if we can improve the accuracy of the tf idf

Just by including stop words we see that we can increase the accuracy by over 3%

In [92]:
from nltk.tokenize import word_tokenize
from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()

In [93]:
from nltk.corpus import stopwords
from string import punctuation

In [94]:
stop_words = set(stopwords.words('english')+list(punctuation))

In [95]:
tfidf = TfidfVectorizer(stop_words=stop_words)

In [96]:
tfidf.fit(X["Consumer complaint narrative"])
vector = tfidf.transform(X["Consumer complaint narrative"])
vector_values = vector.toarray()
X_train,X_test,y_train,y_test = tts(vector_values,y,test_size = 0.2, random_state = 42)
nb = MultinomialNB()
nb.fit(X_train,y_train)
y_pred = nb.predict(X_test)
accuracy_score(y_test,y_pred)


0.559761246178483

In [97]:
corpus = X["Consumer complaint narrative"].tolist()

In [98]:
corpus

['When my loan was switched over to Navient i was never told that i had a deliquint balance because with XXXX i did not. When going to purchase a vehicle i discovered my credit score had been dropped from the XXXX into the XXXX. I have been faithful at paying my student loan. I was told that Navient was the company i had delinquency with. I contacted Navient to resolve this issue you and kept being told to just contact the credit bureaus and expalin the situation and maybe they could help me. I was so angry that i just hurried and paid the balance off and then after tried to dispute the delinquency with the credit bureaus. I have had so much trouble bringing my credit score back up.',
 'I tried to sign up for a spending monitoring program and Capital One will not let me access my account through them',
 'My mortgage is with BB & T Bank, recently I have been investigating ways to pay down my mortgage faster and I came across Biweekly Mortgage Calculator on BB & T \'s website. It\'s a ni

In [99]:
len(corpus)

34343

In [100]:
final_corpus=[]
for x in range(len(corpus)):
    text = word_tokenize(corpus[x].lower())
    text = [lancaster_stemmer.stem(y) for y in text if y not in stop_words]
    sent = " ".join(text)
    final_corpus.append(sent)

In [101]:
print (len(final_corpus))

34343


In [147]:
X1 = pd.DataFrame(final_corpus, columns = ["Consumer complaint narrative"])
y = new_df["Product"]

In [148]:
y.shape

(34343,)

In [149]:
tfidf.fit(X1["Consumer complaint narrative"])
vector = tfidf.transform(X1["Consumer complaint narrative"])
vector_values = vector.toarray()
X_train,X_test,y_train,y_test = tts(vector_values,y,test_size = 0.2, random_state = 0)
nb = MultinomialNB()
nb.fit(X_train,y_train)
y_pred = nb.predict(X_test)
accuracy_score(y_test,y_pred)

0.5492793710874946

### This brings us to the question that if both TF-IDF and cV, neither have given us an accuracy over 70%, TFIDF at 56% and CV gives us 66%, is something wrong with the dataset?


In [150]:
y.value_counts()

6     10112
7      7922
10     4146
4      2292
5      2040
15     1828
1      1334
3      1195
0       896
16      622
2       615
8       525
13      466
12      125
14      109
9        90
11       24
17        2
Name: Product, dtype: int64

### As predicted, the dataset is heavily imbalanced. With Category 6 and 7 being over represented , while all else have a less than 10% weightage. 

In [151]:
from imblearn.over_sampling import SMOTE

In [152]:
X = new_df["Consumer complaint narrative"]
y = new_df["Product"]

In [153]:
#We will have to convert these to numbers before we can run a SMOTE on them.
X1.head(4)

Unnamed: 0,Consumer complaint narrative
0,loan switch navy nev told deliquint bal xxxx g...
1,tri sign spend monit program capit on let acce...
2,mortg bb bank rec investig way pay mortg fast ...
3,entir lend expery cit bank terr lend kept push...


In [154]:
y.head(4)

1     15
2      4
7     10
13    10
Name: Product, dtype: int64

In [155]:
X1 = X1.iloc[:5000]
y = y.iloc[:5000]

In [156]:
X1.shape

(5000, 1)

In [157]:
y.shape

(5000,)

In [158]:
y.value_counts()

6     1491
7     1174
10     592
4      316
5      295
15     272
1      195
3      180
0      119
16      92
2       85
8       75
13      64
14      17
9       16
12      15
11       2
Name: Product, dtype: int64

In [159]:
tfidf.fit(X1["Consumer complaint narrative"])
vector = tfidf.transform(X1["Consumer complaint narrative"])
vector_values = vector.toarray()

In [160]:
#Our X values are now vector_values.

In [164]:
from imblearn.over_sampling import RandomOverSampler

In [165]:
ros = RandomOverSampler()
X_ros, y_ros = ros.fit_sample(vector_values, y)

In [166]:
X_ros.shape

(25347, 8982)

In [171]:
y_ros = pd.Series(y_ros)

In [172]:

X_train,X_test,y_train,y_test = tts(X_ros,y_ros,test_size = 0.3, random_state = 0)
nb = MultinomialNB()
nb.fit(X_train,y_train)
y_pred = nb.predict(X_test)
accuracy_score(y_test,y_pred)

0.9000657462195923

### Looks like we got a 90% accuracy with tf idf and oversampling!!