## Text Classification tutorial 

### What is text classification?

In simple words, classifying text to categories is text classification! Period. 

#### Just like in a typical machine learning classification problem, the prevailing idea is to classify data into pre-defined categories; A supervised learning problem. 

The similarities end there. Unlike a typical classification problem, where the data or the X is usually numerics or encoded categories, the data in a text classification is.....text! Computers won't do a good job of deciphering text. The text will have to be converted to numerics or numbers for the classification algorithm to comprehend the data. 

#### That's where NLP or Natural Langugage processing comes in. 

In this chapter we will be starting off with techniques such as vectorization - which is basically converting your text to a vector of numbers, various text processing methods, such as tokenization, lemmatization and finally tackle a data set consisting of raw text and try and bin incoming test text data onto those categories using the various classification algorithms we already know. 

### The Problem Statement

The problem is supervised text classification problem, and our goal is to investigate which supervised machine learning methods are best suited to solve it.

#### We will start off with an extremely simplistic version of text classification: Classifying a couple of sentences based on training data which consist of labelled sentences, using a simple Naive Bayes algorithm. We will then break this down and understand how the sentences were converted to features or numbers and comprehend concepts such as count vectorization and TF-IDF and tokenization. 

#### In the end we will tackle 2 problems, classifying a set of consumer complaints onto 12 categories and classifying a bunch of paragraphs onto different categories. 

Say we have a couple of sentences with labels such as below:

In [2]:
train = [('I love this sandwich.', 'pos'),
    ('This is an amazing place!', 'pos'),
    ('I feel very good about these beers.', 'pos'),
    ('This is my best work.', 'pos'),
    ("What an awesome view", 'pos'),
    ('I do not like this restaurant', 'neg'),
    ('I am tired of this stuff.', 'neg'),
    ("I can't deal with this", 'neg'),
    ('He is my sworn enemy!', 'neg'),
    ('My boss is horrible.', 'neg')]

#### The train data consists of 10 tuples inside a list with the first element of each tuple being the sentence itself and the second element being the label: 'pos' or 'neg'

Our test data consists of this:

In [4]:
test = [
    ('The beer was good.', 'pos'),
    ('I do not enjoy my job', 'neg'),
    ("I ain't feeling dandy today.", 'neg'),
    ("I feel amazing!", 'pos'),
    ('Gary is a friend of mine.', 'pos'),
    ("I can't believe I'm doing this.", 'neg')
]


#### Importing this particular text analysis library called textblob which has an inbuilt NB classifier.

In [6]:
from textblob import TextBlob as tb

In [7]:
from textblob.classifiers import NaiveBayesClassifier as nb

##### We create a new classifier by passing training data into the constructor for a NaiveBayesClassifier.

In [10]:
cl = nb(train)

#### To see if this works let's pass a few sentences to our cl

In [12]:
cl.classify("Their burgers are amazing")

'pos'

In [13]:
cl.classify("I don't like their pizza.")

'neg'

#### Let's check this on the test data

In [15]:
test_1 = [x[0] for x in test]

In [16]:
test_1

['The beer was good.',
 'I do not enjoy my job',
 "I ain't feeling dandy today.",
 'I feel amazing!',
 'Gary is a friend of mine.',
 "I can't believe I'm doing this."]

In [20]:
predicted_results = [cl.classify(x) for x in test_1]

In [21]:
predicted_results

['pos', 'neg', 'neg', 'pos', 'neg', 'neg']

In [23]:
original_results = [x[1] for x in test]

In [24]:
original_results

['pos', 'neg', 'neg', 'pos', 'pos', 'neg']

In [26]:
mis_classified=[]
for x,y in zip(original_results,predicted_results):
    if x!=y:
        mis_classified.append(x)
print (1-(len(mis_classified)/len(original_results)))

0.8333333333333334


### Looks like our NB classifier has given us an 83% accuracy for the binary classification

#### Let us now take each of these sentences and code our own Naive bayes after converting these sentences to numeric features

In [28]:
sentences = [x[0] for x in train]

In [29]:
labels = [x[1] for x in train]

#### Let's create a dataframe where y is the target variable which is either pos or neg and x is the sentence

In [33]:
dictionary = dict(zip(sentences,labels))

In [34]:
dictionary

{'I love this sandwich.': 'pos',
 'This is an amazing place!': 'pos',
 'I feel very good about these beers.': 'pos',
 'This is my best work.': 'pos',
 'What an awesome view': 'pos',
 'I do not like this restaurant': 'neg',
 'I am tired of this stuff.': 'neg',
 "I can't deal with this": 'neg',
 'He is my sworn enemy!': 'neg',
 'My boss is horrible.': 'neg'}

In [36]:
import pandas as pd

In [53]:
df = pd.DataFrame(list(dictionary.items()), columns=['Sentence', 'label'])

In [54]:
df

Unnamed: 0,Sentence,label
0,I love this sandwich.,pos
1,This is an amazing place!,pos
2,I feel very good about these beers.,pos
3,This is my best work.,pos
4,What an awesome view,pos
5,I do not like this restaurant,neg
6,I am tired of this stuff.,neg
7,I can't deal with this,neg
8,He is my sworn enemy!,neg
9,My boss is horrible.,neg


#### That now essentially means that our y variable or target variable is df["label"] and X is df["sentence"]
#### Have to label encode the y variable

In [56]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [57]:
df["label"] = le.fit_transform(df["label"])

#### The X variable now has to be converted to numbers. We will be using something called count vectrizer for the same. Let's understand the math behind the same

In [60]:
sample_text = ["The quick brown fox","jumped over the lazy dog"]

#### If this has to be converted to numbers, we break down these 2 sentences to words where each word is a feature or a number.

#### So if Document 1 or Doc 1 was the first sentence and Doc 2 was the second and the total list has 8 words in total, we could make 2 vectors of length 8 each where each word is a feature or a number and it's value is basically the number of times that word has appeared in that document. In other words a matrix, Like below. 

![title](img/ok.png)

#### In Doc 1, all words that are present have been encoded to 1 and the ones that are not present have been encoded to 0. This will be possible only when you consider the entire collection of words in all documents combined which in this case is 8, hence this is the reason the vectors are 8 in length. This approach of considering all the words in the entire collection or "corpus" is called a B-O-W approch or a Bag of Words.

### Sklearn already has a count vectorizer so one doesn't have to count the occurences of the words and encode them manually. 

In [66]:
from sklearn.feature_extraction.text import CountVectorizer

In [70]:
text = ["Bangalore is the capital of the state of Karnataka"]

In [71]:
vectorizer = CountVectorizer()

In [72]:
vectorizer.fit(text)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [74]:
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

(1, 7)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 1 2 1 2]]


In [75]:
#vector shape is 7. There are 9 words in the sentence.
#"of" and "the" have been repeated twice. Hence total number of unique words is 7

vector.toarray obviously gives the words "of" and "the" a value of 2 since they have appeared twice and rest of the words a value of 1. To find out which words correspond to the values in the vector hit vectorizer.vocabulary_

In [87]:
vector_values = vector.toarray().tolist()[0]

In [88]:
vector_values

[1, 1, 1, 1, 2, 1, 2]

In [91]:
vectorizer.vocabulary_


{'bangalore': 0,
 'is': 2,
 'the': 6,
 'capital': 1,
 'of': 4,
 'state': 5,
 'karnataka': 3}

In [92]:
import operator
sorted_x = sorted(vectorizer.vocabulary_.items(), key=operator.itemgetter(1))

In [93]:
sorted_x

[('bangalore', 0),
 ('capital', 1),
 ('is', 2),
 ('karnataka', 3),
 ('of', 4),
 ('state', 5),
 ('the', 6)]

In [95]:
words = [x[0] for x in sorted_x]

In [96]:
words

['bangalore', 'capital', 'is', 'karnataka', 'of', 'state', 'the']

In [97]:
d = dict(zip(words,vector_values))

In [98]:
d

{'bangalore': 1,
 'capital': 1,
 'is': 1,
 'karnataka': 1,
 'of': 2,
 'state': 1,
 'the': 2}

#### As we can see now that the words "of" and "the" have been represented twice as in the sentence hence the key values are 2. This is now a numeric representation of the sentence "bangalore is the capital of the state of karnataka". that is how a count vectorizer works. Using this to convert all the sentences in our dataframe to numbers

In [99]:
df

Unnamed: 0,Sentence,label
0,I love this sandwich.,1
1,This is an amazing place!,1
2,I feel very good about these beers.,1
3,This is my best work.,1
4,What an awesome view,1
5,I do not like this restaurant,0
6,I am tired of this stuff.,0
7,I can't deal with this,0
8,He is my sworn enemy!,0
9,My boss is horrible.,0


In [101]:
X = df["Sentence"]

In [103]:
y = df["label"]

In [105]:
X

0                  I love this sandwich.
1              This is an amazing place!
2    I feel very good about these beers.
3                  This is my best work.
4                   What an awesome view
5          I do not like this restaurant
6              I am tired of this stuff.
7                 I can't deal with this
8                  He is my sworn enemy!
9                   My boss is horrible.
Name: Sentence, dtype: object

In [106]:
vectorizer.fit(X)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [107]:
vectorizer.vocabulary_

{'love': 18,
 'this': 28,
 'sandwich': 24,
 'is': 16,
 'an': 3,
 'amazing': 2,
 'place': 22,
 'feel': 12,
 'very': 30,
 'good': 13,
 'about': 0,
 'these': 27,
 'beers': 5,
 'my': 19,
 'best': 6,
 'work': 34,
 'what': 32,
 'awesome': 4,
 'view': 31,
 'do': 10,
 'not': 20,
 'like': 17,
 'restaurant': 23,
 'am': 1,
 'tired': 29,
 'of': 21,
 'stuff': 25,
 'can': 8,
 'deal': 9,
 'with': 33,
 'he': 14,
 'sworn': 26,
 'enemy': 11,
 'boss': 7,
 'horrible': 15}

In [108]:
vector = vectorizer.transform(X)

In [110]:
X = vector.toarray()

In [111]:
X

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
        0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0

#### Now all we gotta do is create X_train,y_train,X_test,y_test and run a normal algo as before

In [112]:
from sklearn.model_selection import train_test_split as tts

In [145]:
#Combining both test and train to a single corpus

In [148]:
data = train+test

In [149]:
data

[('I love this sandwich.', 'pos'),
 ('This is an amazing place!', 'pos'),
 ('I feel very good about these beers.', 'pos'),
 ('This is my best work.', 'pos'),
 ('What an awesome view', 'pos'),
 ('I do not like this restaurant', 'neg'),
 ('I am tired of this stuff.', 'neg'),
 ("I can't deal with this", 'neg'),
 ('He is my sworn enemy!', 'neg'),
 ('My boss is horrible.', 'neg'),
 ('The beer was good.', 'pos'),
 ('I do not enjoy my job', 'neg'),
 ("I ain't feeling dandy today.", 'neg'),
 ('I feel amazing!', 'pos'),
 ('Gary is a friend of mine.', 'pos'),
 ("I can't believe I'm doing this.", 'neg')]

In [151]:
df = pd.DataFrame(list(dict(zip([x[0] for x in data],[x[1] for x in data])).items()), columns=['Sentence', 'label'])

In [153]:
df["label"] = le.fit_transform(df["label"])

In [155]:
X = df["Sentence"]

In [156]:
y = df["label"]

In [157]:
vectorizer.fit(X)
vector = vectorizer.transform(X)
X = vector.toarray()

In [158]:
X

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0],
       [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
        0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0

In [159]:
y

0     1
1     1
2     1
3     1
4     1
5     0
6     0
7     0
8     0
9     0
10    1
11    0
12    0
13    1
14    1
15    0
Name: label, dtype: int64

In [161]:
X.shape

(16, 49)

In [164]:
X_train,X_test,y_train,y_test = tts(X,y,test_size = 0.3,random_state = 43)

In [165]:
from sklearn.naive_bayes import GaussianNB

In [166]:
nb = GaussianNB()

In [167]:
nb.fit(X_train,y_train)

GaussianNB(priors=None)

In [168]:
y_pred = nb.predict(X_test)

In [169]:
y_test

15    0
8     0
12    0
6     0
11    0
Name: label, dtype: int64

In [171]:
from sklearn.metrics import accuracy_score

In [172]:
accuracy_score(y_test,y_pred)

0.8

#### We get a 80% accuracy which is very similar to the Naive bayes accuarcy we got from text blob