# Working with Text Data

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

### Text Data

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

We will use `CountVectorizer` to **convert text into a matrix of token count**.

`Bag of Words`: https://machinelearningmastery.com/gentle-introduction-bag-words-model/

`Code Example`: https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/  

**We are going to perform below mentioned steps to understand the entire process:**  
a. Converting text to numerical vectors with the help of `CountVectorizer`  
b. Understand `fit` and `transform`  
c. Looking at `vocabulary_`  
d. Converting sparse matrix to dense matrix using `toarray()`  
e. Understanding `n_gram`  

In [2]:
# Bag of Words

from sklearn.feature_extraction.text import CountVectorizer

# Lets create 'lst_text' that will contain four doucuments
lst_text = ['it was the best of times', 'it was the worst of times',
            'it was the age of wisdom', 'it was the age of foolishness']

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.
vocab = CountVectorizer()

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
# dtm = vocab.fit_transform(lst_text)

# fit_transform() could be done seperatly as mentioned below
vocab.fit(lst_text)
dtm = vocab.transform(lst_text)

In [3]:
# We can look at unique words by using 'vocabulary_'

vocab.vocabulary_

{'it': 3,
 'was': 7,
 'the': 5,
 'best': 1,
 'of': 4,
 'times': 6,
 'worst': 9,
 'age': 0,
 'wisdom': 8,
 'foolishness': 2}

In [4]:
# Observe that the type of dtm is sparse

print(type(dtm))

<class 'scipy.sparse.csr.csr_matrix'>


In [5]:
# Lets now print the  shape of this dtm

print(dtm.shape)

# o/p -> (4, 10)
# i.e -> 4 documents and 10 unique words

(4, 10)


In [6]:
# Lets look at the dtm

print(dtm)

# Remember that dtm is a sparse matrix. i.e. zeros wont be stored
# Lets understand First line of output -> (0,6)    1
# Here (0, 6) means 0th document and 6th(index starting from 1) unique word. 
# (we have total 4 documents) & (we have total 10 unique words)
# (0, 6)    1 -> 1 here refers to the number of occurence of 6th word
# Now lets read it all in english.
# (0, 6)    1 -> 'times' occurs 1 time in 0th document. 
# Try to observe -> (3, 3)   1

  (0, 1)	1
  (0, 3)	1
  (0, 4)	1
  (0, 5)	1
  (0, 6)	1
  (0, 7)	1
  (1, 3)	1
  (1, 4)	1
  (1, 5)	1
  (1, 6)	1
  (1, 7)	1
  (1, 9)	1
  (2, 0)	1
  (2, 3)	1
  (2, 4)	1
  (2, 5)	1
  (2, 7)	1
  (2, 8)	1
  (3, 0)	1
  (3, 2)	1
  (3, 3)	1
  (3, 4)	1
  (3, 5)	1
  (3, 7)	1


In [7]:
# Since the dtm is sparse, lets convert it into numpy array.

print(dtm.toarray())

[[0 1 0 1 1 1 1 1 0 0]
 [0 0 0 1 1 1 1 1 0 1]
 [1 0 0 1 1 1 0 1 1 0]
 [1 0 1 1 1 1 0 1 0 0]]


In [8]:
# 2-grams

vocab = CountVectorizer(ngram_range=[1,2])

dtm = vocab.fit_transform(lst_text)

In [9]:
print(vocab.vocabulary_)

{'it': 5, 'was': 16, 'the': 11, 'best': 2, 'of': 7, 'times': 15, 'it was': 6, 'was the': 17, 'the best': 13, 'best of': 3, 'of times': 9, 'worst': 19, 'the worst': 14, 'worst of': 20, 'age': 0, 'wisdom': 18, 'the age': 12, 'age of': 1, 'of wisdom': 10, 'foolishness': 4, 'of foolishness': 8}


In [10]:
print(dtm.toarray()) 

# convert sparse matrix to nparray

[[0 0 1 1 0 1 1 1 0 1 0 1 0 1 0 1 1 1 0 0 0]
 [0 0 0 0 0 1 1 1 0 1 0 1 0 0 1 1 1 1 0 1 1]
 [1 1 0 0 0 1 1 1 0 0 1 1 1 0 0 0 1 1 1 0 0]
 [1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 1 0 0 0]]


**Observations:**

- `vect.fit(lst_text)` **learns the vocabulary**
- `vect.transform(lst_text)` **uses the fitted vocabulary** to build a **document-term matrix**

## Read the data

In [11]:
df = pd.read_csv('data/text_data.csv')

df.head()

Unnamed: 0,label,clean_text_stem
0,ham,subject enron methanol meter follow note gave ...
1,ham,subject hpl nom januari see attach file hplnol...
2,ham,subject neon retreat ho ho ho around wonder ti...
3,spam,subject photoshop window offic cheap main tren...
4,ham,subject indian spring deal book teco pvr reven...


In [12]:
df['label'].value_counts(normalize=True)

ham     0.710114
spam    0.289886
Name: label, dtype: float64

## Data Preparation

In [13]:
label_encoder = {'ham' : 0, 'spam' : 1}

df['label_num'] = df['label'].apply(lambda x : label_encoder[x])

In [14]:
# splitting into test and train

from sklearn.model_selection  import train_test_split

train, test = train_test_split(df, test_size=0.2, random_state=42)

In [15]:
train_clean_text = []
for text in train['clean_text_stem']:
    train_clean_text.append(text)

test_clean_text = []
for text in test['clean_text_stem']:
    test_clean_text.append(text)

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1,2))

train_features = vectorizer.fit_transform(train_clean_text)

test_features = vectorizer.transform(test_clean_text)

In [17]:
print("Total unique words:", len(vectorizer.vocabulary_))

print("Type of train features:", type(train_features))

print("Shape of input data:", train_features.shape)

Total unique words: 218908
Type of train features: <class 'scipy.sparse.csr.csr_matrix'>
Shape of input data: (4136, 218908)


## Logistic Regression

In [18]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(train_features, train['label_num'])

LogisticRegression()

In [19]:
pred = classifier.predict(test_features)

In [20]:
from sklearn.metrics import accuracy_score
print(accuracy_score(pred, test['label_num']))

0.9758454106280193


## Decision Tree

In [21]:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(train_features, train['label_num'])

DecisionTreeClassifier()

In [22]:
pred = classifier.predict(test_features)

In [23]:
from sklearn.metrics import accuracy_score
print(accuracy_score(pred, test['label_num']))

0.9458937198067633


## Random Forest

In [24]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(train_features, train['label_num'])

RandomForestClassifier()

In [25]:
pred = classifier.predict(test_features)

In [26]:
from sklearn.metrics import accuracy_score
print(accuracy_score(pred, test['label_num']))

0.9758454106280193
