# Working with Text Data - Text Preprocessing

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

### Text Data

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

We will use `CountVectorizer` to **convert text into a matrix of token count**.

`Bag of Words`: https://machinelearningmastery.com/gentle-introduction-bag-words-model/

`Code Example`: https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/  

**We are going to perform below mentioned steps to understand the entire process:**  
a. Converting text to numerical vectors with the help of `CountVectorizer`  
b. Understand `fit` and `transform`  
c. Looking at `vocabulary_`  
d. Converting sparse matrix to dense matrix using `toarray()`  
e. Understanding `n_gram`  

In [2]:
# Bag of Words

from sklearn.feature_extraction.text import CountVectorizer

# Lets create 'lst_text' that will contain four doucuments
lst_text = ['it was the best of times', 
            'it was the worst of times',
            'it was the age of wisdom', 
            'it was the age of foolishness']

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.
vocab = CountVectorizer()

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
# dtm = vocab.fit_transform(lst_text)

# fit_transform() could be done seperatly as mentioned below
vocab.fit(lst_text)
dtm = vocab.transform(lst_text)

In [3]:
# We can look at unique words by using 'vocabulary_'

vocab.vocabulary_

{'it': 3,
 'was': 7,
 'the': 5,
 'best': 1,
 'of': 4,
 'times': 6,
 'worst': 9,
 'age': 0,
 'wisdom': 8,
 'foolishness': 2}

In [4]:
# Observe that the type of dtm is sparse

print(type(dtm))

<class 'scipy.sparse.csr.csr_matrix'>


In [5]:
# Lets now print the  shape of this dtm

print(dtm.shape)

# o/p -> (4, 10)
# i.e -> 4 documents and 10 unique words

(4, 10)


In [6]:
# Lets look at the dtm

print(dtm)

# Remember that dtm is a sparse matrix. i.e. zeros wont be stored
# Lets understand First line of output -> (0,6)    1
# Here (0, 6) means 0th document and 6th(index starting from 1) unique word. 
# (we have total 4 documents) & (we have total 10 unique words)
# (0, 6)    1 -> 1 here refers to the number of occurence of 6th word
# Now lets read it all in english.
# (0, 6)    1 -> 'times' occurs 1 time in 0th document. 
# Try to observe -> (3, 3)   1

  (0, 1)	1
  (0, 3)	1
  (0, 4)	1
  (0, 5)	1
  (0, 6)	1
  (0, 7)	1
  (1, 3)	1
  (1, 4)	1
  (1, 5)	1
  (1, 6)	1
  (1, 7)	1
  (1, 9)	1
  (2, 0)	1
  (2, 3)	1
  (2, 4)	1
  (2, 5)	1
  (2, 7)	1
  (2, 8)	1
  (3, 0)	1
  (3, 2)	1
  (3, 3)	1
  (3, 4)	1
  (3, 5)	1
  (3, 7)	1


In [7]:
# Since the dtm is sparse, lets convert it into numpy array.

print(dtm.toarray())

[[0 1 0 1 1 1 1 1 0 0]
 [0 0 0 1 1 1 1 1 0 1]
 [1 0 0 1 1 1 0 1 1 0]
 [1 0 1 1 1 1 0 1 0 0]]


In [8]:
# 2-grams

vocab = CountVectorizer(ngram_range=[1,2])

vocab.fit(lst_text)

dtm = vocab.transform(lst_text)

In [9]:
print(vocab.vocabulary_)

{'it': 5, 'was': 16, 'the': 11, 'best': 2, 'of': 7, 'times': 15, 'it was': 6, 'was the': 17, 'the best': 13, 'best of': 3, 'of times': 9, 'worst': 19, 'the worst': 14, 'worst of': 20, 'age': 0, 'wisdom': 18, 'the age': 12, 'age of': 1, 'of wisdom': 10, 'foolishness': 4, 'of foolishness': 8}


In [10]:
# convert sparse matrix to numpy array
print(dtm.toarray())

[[0 0 1 1 0 1 1 1 0 1 0 1 0 1 0 1 1 1 0 0 0]
 [0 0 0 0 0 1 1 1 0 1 0 1 0 0 1 1 1 1 0 1 1]
 [1 1 0 0 0 1 1 1 0 0 1 1 1 0 0 0 1 1 1 0 0]
 [1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 0 1 1 0 0 0]]


**Observations:**

- `vect.fit(lst_text)` **learns the vocabulary**
- `vect.transform(lst_text)` **uses the fitted vocabulary** to build a **document-term matrix**

### Data Preprocessing

**Text preprocessing steps:**

a. Removing special characters  
b. Convert sentence into lower case  
c. Remove Stop Words  
d. Stemming or Lemmatization


In [11]:
lst_text = ['it Was the best oF Times $', 
            'It was The worst of times.',
            'IT 9 was tHe age Of wisdom', 
            'it was thE age of foolishness']

df = pd.DataFrame({'text': lst_text})

df.head()

Unnamed: 0,text
0,it Was the best oF Times $
1,It was The worst of times.
2,IT 9 was tHe age Of wisdom
3,it was thE age of foolishness


In [12]:
# !pip install nltk

In [13]:
# import nltk
# nltk.download('stopwords')

In [14]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [15]:
## initialise the inbuilt Stemmer
stemmer = PorterStemmer()

In [16]:
## We can also use Lemmatizer instead of Stemmer
lemmatizer = WordNetLemmatizer()

In [17]:
def preprocess(raw_text, flag):
    # Removing special characters and digits
    sentence = re.sub("[^a-zA-Z]", " ", raw_text)
    
    # change sentence to lower case
    sentence = sentence.lower()

    # tokenize into words
    tokens = sentence.split()
    
    # remove stop words                
    clean_tokens = [t for t in tokens if not t in stopwords.words("english")]
    
    # Stemming/Lemmatization
    if(flag == 'stem'):
        clean_tokens = [stemmer.stem(word) for word in clean_tokens]
    else:
        clean_tokens = [lemmatizer.lemmatize(word) for word in clean_tokens]
    
    return pd.Series([" ".join(clean_tokens), len(clean_tokens)])

In [18]:
temp_df = df['text'].apply(lambda x : preprocess(x, 'stem'))

temp_df.head()

Unnamed: 0,0,1
0,best time,2
1,worst time,2
2,age wisdom,2
3,age foolish,2


In [19]:
temp_df.columns = ['clean_text_stem', 'text_length_stem']

temp_df.head()

Unnamed: 0,clean_text_stem,text_length_stem
0,best time,2
1,worst time,2
2,age wisdom,2
3,age foolish,2


In [20]:
df = pd.concat([df, temp_df], axis=1)

df.head()

Unnamed: 0,text,clean_text_stem,text_length_stem
0,it Was the best oF Times $,best time,2
1,It was The worst of times.,worst time,2
2,IT 9 was tHe age Of wisdom,age wisdom,2
3,it was thE age of foolishness,age foolish,2


In [21]:
import nltk

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to C:\Users\Kanav
[nltk_data]     Bansal\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [22]:
temp_df = df['text'].apply(lambda x: preprocess(x, 'lemma'))

temp_df.head()

Unnamed: 0,0,1
0,best time,2
1,worst time,2
2,age wisdom,2
3,age foolishness,2


In [23]:
temp_df.columns = ['clean_text_lemma', 'text_length_lemma']

temp_df.head()

Unnamed: 0,clean_text_lemma,text_length_lemma
0,best time,2
1,worst time,2
2,age wisdom,2
3,age foolishness,2


In [24]:
df = pd.concat([df, temp_df], axis=1)

df.head()

Unnamed: 0,text,clean_text_stem,text_length_stem,clean_text_lemma,text_length_lemma
0,it Was the best oF Times $,best time,2,best time,2
1,It was The worst of times.,worst time,2,worst time,2
2,IT 9 was tHe age Of wisdom,age wisdom,2,age wisdom,2
3,it was thE age of foolishness,age foolish,2,age foolishness,2
