# Text Mining Data Prep

In this notebook, we will prepare the data for text mining. 

let's start with a corpus

In [1]:
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "begin begun beginning begins",
    "is was were being",
    "123 the world is large 32.34"
]
import nltk


import nltk


## Create a term by document matrix

TfidfVectorizer and CountVectorizer both are methods for converting text data into vectors as model can process only numerical data.

### Using CountVectorizer

In CountVectorizer we only count the number of times a word appears in the document which results in biasing in favour of most frequent words. this ends up in ignoring rare words which could have helped is in processing our data more efficiently.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pandas as pd

# CountVectorizer will covert to lowercase, remove punctuation, and remove stop words - to 
# remove other things, such as numbers, use the token_pattern parameter
vectorizer = CountVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+") # [^\W\d_]+ not Word, not digit, not underscore -- see: https://regexr.com/
X = vectorizer.fit_transform(corpus)
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,begin,beginning,begins,begun,document,large,second,world
0,0,0,0,0,1,0,0,0
1,0,0,0,0,2,0,1,0
2,0,0,0,0,0,0,0,0
3,1,1,1,1,0,0,0,0
4,0,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,1


### Using TficVectorizer

To overcome this problem (over emphasis on high frequency), we use TfidfVectorizer .

In TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Like CountVectorizer, TfidfVectorizer will covert to lowercase, remove punctuation, and remove 
# stop words - to remove other things, such as numbers, use the token_pattern parameter
vectorizer = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X = vectorizer.fit_transform(corpus)

df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,begin,beginning,begins,begun,document,large,second,world
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.8538,0.0,0.520601,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.5,0.5,0.5,0.5,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.707107


In [15]:
### Word Stemming

Notice that we might benefit from finding word stems. For example, the words "beginning", "begun", and "begins" are all related to the same concept or begin. We can use the NLTK's WordNetLemmatizer to reduce words to their stems.

In [16]:
import nltk
nltk.download('averaged_perceptron_tagger') # you only need to run this once
from nltk.stem import WordNetLemmatizer 
from nltk import pos_tag, word_tokenize

# Define the corpus of documents
corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "begin begun beginning begins",
    "is was were being",
    "123 the world is large 32.34",
    'striped striping stripped hanging hanged begin beginning loving love loved'
]

transformed_corpus = []
wnl = WordNetLemmatizer()
for document in corpus:
    transformed_document = ""
    for word, tag in pos_tag(word_tokenize(document)):
        wntag = tag[0].lower()
        wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
        if not wntag:
            lemma = word
        else:
            lemma = wnl.lemmatize(word, wntag)
        transformed_document+= lemma + " "
    transformed_corpus += [transformed_document]

transformed_corpus


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\akash\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


['This be the first document . ',
 'This document be the second document . ',
 'And this be the third one . ',
 'begin begin begin begin ',
 'be be be be ',
 '123 the world be large 32.34 ',
 'strip strip strip hang hang begin begin love love love ']

Now, let's use the TfidfVectorizer to convert our new lematized corpus into a matrix of TF-IDF features.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Like CountVectorizer, TfidfVectorizer will covert to lowercase, remove punctuation, and remove 
# stop words - to remove other things, such as numbers, use the token_pattern parameter
vectorizer = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X = vectorizer.fit_transform(transformed_corpus)

df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,begin,document,hang,large,love,second,strip,world
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.856605,0.0,0.0,0.0,0.515973,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.707107
6,0.333665,0.0,0.401965,0.0,0.602948,0.0,0.602948,0.0


## Apply SVD for dimension reduction

Let's apply SVD to reduce the dimensionality of our data. 

In [18]:
from sklearn.decomposition import TruncatedSVD

In [19]:
#If you are performing Latent Semantic Analysis, recommended number of components is 100

svd = TruncatedSVD(n_components=5, n_iter=10)

In [20]:
X_svd = svd.fit_transform(X)
X_svd

array([[ 9.63484449e-01,  8.99429736e-18,  2.75130026e-19,
        -4.58113387e-22, -2.67764291e-01],
       [ 9.63484449e-01,  6.01525501e-16,  1.06940301e-17,
        -5.37632149e-17,  2.67764291e-01],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00],
       [-5.25502134e-16,  8.16598276e-01, -6.98974417e-17,
        -5.77206423e-01, -7.37107657e-16],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  0.00000000e+00],
       [-8.59106293e-18, -2.15956513e-16,  1.00000000e+00,
         3.89493417e-17, -7.08790346e-18],
       [-4.07288096e-16,  8.16598276e-01, -7.90531134e-17,
         5.77206423e-01, -5.95394732e-16]])

In [21]:
X_svd.shape[1]

5

In [22]:
df = pd.DataFrame(X_svd, columns=[f"svd{num:04}" for num in range(0,X_svd.shape[1])])
df

Unnamed: 0,svd0000,svd0001,svd0002,svd0003,svd0004
0,0.9634844,8.994297e-18,2.7512999999999998e-19,-4.581134e-22,-0.2677643
1,0.9634844,6.015255e-16,1.0694030000000001e-17,-5.3763210000000005e-17,0.2677643
2,0.0,0.0,0.0,0.0,0.0
3,-5.255021e-16,0.8165983,-6.989744e-17,-0.5772064,-7.371077e-16
4,0.0,0.0,0.0,0.0,0.0
5,-8.591063e-18,-2.159565e-16,1.0,3.8949340000000005e-17,-7.087903e-18
6,-4.072881e-16,0.8165983,-7.905311e-17,0.5772064,-5.953947e-16


### Now wer are ready to use this data in a model

Our data is now ready to be used in a model. If we have these documents tagged (for instance, 'good' or 'bad'), we can use this data to train a model. If we don't have the tags, we can use this data to cluster the documents - or, go through the documents manually and tag them.