# Text Processing with NLTK
Date: August, 2022.

# Introduction

We are going to explore techniques to clean and convert text features into numerical features that machine learning algoritms can work with

# Common text preprocessing

In [None]:
text = "   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  "

Let's first lowercase our text

In [None]:
text = text.lower()
print(text)

   this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  


We can get rid of leading/trailing whitespace with the following:

In [None]:
text = text.strip()
print(text)

this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .


Remove HTML tags/markups

In [None]:
import re
text = re.compile('<.*?>').sub('', text)
print(text)

this is a message to be cleaned. it may involve some things like: , ?, :, ''  adjacent spaces and tabs     .


Replace punctuation with space

In [None]:
import re, string
text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
print(text)

this is a message to be cleaned  it may involve some things like              adjacent spaces and tabs      


Remove extra space and tabs

In [None]:
import re
text = re.sub('\s+', ' ', text)
print(text)

this is a message to be cleaned it may involve some things like adjacent spaces and tabs 


# Lexicon based text processing

Lexicon based methods are usually used to normalize sentences in our dataset.

By **normalization**, here, we mean putting words in the sentences into a similar format that will enhance similarities (if any) between sentences

**Stop word removal**: There can be some words in our sentences that occur very frequently and don't contribute too much to the overall meaning of the sentences. We usually have a list of these words and remove them from each our sentences. For example: "a", "an", "the", "this", "that", "is"

In [None]:
stop_words = ["a", "an", "the", "this", "that", "is", "it", "to", "and"]

filtered_sentence = []
words = text.split(" ")
for w in words:
    if w not in stop_words:
        filtered_sentence.append(w)
text = " ".join(filtered_sentence)

print(text)

message be cleaned may involve some things like adjacent spaces tabs 


Stemming: Stemming is a rule-based system to convert words into their root form.

It removes suffixes from words. This helps us enhace similarities (if any) between sentences.

Example:



*   "jumping", "jumped" -> "jump"
*   "cars" -> "car"



In [None]:
# We use the NLTK library
import nltk
from nltk.stem import SnowballStemmer

# Initialize the stemmer
snow = SnowballStemmer('english')

stemmed_sentence = []
words = text.split(" ")
for w in words:
    stemmed_sentence.append(snow.stem(w))
text = " ".join(stemmed_sentence)

print(text)

messag be clean may involv some thing like adjac space tab 


# Features extraction

We assume we will first apply the common and lexicon based pre-processing to our text. After those, we will convert our text data into numerical data with the Bag of Words (BoW) representation.

**Bag of Words (BoW)**: A modeling technique to convert text information into numerical representation.
Machine learning models expect numerical or categorical values as input and won't work with raw text data.

Steps:
1.   Create vocabulary of known words
2.   Measure presence of the known words in sentences

We will use the sklearn library's Bag of Words implementation:
* **CountVectorizer:** Sklearn text vectorizer, converts a collection of text documents to a matrix of token counts
* **TfidfVectorizer:** Sklearn text vectorizer, converts a collection of text documents to a matrix of TF-IDF features

In [None]:
#We will use the sklearn library's Bag of Words implementation:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

sentences = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document one one one one one one one one?'
]
countVectorizer = CountVectorizer(binary=True)
X = countVectorizer.fit_transform(sentences)

#Returns normalized term frequencies matrix when 'use_idf'=False
tfVectorizer = TfidfVectorizer(use_idf=False)
tf = tfVectorizer.fit_transform(sentences)

#Returns smoother TF-IDF matrix when 'use_idf'=True
tfIdfVectorizer = TfidfVectorizer(use_idf=True)
tfIdf = tfIdfVectorizer.fit_transform(sentences)

Let's print the vocabulary below.

* Each number next to a word shows the index of it in the vocabulary (From 0 to 8 here).

* They are alphabetically ordered-> and:0, document:1, first:2, ..

In [None]:
print('Vocabulary: \n',countVectorizer.vocabulary_)

Vocabulary: 
 {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}


In [None]:
print('Bag of Words Binary Features: \n',X.toarray())

Bag of Words Binary Features: 
 [[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 1 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 1 0 1 0 1]]


What happens when we encounter a new word during prediction?

* New words will be skipped.
* This usually happens when we are making predictions. For our test and validation data/text, we need to use the **.transform()** function this time.
* This simulates a real-time prediction case where we cannot re-train the model quickly whenever we receive new words.

In [None]:
test_sentences = ["this document has some new words",
                 "this one is new too"]
count_vectors = countVectorizer.transform(test_sentences)
print(count_vectors.toarray())

[[0 1 0 0 0 0 0 0 1]
 [0 0 0 1 1 0 0 0 1]]


See that these last two vectors have the same lenght 9 (same vocabulary) like the ones before.

Let's print the Term Frequency

In [None]:
import pandas as pd

print('Features names: \n',tfVectorizer.get_feature_names_out())
print('Shape: \n',tf.shape)
print('Vocabulary: \n',tfVectorizer.vocabulary_)

df = pd.DataFrame(tf[0].T.todense(), index=tfVectorizer.get_feature_names_out(), columns=["TF"])
df = df.sort_values('TF', ascending=False)
print('TF for Sentence 0: \n',df.head(25))

Features names: 
 ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
Shape: 
 (4, 9)
Vocabulary: 
 {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
TF for Sentence 0: 
                 TF
document  0.447214
first     0.447214
is        0.447214
the       0.447214
this      0.447214
and       0.000000
one       0.000000
second    0.000000
third     0.000000


Let's print the TF-IDF

In [None]:
import pandas as pd

print('Features names: \n',tfIdfVectorizer.get_feature_names_out())
print('Shape: \n',tfIdf.shape)
print('Vocabulary: \n',tfIdfVectorizer.vocabulary_)
print('Inverse Document Frequency Vector: \n',tfIdfVectorizer.idf_)

df = pd.DataFrame(tfIdf[0].T.todense(), index=tfIdfVectorizer.get_feature_names_out(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)
print('TF-IDF for Sentence 0: \n',df.head(25))

Features names: 
 ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
Shape: 
 (4, 9)
Vocabulary: 
 {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
Inverse Document Frequency Vector: 
 [1.91629073 1.22314355 1.51082562 1.22314355 1.51082562 1.91629073
 1.         1.91629073 1.22314355]
TF-IDF for Sentence 0: 
             TF-IDF
first     0.541977
document  0.438777
is        0.438777
this      0.438777
the       0.358729
and       0.000000
one       0.000000
second    0.000000
third     0.000000


**Note:**
* Sklearn automatically removes punctuation, but doesn't do the other extra pre-processing methods we discussed here.
* Lexicon-based methods are also not automaticaly applied, we need to call those methods before feature extraction.

# Putting it all together

Let's see all together

In [None]:
# Prepare cleaning functions
import re, string
import nltk
from nltk.stem import SnowballStemmer

stop_words = ["a", "an", "the", "this", "that", "is", "it", "to", "and"]

stemmer = SnowballStemmer('english')

def preProcessText(text):
    # lowercase and strip leading/trailing white space
    text = text.lower().strip()

    # remove HTML tags
    text = re.compile('<.*?>').sub('', text)

    # remove punctuation
    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)

    # remove extra white space
    text = re.sub('\s+', ' ', text)

    return text

def lexiconProcess(text, stop_words, stemmer):
    filtered_sentence = []
    words = text.split(" ")
    for w in words:
        if w not in stop_words:
            filtered_sentence.append(stemmer.stem(w))
    text = " ".join(filtered_sentence)

    return text

def cleanSentence(text, stop_words, stemmer):
    return lexiconProcess(preProcessText(text), stop_words, stemmer)

Prepare vectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

textvectorizer = CountVectorizer(binary=True)# can also limit vocabulary size here, with say, max_features=50

Clean and vectorize a text feature with four samples

In [None]:
text_feature = ["I liked the material, color and overall how it looks.<br /><br />",
             "Worked okay first two times I used it, but third time burned my face.",
             "I am not sure about this product.",
             "I never thought I would pay so much for a hair dryer.",
            ]

print(len(text_feature))

# Clean up the text
text_feature_cleaned = [cleanSentence(item, stop_words, stemmer) for item in text_feature]

# Vectorize the cleaned text
text_feature_vectorized = textvectorizer.fit_transform(text_feature_cleaned)
print('Vocabulary: \n', textvectorizer.vocabulary_)
print('Bag of Words Binary Features: \n', text_feature_vectorized.toarray())
print('Shape: \n',text_feature_vectorized.shape)

4
Vocabulary: 
 {'like': 11, 'materi': 13, 'color': 4, 'overal': 19, 'how': 10, 'look': 12, 'work': 29, 'okay': 18, 'first': 7, 'two': 27, 'time': 26, 'use': 28, 'but': 3, 'third': 24, 'burn': 2, 'my': 15, 'face': 6, 'am': 1, 'not': 17, 'sure': 23, 'about': 0, 'product': 21, 'never': 16, 'thought': 25, 'would': 30, 'pay': 20, 'so': 22, 'much': 14, 'for': 8, 'hair': 9, 'dryer': 5}
Bag of Words Binary Features: 
 [[0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0]
 [1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1]]
Shape: 
 (4, 31)


# Using the Natural Language Toolkit (NLTK) library:

Importing all the resources to use the NLTK library in Python

In [None]:
import nltk

nltk.download("book")

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to /root/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to /root/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nltk_data]    | Downloading package dependency_treebank to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package dependency_treebank is already up-to-date!
[nltk_data]    | Downloadi

True

In [None]:
import pandas as pd

df = pd.read_json("https://raw.githubusercontent.com/erickedu85/dataset/master/yt_tweets.json")

for text in df[0]:
  print(text)

La Junta Electoral de Yachay Tech, comunica que de acuerdo al cronograma establecido desde el 05 al 18 de agosto, se realizarán las inscripciones de candidaturas. #SomosYachayTech
Yachay Tech recibió la visita de 40 estudiantes de la Universidad Estatal Península de Santa Elena, @UPSE_ec con el objetivo de conocer las instalaciones, laboratorios y equipos que dispone la universidad, además de la oferta académica vigente de posgrados.
Con mucha alegría y entusiasmo el día de hoy, se realizó la socialización de los programas y proyectos de vinculación de la Escuela de Investigaciones Agropecuarias y Agroindustriales, contando con la participación de distintas comunidades. 1/2 #SomosYachayTech
Además se realizó una firma de convenio y carta compromiso con @AgriculturaEc y la Fundacion @HeiferEcuador, con el objetivo de establecer mecanismos de cooperación interinstitucional y desarollo productivo. #SomosYachayTech


Using the stop-words in Spanish

In [None]:
from nltk.corpus import stopwords

stopword_es = stopwords.words('spanish')
print (stopword_es)

['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'del', 'se', 'las', 'por', 'un', 'para', 'con', 'no', 'una', 'su', 'al', 'lo', 'como', 'más', 'pero', 'sus', 'le', 'ya', 'o', 'este', 'sí', 'porque', 'esta', 'entre', 'cuando', 'muy', 'sin', 'sobre', 'también', 'me', 'hasta', 'hay', 'donde', 'quien', 'desde', 'todo', 'nos', 'durante', 'todos', 'uno', 'les', 'ni', 'contra', 'otros', 'ese', 'eso', 'ante', 'ellos', 'e', 'esto', 'mí', 'antes', 'algunos', 'qué', 'unos', 'yo', 'otro', 'otras', 'otra', 'él', 'tanto', 'esa', 'estos', 'mucho', 'quienes', 'nada', 'muchos', 'cual', 'poco', 'ella', 'estar', 'estas', 'algunas', 'algo', 'nosotros', 'mi', 'mis', 'tú', 'te', 'ti', 'tu', 'tus', 'ellas', 'nosotras', 'vosotros', 'vosotras', 'os', 'mío', 'mía', 'míos', 'mías', 'tuyo', 'tuya', 'tuyos', 'tuyas', 'suyo', 'suya', 'suyos', 'suyas', 'nuestro', 'nuestra', 'nuestros', 'nuestras', 'vuestro', 'vuestra', 'vuestros', 'vuestras', 'esos', 'esas', 'estoy', 'estás', 'está', 'estamos', 'estáis', 'están', 'e

* **Word tokenization** is a method by which we break the whole paragraph into individual tokens of strings
* **Stemming** is a process by which we tend to form the word stem out of the given word, for example, if the given word is ‘lately’, then the stemming will cut ‘ly’ and give the output as ‘late’, this is done in order to find more context for information retrieval and to reduce the size of the dataset

In [None]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from string import punctuation

from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('spanish')

cleaning_dataset = []

for text in df[0]:
  tokens = word_tokenize(text)
  print('Tokens: \n',tokens)

  cleaned_tokens = [stemmer.stem(token) for token in tokens if token.lower() not in stopword_es and token not in punctuation]
  print('Cleaned tokens and punctuation: \n',cleaned_tokens)

  tagged = pos_tag(cleaned_tokens)
  print('Tagged tokens: \n',tagged)
  print('\n')

  cleaning_dataset.append(' '.join(cleaned_tokens))

print('Cleaning dataset: \n',cleaning_dataset)

Tokens: 
 ['La', 'Junta', 'Electoral', 'de', 'Yachay', 'Tech', ',', 'comunica', 'que', 'de', 'acuerdo', 'al', 'cronograma', 'establecido', 'desde', 'el', '05', 'al', '18', 'de', 'agosto', ',', 'se', 'realizarán', 'las', 'inscripciones', 'de', 'candidaturas', '.', '#', 'SomosYachayTech']
Cleaned tokens and punctuation: 
 ['la', 'junt', 'electoral', 'yachay', 'tech', 'comun', 'acuerd', 'cronogram', 'establec', '05', '18', 'agost', 'realiz', 'inscripcion', 'candidatur', 'somosyachaytech']
Tagged tokens: 
 [('la', 'NN'), ('junt', 'NN'), ('electoral', 'JJ'), ('yachay', 'NN'), ('tech', 'NN'), ('comun', 'NN'), ('acuerd', 'NN'), ('cronogram', 'NN'), ('establec', 'VBZ'), ('05', 'CD'), ('18', 'CD'), ('agost', 'NN'), ('realiz', 'NN'), ('inscripcion', 'NN'), ('candidatur', 'NN'), ('somosyachaytech', 'NN')]


Tokens: 
 ['Yachay', 'Tech', 'recibió', 'la', 'visita', 'de', '40', 'estudiantes', 'de', 'la', 'Universidad', 'Estatal', 'Península', 'de', 'Santa', 'Elena', ',', '@', 'UPSE_ec', 'con', 'el', 

TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfIdfVectorizer=TfidfVectorizer(use_idf=True)
tfIdf = tfIdfVectorizer.fit_transform(cleaning_dataset)
df = pd.DataFrame(tfIdf[0].T.todense(), index=tfIdfVectorizer.get_feature_names_out(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=False)
print (df.head(25))

                   TF-IDF
05               0.274192
la               0.274192
acuerd           0.274192
junt             0.274192
agost            0.274192
inscripcion      0.274192
18               0.274192
electoral        0.274192
candidatur       0.274192
cronogram        0.274192
comun            0.216176
tech             0.216176
establec         0.216176
yachay           0.216176
somosyachaytech  0.175013
realiz           0.175013
compromis        0.000000
program          0.000000
academ           0.000000
laboratori       0.000000
mecan            0.000000
much             0.000000
objet            0.000000
ofert            0.000000
particip         0.000000
