In [1]:
!pip install textblob



Tokenization


Tokenization is the process of breaking down the given text in natural language processing into the smallest unit in a sentence called a token.

In [2]:
import textblob
from textblob import TextBlob

In [3]:
text = "Hello everyone! Welcome to my blog post on Medium. We are studying Natural Language Processing."

In [4]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [5]:
TextBlob(text).words

WordList(['Hello', 'everyone', 'Welcome', 'to', 'my', 'blog', 'post', 'on', 'Medium', 'We', 'are', 'studying', 'Natural', 'Language', 'Processing'])

In [6]:
import nltk
from nltk import sent_tokenize
from nltk import word_tokenize

In [7]:
tokens_sents = nltk.sent_tokenize(text)
print(tokens_sents)

['Hello everyone!', 'Welcome to my blog post on Medium.', 'We are studying Natural Language Processing.']


In [8]:
tokens_words = nltk.word_tokenize(text)
print(tokens_words)

['Hello', 'everyone', '!', 'Welcome', 'to', 'my', 'blog', 'post', 'on', 'Medium', '.', 'We', 'are', 'studying', 'Natural', 'Language', 'Processing', '.']


Stemming


Stemming is definitely the simpler of the two approaches. With stemming, words are reduced to their word stems. A word stem need not be the same root as a dictionary-based morphological root, it just is an equal to or smaller form of the word.

In [9]:
from nltk.stem import PorterStemmer

In [10]:
ps = PorterStemmer()
word = ("civilization")
ps.stem(word)

'civil'

In [11]:
from nltk.stem.snowball import SnowballStemmer

In [12]:
stemmer = SnowballStemmer(language = "english")
word = "civilization"
stemmer.stem(word)

'civil'

Lemmatization


Lemmatization is the process of finding the form of the related word in the dictionary. It is different from Stemming. It involves longer processes to calculate than Stemming.

In [13]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [14]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [15]:
# Lemmatize single word

print(lemmatizer.lemmatize("workers"))
print(lemmatizer.lemmatize("beeches"))

worker
beech


In [16]:
text = "Let’s lemmatize a simple sentence. We first tokenize the sentence into words using nltk.word_tokenize and then we will call lemmatizer.lemmatize() on each word. "
word_list = nltk.word_tokenize(text)
print(word_list)

['Let', '’', 's', 'lemmatize', 'a', 'simple', 'sentence', '.', 'We', 'first', 'tokenize', 'the', 'sentence', 'into', 'words', 'using', 'nltk.word_tokenize', 'and', 'then', 'we', 'will', 'call', 'lemmatizer.lemmatize', '(', ')', 'on', 'each', 'word', '.']


In [17]:
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)

Let ’ s lemmatize a simple sentence . We first tokenize the sentence into word using nltk.word_tokenize and then we will call lemmatizer.lemmatize ( ) on each word .


In [18]:
# pip install textblob

from textblob import TextBlob, Word

In [19]:
word = 'stripes'
w = Word(word)
w.lemmatize()

'stripe'

In [20]:
text = "The striped bats are hanging on their feet for best"
sent = TextBlob(text)
" ". join([w.lemmatize() for w in sent.words])

'The striped bat are hanging on their foot for best'

Part Of Speech Tagging (POS Tagging)


1 - Part of Speech Tagging (POS-Tag) is the labeling of the words in a text according to their word types (noun, adjective, adverb, verb, etc.)


2 - It is a process of converting a sentence to forms — list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on.

In [21]:
import nltk
from nltk import word_tokenize

In [22]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [23]:
"""Parts of Speech (nltk.pos_tag list)
CC: It is the conjunction of coordinating
CD: It is a digit of cardinal
DT: It is the determiner
EX: Existential
FW: It is a foreign word
IN: Preposition and conjunction
JJ: Adjective
JJR and JJS: Adjective and superlative
LS: List marker
MD: Modal
NN: Singular noun
NNS, NNP, NNPS: Proper and plural noun
PDT: Predeterminer
WRB: Adverb of wh
WP$: Possessive wh
WP: Pronoun of wh
WDT: Determiner of wp
VBZ: Verb
VBP, VBN, VBG, VBD, VB: Forms of verbs
UH: Interjection
TO: To go
RP: Particle
RBS, RB, RBR: Adverb
PRP, PRP$: Pronoun personal and professional
"""


text = "The striped bats are hanging on their feet for best"
tokens = nltk.word_tokenize(text)
print("Parts of Speech: ",nltk.pos_tag(tokens))

Parts of Speech:  [('The', 'DT'), ('striped', 'JJ'), ('bats', 'NNS'), ('are', 'VBP'), ('hanging', 'VBG'), ('on', 'IN'), ('their', 'PRP$'), ('feet', 'NNS'), ('for', 'IN'), ('best', 'JJS')]


Stop words -

The words which are generally filtered out before processing a natural language are called stop words. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”.



Libraries to remove stop words -

1 - Natural Language Toolkit (NLTK):

NLTK is an amazing library to play with natural language.

In [24]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [25]:
import nltk
from nltk.corpus import stopwords
sw_nltk = stopwords.words('english')
print(sw_nltk)

print()
print(len(sw_nltk))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [26]:
text = "When I first met her she was very quiet. She remained quiet during the entire two hour long journey from Stony Brook to New York."
words = [word for word in text.split() if word.lower() not in sw_nltk]
new_text = " ".join(words)
print(new_text)
print("Old length: ", len(text))
print("New length: ", len(new_text))

first met quiet. remained quiet entire two hour long journey Stony Brook New York.
Old length:  129
New length:  82


2 - spaCy:

spaCy is an open-source software library for advanced NLP. This library is quite popular now and NLP practitioners use this to get their work done in the best way.

In [27]:
import spacy
#loading the english language small model of spacy
en = spacy.load('en_core_web_sm')
sw_spacy = en.Defaults.stop_words
print(sw_spacy)

print()
print(len(sw_spacy))

{'whenever', '’ve', '‘s', 'anyway', 'within', 'again', 'his', 'n‘t', 'not', 'used', 'cannot', 'almost', 'hereupon', 'amount', 'meanwhile', 'using', 'been', 'among', 'ourselves', 'ever', '‘d', 'for', 'always', 'on', 'may', 'back', 'off', 'should', 'if', 'elsewhere', 'myself', 'down', 'hence', 'many', 'all', 'moreover', 'however', 'herein', 'as', 'they', 'himself', 'whence', 'i', 'thence', 'both', 'can', 'became', 'an', 'did', 'say', 'has', 'hereby', 'does', 'thereupon', 'least', 'three', 'would', 'us', 'amongst', 'besides', 'whether', '’m', 'these', 'where', 'fifty', 'her', 'made', 'herself', 'next', 'wherever', 'you', 'further', 'hundred', 'might', 'same', 'between', 'other', 'part', 'mostly', 'top', 'towards', 'eight', 'out', 'to', 'someone', 'do', 'is', '’d', 'but', 'bottom', 'former', 'are', 'why', 'give', 'alone', 'against', 'anyone', 'the', 'whither', 'somehow', 'own', 'perhaps', 'thru', 'into', 'above', 'few', 'together', 'will', 'sometime', 'hers', 'else', '‘re', 'those', 'eithe

In [28]:
words = [word for word in text.split() if word.lower() not in sw_spacy]
new_text = " ".join(words)
print(new_text)
print("Old length: ", len(text))
print("New length: ", len(new_text))

met quiet. remained quiet entire hour long journey Stony Brook New York.
Old length:  129
New length:  72


3 - Gensim:

Gensim (Generate Similar) is an open-source software library that uses modern statistical machine learning. According to Wikipedia, Gensim is designed to handle large text collections using data streaming and incremental online algorithms, which differentiates it from most other machine learning software packages that target only in-memory processing.

In [29]:
import gensim
from gensim.parsing.preprocessing import remove_stopwords, STOPWORDS
print(STOPWORDS)

print()
print(len(STOPWORDS))

frozenset({'whenever', 'anyway', 'within', 'his', 'again', 'not', 'used', 'cannot', 'almost', 'hereupon', 'amount', 'using', 'meanwhile', 'been', 'among', 'ourselves', 'ever', 'for', 'always', 'on', 'may', 'back', 'thin', 'off', 'should', 'if', 'elsewhere', 'myself', 'down', 'hence', 'many', 'all', 'moreover', 'however', 'bill', 'herein', 'they', 'as', 'himself', 'whence', 'i', 'thence', 'both', 'can', 'interest', 'did', 'became', 'an', 'say', 'has', 'hereby', 'does', 'thereupon', 'least', 'three', 'would', 'sincere', 'found', 'fire', 'us', 'amongst', 'besides', 'whether', 'these', 'where', 'fifty', 'her', 'made', 'next', 'herself', 'wherever', 'you', 'further', 'hundred', 'thick', 'might', 'same', 'between', 'other', 'part', 'mostly', 'top', 'towards', 'eight', 'out', 'to', 'someone', 'do', 'is', 'system', 'but', 'bottom', 'former', 'are', 'why', 'give', 'alone', 'against', 'anyone', 'the', 'whither', 'somehow', 'don', 'own', 'perhaps', 'thru', 'into', 'above', 'few', 'ltd', 'describe

In [30]:
new_text = remove_stopwords(text)
print(new_text)
print("Old length: ", len(text))
print("New length: ", len(new_text))

When I met quiet. She remained quiet entire hour long journey Stony Brook New York.
Old length:  129
New length:  83


4 - Scikit-Learn:

Scikit-Learn needs no introduction. It is a free software machine learning library for Python. It is probably the most powerful library for machine learning.

In [31]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
print(ENGLISH_STOP_WORDS)

print()
print(len(ENGLISH_STOP_WORDS))

frozenset({'whenever', 'anyway', 'within', 'again', 'his', 'not', 'cannot', 'almost', 'hereupon', 'amount', 'meanwhile', 'been', 'among', 'ourselves', 'ever', 'for', 'always', 'on', 'may', 'back', 'thin', 'off', 'should', 'if', 'elsewhere', 'myself', 'down', 'hence', 'many', 'all', 'moreover', 'however', 'bill', 'herein', 'as', 'they', 'himself', 'whence', 'i', 'thence', 'both', 'can', 'interest', 'became', 'an', 'has', 'hereby', 'thereupon', 'least', 'three', 'would', 'sincere', 'found', 'fire', 'us', 'amongst', 'besides', 'whether', 'these', 'where', 'fifty', 'her', 'made', 'herself', 'next', 'wherever', 'you', 'further', 'hundred', 'thick', 'might', 'same', 'between', 'other', 'part', 'mostly', 'top', 'towards', 'eight', 'out', 'to', 'someone', 'do', 'is', 'system', 'but', 'bottom', 'former', 'are', 'why', 'give', 'alone', 'against', 'anyone', 'the', 'whither', 'somehow', 'own', 'perhaps', 'thru', 'into', 'above', 'few', 'ltd', 'describe', 'together', 'will', 'sometime', 'hers', 'el

In [32]:
words = [word for word in text.split() if word.lower() not in ENGLISH_STOP_WORDS]
new_text = " ".join(words)
print(new_text)
print("Old length: ", len(text))
print("New length: ", len(new_text))

met quiet. remained quiet entire hour long journey Stony Brook New York.
Old length:  129
New length:  72


In [33]:
sw_nltk.extend(['first', 'second', 'third', 'me'])
print(len(sw_nltk))

183


In [34]:
sw_nltk.remove('not')
print(len(sw_nltk))

182


Custom Stop Words Removal -


If we do not want to use any of these libraries, we can also create our own custom stop words list and use it in our task. This is usually done when we have domain expertise in our field and when we know which words we should avoid while performing our task.

In [35]:
my_stop_words = ['her','me','i','she','it']
words = [word for word in text.split() if word.lower() not in my_stop_words]
new_text = " ".join(words)
print(new_text)
print("Old length: ", len(text))
print("New length: ", len(new_text))

When first met was very quiet. remained quiet during the entire two hour long journey from Stony Brook to New York.
Old length:  129
New length:  115


TF-IDF Vectorizer

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer
train = ('The sky is blue.','The sun is bright.')
test = ('The sun in the sky is bright', 'We can see the shining sun, the bright sun.')
# instantiate the vectorizer object
# use analyzer is word and stop_words is english which are responsible for remove stop words and create word vocabulary
tfidfvectorizer = TfidfVectorizer(analyzer='word' , stop_words='english',)
tfidfvectorizer.fit(train)
tfidf_train = tfidfvectorizer.transform(train)
tfidf_term_vectors  = tfidfvectorizer.transform(test)
print("Sparse Matrix form of test data : \n")
tfidf_term_vectors.todense()

Sparse Matrix form of test data : 



matrix([[0.        , 0.57735027, 0.57735027, 0.57735027],
        [0.        , 0.4472136 , 0.        , 0.89442719]])