In [1]:
# Load, explore and plot data
import numpy as np
import pandas as pd
import re
import string
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# suppress display of warnings
import warnings
warnings.filterwarnings("ignore")

### Stemming

From Stemming we will process of getting the root form of a word. Root or Stem is the part to which inflextional affixes(like -ed, -ize, etc) are added. We would create the stem words by removing the prefix or suffix of a word. So, stemming a word may not result in actual words.

For Example: Mangoes ---> Mango

             Boys ---> Boy
             
             going ---> go

In [5]:
!pip install nltk



In [3]:
#importing nltk's porter stemmer 
from nltk.stem.porter import PorterStemmer 
#from nltk.tokenize import word_tokenize 

ps = PorterStemmer()


words = ['plays', 'playing', 'played', 'player', 'pharmacies', 'badly', 'improvement', 'hospitals']

for w in words:
    print(w, " : ", ps.stem(w))

plays  :  play
playing  :  play
played  :  play
player  :  player
pharmacies  :  pharmaci
badly  :  badli
improvement  :  improv
hospitals  :  hospit


In [7]:
from nltk.stem.snowball import SnowballStemmer
#from nltk.tokenize import word_tokenize 

sb = SnowballStemmer(language = 'english')


words = ['plays', 'playing', 'played', 'player', 'pharmacies', 'badly', 'improvement', 'hospitals']

for w in words:
    print(w, " : ", sb.stem(w))

plays  :  play
playing  :  play
played  :  play
player  :  player
pharmacies  :  pharmaci
badly  :  bad
improvement  :  improv
hospitals  :  hospit


### Lemmatization

As stemming, lemmatization do the same but the only difference is that lemmatization ensures that root word belongs to the language. Because of the use of lemmatization we will get the valid words. In NLTK(Natural language Toolkit), we use WordLemmatizer to get the lemmas of words.

In [8]:
from nltk.stem import WordNetLemmatizer 
#from nltk.tokenize import word_tokenize 
lemma = WordNetLemmatizer()

words = ['plays', 'playing', 'played', 'player', 'pharmacies', 'badly', 'improvement', 'hospitals']

for w in words:
    print(w, " : ", lemma.lemmatize(w))

plays  :  play
playing  :  playing
played  :  played
player  :  player
pharmacies  :  pharmacy
badly  :  badly
improvement  :  improvement
hospitals  :  hospital


Stemming is faster to implement and quite straightforward. There might be some inaccuracies, they may be irrelevant to particular tasks and operations.

Lemmatization, on the other hand, provides better results by analyzing the POS of the words and thus displaying real words. Lemmatization is harder to implement and a bit slower when compared to Stemming.

In short, Lemmatization is the best choice when you are looking for qualitative results. In the modern day, Lemmatization algorithms do not affect the performance. But if you want to optimize speed then Stemming algorithms are the best option.

### *Bag-of-words using Count Vectorization*

*The bag-of-words model converts text into fixed-length vectors by counting how many times each word appears. Let us illustrate this with an example. Consider that we have the following sentences:*

- Text processing is necessary.
- Text processing is necessary and important.
- Text processing is easy.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['Text processing is necessary.', 'Text processing is necessary and important.', 'Text processing is easy.']

vect = CountVectorizer()
X = vect.fit_transform(corpus)

In [12]:
X.toarray()

array([[0, 0, 0, 1, 1, 1, 1],
       [1, 0, 1, 1, 1, 1, 1],
       [0, 1, 0, 1, 0, 1, 1]], dtype=int64)

#### Limitations of Bag-of-Words:

- If we deploy bag-of-words to generate vectors for large documents, the vectors would be of large sizes and would also have too many null values leading to the creation of sparse vectors.
- Bag-of-words does not bring in any information on the meaning of the text. For example, if we consider these two sentences – “Text processing is easy but tedious.” and “Text processing is tedious but easy.” – a bag-of-words model would create the same vectors for both of them, even though they have different meanings.

### Term Frequency Inverse Document Frequency (TF-IDF) :

*TFIDF works by proportionally increasing the number of times a word appears in the document but is counterbalanced by the number of documents in which it is present. Hence, words like ‘this’, ’are’ etc., that are commonly present in all the documents are not given a very high rank. However, a word that is present too many times in a few of the documents will be given a higher rank as it might be indicative of the context of the document.*

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(corpus)
print(X.toarray())

[[0.         0.         0.         0.46333427 0.59662724 0.46333427
  0.46333427]
 [0.52523431 0.         0.52523431 0.31021184 0.39945423 0.31021184
  0.31021184]
 [0.         0.69903033 0.         0.41285857 0.         0.41285857
  0.41285857]]


#### Important parameters to know – Sklearn’s CountVectorizer & TFIDF vectorization:

- *max_features: This parameter enables using only the ‘n’ most frequent words as features instead of all the words. An integer can be passed for this parameter.*


- *stop_words: You could remove the extremely common words like ‘this’, ’is’, ’are’ etc by using this parameter as the common words add little value to the model. We can set the parameter to ‘english’ to use a built-in list. We can also set this parameter to a custom list.*




- *ngram_range: An n-gram is a string of words in a row. For example, in the sentence – “Text processing is easy.”, 2-grams could be ‘Text processing’, ‘processing is’ or ‘is easy’. We can set the ngram_range to be (x,y) where x is the minimum and y is the maximum size of the n-grams we want to include in the features. The default ngram_range is (1,1).*


- *min_df, max_df: These refer to the minimum and maximum document frequency that a word/n-gram should have to be used as a feature. The frequency here refers to the proportion of documents. Both the parameters have to be set in the range of [0,1].*