Text cleaning techniques:
1. Normalizing text -  case normaization
2. Tokenize
3. Removing stop words and punctuations
4. Stemming and lemmetization

Other steps include:
1. dealing with numbers
2. spell check

Case Normalization

In [10]:
import nltk
from nltk.tokenize import word_tokenize

nltk has different types of tokenizers:
    1. word_tokenize
    2. wordpunct_tokenize
    3. tweettokenizer
    4. regexp_tokenize

In [11]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sumit\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:
from nltk.tokenize import sent_tokenize
paragraph = "Sachin is the best player. Is Kohli even close? Dhoni may be miles behind, Gambhir and Yuvraj..!! Sehwag is a delight to watch."
paragraph

'Sachin is the best player. Is Kohli even close? Dhoni may be miles behind, Gambhir and Yuvraj..!! Sehwag is a delight to watch.'

In [13]:
sentencetoken = nltk.sent_tokenize(paragraph)
wordtoken = nltk.word_tokenize(paragraph)

In [14]:
sentencetoken

['Sachin is the best player.',
 'Is Kohli even close?',
 'Dhoni may be miles behind, Gambhir and Yuvraj..!!',
 'Sehwag is a delight to watch.']

In [15]:
print(wordtoken)

['Sachin', 'is', 'the', 'best', 'player', '.', 'Is', 'Kohli', 'even', 'close', '?', 'Dhoni', 'may', 'be', 'miles', 'behind', ',', 'Gambhir', 'and', 'Yuvraj', '..', '!', '!', 'Sehwag', 'is', 'a', 'delight', 'to', 'watch', '.']


# Stemming

- Take a string to its root form
- it is Rule based and chops off the string at the end of the word
- The stemmed word might not be part of the dictionary
- 2 types:
    1. porter stemmer - oldest one originally developed in 1979
    2. snowball stemmer - sophasticated stemmer, supports multiple languages. faster than porter stemmer

In [16]:
from nltk.stem import PorterStemmer, SnowballStemmer

In [17]:
stemmer_s = SnowballStemmer("english")

In [20]:
text = "studies studying cries cry his execute orderly university universal"
tokens = word_tokenize(text)
print([ stemmer_s.stem(word) for word in tokens ])

['studi', 'studi', 'cri', 'cri', 'his', 'execut', 'order', 'univers', 'univers']


# Lemmetization

- Like stemming, lemmatization takes the word to the root form called as lemma
- It involves resolving words to their dictionary form
- A lemma of a word is its dictionary form or canonical form
- Lemmetizer in NLTK uses WordNet data set which comprises a list of synonyms

In [22]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sumit\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [23]:
from nltk.stem import WordNetLemmatizer

In [24]:
lemm = WordNetLemmatizer()

In [25]:
#Lemmetize the below sentenses
txt1 = "he is very methodical and orderly in his execution"
txt2 = "he is driving and drives the down of the drived vehicle"
txt3 = "studies studying cries cry likes his execute"
txt4 = "studies studying cries cry his likes execute orderly university universal"

In [26]:
[lemm.lemmatize(word) for word in word_tokenize(txt1.lower()) ]

['he', 'is', 'very', 'methodical', 'and', 'orderly', 'in', 'his', 'execution']

- lemmetize is very aggresive in taking the word to the root form
- if the word to be stemmed is not part of the dictionary, it leaves it as is
- ensures that the meaning of the sentence is not altered
- In most of the scenarios the no. distinct words after lemmetization could be same as before
    - every step in text cleaning helps is reducing the number of words. but lemmetizer might not make a difference 

In [27]:
print([lemm.lemmatize(word) for word in word_tokenize(txt2.lower()) ])

['he', 'is', 'driving', 'and', 'drive', 'the', 'down', 'of', 'the', 'drived', 'vehicle']


In [28]:
# lemmetizer by defualt acts only on the noun forms, the below code
# lemmetizes all the verb forms in the sentence
print([lemm.lemmatize(word, pos='v') for word in word_tokenize(txt2.lower()) ])

['he', 'be', 'drive', 'and', 'drive', 'the', 'down', 'of', 'the', 'drive', 'vehicle']


In [29]:
txt4 = "studies studying cries cry his likes execute orderly ordered university universal"

In [30]:
print([lemm.lemmatize(word) for word in word_tokenize(txt4.lower()) ])

['study', 'studying', 'cry', 'cry', 'his', 'like', 'execute', 'orderly', 'ordered', 'university', 'universal']


In [None]:
print([stemmer_s.stem(word) for word in word_tokenize(txt4.lower()) ])

# Convert Emojis to Text

In [1]:
!pip install emot
import emot



In [2]:
text = "very bad phone :) :P :D"

In [3]:
emot_obj = emot.emot() 

In [4]:
emot_obj.emoticons(text)

{'value': [':)', ':P', ':D'],
 'location': [[15, 17], [18, 20], [21, 23]],
 'mean': ['Happy face or smiley',
  'Tongue sticking out, cheeky, playful or blowing a raspberry',
  'Laughing, big grin or laugh with glasses'],
 'flag': True}