After the text data is cleaned of impurities, it is ready for tokenization, lemmatization, and stemming.<br>
Here, however, we will token a raw text directly for the sake of illustration.

### Tokenization
Tokenization is the process of splitting paragraphs into sentences or sentences into individual words


#### Sentence Tokenization

In [53]:
import nltk
nltk.download("popular", quiet = True)

True

In [54]:
import os
import pandas as pd

path_to_file_1 = "../data/example_text.txt"
with open(path_to_file_1) as f:
    raw_text_1 = f.read()        # reads all the contents into a single string
                             # can read one line at a time and treat each line as separate using readlines() 

# Read Reviews       
def build_data_frame(path):
    rows = []
    index = []
    review_count = 0
    
    all_files = os.listdir(path)#os.walk(path)
    
    for filename in all_files:
        
        with open(path+filename) as f:
            raw_text = f.read()
            rows.append(raw_text)
        index.append(filename)
        review_count += 1
        
        # We will use only 300 review for simplicity for now.
        if review_count == 300:
            break
    

    data_frame = pd.DataFrame(rows, index=index)
    return data_frame

        
path_to_files_2 = "../data/imdb/train/unsup/"

reviews = build_data_frame(path_to_files_2)

In [55]:
# tokenize sentences

# Method 1
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(reviews.iloc[0,0])
sentences

['I admit, the great majority of films released before say 1933 are just not for me.',
 'Of the dozen or so "major" silents I have viewed, one I loved (The Crowd), and two were very good (The Last Command and City Lights, that latter Chaplin circa 1931).<br /><br />So I was apprehensive about this one, and humor is often difficult to appreciate (uh, enjoy) decades later.',
 'I did like the lead actors, but thought little of the film.<br /><br />One intriguing sequence.',
 'Early on, the guys are supposed to get "de-loused" and for about three minutes, fully dressed, do some schtick.',
 'In the background, perhaps three dozen men pass by, all naked, white and black (WWI ?',
 '), and for most, their butts, part or full backside, are shown.',
 'Was this an early variation of beefcake courtesy of Howard Hughes?']

In [56]:
# Method 2
import nltk.data
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
sentences = tokenizer.tokenize(reviews.iloc[0,0])
sentences

['I admit, the great majority of films released before say 1933 are just not for me.',
 'Of the dozen or so "major" silents I have viewed, one I loved (The Crowd), and two were very good (The Last Command and City Lights, that latter Chaplin circa 1931).<br /><br />So I was apprehensive about this one, and humor is often difficult to appreciate (uh, enjoy) decades later.',
 'I did like the lead actors, but thought little of the film.<br /><br />One intriguing sequence.',
 'Early on, the guys are supposed to get "de-loused" and for about three minutes, fully dressed, do some schtick.',
 'In the background, perhaps three dozen men pass by, all naked, white and black (WWI ?',
 '), and for most, their butts, part or full backside, are shown.',
 'Was this an early variation of beefcake courtesy of Howard Hughes?']

#### Word Tokenization

In [57]:
sample_text = "Dancing is what I do. I came riding on a bike. He eats pasta."

# Method 1
from nltk.tokenize import word_tokenize
words = word_tokenize(sample_text)
words

['Dancing',
 'is',
 'what',
 'I',
 'do',
 '.',
 'I',
 'came',
 'riding',
 'on',
 'a',
 'bike',
 '.',
 'He',
 'eats',
 'pasta',
 '.']

In [58]:
# Method 2
from nltk.tokenize import TreebankWordTokenizer
tokenizer_obj = TreebankWordTokenizer()

words = tokenizer_obj.tokenize(sample_text)
words

['Dancing',
 'is',
 'what',
 'I',
 'do.',
 'I',
 'came',
 'riding',
 'on',
 'a',
 'bike.',
 'He',
 'eats',
 'pasta',
 '.']

Both the methods above show that if apostrophes are not handled before passing to the nltk tokenizer, we get undesired results.<br>
nltk offers another tokenizer that can use regular expression matching to avoid such undesired results.

In [59]:
# Method 1
from nltk.tokenize import regexp_tokenize
words = regexp_tokenize(sample_text, "[\w']+")
words

['Dancing',
 'is',
 'what',
 'I',
 'do',
 'I',
 'came',
 'riding',
 'on',
 'a',
 'bike',
 'He',
 'eats',
 'pasta']

In [60]:
# Method 2
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
words = tokenizer.tokenize(sample_text)
words

['Dancing',
 'is',
 'what',
 'I',
 'do',
 'I',
 'came',
 'riding',
 'on',
 'a',
 'bike',
 'He',
 'eats',
 'pasta']

### Stemming
In language, the same word can have different forms depending upon its usage. For example, eat  has different forms such as ate, eaten, eating, and eats. Because of these different form, there is redundancy in thr corpus. We could mean the same thing with the use of different forms. For example, I rode from school to my house. VS I came riding from school to my house. It is obvious that "rode" and "riding" were conveying the same information. We want to remove this redundancy from the data and this process is called stemming. Stemming literally means "to remove stems from a fruit/ vegetable".<br><br>
Typically, a stemmer only truncates a word from the end.<br>
nltk library has multiple stemmer available: PorterStemmer, LancasterStemmer, RegexpStemmer, SnowballStemmer<br>
PorterStemmer is the least aggressive among all the  nltk stemmer. LancasterStemmer is very aggressive. RegexpStemmer allows us to pass a regular expression to stem. 

In [61]:
from nltk.stem import PorterStemmer

pstemmer = PorterStemmer()
stemmed_words = [pstemmer.stem(word) for word in words]
stemmed_words

['danc',
 'is',
 'what',
 'I',
 'do',
 'I',
 'came',
 'ride',
 'on',
 'a',
 'bike',
 'He',
 'eat',
 'pasta']

In [62]:
from nltk.stem import LancasterStemmer

lstemmer = LancasterStemmer()
stemmed_words = [lstemmer.stem(word) for word in words]
stemmed_words

['dant',
 'is',
 'what',
 'i',
 'do',
 'i',
 'cam',
 'rid',
 'on',
 'a',
 'bik',
 'he',
 'eat',
 'past']

While stemming appears to give us the root word (called lemma), it can lead to wrong lemmas because most of the times it is only truncating the word without any linguistic analysis/ consideration. For example, stemming "caring" using a RegexStemmer to  filter "ing" gives us "Car" which has an entirely different meaning than the actual lemma "care". Hence, instead of stemming, we should resort to lemmatization.

### Lemmatization
Lemma : "run" is lemma for "running". "ride" is lemma for "rides" or "riding".
Lemmatization is the process of converting words into their roots/ lemmas such that the meaning does not get distorted as seen above for "caring". Also, stemming might produce words that do not exist in the language but lemmatization always produces a meaningful word. For such correct conversion, the PartOf Speech (POS) of the word should be knowm i.e., noun/ verb/adjective, etc.<br>
For example, if we lemmatize the word "meeting" based on what POS it belongs to i.e., noun, we get "meeting" as the lemma. However, if we pass it through a regular stemmer, it returns "meet" which could mean a verb or a noun.

In [63]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ["cacti", "gods", "rains", "ate"]

In [64]:
lemmatized = [lemmatizer.lemmatize(word) for word in words]
lemmatized

['cactus', 'god', 'rain', 'ate']

Note that we mentioned above that a lemmatizer needs to know what POS a particular word belongs to. By default, WordNetLemmatizer takes each word as a noun. That is why the lemmatizer was unable to change "ate" to "eat" as it considered "ate" a noun and no lemma was found in its library.<br>
TO make sure correct lemmatization happens, we can do as below:

In [65]:
lemmatizer.lemmatize("ate", pos = "v") # where v means verb

'eat'