<img style="float: left;" src="pic2.png">

### Sridhar Palle, Ph.D, spalle@emory.edu (Applied ML & DS with Python Program)

# Text Preprocessing

**Import the libraries and dependencies**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from bs4 import BeautifulSoup
import nltk
import contractions
%matplotlib inline

In [2]:
#nltk.download('all', halt_on_error=False) # do this only once if never done before

## 1. Regex operations

* re.match() - matches pattern at the begnining of the string
* re.search() - match patterns occuring at any position
* re.findall() - returns all non-verlapping matches of a specifief regex pattern
* re.sub() - replaces a pattern with another string

In [3]:
sample_text = "Learning is a repetitive process. Best way of learning anything in life is to actualllly do it, @@data science wisdom <br> <br>.but what about difficulty in understanding???. Again $ when we take baby steps**,everything becomes easier"
sample_text

'Learning is a repetitive process. Best way of learning anything in life is to actualllly do it, @@data science wisdom <br> <br>.but what about difficulty in understanding???. Again $ when we take baby steps**,everything becomes easier'

In [4]:
re.match('Learning', sample_text)

<re.Match object; span=(0, 8), match='Learning'>

In [5]:
re.match('Learning', sample_text).span()

(0, 8)

In [6]:
re.match('Best', sample_text) #match only works for matching a pattern at the begining

**re.search()**

In [7]:
sample_text

'Learning is a repetitive process. Best way of learning anything in life is to actualllly do it, @@data science wisdom <br> <br>.but what about difficulty in understanding???. Again $ when we take baby steps**,everything becomes easier'

In [8]:
re.search('Best', sample_text) #search works to match pattern at any position

<re.Match object; span=(34, 38), match='Best'>

**re.findall()**

In [9]:
sample_text

'Learning is a repetitive process. Best way of learning anything in life is to actualllly do it, @@data science wisdom <br> <br>.but what about difficulty in understanding???. Again $ when we take baby steps**,everything becomes easier'

In [10]:
re.findall('Learning', sample_text, re.I)

['Learning', 'learning']

In [11]:
re.findall('is', sample_text)

['is', 'is', 'is']

In [15]:
re.findall('[^A-Za-z0-9., ]', sample_text) # returns all characters other than A-Za-z0-9.

['@', '@', '<', '>', '<', '>', '?', '?', '?', '$', '*', '*']

**re.sub()**

In [16]:
sample_text

'Learning is a repetitive process. Best way of learning anything in life is to actualllly do it, @@data science wisdom <br> <br>.but what about difficulty in understanding???. Again $ when we take baby steps**,everything becomes easier'

In [17]:
re.sub('Best', 'Super', sample_text) # substitutes a regex pattern in a string with another

'Learning is a repetitive process. Super way of learning anything in life is to actualllly do it, @@data science wisdom <br> <br>.but what about difficulty in understanding???. Again $ when we take baby steps**,everything becomes easier'

In [18]:
re.sub('in', '500', sample_text)

'Learn500g is a repetitive process. Best way of learn500g anyth500g 500 life is to actualllly do it, @@data science wisdom <br> <br>.but what about difficulty 500 understand500g???. Aga500 $ when we take baby steps**,everyth500g becomes easier'

In [19]:
sample_text.replace('Best', 'Super') #.replace on strings achieves the same and is faster.

'Learning is a repetitive process. Super way of learning anything in life is to actualllly do it, @@data science wisdom <br> <br>.but what about difficulty in understanding???. Again $ when we take baby steps**,everything becomes easier'

### 1.1.1 Regex rules

<img style="float: left;" src="reg.png">

**. Period**

In [20]:
sample_text

'Learning is a repetitive process. Best way of learning anything in life is to actualllly do it, @@data science wisdom <br> <br>.but what about difficulty in understanding???. Again $ when we take baby steps**,everything becomes easier'

In [21]:
re.findall('.ea.', sample_text) # for matching any character before or after period

['Lear', 'lear', ' eas']

In [22]:
re.findall('l..', sample_text)

['lea', 'lif', 'lll', 'ly ', 'lty']

**^**

In [23]:
sample_text

'Learning is a repetitive process. Best way of learning anything in life is to actualllly do it, @@data science wisdom <br> <br>.but what about difficulty in understanding???. Again $ when we take baby steps**,everything becomes easier'

In [24]:
re.findall('^L', sample_text, re.I) # ^ for matching the start of the string

['L']

**$**

In [25]:
re.findall('..r$', sample_text) # ^ for matching the end of the string

['ier']

**[...]**

In [26]:
re.findall('[@]', sample_text) # for matching set of characters inside []

['@', '@']

**[^...]**

In [27]:
re.findall('[^A-Za-z., ]', sample_text) # for matching any character which is not there after ^ in the [^ ]

['@', '@', '<', '>', '<', '>', '?', '?', '?', '$', '*', '*']

In [28]:
sample_text2 = "Learning is a repetitive process. Number 1 way of learning anything in life is to actualllly do it 1000 times, @@data science wisdom <br> <br>.but what about difficulty in understanding???. Again $ when we take baby steps**,everything becomes easier"
sample_text2

'Learning is a repetitive process. Number 1 way of learning anything in life is to actualllly do it 1000 times, @@data science wisdom <br> <br>.but what about difficulty in understanding???. Again $ when we take baby steps**,everything becomes easier'

**\d**

In [29]:
re.findall('\d', sample_text2) # \d for matching decimal digits depicted by [0-9]

['1', '1', '0', '0', '0']

**\D**

In [30]:
re.findall('\D', sample_text2)[0:5] # \D for matching non-digits

['L', 'e', 'a', 'r', 'n']

**\s**

In [31]:
re.findall('\s', sample_text2)[0:5] # \s for matching whitespaces 

[' ', ' ', ' ', ' ', ' ']

**\S**

In [32]:
''.join(re.findall('\S', sample_text2)) # \S for matching non-whitespaces 

'Learningisarepetitiveprocess.Number1wayoflearninganythinginlifeistoactuallllydoit1000times,@@datasciencewisdom<br><br>.butwhataboutdifficultyinunderstanding???.Again$whenwetakebabysteps**,everythingbecomeseasier'

**\w**

In [33]:
re.findall('\w', sample_text2)[0:5] # \w for matching alphanumeric characters [a-zA-Z0-9_]

['L', 'e', 'a', 'r', 'n']

**\W**

In [34]:
re.findall('\W', sample_text2)[0:9] # \W for matching non alphanumeric characters. Same as  [^a-zA-Z0-9_]

[' ', ' ', ' ', ' ', '.', ' ', ' ', ' ', ' ']

**For more info on regular expressions please see https://docs.python.org/3.4/library/re.html**

# 2. Text Preprocessing

**Lets Load a  bigger imdb reviews dataset**

In [35]:
imdb_big = pd.read_csv('movie_reviews.csv')

In [36]:
imdb_big.head(3)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive


In [37]:
imdb_big.shape

(50000, 2)

In [38]:
imdb_big['review'].describe()

count                                                 50000
unique                                                49582
top       Loved today's show!!! It was a variety and not...
freq                                                      5
Name: review, dtype: object

In [39]:
imdb_big['sentiment'].value_counts()

negative    25000
positive    25000
Name: sentiment, dtype: int64

### 2.1 Some basic preprocessing methodologies

**Lets take a sample review and demonstrate different preprocessing metholodies**

In [40]:
# reviews with lot of special characters, 20867, 26791, 37153, 42947, 48952    

In [41]:
sample_review = imdb_big['review'][42947]
sample_review

'In a far away Galaxy is a planet called Ceta. It\'s native people worship cats. But the dog people wage war upon these feline loving people and they have no choice but to go to Earth and grind people up for food. This is one of the stupidest f#@k!ng ideas for a movie I\'ve seen. Leave it to Ted Mikels to make a movie more incompetent than the already low standard he set in previous films. It\'s like he enjoying playing in a celluloid game of Limbo. How low can he go? The only losers in the scenario are US the viewer. Mr. Mikels and his silly little handlebar mustache actually has people who STILL buy this crap.<br /><br />My Grade: F <br /><br />DVD Extras: Commentary by Ted Mikels; the Story behind the Making of (9 and a half minutes); 17 minutes, 15 seconds of Behind the scenes footage; Ted Mikels filmography; and Trailers for "The Worm Eaters" "Girl in Gold Boots", "the Doll Squad", "Ten Violent Women" (featuring nudity), "Blood Orgy of the She Devils", & "the Corpse Grinders"'

**Text Normalization or preprocessing steps**
    - Converting to lowercase
    - Remove html tags
    - Removing punctuation
    - Removing stop words
    - Stemming or lemmatization
    - Expanding contractions
    - Correcting words, spellings
    - ngrams

**Converting to lowercase**

In [42]:
sample_review

'In a far away Galaxy is a planet called Ceta. It\'s native people worship cats. But the dog people wage war upon these feline loving people and they have no choice but to go to Earth and grind people up for food. This is one of the stupidest f#@k!ng ideas for a movie I\'ve seen. Leave it to Ted Mikels to make a movie more incompetent than the already low standard he set in previous films. It\'s like he enjoying playing in a celluloid game of Limbo. How low can he go? The only losers in the scenario are US the viewer. Mr. Mikels and his silly little handlebar mustache actually has people who STILL buy this crap.<br /><br />My Grade: F <br /><br />DVD Extras: Commentary by Ted Mikels; the Story behind the Making of (9 and a half minutes); 17 minutes, 15 seconds of Behind the scenes footage; Ted Mikels filmography; and Trailers for "The Worm Eaters" "Girl in Gold Boots", "the Doll Squad", "Ten Violent Women" (featuring nudity), "Blood Orgy of the She Devils", & "the Corpse Grinders"'

In [43]:
def lower_case(text):
    return text.lower()

In [44]:
sample_review = lower_case(sample_review)
sample_review

'in a far away galaxy is a planet called ceta. it\'s native people worship cats. but the dog people wage war upon these feline loving people and they have no choice but to go to earth and grind people up for food. this is one of the stupidest f#@k!ng ideas for a movie i\'ve seen. leave it to ted mikels to make a movie more incompetent than the already low standard he set in previous films. it\'s like he enjoying playing in a celluloid game of limbo. how low can he go? the only losers in the scenario are us the viewer. mr. mikels and his silly little handlebar mustache actually has people who still buy this crap.<br /><br />my grade: f <br /><br />dvd extras: commentary by ted mikels; the story behind the making of (9 and a half minutes); 17 minutes, 15 seconds of behind the scenes footage; ted mikels filmography; and trailers for "the worm eaters" "girl in gold boots", "the doll squad", "ten violent women" (featuring nudity), "blood orgy of the she devils", & "the corpse grinders"'

**Removing html tags**

In [45]:
def html_parser(text):
    return BeautifulSoup(text, "html.parser").get_text()

In [46]:
sample_review = html_parser(sample_review)
sample_review

'in a far away galaxy is a planet called ceta. it\'s native people worship cats. but the dog people wage war upon these feline loving people and they have no choice but to go to earth and grind people up for food. this is one of the stupidest f#@k!ng ideas for a movie i\'ve seen. leave it to ted mikels to make a movie more incompetent than the already low standard he set in previous films. it\'s like he enjoying playing in a celluloid game of limbo. how low can he go? the only losers in the scenario are us the viewer. mr. mikels and his silly little handlebar mustache actually has people who still buy this crap.my grade: f dvd extras: commentary by ted mikels; the story behind the making of (9 and a half minutes); 17 minutes, 15 seconds of behind the scenes footage; ted mikels filmography; and trailers for "the worm eaters" "girl in gold boots", "the doll squad", "ten violent women" (featuring nudity), "blood orgy of the she devils", & "the corpse grinders"'

**Expanding contractions**

In [47]:
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)

In [48]:
sample_review = replace_contractions(sample_review)
sample_review

'in a far away galaxy is a planet called ceta. it is native people worship cats. but the dog people wage war upon these feline loving people and they have no choice but to go to earth and grind people up for food. this is one of the stupidest f#@k!ng ideas for a movie i\'ve seen. leave it to ted mikels to make a movie more incompetent than the already low standard he set in previous films. it is like he enjoying playing in a celluloid game of limbo. how low can he go? the only losers in the scenario are us the viewer. mr. mikels and his silly little handlebar mustache actually has people who still buy this crap.my grade: f dvd extras: commentary by ted mikels; the story behind the making of (9 and a half minutes); 17 minutes, 15 seconds of behind the scenes footage; ted mikels filmography; and trailers for "the worm eaters" "girl in gold boots", "the doll squad", "ten violent women" (featuring nudity), "blood orgy of the she devils", & "the corpse grinders"'

**Removing punctuation and special characters**

In [49]:
def remove_special(text):
    return re.sub('[^a-zA-Z0-9]', ' ', text)

In [50]:
sample_review = remove_special(sample_review)
sample_review

'in a far away galaxy is a planet called ceta  it is native people worship cats  but the dog people wage war upon these feline loving people and they have no choice but to go to earth and grind people up for food  this is one of the stupidest f  k ng ideas for a movie i ve seen  leave it to ted mikels to make a movie more incompetent than the already low standard he set in previous films  it is like he enjoying playing in a celluloid game of limbo  how low can he go  the only losers in the scenario are us the viewer  mr  mikels and his silly little handlebar mustache actually has people who still buy this crap my grade  f dvd extras  commentary by ted mikels  the story behind the making of  9 and a half minutes   17 minutes  15 seconds of behind the scenes footage  ted mikels filmography  and trailers for  the worm eaters   girl in gold boots    the doll squad    ten violent women   featuring nudity    blood orgy of the she devils      the corpse grinders '

**Removing stop words**

In [58]:
def remove_stopwords(text):
    stopword_list = nltk.corpus.stopwords.words('english')
    words = nltk.word_tokenize(text)
    words = [word.strip() for word in words]
    filtered_words = [word for word in words if word not in stopword_list]
    return ' '.join(filtered_words)

In [59]:
sample_review = remove_stopwords(sample_review)
sample_review

'far away galaxy planet called ceta native people worship cats dog people wage war upon feline loving people choice go earth grind people food one stupidest f k ng ideas movie seen leave ted mikels make movie incompetent already low standard set previous films like enjoying playing celluloid game limbo low go losers scenario us viewer mr mikels silly little handlebar mustache actually people still buy crap grade f dvd extras commentary ted mikels story behind making 9 half minutes 17 minutes 15 seconds behind scenes footage ted mikels filmography trailers worm eaters girl gold boots doll squad ten violent women featuring nudity blood orgy devils corpse grinders'

**Stemming or Lemmatization**

In [60]:
def word_stem(text, kind='stemming'):
        from nltk.stem import WordNetLemmatizer
        from nltk.stem import PorterStemmer
        wnl = WordNetLemmatizer()
        ps = PorterStemmer()

        words = nltk.word_tokenize(text)
        words = [word.strip() for word in words]
        filtered_words = [wnl.lemmatize(word) if (kind == 'lemmatize') else ps.stem(word) for word in words]
        return ' '.join(filtered_words)

In [61]:
word_stem(sample_review)

'far away galaxi planet call ceta nativ peopl worship cat dog peopl wage war upon felin love peopl choic go earth grind peopl food one stupidest f k ng idea movi seen leav ted mikel make movi incompet alreadi low standard set previou film like enjoy play celluloid game limbo low go loser scenario us viewer mr mikel silli littl handlebar mustach actual peopl still buy crap grade f dvd extra commentari ted mikel stori behind make 9 half minut 17 minut 15 second behind scene footag ted mikel filmographi trailer worm eater girl gold boot doll squad ten violent women featur nuditi blood orgi devil corps grinder'

In [62]:
word_stem(sample_review, 'lemmatize')

'far away galaxy planet called ceta native people worship cat dog people wage war upon feline loving people choice go earth grind people food one stupidest f k ng idea movie seen leave ted mikels make movie incompetent already low standard set previous film like enjoying playing celluloid game limbo low go loser scenario u viewer mr mikels silly little handlebar mustache actually people still buy crap grade f dvd extra commentary ted mikels story behind making 9 half minute 17 minute 15 second behind scene footage ted mikels filmography trailer worm eater girl gold boot doll squad ten violent woman featuring nudity blood orgy devil corpse grinder'