# Supplemental Data Cleaning

Process of reducing inflected (or sometimes derived) words to their word stem or root

stemmed/stemming -> stem

electricity/electrical -> electric

__Benefits__:
- Reduce the corpus of wrods the model is exposed to
- Explicitly correlates words with similar meanings

__Methods__:
- Porter
- Snowball
- Lancaster
- Regex-based

## Test out Porter Stemmer

In [1]:
import nltk
ps = nltk.PorterStemmer()

In [2]:
dir(ps)

['MARTIN_EXTENSIONS',
 'NLTK_EXTENSIONS',
 'ORIGINAL_ALGORITHM',
 '__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_abc_impl',
 '_apply_rule_list',
 '_contains_vowel',
 '_ends_cvc',
 '_ends_double_consonant',
 '_has_positive_measure',
 '_is_consonant',
 '_measure',
 '_replace_suffix',
 '_step1a',
 '_step1b',
 '_step1c',
 '_step2',
 '_step3',
 '_step4',
 '_step5a',
 '_step5b',
 'mode',
 'pool',
 'stem',
 'unicode_repr',
 'vowels']

__NB__: focus on stem

In [8]:
print(f"Stem for different version of {ps.stem('grows')}, {ps.stem('growing')}, {ps.stem('grown')} ")
print(f"Stem for different version of {ps.stem('run')}, {ps.stem('running')}, {ps.stem('runner')} ")

Stem for different version of grow, grow, grown 
Stem for different version of run, run, runner 


## Load the sms spam collection from csv

In [10]:
import pandas as pd
#for regular expression
import re
import string

pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')

data= pd.read_csv("../dataset/SMSSpamCollection.tsv", sep="\t", header=None)
data.columns=['label','body_text']
data.head()

Unnamed: 0,label,body_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,"Nah I don't think he goes to usf, he lives around here though"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


## Clean the dataset
- Remove the punc
- Tokenize the sentences (put in separate words)
- Remove stopwords (words without lots of meaning)

In [16]:
def clean_text(text):
    text = "".join([char for char in text if char not in string.punctuation])
    tokens = re.split("\W+", text)
    return [word for word in tokens if word not in stopwords]

In [17]:
data['body_text_nostop'] = data['body_text'].apply(lambda x: clean_text(x.lower()))

In [18]:
data

Unnamed: 0,label,body_text,body_text_nostop
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]"
...,...,...,...
5563,spam,This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy...,"[2nd, time, tried, 2, contact, u, u, 750, pound, prize, 2, claim, easy, call, 087187272008, now1..."
5564,ham,Will ü b going to esplanade fr home?,"[ü, b, going, esplanade, fr, home]"
5565,ham,"Pity, * was in mood for that. So...any other suggestions?","[pity, mood, soany, suggestions]"
5566,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week ...,"[guy, bitching, acted, like, id, interested, buying, something, else, next, week, gave, us, free]"


### Stem text

In [19]:
def stemming(tokenized_text):
    text = [ps.stem(word) for word in tokenized_text]
    return text

data['stemmed'] = data['body_text_nostop'].apply(lambda x: stemming(x))

In [20]:
data

Unnamed: 0,label,body_text,body_text_nostop,stemmed
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]","[date, sunday]"
...,...,...,...,...
5563,spam,This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy...,"[2nd, time, tried, 2, contact, u, u, 750, pound, prize, 2, claim, easy, call, 087187272008, now1...","[2nd, time, tri, 2, contact, u, u, 750, pound, prize, 2, claim, easi, call, 087187272008, now1, ..."
5564,ham,Will ü b going to esplanade fr home?,"[ü, b, going, esplanade, fr, home]","[ü, b, go, esplanad, fr, home]"
5565,ham,"Pity, * was in mood for that. So...any other suggestions?","[pity, mood, soany, suggestions]","[piti, mood, soani, suggest]"
5566,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week ...,"[guy, bitching, acted, like, id, interested, buying, something, else, next, week, gave, us, free]","[guy, bitch, act, like, id, interest, buy, someth, els, next, week, gave, us, free]"


## Lemmatizing

__Definition__: the process of grouping together the inflectedx forms of a words so they can be analyzed as a single term, identified by the word's lemma

In other words: It *Using vocab analysis of words aiming to remove inflectional endings to return the dictionary form of a word.*

__Examples__" typed, typing, types -> type

### Comparison: Lemmatizing vs. Stemming

Stemming: simply chops off the end of a word using __heuristics__, without any understanding of the context in which a word is used.

Lemmatizing: typically more accurate as it used more informed analysis to create groups of wrods with similar meaning based on the context around the word. More computationally expensive.

## Using a Lemmatizer

### Test out WordNet lemmatizer
> __Wordnet__ is a collection of nouns, verbs, adj, adverbs grouped together as a set of synonyms. [more info](https://wordnet.princeton.edu/)

In [21]:
import nltk

wn = nltk.WordNetLemmatizer()
ps = nltk.PorterStemmer()

In [31]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/chuong/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [24]:
dir(wn) #Mostly use `lemmatize`

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 'lemmatize',
 'unicode_repr']

In [36]:
#First example
print('Incorrectly stemming')
print(f"{ps.stem('meanness')} vs {ps.stem('meanning')}")

print('Correctly lemmatize')
print(f"{wn.lemmatize('meanness')} vs {wn.lemmatize('meaning')}")

print("\n")

#Second example
print('Incorrectly stemming')
print(f"{ps.stem('goose')} vs {ps.stem('geese')}")


print('Correctly lemmatize')
print(f"{wn.lemmatize('goose')} vs {wn.lemmatize('geese')}")

Incorrectly stemming
mean vs mean
Correctly lemmatize
meanness vs meaning


Incorrectly stemming
goos vs gees
Correctly lemmatize
goose vs goose


## Apply to the SMS dataset

In [38]:
import pandas as pd
#for regular expression
import re
import string

pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')

data= pd.read_csv("../dataset/SMSSpamCollection.tsv", sep="\t", header=None)
data.columns=['label','body_text']

data['body_text_nostop'] = data['body_text'].apply(lambda x: clean_text(x.lower()))

data['stemmed'] = data['body_text_nostop'].apply(lambda x: stemming(x))

In [39]:
data

Unnamed: 0,label,body_text,body_text_nostop,stemmed
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]","[date, sunday]"
...,...,...,...,...
5563,spam,This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy...,"[2nd, time, tried, 2, contact, u, u, 750, pound, prize, 2, claim, easy, call, 087187272008, now1...","[2nd, time, tri, 2, contact, u, u, 750, pound, prize, 2, claim, easi, call, 087187272008, now1, ..."
5564,ham,Will ü b going to esplanade fr home?,"[ü, b, going, esplanade, fr, home]","[ü, b, go, esplanad, fr, home]"
5565,ham,"Pity, * was in mood for that. So...any other suggestions?","[pity, mood, soany, suggestions]","[piti, mood, soani, suggest]"
5566,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week ...,"[guy, bitching, acted, like, id, interested, buying, something, else, next, week, gave, us, free]","[guy, bitch, act, like, id, interest, buy, someth, els, next, week, gave, us, free]"


### Lemmatize text


In [41]:
def lemmatize(tokenized_text):
    return [wn.lemmatize(word) for word in tokenized_text]
data['lemmatized'] = data['body_text_nostop'].apply(lambda x: lemmatize(x))

In [42]:
data

Unnamed: 0,label,body_text,body_text_nostop,stemmed,lemmatized
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,"[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom...","[ive, search, right, word, thank, breather, promis, wont, take, help, grant, fulfil, promis, won...","[ive, searching, right, word, thank, breather, promise, wont, take, help, granted, fulfil, promi..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]","[nah, dont, think, go, usf, life, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,"[even, brother, like, speak, treat, like, aids, patent]","[even, brother, like, speak, treat, like, aid, patent]","[even, brother, like, speak, treat, like, aid, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,"[date, sunday]","[date, sunday]","[date, sunday]"
...,...,...,...,...,...
5563,spam,This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy...,"[2nd, time, tried, 2, contact, u, u, 750, pound, prize, 2, claim, easy, call, 087187272008, now1...","[2nd, time, tri, 2, contact, u, u, 750, pound, prize, 2, claim, easi, call, 087187272008, now1, ...","[2nd, time, tried, 2, contact, u, u, 750, pound, prize, 2, claim, easy, call, 087187272008, now1..."
5564,ham,Will ü b going to esplanade fr home?,"[ü, b, going, esplanade, fr, home]","[ü, b, go, esplanad, fr, home]","[ü, b, going, esplanade, fr, home]"
5565,ham,"Pity, * was in mood for that. So...any other suggestions?","[pity, mood, soany, suggestions]","[piti, mood, soani, suggest]","[pity, mood, soany, suggestion]"
5566,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week ...,"[guy, bitching, acted, like, id, interested, buying, something, else, next, week, gave, us, free]","[guy, bitch, act, like, id, interest, buy, someth, els, next, week, gave, us, free]","[guy, bitching, acted, like, id, interested, buying, something, else, next, week, gave, u, free]"
