# Text Processing: Stemming and Lemmatization

*Original Dataset: https://www.kaggle.com/uciml/sms-spam-collection-dataset/home*

For this exercise, there are 100 sms that have been parsed and categorized as "Spam" or "Ham". The dataframe also contains the original text message. We have converted the dataframe into a dictionary for this exercise (execute the first two cells).

In the given dictionary, there are 100 entries, starting from 0 to 99 as the keys. The value for each of them is two strings, `class` and `text`. `class` contains either "spam" or "ham", based on the category of the sms, and `text` contains the original text message.

In [2]:
import pandas as pd

df = pd.read_csv("/dsa/data/DSA-8410/spam.csv", encoding='latin1')
mini_df = df[['v1', 'v2']][:100]
mini_df.columns = ['class', 'text']

mini_df.to_csv('messages.csv', index=False)

In [3]:
df = pd.read_csv('messages.csv')
msgs = df.T.to_dict()

**Task 1.** Create a list of strings from this dictionary with the `text` values, and convert all of the strings into lowercase. Print out the first five (5) items from your list.

In [4]:
# Your code goes here
#---------------------

text_list = [val['text'] for key, val in msgs.items() if 'text' in val]

text_list = [each_string.lower() for each_string in text_list]

print(text_list[0:5])

['go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there got amore wat...', 'ok lar... joking wif u oni...', "free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005. text fa to 87121 to receive entry question(std txt rate)t&c's apply 08452810075over18's", 'u dun say so early hor... u c already then say...', "nah i don't think he goes to usf, he lives around here though"]


**Task 2.** Use `nltk` packages tokenize functionality on each of the strings in your list. The result should be a list of lists. Print out the first five (5) items from your list.

In [5]:
# Your code goes here
#---------------------

from nltk import word_tokenize

tokens = [word_tokenize(i) for i in text_list]

print(tokens[0:5])

[['go', 'until', 'jurong', 'point', ',', 'crazy..', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'cine', 'there', 'got', 'amore', 'wat', '...'], ['ok', 'lar', '...', 'joking', 'wif', 'u', 'oni', '...'], ['free', 'entry', 'in', '2', 'a', 'wkly', 'comp', 'to', 'win', 'fa', 'cup', 'final', 'tkts', '21st', 'may', '2005.', 'text', 'fa', 'to', '87121', 'to', 'receive', 'entry', 'question', '(', 'std', 'txt', 'rate', ')', 't', '&', 'c', "'s", 'apply', '08452810075over18', "'s"], ['u', 'dun', 'say', 'so', 'early', 'hor', '...', 'u', 'c', 'already', 'then', 'say', '...'], ['nah', 'i', 'do', "n't", 'think', 'he', 'goes', 'to', 'usf', ',', 'he', 'lives', 'around', 'here', 'though']]


**Task 3.** Remove the stopwords, punctuations and numbers from your list (list of lists). Punctuations and numbers can be checked by the function `string.punctuation` used after a string. If the result is false, you can remove that particular string from the list.

In [6]:
# Your code goes here
#---------------------
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
stop_words = stopwords.words("english")

for i in range(len(tokens)):
    for lst in tokens[i]:
        tokens[i] = [word for word in tokens[i] if word.isalpha()] # seemed like this made more sense than string.punctuation
        tokens[i] = [word for word in tokens[i] if word not in stop_words]

print(tokens[0:10])

[['go', 'jurong', 'point', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'got', 'amore', 'wat'], ['ok', 'lar', 'joking', 'wif', 'u', 'oni'], ['free', 'entry', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', 'may', 'text', 'fa', 'receive', 'entry', 'question', 'std', 'txt', 'rate', 'c', 'apply'], ['u', 'dun', 'say', 'early', 'hor', 'u', 'c', 'already', 'say'], ['nah', 'think', 'goes', 'usf', 'lives', 'around', 'though'], ['freemsg', 'hey', 'darling', 'week', 'word', 'back', 'like', 'fun', 'still', 'tb', 'ok', 'xxx', 'std', 'chgs', 'send', 'rcv'], ['even', 'brother', 'like', 'speak', 'treat', 'like', 'aids', 'patent'], ['per', 'request', 'melle', 'oru', 'minnaminunginte', 'nurungu', 'vettam', 'set', 'callertune', 'callers', 'press', 'copy', 'friends', 'callertune'], ['winner', 'valued', 'network', 'customer', 'selected', 'receivea', 'prize', 'reward', 'claim', 'call', 'claim', 'code', 'valid', 'hours'], ['mobile', 'months', 'u', 'r', 'entitled', 'update',

[nltk_data] Downloading package stopwords to /home/dcphw2/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**Task 4.** Use `nltk` packages `PorterStemmer` to stem the cleaned-text list that you got as a result of **Task 3**. Use a new variable to store the stemmed-word list, and keep the result from the **Task 3** intact. As we will use the cleaned-text list from **Task 3** in the later tasks.

In [7]:
# Your code goes here
#---------------------
from nltk.stem import PorterStemmer
porter = PorterStemmer()

stems = []

for i in range(len(tokens)):
    stems.append([porter.stem(word) for word in tokens[i]])

print(stems[:10])

[['go', 'jurong', 'point', 'avail', 'bugi', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'got', 'amor', 'wat'], ['ok', 'lar', 'joke', 'wif', 'u', 'oni'], ['free', 'entri', 'wkli', 'comp', 'win', 'fa', 'cup', 'final', 'tkt', 'may', 'text', 'fa', 'receiv', 'entri', 'question', 'std', 'txt', 'rate', 'c', 'appli'], ['u', 'dun', 'say', 'earli', 'hor', 'u', 'c', 'alreadi', 'say'], ['nah', 'think', 'goe', 'usf', 'live', 'around', 'though'], ['freemsg', 'hey', 'darl', 'week', 'word', 'back', 'like', 'fun', 'still', 'tb', 'ok', 'xxx', 'std', 'chg', 'send', 'rcv'], ['even', 'brother', 'like', 'speak', 'treat', 'like', 'aid', 'patent'], ['per', 'request', 'mell', 'oru', 'minnaminungint', 'nurungu', 'vettam', 'set', 'callertun', 'caller', 'press', 'copi', 'friend', 'callertun'], ['winner', 'valu', 'network', 'custom', 'select', 'receivea', 'prize', 'reward', 'claim', 'call', 'claim', 'code', 'valid', 'hour'], ['mobil', 'month', 'u', 'r', 'entitl', 'updat', 'latest', 'colour', 'mobil', 'came

**Task 5.** Use `nltk` packages `WordNetLemmatizer` to find the lemma (or root word) from the cleaned-text list that you got as a result of **Task 3**. Consider all of the words to be a `Verb`. Use a new variable to store the lemmatized-word list, and keep the result from **Task 3** intact. As we will use the cleaned-text list from **Task 3** in the later tasks. We assume every word is a verb to make the problem easier, but we could have applied a `POS` tagger and inferred the POS for that word. 

In [8]:
# Your code goes here
#---------------------
from nltk.stem import WordNetLemmatizer
wordnet = WordNetLemmatizer()

lemmas = []

for i in range(len(tokens)):
     lemmas.append([wordnet.lemmatize(word, pos="v") for word in tokens[i]])
        
print(lemmas[:10])

[['go', 'jurong', 'point', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'get', 'amore', 'wat'], ['ok', 'lar', 'joke', 'wif', 'u', 'oni'], ['free', 'entry', 'wkly', 'comp', 'win', 'fa', 'cup', 'final', 'tkts', 'may', 'text', 'fa', 'receive', 'entry', 'question', 'std', 'txt', 'rate', 'c', 'apply'], ['u', 'dun', 'say', 'early', 'hor', 'u', 'c', 'already', 'say'], ['nah', 'think', 'go', 'usf', 'live', 'around', 'though'], ['freemsg', 'hey', 'darling', 'week', 'word', 'back', 'like', 'fun', 'still', 'tb', 'ok', 'xxx', 'std', 'chgs', 'send', 'rcv'], ['even', 'brother', 'like', 'speak', 'treat', 'like', 'aid', 'patent'], ['per', 'request', 'melle', 'oru', 'minnaminunginte', 'nurungu', 'vettam', 'set', 'callertune', 'callers', 'press', 'copy', 'friends', 'callertune'], ['winner', 'value', 'network', 'customer', 'select', 'receivea', 'prize', 'reward', 'claim', 'call', 'claim', 'code', 'valid', 'hours'], ['mobile', 'months', 'u', 'r', 'entitle', 'update', 'latest',

**Task 6.** For each lemma that we got from **Task 5**, calculate how many times they occur in all of the messages. Sort them in descending order by the number of total occurrences, and print out the top ten (10) words and their number of occurrences.

In [9]:
# Your code goes here
#---------------------

lemmas_count = pd.Series(lemmas).explode().value_counts()
lemmas_count.head(10)

u          17
call       14
get        11
go         11
like       10
free        9
sorry       8
ok          8
smile       6
already     6
dtype: int64

**Task 7.** From the result we got from **Task 6**, remove all of the words with a length of 1 and select the top hundred (100) most frequent terms from it. We will use this list of words in our next task.

In [13]:
# Your code goes here
#---------------------

greater_one = lemmas_count[lemmas_count > 1]

greater_one = greater_one.head(100)

greater_one

u            17
call         14
get          11
go           11
like         10
             ..
something     2
car           2
run           2
name          2
months        2
Length: 100, dtype: int64

**Task 8.** For each message (use the lemma-list we created for **Task 5**), calculate the number of times each word from **Task 7** (top-100 words) occurs in that message. 
Create a **Data-Matrix** using your calculations. Each row should correspond to a message, and each column should correspond to a word from the list we got in **Task 7**. Each cell should correspond to how many times that particular word (from column) occurs for that specific message (from row).

You can use Pandas-DataFrame to store your **Data-Matrix**. Print the first five rows of the Data-Matrix.

In [37]:
# Your code goes here
#---------------------

word_counts = []

for i in range(len(lemmas)):
    for j in range(len(greater_one.index)):
        word_counts.append(lemmas[i].count(greater_one.index[j])) # created one large list
        
word_counts = [word_counts[x:x + 100] for x in range(0, len(word_counts), 100)] # broke list down into each 100 values

df = pd.DataFrame(word_counts, columns = greater_one.index)

df.head()

Unnamed: 0,u,call,get,go,like,free,sorry,ok,smile,already,...,watch,charge,cut,gram,work,something,car,run,name,months
0,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
