  **NLP- LinkedIn course**
   
   - **Armin Norouzi**
   - Part of [NLP with Python for Machine Learning Essential Training](https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training?trk=course_title&upsellOrderOrigin=default_guest_learning)
   - Compatible with Google Colaboratory- TF version 2.8.0

   
**Objective:** 
- Definition of an NLP
- Tokenizing
- Vectorizing
- Recognize the outcomes of lemmatizing
- TF-IDF
- Accuracy in terms of evaluation metrics
- Ensemble methods
  


# NLP Basics and importing the data


**NLP main pipeline:**

1. **Raw text:** model can't distinguish words
2. **Tokonize:** tell the model what to look at
3. **clean text:** remove stop words/punctuation, stemming, etc
4. **Vectorize:** convert to numeric form
5. **Feature Engineering**
6. **Machine Learning algorithm:** gt/train model

## Importing Text data

Importing NLKT library

In [9]:
import nltk

In [11]:
from nltk.corpus import stopwords
nltk.download('stopwords') #we need to download each package in order to use it

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [12]:
stopwords.words('english')[0:500:10]

['i',
 "you've",
 'himself',
 'they',
 'that',
 'been',
 'a',
 'while',
 'through',
 'in',
 'here',
 'few',
 'own',
 'just',
 're',
 'doesn',
 'ma',
 "shouldn't"]

### Importing unstructure data

#### Using a difficult way to import dataset

In [25]:
import urllib
response = urllib.request.urlopen("https://raw.githubusercontent.com/jekyll/classifier-reborn/master/test/data/corpus/SMSSpamCollection.tsv")
rawData = response.read()
rawData=rawData.decode("utf-8") 

In [26]:
rawData[0:500]

"ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\nham\tOk lar... Joking wif u oni...\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\nham\tU dun say so early hor... U c already then say...\nham\tNah I don't think he goes to usf, he lives around here though\nspam\tFreeMsg Hey there darling it's been 3 week's now and no word bac"

parsed the data: replace '\t' with '\n' and then split it based on '\n'

Output will be a list where odd numbers are labels and rest are text

In [32]:
parsedData = rawData.replace('\t', '\n').split('\n')
parsedData[0:5]

['ham',
 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
 'ham',
 'Ok lar... Joking wif u oni...',
 'spam']

Devid labels and body text

In [49]:
labelList = parsedData[0::2]
textList = parsedData[1::2]

In [50]:
print(labelList[0:5])
print(textList[0:5])

['ham', 'ham', 'spam', 'ham', 'ham']
['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', 'Ok lar... Joking wif u oni...', "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", 'U dun say so early hor... U c already then say...', "Nah I don't think he goes to usf, he lives around here though"]


In [51]:
print(len(textList))
print(len(labelList))

5574
5575


In [52]:
labelList[-5:]

['ham', 'ham', 'ham', 'ham', '']

In [53]:
textList[-5:]

['This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.',
 'Will ü b going to esplanade fr home?',
 'Pity, * was in mood for that. So...any other suggestions?',
 "The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free",
 'Rofl. Its true to its name']

In [54]:
# remove the last label as it is extra
labelList = labelList[:-1]

In [55]:
print(len(textList))
print(len(labelList))

5574
5574


**Create a pandas dataframe based on parsed data**

In [56]:
import pandas as pd

fullCorpus = pd. DataFrame({
    'label' : labelList,
    'body_list' : textList
})

fullCorpus.head()

Unnamed: 0,label,body_list
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### Using Pandas to read data 

In [62]:
dataset = pd.read_csv('https://raw.githubusercontent.com/jekyll/classifier-reborn/master/test/data/corpus/SMSSpamCollection.tsv', sep = '\t', header = None) 
# without header = None, it will add first row as header
dataset.columns = ['label', 'body_text']

In [63]:
dataset.head()

Unnamed: 0,label,body_text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Explore dataset

In [64]:
# Shape of the inputs

print('Input data has {} rows and {} columns'.format(len(dataset), len(dataset.columns)))

Input data has 5572 rows and 2 columns


In [67]:
# How many spam/ham are there?

print('Out of {} rows, {} are spam, {} are ham'.format(len(dataset), 
                                                        len(dataset[dataset['label'] == 'spam']),
                                                        len(dataset[dataset['label'] == 'ham'])))

Out og 5572 rows, 747 are spam, 4825 are ham


In [70]:
# How many missing data is there?

print('Number of null in label: {}'.format(dataset['label'].isnull().sum()))
print('Number of null in body_text: {}'.format(dataset['body_text'].isnull().sum()))

Number of null in label: 0
Number of null in body_text: 0


### Regular Expressions
A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

Use cases:

- Confiming passwords meet criteria
- Searching URL for sone subsring
- Searching for files on your computer
- Documen scraping

In [86]:
import re # Regular expression operations

In [72]:
re_test = 'This is a made up string to test 2 different regex methods'
re_test_messy = 'This      is a made up     string to test 2    different regex methods'
re_test_messy1 = 'This-is-a-made/up.string*to>>>>test----2\"\"\"\"\"\"different~regex-methods'

splitting a sentence into a list of words

In [75]:
re.split('\s', re_test) # \s means looking for one single space to split

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [76]:
re.split('\s', re_test_messy) 

['This',
 '',
 '',
 '',
 '',
 '',
 'is',
 'a',
 'made',
 'up',
 '',
 '',
 '',
 '',
 'string',
 'to',
 'test',
 '2',
 '',
 '',
 '',
 'different',
 'regex',
 'methods']

Because of extra white space, we have empty element in the list

In [77]:
re.split('\s+', re_test_messy) # \s+ means looking for one or more space to split

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [78]:
re.split('\s+', re_test_messy1)

['This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods']

This method cannot split where the string has devided by special character!

In [79]:
re.split('\W+', re_test_messy) # \W+ means looking for non-word, e.g., space, slash, ", ..., character to split

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [82]:
  re.findall('\S+', re_test_messy) # \S+ look for one or more non white space character

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [83]:
  re.findall('\w+', re_test_messy1) # \w+ look for one or more word character- looking for tokens

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

- regexes very useful for tokenizing

'\w' and '\W' --> search for words

'\s' and '\S' --> search for whitespaces

W us better for us to use as it returns seperated words

#### replacement using regexes

In [91]:
# goal is finding PEP8 or similar word and replace it with PEP8
pep_test1 = 'I try to follow PEP8 guidlines'
pep_test2 = 'I try to follow PEEP8 guidlines'
pep_test3 = 'I try to follow PEP7 guidlines'

In [92]:
re.findall('[a-z]+', pep_test1) # only lowercase a to z

['try', 'to', 'follow', 'guidlines']

In [94]:
re.findall('[A-Z]+', pep_test1) # only uppecase A to Z but without number

['I', 'PEP']

In [95]:
re.findall('[A-Z]+[0-9]+', pep_test1) # uppecase A to Z with number

['PEP8']

In [98]:
# now we are using to replace miss spelled PEP8 from other sentences
print(pep_test2)
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Syleguid', pep_test2) # uppecase A to Z with number

I try to follow PEEP8 guidlines


'I try to follow PEP8 Python Syleguid guidlines'

In [99]:
print(pep_test3)
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Syleguid', pep_test3) # uppecase A to Z with number

I try to follow PEP7 guidlines


'I try to follow PEP8 Python Syleguid guidlines'

# Preprocessing text data
Cleaning up the text data is neccessary to highlight attributes that you're goining to want your ML system to pick up on. Cleaning the data typically consist of a number of steps:

1. Removing punctuation 
2. Tokenization
3. Remove Stopwords
4. Lemmatize/Stem (advance)



In [107]:
pd.set_option('display.max_colwidth',100) # to show more width in pandas dataframe

dataset.head()

Unnamed: 0,label,body_text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


## Removing punctuation

In [108]:
import string # list of punctuations
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [113]:
# write a function to remove puncuations

def remove_punct(text):
  text_nopunc = ''.join([char for char in text if char not in string.punctuation]) # without joint it only returns individual character - "" join on nothing
  return text_nopunc

dataset['body_text_cleaned'] = dataset['body_text'].apply(lambda x: remove_punct(x))
dataset.head()

Unnamed: 0,label,body_text,body_text_cleaned
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though


## Tokenization

In [117]:
import re

def tokonize(text):
  tokens = re.split('\W+', text)
  return tokens

dataset['body_text_cleaned_tokonized'] = dataset['body_text_cleaned'].apply(lambda x: tokonize(x.lower())) # to make all text lowercase
dataset.head()

Unnamed: 0,label,body_text,body_text_cleaned,body_text_cleaned_tokonized
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, ci..."
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"


## Remove Stopwords

In [125]:
import nltk

from nltk.corpus import stopwords
nltk.download('stopwords') #we need to download each package in order to use it

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [128]:
stopwords.words('english')[0:5]

['i', 'me', 'my', 'myself', 'we']

In [129]:
stopwords_eng = stopwords.words('english')

In [130]:
def remove_stopwords(tokonized_text):
  text = [word for word in tokonized_text if word not in stopwords_eng]
  return text
  
dataset['body_text_cleaned_tokonized_nonstop'] = dataset['body_text_cleaned_tokonized'].apply(lambda x: remove_stopwords(x)) # to make all text lowercase
dataset.head()

Unnamed: 0,label,body_text,body_text_cleaned,body_text_cleaned_tokonized,body_text_cleaned_tokonized_nonstop
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, ci...","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]"


## Lemmatize/stem

### Stemming

- Process of reducing inflected (or sometimes derived) words to their word step or root

- Crudly chopping off the end of the word to leave only the base

e.g.: Stemming/stemmed ---> stem


*Stemmers:*
1. **Porter Stemmer:** Porter Stemmer or Porter algorithm was developed by Martin Porter in 1980. The algorithm employs five phases of word reduction, each with its own set of mapping rules. Porter Stemmer is the oldest stemmer is known for its simplicity and speed. The resulting stem is often a shorter word having the same root meaning.
2. **Snowball Stemmer:** The algorithm used here is more accurate and is known as “English Stemmer” or “Porter2 Stemmer”. It offers a slight improvement over the original Porter Stemmer, both in logic and speed.
3. **Lancaster Stemmer:** Lancaster Stemmer is simple but it tends to produce results with over stemming. Over-stemming causes the stems to be not linguistic, or they may have no meaning.
4. **Regex-Based Stemmer:** Regex stemmer uses regular expressions to identify morphological affixes. Any substrings that match the regular expressions will be removed.



In [140]:
ps = nltk.PorterStemmer()

In [141]:
dir(ps)

['MARTIN_EXTENSIONS',
 'NLTK_EXTENSIONS',
 'ORIGINAL_ALGORITHM',
 '__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_abc_impl',
 '_apply_rule_list',
 '_contains_vowel',
 '_ends_cvc',
 '_ends_double_consonant',
 '_has_positive_measure',
 '_is_consonant',
 '_measure',
 '_replace_suffix',
 '_step1a',
 '_step1b',
 '_step1c',
 '_step2',
 '_step3',
 '_step4',
 '_step5a',
 '_step5b',
 'mode',
 'pool',
 'stem',
 'unicode_repr',
 'vowels']

In [143]:
print(ps.stem('grows'))
print(ps.stem('growing'))
print(ps.stem('grow'))

grow
grow
grow


In [144]:
print(ps.stem('run'))
print(ps.stem('run'))
print(ps.stem('runner'))

run
run
runner


- Put toghther all the functions we had before


In [146]:
# write a function to remove puncuations

def clean_text(text):
  text = ''.join([char for char in text if char not in string.punctuation])
  tokens = re.split('\W+', text)
  text = [word for word in tokens if word not in stopwords_eng]
  return text



In [147]:
data = pd.read_csv('https://raw.githubusercontent.com/jekyll/classifier-reborn/master/test/data/corpus/SMSSpamCollection.tsv', sep = '\t', header = None) 
# without header = None, it will add first row as header
data.columns = ['label', 'body_text']


data['body_text_cleaned'] = data['body_text'].apply(lambda x: clean_text(x.lower()))
data.head()

Unnamed: 0,label,body_text,body_text_cleaned
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
3,ham,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]"


In [149]:
def stemming(tokonized_text):
  text = [ps.stem(word) for word in tokonized_text]
  return text


data['body_text_stemmed'] = data['body_text_cleaned'].apply(lambda x: stemming(x))
data.head()

Unnamed: 0,label,body_text,body_text_cleaned,body_text_stemmed
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]","[go, jurong, point, crazi, avail, bugi, n, great, world, la, e, buffet, cine, got, amor, wat]"
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,..."
3,ham,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, earli, hor, u, c, alreadi, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]"


Stemming helps to reduce the corpus of words that models are exposed to and explicitly correlates words with similar meaning. 

### lemmatization

Lemmatization (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word, so they can be analyzed as a single item, identified by the word's lemma, or dictionary form



Stemming vs lemmatizing

Stemming is typically faster as it simply chops off the end of the word using heuristics without understanding the context in which a word is used.

Lemmatizing is typically more accurate as it uses more informed analysis to create a group of words with similar meanings based on the context around the word.

In [160]:
# WordNet lemmatizer

wn = nltk.WordNetLemmatizer()

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [161]:
dir(wn)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 'lemmatize',
 'unicode_repr']

In [162]:
print(ps.stem('meanness'))
print(ps.stem('meaning'))

mean
mean


In [163]:
print(wn.lemmatize('meanness'))
print(wn.lemmatize('meaning'))

meanness
meaning


lemmatizing can distinguish meanness and meaning while stemming chopped them to mean!

In [164]:
print(ps.stem('goose'))
print(ps.stem('geese'))

goos
gees


In [165]:
print(wn.lemmatize('goose'))
print(wn.lemmatize('geese'))

goose
goose


Let's try to our dataset

In [167]:
def lemmatizing(tokonized_text):
  text = [wn.lemmatize(word) for word in tokonized_text]
  return text


data['body_text_lemmatized'] = data['body_text_cleaned'].apply(lambda x: lemmatizing(x))
data.head()

Unnamed: 0,label,body_text,body_text_cleaned,body_text_stemmed,body_text_lemmatized
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]","[go, jurong, point, crazi, avail, bugi, n, great, world, la, e, buffet, cine, got, amor, wat]","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
3,ham,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, earli, hor, u, c, alreadi, say]","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]","[nah, dont, think, go, usf, life, around, though]"


# Vectorizing Raw Data