# Natural Language Processing Basics

Field concerned with the ability of a computer to understand, analyse, manipulate and potantially generate human language

Example: email filter, auto-complete system, auto-correct, sentiment analysis, topic modelling, text classfication, sentence segmentation or part-of-speech tagging.

## NL toolkit (using nltk)

If you have not got `nltk`, you could download it via `!pip install -U nltk
`

In [1]:
import nltk 

You could also check the methods inside `nltk` package by typing ` dir(nltk)`

## Explore `nltk` by the first example: `stopwords`

__NB__: stopwords: words that do not carry lots of important meanings (of, by, you, I, etc)


In [3]:
from nltk.corpus import stopwords
try:
    stopwords.words('english')[0:500:25]
except Exception as e:
    if e == LookupError:
        ntlk.download('stopwords')
        stopwords.words('english')[0:500:25]

## Structured Data vs. Unstructured Data

- Most text data lacks the formal structure of numeric data

> 80% of business-relevant infomation orignates in unstructured form, primarily text

What makes a file unstructured: 

- Binary data
- No delimiters
- No indication of rows

## How does Python see a raw reading in text data & why do we need to clean the text?

### Read in semi-structured text data

#### Two ways of reading `SMSSpamCollection.tsv` files

1. Read raw, then convert to `pandas dataframe`.

In [3]:
# Read in the raw text
rawData = open("../dataset/SMSSpamCollection.tsv").read()

In [4]:
rawData[0:500]

"ham\tI've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.\nspam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\nham\tNah I don't think he goes to usf, he lives around here though\nham\tEven my brother is not like to speak with me. They treat me like aid"

In [5]:
parsedData = rawData.replace('\t', '\n').split('\n')

In [6]:
parsedData[0:20]

['ham',
 "I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.",
 'spam',
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
 'ham',
 "Nah I don't think he goes to usf, he lives around here though",
 'ham',
 'Even my brother is not like to speak with me. They treat me like aids patent.',
 'ham',
 'I HAVE A DATE ON SUNDAY WITH WILL!!',
 'ham',
 "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune",
 'spam',
 'WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.',
 'spam',
 'Had your mobile 11 months or more? U R entitled to Upda

In [7]:
labelList = parsedData[0::2]
textList = parsedData[1::2]

In [8]:
print(labelList[0:5])
print(textList[0:5])

['ham', 'spam', 'ham', 'ham', 'ham']
["I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.", "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", "Nah I don't think he goes to usf, he lives around here though", 'Even my brother is not like to speak with me. They treat me like aids patent.', 'I HAVE A DATE ON SUNDAY WITH WILL!!']


In [9]:
import numpy as np
import pandas as pd

In [10]:
fullCorpus = pd.DataFrame({
    'label': labelList[:-1],
    'body_list': textList 
})

In [11]:
fullCorpus

Unnamed: 0,label,body_list
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!
...,...,...
5565,spam,This is the 2nd time we have tried 2 contact u...
5566,ham,Will ü b going to esplanade fr home?
5567,ham,"Pity, * was in mood for that. So...any other s..."
5568,ham,The guy did some bitching but I acted like i'd...


2. Read CSV after learning the structures of the dataset

In [12]:
fullCorpus = pd.read_csv("../dataset/SMSSpamCollection.tsv", sep='\t', header=None)
fullCorpus.columns = ['label', 'body_text']
fullCorpus.head()

Unnamed: 0,label,body_text
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


## Explore the dataset


In [13]:
#What is the shape of the dataset
print(f"Input data has {len(fullCorpus)} rows and {len(fullCorpus.columns)} columns")

#How many spam/ham are there?
print("{0} are spam, {1} are ham".format(
    len(fullCorpus[fullCorpus['label']== 'spam']), 
    len(fullCorpus[fullCorpus['label'] == 'ham'])))

#How many missing data is there?

print(f"Number of null in label: {fullCorpus['label'].isnull().sum()}")
print(f"Number of null in label: {fullCorpus['body_text'].isnull().sum()}")

Input data has 5568 rows and 2 columns
746 are spam, 4822 are ham
Number of null in label: 0
Number of null in label: 0


### Regular Expressions
> Text string for describing a search pattern

Example: [0-9], [j-q0-9]+

__Why useful?__

- Identify whitespace between words/tokens
- Identify delimiters or end-of-line escape characters
- Remove punctuation or numbers from your text
- Clean HTML tags from text
- Identify some textual patterns you're interested in

__Use cases__

- Confirm passwords meet criteria
- Search URL for some substring
- Searc for files on your computers

### Using regular expression in Python: `re` package

In [14]:
import re

re_test = 'This is a made up string to test 2 different regex methods'
re_test_messy = 'This      is a made up     string to test 2    different regex methods'
re_test_messy1 = 'This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods'

### Split a sentence into a list of words

In [44]:
#Split with a single whitespace
re.split('\s', re_test)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [53]:
#using + to split words with MULTIPLE space in between
re.split('\s+', re_test_messy)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [54]:
# W+ to split with non-word in between
re.split('\W+', re_test_messy1)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [58]:
# w+ to split with non-word in between
# one or more word character
re.findall('\w+', re_test_messy1)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

### Takeaways

- Useful methods for tokenizing: `findall()` `split()`
- Usefull regexes for tokenzing: `\s` `\w`

### Replace a specific string

In [59]:
pep8_test = 'I try to follow PEP8 guidelines.'
pep7_test = 'I try to follow PEP7 guidelines'
peep8_test = 'I try to follow PEEP8 guidelines'

In [64]:
re.findall('[A-Z]+[0-9]+', peep8_test)

['P8']

In [65]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styleguide', peep8_test)

'I try to follow PEP8 Python Styleguide guidelines'

# Machine Learning PipeLine

Raw text -> Tokenize -> Clean text -> Vectorize -> Machine Learning algorithm

## Implement a pipeline to clean text
### Pre-processing text data

Cleaning or preprocessing the data usually consists of a number of steps:

1. Remove punctuation
2. Tokenization
3. Remove stopwords
4. Lemmatize/Stem

In [94]:
import pandas as pd
pd.set_option('display.max_colwidth', 100)

data= pd.read_csv("../dataset/SMSSpamCollection.tsv", sep="\t", header=None)
data.columns=['label','body_text']
data.head()

Unnamed: 0,label,body_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,"Nah I don't think he goes to usf, he lives around here though"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


### Remove punctuation



In [70]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [79]:
#build a function to remove punctuation
def remove_punct(text):
    return "".join([char for char in text if char not in string.punctuation])

data['body_text_clean'] = data['body_text'].apply(lambda x: remove_punct(x))

In [80]:
data

Unnamed: 0,label,body_text,body_text_clean
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL
...,...,...,...
5563,spam,This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy...,This is the 2nd time we have tried 2 contact u U have won the £750 Pound prize 2 claim is easy c...
5564,ham,Will ü b going to esplanade fr home?,Will ü b going to esplanade fr home
5565,ham,"Pity, * was in mood for that. So...any other suggestions?",Pity was in mood for that Soany other suggestions
5566,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week ...,The guy did some bitching but I acted like id be interested in buying something else next week a...


### Tokenization


In [81]:
import re

def tokenize(text):
    tokens = re.split("\W+", text)
    return tokens

data['body_text_tokenized'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower()))

In [82]:
data

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]"
...,...,...,...,...
5563,spam,This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy...,This is the 2nd time we have tried 2 contact u U have won the £750 Pound prize 2 claim is easy c...,"[this, is, the, 2nd, time, we, have, tried, 2, contact, u, u, have, won, the, 750, pound, prize,..."
5564,ham,Will ü b going to esplanade fr home?,Will ü b going to esplanade fr home,"[will, ü, b, going, to, esplanade, fr, home]"
5565,ham,"Pity, * was in mood for that. So...any other suggestions?",Pity was in mood for that Soany other suggestions,"[pity, was, in, mood, for, that, soany, other, suggestions]"
5566,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week ...,The guy did some bitching but I acted like id be interested in buying something else next week a...,"[the, guy, did, some, bitching, but, i, acted, like, id, be, interested, in, buying, something, ..."


### Remove stopwords

In [85]:
import nltk

nltk.download('stopwords')
stopword =.corpus.stopwords.words("english")

[nltk_data] Downloading package stopwords to /home/chuong/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [86]:
def remove_stopwords(tokenized_list):
    return [word for word in tokenized_list if word not in stopword]

In [89]:
data['body_text_nostop'] = data['body_text_tokenized'].apply(lambda x: remove_stopwords(x))

In [90]:
data

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized,body_text_nostop
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...,Ive been searching for the right words to thank you for this breather I promise i wont take your...,"[ive, been, searching, for, the, right, words, to, thank, you, for, this, breather, i, promise, ...","[ive, searching, right, words, thank, breather, promise, wont, take, help, granted, fulfil, prom..."
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to...","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
2,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]"
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]"
...,...,...,...,...,...
5563,spam,This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy...,This is the 2nd time we have tried 2 contact u U have won the £750 Pound prize 2 claim is easy c...,"[this, is, the, 2nd, time, we, have, tried, 2, contact, u, u, have, won, the, 750, pound, prize,...","[2nd, time, tried, 2, contact, u, u, 750, pound, prize, 2, claim, easy, call, 087187272008, now1..."
5564,ham,Will ü b going to esplanade fr home?,Will ü b going to esplanade fr home,"[will, ü, b, going, to, esplanade, fr, home]","[ü, b, going, esplanade, fr, home]"
5565,ham,"Pity, * was in mood for that. So...any other suggestions?",Pity was in mood for that Soany other suggestions,"[pity, was, in, mood, for, that, soany, other, suggestions]","[pity, mood, soany, suggestions]"
5566,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week ...,The guy did some bitching but I acted like id be interested in buying something else next week a...,"[the, guy, did, some, bitching, but, i, acted, like, id, be, interested, in, buying, something, ...","[guy, bitching, acted, like, id, interested, buying, something, else, next, week, gave, us, free]"
