## **Natural Language Processing - Basics**

In [1]:
import nltk
import pandas as pd
import re
import string
nltk.download('stopwords')

from google.colab import drive
drive.mount('/content/drive')

print('Setup done.')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Setup done.


### **Explore the data**

In [2]:
data = pd.read_csv('/content/drive/MyDrive/projects/colab-notebooks/datasets/SMSSpamCollection.tsv', sep='\t', header=None)
data.columns = ['label', 'body_text']
data.head()


Unnamed: 0,label,body_text
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


Number of rows and columns in the dataset.

In [3]:
rows = len(data)
columns = len(data.columns)
print("Input data: rows = {}, columns = {}".format(rows, columns))

Input data: rows = 5568, columns = 2


Number of spam / ham tags in the dataset.

In [4]:
spam_count = len(data[data['label'] == 'spam'])
ham_count = len(data[data['label'] == 'ham'])

print('Out of {} rows, there are {} spam and {} ham'.format(
  rows, spam_count, ham_count
))

Out of 5568 rows, there are 746 spam and 4822 ham


Missing data in the dataset.

In [5]:
null_count_label = data['label'].isnull().sum()
null_count_text = data['body_text'].isnull().sum()

print('Number of null in label: {}'.format(null_count_label))
print('Number of null in text_body: {}'.format(null_count_text))

Number of null in label: 0
Number of null in text_body: 0


### **Remove punctuation**

In [6]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [7]:
def remove_punctuation(text):
  text_no_punctuation = ''.join([char for char in text if char not in string.punctuation])
  return text_no_punctuation

In [8]:
data['body_text_clean'] = data['body_text'].apply(lambda x: remove_punctuation(x))
data.head()

Unnamed: 0,label,body_text,body_text_clean
0,ham,I've been searching for the right words to tha...,Ive been searching for the right words to than...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...
3,ham,Even my brother is not like to speak with me. ...,Even my brother is not like to speak with me T...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL


### **Tokenize**

In [9]:
def tokenize(text):
  tokens = re.split('\W+', text)
  return tokens

In [10]:
data['body_text_tokenized'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower()))
data.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized
0,ham,I've been searching for the right words to tha...,Ive been searching for the right words to than...,"[ive, been, searching, for, the, right, words,..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f..."
2,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l..."
3,ham,Even my brother is not like to speak with me. ...,Even my brother is not like to speak with me T...,"[even, my, brother, is, not, like, to, speak, ..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]"


### **Remove stopwords**

In [11]:
stopwords = nltk.corpus.stopwords.words('english')

In [12]:
def remove_stopwords(tokenized_list):
  text = [word for word in tokenized_list if word not in stopwords]
  return text

In [13]:
data['body_text_no_stop'] = data['body_text_tokenized'].apply(lambda x: remove_stopwords(x))
data.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenized,body_text_no_stop
0,ham,I've been searching for the right words to tha...,Ive been searching for the right words to than...,"[ive, been, searching, for, the, right, words,...","[ive, searching, right, words, thank, breather..."
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
2,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t..."
3,ham,Even my brother is not like to speak with me. ...,Even my brother is not like to speak with me T...,"[even, my, brother, is, not, like, to, speak, ...","[even, brother, like, speak, treat, like, aids..."
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,I HAVE A DATE ON SUNDAY WITH WILL,"[i, have, a, date, on, sunday, with, will]","[date, sunday]"
