# Natural Language Processing

#### Load the packages and import the data
File should not be a .tsv file because the text contains commas

In [1]:
import pandas as pd
import numpy as np

data = pd.read_csv("./Data Files/SMSSpamCollection", 
                   sep = "\t", names = ["label", "message"])
data.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### Clean the text and tokenize the data
Create a function to do the text processing

In [2]:
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

def text_process(text):
    text = re.sub('[^a-zA-Z]', ' ', text)
    text = text.lower()
    text = text.split()
    ps = PorterStemmer()
    text = [ps.stem(word) for word in text if not word in set(stopwords.words("english"))]
    text = " ".join(text)
    return text

Implement the text processing across each row

In [3]:
clean_data = data["message"]
clean_data = list(map(text_process, clean_data))

#### Create Bag of Words Model X (Tokenization)
Might take a while; Output is a sparse matrix

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
bow_model = CountVectorizer(max_features = 1000)  # Set max_features argument to increase speed
X = bow_model.fit_transform(clean_data)

To visualize the Bag of Words model in matrix form (this is a completely unnecessary step, I use it for visualization purposes only)

In [5]:
X_bow = pd.DataFrame(X.toarray(), columns = bow_model.get_feature_names())
X_bow.head()

Unnamed: 0,abiola,abl,abt,ac,accept,access,account,across,activ,actual,...,yar,ye,yeah,year,yep,yesterday,yet,yo,yr,yup
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Fit a Multinomial Naive Bayes Model on the sparse matrix
Multinomial Naive Bayes tends to perform well on text data

#### Create y

In [6]:
y = data["label"]

#### Split the data into a train_set and test_set

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1111)

#### Transform X using TF-IDF (Optional)

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()
X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)

#### Fit the Multinomial Naive Bayes Model

In [9]:
from sklearn.naive_bayes import MultinomialNB
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [10]:
y_pred = nb_model.predict(X_test)

#### Evaluate the Naive Bayes Model

In [11]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

print("Naive Bayes Model", "\n")
print(pd.DataFrame(confusion_matrix(y_test, y_pred)), 
      "      Accuracy:", round(accuracy_score(y_test, y_pred), 3), 
      "\n")
print(classification_report(y_test, y_pred))

Naive Bayes Model 

      0    1
0  1453    5
1    28  186       Accuracy: 0.98 

             precision    recall  f1-score   support

        ham       0.98      1.00      0.99      1458
       spam       0.97      0.87      0.92       214

avg / total       0.98      0.98      0.98      1672

