### Building a Spam Message Classifier
we will be building a Spam Message Classifier, a NLP model, build a flask application which will render HTML based home page and prediction page. The user will input text in the home page and the application will predict whether it seems like a Spam message or a Ham on the prediction page. This flask API will be deployed in public cloud.


### Building Model

First we will build our NLP model that takes in natural text from user and predicts whether it is a Spam or Ham. We will use Jupyter for training the model & saving it as pickle & Python - Flask for exposing to web layer.

In [3]:
import pandas as pd

import pickle #for saving the model 

from sklearn.feature_extraction.text import CountVectorizer #To Vectorize the textual data 

from sklearn.naive_bayes import MultinomialNB #Algo

import joblib #Alternative Usage of Saved Model 

from sklearn.model_selection import train_test_split #Train/Test split

### Load the data from csv into Pandas dataframe & do basic EDA of the dataset

In [9]:
df = pd.read_csv("spam.csv", encoding="latin-1")
#df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True) > to remove spam features :)

df.info()
df.shape
df.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   class    5572 non-null   object
 1   message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


Unnamed: 0,class,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...


### Split into text data(X) & labels(y).

In [10]:
# Text and Labels

df['label'] = df['class'].map({'ham': 0, 'spam': 1}) 

X = df['message']

y = df['label']

We will vectorize (convert text to numerical form) the textual data by using SKlearn's Count Vectorizer module. Then split the training and testing data for model training. Aim of this exercise is to build & deploy the model & not better accuracy so further text preprocessing we wont do it.

In [11]:
  # Extract Feature With CountVectorizer 
cv = CountVectorizer() 
X = cv.fit_transform(X) # Fit the Data 

  # Split the dataset for training & testing
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

We need to fit the model and save it as pickle file which will later be loaded in the flask app.

In [15]:
#Multinomial Naive Bayes Classifier  

clf = MultinomialNB() 
clf.fit(X_train,y_train) 
print(clf.score(X_test,y_test)) 

#Alternative Usage of Saved Model 

joblib.dump(clf, 'spam-nb-model.pkl') 
joblib.dump(cv, 'count_vect.pkl')

0.9793365959760739


['count_vect.pkl']

We have saved both trained model and count vectorizer as well. This is needed because when input is got from user in web we need to vectorize using the same Count Vectorizer parameters as the model was trained on.
