## SMS message Filtering
The goal here is to build a ML model which can classify SMS messages into spam or ham. 
Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag of words features to identify spam e-mail. Therefore, We’ll build a simple message classifier using Naive Bayes theorem.

#### 1. Import all the necessary libraries.

In [8]:
# Data analysis, cleaning and preparation
import pandas as pd
import numpy as np

In [9]:
# To convert a collection of text documents to a vector of term/token counts.
from sklearn.feature_extraction.text import CountVectorizer
# For splitting data arrays into training set and testing set
from sklearn.model_selection import train_test_split
# The multinomial Naive Bayes classifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

#### 2. Dataset
The data here is a collection of SMS messages tagged as spam or ham that can be found https://www.kaggle.com/uciml/sms-spam-collection-dataset.

In [3]:
df = pd.read_csv('spam.csv', encoding="latin-1")
# drop unnecessary columns from dataset 
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
df['label'] = df['v1'].map({'ham': 0, 'spam': 1})
X = df['v2']     # messages
y = df['label']  # labels/classes

In [7]:
cv = CountVectorizer()
X = cv.fit_transform(X) # Fit the Data

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [11]:
# Naive Bayes Classifier
clf = MultinomialNB()
clf.fit(X_train,y_train)
clf.score(X_test,y_test)

0.9793365959760739

In [12]:
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1587
           1       0.93      0.92      0.92       252

    accuracy                           0.98      1839
   macro avg       0.96      0.95      0.96      1839
weighted avg       0.98      0.98      0.98      1839



From above results we can see that not only Naive Bayes classifier is easy to implement but also provides very good result.

### Persist model in a standard format
After training the model, it is desirable to have a way to persist the model for future use without having to retrain. To achieve this, we add the following lines to save our model as a .pkl file for the later use.

In [13]:
import joblib
joblib.dump(clf, 'NB_spam_model.pkl')
# And the model will be served in a micro-service that expose endpoints to receive requests from client.

['NB_spam_model.pkl']

In [14]:
NB_spam_model = open('NB_spam_model.pkl','rb')
clf = joblib.load(NB_spam_model)