## NLP for Text Classification 

## **`Spam-Ham Classifier`**



In [38]:
## importing necessary libraries

import pandas as pd
## Data Preprocessing libraries
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords ## for removal of words that are of no use in classification
from nltk.stem.porter import PorterStemmer ## Stem to base words


# XGBoost
import xgboost as xgb
from xgboost import XGBClassifier

# sklearn 
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV,StratifiedKFold,RandomizedSearchCV

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Data provided has 2 columnns separated by tab. 
The first gives the information about the target label whether the message is Ham or Spam while the 2nd column contains the original message. Let's start by loading hte data and have a look at some messages

In [2]:
data = pd.read_csv('/content/SMSSpamCollection',sep='\t',names = ['label','msg'])

In [4]:
data.sample(5)

Unnamed: 0,label,msg
1126,spam,For taking part in our mobile survey yesterday...
3743,ham,Hey i'm bored... So i'm thinking of u... So wa...
1905,ham,Wah... Okie okie... Muz make use of e unlimite...
1999,ham,"Well, I have to leave for my class babe ... Yo..."
5040,ham,Pls clarify back if an open return ticket that...


In [7]:
pd.set_option("display.max_colwidth",-1)

  """Entry point for launching an IPython kernel.


In [10]:
## check how the messages look for both the labels
data[data['label']=='ham']['msg'].values[7]

"I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times."

In [11]:
data[data['label']=='spam']['msg'].values[7]

'England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/ú1.20 POBOXox36504W45WQ 16+'

From above messages we can clearly see how our data look like and more specifically what text does the spam and ham messages body contains.

**Text Preprocessing**

Now we are going to transform our data to the format required by our model. We will be performing number of steps Data cleaning, tokenizing , stopwords removal, converting to lower case, converting words to base form by stemming and at last converting to vectors using the below helper function.

In [17]:
ps = PorterStemmer()
corpus = []
for i in range(0, len(data)):
    review = re.sub('[^a-zA-Z]', ' ', data['msg'][i])   ## selecting only the words
    review = review.lower() ## converting all words to lowercase to eliminate any duplicate words
    review = review.split() ## tokenizing
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')] ## converting words to their base form
    review = ' '.join(review)
    corpus.append(review)

In [18]:
## how does our corpus look after the text preprocessing
corpus[:5]

['go jurong point crazi avail bugi n great world la e buffet cine got amor wat',
 'ok lar joke wif u oni',
 'free entri wkli comp win fa cup final tkt st may text fa receiv entri question std txt rate c appli',
 'u dun say earli hor u c alreadi say',
 'nah think goe usf live around though']

**Time to convert to vectors using Bag of Words Model**

We will using Count Vectorizer to implement BOW

In [19]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=2500)


In [20]:
X = cv.fit_transform(corpus).toarray()    

In [22]:
## dimension of our independent data

X.shape

(5572, 2500)

In [None]:
## words present in our vocab
cv.vocabulary_

In [34]:
## dependent data
y = pd.get_dummies(data['label'])

In [35]:
y[:5]

Unnamed: 0,ham,spam
0,1,0
1,1,0
2,0,1
3,1,0
4,1,0


In [36]:
y = y.iloc[:,1].values

### Train-Test Split

In [37]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

## Building a Text Classification Model

Now the data is ready to be fed into a classification model. Let's create a basic claasification model using commonly used classification algorithms and see how our model performs.

In [39]:
# Fitting a simple Logistic Regression on Counts
clf = LogisticRegression(C=1.0)
scores = model_selection.cross_val_score(clf, X_train, y_train, cv=5, scoring="f1")
scores

array([0.92035398, 0.90232558, 0.93273543, 0.90232558, 0.9321267 ])

In [44]:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [45]:
print(metrics.accuracy_score(y_pred,y_test))

0.9847533632286996


In [41]:
# Fitting a simple Naive Bayes on Counts
clf_NB = MultinomialNB()
scores = model_selection.cross_val_score(clf_NB, X_train, y_train, cv=5, scoring="f1")
scores

array([0.95798319, 0.90295359, 0.93670886, 0.925     , 0.93220339])

In [46]:
clf_NB.fit(X_train, y_train)
y_pred_NB = clf_NB.predict(X_test)
print(metrics.accuracy_score(y_pred_NB,y_test))

0.9856502242152466


In [42]:
# Fitting a XGBoost on Counts
import xgboost as xgb
clf_xgb = xgb.XGBClassifier(max_depth=7, n_estimators=200, colsample_bytree=0.8, 
                        subsample=0.8, nthread=10, learning_rate=0.1)
scores = model_selection.cross_val_score(clf_xgb, X_train, y_train, cv=5, scoring="f1")
scores

array([0.92982456, 0.89719626, 0.91818182, 0.91402715, 0.92857143])

In [47]:
clf_xgb.fit(X_train, y_train)
y_pred_xgb = clf_xgb.predict(X_test)
print(metrics.accuracy_score(y_pred_xgb,y_test))

0.9856502242152466


So with basic models we are able to achieve accuracy of more than 98%. In the next part of notebook we will be extending the text classification part by applying other Bag of words approaches  -TfidfVectorizer and Hashing Vectorizer to increase our accuracy further.