# NLP-7-1: Spam Classifier Implementation
#### Using **Lemmatization** technique, and **Bag of word - TF-IDF**
#### Credit
https://www.youtube.com/watch?v=fA5TSFELkC0&list=PLZoTAELRMXVMdJ5sqbCK2LiM0HhQVWNzm&index=10



#### Import libraries

In [1]:
import nltk
import pandas as pd
import re

#### Download the nltk library's data

In [2]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    |   Package bcp47 is already up-to-dat

True

#### Calling others dependent libraries

In [3]:
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

#### copying the raw spam data from **github**

In [4]:
url_new="https://raw.githubusercontent.com/akdubey2k/NLP/main/Spam_Classifier/SpamCollection.csv"
url_org='https://raw.githubusercontent.com/akdubey2k/NLP/main/Spam_Classifier/SMSSpamCollection.csv'

#### Read data from **github** repositary using **pandas**.

In [5]:
df=pd.read_csv(url_new, sep='\t')
df.head()

Unnamed: 0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
0,ham,Ok lar... Joking wif u oni...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,U dun say so early hor... U c already then say...
3,ham,"Nah I don't think he goes to usf, he lives aro..."
4,spam,FreeMsg Hey there darling it's been 3 week's n...


#### Creating a **pandas** dataframe by inserting new columns named as **label**, **message** and **index** by default true.

In [6]:
df=pd.read_csv(url_new, sep='\t', names=['label', 'message'])
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
df=pd.read_csv(url_org, sep='\t', names=['label', 'message'])
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### Creation of **Stemming** and **Lemmatization** object

In [8]:
ps=PorterStemmer()
wnl=WordNetLemmatizer()

#### Data **Cleaning** and **preprocessing**

In [9]:
corpus=[]
for i in range(0, len(df)):
  data=re.sub('[^a-zA-Z]', ' ', df['message'][i])
  # print(data)
  data=data.lower()
  data=data.split()
  #data=[ps.stem(word) for word in data if word not in set(stopwords.words('english'))]
  data=[wnl.lemmatize(word) for word in data if word not in set(stopwords.words('english'))]
  data=' '.join(data)
  corpus.append(data)

#### **Bag of word** creation for spam classification

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer(max_features=2500)
X=tfidf.fit_transform(corpus).toarray()
X

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

#### get **dummies** from **pandas** dataframe for the **output/dependent** variable (**label** column)

In [11]:
y=pd.get_dummies(df['label'])
y

Unnamed: 0,ham,spam
0,True,False
1,True,False
2,False,True
3,True,False
4,True,False
...,...,...
5567,False,True
5568,True,False
5569,True,False
5570,True,False


#### As there is only two dummy varibales, so we do not need two column... we can convert this **dummy variable trap** by using single column.

In [12]:
y=y.iloc[:,1].values # all rows and only one column.... values need to be present
y

array([False, False,  True, ..., False, False, False])

#### Split the data into **training** and **test** set, using **sklearn**
X (fit_transform(corpus)) => independent variable and

y (single dummy varible) => depedent variable

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.20, random_state=0)

#### Classification (Training) model using **Naive Bayes Classifier**


In [14]:
from sklearn.naive_bayes import MultinomialNB
mtnb=MultinomialNB()
mtnb.fit(X, y)

In [15]:
mtnb=mtnb.fit(X_train, y_train)
y_pred=mtnb.predict(X_test)
y_pred

array([False,  True, False, ..., False,  True, False])

#### Classification between **y_test** and **y_pred** values using **Confusion Matrix**

In [16]:
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred) # 2 x 2 matrix
cm

array([[954,   1],
       [ 22, 138]])

#### Check the accuracy score of model

In [17]:
from sklearn.metrics import accuracy_score
ac=accuracy_score(y_test, y_pred)
ac

0.979372197309417