<a href="https://www.kaggle.com/code/aarushi211/gmail-spam-detection-with-nlp?scriptVersionId=119671994" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Natural Language Processing 
Natural Language Processing (NLP) is the study of making computers understand how humans naturally speak, write and communicate.

I will be using NLTK (Natural Language Toolkit) for doing natural language processing in English Language. The NLTK is a a collection of python libraries designed specially for identifying and tag parts of speech found in text of natural language like English.

# Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from sklearn.metrics import accuracy_score

# Loading Dataset

In [2]:
df = pd.read_csv('../input/gmail-spam-detection-dataset/spam1.csv', encoding = 'windows-1252')

In [3]:
df.head()

Unnamed: 0,type,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# Adding one more column with the name spam.
# Here if a mail is spam it will print 1 else 0.
df['spam'] = df['type'].map({'spam': 1, 'ham': 0}).astype(int)

In [5]:
df.head()

Unnamed: 0,type,text,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [6]:
df.shape

(5572, 3)

In [7]:
df['spam'].value_counts()

0    4825
1     747
Name: spam, dtype: int64

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   type    5572 non-null   object
 1   text    5572 non-null   object
 2   spam    5572 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 130.7+ KB


In [9]:
df.isnull().sum()

type    0
text    0
spam    0
dtype: int64

We can see that our dataset contains 5572 rows, in which 4825 are ham and 747 mails are spam mails.
Moreover, our dataset does not contain any null or 0 values.

# Tokenization 
Tokenization stands for splitting up of data into tokens, that is comma seperated values.

In [10]:
df['text'][1]

'Ok lar... Joking wif u oni...'

In [11]:
def tokenizer(text):
    return text.split()

In [12]:
df['text'] = df['text'].apply(tokenizer)

In [13]:
df['text'][1]

['Ok', 'lar...', 'Joking', 'wif', 'u', 'oni...']

# Stemming
Stemming is the process of removing of suffix to convert the word into core values. For example, converting waits, waiting, waited to the core word wait.

There are different stemmers in the package such as snowball, porter, lancaster, etc. I will be using Snowball.

In [14]:
porter = SnowballStemmer("english", ignore_stopwords=False)

In [15]:
def stem_it(text):
    return [porter.stem(word) for word in text]

In [16]:
df['text'] = df['text'].apply(stem_it)

In [17]:
df['text'][1]

['ok', 'lar...', 'joke', 'wif', 'u', 'oni...']

# Lemmitization
It is the process of finding lemma of a word depending on their meaning. It aims to remove inflectional endings. It helps in returning the base or dictionary form of a word, which is known as lemma. For example, converting is, am, was, are to the lemma word be. 

Difference between Stemming and Lemmitization is that stemming can often create non-existent words, whereas lemmas are actual words. 

In [18]:
df['text'][153]

['as',
 'per',
 'your',
 'request',
 'mell',
 'mell',
 '(oru',
 'minnaminungint',
 'nurungu',
 'vettam)',
 'has',
 'been',
 'set',
 'as',
 'your',
 'callertun',
 'for',
 'all',
 'callers.',
 'press',
 '*9',
 'to',
 'copi',
 'your',
 'friend',
 'callertun']

In [19]:
lemmitizer = WordNetLemmatizer()

In [20]:
def lemmit_it(text):
    return [lemmitizer.lemmatize(word, pos = 'a') for word in text]

In [21]:
df['text'] = df['text'].apply(lemmit_it)

In [22]:
df['text'][153]

['as',
 'per',
 'your',
 'request',
 'mell',
 'mell',
 '(oru',
 'minnaminungint',
 'nurungu',
 'vettam)',
 'has',
 'been',
 'set',
 'as',
 'your',
 'callertun',
 'for',
 'all',
 'callers.',
 'press',
 '*9',
 'to',
 'copi',
 'your',
 'friend',
 'callertun']

# StopWord Remmoval
It is used to remove common words such as is, an, the, etc. The search engine is programmed to ignore such words.

In [23]:
stop_words = stopwords.words('english')

In [24]:
def stop_it(text):
    review = [word for word in text if not word in stop_words]
    return review

In [25]:
df['text'] = df['text'].apply(stop_it)

In [26]:
df.head()

Unnamed: 0,type,text,spam
0,ham,"[go, jurong, point,, crazy.., avail, onli, bug...",0
1,ham,"[ok, lar..., joke, wif, u, oni...]",0
2,spam,"[free, entri, 2, wkli, comp, win, fa, cup, fin...",1
3,ham,"[u, dun, say, earli, hor..., u, c, alreadi, sa...",0
4,ham,"[nah, think, goe, usf,, live, around, though]",0


In [27]:
df['text'] = df['text'].apply(' '.join)

In [28]:
df.head()

Unnamed: 0,type,text,spam
0,ham,"go jurong point, crazy.. avail onli bugi n gre...",0
1,ham,ok lar... joke wif u oni...,0
2,spam,free entri 2 wkli comp win fa cup final tkts 2...,1
3,ham,u dun say earli hor... u c alreadi say...,0
4,ham,"nah think goe usf, live around though",0


# Vectorization
It is the method to convert textual data into numeric format. Since computers are unable to understand textual data, hence we need to convert text into numerical format.

I will be using TfidfVectorizer for the same, that is Term Frequency-Inverse Document Frequency.

In [29]:
tfidf = TfidfVectorizer()
y = df.spam.values
x = tfidf.fit_transform(df['text'])

In [30]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 0, test_size = 0.2, shuffle = False)

In [31]:
df.head()

Unnamed: 0,type,text,spam
0,ham,"go jurong point, crazy.. avail onli bugi n gre...",0
1,ham,ok lar... joke wif u oni...,0
2,spam,free entri 2 wkli comp win fa cup final tkts 2...,1
3,ham,u dun say earli hor... u c alreadi say...,0
4,ham,"nah think goe usf, live around though",0


# Logistic Regression

In [32]:
lr = LogisticRegression()
lr.fit(x_train, y_train)
y_pred  = lr.predict(x_test)

In [33]:
acc_log = accuracy_score(y_pred, y_test)*100
print("Accuracy", acc_log)

Accuracy 96.05381165919282


# LinearSVC Accuracy

In [34]:
svc = LinearSVC(random_state=0)
svc.fit(x_train, y_train)
y_pred = svc.predict(x_test)

In [35]:
acc_svc = accuracy_score(y_pred, y_test)*100
print("Accuracy", acc_svc)

Accuracy 97.66816143497758


# Predictive Model
Since, the accuracy of LinearSVC is slightly better than Logistic Regression, I will be using LinearSVC to make the predictive model.

In [36]:
#input_mail = input("Enter the mail text: ")
input_mail = 'Your free ringtone is waiting to be collected. Simply text the password \MIX\" to 85069 to verify. Get Usher and Britney. FML'
input_mail = [input_mail]
transformed_data = tfidf.transform(input_mail)

prediction = svc.predict(transformed_data)

if (prediction == 1):
    print("\nSpam mail")
else:
    print("\nHam mail")


Spam mail
