# Natural Language Processing 
Natural Language Processing (NLP) is the study of making computers understand how humans naturally speak, write and communicate.

I will be using NLTK (Natural Language Toolkit) for doing natural language processing in English Language. The NLTK is a a collection of python libraries designed specially for identifying and tag parts of speech found in text of natural language like English.

# Importing Libraries

In [None]:
import numpy as np
import pandas as pd

from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from sklearn.metrics import accuracy_score

# Loading Dataset

In [None]:
df = pd.read_csv('../input/gmail-spam-detection-dataset/spam1.csv', encoding = 'windows-1252')

In [None]:
df.head()

In [None]:
# Adding one more column with the name spam.
# Here if a mail is spam it will print 1 else 0.
df['spam'] = df['type'].map({'spam': 1, 'ham': 0}).astype(int)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df['spam'].value_counts()

In [None]:
df.info()

In [None]:
df.isnull().sum()

We can see that our dataset contains 5572 rows, in which 4825 are ham and 747 mails are spam mails.
Moreover, our dataset does not contain any null or 0 values.

# Tokenization 
Tokenization stands for splitting up of data into tokens, that is comma seperated values.

In [None]:
df['text'][1]

In [None]:
def tokenizer(text):
    return text.split()

In [None]:
df['text'] = df['text'].apply(tokenizer)

In [None]:
df['text'][1]

# Stemming
Stemming is the process of removing of suffix to convert the word into core values. For example, converting waits, waiting, waited to the core word wait.

There are different stemmers in the package such as snowball, porter, lancaster, etc. I will be using Snowball.

In [None]:
porter = SnowballStemmer("english", ignore_stopwords=False)

In [None]:
def stem_it(text):
    return [porter.stem(word) for word in text]

In [None]:
df['text'] = df['text'].apply(stem_it)

In [None]:
df['text'][1]

# Lemmitization
It is the process of finding lemma of a word depending on their meaning. It aims to remove inflectional endings. It helps in returning the base or dictionary form of a word, which is known as lemma. For example, converting is, am, was, are to the lemma word be. 

Difference between Stemming and Lemmitization is that stemming can often create non-existent words, whereas lemmas are actual words. 

In [None]:
df['text'][153]

In [None]:
lemmitizer = WordNetLemmatizer()

In [None]:
def lemmit_it(text):
    return [lemmitizer.lemmatize(word, pos = 'a') for word in text]

In [None]:
df['text'] = df['text'].apply(lemmit_it)

In [None]:
df['text'][153]

# StopWord Remmoval
It is used to remove common words such as is, an, the, etc. The search engine is programmed to ignore such words.

In [None]:
stop_words = stopwords.words('english')

In [None]:
def stop_it(text):
    review = [word for word in text if not word in stop_words]
    return review

In [None]:
df['text'] = df['text'].apply(stop_it)

In [None]:
df.head()

In [None]:
df['text'] = df['text'].apply(' '.join)

In [None]:
df.head()

# Vectorization
It is the method to convert textual data into numeric format. Since computers are unable to understand textual data, hence we need to convert text into numerical format.

I will be using TfidfVectorizer for the same, that is Term Frequency-Inverse Document Frequency.

In [None]:
tfidf = TfidfVectorizer()
y = df.spam.values
x = tfidf.fit_transform(df['text'])

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 0, test_size = 0.2, shuffle = False)

In [None]:
df.head()

# Logistic Regression

In [None]:
lr = LogisticRegression()
lr.fit(x_train, y_train)
y_pred  = lr.predict(x_test)

In [None]:
acc_log = accuracy_score(y_pred, y_test)*100
print("Accuracy", acc_log)

# LinearSVC Accuracy

In [None]:
svc = LinearSVC(random_state=0)
svc.fit(x_train, y_train)
y_pred = svc.predict(x_test)

In [None]:
acc_svc = accuracy_score(y_pred, y_test)*100
print("Accuracy", acc_svc)

# Predictive Model
Since, the accuracy of LinearSVC is slightly better than Logistic Regression, I will be using LinearSVC to make the predictive model.

In [None]:
#input_mail = input("Enter the mail text: ")
input_mail = 'Your free ringtone is waiting to be collected. Simply text the password \MIX\" to 85069 to verify. Get Usher and Britney. FML'
input_mail = [input_mail]
transformed_data = tfidf.transform(input_mail)

prediction = svc.predict(transformed_data)

if (prediction == 1):
    print("\nSpam mail")
else:
    print("\nHam mail")