# Contributors:- 


### 1. Protyush Kumar Das (pdas2s)
 
### 2. Somesh Devagekar (sdevag2s)

### 3. Gautam Kumar Jain (gjain2s)


# Build a spam classifier using Naive Bayes

## Project Description: 
- There are three datasets for training: TrainDataset1.csv, TrainDataset2.csv and TrainDataset3.txt. Each dataset contains short messages with the labels (ham or spam). 
- Analyse, clean and visualise these datasets.
- Combine them into one big data set for the training
- Use this dataset in order to build your own Naive Bayes classifier. (You can either use existing Naive Bayes from sklearn or build your own one)
- Verify your Classifier using new messages (create your own messages or use the messages from the TestDataset.csv dataset).

## Project Duration: 2 weeks
## Project Deliverables:
1. End of the first week do Data preprocessing: 
    - Load the dataset using pandas, 
    - Analysis it for this you will need to process the text, namely remove punctuation and stopwords, and then create a list of clean text words. (Research how to do this) 
    - Visualise the results
    - Prepare the pre-processed data for the usage by Naive Bayes Classifier
2. End of the second week:
    - Train the classifier,
    - Validate it, build confusion matrix, analyse its results
    - Apply it to new test messages,
    - Try to cheat the classifier by adding "good words" to the end of test message.

You can use the following link can be used as guidance for implementation:
https://towardsdatascience.com/spam-filtering-using-naive-bayes-98a341224038

**Loading the data and concatenating into one**

In [19]:
import pandas as pd
import numpy as np
import string
import matplotlib.pyplot as plt


D_1=pd.read_csv('TrainDataset1.csv')
D_2=pd.read_csv('TrainDataset2.csv')
D_2.rename(columns={'v1': 'type', 'v2': 'text'}, inplace=True)
D_3=pd.read_csv('TrainDataset3.txt', sep=r"\t+", header=None ,engine="python")
D_3=D_3.rename(columns={0: "type", 1: "text"})

TrainDataset=pd.concat([D_3,D_2,D_1])
TrainDataset.head()

Unnamed: 0,type,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


**Cleaning and labeling loaded data concatenated into one**

In [20]:
TrainDataset = TrainDataset.sample(frac=1).reset_index(drop=True) # scattering dataset increase randomness
type_of_text = list(TrainDataset["type"])
labels = list()
for item in type_of_text:
    if item == "ham":
        labels.append(0)
    else:
        labels.append(1)

        
TrainDataset["label"] = labels

def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text


TrainDataset['text'] = TrainDataset['text'].apply(remove_punctuations)
TrainDataset


Unnamed: 0,type,text,label
0,ham,Kind of Just missed train cos of asthma attack...,0
1,spam,Get 3 Lions England tone reply lionm 4 mono or...,1
2,ham,Im working technical support voice process,0
3,ham,Up to u u wan come then come lor But i din c a...,0
4,spam,WIN We have a winner Mr T Foley won an iPod Mo...,1
5,ham,Okie Ì wan meet at bishan Cos me at bishan now...,0
6,ham,Ü got wat to buy tell us then ü no need to com...,0
7,ham,Stop the story Ive told him ive returned it an...,0
8,ham,wiskey Brandy Rum Gin Beer Vodka Scotch Shampa...,0
9,ham,Im in office now I will call you ltgt min,0


## Training the classifier

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split

#parsing the dataset into equal labels and mails. 
X = TrainDataset.text
y = TrainDataset.label

X_train , X_test, y_train , y_test = train_test_split(X,y)

#vectorizing the words in the text 
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(X_train.values)

classifier = MultinomialNB()
targets = y_train.values 

classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [23]:
test_count = vectorizer.transform(X_test)
predictions = classifier.predict(test_count)
predictions

array([0, 0, 0, ..., 0, 0, 0])

## Confusion Matrix

In [15]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test,predictions)

array([[4791,   21],
       [  17,  730]])

In [24]:
from sklearn.metrics import f1_score

f1_score(y_test,predictions, average = 'binary')

0.9586935638808837

## Validation on Test dataset

In [64]:
V_test =pd.read_csv('TestDataset.csv')
# V_test.head()

# cleaning validation dataset 
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text


V_test['v2'] = V_test['v2'].apply(remove_punctuations)

V_examples = V_test.v2

#vectorizing the text in the test dataset
V_count = vectorizer.transform(V_examples)
V_predictions = classifier.predict(V_count)

V_predictions[0:20]

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0])

## Creating a custom message to cheat the algorithm by adding good words in the end. 


In [65]:
cheat_example = V_examples[3] + 'I appreciate your care'

print(cheat_example)

cheat_count = vectorizer.transform(cheat_example)
cheat_prediction = classifier.predict(cheat_count)

print('\n')
print(cheat_prediction)


Congrats 2 mobile 3G Videophones R yours call 09061744553 now videochat wid ur mates play java games Dload polyH music noline rentl bx420 ip4 5we 150pmI appreciate your care


[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


We took the $\textbf{fourth example} $ from the test dataset and add some good words at the end of the text and we can see that before it was predicted by the classifier as 'SPAM' and now it is a 'HAM' 