# E-mail Spam Classification

*Team PythonPcz (pronounce Python Packs): Zhipeng Mei, Chanon Chantaduly, Patrapee Pongtana*

### Project description:

The dataset included for this project is based on a subset of the [SpamAssassin Public Corpus](http://spamassassin.apache.org/old/). Upper image of Figure 1 shows a sample email that contains a URL, an email address (at the end), numbers, and dollar amounts. While many emails would contain similar types of entities (e.g., numbers, other URLs, or other email addresses), the specific entities (e.g., the specific URL or specific dollar amount) will be different in almost every email. Therefore, one method often employed in processing emails is to “normalize’ these values’, so that all URLs are treated the same, all numbers are treated the same, etc. For example, we could replace each URL in the email with the unique string “httpaddr” to indicate that a URL was present. This has the effect of letting the spam classifier make a classification decision based on whether any URL was present, rather than whether a specific URL was present. This typically improves the performance of a spam classifier, since spammers often randomize the URLs, and thus the odds of seeing any particular URL again in a new piece of spam is very small.

We have already implemented the following email preprocessing steps: lower- casing; removal of HTML tags; normalization of URLs, email addresses, and numbers. In addition, words are reduced to their stemmed form. For example, “discount”, “discounts”, “discounted” and “discounting” are all replaced with “discount”. Finally, we removed all non-words and punctuation. The result of these preprocessing steps is shown in lower image of Figure 1.

### Dataset:
We have provided you with two files: spam train.txt, spam test.txt. Each row of the data files corresponds to a single email. The data can be downloaded from [this link](https://www.dropbox.com/sh/q7051ab9pef7979/AACoUjSQjLLUWSqdVd54R2Kva?dl=0). The first column gives the label (1=spam,0=not spam).


### Experiments:
* This project will involve your implementing classification algorithms. Before you can build these models and measures their performance, split your training data (i.e. spam train.txt) into a training and validate set, putting the last 1000 emails into the validation set. Thus, you will have a new training set with 4000 emails and a validation set with 1000 emails. **Explain why measuring the performance of your final classifier would be problematic had you not created this validation set.**

###### Ans: In a classification work flow, training data set are in two categories (training and validation). Then the test data set is for testing. The problem with not creating a validation set can cause the performance measure to perform poorly due to inaccurate prediction.

* Transform all of the data into **feature vectors**. Build a vocabulary list using only the 4000 email training set by finding all words that occur across the training set. Note that we assume that the data in the validation and testsets is completely unseen when we train our model, and thus we do not use any information contained in them. Ignore all words that appear in fewer than X = 30 emails of the 4000 email training set. This is both a means of preventing overfitting and of improving scalability. For each email, transform it into a feature vector x where the ith entry, xi, is 1 if the ith word in the vocabulary occurs in the email, and 0 otherwise.

* Train the linear classifier such as Naive Bayes using your training set. **How many mistakes are made before the algorithm terminates? Next, classify the emails in your validation set. What is the validation error? Explain your results.**

###### Ans: The algorithm makes about 1.575% mistake during the training phase. In addidtion, validation error is about 2.7% while testing the validation data set.

* **Explore some other algorithms to solve spam filter problem. And demon- strate your thoughts.**

## Implementation:

In [9]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

## Multinomial Naive Bayes

In [18]:
def email_spam_classifier(training_file, testing_file):
    
    # ----------------------    
    # --- training phase ---
    # ----------------------
    df = pd.read_csv(training_file, sep='\n', header=None, names=['message'])
    df['label'] = df['message'].str.split(' ').str[0]
    # print(df.head())

    # 0 is ham, 1 is spam
    # print(df['label'].value_counts())

    df_x = df["message"] #all email content
    df_y = df["label"]   #spam/ham label in 1/0 as int

    #split the 4000 (.8) emails for training, 1000 (.2) for validating
    X_train, X_test, y_train, y_test = train_test_split(df_x, df_y, test_size = .2, random_state = 42)
 
    #transform words into int with Countvectorizer
    cv = CountVectorizer()
    X_traincv = cv.fit_transform(X_train)
    X_testcv = cv.transform(X_test)
    
    #produce an int array from y_train
    y_train = y_train.astype(int)
    
    #call the classifier from sklearn library
    mnb = MultinomialNB()
    #train the model
    mnb.fit(X_traincv, y_train)
    
    pred_train = mnb.predict(X_traincv)
    y_train = np.array(y_train).astype(int)
    count_train=0
    for i in range(len(pred_train)):
        if pred_train[i] == y_train[i]: #compare validation predicion vs y_test reuslt
            count_train=count_train+1
    print("The training's accuracy is: ",count_train/len(pred_train)) 
    

    # ------------------------
    # --- Validating phase ---
    # ------------------------
    #validating the model
    pred = mnb.predict(X_testcv)
    
    #produce an int array from y_train
    y_test = np.array(y_test).astype(int) 

    count=0
    for i in range(len(pred)):
        if pred[i] == y_test[i]: #compare validation predicion vs y_test reuslt
            count=count+1
    #calculate the accuracy of the classifier's prediction
    print("The validating's accuracy is: ",count/len(pred)) 


    # ---------------------
    # --- testing phase ---
    # ---------------------
    test_raw_data = pd.read_csv(testing_file, sep='\n', header=None, names=['message_test'])
    test_raw_data['label_test'] = test_raw_data['message_test'].str.split(' ').str[0]
    test_X = test_raw_data["message_test"] #all email content
    test_y = test_raw_data["label_test"]   #spam/ham label in 1/0 format
    test_X_cv = cv.transform(test_X)
    pred_test = mnb.predict(test_X_cv)
    test_y = np.array(test_y).astype(int)

    count_test=0
    for i in range(len(pred_test)):
        if pred_test[i] == test_y[i]: #compare validation predicion vs y_test reuslt
            count_test=count_test+1
    print("The testing's accuracy is: ",count_test/len(pred_test)) #calculate the accuracy of the classifier's prediction

###### Execution Code Below for MNB:

In [None]:
training_file = "spam_train.txt"
testing_file = "spam_test.txt"
email_spam_classifier(training_file, testing_file)

## Gaussian Naive Bayes

## Bernoulli Naive Bayes