## Assignment: Document Classification

### Alice Ding, Shoshana Farber, Christian Uriostegui

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  

For this project, we've chosen to use two files from [spamasssassin](https://spamassassin.apache.org/old/publiccorpus/) as our training data. This is a list of 2,551 ham (non-spam) emails and 1,398 spam ones to see whether the document is spam or not.

### Importing the Data

To start, we have the files downloaded in two different folders within this directory; we'll be using Python to read each file and put it into one full dataframe with messages and labels.

In [1]:
import sys
import os
import re
import nltk
import pandas as pd
from os.path import expanduser
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score


# get the path names of each directory
ham = sys.path[0] + '/easy_ham'
spam = sys.path[0] + '/spam'

# create a function that takes the directory and the label we want appended to those messages to put into a dataframe
def create_df(folder, label):
    file_data = []
    file_labels = []
    for file_name in os.listdir(folder):
        file_path = os.path.join(folder, file_name)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='latin-1') as file:
                content = file.read()
        file_data.append(content)
        file_labels.append(label)
    df = pd.DataFrame({'email': file_data, 'label': file_labels})
    return df

# create a dataframe for the ham messages
ham_df = create_df(ham, 'ham')
# create a dataframe for the spam messages
spam_df = create_df(spam, 'spam')
# combine these dataframes into one full one of emails
full_df = pd.concat([ham_df, spam_df])

full_df.head()

Unnamed: 0,email,label
0,From exmh-workers-admin@redhat.com Thu Aug 22...,ham
1,From Steve_Burt@cursor-system.com Thu Aug 22 ...,ham
2,From timc@2ubh.com Thu Aug 22 13:52:59 2002\n...,ham
3,From irregulars-admin@tb.tf Thu Aug 22 14:23:...,ham
4,From exmh-users-admin@redhat.com Thu Aug 22 1...,ham


Now that the data is imported, let's try to clean it up to only hold relevant information.

### Cleaning the Data

In [2]:
# get stop words from nltk
stop_words = set(stopwords.words('english'))

# create a function to clean a given text
def clean_text(text):
    text = re.sub(r"<.*?>", "", text)              # remove HTML tags
    text = re.sub(r"[0-9]", "", text)               # remove digits
    text = re.sub(r"[^\w\s]", "", text)             # remove non-word and non-space characters
    text = re.sub(r"\n", "", text)                  # remove newlines
    text = text.lower()                            # convert all text to lowercase
    text = " ".join([word for word in text.split() if word not in stop_words])  # remove stop words
    text = nltk.PorterStemmer().stem(text) # gets stem root for words
    return text

full_df['email'] = full_df['email'].apply(clean_text)

full_df.head()

Unnamed: 0,email,label
0,exmhworkersadminredhatcom thu aug returnpath d...,ham
1,steve_burtcursorsystemcom thu aug returnpath d...,ham
2,timcubhcom thu aug returnpath deliveredto zzzz...,ham
3,irregularsadmintbtf thu aug returnpath deliver...,ham
4,exmhusersadminredhatcom thu aug returnpath del...,ham


Things are definitely looking cleaner! Now we're going to transform to further transform the dataset to allow for smoother analysis. There is also the addition of the ham and spam classification, where 0 indicates that it is non-spam and 1 indicates the email is spam.

In [3]:
# Create document-term matrix
vectorizer = CountVectorizer()

# Fit and transform the preprocessed text data
X = vectorizer.fit_transform(full_df["email"])

# Extract the labels
terms = vectorizer.get_feature_names_out()

# Convert the document-term matrix to a dataframe
full_df_matrix = pd.DataFrame(X.toarray(), columns=terms)

# Add the spam ham classification
full_df_matrix["class"] = full_df["label"].map({"ham": 0, "spam": 1}).reset_index(drop=True)

Now that our data is in the proper format, we can now create our model. The model will be split between 70% training data and 30% testing data . We're also going to try the Random Forest Classifier with 300 trees.

### Predicting

In [4]:
# Split the data into training and testing sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(full_df_matrix.drop("class", axis=1), full_df_matrix["class"], test_size=0.3, random_state=1234)

# Create a Random Forest classifier with 300 trees
classifier = RandomForestClassifier(n_estimators=300)

# Train the classifier
classifier.fit(X_train, y_train)

# Make predictions on the testing set
predicted = classifier.predict(X_test)

# Compute the confusion matrix
confusion = confusion_matrix(y_test, predicted)
print(confusion)

accuracy = accuracy_score(y_test, predicted)
print("Accuracy:", accuracy)

[[776   0]
 [  2 138]]
Accuracy: 0.9978165938864629


Our model generated a 99.78% accuracy - pretty good! Looking at the confusion matrix, it tells us that there are 776 true negatives. These are the instances that the classifier correctly identified "ham" emails. It also tells us that there are 138 true positives - which is when the classifier correctly identified "spam" emails.

Our model did not classify "ham" email as "spam" (0), however we can see that the model did incorrectly label 2 spam "emails" as "ham".

### Conclusion


Using Random Forest, the model was pretty successful; a next step would be to use various other models (Naive Bayes as an example) to compare and see which is most efficient.