## Assignment: Document Classification

### Alice Ding, Shoshana Farber, Christian Uriostegui

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  

For this project, we've chosen to use two files from [spamasssassin](https://spamassassin.apache.org/old/publiccorpus/) as our training data. This is a list of 2,551 ham (non-spam) emails and 1,398 spam ones to see whether the document is spam or not.

### Importing the Data

To start, we have the files downloaded in two different folders within this directory; we'll be using Python to read each file and put it into one full dataframe with messages and labels.

In [1]:
import sys
import os
import pandas as pd
from os.path import expanduser
import nltk
import re

# get the path names of each directory
ham = sys.path[0] + '/easy_ham'
spam = sys.path[0] + '/spam_2'

# create a function that takes the directory and the label we want appended to those messages to put into a dataframe
def create_df(folder, label):
    file_data = []
    file_labels = []
    for file_name in os.listdir(folder):
        file_path = os.path.join(folder, file_name)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='latin-1') as file:
                content = file.read()
        file_data.append(content)
        file_labels.append(label)
    df = pd.DataFrame({'email': file_data, 'label': file_labels})
    return df

# create a dataframe for the ham messages
ham_df = create_df(ham, 'ham')
# create a dataframe for the spam messages
spam_df = create_df(spam, 'spam')
# combine these dataframes into one full one of emails
full_df = pd.concat([ham_df, spam_df])

full_df.head()

Unnamed: 0,email,label
0,From rssfeeds@jmason.org Mon Sep 30 13:43:46 ...,ham
1,From fork-admin@xent.com Tue Sep 3 14:24:41 ...,ham
2,From exmh-users-admin@redhat.com Wed Sep 11 1...,ham
3,From fork-admin@xent.com Mon Sep 2 16:22:12 ...,ham
4,From rssfeeds@jmason.org Fri Sep 27 10:40:59 ...,ham


Now that the data is imported, let's try to clean it up to only hold relevant information.

### Cleaning the Data

In [2]:
from nltk.corpus import stopwords

# get stop words from nltk
stop_words = set(stopwords.words('english'))

# create a function to clean a given text
def clean_text(text):
    text = re.sub(r"<.*?>", "", text)              # remove HTML tags
    text = re.sub(r"[0-9]", "", text)               # remove digits
    text = re.sub(r"[^\w\s]", "", text)             # remove non-word and non-space characters
    text = re.sub(r"\n", "", text)                  # remove newlines
    text = text.lower()                            # convert all text to lowercase
    text = " ".join([word for word in text.split() if word not in stop_words])  # remove stop words
    return text

full_df['email'] = full_df['email'].apply(clean_text)

full_df.head()

Unnamed: 0,email,label
0,rssfeedsjmasonorg mon sep returnpath delivered...,ham
1,forkadminxentcom tue sep returnpath deliveredt...,ham
2,exmhusersadminredhatcom wed sep returnpath del...,ham
3,forkadminxentcom mon sep returnpath deliveredt...,ham
4,rssfeedsjmasonorg fri sep returnpath delivered...,ham


Things are definitely looking cleaner!