Step 0. Unzip enron1.zip into the current directory.

Step 1. Traverse the dataset and create a Pandas dataframe. This is already done for you and should run without any errors. You should recognize Pandas from task 1.

In [5]:
import pandas as pd
import os

def read_spam():
    category = 'spam'
    directory = 'enron1/enron1/spam'
    return read_category(category, directory)

def read_ham():
    category = 'ham'
    directory = 'enron1/enron1/ham'
    return read_category(category, directory)

def read_category(category, directory):
    emails = []
    for filename in os.listdir(directory):
        if not filename.endswith(".txt"):
            continue
        with open(os.path.join(directory, filename), 'r') as fp:
            try:
                content = fp.read()
                emails.append({'name': filename, 'content': content, 'category': category})
            except:
                print(f'skipped {filename}')
    return emails

ham = read_ham()
spam = read_spam()

df = pd.DataFrame.from_records(ham)
df = df.append(pd.DataFrame.from_records(spam))

skipped 2248.2004-09-23.GP.spam.txt
skipped 2526.2004-10-17.GP.spam.txt
skipped 2698.2004-10-31.GP.spam.txt
skipped 4566.2005-05-24.GP.spam.txt


Step 2. Data cleaning is a critical part of machine learning. You and I can recognize that 'Hello' and 'hello' are the same word but a machine does not know this a priori. Therefore, we can 'help' the machine by conducting such normalization steps for it. Write a function `preprocessor` that takes in a string and replaces all non alphabet characters with a space and then lowercases the result.

In [9]:
import re

def preprocessor(e):
    # Replace non-alphabet characters with a space
    processed_text = re.sub('[^a-zA-Z]', ' ', e)

    # Lowercase the result
    processed_text = processed_text.lower()

    return processed_text


Step 3. We will now train the machine learning model. All the functions that you will need are imported for you. The instructions explain how the work and hint at which functions to use. You will likely need to refer to the scikit learn documentation to see how exactly to invoke the functions. It will be handy to keep that tab open.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Instantiate a CountVectorizer with the preprocessor
vectorizer = CountVectorizer(preprocessor=preprocessor)

# Split the dataset into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(df['content'], df['category'], test_size=0.2, random_state=42)

# Transform the datasets using the vectorizer
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Instantiate and fit the LogisticRegression model
model = LogisticRegression()
model.fit(X_train_vec, y_train)

# Transform the test dataset and make predictions
X_test_vec_transformed = vectorizer.transform(X_test)
predictions = model.predict(X_test_vec_transformed)

# Calculate accuracy score
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

# Generate confusion matrix
confusion = confusion_matrix(y_test, predictions)
print("Confusion Matrix:")
print(confusion)

# Generate classification report
classification = classification_report(y_test, predictions)
print("Classification Report:")
print(classification)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy: 0.9758220502901354
Confusion Matrix:
[[717  12]
 [ 13 292]]
Classification Report:
              precision    recall  f1-score   support

         ham       0.98      0.98      0.98       729
        spam       0.96      0.96      0.96       305

    accuracy                           0.98      1034
   macro avg       0.97      0.97      0.97      1034
weighted avg       0.98      0.98      0.98      1034



Step 4.

In [11]:
# Get the feature names from the vectorizer
feature_names = vectorizer.get_feature_names()
print("Feature Names:")
print(feature_names)

# Get the coefficients from the Logistic Regression model
coefficients = model.coef_[0]

# Create a dictionary to map feature names to their corresponding coefficients
feature_coefficients = dict(zip(feature_names, coefficients))

# Sort the feature_coefficients dictionary by value (importance)
sorted_features = sorted(feature_coefficients.items(), key=lambda x: abs(x[1]), reverse=True)

# Get the top 10 positive features with the largest magnitude
top_positive_features = sorted_features[:10]
print("Top 10 Positive Features:")
for feature, coefficient in top_positive_features:
    print(feature, coefficient)

# Get the top 10 negative features with the largest magnitude
top_negative_features = sorted_features[-10:]
print("Top 10 Negative Features:")
for feature, coefficient in top_negative_features:
    print(feature, coefficient)


Feature Names:
Top 10 Positive Features:
enron -1.4923906157700222
thanks -1.4597516802585244
attached -1.37497097001367
doc -1.3249367599631963
daren -1.3062235249492227
pictures -1.279805067935787
xls -1.2247599218036054
neon -1.1541537217170812
deal -1.144040330983197
hpl -1.0506351379405037
Top 10 Negative Features:
cnrl 1.7124939711534073e-08
expectted 1.7124939711534073e-08
houstonexp 1.7124939711534073e-08
llipperdt 1.7124939711534073e-08
patb 1.7124939711534073e-08
spinexp 1.7124939711534073e-08
tjones 1.7124939711534073e-08
evey 4.534643393928934e-09
ratnala -3.752204481661492e-09
unidirectional 1.5250489927845775e-09


Submission
1. Upload the jupyter notebook to Forage.

All Done!