Step 0. Unzip enron1.zip into the current directory.

Step 1. Traverse the dataset and create a Pandas dataframe. This is already done for you and should run without any errors. You should recognize Pandas from task 1.

In [25]:
# import pandas as pd
# import os

# def read_category(category, directory):
#     emails = []
#     for filename in os.listdir(directory):
#         if not filename.endswith(".txt"):
#             continue
#         with open(os.path.join(directory, filename), 'r') as fp:
#             try:
#                 content = fp.read()
#                 emails.append({'name': filename, 'content': content, 'category': category})
#             except:
#                 print(f'skipped {filename}')
#     return emails

# def read_spam():
#     category = 'spam'
#     directory = './enron1/spam'
#     return read_category(category, directory)

# def read_ham():
#     category = 'ham'
#     directory = './enron1/ham'
#     return read_category(category, directory)


# ham = read_ham()
# spam = read_spam()

# df = pd.DataFrame.from_records(ham)
# df = df.append(pd.DataFrame.from_records(spam))

In [37]:
import pandas as pd
import os

def read_category(category, directory):
    emails = []
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            try:
                with open(os.path.join(directory, filename), 'r', encoding='utf-8', errors='ignore') as fp:
                    content = fp.read()
                    emails.append({'name': filename, 'content': content, 'category': category})
            except Exception as e:
                print(f"skipped {filename} due to error: {e}")
    return emails

def read_spam():
    category = 'spam'
    directory = './enron1/spam'
    return read_category(category, directory)

def read_ham():
    category = 'ham'
    directory = './enron1/ham'
    return read_category(category, directory)

ham = read_ham()
spam = read_spam()

# Use different variable names to avoid overwriting
df_ham = pd.DataFrame.from_records(ham)
df_spam = pd.DataFrame.from_records(spam)
# df_spam.head(5)

# Combine DataFrames
df_combined = pd.concat([df_ham, df_spam], ignore_index=True)


Step 2. Data cleaning is a critical part of machine learning. You and I can recognize that 'Hello' and 'hello' are the same word but a machine does not know this a priori. Therefore, we can 'help' the machine by conducting such normalization steps for it. Write a function `preprocessor` that takes in a string and replaces all non alphabet characters with a space and then lowercases the result.

In [39]:
df=df_combined
df

Unnamed: 0,name,content,category
0,0001.1999-12-10.farmer.ham.txt,Subject: christmas tree farm pictures\n,ham
1,0002.1999-12-13.farmer.ham.txt,"Subject: vastar resources , inc .\ngary , prod...",ham
2,0003.1999-12-14.farmer.ham.txt,Subject: calpine daily gas nomination\n- calpi...,ham
3,0004.1999-12-14.farmer.ham.txt,Subject: re : issue\nfyi - see note below - al...,ham
4,0005.1999-12-14.farmer.ham.txt,Subject: meter 7268 nov allocation\nfyi .\n- -...,ham
...,...,...,...
5167,5163.2005-09-06.GP.spam.txt,Subject: our pro - forma invoice attached\ndiv...,spam
5168,5164.2005-09-06.GP.spam.txt,Subject: str _ rndlen ( 2 - 4 ) } { extra _ ti...,spam
5169,5167.2005-09-06.GP.spam.txt,Subject: check me out !\n61 bb\nhey derm\nbbbb...,spam
5170,5170.2005-09-06.GP.spam.txt,Subject: hot jobs\nglobal marketing specialtie...,spam


In [40]:
import re

def preprocessor(e):
    e=re.sub('[^a-zA-Z]', ' ', e)
    e=e.lower()
    return e

Step 3. We will now train the machine learning model. All the functions that you will need are imported for you. The instructions explain how the work and hint at which functions to use. You will likely need to refer to the scikit learn documentation to see how exactly to invoke the functions. It will be handy to keep that tab open.

In [41]:
print(df['category'].unique())


['ham' 'spam']


In [42]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Instantiate a CountVectorizer with the preprocessor
vectorizer = CountVectorizer(preprocessor=preprocessor)

# Use train_test_split to split the dataset into a train dataset and a test dataset
X_train, X_test, y_train, y_test = train_test_split(df['content'], df['category'], test_size=0.2, random_state=42)

# Use the vectorizer to transform the existing dataset into a form in which the model can learn from
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Use the LogisticRegression model to fit to the train dataset
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

# Validate that the model has learned something
X_test_predictions = model.predict(X_test_vectorized)

# Evaluate the performance of the model
accuracy = accuracy_score(y_test, X_test_predictions)
conf_matrix = confusion_matrix(y_test, X_test_predictions)
classification_rep = classification_report(y_test, X_test_predictions)

print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(classification_rep)


Accuracy: 0.98
Confusion Matrix:
[[732  17]
 [  8 278]]
Classification Report:
              precision    recall  f1-score   support

         ham       0.99      0.98      0.98       749
        spam       0.94      0.97      0.96       286

    accuracy                           0.98      1035
   macro avg       0.97      0.97      0.97      1035
weighted avg       0.98      0.98      0.98      1035



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Step 4.

In [45]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

def preprocessor(text):
    # Replace non-alphabetic characters with a space and convert to lowercase
    return ''.join([char if char.isalpha() else ' ' for char in text]).lower()

# Instantiate a CountVectorizer with the preprocessor
vectorizer = CountVectorizer(preprocessor=preprocessor)

# Fit and transform the training data
X_train_vectorized = vectorizer.fit_transform(df_combined['content'])

# Get the feature names (words) created by the vectorizer
feature_names = vectorizer.get_feature_names_out()

# Display the first 10 features
print("Top 10 features:")
print(feature_names[:10])

# Access the coefficients from the logistic regression model
coefficients = model.coef_[0]

# Ensure feature_names and coefficients have the same length
if len(feature_names) == len(coefficients):
    # Create a DataFrame to associate feature names with their coefficients
    coefficients_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

    # Sort the DataFrame by the magnitude of coefficients in descending order
    coefficients_df_sorted = coefficients_df.reindex(coefficients_df['Coefficient'].abs().sort_values(ascending=False).index)

    # Display the top 10 positive and negative features
    top_positive_features = coefficients_df_sorted.head(10)
    top_negative_features = coefficients_df_sorted.tail(10)

    print("\nTop 10 Positive Features (Spam):")
    print(top_positive_features)

    print("\nTop 10 Negative Features (Ham):")
    print(top_negative_features)
else:
    print("Error: Feature names and coefficients have different lengths.")


Top 10 features:
['aa' 'aaa' 'aaas' 'aabda' 'aabvmmq' 'aac' 'aachecar' 'aaer' 'aafco'
 'aaiabe']
Error: Feature names and coefficients have different lengths.


Submission
1. Upload the jupyter notebook to Forage.

All Done!