In [None]:
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam.csv


# Task
Build an email spam detector using the uploaded dataset "spam.csv" and machine learning in Python.

## Load the dataset

### Subtask:
Load the uploaded `spam.csv` file into a pandas DataFrame.


**Reasoning**:
Import pandas and load the 'spam.csv' file into a DataFrame, then display the head and info to verify the loading process.



In [None]:
import pandas as pd

df = pd.read_csv('spam.csv', encoding='latin-1')
display(df.head())
display(df.info())

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


None

## Explore the data

### Subtask:
Understand the structure and content of the dataset. Check for missing values and analyze the distribution of spam and non-spam emails.


**Reasoning**:
Display the column names, the number of unique values per column, the value counts for the 'v1' column, and the first few rows of the dataframe to understand its structure and content as requested by the instructions.



In [None]:
print("Column Names:")
print(df.columns)

print("\nNumber of Unique Values per Column:")
print(df.nunique())

print("\nValue Counts for 'v1' column:")
print(df['v1'].value_counts())

print("\nFirst few rows of the DataFrame:")
display(df.head())

Column Names:
Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

Number of Unique Values per Column:
v1               2
v2            5169
Unnamed: 2      43
Unnamed: 3      10
Unnamed: 4       5
dtype: int64

Value Counts for 'v1' column:
v1
ham     4825
spam     747
Name: count, dtype: int64

First few rows of the DataFrame:


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


## Preprocess the data

### Subtask:
Clean the text data by removing punctuation, converting to lowercase, and potentially removing stop words.


**Reasoning**:
Clean the text data in the 'v2' column by converting to lowercase, removing punctuation, and removing stop words. Store the result in a new column.



In [None]:
import string
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

def clean_text(text):
    text = text.lower()
    text = ''.join([char for char in text if char not in string.punctuation])
    words = text.split()
    words = [word for word in words if word not in stopwords.words('english')]
    return ' '.join(words)

df['cleaned_text'] = df['v2'].apply(clean_text)
display(df[['v2', 'cleaned_text']].head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,v2,cleaned_text
0,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,"Nah I don't think he goes to usf, he lives aro...",nah dont think goes usf lives around though


## Split the data

### Subtask:
Divide the dataset into training and testing sets.


**Reasoning**:
Divide the dataset into training and testing sets using the cleaned text as features and the original label as the target.



In [None]:
from sklearn.model_selection import train_test_split

X = df['cleaned_text']
y = df['v1']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (4457,)
Shape of X_test: (1115,)
Shape of y_train: (4457,)
Shape of y_test: (1115,)


## Vectorize the text data

### Subtask:
Convert the text data into numerical features that can be used by a machine learning model. Techniques like TF-IDF or Count Vectorization can be used.


**Reasoning**:
Import TfidfVectorizer, instantiate it, fit it to the training data, and transform both training and testing data. Finally, print the shapes of the transformed dataframes.



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("Shape of X_train_tfidf:", X_train_tfidf.shape)
print("Shape of X_test_tfidf:", X_test_tfidf.shape)

Shape of X_train_tfidf: (4457, 8270)
Shape of X_test_tfidf: (1115, 8270)


## Choose and train a model

### Subtask:
Select a suitable machine learning model for classification (e.g., Naive Bayes, SVM, Logistic Regression) and train it on the training data.


**Reasoning**:
Import and train a Multinomial Naive Bayes model on the TF-IDF transformed training data.



In [None]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

Now, let's save the trained model and the TF-IDF vectorizer for offline use.

In [None]:
import joblib
from google.colab import files

# Save the trained Multinomial Naive Bayes model
joblib.dump(model, 'spam_detector_model.joblib')

# Save the fitted TF-IDF vectorizer
joblib.dump(tfidf_vectorizer, 'tfidf_vectorizer.joblib')

print("Model and vectorizer saved.")

# Download the saved files
files.download('spam_detector_model.joblib')
files.download('tfidf_vectorizer.joblib')

Model and vectorizer saved.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Let's test the model with some new messages.

In [None]:
# Example new messages to test
new_messages = [
    "Congratulations! You've won a free iPhone!",
    "Hey, how are you doing today?",
    "Urgent: Your account has been compromised. Click here to verify.",
    "Meeting tomorrow at 10 AM.",
    "Free entry to a prize draw! Text WIN to 12345."
]

# Preprocess the new messages using the same clean_text function
cleaned_new_messages = [clean_text(msg) for msg in new_messages]

# Vectorize the cleaned new messages using the fitted TF-IDF vectorizer
new_messages_tfidf = tfidf_vectorizer.transform(cleaned_new_messages)

# Predict the labels for the new messages
predictions = model.predict(new_messages_tfidf)

# Display the original messages and their predicted labels
for message, prediction in zip(new_messages, predictions):
    print(f"Message: {message}")
    print(f"Predicted Label: {prediction}\n")

Message: Congratulations! You've won a free iPhone!
Predicted Label: ham

Message: Hey, how are you doing today?
Predicted Label: ham

Message: Urgent: Your account has been compromised. Click here to verify.
Predicted Label: spam

Message: Meeting tomorrow at 10 AM.
Predicted Label: ham

Message: Free entry to a prize draw! Text WIN to 12345.
Predicted Label: spam



## Evaluate the model

### Subtask:
Assess the performance of the trained model on the testing data using appropriate metrics (e.g., accuracy, precision, recall, F1-score).


**Reasoning**:
Import the necessary metrics, make predictions on the test set, calculate the evaluation metrics, and print the results.



In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = model.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label='spam')
recall = recall_score(y_test, y_pred, pos_label='spam')
f1 = f1_score(y_test, y_pred, pos_label='spam')

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

Accuracy: 0.9677
Precision: 1.0000
Recall: 0.7600
F1-score: 0.8636


## Summary:

### Data Analysis Key Findings

*   The dataset contains 5572 entries with columns for the email label ('v1') and the email text ('v2'), along with three largely empty columns ('Unnamed: 2', 'Unnamed: 3', and 'Unnamed: 4').
*   There is a significant class imbalance, with 4825 'ham' (non-spam) emails and 747 'spam' emails.
*   Text data was successfully cleaned by converting to lowercase, removing punctuation, and removing English stop words.
*   The dataset was split into training (4457 samples) and testing (1115 samples) sets.
*   Text data was vectorized using TF-IDF, resulting in a feature matrix with 8270 features.
*   A Multinomial Naive Bayes model was trained on the TF-IDF transformed training data.
*   The trained model achieved an accuracy of 0.9677 on the test set.
*   The model demonstrated perfect precision (1.0000) in identifying spam, but a lower recall (0.7600), indicating it correctly identifies all emails it flags as spam but misses some actual spam emails.
*   The F1-score for spam detection was 0.8636.

### Insights or Next Steps

*   The high precision suggests the model is very good at avoiding false positives (classifying ham as spam), which is desirable for a spam filter. However, the lower recall indicates potential for improvement in capturing all spam emails.
*   Further steps could involve exploring different text vectorization techniques (e.g., Count Vectorization with n-grams) or trying other classification models (e.g., SVM, Logistic Regression) to potentially improve the recall score while maintaining high precision.
