# Assignment

## Instructions

### Text Classification for Spam Detection

In this assignment, you will build a text classification model using Naive Bayes to classify SMS messages as spam or ham (non-spam). You will implement text preprocessing techniques and use the Vector Space Model (TF-IDF) to represent the text data.

#### Dataset

You will be using the SMS Spam Collection dataset, which contains a set of SMS messages that have been labeled as either spam or ham (legitimate). This dataset is available through several Python libraries or can be downloaded directly.

#### Tasks

1. **Text Preprocessing**:

   - Load the dataset
   - Implement tokenization
   - Apply stemming or lemmatization
   - Remove stopwords

2. **Feature Extraction**:

   - Use TF-IDF vectorization to convert the text data into numerical features
   - Explore the most important features for spam and ham categories

3. **Classification**:

   - Split the data into training and testing sets
   - Train a Multinomial Naive Bayes classifier
   - Evaluate the model using appropriate metrics (accuracy, precision, recall, F1-score)
   - Create a confusion matrix to visualize the results

4. **Analysis**:
   - Analyze false positives and false negatives
   - Identify characteristics of messages that are frequently misclassified
   - Suggest improvements to your model


In [3]:
# Import necessary libraries
import pandas as pd
import numpy as np
import urllib.request

# Load the SMS Spam Collection dataset
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
urllib.request.urlretrieve(url, "sms.tsv")
sms_data = pd.read_csv('sms.tsv', sep='\t',names=['label', 'message'])
print(sms_data.head())

# Check data distribution
print(sms_data['label'].value_counts())


  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
label
ham     4825
spam     747
Name: count, dtype: int64


In [4]:
print(sms_data.columns.tolist())

['label', 'message']


In [5]:
# TODO: Implement text preprocessing
# - Tokenization
# - Stemming/Lemmatization
# - Stopwords removal

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

nltk.download('stopwords')
nltk.download('punkt')

def preprocess_text(text):
    # 1. Lowercase
    text = text.lower()
    
    # 2. Tokenization (Break into words)
    tokens = nltk.word_tokenize(text)
    
    # 3. Remove Special Characters & Stopwords
    # Also removing punctuation using string.punctuation
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
    
    # 4. Stemming (Reducing words to their root)
    ps = PorterStemmer()
    stemmed_tokens = [ps.stem(word) for word in tokens]
    
    # Join back into a string for Vectorization later
    return " ".join(stemmed_tokens)

# Apply to your dataframe
sms_data['transformed_text'] = sms_data['message'].apply(preprocess_text)

[nltk_data] Downloading package stopwords to /home/pocl/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /home/pocl/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
print(sms_data[['message', 'transformed_text']].head())

                                             message  \
0  Go until jurong point, crazy.. Available only ...   
1                      Ok lar... Joking wif u oni...   
2  Free entry in 2 a wkly comp to win FA Cup fina...   
3  U dun say so early hor... U c already then say...   
4  Nah I don't think he goes to usf, he lives aro...   

                                    transformed_text  
0  go jurong point crazi avail bugi n great world...  
1                              ok lar joke wif u oni  
2  free entri 2 wkli comp win fa cup final tkt 21...  
3                u dun say earli hor u c alreadi say  
4               nah think goe usf live around though  


In [7]:
# TODO: Apply TF-IDF vectorization

from sklearn.feature_extraction.text import TfidfVectorizer

# 1. Initialize the Vectorizer
# We usually limit max_features to avoid a massive, sparse matrix
tfidf = TfidfVectorizer(max_features=3000)

# 2. Transform the text data into a numerical array
X = tfidf.fit_transform(sms_data['transformed_text']).toarray()

# 3. Extract the labels (target variable)
# Note: You might need to encode 'ham'/'spam' to 0/1 first
y = sms_data['label'].values

print(f"Feature matrix shape: {X.shape}") 
# Example output: (5572, 3000) -> 5572 messages, 3000 unique words


Feature matrix shape: (5572, 3000)


In [8]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
sms_data['label'] = encoder.fit_transform(sms_data['label'])

# ham becomes 0, spam becomes 1
y = sms_data['label'].values

In [9]:
# TODO: Split data into training and testing sets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix

# 1. Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

In [None]:
# TODO: Train a Multinomial Naive Bayes classifier
# 2. Initialize and Train the model
mnb = MultinomialNB()
mnb.fit(X_train, y_train)

In [None]:
# TODO: Evaluate the model
# 3. Make predictions
y_pred = mnb.predict(X_test)

# 4. Evaluate the performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.9721973094170404
Precision: 1.0

Confusion Matrix:
[[957   0]
 [ 31 127]]


# TODO: Analyze misclassifications
Zero False Positives: Your model did not misclassify a single legitimate message as spam. This is the "holy grail" of spam filtering because users hate losing important emails to the junk folder.
Precision = 100%: Because the top-right value is 0, your Precision TP / (TP + FP) is a perfect 1.0.
Recall Check: You did miss 31 spam messages (False Negatives). They "leaked" into the inbox. While not ideal, this is usually considered a safe trade-off to ensure no "Ham" is ever blocked.


In [12]:
# Initialize with N-grams (1,2)
tfidf = TfidfVectorizer(max_features=3000, ngram_range=(1,2))

# Re-transform and Re-split
X = tfidf.fit_transform(sms_data['transformed_text']).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

# Re-train
mnb = MultinomialNB(alpha=0.1) # Lower alpha can also help catch more spam
mnb.fit(X_train, y_train)

# Evaluate
y_pred = mnb.predict(X_test)
print(f"New Precision: {precision_score(y_test, y_pred)}")
print(f"New Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")

New Precision: 0.9781021897810219
New Confusion Matrix:
[[954   3]
 [ 24 134]]
