<a href="https://colab.research.google.com/github/alexchilton/cas_nlp_module_1_2/blob/main/nlp_cas_01_spam_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Spam Detection



**Algorithm**: Naive Bayes (and others?)

**Dataset**: [UCI SMS Spam Collection Dataset](https://archive.ics.uci.edu/dataset/228/sms+spam+collection)

**Goal:** teach the model to classify texts as either spam or not spam ("ham")

## 1.1 Installs & Imports

In [1]:
# installs

# tip: output from install commands can be very long. Uncomment the following line to hide
# %%capture

!pip install pandas
!pip install scikit-learn



In [2]:
# imports
import pandas as pd # used to read and hold data
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer # aka Bag-of-Words Vectorizer
from sklearn.naive_bayes import MultinomialNB # Naive Bayes classifier
from sklearn.metrics import accuracy_score, classification_report # to evaluate the training

## 1.2 Load & Prepare Data

1. download the spam dataset from [UCI repository](https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip)

2. unzip and upload the file `SMSSPamCollection` to your default directory on colab

3. *optional, if you're bored: figure out how to skip steps 1 & 2 and load the files directly into Colab*



In [3]:
# quickly check file
!head SMSSpamCollection

head: cannot open 'SMSSpamCollection' for reading: No such file or directory


In [None]:
# load spam dataset into dataframe
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['label', 'message']) #use

In [None]:
# the training algorithm needs numerical input
# so 'ham' and 'spam' need to be converted to 0 and 1
df['label'] = df.label.map({'ham': 0, 'spam': 1})
print(df.head)

<bound method NDFrame.head of       label                                            message
0         0  Go until jurong point, crazy.. Available only ...
1         0                      Ok lar... Joking wif u oni...
2         1  Free entry in 2 a wkly comp to win FA Cup fina...
3         0  U dun say so early hor... U c already then say...
4         0  Nah I don't think he goes to usf, he lives aro...
...     ...                                                ...
5567      1  This is the 2nd time we have tried 2 contact u...
5568      0               Will ü b going to esplanade fr home?
5569      0  Pity, * was in mood for that. So...any other s...
5570      0  The guy did some bitching but I acted like i'd...
5571      0                         Rofl. Its true to its name

[5572 rows x 2 columns]>


In [None]:
# split data into training and testing sets
# model will learn from training set, then we'll evaluate the performance on the testing set.
# we do this to prevent overfitting (model simply memorizing the training data, can't generalize to new data)

# separate data into features. (X, input) and target (y)
# side note: X is capitalized because it represents a matrix (though it's only one column in this example)

X = df['message']
y = df['label']



# split the dataset into 80% for training and 20% for testing (common ratio)
# (theoretically, we could also use validation set)
# this allows us to evaluate the model on data it has never seen before

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # random_state ensures that the split is the same every time we run the code. use any integer

# show top of table
# print(X_train.head())

1978    Reply to win £100 weekly! Where will the 2006 ...
3989    Hello. Sort of out in town already. That . So ...
3935     How come guoyang go n tell her? Then u told her?
4078    Hey sathya till now we dint meet not even a si...
4086    Orange brings you ringtones from all time Char...
Name: message, dtype: object


In [None]:
# check data split
print(f"--- Data split ---")
print(f"Training set size: {len(X_train)} messages")
print(f"Testing set size: {len(X_test)} messages")

--- Data split ---
Training set size: 4457 messages
Testing set size: 1115 messages


## 1.3 Convert text data to numerical vectors
because statistical machine learning models **cannot** work with  words, just numbers

In [None]:
#instantiate the vectorizer.
# it will learn a vocabulary from the text and convert each message into a (sparse) vector of word counts
vectorizer = CountVectorizer()

# alternative vectorizers to try
# vectorizer = TfidfVectorizer()
# vectorizer = CountVectorizer(ngram_range=(1,2))



In [None]:
# fit the vectorizer on the training data to learn the vocabulary and then transform the training data
X_train_vectors = vectorizer.fit_transform(X_train)

# ONLY transform the test data using the vocabulary learned from the training data
# the test set is unseen data --> we cannot learn vocabulary from it
# new words in test set will be ignored
X_test_vectors = vectorizer.transform(X_test)

## 1.4 Train the NB classifier

In [None]:
# the MultinomialNB() suits feature vectors representing counts (like word counts)
model = MultinomialNB()

In [None]:
# .fit() is how the model trains in scikit-learn
model.fit(X_train_vectors, y_train)

## 1.5 Evaluate the model

In [None]:
predictions = model.predict(X_test_vectors)
accuracy = accuracy_score(y_test, predictions)
print(classification_report(y_test, predictions, target_names=['Ham', 'Spam']))


              precision    recall  f1-score   support

         Ham       0.99      1.00      1.00       966
        Spam       1.00      0.94      0.97       149

    accuracy                           0.99      1115
   macro avg       1.00      0.97      0.98      1115
weighted avg       0.99      0.99      0.99      1115



# 1.6 Test with new data!


In [None]:
new_messages = [
    "congratulations! You've won a $1,000 gift card. Click here to claim.",
    "hey, are we still on for the meeting tomorrow?",
    "urgent: Your account has been suspended. Please verify your details immediately."
]

new_messages_vectors = vectorizer.transform(new_messages)

new_predictions = model.predict(new_messages_vectors)

for message, prediction in zip(new_messages, new_predictions):
  label = "Spam" if prediction == 1 else "Ham"
  print(f"Message: '{message}'\nPredicted: {label}\n")


Message: 'congratulations! You've won a $1,000 gift card. Click here to claim.'
Predicted: Spam

Message: 'hey, are we still on for the meeting tomorrow?'
Predicted: Ham

Message: 'urgent: Your account has been suspended. Please verify your details immediately.'
Predicted: Spam

