# 1. Summary

- Demo notebook showcasing steps to get started with Experiment Tracking using [Comet.ml](https://www.comet.ml)
- **NLP Task**: Spam Detection on the [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset)


In [20]:
# Install comet-ml if not already installed
%pip install comet-ml --quiet --upgrade

Note: you may need to restart the kernel to use updated packages.


In [21]:
# verfiy comet-ml is installed
import comet_ml

print("Comet v:", comet_ml.__version__, "installed")

Comet v: 3.37.1 installed


In [22]:
import os
from pathlib import Path

import pandas as pd
import numpy as np

# This must be imported before any other ML library to enable automatic experiment tracking
from comet_ml import Experiment

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import matplotlib.pyplot as plt

In [23]:
experiment = Experiment(api_key=os.environ["COMET_API_KEY"],   # Create a FREE Comet ML account and get your API key here: https://www.comet.com/signup
                        project_name="hundzula_sms_spam_classification_demo")

In [24]:
ROOT_DIR = Path.cwd().parent
print("Project root directory:", ROOT_DIR)

Project root directory: /home/endeesa/projects/github/hundzula-2024-reproducible-nlp


# 2. Get Data


- If not already downloaded, download the dataset from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset) and save it in the `data/raw` directory.


In [25]:
# Alternatively, use wget to download the file using the kaggle API
# !kaggle datasets download -d uciml/sms-spam-collection-dataset

In [26]:
# Unzip the file into the data directory
!unzip -o $ROOT_DIR/data/raw/kaggle-spam-collection-data.zip -d $ROOT_DIR/data/raw

Archive:  /home/endeesa/projects/github/hundzula-2024-reproducible-nlp/data/raw/kaggle-spam-collection-data.zip
  inflating: /home/endeesa/projects/github/hundzula-2024-reproducible-nlp/data/raw/spam.csv  


In [27]:

df = pd.read_csv(ROOT_DIR / 'data/raw/spam.csv', encoding='latin-1')
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB
None


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [28]:
df.describe()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


In [29]:
df.v1.value_counts(normalize=True)

v1
ham     0.865937
spam    0.134063
Name: proportion, dtype: float64

# 3. Preprocess data


In [30]:
df.shape

(5572, 5)

In [31]:
# start by removing the columns that are not required
df = df[['v1', 'v2']].copy(True)
df = df.rename(columns={'v1': 'label', 'v2': 'text'})

# drop nulls and duplicates
df = df.dropna()
df = df.drop_duplicates(subset=['text'])

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5169 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   5169 non-null   object
 1   text    5169 non-null   object
dtypes: object(2)
memory usage: 121.1+ KB


### Tokenization


In [33]:
import re
import string
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [34]:
def remove_urls(text):
    return re.sub(r'http\S+', '', text)

In [35]:
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

In [36]:
def word_tokenize(text):
    return text.split()

In [44]:
def preprocess_text(text: str) -> str:
    text = text.lower()  # Convert to lowercase
    text = remove_urls(text)
    text = remove_punctuation(text)

    # add more preprocessing steps if needed

    # Tokenize into words
    words = word_tokenize(text)
    words = [w for w in words if not w.lower() in ENGLISH_STOP_WORDS]

    return ' '.join(words)

In [45]:
df['clean_text'] = df['text'].apply(preprocess_text)

In [46]:
df.sample(3)

Unnamed: 0,label,text,clean_text
1491,spam,Your account has been credited with 500 FREE T...,account credited 500 free text messages activa...
2107,ham,Hmmm ... And imagine after you've come home fr...,hmmm imagine youve come home having rub feet m...
1674,ham,Nah dub but je still buff,nah dub je buff


# 4. Feature Extraction


In [47]:
print('Raw class frequencies:')
print(df['label'].value_counts(normalize=True))

Raw class frequencies:
label
ham     0.87367
spam    0.12633
Name: proportion, dtype: float64


In [48]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['clean_text'])
y = df['label']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [49]:
# Check relative class frequencies
print('Train class frequencies:')
print(y_train.value_counts(normalize=True))
print('Test class frequencies:')
print(y_test.value_counts(normalize=True))

Train class frequencies:
label
ham     0.873761
spam    0.126239
Name: proportion, dtype: float64
Test class frequencies:
label
ham     0.873308
spam    0.126692
Name: proportion, dtype: float64


# 5. Model Training


In [50]:
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
# Log model parameters
# experiment.log_parameters(model.get_params())

# 6. Model Evaluation


**Accuracy**

In [None]:
# Calculate predictions on test set
y_pred = model.predict(X_test)

# Evaluate model accuracy
accuracy = metrics.accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# Log accuracy to Comet
experiment.log_metric('accuracy', accuracy)



**Other Metrics**

In [None]:
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)  
f1 = metrics.f1_score(y_test, y_pred)

print(f'Precision: {precision}') 
print(f'Recall: {recall}')
print(f'F1 score: {f1}')

experiment.log_metric('precision', precision)
experiment.log_metric('recall', recall)
experiment.log_metric('f1', f1)

> **Note**: When using Comet in a notebook, you must manually end the experiment by calling `experiment.end()` at the end of the notebook.


In [None]:
experiment.end()

# Conclusion


- In this notebook, we saw a basic machine learning workflow for an NLP text classification task. We loaded a dataset, preprocessed the text, extracted TF-IDF features, trained a Logistic Regression model and evaluated its performance.

- Integrating Comet ML allowed us to easily log metrics, parameters, and figures to better analyze the model training process without any changes to the core code.

- Comet can be used similarly with any machine learning framework like PyTorch, Tensorflow, Keras etc. It is invaluable for comparing experiments to select the best model.
