# 1. Summary

* Demo notebook showcasing steps to get started with Experiment Tracking using [Comet.ml](https://www.comet.ml)
* **NLP Task**: Spam Detection on the [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset)


In [None]:
# Install comet-ml if not already installed
%pip install comet-ml --quiet --upgrade

In [2]:
# verfiy comet-ml is installed
import comet_ml

print("Comet v:", comet_ml.__version__, "installed")

Comet v: 3.37.1 installed


In [6]:
import os
from pathlib import Path

import pandas as pd
import numpy as np

# This must be imported before any other ML library to enable automatic experiment tracking
from comet_ml import Experiment

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import matplotlib.pyplot as plt

In [None]:
experiment = Experiment(api_key=os.environ["COMET_API_KEY"],   # Create a FREE Comet ML account and get your API key here: https://www.comet.com/signup
                        project_name="hundzula_sms_spam_classification_demo")

In [8]:
ROOT_DIR = Path.cwd().parent
print("Project root directory:", ROOT_DIR)

Project root directory: /home/endeesa/projects/github/hundzula-2024-reproducible-nlp


# 2. Get Data

* If not already downloaded, download the dataset from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset) and save it in the `data/raw` directory.

In [4]:
# Alternatively, use wget to download the file using the kaggle API
# !kaggle datasets download -d uciml/sms-spam-collection-dataset

In [18]:
# Unzip the file into the data directory
!unzip -o $ROOT_DIR/data/raw/kaggle-spam-collection-data.zip -d $ROOT_DIR/data/raw

Archive:  /home/endeesa/projects/github/hundzula-2024-reproducible-nlp/data/raw/kaggle-spam-collection-data.zip
  inflating: /home/endeesa/projects/github/hundzula-2024-reproducible-nlp/data/raw/spam.csv  


In [21]:

df = pd.read_csv(ROOT_DIR / 'data/raw/spam.csv', encoding='latin-1')
print(df.info())
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB
None


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [22]:
df.describe()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


In [24]:
df.v1.value_counts(normalize=True).plot(kind='barh', title='Spam vs Ham Distribution')

<Axes: title={'center': 'Spam vs Ham Distribution'}, ylabel='v2'>

Error in callback <function _draw_all_if_interactive at 0x7f49cc24e5c0> (for post_execute), with arguments args (),kwargs {}:


KeyboardInterrupt: 

# 3. Preprocess data

In [None]:
def preprocess(text):
    text = text.lower() # Convert to lowercase
    text = text.replace urls, '') # Remove URLs
    
    # Tokenize into words
    words = word_tokenize(text)  
    words = [w for w in words if not w.lower() in english_stopwords]
    
    # Remove punctuation
    words = [word for word in words if word.isalpha()] 

    return words

In [None]:
df['clean_text'] = df['text'].apply(preprocess)

In [None]:
df.random(3)

# 4. Feature Extraction  

In [None]:
print('Raw class frequencies:')
print(df['label'].value_counts(normalize=True))

In [None]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['clean_text']) 
y = df['label']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
# Check relative class frequencies
print('Train class frequencies:')
print(y_train.value_counts(normalize=True))
print('Test class frequencies:')
print(y_test.value_counts(normalize=True))

# 5. Model Training

In [5]:
model = LogisticRegression()
model.fit(X_train, y_train)

NameError: name 'LogisticRegression' is not defined

# 6. Model Evaluation

> **Note**: When using Comet in a notebook, you must manually end the experiment by calling `experiment.end()` at the end of the notebook.

In [None]:
experiment.end()

# Conclusion

- In this notebook, we saw a basic machine learning workflow for an NLP text classification task. We loaded a dataset, preprocessed the text, extracted TF-IDF features, trained a Logistic Regression model and evaluated its performance.

- Integrating Comet ML allowed us to easily log metrics, parameters, and figures to better analyze the model training process without any changes to the core code.

- Comet can be used similarly with any machine learning framework like PyTorch, Tensorflow, Keras etc. It is invaluable for comparing experiments to select the best model.