# Member 1: Logistic Regression
## Logistic Regression with Multiple Embeddings

**Team Member:** Member 1
**Model:** Logistic Regression
**Embeddings:** TF-IDF, Skip-gram, CBOW

---

## Objectives
1. Implement Logistic Regression for spam classification
2. Train with at least 3 different embeddings
3. Perform hyperparameter tuning
4. Document all experiments systematically
5. Save results for team comparison


## 1. Environment Setup

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
import time
import sys

# Add src to path
sys.path.append('../src')

# Import custom modules
from preprocessing import TextPreprocessor
from embeddings import TFIDFEmbedding, Word2VecEmbedding, GloVeEmbedding, FastTextEmbedding
from evaluation import ModelEvaluator
from utils import set_seed, print_data_info, save_model

# Set random seed for reproducibility
set_seed(42)

print('âœ“ Setup complete!')


## 2. Load and Prepare Data

In [None]:
# Load preprocessed data from shared exploration notebook
df = pd.read_csv('../data/processed/spam_cleaned.csv')

# OR load raw and preprocess
# df = pd.read_csv('../data/raw/spam.csv', encoding='latin-1')
# preprocessor = TextPreprocessor()
# df = preprocessor.preprocess_dataframe(df, 'v2', 'cleaned_text')

print(f'Dataset shape: {df.shape}')
df.head()


## 3. Experiment 1: Logistic Regression + TF-IDF

**Rationale:** TF-IDF is a traditional sparse representation that works well with...

**Citation:** [Add relevant paper citation here]


In [None]:
# Prepare TF-IDF embeddings
from embeddings import TFIDFEmbedding

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df['cleaned_text'], df['label'],
    test_size=0.2, random_state=42, stratify=df['label']
)

# Create TF-IDF features
tfidf = TFIDFEmbedding(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train.tolist())
X_test_tfidf = tfidf.transform(X_test.tolist())

print(f'TF-IDF shape: {X_train_tfidf.shape}')


In [None]:
# Train Logistic Regression with TF-IDF
# TODO: Implement your Logistic Regression model here

# Example for Logistic Regression:
# from sklearn.linear_model import LogisticRegression
# model = LogisticRegression(max_iter=1000)

# Example for RNN/LSTM/GRU:
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Dense, Embedding, LSTM
# model = Sequential([...])

# Training
start_time = time.time()
# model.fit(X_train_tfidf, y_train)
training_time = time.time() - start_time

# Predictions
# y_pred = model.predict(X_test_tfidf)

print(f'Training time: {training_time:.2f}s')


In [None]:
# Evaluate Model
evaluator = ModelEvaluator(class_names=['ham', 'spam'])

# metrics = evaluator.evaluate(
#     y_test, y_pred,
#     model_name='Logistic Regression',
#     embedding_name='TF-IDF',
#     training_time=training_time
# )

# evaluator.print_metrics(metrics)
# evaluator.plot_confusion_matrix(
#     y_test, y_pred,
#     title='Logistic Regression + TF-IDF - Confusion Matrix',
#     save_path='../results/figures/member1_tfidf_cm.png'
# )


## 4. Experiment 2: Logistic Regression + Skip-gram (Word2Vec)

**Rationale:** Skip-gram embeddings capture semantic relationships...

**Citation:** Mikolov et al. (2013)


In [None]:
# Prepare Skip-gram embeddings
# TODO: Implement Skip-gram training
# Tokenize texts
# Train Word2Vec with sg=1
# Transform to document vectors
# Train model
# Evaluate


## 5. Experiment 3: Logistic Regression + CBOW (Word2Vec)

**Rationale:** CBOW is faster to train and works well for...


In [None]:
# Prepare CBOW embeddings
# TODO: Implement CBOW training (sg=0)


## 6. Results Summary and Comparison

In [None]:
# Save all results to CSV
# evaluator.save_results_table(
#     filepath='../results/tables/member1_results.csv'
# )

# Create comparison plots
# Plot bar chart comparing embeddings for this model


## 7. Hyperparameter Tuning (Optional but Recommended)

In [None]:
# Perform grid search or random search for best hyperparameters
# Example:
# from sklearn.model_selection import GridSearchCV
# param_grid = {...}
# grid_search = GridSearchCV(model, param_grid, cv=5)
# grid_search.fit(X_train, y_train)


## 8. Conclusions and Observations

**Key Findings:**
- Best performing embedding: [Fill in]
- Why it performed better: [Discuss]
- Comparison with team: [After seeing other results]

**Limitations:**
- [Discuss limitations]

**Future Work:**
- [Suggestions for improvement]
