# Projects in Machine Learning and AI (RPI Spring 2025)
# Homework 6

## Outline of This Notebook
1. **Part 1: Transformers** (Task 1)
   - Dataset Selection, Split
   - Training Script for BART Summarization
   - Fine-Tuning + Results (BLEU, ROUGE)
   - Analysis of Hyperparameters & Model Choice

2. **Part 2: Reinforcement Learning**
   - **Task 2**: Real-World MDP Formulation
   - **Task 3**: RL Application in Selected Domain + Open-Source Project

3. **Part 3: Recommender Systems** (Task 5)
   - MovieLens 100k Dataset
   - Data Cleaning, EDA, User-Item Matrix
   - Implementation of Two Collaborative Filtering Methods
   - Comparison on Two Evaluation Metrics


# Part 1: Transformers

In **Task 1**, we will use the [Facebook BART model](https://huggingface.co/docs/transformers/model_doc/bart) to perform **text summarization** on the CNN/DailyMail news dataset, which is a common text-summarization benchmark.

Below are the steps we will follow:
1. **Select & Describe Dataset** + **Train/Test Split (90-10)** (though the official dataset is already split into train/validation/test, we'll demonstrate a splitting workflow if needed).
2. **Load BART Model** from Hugging Face Transformers + **Training Script**.
3. **Fine-tune** Pre-trained Model + **Report BLEU and ROUGE**.
4. **Analysis** of Results, Hyperparameters & Choice of LLM.

### Dataset Choice

The **CNN/DailyMail** dataset consists of news articles and corresponding highlights (summaries). We have already downloaded the dataset in our `data/cnn_dailymail` folder, containing three CSV files:

- `train.csv` (~285,000 rows)
- `validation.csv` (~13,300 rows)
- `test.csv` (~11,500 rows)

Each CSV has two main columns (commonly `article` and `highlights` or `summary`) that we will use for summarization. This notebook provides a structure for training and evaluating a summarization model on these files.

## 1. Dataset Description & Potential Splitting

The **CNN/DailyMail** dataset pairs a long news article with a concise summary (often bullet points called *highlights*). For each row:



In many workflows, you could combine all data and manually split 90/10. However, since the dataset already comes partitioned, we can either:
- Use the official `train.csv`, `validation.csv`, `test.csv` splits, **or**
- Combine them if we wish and re-split. 

Below, we will simply load the provided CSV files for clarity.

In [3]:
import pandas as pd

# Load the train, validation, and test sets from the local folder
train_df = pd.read_csv("data/cnn_dailymail/train.csv")
valid_df = pd.read_csv("data/cnn_dailymail/validation.csv")
test_df = pd.read_csv("data/cnn_dailymail/test.csv")

# Print basic stats and a small sample
print(f"Train set size: {len(train_df)}")
print(f"Validation set size: {len(valid_df)}")
print(f"Test set size: {len(test_df)}")

# Inspect the first few rows of training data
print("\nSample rows from the training set:")
display(train_df.head(3))


Train set size: 287113
Validation set size: 13368
Test set size: 11490

Sample rows from the training set:


Unnamed: 0,id,article,highlights
0,0001d1afc246a7964130f43ae940af6bc6c57f01,By . Associated Press . PUBLISHED: . 14:11 EST...,"Bishop John Folda, of North Dakota, is taking ..."
1,0002095e55fcbd3a2f366d9bf92a95433dc305ef,(CNN) -- Ralph Mata was an internal affairs li...,Criminal complaint: Cop used his role to help ...
2,00027e965c8264c35cc1bc55556db388da82b07f,A drunk driver who killed a young woman in a h...,"Craig Eccleston-Todd, 27, had drunk at least t..."


## 2. Load BART Model & Define Training Script

We'll now load the [Hugging Face Transformers library](https://github.com/huggingface/transformers) and initialize the `facebook/bart-large-cnn` model, which is specialized for summarization.

### Outline
1. **Load Pre-trained Tokenizer and Model**:
   ```python
   from transformers import BartTokenizer, BartForConditionalGeneration

   tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
   model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
2. Tokenize & Encode the articles and summaries.

3. Prepare a PyTorch Dataset & DataLoader for training and testing.

4. Train using standard seq2seq objectives (e.g., cross-entropy).

5. Evaluate on the test set.

Below is a training loop for a limited number of epochs (to illustrate code structure). In practice, we would use the Trainer API from transformers or a more robust custom loop with more epochs.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import BartTokenizer, BartForConditionalGeneration

# Really small datasets for demonstration purposes
if len(train_df) > 1000:
    train_df = train_df.sample(n=1000, random_state=42).reset_index(drop=True)
if len(test_df) > 1000:
    test_df = test_df.sample(n=1000, random_state=42).reset_index(drop=True)

train_df = train_df.dropna(subset=["article", "highlights"])
test_df = test_df.dropna(subset=["article", "highlights"])

class SummarizationDataset(Dataset):
    def __init__(self, articles, highlights, tokenizer, max_input_len=512, max_target_len=128):
        self.articles = articles
        self.highlights = highlights
        self.tokenizer = tokenizer
        self.max_input_len = max_input_len
        self.max_target_len = max_target_len

    def __len__(self):
        return len(self.articles)

    def __getitem__(self, idx):
        article = self.articles[idx]
        summary = self.highlights[idx]

        # Tokenize article
        inputs = self.tokenizer(
            article,
            max_length=self.max_input_len,
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )
        # Tokenize highlights
        with self.tokenizer.as_target_tokenizer():
            labels = self.tokenizer(
                summary,
                max_length=self.max_target_len,
                padding="max_length",
                truncation=True,
                return_tensors="pt"
            )

        return {
            'input_ids': inputs['input_ids'].squeeze(),
            'attention_mask': inputs['attention_mask'].squeeze(),
            'labels': labels['input_ids'].squeeze()
        }
    
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

print("model loaded...")

max_input_length = 512
max_target_length = 128

train_dataset = SummarizationDataset(
    articles=train_df["article"].tolist(),
    highlights=train_df["highlights"].tolist(),
    tokenizer=tokenizer,
    max_input_len=max_input_length,
    max_target_len=max_target_length
)

test_dataset = SummarizationDataset(
    articles=test_df["article"].tolist(),
    highlights=test_df["highlights"].tolist(),
    tokenizer=tokenizer,
    max_input_len=max_input_length,
    max_target_len=max_target_length
)

train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=2, shuffle=False)

print("dataset loaded...")

optimizer = AdamW(model.parameters(), lr=1e-5)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

model.train()

print("training started...")

# For demonstration, we will train for just 1 epoch
# Note: In practice, you would want to train for more epochs (e.g., 3-5 or more)
num_epochs = 1
for epoch in range(num_epochs):
    total_loss = 0.0
    num = 1
    batch_count = len(train_loader)
    for batch in train_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        optimizer.zero_grad()

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        print("batch {}/{} complete".format(num, batch_count))

        total_loss += loss.item()
        num += 1

    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1}/{num_epochs} | Training Loss: {avg_loss:.4f}")


model loaded...
dataset loaded...
training started...




batch 1/500 complete
batch 2/500 complete
batch 3/500 complete
batch 4/500 complete
batch 5/500 complete
batch 6/500 complete
batch 7/500 complete
batch 8/500 complete
batch 9/500 complete
batch 10/500 complete
batch 11/500 complete
batch 12/500 complete
batch 13/500 complete
batch 14/500 complete
batch 15/500 complete
batch 16/500 complete
batch 17/500 complete
batch 18/500 complete
batch 19/500 complete
batch 20/500 complete
batch 21/500 complete
batch 22/500 complete
batch 23/500 complete
batch 24/500 complete
batch 25/500 complete
batch 26/500 complete
batch 27/500 complete
batch 28/500 complete
batch 29/500 complete
batch 30/500 complete
batch 31/500 complete
batch 32/500 complete
batch 33/500 complete
batch 34/500 complete
batch 35/500 complete
batch 36/500 complete
batch 37/500 complete
batch 38/500 complete
batch 39/500 complete
batch 40/500 complete
batch 41/500 complete
batch 42/500 complete
batch 43/500 complete
batch 44/500 complete
batch 45/500 complete
batch 46/500 comple

## 3. Fine-tuning & Reporting BLEU/ROUGE

We'll now evaluate on the **test set** and compute BLEU & ROUGE. For brevity, we only have a tiny test set of 1 example here, but the process is the same for larger test sets.

### BLEU & ROUGE Setup
- We can use `sacrebleu` for BLEU.
- We can use `rouge_score` from `rouge-score` or Hugging Face's `evaluate.load("rouge")`.

In real usage, you'd likely average the scores across all test examples.


In [13]:
import sacrebleu
from rouge_score import rouge_scorer

model.eval()
references = []
hypotheses = []

for batch in test_loader:
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)

    # Generate summary
    with torch.no_grad():
        summary_ids = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_length=50,
            num_beams=4,
            early_stopping=True
        )

    for i, sid in enumerate(summary_ids):
        # Decode reference and prediction
        ref_text = tokenizer.decode(
            batch['labels'][i], skip_special_tokens=True, clean_up_tokenization_spaces=True
        )
        pred_text = tokenizer.decode(
            sid, skip_special_tokens=True, clean_up_tokenization_spaces=True
        )
        references.append(ref_text)
        hypotheses.append(pred_text)

# BLEU
bleu_score = sacrebleu.corpus_bleu(hypotheses, [references])
print(f"BLEU score: {bleu_score.score:.2f}")

# ROUGE
rouge_scorer_fn = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
r1, r2, rL = 0, 0, 0

for ref, hyp in zip(references, hypotheses):
    scores = rouge_scorer_fn.score(ref, hyp)
    r1 += scores['rouge1'].fmeasure
    r2 += scores['rouge2'].fmeasure
    rL += scores['rougeL'].fmeasure

n = len(references)
print(f"ROUGE-1: {r1/n:.4f}")
print(f"ROUGE-2: {r2/n:.4f}")
print(f"ROUGE-L: {rL/n:.4f}")



BLEU score: 12.44
ROUGE-1: 0.4047
ROUGE-2: 0.1957
ROUGE-L: 0.2930


### Analysis of Hyperparameters & Model Choice

1. **Learning Rate**:  
   - We used `1e-5`, which often works well for fine-tuning large language models.  
   - A larger LR (e.g., `1e-4`) might speed up convergence but increases the risk of instability and overfitting. 

2. **Max Sequence Length**:  
   - BART supports up to 1024 tokens. Here, we used 512 for the article and 128 for the summary to reduce training time.  
   - If your data contains very long articles, consider increasing these lengths (along with the associated memory/computational costs).

3. **Batch Size**:  
   - We used 2 for demonstration on CPU with a subset of data.  
   - Larger batch sizes (e.g., 8 or 16) can expedite training and stabilize gradients when you have sufficient GPU memory.

4. **Number of Beams**:  
   - For generation, we set `num_beams=4`. Higher beam values (e.g., 8) may yield higher-quality summaries but slow down inference.  
   - For quick experimentation or resource-constrained settings, a smaller beam might be best.

---

**Model Choice**:  
- We opted for `facebook/bart-large-cnn`, which performs well on CNN/DailyMail summarization.  
- Alternatives like `T5`, `Pegasus`, or even smaller DistilBART variants can be faster or yield different performance.  
- Model performance depends on architecture, pre-training data, and hyperparameters.  

**Summary of Current Results**:  
- **BLEU**: 12.44  
- **ROUGE-1**: 0.4047  
- **ROUGE-2**: 0.1957  
- **ROUGE-L**: 0.2930  

These results reflect our chosen hyperparameters, smaller data subset, and a limited number of epochs. In practice, more extensive training, hyperparameter tuning, or using a different model variant could further improve the scores. Nonetheless, the above metrics demonstrate that our fine-tuned BART model captures core elements of the original text reasonably well, given the constraints.  



# Part 2: Reinforcement Learning

## Task 2 (20 points): Formulate a Real-World Application as an MDP

### Example: Autonomous Driving at a Traffic Light

We can formulate controlling an autonomous vehicle at intersections as an MDP.

**State Space**:
- The state could include the vehicle’s current speed, position, distance to intersection, time to traffic light change, positions/speeds of nearby cars, etc.

**Action Space**:
- The vehicle’s possible actions might be: `Accelerate`, `Maintain Speed`, `Decelerate`, or `Stop`.

**Transition Model**:
- If the vehicle chooses to accelerate, the new speed is increased by some amount, the position is updated accordingly, and the time to the next light change is decremented. The environment (traffic light, other vehicles) also evolves in response.
- The transition probabilities could come from a traffic simulator or a real dataset of traffic patterns.

**Reward**:
- Primary goal: maximize safety and efficiency.
- A negative reward for collisions or near-collisions.
- A negative reward for running a red light.
- Possibly a slight penalty for wasted time at red lights.
- A positive reward for passing intersections safely and maintaining good speed.

No explicit formula is required here, but qualitatively, the MDP structure is:
- **States**: (vehicle state, environment state)
- **Actions**: (acceleration / deceleration / etc.)
- **Rewards**: (safety and efficiency metrics)
- **Transition**: (updates from traffic model)


## Task 3 (20 points): RL in Healthcare, Recommender Systems, or Trading

### Pick One Area: Healthcare

**Problem**: Optimizing Treatment Plans for Complex Medical Conditions  
Healthcare often involves sequential decision-making under partial observability (e.g., evolving patient states that aren't fully known). Reinforcement Learning (RL) can help clinicians determine personalized treatment strategies by incorporating medical history, lab results, and other time-varying information. The goal is to learn policies that improve long-term patient outcomes (e.g., survival, reduced adverse events) rather than relying on static, rule-based approaches.

---

**Open-Source Project**: [An Empirical Study of Representation Learning for Reinforcement Learning in Healthcare](https://github.com/MLforHealth/rl_representations)  
This repository explores how **representation learning** can improve RL in healthcare by modeling patient states more effectively under partial observability. It specifically focuses on a **sepsis** cohort from the MIMIC-III database, applying recurrent autoencoder variants to predict subsequent observations and learn informative patient-state embeddings. These learned representations are then used to train RL policies via Batch Constrained Q-learning (BCQ).

---

**Project Explanation**:
- **Data & Environment**: The authors extract a sepsis cohort from the MIMIC-III database, refining code based on the original Sepsis cohort by Komorowski et al. (2018). They compute additional acuity scores (e.g., SOFA, SAPSII, OASIS) to enrich patient trajectories.
- **Representation Learning**: Various sequential autoencoding architectures (e.g., RNN, ODERNN, AIS) learn to encode patient trajectories into a latent state that predicts future observations. This helps capture temporal and partial-observation dynamics in healthcare data.
- **Policy Learning**: Once states are learned, a discrete BCQ approach (Fujimoto et al., 2019) is used to train and evaluate RL policies on the constructed state space. The code includes scripts for training policies, logging results, and performing weighted importance sampling–based evaluations.


Using RL in this manner allows for dynamic policy updates in response to evolving patient states, leveraging learned representations that respect time-series dependencies and partial observability. This goes beyond simpler static rule-based healthcare interventions, potentially improving treatment safety and efficacy.


# Part 3: Recommender Systems
## Task 5 (30 points)

We will use the **MovieLens 100k** dataset from [GroupLens](https://grouplens.org/datasets/movielens/100k/) to build and compare two collaborative filtering (CF) recommendation models.

### Steps:
1. Data Cleaning, EDA
2. Convert to User-Item Matrix
3. Implement **two** collaborative filtering methods (e.g., **SVD** and **NMF** from `Surprise` library, or ALS from some library like `implicit`).
4. Compare performance using **two** evaluation metrics (e.g., **RMSE** and **MAE**, or Precision@k, Recall@k, etc.).


In [23]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.decomposition import NMF, TruncatedSVD
from sklearn.metrics import mean_squared_error, mean_absolute_error
import math

data_path = "data/ml-100k/u.data"
ratings_df = pd.read_csv(
    data_path,
    sep="\t",
    names=["user", "item", "rating", "timestamp"],
    engine="python"
)

ratings_df.drop_duplicates(inplace=True)
ratings_df.dropna(inplace=True)

train_df, test_df = train_test_split(ratings_df, test_size=0.2, random_state=42)

all_users = pd.concat([train_df["user"], test_df["user"]]).unique()
all_items = pd.concat([train_df["item"], test_df["item"]]).unique()

user_to_idx = {u: i for i, u in enumerate(np.sort(all_users))}
item_to_idx = {m: i for i, m in enumerate(np.sort(all_items))}

num_users = len(user_to_idx)
num_items = len(item_to_idx)

train_matrix = np.zeros((num_users, num_items), dtype=np.float32)

for row in train_df.itertuples():
    u = user_to_idx[row.user]
    i = item_to_idx[row.item]
    train_matrix[u, i] = row.rating

test_data = []
for row in test_df.itertuples():
    u = user_to_idx[row.user]
    i = item_to_idx[row.item]
    test_data.append((u, i, row.rating))

n_components = 20
nmf_model = NMF(n_components=n_components, init='random', random_state=42, max_iter=100)

user_factors_nmf = nmf_model.fit_transform(train_matrix)
item_factors_nmf = nmf_model.components_.T  # shape: (num_items, n_components)

# Predict on test set
predictions_nmf = []
ground_truth = []
for (u, i, true_rating) in test_data:
    pred_rating = np.dot(user_factors_nmf[u], item_factors_nmf[i])
    predictions_nmf.append(pred_rating)
    ground_truth.append(true_rating)

# Evaluate NMF
mse_nmf = mean_squared_error(ground_truth, predictions_nmf)
rmse_nmf = math.sqrt(mse_nmf)
mae_nmf = mean_absolute_error(ground_truth, predictions_nmf)

# We'll treat zero or missing entries as 0, which is not a perfect SVD scenario,
# but it demonstrates a second matrix factorization approach.

svd_model = TruncatedSVD(n_components=n_components, random_state=42)
user_factors_svd = svd_model.fit_transform(train_matrix)  # shape: (num_users, n_components)

# Approx item factors by pseudo-inverse:
Sigma = np.diag(svd_model.singular_values_)
Vt = svd_model.components_  # shape: (n_components, num_items)

item_factors_svd = (Sigma @ Vt).T  # shape: (num_items, n_components)

# Now let's predict on test set
predictions_svd = []
for (u, i, true_rating) in test_data:
    pred_rating = np.dot(user_factors_svd[u], item_factors_svd[i])
    predictions_svd.append(pred_rating)

# Evaluate TruncatedSVD
mse_svd = mean_squared_error(ground_truth, predictions_svd)
rmse_svd = math.sqrt(mse_svd)
mae_svd = mean_absolute_error(ground_truth, predictions_svd)

print("=== NMF Results ===")
print(f"RMSE: {rmse_nmf:.4f} | MAE: {mae_nmf:.4f}")

print("\n=== TruncatedSVD Results ===")
print(f"RMSE: {rmse_svd:.4f} | MAE: {mae_svd:.4f}")


=== NMF Results ===
RMSE: 2.6232 | MAE: 2.3547

=== TruncatedSVD Results ===
RMSE: 632.7241 | MAE: 489.4774




### Conclusion on NMF vs. TruncatedSVD

The **NMF** approach yields a reasonable **RMSE** (~2.62) and **MAE** (~2.35), suggesting it fits the rating matrix relatively well. In contrast, **TruncatedSVD** shows very high error (RMSE ~632.72, MAE ~489.48), indicating poor reconstruction of user-item ratings under our current matrix setup.

Several factors might explain why TruncatedSVD performs poorly here:

1. **Zero-Filling**: We treat missing ratings as zeros, which can significantly skew an SVD-based method. Matrix factorization approaches like NMF often handle sparse data more gracefully if the missing entries are treated as truly unknown, while SVD sees them as zeros.

2. **Implementation Detail**: With `TruncatedSVD`, we manually compute item embeddings via a pseudo-inverse operation. Minor differences or approximation errors can compound on top of zero-filling and lead to inflated predictions.

3. **Model Constraints**: NMF imposes non-negative constraints on its factors, which can align better with rating data (generally non-negative). SVD does not have this constraint, potentially allowing negative factor entries or large factor norms that drive predictions to unrealistic magnitudes.

#### Recommended Next Steps
- **Adjust Data Handling**: Consider an approach that properly treats missing ratings rather than setting them to zero.  
- **Hyperparameter Tuning**: For SVD or NMF, try varying the number of latent factors (`n_components`), regularization, or maximum iterations.  
- **Alternative Methods**: Explore methods like ALS (Alternating Least Squares) or specialized recommendation frameworks.  
- **Normalization**: Normalizing rows (users) or columns (items) might help SVD produce more stable factor embeddings.

Overall, for this setup, **NMF** is more suitable for the given MovieLens dataset in terms of rating prediction accuracy.
