# Training Different Architectures on the Ice Cream Dataset

This notebook compares several neural network architectures trained on the same small, structured dataset describing traits and contextual factors that influence whether a person might buy ice cream.  
The target is a scalar value between 0 and 1 (a probability estimated by a language model).

---

## 1. Data Representation

Each data example consists of:

- a list of **traits** (e.g. "likes sweets", "health-conscious"),
- a list of **contextual factors** (e.g. "hot summer day", "after lunch"),
- and a scalar **target probability** $y \in [0, 1]$.

All textual tokens are mapped to integer IDs using a vocabulary:
$$
V = \{t_1, t_2, \ldots, t_{|V|}\}
$$

For a given example with tokens $(t_{i_1}, \ldots, t_{i_L})$, the corresponding tensor of token IDs is:
$$
x = [i_1, i_2, \ldots, i_L].
$$

The dataset is stored in JSONL format and loaded into PyTorch via a custom `Dataset` and `DataLoader`.

---

## 2. Token Embedding

Each token ID $i \in \{1, \ldots, |V|\}$ is mapped to a dense vector via an embedding matrix:
$$
E \in \mathbb{R}^{|V| \times d}
$$

The embedded representation of a sequence is:
$$
X = [E_{i_1}, E_{i_2}, \ldots, E_{i_L}] \in \mathbb{R}^{L \times d}.
$$

---

## 3. Model Architectures

### (a) Logistic Regression

This model ignores token order and computes a mean-pooled embedding:
$$
h = \frac{1}{L} \sum_{j=1}^{L} E_{i_j}.
$$

A single linear transformation followed by a sigmoid produces the output probability:
$$
\hat{y} = \sigma(w^\top h + b)
$$

This is equivalent to a logistic regression over the averaged embeddings.

---

### (b) Feedforward MLP

The Feedforward MLP extends the logistic regression by passing the pooled vector through multiple nonlinear layers:
$$
h_0 = \frac{1}{L} \sum_{j=1}^{L} E_{i_j}
$$

Then, for $k = 1, \ldots, K$ hidden layers:
$$
h_k = \mathrm{ReLU}(W_k h_{k-1} + b_k)
$$

Finally, the output probability is:
$$
\hat{y} = \sigma(W_{\text{out}} h_K + b_{\text{out}})
$$

The model can have 2–4 layers with decreasing hidden dimensionality.

---

### (c) Encoder-only Transformer

This architecture uses the Transformer encoder to model interactions between tokens.  
Each encoder layer applies **self-attention** and a **feedforward network**:

1. **Self-attention:**
   $$
   \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V
   $$
   where $Q = XW_Q$, $K = XW_K$, and $V = XW_V$.

2. **Feedforward sublayer:**
   $$
   \mathrm{FFN}(x) = \mathrm{ReLU}(xW_1 + b_1)W_2 + b_2
   $$

The encoder processes all tokens in parallel (order is not explicitly encoded since positional embeddings are omitted).  
Mean pooling aggregates token representations into a single vector, followed by a regression head with a sigmoid activation:
$$
\hat{y} = \sigma(W_{\text{out}} \, \frac{1}{L}\sum_j h_j + b_{\text{out}})
$$

---

### (d) Decoder-only Transformer

This model is similar to GPT-style architectures.  
It uses **masked self-attention**, ensuring that each token can only attend to previous positions:
$$
M_{ij} =
\begin{cases}
0, & i \ge j \\
-\infty, & i < j
\end{cases}
$$

The causal attention mask enforces autoregressive processing.  
The rest of the computation is the same as in the encoder, followed by mean pooling and a regression head.

---

## 4. Training Objective

All models predict a continuous scalar $\hat{y} \in [0,1]$.  
The loss function is the **mean squared error** (MSE):

$$
\mathcal{L}_{\text{MSE}} = \frac{1}{N} \sum_{n=1}^{N} (\hat{y}_n - y_n)^2
$$

For probabilistic interpretation, **binary cross-entropy** (BCE) can also be used:

$$
\mathcal{L}_{\text{BCE}} = -\frac{1}{N} \sum_{n=1}^{N} [y_n \log(\hat{y}_n) + (1 - y_n)\log(1 - \hat{y}_n)]
$$

Optimization uses the AdamW optimizer with learning rates around $10^{-4}$ to $10^{-3}$.

---

## 5. Training Procedure

Each model is trained for a fixed number of epochs (e.g., 10) using batched gradient descent:

1. Forward pass: compute $\hat{y}$  
2. Compute loss $\mathcal{L}$  
3. Backward pass: $\nabla_\theta \mathcal{L}$  
4. Update weights with AdamW

Training losses are recorded and plotted per epoch.

---

## 6. Model Saving

Each trained model’s weights are saved for later evaluation.  
The saving function constructs a descriptive filename:

$$
\texttt{\{ModelName\}\_d\{Dim\}\_lr\{LR\}\_epoch\{E\}.pt}
$$

and stores it in the directory:
```
../models/saved_models/
```

This allows consistent comparison across architectures.

---

## 7. Comparison

By examining training loss curves and saved model files, one can assess:

- how simpler models (logistic regression, MLP) compare to Transformers,
- whether attention mechanisms improve predictive performance,
- and how model capacity affects generalization on this small dataset.



In [None]:
# Imports & Setup

import json
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import matplotlib.pyplot as plt
import os

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)


In [None]:
# vocab and tokenization
vocab = [
    "likes sweets", "dislikes sweets", "health-conscious", "lactose intolerant",
    "cheap", "spender", "impulsive buyer",
    "hungry", "on a diet", "ice cream truck nearby", "hot summer day",
    "cold winter day", "ice cream is cheap today (discount)",
    "after a long workout", "after lunch", "after work"
]

token2id = {tok: i for i, tok in enumerate(vocab)}

def to_token_ids(example):
    ids = [token2id[tok] for tok in example["traits"] + example["context"]]
    return torch.tensor(ids, dtype=torch.long), torch.tensor(example["probability_LLM"], dtype=torch.float32)


In [None]:
# Dataset and Dataloader

class IceCreamDataset(Dataset):
    def __init__(self, path):
        self.data = []
        with open(path, "r", encoding="utf-8") as f:
            for line in f:
                self.data.append(json.loads(line.strip()))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        example = self.data[idx]
        input_ids, target = to_token_ids(example)
        return input_ids, target


def collate_fn(batch):
    input_ids = [b[0] for b in batch]
    targets = torch.stack([b[1] for b in batch])
    max_len = max(len(x) for x in input_ids)
    padded = torch.stack([torch.cat([x, torch.zeros(max_len - len(x), dtype=torch.long)]) for x in input_ids])
    return padded, targets


train_dataset = IceCreamDataset("../data/train.jsonl")
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_fn)


In [None]:
# Train Utility

def train_model(model, train_loader, optimizer, loss_fn, epochs=10):
    train_losses = []
    model = model.to(device)

    for epoch in range(epochs):
        model.train()
        total_loss = 0
        for input_ids, targets in train_loader:
            input_ids, targets = input_ids.to(device), targets.to(device)
            preds = model(input_ids)
            loss = loss_fn(preds, targets)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        avg_loss = total_loss / len(train_loader)
        train_losses.append(avg_loss)
        print(f"Epoch {epoch+1}/{epochs}  Loss={avg_loss:.4f}")

    plt.plot(train_losses)
    plt.title(f"Training Loss: {model.__class__.__name__}")
    plt.xlabel("Epoch")
    plt.ylabel("MSE Loss")
    plt.show()

    return train_losses


In [None]:
# Save Utility
def save_model(model, optimizer, epochs, save_name=None, save_dir="../models/saved_models"):
    os.makedirs(save_dir, exist_ok=True)
    
    # Standardmäßiger Dateiname, falls keiner übergeben wurde
    if save_name is None:
        save_name = (
            f"{model.__class__.__name__}_"
            f"d{getattr(model, 'embedding', nn.Embedding(1,1)).embedding_dim if hasattr(model, 'embedding') else 'NA'}_"
            f"lr{optimizer.param_groups[0]['lr']}_"
            f"epoch{epochs}.pt"
        )

    save_path = os.path.join(save_dir, save_name)
    torch.save(model.state_dict(), save_path)
    print(f"💾 Modell gespeichert unter: {save_path}")

In [None]:
# Train encoder only model

class BasicEncoderTransformer(nn.Module):
    def __init__(self, vocab_size, d_model=128, nhead=4, num_layers=2, num_classes=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=256,
            dropout=0.1,
            batch_first=True
        )
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.fc = nn.Sequential(
            nn.Linear(d_model, d_model),
            nn.ReLU(),
            nn.Linear(d_model, num_classes),
            nn.Sigmoid()
        )

    def forward(self, input_ids):
        x = self.embedding(input_ids)
        x = self.encoder(x)
        x = x.mean(dim=1)
        return self.fc(x).squeeze(-1)

encoder_model = BasicEncoderTransformer(vocab_size=len(vocab))
optimizer = optim.AdamW(encoder_model.parameters(), lr=1e-4)
loss_fn = nn.MSELoss()

encoder_losses = train_model(encoder_model, train_loader, optimizer, loss_fn, epochs=10)

save_model(encoder_model, optimizer, epochs=10)

In [None]:
# Train Decoder Only Model

class BasicDecoderTransformer(nn.Module):
    def __init__(self, vocab_size, d_model=128, nhead=4, num_layers=2, num_classes=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=256,
            dropout=0.1,
            batch_first=True
        )
        self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)
        self.fc = nn.Sequential(
            nn.Linear(d_model, d_model),
            nn.ReLU(),
            nn.Linear(d_model, num_classes),
            nn.Sigmoid()
        )

    def forward(self, input_ids):
        x = self.embedding(input_ids)
        seq_len = input_ids.size(1)
        mask = torch.triu(torch.ones(seq_len, seq_len, device=input_ids.device), diagonal=1).bool()
        x = self.decoder(x, x, tgt_mask=mask)
        x = x.mean(dim=1)
        return self.fc(x).squeeze(-1)

decoder_model = BasicDecoderTransformer(vocab_size=len(vocab))
optimizer = optim.AdamW(decoder_model.parameters(), lr=1e-4)
loss_fn = nn.MSELoss()

decoder_losses = train_model(decoder_model, train_loader, optimizer, loss_fn, epochs=10)

save_model(decoder_model, optimizer, epochs=10)

In [None]:
# Train Binary Logistic Regression with Sigmoid

class LogisticRegressionModel(nn.Module):
    def __init__(self, vocab_size, d_model=128, num_classes=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.fc = nn.Sequential(
            nn.Linear(d_model, num_classes),
            nn.Sigmoid()  # für Wahrscheinlichkeiten
        )

    def forward(self, input_ids):
        # Einfache Mittelung über alle Tokens (Bag-of-Words-artig)
        x = self.embedding(input_ids)   # [B, L, d_model]
        x = x.mean(dim=1)               # [B, d_model]
        out = self.fc(x)
        return out.squeeze(-1)

logreg_model = LogisticRegressionModel(vocab_size=len(vocab))
optimizer = optim.AdamW(logreg_model.parameters(), lr=1e-3)
loss_fn = nn.BCELoss()

logreg_losses = train_model(logreg_model, train_loader, optimizer, loss_fn, epochs=10)

save_model(logreg_model, optimizer, epochs=10)

In [None]:
# Train Feedforward MLP

class FeedForwardMLP(nn.Module):
    def __init__(self, vocab_size, d_model=128, hidden_dims=[128, 64], num_classes=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)

        layers = []
        input_dim = d_model
        for h in hidden_dims:
            layers.append(nn.Linear(input_dim, h))
            layers.append(nn.ReLU())
            input_dim = h

        layers.append(nn.Linear(input_dim, num_classes))
        layers.append(nn.Sigmoid())  

        self.network = nn.Sequential(*layers)

    def forward(self, input_ids):
        # No Order
        x = self.embedding(input_ids)  # [B, L, d_model]
        x = x.mean(dim=1)              # [B, d_model]
        out = self.network(x)
        return out.squeeze(-1)

mlp_model = FeedForwardMLP(vocab_size=len(vocab), d_model=128, hidden_dims=[128, 64])
optimizer = optim.AdamW(mlp_model.parameters(), lr=1e-3)
loss_fn = nn.BCELoss()

mlp_losses = train_model(mlp_model, train_loader, optimizer, loss_fn, epochs=10)

save_model(mlp_model, optimizer, epochs=10)