# Project Milestone Two

**Data Preparation and Model Exploration**
**Due:** Midnight on November 16th with usual 2-hour grace period — **worth 100 points**

**Note: No late assignments accepted, we need the time to grade them!**

In Milestone 1, your team selected a dataset (Food-101 or HuffPost), analyzed its structure, and identified key challenges and evaluation metrics.
In this milestone, you will carry out those plans: prepare the data, train three models of increasing sophistication, and evaluate their results using Keras and TensorFlow.
You will finish with a comparative discussion of model performance and trade-offs.


### Submission Guidelines

* Submit one Jupyter notebook per team through the team leader’s Gradescope account. **Include all team members names at the top of the notebook.** 
* Include all code, plots, and answers inline below.
* Ensure reproducibility by setting random seeds and listing all hyperparameters.
* Document any AI tools used, as required by the CDS policy.


## Problem 1 – Data Preparation and Splits (20 pts)

### Goals

Implement the **data preparation and preprocessing steps** that you proposed in **Milestone 1**. You’ll clean, normalize, and split your data so that it’s ready for modeling and reproducible fine-tuning.

### Steps to Follow

1. **Load your chosen dataset**

   * Use `datasets.load_dataset()` from **Hugging Face** to load **Food-101** or **HuffPost**.
   * Display basic information (e.g., number of samples, feature names, example entries).

2. **Apply cleaning and normalization**

   * **Images:**

     * Ensure all images are in RGB format.
     * Resize or crop to a consistent shape (e.g., `224 × 224`).
     * Drop or fix any corrupted files.
   * **Text:**

     * Concatenate headline + summary (for HuffPost).
     * Strip whitespace, convert to lowercase if appropriate, and remove empty samples.
     * Optionally remove duplicates or extremely short entries.

3. **Standardize or tokenize the inputs**

   * **Images:**

     * Normalize pixel values (e.g., divide by 255.0).
     * Define a minimal augmentation pipeline (e.g., random flip, crop, or rotation).
   * **Text:**

     * Create a tokenizer or `TextVectorization` layer.
     * Set a target `max_length` based on your analysis from Milestone 1 (e.g., 95th percentile).
     * Apply padding/truncation and build tensors for input + labels.

4. **Handle dataset-specific challenges**

   * If you identified **class imbalance**, compute label counts and, if needed, create a dictionary of `class_weights`.
   * If you noted **length or size variance**, verify that your truncation or resizing works as intended.
   * If you planned **noise filtering**, include the cleaning step and briefly explain your criteria (e.g., remove items with missing text or unreadable images).

5. **Create reproducible splits**

   * Split your cleaned dataset into **train**, **validation**, and **test** subsets (e.g., 80 / 10 / 10).
   * Use a fixed random seed for reproducibility (`random_seed = 42`).
   * Use **stratified splits**  (e.g., with `train_test_split` and `stratify = labels`).
   * Display the size of each subset.

6. **Document your pipeline**

   * Summarize your preprocessing steps clearly in Markdown or code comments.
   * Save or display a few representative examples after preprocessing to confirm the transformations are correct.




In [8]:
# Your code here; add as many cells as you need but make it clear what the structure is. 

# import necessary libraries (Please add any other libraries you may need)

import os
import random
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers
from datasets import load_dataset, DatasetDict, ClassLabel
from sklearn.metrics import accuracy_score

In [9]:
# ============================================
# Global Configuration & Constants
# ============================================

# Reproducibility
random_seed = 42
random.seed(random_seed)
np.random.seed(random_seed)
tf.keras.utils.set_random_seed(random_seed)
os.environ['TF_DETERMINISTIC_OPS'] = '1'

# Dataset and model hyperparameters
VOCAB_SIZE = 20000
MAX_SEQ_LEN = 256
BATCH_SIZE = 64
SEP_TOKEN = "[SEP]"


In [10]:
# --- 1.1 Load Data ---

# This URL points to the raw JSON, bypassing the Hugging Face Hub 
# repository lookup error (DatasetNotFoundError).
URL = "https://huggingface.co/datasets/khalidalt/HuffPost/resolve/main/News_Category_Dataset_v2.json"

print(f"Loading HuffPost dataset from direct JSON URL:\n{URL}")
# Load the dataset using the 'json' loader
raw_ds = load_dataset("json", data_files=URL, split="train")

print("\nLoad complete:")
print(raw_ds)
print("Columns found:", raw_ds.column_names)

# --- 1.2 Preprocessing ---
# This JSON contains 'headline' and 'short_description',

def concatenate_text(example):
    example["text"] = example["headline"] + " " + SEP_TOKEN + " " + example["short_description"]
    return example

raw_ds = raw_ds.map(concatenate_text)

# Rename 'category' and encode it
raw_ds = raw_ds.class_encode_column("category")
raw_ds = raw_ds.rename_column("category", "label")

# Store class names for later
class_names = raw_ds.features['label'].names
num_classes = len(class_names)
print(f"\nFound {num_classes} classes.")

Loading HuffPost dataset from direct JSON URL:
https://huggingface.co/datasets/khalidalt/HuffPost/resolve/main/News_Category_Dataset_v2.json

Load complete:
Dataset({
    features: ['category', 'headline', 'authors', 'link', 'short_description', 'date'],
    num_rows: 200853
})
Columns found: ['category', 'headline', 'authors', 'link', 'short_description', 'date']

Found 41 classes.


In [11]:

# 1.3 Data Splits

print("Splitting data (80/10/10 stratified) using datasets.train_test_split...")

# Split into Train (80%) and Temp (20%)
temp_ds = raw_ds.train_test_split(
    test_size=0.2, 
    seed=random_seed, 
    stratify_by_column="label"
)

# Split Temp (20%) into Validation (10%) and Test (10%)
val_test_ds = temp_ds['test'].train_test_split(
    test_size=0.5, # 50% of the 20% temp split = 10% of total
    seed=random_seed, 
    stratify_by_column="label"
)

# Combine into a final DatasetDict
ds = DatasetDict({
    'train': temp_ds['train'],
    'validation': val_test_ds['train'],
    'test': val_test_ds['test']
})

print("\nSplits created:")
print(ds)

# Check distribution
print("\nChecking class distribution (Top 5 classes)...")
train_dist = pd.Series(ds['train']['label']).value_counts(normalize=True).sort_index()
val_dist = pd.Series(ds['validation']['label']).value_counts(normalize=True).sort_index()
test_dist = pd.Series(ds['test']['label']).value_counts(normalize=True).sort_index()

print("Train (head):\n", train_dist.head())
print("\nValidation (head):\n", val_dist.head())

Splitting data (80/10/10 stratified) using datasets.train_test_split...

Splits created:
DatasetDict({
    train: Dataset({
        features: ['label', 'headline', 'authors', 'link', 'short_description', 'date', 'text'],
        num_rows: 160682
    })
    validation: Dataset({
        features: ['label', 'headline', 'authors', 'link', 'short_description', 'date', 'text'],
        num_rows: 20085
    })
    test: Dataset({
        features: ['label', 'headline', 'authors', 'link', 'short_description', 'date', 'text'],
        num_rows: 20086
    })
})

Checking class distribution (Top 5 classes)...
Train (head):
 0    0.007512
1    0.006665
2    0.022541
3    0.029561
4    0.005694
Name: proportion, dtype: float64

Validation (head):
 0    0.007518
1    0.006672
2    0.022554
3    0.029525
4    0.005726
Name: proportion, dtype: float64


In [12]:
# 1.4 Data Preprocessing (TextVectorization)

print("\nInitializing TextVectorization layer...")
vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQ_LEN
)

print("Adapting TextVectorization layer to *training data only*...")
vectorize_layer.adapt(ds["train"]["text"])

print("Adaptation complete.")

# [FIX] The correct method is .vocabulary_size()
# My previous response had a typo (.get_vocabulary_size()).
print(f"Vocabulary size: {vectorize_layer.vocabulary_size()}")

# Vectorization Test
print("\n--- Vectorization Test ---")
# We access the first example's 'text' field from the 'ds' object
sample_text = ds["train"][0]["text"] 
print(f"Original Text:\n{sample_text[:100]}...")

vectorized_text = vectorize_layer([sample_text])
print(f"\nVectorized (shape: {vectorized_text.shape}):\n{vectorized_text[0, :20]}...")
print("--------------------------")


Initializing TextVectorization layer...
Adapting TextVectorization layer to *training data only*...
Adaptation complete.
Vocabulary size: 20000

--- Vectorization Test ---
Original Text:
Allowing the Time to Heal [SEP] When you are both patient and healer, remember to be patient and all...

Vectorized (shape: (1, 256)):
[ 3303     2    59     4  2915     3    45    13    17   277  1953     7
 18151   528     4    19  1953     7  1264    59]...
--------------------------


In [13]:
# 1.5 Create tf.data Pipelines

def create_tf_dataset(split, is_training=True):
  
    # Select columns needed for the model
    columns_to_keep = ["text", "label"]
    
    # Convert the HF Dataset to a tf.data.Dataset
    #    .to_tf_dataset handles shuffling and batching efficiently
    tf_ds = split.to_tf_dataset(
        columns=columns_to_keep,
        shuffle=is_training,
        batch_size=BATCH_SIZE,
        label_cols=["label"] # This formats it as (features, label)
    )

    # Map the vectorization layer
    #    The input 'features' is now a dictionary: {'text': ...}
    def apply_vectorization(features, label):
        features['text'] = vectorize_layer(features['text'])
        return features['text'], label # Return (vectorized_text, label)

    tf_ds = tf_ds.map(apply_vectorization, 
                      num_parallel_calls=tf.data.AUTOTUNE)

    # Prefetch
    return tf_ds.prefetch(tf.data.AUTOTUNE)

print("Helper function 'create_tf_dataset' defined.")

print("\nBuilding tf.data pipelines...")

train_ds = create_tf_dataset(ds["train"], is_training=True)
val_ds = create_tf_dataset(ds["validation"], is_training=False)
test_ds = create_tf_dataset(ds["test"], is_training=False)

print("Pipeline creation complete.")
print(f"Train Dataset:\n{train_ds}")

# Optional: Inspect the shape of one batch
print("\n--- Pipeline Test (one batch) ---")
for text_batch, label_batch in train_ds.take(1):
    print(f"Text batch shape: {text_batch.shape}")
    print(f"Label batch shape: {label_batch.shape}")
    print(f"First text vector (first 20 tokens):\n {text_batch[0, :20]}")
    print(f"First label: {label_batch[0]}")
print("-------------------------------")

Helper function 'create_tf_dataset' defined.

Building tf.data pipelines...
Pipeline creation complete.
Train Dataset:
<_PrefetchDataset element_spec=(TensorSpec(shape=(None, 256), dtype=tf.int64, name=None), TensorSpec(shape=(None,), dtype=tf.int64, name=None))>

--- Pipeline Test (one batch) ---
Text batch shape: (64, 256)
Label batch shape: (64,)
First text vector (first 20 tokens):
 [  917  2635   865   909     3    10    18   909   251  5777  3023  3104
     4  1909 11328  3315     7   735     1     7]
First label: 17
-------------------------------


2025-10-31 19:07:56.120571: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_19}}
2025-10-31 19:07:56.149502: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


### Graded Questions (5 pts each)

For each question, answer thoroughly but concisely, in a short paragraph, longer or shorter as needed. Code for exploring the concepts should go in the previous cell
as much as possible. 

1. **Data Loading and Cleaning:**
   Describe how you loaded your dataset and the key cleaning steps you implemented (e.g., handling missing data, normalizing formats, or removing duplicates).



1.1. **Your answer here:**

We used the Hugging Face `datasets` library to load the data directly from a JSON mirror URL of the original dataset.

As our key cleaning step, we concatenated the `headline` and `short_description` columns with a `[SEP]` token (matching our Milestone 1 plan) to create a single `text` field. We also used `class_encode_column` to normalize the string `category` into an integer `label` for modeling.


2. **Preprocessing and Standardization:**
   Summarize your preprocessing pipeline. Include any normalization, tokenization, resizing, or augmentation steps, and explain why each was necessary for your dataset.
  

1.2. **Your answer here:**

Our preprocessing pipeline was built around the `tf.keras.layers.TextVectorization` layer, with all steps designed to run efficiently within the `tf.data` pipeline.

1.  **Text Normalization:** As planned in Milestone 1, we concatenated the `headline` and `short_description` fields using a `[SEP]` token. This creates a single `text` input, allowing the model to understand the full context of the article.
2.  **Standardization:** The `TextVectorization` layer automatically converts text to lowercase and strips punctuation. This is essential for reducing vocabulary noise, as "Word" and "word." are treated as the same token.
3.  **Tokenization:** We called `.adapt()` **only on the training dataset (`ds["train"]["text"]`)** to build the vocabulary, preventing any data leakage. Texts are then converted into integer indices based on this vocabulary, limited to `max_tokens=20000`.
4.  **Padding & Truncation:** All text sequences are standardized to a fixed length of `MAX_SEQ_LEN=256`. This step is necessary because a neural network requires fixed-size tensor batches as input.

This entire transformation process was included in the `tf.data` pipeline's `.map()` operation and optimized with `.prefetch()`, allowing the CPU to prepare the next batch while the GPU is training. (As this is text data, no resizing or augmentation was applied).



3. **Train/Validation/Test Splits:**
   Explain how you divided your data into subsets, including the split ratios, random seed, and any stratification or leakage checks you used to verify correctness.


1.3. **Your answer here:**

We divided the data into **train (80%), validation (10%), and test (10%)** subsets. To maintain consistency with Milestone 1, we used the `datasets` library's `.train_test_split` method and applied a `random_seed=42` for reproducibility.

The most critical step was using the **`stratify_by_column="label"`** option. This was essential to address the severe class imbalance we found in Milestone 1, ensuring all splits share the same class distribution.

We performed two key checks for correctness:
1.  **Stratification Check:** We used `value_counts(normalize=True)` on the train and validation sets and confirmed their label proportions were nearly identical.
2.  **Leakage Check:** We prevented data leakage by calling `vectorize_layer.adapt()` **only on the training set (`ds["train"]["text"]`)**, ensuring no validation or test data statistics influenced our vocabulary.



4. **Class Distribution and Balance:**
   Report your label counts and describe any class imbalances you observed. If applicable, explain how you addressed them (e.g., weighting, oversampling, or data augmentation).


1.4. **Your answer here:**

As confirmed in **Milestone 1**, the HuffPost dataset exhibits a **severe class imbalance** across its 41 classes. We re-confirmed this in `[Cell 6]` by observing the label counts (`value_counts`), which show a significant gap between the most and least frequent categories.

In this **Problem 1 (Data Prep)** stage, we did not apply methods like oversampling or data augmentation to directly modify this distribution.

Instead, we addressed it by using **`stratify_by_column="label"`** during our splits. This was a critical step to ensure that this imbalance was **reflected proportionally** across the train, validation, and test sets, allowing the model to be trained and evaluated fairly on the minority classes.


## Problem 2 – Baseline Model (20 pts)

### Goal

Build and train a **simple, fully functional baseline model** to establish a reference level of performance for your dataset.
This baseline will help you evaluate whether later architectures and fine-tuning steps actually improve results.


### Steps to Follow

1. **Construct a baseline model**

   * **Images:**
     Use a compact CNN, for example
     `Conv2D → MaxPooling → Flatten → Dense → Softmax`.
   * **Text:**
     Use a small embedding-based classifier such as
     `Embedding → GlobalAveragePooling → Dense → Softmax`.
   * Keep the model small enough to train in minutes on Colab.

2. **Compile the model**

   * Optimizer: `Adam` or `AdamW`.
   * Loss: `categorical_crossentropy` (for multi-class).
   * Metrics: at least `accuracy`; add `F1` if appropriate.

3. **Train and validate**

   * Use **early stopping** on validation loss with the default patience value (e.g., 5 epochs).
   * Record number of epochs trained and total runtime.

4. **Visualize results**

   * Plot **training vs. validation accuracy and loss**.
   * Carefully observe: does the model underfit, overfit, or generalize reasonably?

5. **Report baseline performance**

   * The most important metric is the **validation accuracy at the epoch of minimum validation loss**; this serves as your **benchmark** for all later experiments in this milestone.
   * Evaluate on the **test set** and record final metrics.

In [14]:
# Your code here; add as many cells as you need but make it clear what the structure is. 


### Graded Questions (5 pts each)

1. **Model Architecture:**
   Describe your baseline model and justify why this structure suits your dataset.

2.1. **Your answer here:**



2. **Training Behavior:**
   Summarize the model’s training and validation curves. What trends did you observe?

2.2. **Your answer here:**



  3. **Baseline Metrics:**
   Report validation and test metrics. What does this performance tell you about dataset difficulty?

2.3. **Your answer here:**



  4. **Reflection:**
   What are the main limitations of your baseline? Which specific improvements (depth, regularization, pretraining) would you try next?
  

2.4. **Your answer here:**



## Problem 3 – Custom (Original) Model (20 pts)

### Goal

Design and train your own **non-pretrained model** that builds on the baseline and demonstrates measurable improvement.
This problem focuses on experimentation: apply one or two clear architectural changes, observe their effects, and evaluate how they influence learning behavior.


### Steps to Follow

1. **Modify or extend your baseline architecture**

   * Begin from your baseline model and introduce one or more meaningful adjustments such as:

     * Adding **dropout** or **batch normalization** for regularization.
     * Increasing **depth** (extra convolutional or dense layers).
     * Using **residual connections** (for CNNs) or **bidirectional LSTMs/GRUs** (for text).
     * Trying alternative activations like `ReLU`, `LeakyReLU`, or `GELU`.
   * Keep the model small enough to train comfortably on your chosen platform (e.g., Colab)

2. **Observe what specific limitations you want to address**

   * Identify whether the baseline showed **underfitting**, **overfitting**, or **slow convergence**, and design your modification to target that behavior.
   * Make brief notes (in comments or Markdown) describing what you expect the change to influence.

3. **Train and evaluate under the same conditions**

   * Use the **same data splits**, **random seed**, and **metrics** as in Problem 2.
   * Apply **early stopping** on validation loss.
   * Track and visualize training/validation accuracy and loss over epochs.

4. **Compare outcomes to the baseline**

   * Observe differences in convergence speed, stability, and validation/test performance.
   * Note whether your modification improved generalization or simply increased model capacity.

### Graded Questions (5 pts each)

1. **Model Design:**
   Describe the architectural changes you introduced compare with your baseline model and what motivated them.

3.1. **Your answer here:**



2. **Training Results:**
   Present key validation and test metrics. Did your modifications improve performance?

3.2. **Your answer here:**



3. **Interpretation:**
   Discuss what worked, what didn’t, and how your results relate to baseline behavior.

3.3. **Your answer here:**



4. **Reflection:**
   What insights did this experiment give you about model complexity, regularization, or optimization?

3.4. **Your answer here:**



## Problem 4 – Pretrained Model (Transfer Learning) (20 pts)

### Goal

Apply **transfer learning** to see how pretrained knowledge improves accuracy, convergence speed, and generalization.
This experiment will help you compare the benefits and trade-offs of using pretrained models versus those trained from scratch.


### Steps to Follow

1. **Select a pretrained architecture**

   * **Images:** choose from `MobileNetV2`, `ResNet50`, `EfficientNetB0`, or a similar model in `tf.keras.applications`.
   * **Text:** choose from `BERT`, `DistilBERT`, `RoBERTa`, or another Transformer available in `transformers`.

2. **Adapt the model for your dataset**

   * Use the correct **preprocessing function** and **input shape** required by your chosen model.
   * Replace the top layer with your own **classification head** (e.g., `Dense(num_classes, activation='softmax')`).

3. **Apply transfer learning**

   * Choose an appropriate **training strategy** for your pretrained model. Options include:

     * **Freezing** the pretrained base and training only a new classification head.
     * **Partially fine-tuning** selected upper layers of the base model.
     * **Full fine-tuning** (all layers trainable) with a reduced learning rate.
   * Adjust your learning rate schedule to match your strategy (e.g., smaller LR for fine-tuning).
   * Observe how your chosen approach affects **validation loss**, **training time**, and **model stability**.

4. **Train and evaluate under consistent conditions**

   * Use the same **splits**, **metrics**, and **evaluation protocol** as in earlier problems.
   * Record training duration, validation/test performance, and any resource constraints (GPU memory, runtime).

5. **Compare and analyze**

   * Observe how transfer learning changes both **performance** and **efficiency** relative to your baseline and custom models.
   * Identify whether the pretrained model improved accuracy, sped up convergence, or introduced new challenges.


### Graded Questions (5 pts each)

1. **Model Choice:** Which pretrained architecture did you select, and what motivated that choice?

4.1. **Your answer here:**



2. **Fine-Tuning Plan:** Describe your fine-tuning strategy and why you chose it. 

4.2. **Your answer here:**



3. **Performance:** Report key metrics and compare them with your baseline and custom models.

4.3. **Your answer here:**



4. **Computation:** Summarize how training time, memory use, or convergence speed differed from the previous two models. 

4.4. **Your answer here:**



## Problem 5 – Comparative Evaluation and Discussion (20 pts)

### Goal

Compare your **baseline**, **custom**, and **pretrained** models to evaluate how design choices affected performance, efficiency, and generalization.
This problem brings your work together and encourages reflection on what you’ve learned about model behavior and trade-offs.

**Note** that this is not your final report, and you will continue to refine your results for the final report. 

### Steps to Follow

1. **Compile key results**

   * Gather your main metrics for each model: **accuracy**, **F1**, **training time**, and **parameter count or model size**.
   * Ensure all numbers come from the same evaluation protocol and test set.

2. **Visualize the comparison**

   * Present results in a **single, well-organized chart or table**.
   * Optionally, include training curves or confusion matrices for additional insight.

3. **Analyze comparative performance**

   * Observe which model performed best by your chosen metric(s).
   * Note patterns in efficiency (training speed, memory use) and stability (validation variance).

4. **Inspect model behavior**

   * Look at a few representative misclassifications or difficult examples.
   * Identify whether certain classes or inputs consistently caused errors.

5. **Plan forward improvements**

   * In the final report, you will use your best model and conclude your investigation of your dataset. Based on your observations, decide on a model and next steps for refining your approach in the final project (e.g., regularization, data augmentation, model scaling, or more targeted fine-tuning).

### Graded Questions (4 pts each)

1. **Summary Table and Performance Analysis:** Present a clear quantitative comparison of all three models. Which model achieved the best overall results, and what factors contributed to its success?

5.1. **Your answer here:**



2. **Trade-Offs:** Discuss how complexity, accuracy, and efficiency balanced across your models.

5.2. **Your answer here:**



3. **Error Patterns:** Describe the types of examples or classes that remained challenging for all models.

5.3. **Your answer here:**



4. **Next Steps:** Based on these findings, decide on a model to go forward with and outline your plan for improving that model. 


5.4 **Your answer here:**



### Final Question: Describe what use you made of generative AI tools in preparing this Milestone. 

**AI Question: Your answer here:**