# TP 6: Neural Networks

## Quick Recap: Sotochastic Gradient Descent (SGD)

Gradient Descent is an iterative optimization algorithm that finds the minimum of a loss function by taking steps in the direction of the steepest descent (the negative gradient). 

The update rule is:

$$\theta = \theta - \eta \nabla L(\theta)$$

where:
- $\theta$ represents the model parameters
- $\eta$ is the **learning rate** (step size)
- $\nabla L(\theta)$ is the gradient of the loss function

**Stochastic Gradient Descent** updates parameters using only a small random subset of data at each iteration, rather than the entire dataset. This makes training:

- Faster (fewer data points to process)
- More memory-efficient

## üìù Exercise 1: Stochastic Gradient Descent for Linear Regression

In this exercise, we will implement the **Stochastic Gradient Descent (SGD)** algorithm from scratch to solve a Linear Regression problem.

### The Problem Setup:

We want to fit a **linear regression model** to synthetic data: $$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$$

where:
- $\beta_0$ is the intercept
- $\beta_1, \beta_2$ are the slope coefficients
- $\epsilon$ is Gaussian noise

We'll generate training data from known parameters, then use SGD to recover them.

In [None]:
## Import necessary libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import math
import random

### Part 1: Synthetic Data Generation

In [None]:
## Set the "True" Parameters
n = 60  # Number of samples
b0 = 5  # True intercept (what we want to recover)
b1 = np.array([2, -3])  # True coefficients (what we want to recover)

## Add realistic noise
mue = 0  # Mean of noise
sigmae = 5  # Standard deviation of noise

## Range of input values
xl, xh = 0, 10  # x ranges from 0 to 10

In [None]:
def genSample(n,b0,b1,sigmae,xLow,xHigh,seedit=199,size=1):
    if type(seedit)==int:
        np.random.seed(seedit)
        Er = np.random.normal(mue, sigmae, n)
        x = []
        for k in range(size):
            np.random.seed(seedit+k)
            x.append(np.random.uniform(xl,xh,n))
    else:
        np.random.seed()
        Er = np.random.normal(mue, sigmae, n)
        x = []
        for k in range(size):
            np.random.seed()
            x.append(np.random.uniform(xl,xh,n))
    y = b0+Er
    for k in range(size):
        y +=b1[k]*x[k]
    
    ## Output
    if size==1:
        return (x[0], y, Er)
    else:
        return (x, y, Er)

In [None]:
## Use the provided `genSample()` function to create training data
(x,y,Er) = genSample(n,b0,b1,sigmae,xl,xh,seedit=199,size=2)

In [None]:
## Organize into a DataFrame for easier inspection
data1 = {'x1': x[0], 'x2': x[1], 'error': Er, 'y': y}
df_slr = pd.DataFrame(data=data1)
df_slr

This gives us:
- `x[0]`: First feature, 60 samples
- `x[1]`: Second feature, 60 samples
- `y`: Target variable for each sample
- `Er`: Noise added to each sample

### Part 2: Implementing SGD

In this section, we will evaluate how the performance of Stochastic Gradient Descent (SGD) changes when we adjust two critical hyperparameters: 

- Mini-batch Size (m)
- Learning Rate ($\eta$)

**The Mini-batch Size (m):** This is the amount of data the model looks at before making an update.

**The Learning Rate ($\eta$):** It determines how much we change our parameters after seeing an error. If it's too large, we might skip over the best solution; if it's too small, the model will take forever to learn.

First, complete the logic inside the code cell below. Look for the `##TODO` markers to implement the gradient calculation and the parameter update step. Once your code is working, you will run the SGD algorithm multiple times to find the converged values for the parameters $(b_0,b_1,b_2)$. You should test every combination of the following settings:

- **Mini-batch Size (m):** m=1, m=10, m=n (where n is the full data size)
- **Learning Rate ($\eta$):** $\eta$=0.04, $\eta$=0.01, $\eta$=0.001

In [None]:
## The Complete SGD Function
def LinReg_SGD(T, m, eta, printit=True):
    """
    Stochastic Gradient Descent for Linear Regression
    
    Parameters:
    -----------
    T : int
        Number of epochs (passes through the data)
    m : int
        Mini-batch size (m=1: online, m=n: batch gradient descent)
    eta : float
        Learning rate (larger = faster but risky, smaller = slow but stable)
    
    Returns:
    --------
    [b0List, b1List, b2List, b0_final, b1_final, b2_final]
        Lists of parameter values at each epoch, plus final values
    """
    
    ## Initialize unknown linear parameters randomly
    b1_init = np.random.uniform()
    b2_init = np.random.uniform()
    b0_init = np.random.uniform()
    
    ## Track parameter values through training
    b0 = b0_init
    b0List = [b0]
    b1 = b1_init
    b1List = [b1_init]
    b2 = b2_init
    b2List = [b2]
    
    ## Main training loop
    for t in range(T):
        ## Select a random mini-batch
        if m < n:
            indx = np.random.choice(np.arange(n), size=m, replace=False)
        else:
            indx = np.arange(n)  # Use all data if you want to use all data per iteration
        
        ## Use x_batch and y_batch to store the batch of data
        y_batch = y[indx]
        x_batch = [[x[0][i], x[1][i]] for i in indx]
        x_batch1 = [x_batch[j][0] for j in range(m)]
        x_batch2 = [x_batch[j][1] for j in range(m)]
        
        ## TODO #1: Calculate gradients
        ## [Add code to compute grad_b0, grad_b1, grad_b2]
        
        ## TODO #2: Update parameters 
        ## [Add code to update b0, b1, b2]
        
        ## Record values for analysis
        b0List.append(b0)
        b1List.append(b1)
        b2List.append(b2)
        
        ## Print progress
        if t % 100 == 0 and printit == True:
            print(f'Epoch {t:4d} | b0={b0:7.3f} | b1={b1:7.3f} | b2={b2:7.3f}')
    
    return [b0List, b1List, b2List, b0, b1, b2]

In [None]:
## Set hyperparameters
T = 1000 #number of epochs
m = 10 # Batch-size m is less than or equal to n
eta = 0.01 # Learning rate

In [None]:
## Run SGD to estimate parameters
results = LinReg_SGD(T, m, eta, printit=True)
b0List, b1List, b2List, b0, b1, b2 = results

In [None]:
## Analyze results
print(f"\nTrue parameters:   b0={5.0:.3f}, b1={2.0:.3f}, b2={-3.0:.3f}")
print(f"Learned parameters: b0={b0:.3f}, b1={b1:.3f}, b2={b2:.3f}")

In [None]:
fig, axs = plt.subplots(3)
fig.suptitle('Convergence of unknowns')

## Plot parameter convergence 'b0'
axs[0].plot(b0List,'r',label='b0, batch-size 10')
axs[0].legend(loc=7)
axs[0].grid()

## Plot parameter convergence 'b1'
axs[1].plot(b1List,'g',label='b1, batch-size 10')
axs[1].legend(loc=7)
axs[1].grid()

## Plot parameter convergence 'b2'
axs[2].plot(b2List,'--k',label='b2, batch-size 10')
axs[2].legend(loc=7)
axs[2].grid()

#### Question:

- Now, run the algorithm with **different hyperparameters**, plot convergence of unknown parameters and compare results.

#### Answer:

In [None]:
## TODO:

#### Question:

- How does mini-batch size affect convergence?

#### Answer:

#### Question: 

- How does learning rate affect convergence?

#### Answer:

## üìù Exercise 2: Neural Networks for Network Bandwidth Estimation

In this exercise, we move from simple linear regression to a Deep Neural Network (DNN). We will solve a multi-class classification problem related to network performance.

Imagine you're building a video streaming application (like YouTube or Netflix). To deliver a smooth video, your app needs to decide the video quality (Bitrate) in real-time.

- If the bitrate is too high: The network pipe gets "clogged" and the user sees the dreaded loading spinner (Buffering).

- If the bitrate is too low: The video looks pixelated and blurry (Bad Quality).

**Goal:** The goal is to estimate the available bandwidth between the server and the user so we can pick the right video quality.

**Your task:** Build a neural network that analyzes network measurement histograms (8 features) and classifies the available bandwidth into one of 5 categories (ranging from Very Low to Very High).

### Dataset Overview:

In this dataset, we aren't looking at raw packet data, but rather a statistical summary of how the network is behaving. A histogram of the ratio between Bits Sent and Bits Received.

- **Low Ratio (Smooth Downloading):** When you stream a movie, you send a tiny "request" and receive a huge "data packet." The ratio is very small (e.g., 0.01).

- **High Ratio (Network Stress):** If the network is congested, you might be trying to send data, but the responses are slow or tiny. The ratio spikes (e.g., 50.0 or 100.0).

**Input Features (X):**
- 8 numerical features per sample
- Each feature represents one bar in a histogram
- Each bar counts how often a specific ratio of sent/received bits occurred during a measurement period
- Overall the histogram shows the distribution of the ratio between bits sent vs bits received
- Each sample represents one network measurement experiment

**Target Labels (y):**
- 5 classes: **12.5, 25, 37.5, 50, 75 Mbps**
- These are the available bandwidth values in the testbed

**Dataset sizes:**
- Training: 1,100 samples
- Testing: 1,000 samples

**The Problem:** 
- This is a multi-class classification task. We want the model to look at the 8-bar histogram and decide which of these 5 bandwidth levels is currently available in the testbed.

### Part 1: Data Loading and Exploration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from imblearn.over_sampling import SMOTE
import category_encoders as ce

In [None]:
# Load the data
X_train = pd.read_csv('X_train.csv', header=None)
y_train = pd.read_csv('label_train.csv', header=None, names=['label'])
X_test = pd.read_csv('X_test.csv', header=None)
y_test = pd.read_csv('label_test.csv', header=None, names=['label'])

# Convert labels to string for clarity
y_train['label'] = y_train['label'].astype(str)
y_test['label'] = y_test['label'].astype(str)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)
print("\nClass distribution in training set:")
print(y_train.value_counts().sort_index())

#### Question: 

- Is the dataset imbalanced? How might this affect the model?

#### Answer:

### Part 2: Data Pre-Processing

Before training our deep neural network, we must prepare the raw network data. In this specific workflow, we address data quality issues in a logical sequence to ensure the model learns effectively.

**Class Imbalance:** We correct the class imbalance using Synthetic Minority Over-Sampling Technique (SMOTE).

In [None]:
smote = SMOTE()
X_train, y_train = smote.fit_resample(X_train, y_train)

y_train.value_counts()

**Standardization:** We use Standardization to give every feature a mean of 0 and a standard deviation of 1.

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

print('X_train scaled\n', X_train[0:6, :] )

**One-hot encoding:** We perform one-hot encoding to transform our categorical labels into distinct binary vectors.

In [None]:
one_hot_encoder = ce.OneHotEncoder(cols=['label'], use_cat_names='True')
one_hot_encoder.fit(y_train)

In [None]:
one_hot_encoder = ce.OneHotEncoder(cols=['label'], use_cat_names='True')
one_hot_encoder.fit(y_train)
y_train = one_hot_encoder.transform(y_train)
y_test  = one_hot_encoder.transform(y_test)

print('y_train', y_train)

num_of_classes = y_train.shape[1]
class_names =list(y_train.columns)
print('There are ', num_of_classes, ' classes. Their names is:', class_names)

### Part 3: Building the Neural Network

A Neural Network is composed of layers. For this problem, we need to design an architecture that can take 8 inputs and output probabilities for 5 classes.

1. **Input Layer:** Receives the 8 features

2. **Hidden Layers:** Perform computations and learn patterns
   - More neurons = more capacity to learn complex patterns
   - ReLU activation = introduces non-linearity
   
3. **Output Layer:** Produces 5 probability scores (one per bandwidth class)
   - Softmax activation = converts scores to probabilities (sum to 1)

In [None]:
def build_simple_model(input_dim, num_classes):
    """
    Build a simple neural network with 2 hidden layers.
    
    Architecture:
    - Input: input_dim features
    - Hidden Layer 1: 32 neurons, ReLU activation
    - Hidden Layer 2: 16 neurons, ReLU activation
    - Output: 5 neurons, Softmax activation ‚Üí Produces probability for each bandwidth class
    """
    
    model = keras.Sequential([
        ## TODO: Add input layer with correct dimensions
        
        ## TODO: Add first hidden layer
        
        ## TODO: Add second hidden layer
        
        ## TODO: Add output layer for 5-class classification
    ])
    
    return model

### Part 4: Compiling the Model

Before we can start training, we must configure the learning process.

To do this, we define three key components:

1.  **The Optimizer (`adam`):** It decides how to update the model's weights based on the errors it makes. `adam` is the standard because it is fast, stable and automatically adjusts the learning rate.
2.  **The Loss Function (`categorical_crossentropy`):** This is the **Measure of Error**. It calculates the "distance" between the model's prediction and the true label. Use `categorical_crossentropy` if your labels are One-Hot Encoded.
3.  **The Metrics (`accuracy`):** This is the **Scoreboard**. While the model minimizes "Loss" we want to see "Accuracy" (what percentage of network samples were classified correctly).

In [None]:
## TODO: Build the model


## TODO: Compile the model with the specified settings


## TODO: Print the summary to see the architecture

#### Question:

- After printing model summary, look at the `Total Trainable Parameters`. What do these parameters represent in terms of the "weights" and "biases". If this number is very high, what is the risk to our model?

#### Answer:

### Part 4: Training and Evaluation

Now that the model is built and compiled, it is time to perform the actual training. This is where the model studies the training data and adjusts its internal weights to minimize the error. Fill in the .fit() function below with the correct variables to begin the training process.

- **x:** X_train

- **y:** y_train

- **validation_data:** (X_val, y_val)

- **epochs:** 200 (number of passes through data)

- **batch_size:** 32 (samples per gradient update)

- **class_weight:** class_weights (to handle imbalance!)

- **verbose:** 1 (to see training progress)

In [None]:
# Start the training process and save the results in 'history'
history = model.fit(
    ## TODO: Add training data, labels, epochs, batch size, and validation data

)

### Step 5: Final Evaluation & Visualizing Success

Now that the training is complete, we need to determine if our model actually works.

Evaluate the model to see how it performs on the test set. This gives us a single number for Accuracy and Loss.

In [None]:
## TODO: 

Plot the Learning Curves. Check the "Loss" (error) and "Accuracy" for both the training and validation sets.

In [None]:
## TODO:

#### Question:

- Does the model overfit? 

#### Question:

- At what epoch does validation accuracy stop improving?