# TP 2: Regression and Cross Validation for Network Data Analysis

## üìù Exercise 1: Linear Regression with Training and Test Data

In this exercise, you will revisit simple linear regression exercise from last week but this time with a focus on model training and evaluation using a train/test split.

### Step 1: Import Required Libraries

First, let's import the necessary libraries for data manipulation, visualization and regression modeling.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.model_selection import train_test_split

### Step 2: Generate the Data

Before training a model, we need data that represents the problem we want to solve. In this lab, we use a predefined function `genSample()` to simulate synthetic data that mimics real-world behavior (e.g network-related variables).

* a) Use the provided `genSample()` function from the previous lab session to generate your dataset.

    - Choose appropriate parameters: number of samples, intercept, slope and noise level
    - Store the output into variables for input features and target labels

#### Answer:

In [1]:
## TODO:

### Step 3: Split the Dataset

When building predictive models, we want to know not only how well the model fits the **data it sees**, but also how well it will perform on **unseen data**. 

To do this, we split the dataset into two parts:
- A **training set** to learn from
- A **test set** to evaluate generalization

This helps prevent **overfitting** and provides a more realistic estimate of how your model will perform in production or deployment.

* b) Split the dataset into a **training set** and a **test set**.

    - Use `train_test_split()` from `sklearn.model_selection`
    - Set a test size of 20% and use a random seed for reproducibility

#### Answer:

In [2]:
## TODO:

### Step 4: Train the Model

Once the data is split, the next step is to fit a linear regression model using only the training set. This means the model will learn the best-fitting line by minimizing the error between the predicted and actual values on the training data. 

* c) Create and train a linear regression model using only the **training data**.

    - Use `LinearRegression()` from `sklearn.linear_model`
    - Fit the model using `.fit(...)` with your training inputs and outputs

#### Answer:

In [3]:
## TODO:

### Step 5: Make Predictions

Once the model is trained, you can use it to predict outcomes for the test set, which simulates new, unseen data. This step evaluates how well the model generalizes.

* d) Use the trained model to predict the outputs for your test set.

    - Use the `.predict(...)` method
    - Store the predicted values for comparison with the actual test labels

#### Answer:

In [5]:
## TODO:

### Step 6: Evaluate the Model

To measure how well the model performs, we use quantitative evaluation metrics. Two common ones are:

- **Mean Squared Error (MSE)**: Measures the average squared difference between predicted and actual values
- **R¬≤ Score**: Indicates how much of the variance in the output variable is explained by the model (closer to 1 is better)

A low MSE and high R¬≤ generally indicate a good model fit.

* e) Evaluate the accuracy of your predictions using the test data.

    - Compute **Mean Squared Error (MSE)** using `mean_squared_error(...)`
    - Compute **R¬≤ Score** using `r2_score(...)`

* f) Comment on the results.
    - Do the MSE and R¬≤ scores indicate a good fit on the test set?
    - What might cause a model to perform poorly on test data, even if it performs well on the training data?
    - Is there any indication of overfitting or underfitting?

#### Answer:

In [None]:
## TODO:

### Step 7: Visualize the Results

Visualizing your model‚Äôs predictions helps you understand how well it fits the data. A well-fitted model should have its regression line align closely with the true data points in the test set.

* g) Plot the following on a graph:

    - The original test data (as a scatter plot)
    - The model‚Äôs predicted outputs (regression line)
    - Add axis labels, a title, and a legend.

#### Answer:

In [None]:
## TODO:

## üìù Exercise 2: Polynomial Regression

In this exercise, you will explore **polynomial regression** as an extension of linear models. Polynomial regression allows us to fit more complex, non-linear relationships by using polynomial features derived from a single input variable.

We are going to generate synthetic data based on a non-linear, degree-3 polynomial function: $y = 4 + 2x + 0.5x^2 - 0.07x^3 + \epsilon$.

* The input $x$ is sampled uniformly from the interval [0, 10].
* The noise $\epsilon$ is sampled from a normal distribution with mean 0 and standard deviation $\sigma = 5$.

This is a cubic polynomial with Gaussian noise added. This equation represents the hidden ground truth that we will later try to discover using regression.

You will then fit regression models of varying polynomial degrees (from 1 to 16) and evaluate how well they approximate the true curve.

The general form of a polynomial regression model with degree $\ell$ is: $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x + \hat{\beta}_2x^2 +\ldots + \hat{\beta}_{\ell}x^{\ell}$.

Although you are still working with a single input variable $x$, transforming it into $x^2, x^3, \ldots$ means the model effectively operates in a **higher-dimensional space**.

Your task is to evaluate how different polynomial degrees affect the model's performance and generalization.

### Step 1: Generate Synthetic Data

Using the following parameters:
- $\sigma = 5$ (standard deviation of the noise)
- $n = 200$ data points
- $x \sim \text{Uniform}(0, 10)$

We generate data for:
  - Feature: $x$
  - Derived features: $x^2, x^3$
  - Noise: $\varepsilon \sim \mathcal{N}(0, \sigma^2)$
  - Target: $y = 4 + 2x + 0.5x^2 - 0.07x^3 + \varepsilon$

In [None]:
# Parameters
n = 200
b0 = 4
b1 = np.array([2,0.5,-0.07])
mue, sigmae = 0, 5
xl, xh = 0, 10

# Set random seed for reproducibility
np.random.seed(199)
Er = np.random.normal(mue, sigmae, n)
np.random.seed(199)

# Generate synthetic x values
x0 = np.random.uniform(xl,xh,n)
x = np.array([x0])
x = np.append(x,np.array([x0**2]),axis=0)
x = np.append(x,np.array([x0**3]),axis=0)


# Generate true y values using the polynomial plus noise
y = b0 + b1[0]*x[0]+ b1[1]*x[1]+ b1[2]*x[2]+Er

Then we store the data in data frames and prepare it for polynomial regression. From this point on you don't know how this data was produced, you just have the data on x and y as a data frame and are asked to do a regression.

In [None]:
dataSynth = {'x': x[0],'y': y}
df = pd.DataFrame(data=dataSynth)
df

### Step 2: Split the Dataset

We have to split the data into training set and test set.

* a) Split the data into a **training set** and a **test set** using an 80/20 ratio.

    - Use `train_test_split()` from `sklearn.model_selection`
    - Set a random seed for reproducibility

#### Answer:

In [None]:
## TODO:

### Step 3: Explore the Effect of Polynomial Degree

In this part, you will investigate how changing the **degree ‚Ñì of the polynomial** affects the model‚Äôs ability to fit and generalize.

Polynomial regression is still linear regression but applied to transformed features: $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x + \hat{\beta}_2 x^2 + \dots + \hat{\beta}_\ell x^\ell$

* b) Fit a regression model to the training data. 
* c) Predict the output on the test set.
* d) Plot the resulting regression curve along with the original data points
* e) Print the MSE and R¬≤ score for the test set
* f) Does the degree-3 polynomial fit the shape of the true curve well? Comment on R¬≤ score?

#### Answer:

In [None]:
## TODO:

### Step 4: Compare Models with Different Polynomial Degrees

In this part, you will systematically evaluate how the degree of the polynomial affects your model‚Äôs performance. 

* g) Now repeat the entire procedure for the following degrees: $\ell = 1,\ 2,\ 3,\ 6,\ 9,\ 16$

  For each degree ‚Ñì:
  - **Hint**: You can transform the input using `PolynomialFeatures(degree=‚Ñì)`
  - Train the model on the training data
  - Predict on both training and test sets
  - Compute and store:
    - **Mean Squared Error (MSE)** for training and test sets
    - **R¬≤ Score** for training and test sets
  - Store all these results so you can visualize the trends across different polynomial degrees.

#### Answer:

In [None]:
## TODO:

### Step 5: Visualize and Analyze the Results

Now use the collected results to analyze model performance.

* h) Plot the $MSE$ value vs. $\ell$ for the training dataset. 
* i) Plot the $MSE$ value vs. $\ell$ for the test dataset. 
* j) What do you observe?

#### Answer:

In [None]:
## TODO:

* k) Plot the $R^2$ value vs. $\ell$ for the training dataset. 
* l) Plot the $R^2$ value vs. $\ell$ for the test dataset. 
* m) What do you observe?

#### Answer:

In [None]:
## TODO:

## üìù Exercise 3: Cross Validation

In this exercise, you will explore different **cross-validation (CV)** strategies for evaluating a model‚Äôs generalization ability on small datasets.

Cross-validation helps estimate how well a model trained on one subset of data performs on unseen data.

### Step 1: Define a Toy Dataset

We will work with a very small dataset, defined as: $D_6 = \left\{(1,3),\ (2,4),\ (3,8),\ (4,9),\ (5,12),\ (7,14) \right\}$

Each pair $(x_i, y_i)$ represents an input‚Äìoutput example. 

* a) Store this dataset as `NumPy` arrays or a `Pandas` DataFrame.

#### Answer:

In [None]:
## TODO:

### Step 2: Cross-Validation Splits

In this part, you will explore how different cross-validation strategies divide the dataset into **training and test sets**.

For each method below, print out all train-test index splits and understand how the dataset is divided.

For each method, clearly print:
- The train indices
- The test indices

* b) Leave-One-Out Cross Validation (LOOCV):
   - Use: `LeaveOneOut()` from `sklearn.model_selection`

#### Answer:

In [None]:
## TODO:

* c) 3-Fold Cross Validation:
   - Use: `KFold(n_splits=3)` from `sklearn.model_selection`

#### Answer:

In [None]:
## TODO:

* d) Bootstrap Method:
   - Use: `Bootstrap(n_resamples=5)` from `sklearn.utils`

#### Answer:

In [None]:
## TODO:

### Step 3: Apply Cross-Validation for Model Evaluation

Now you will fit a linear regression model and compute the average test MSE using a specific CV strategy.

* e) Use 2-Fold Cross Validation to estimate the model‚Äôs average test error.
   - Use `cross_validate(...)` or `cross_val_score(...)` from `sklearn.model_selection`
   - Use `KFold(n_splits=2)`

Print the MSE for each fold, and compute the average test error across the two folds.

#### Answer:

In [7]:
## TODO:

### Step 4: Apply the same method to the data set below for the **TV-sales** pair:

The dataset "Advertising.csv" can be downloaded from: http://faculty.marshall.usc.edu/gareth-james/ISL/data.html

If the previous link doesn't work, please follow this backup link: https://raw.githubusercontent.com/justmarkham/scikit-learn-videos/master/data/Advertising.csv

In [None]:
directory = #your directory
prefix = "Advertising.csv"
filename1 = directory+prefix
dataAd = np.loadtxt(filename1, delimiter=",",skiprows=1,usecols=[1,2,3,4])

In [None]:
# pandas
pdAd = pd.DataFrame(dataAd, columns=["TV","radio","newspaper","sales"])

In [None]:
TV = pdAd.iloc[:,0].values
#radio = pdAd.iloc[:,1].values
#news = pdAd.iloc[:,2].values
sales = pdAd.iloc[:,3].values

#### **Answer:**

In [None]:
## TODO: