# Required assignment 5.1-Applying the training set–validation set–test set approach

In this notebook, you will use the training and validation sets to identify which model best fits the data. You will be working with the `train-test-split` approach from sklearn.

### **Steps followed in the `train_test_validate` approach**

1. Split the available data into a training set, a validation set and a test set.

2. Fit each model separately on the training set.

3. Evaluate each model separately on the validation set.

4. Choose the model that performs best on the validation set.

5. Estimate the performance of that model on the test set.

6. Train the selected model again using all data.

In [3]:
#import the necessary libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Use the polynomial function to generate a data set. The function is defined as 
$$
y = 3x^3 - 2x^2 + x + 5
$$

In [4]:
# Example: generate synthetic data (replace with your own X, y)
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = 3 * X.squeeze()**3 - 2 * X.squeeze()**2 + X.squeeze() + 5 + np.random.randn(100) * 100

### Step 1: Split the data into training, testing and validation.

#### **Question 1:** For the synthetic dataset, use the `train_test_split` function to split the data into 60% training , 20% validation and 20% test dataset.

HINT: Since the `train_test_split()` function can only split the dataset into two parts, follow these steps:

- The first split uses 20% of the data for the test dataset.

- The second split uses 25% of the previous training set as the test dataset.

- The final proportions of the train:validate:test is 60:20:20.

In [5]:
# Split the data into training, testing and validation.

###GRADED

# First split: Separate out the 20% test set (remaining 80% for training and validation)
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Second split: 25% of 80% = 60-20 split (20% of original dataset for validation, 80% training)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42)

# Print the shapes to verify the split
print("--- Data Split Shapes ---")
print("X_train shape\n", X_train.shape)   # Should be 60
print("X_val shape\n", X_val.shape)       # Should be 20
print("X_test shape\n", X_test.shape)     # Should be 20
print("y_train shape\n", y_train.shape)
print("y_val shape\n", y_val.shape)
print("y_test shape\n", y_test.shape)

--- Data Split Shapes ---
X_train shape
 (60, 1)
X_val shape
 (20, 1)
X_test shape
 (20, 1)
y_train shape
 (60,)
y_val shape
 (20,)
y_test shape
 (20,)


### Step 2: Fit each model separately on the training dataset.

#### **Question 2:** For the synthetic dataset, use the `LinearRegression()` model.

HINT: Use the `model.fit()` function to fit the model.

In [6]:
###GRADED CELL
val_errors = {}
models = {}

for degree in range(1, 7): #loop for degrees 1, 2, 3, 4, 5, 6 (7 not included)
    # Create polynomial features for the current degree
    poly = PolynomialFeatures(degree=degree)

    # Transform the training data to include polynomial features
    X_train_poly = poly.fit_transform(X_train)
    # .fit_transform() learns the polynomial features from X_train and then applies the transformation

    # Transform the validation data using the *same* polynomial features learned from training data
    X_val_poly = poly.transform(X_val)
    # .transform() NOT .fit_transform(), to avoid data leakage from the validation set


# YOUR CODE HERE
    # Initialize a Linear Regression model
    model = LinearRegression()

    # Fit the Linear Regression model on the transformed training data
    model.fit(X_train_poly, y_train)

    # Store the fitted model and its corresponding polynomial transformer
    # We store both because we'll need the 'poly' object to transform new data (validation/test) later
    models[degree] = (model, poly)

    print(f"  - Model for degree {degree} fitted and polynomial features transformed.")

  - Model for degree 1 fitted and polynomial features transformed.
  - Model for degree 2 fitted and polynomial features transformed.
  - Model for degree 3 fitted and polynomial features transformed.
  - Model for degree 4 fitted and polynomial features transformed.
  - Model for degree 5 fitted and polynomial features transformed.
  - Model for degree 6 fitted and polynomial features transformed.


#### Step 3: Evaluate each model separately on the validation dataset.

#### **Question 3:** Evaluate on the validation set and use the `mean_squared_error()` to compute the error.

In [7]:
### GRADED

# Use the fitted model to make predictions on the transformed validation data
y_val_pred = model.predict(X_val_poly)

# Compute the Mean Squared Error (MSE) between predictions and actual validation labels
mse = mean_squared_error(y_val, y_val_pred)

# YOUR CODE HERE

# Store the MSE for the current degree in the val_errors dictionary
val_errors[degree] = mse

# Print the validation MSE for the current degree
print(f"  - Degree {degree}: Validation MSE = {mse:.2f}")

  - Degree 6: Validation MSE = 8060.76


#### Step 4: Choose the model that performs best on the validation set.

#### **Question 4**: Choose the best model which has the min error and print the degree of that model.

HINT: Use `key=val_errors.get` to use the dictionary values (the validation errors) for comparison, and not the keys themselves.

In [8]:
# Choose the best model which has the min error

###GRADED
best_degree = min(val_errors, key=val_errors.get) # Use key=val_errors.get to use the dictionary values
#(the validation errors) for comparison, and not the keys themselves.

print(f"Best polynomial degree: {best_degree}")

Best polynomial degree: 6


#### Step 5: Estimate the performance of that model on the test set.

#### **Question 5**: Estimate the performance on the test set.

In [9]:
from sklearn.preprocessing import PolynomialFeatures
###GRADED

# Retrieve the best model and its corresponding polynomial transformer
best_model, best_poly = models[best_degree]

# Transform the test data using the polynomial features learned by the best model's transformer
X_test_poly = best_poly.transform(X_test)
# It's crucial to use .transform() here, not .fit_transform(), to avoid data leakage

# Use the best fitted model to make predictions on the transformed test data
y_test_pred = best_model.predict(X_test_poly)

# Compute the Mean Squared Error (MSE) between predictions and actual test labels
test_mse = mean_squared_error(y_test, y_test_pred)

print(f"Test MSE for best model (degree {best_degree}): {test_mse:.2f}")

Test MSE for best model (degree 6): 6063.63


#### Step 6: Retrain the best model on all available data.

In [10]:
best_poly_full = PolynomialFeatures(degree=best_degree)
X_full_poly = best_poly_full.fit_transform(X)
final_model = LinearRegression()
final_model.fit(X_full_poly, y)
print(f"Retrained best polynomial model (degree {best_degree}) on all data.")

Retrained best polynomial model (degree 6) on all data.
