# EXERCISES Hyperparameter tuning


### Task 1 : Import libraries

 Import the necessary libraries (Import pandas, numpy, matplotlib, seaborn, and sklearn libraries)

In [17]:

# Import libraries for data handling and visualization
import pandas as pd  # voor dataframes
import numpy as np  # voor numerieke arrays en berekeningen
import matplotlib.pyplot as plt  # voor grafieken
import seaborn as sns  # voor statistische visualisatie

# Import scikit-learn modules (ML, validatie, etc.)
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_scorefrom sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import validation_curve


### Task 2 : Load the data

 We load the data again from the Concrete Compressive Strength Dataset Regression Notebook 'Concrete_Data.csv'.


In [6]:
data = pd.read_csv("../Datasets/Concrete_data.csv", sep=",")


### Task 3 : Polynomial regression function

Create a function 'polynomial_regression' with 2 parameters: degree (default=2) and **kwargs. The function returns a polynomial model, constructed by a pipeline 
of 'PolynomialFeatures' (with degree as parameter and include_bias set to False) and 'LinearRegression' (with **kwargs as parameter). 
What is the goal of the **kwargs parameter and what does the ** operator do?

In [7]:
class GeneralRegression:
    def __init__(self, degree=1, exp=False, log=False, **kwargs):
        self.degree = degree
        self.exp = exp
        self.log = log
        self.kwargs = kwargs
        self.model = None
        self.x_orig = None
        self.y_orig = None
        self.X = None
        self.y = None

    def fit(self, x: np.array, y: np.array):
        self.x_orig = x
        self.y_orig = y
        self.X = x.reshape(-1, 1)

        if self.exp:
            self.y = np.log(y)

        else:
            self.y = y

        if self.log:
            self.X = np.log(self.X)

        self.model = make_pipeline(PolynomialFeatures(degree=self.degree, include_bias=False), LinearRegression(**self.kwargs))
        self.model.fit(self.X, self.y)

    def predict(self, x: np.array):
        X = x.reshape(-1, 1)

        if self.exp:
            return np.exp(self.model.predict(X))

        if self.log:
            return self.model.predict(np.log(X))

        return self.model.predict(X)
    
#de ** operator verzamelt alle extra keyword arguments in een dictionary

### Task 4 : Validation curves

Create the the target variable 'csMPa' (y) and the feature variable (X) with 'Cement', 'Water', and 'Age' as features.

In [9]:
# Zet de correcte kolomnamen volgens de CSV
y = data['csMPa']  # doelvariabele
X = data[['cement', 'water', 'age']]  # features


Calculate the training and the validation R-squared scores for the polynomial regression models of degree 1 to 5. Use a cross-validation fold value of 5. 
Print the average scores over the different cross-validation folds for the different models.

In [13]:
degrees = range(1, 6)  # degrees 1 tot 5
cv_folds = 5

train_scores_all = []
val_scores_all = []

Draw the validation curve. Use the median of the scores over the different cross validation folds. What you think about (underfitting/overfitting)? Is it useful to use a more complex model?

In [18]:
for degree in degrees:
    model = polynomial_regression(degree=degree)
    
    # Gebruik validation_curve om training en validation scores te krijgen
    train_scores, val_scores = validation_curve(
        model, X, y, 
        param_name='polynomialfeatures__degree',
        param_range=[degree],
        cv=cv_folds,
        scoring='r2'
    )
    
    train_scores_all.append(train_scores.flatten())
    val_scores_all.append(val_scores.flatten())
    
    # Print gemiddelde scores
    print(f"Degree {degree}:")
    print(f"  Training R² (mean): {train_scores.mean():.4f} (±{train_scores.std():.4f})")
    print(f"  Validation R² (mean): {val_scores.mean():.4f} (±{val_scores.std():.4f})")
    print()

NameError: name 'polynomial_regression' is not defined

### Task 5 : Grid Search

Use grid search to find the optimal polynomial model. Use a two-dimensional grid of model features, namely the polynomial degree from 1 to 10 and the flag telling us whether to fit the intercept. Use a cross-validation fold value of 7. Print the best parameters and the best scores (mean over the folds).