# Task 1a

### Description

In order to compare the different lambdas via cross validation, we used GridSearchCV and Ridge from sci-kit learn. This allowed us to specify the parameters of the estimator using the param_grid parameter, the RMSE score using the scoring parameter, as well as the folds using the cv parameter of GridSearchCV. Furthermore, we used pandas dataframes to handle the data.

### Main

In [1]:
# Import packages
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

In [2]:
# Specify path to train.csv
path = '/Users/gian-andreagottini/Documents/Coding/Introduction to Machine Learning/Task 1a/train.csv'

In [3]:
# Define features X_train and target y_train
train = pd.read_csv(path)

# Lists: round brackets, fixed values
# Arrays: square brackets, changing values
# Dictionaries: curly brackets

# Capital letter = matrix = X_train
X_train = train[['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13']]

# Small letter = vector = y_train
y_train = train['y']

In [4]:
# Define number of folds (=k) for k-fold cross-validation 
folds = 10

# Define regularization parameter (defines how strong the effect (=damping) is)
lambdas = (0.1, 1, 10, 100, 200)

# Predefined hyperparameters that will be passed to the GridSearchCV function
# GridSearchCV takes only dictionnaries named 'alpha' (if another name is taken, code throws an error later)
# In sklearn syntax 'alpha' = 'lambda' for ridge regression
parameters = {'alpha': lambdas}

In [5]:
# Define model
# GridSearchCV makes it possible to compare multiple parameters
model = GridSearchCV(Ridge(), # Estimator
                     # GridSearchCV compares all parameters available in param_grid
                     param_grid = parameters, # Hyperparamters
                     scoring = 'neg_root_mean_squared_error', # Evaluation metric (only negative RMSE available)
                     cv = folds, # Number of folds
                     return_train_score = True)

In [6]:
# Get parameters
Ridge().get_params().keys()

dict_keys(['alpha', 'copy_X', 'fit_intercept', 'max_iter', 'normalize', 'random_state', 'solver', 'tol'])

In [7]:
# Fit model
model.fit(X_train, y_train)

GridSearchCV(cv=10, estimator=Ridge(),
             param_grid={'alpha': (0.1, 1, 10, 100, 200)},
             return_train_score=True, scoring='neg_root_mean_squared_error')

In [8]:
# Calculate root-mean-square error (RMSE)
# GridSearchCV calculates multiple values (e.g. mean_fit_time) and stores them in cv_results_
# We are only interested in mean_test_score, wich is why we only store this value in RMSE.
# Minus is needed to cancel out the minus from neg_root_mean_squared_error
RMSE = -model.cv_results_['mean_test_score']

# Display array
RMSE

array([5.50180945, 5.49983874, 5.48363149, 5.63664214, 5.72123372])

In [9]:
# Create pandas dataframe
solution = pd.DataFrame(RMSE)

# Export results
solution.to_csv('solution.csv', index = False, header = False)