# __Velon Murugathas__ Lab 5 Assignment

## Cross-Validation for Model Selection

In [14]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import pandas as pd

diabetes = load_diabetes()                                      # Loading the Diabetes dataset
X = diabetes.data
y = diabetes.target

## Perform cross-validation on nine polynomial models, ranging from degree 0 to 8

In [15]:
degrees = [0, 1, 2, 3, 4, 5, 6, 7, 8]
cv_results = []

for degree in degrees:
    poly_features = PolynomialFeatures(degree=degree)
    X_poly = poly_features.fit_transform(X)
    
    model = LinearRegression()
    
    kf = KFold(n_splits=5)
    r2_scores = cross_val_score(model, X_poly, y, cv=kf, scoring='r2')
    mae_scores = -cross_val_score(model, X_poly, y, cv=kf, scoring='neg_mean_absolute_error')
    mape_scores = -cross_val_score(model, X_poly, y, cv=kf, scoring='neg_mean_absolute_percentage_error')
    
    cv_results.append((degree, r2_scores, mae_scores, mape_scores))


## Include the R-Squared, Mean Absolute Error (MAE) and MAPE metrics for each model.

In [16]:
results_df = pd.DataFrame(cv_results, columns=['Degree', 'R-Squared', 'MAE', 'MAPE'])
results_df['Mean R-Squared'] = results_df['R-Squared'].apply(lambda x: x.mean())
results_df['Std Dev R-Squared'] = results_df['R-Squared'].apply(lambda x: x.std())
results_df['Mean MAE'] = results_df['MAE'].apply(lambda x: x.mean())
results_df['Std Dev MAE'] = results_df['MAE'].apply(lambda x: x.std())
results_df['Mean MAPE'] = results_df['MAPE'].apply(lambda x: x.mean())
results_df['Std Dev MAPE'] = results_df['MAPE'].apply(lambda x: x.std())

print(results_df)

   Degree                                          R-Squared   
0       0  [-0.098447684719976, -0.02786737771119996, -0....  \
1       1  [0.4295561538258379, 0.5225993866099365, 0.482...   
2       2  [0.3665137244906693, 0.5023038092363092, 0.446...   
3       3  [-11.357075742146783, -25.84895514123222, -47....   
4       4  [-36.251835231035464, -175.92042330006768, -45...   
5       5  [-35.33817240971148, -168.08323830267776, -45....   
6       6  [-35.314273049163795, -168.46434534699304, -45...   
7       7  [-35.3134145749255, -168.46994840756065, -45.3...   
8       8  [-35.30263680072528, -168.47316642280705, -45....   

                                                 MAE   
0  [60.952509787694574, 68.32170480949804, 67.135...  \
1  [43.02616605962198, 44.80048010224325, 48.1557...   
2  [45.01657162448429, 43.79149726937093, 48.4584...   
3  [152.5505617977528, 220.1011235955056, 349.465...   
4  [231.19620255360107, 350.3423632810807, 305.45...   
5  [227.96345865885243,

## Chosing the best model

In [17]:
best_r2_degree = results_df['Mean R-Squared'].idxmax()
best_mae_degree = results_df['Mean MAE'].idxmin()
best_mape_degree = results_df['Mean MAPE'].idxmin()

explanation = "The best model is chosen based on the highest R-squared value and the lowest MAE and MAPE. The model with degree {} exhibits the highest R-squared of {:.4f}, the lowest MAE of {:.4f}, and the lowest MAPE of {:.4f}.".format(
    best_r2_degree, results_df.at[best_r2_degree, 'Mean R-Squared'], results_df.at[best_mae_degree, 'Mean MAE'], results_df.at[best_mape_degree, 'Mean MAPE'])
print(explanation)


The best model is chosen based on the highest R-squared value and the lowest MAE and MAPE. The model with degree 1 exhibits the highest R-squared of 0.4823, the lowest MAE of 44.2765, and the lowest MAPE of 0.3949.


### Identification of the Best Model: Identify the model that exhibits the highest performance based on the R-Squared, MAE and MAPE metrics. Provide an explanation for choosing this specific model


The best model is the one with degree 1 as it outperforms the others in terms of R-squared, MAE, and MAPE metrics. It has the highest R-squared (0.4823), signifying a strong fit to the data, and the lowest MAE (44.2765) and MAPE (0.3949), indicating high accuracy in making predictions. This model strikes a balance between explaining variance and minimizing prediction errors, making it the top choice among the models.

### Additional analysis and interpretation of the models' performances. You may explore further insights beyond the required metrics. The analysis should provide at least one relevant insight about the choice of the best model, or about characteristics of the chosen one

The model with degree 1 is doing well by the standard metrics, but we should also check if it overfits the data, see how easy it is to understand, and how it handles unusual data points. We need to think about how fast it can make predictions and if it will work well over a long time. It might also be worth trying a combination of models to improve its performance. All of this will help make sure the model is good for real-world use.