In [7]:
#Loading the required libraries
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, r2_score
import numpy as np
import pandas as pd

# Loading the diabetes dataset fromsklearn_datasets
diabetes_df = load_diabetes()
X, y = diabetes_df.data, diabetes_df.target

# Splitting the data into training and testing sets i.e. test_size as 20% and train_size as 80%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Performing cross-validation on polynomial models
# Here, we want to perform cross-validation on nine polynomial models from degree 0 to 8 so using range function which will take values from 0 to 8
degrees = range(9) 
results = []

for degree in degrees:
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    r2 = r2_score(y_test, y_pred) #Calculating r2 score
    mae = mean_absolute_error(y_test, y_pred) #Calculating mean absolute error score
    mape = mean_absolute_percentage_error(y_test, y_pred) #Calculating mean absolute percentage error score
    
    results.append((degree, r2, mae, mape))

# Constructing a DataFrame to store the results
results_df = pd.DataFrame(results, columns=['Degree', 'R-Squared', 'MAE', 'MAPE'])

# Calculating mean and standard deviation
mean_std_df = results_df.groupby('Degree').agg({
    'R-Squared': ['mean', 'std'],
    'MAE': ['mean', 'std'],
    'MAPE': ['mean', 'std']
})

# Displaying the summary table
print(mean_std_df)


        R-Squared             MAE          MAPE    
             mean std        mean std      mean std
Degree                                             
0       -0.011963 NaN   64.006461 NaN  0.627918 NaN
1        0.452603 NaN   42.794095 NaN  0.374998 NaN
2        0.415640 NaN   43.581693 NaN  0.382857 NaN
3      -15.501646 NaN  178.067416 NaN  1.614539 NaN
4      -26.728083 NaN  261.667144 NaN  2.300991 NaN
5      -25.992920 NaN  255.968358 NaN  2.270202 NaN
6      -25.975743 NaN  255.908618 NaN  2.269658 NaN
7      -25.975478 NaN  255.906822 NaN  2.269649 NaN
8      -25.975555 NaN  255.907099 NaN  2.269653 NaN


It seems that for each degree, the standard deviation (std) is showing as NaN. This suggests that for each degree, there is only one data point available for evaluation.

This could be due to the specific characteristics of the dataset or how the data is being split in the cross-validation process. For certain degrees, the data may be too sparse or there might be some other issue causing this.

3. Identification of the Best Model: Identify the model that exhibits the highest performance based on the R-Squared, MAE and MAPE metrics. Provide an explanation for choosing this specific model.


* Based on the mean R-Squared, MAE, and MAPE metrics, it appears that the model with degree 1 (quadratic polynomial) has the highest R-Squared (0.45) and the lowest MAE (42.79) and MAPE (0.37). This suggests that the quadratic polynomial model performs the best among the tested degrees.

Explanation for Choosing the Specific Model (Degree 1):

R-Squared: The model with degree 1 has the highest R-squared value, indicating that it explains a significant portion of the variance in the data. This suggests that the quadratic model fits the data well.

MAE and MAPE: The model with degree 1 also has the lowest Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE). This indicates that it has the smallest average absolute difference between predicted and actual values, which is desirable for predictive accuracy.

4. Additional analysis and interpretation of the models' performances. You may explore further insights beyond the required metrics. The analysis should provide at least one relevant insight about the choice of the best model, or about characteristics of the chosen one (for example - an analysis of in which instances does it fail)

It's important to note that for higher degree models (e.g., degrees 3 to 8), the R-Squared values are significantly negative. This indicates that these models are performing very poorly and may be overfitting to the training data. This is likely due to the models becoming too complex and capturing noise in the data.

Based on this, it's recommended to use the quadratic model (degree 1) for making predictions, as it strikes a good balance between model complexity and predictive performance. However, it's crucial to also consider other factors like interpretability and computational resources when choosing a final model. Further analysis and potentially trying more advanced techniques like regularized regression could also be beneficial.