We will use diabetes dataset which comes when we use import or install sci-kit learn.

In [1]:
#importing the necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score, mean_absolute_error, mean_absolute_percentage_error


In [2]:
#Loading the Diabetes dataset
X, y = load_diabetes(return_X_y=True, as_frame=True, scaled=True)

In [3]:
#In the next step, data splitting into training, validation and test set happens
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)

Here,we used above code twice: First time to split the data into 80% train and 20% test, and then to split the train set and validation further into 75% and 25% respectively. In summary, it can be said that train set consist of 60% data, validation and test set consists 20% data each.

In [4]:
#Now, we will run a multivariate linear regressionon all the variables which are there in the dataset using LinearRegression class from sklearn.
#Here, we will fit the model onthe train set and evaluation will be done on the validation set using MAE, MAPE and R-squared.

# Running a multivariate linear regression on all variables
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Predicting on the validation set
y_pred_lin = lin_reg.predict(X_val)

# Evaluating the model using R-squared, MAPE and MAE
r2_lin = r2_score(y_val, y_pred_lin)
mape_lin = mean_absolute_percentage_error(y_val, y_pred_lin)
mae_lin = mean_absolute_error(y_val, y_pred_lin)

# Printing the results
print(f"Multivariate linear regression results:")
print(f"R-squared: {r2_lin:.3f}")
print(f"MAPE: {mape_lin:.3f}")
print(f"MAE: {mae_lin:.3f}")


Multivariate linear regression results:
R-squared: 0.522
MAPE: 0.307
MAE: 42.939


In [5]:
#In the next step, we will run a polynomial regression of the 2nd degree on the BMI feature alone using the PolynomialFeatures class from sklearn.
#Here, BMI feature will be transformed into a quadratic polynomial and will fit a linear regression on it.
# Running a polynomial regression of the 2nd degree on the BMI feature alone
poly_bmi = PolynomialFeatures(degree=2)
X_train_bmi = poly_bmi.fit_transform(X_train[["bmi"]])
X_val_bmi = poly_bmi.transform(X_val[["bmi"]])

poly_reg_bmi = LinearRegression()
poly_reg_bmi.fit(X_train_bmi, y_train)

# Predicting on the validation set
y_pred_poly_bmi = poly_reg_bmi.predict(X_val_bmi)

# Evaluating the model using R-squared, MAPE and MAE
r2_poly_bmi = r2_score(y_val, y_pred_poly_bmi)
mape_poly_bmi = mean_absolute_percentage_error(y_val, y_pred_poly_bmi)
mae_poly_bmi = mean_absolute_error(y_val, y_pred_poly_bmi)

# Printing the results
print(f"Polynomial regression on BMI results:")
print(f"R-squared: {r2_poly_bmi:.3f}")
print(f"MAPE: {mape_poly_bmi:.3f}")
print(f"MAE: {mae_poly_bmi:.3f}")

Polynomial regression on BMI results:
R-squared: 0.357
MAPE: 0.383
MAE: 50.452


In [6]:
#Lastly, we will run a multivariate polynomial regression of the 2nd degree on all variables using PolynomialFeatures class from sklearn.
#Here, All features will be transformed into quadratic polynomials and fit a linear regression on them.

# Running a multivariate polynomial regression of the 2nd degree on all variables
poly_all = PolynomialFeatures(degree=2, include_bias=False)
X_train_all = poly_all.fit_transform(X_train)
X_val_all = poly_all.transform(X_val)

poly_reg_all = LinearRegression()
poly_reg_all.fit(X_train_all, y_train)

# Predicting on the validation set
y_pred_poly_all = poly_reg_all.predict(X_val_all)

# Evaluating the model using R-squared, MAPE and MAE
r2_poly_all = r2_score(y_val, y_pred_poly_all)
mape_poly_all = mean_absolute_percentage_error(y_val, y_pred_poly_all)
mae_poly_all = mean_absolute_error(y_val, y_pred_poly_all)

# Printing the results
print(f"Multivariate polynomial regression results:")
print(f"R-squared: {r2_poly_all:.3f}")
print(f"MAPE: {mape_poly_all:.3f}")
print(f"MAE: {mae_poly_all:.3f}")

Multivariate polynomial regression results:
R-squared: 0.360
MAPE: 0.326
MAE: 46.646


In [None]:
#Below cell represents the comaprison of three models on the basis of R-squared,  MAPE, and MAE
#In order to look pleasant, it is requested not to run below cell



Model                                      R-squared             MAPE                MAE

Multivariate linear regression               0.52                0.31               42.94           

Polynomial regression on BMI                 0.357               0.38               50.45

Multivariate polynomial regression           0.36                0.33               46.64


R-squared is a measure of how well the model fits the data. It ranges from 0 to 1, with higher values indicating better fit. MAPE and MAE are measures of how much the model deviates from the true values. They are both in the same units as the target variable, with lower values indicating better accuracy.

For a non-expert, I would explain the values as follows:

The multivariate linear regression model explains about 52% of the variation in the target variable, and has an average error of about 43 units.

The polynomial regression on BMI model explains about 36% of the variation in the target variable, and has an average error of about 50 units.

The multivariate polynomial regression model explains about 36% of the variation in the target variable, and has an average error of about 46 units.

My insight on the values of each model is mentioned below:
* The conclusion that the multivariate linear regression model performs better than the other two models is because the characteristics and the target variable are linearly related.
* According to the polynomial regression on BMI model's performance, which is worse than that of the other two models, BMI alone is not a good predictor of the target variable.
* Indicating that the inclusion of nonlinear components does not improve the model's fit or accuracy, the multivariate polynomial regression model performs only slightly better than the polynomial regression on BMI model but worse than the multivariate linear regression model.

## How many parameters are we fitting for each of the three models? Explain these values.

* For the multivariate linear regression model, we are fitting 11 parameters: one intercept and one coefficient for each of the 10 features.
* For the polynomial regression on BMI model, we are fitting 3 parameters: one intercept and two coefficients for the quadratic polynomial of BMI.
* For the multivariate polynomial regression model, we are fitting 66 parameters: one intercept and 65 coefficients for the quadratic polynomials of all features. The number of coefficients can be obtained by using poly_all.get_feature_names_out(), which shows that there are 10 linear terms, 45 interaction terms, and 10 quadratic terms.

## Which model would you choose for deployment, and why?

I would choose the multivariate linear regression model for deployment, because it has the highest R-squared value and the lowest MAPE and MAE values among the three models. This means that it has the best fit and accuracy for predicting the target variable. However, I would also consider other factors such as complexity, interpretability, and generalization before making a final decision.