### 1.Get the data, and run a train-validation-test split. Description of each column can be found in sklearn documentation. Look at the documentation for the load_diabetes method to know what are as_frame and scaled arguments are for.

In [18]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_absolute_percentage_error
from sklearn.preprocessing import PolynomialFeatures

In [19]:
data = load_diabetes(return_X_y=True, as_frame=True, scaled=False)
X = data[0]  
y = data[1]  

In [20]:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [21]:
# Multivariate linear regression
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
y_pred = linear_model.predict(X_valid)

# Evaluating multivariate linear regression model
r2_linear = r2_score(y_valid, y_pred)
mae_linear = mean_absolute_error(y_valid, y_pred)
mape_linear = mean_absolute_percentage_error(y_valid, y_pred)

In [23]:
# Polynomial regression
degree = 2  
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
X_valid_poly = poly.transform(X_valid)

poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)
y_pred_poly = poly_model.predict(X_valid_poly)

# Evaluating the polynomial regression model
r2_poly = r2_score(y_valid, y_pred_poly)
mae_poly = mean_absolute_error(y_valid, y_pred_poly)
mape_poly = mean_absolute_percentage_error(y_valid, y_pred_poly)

In [24]:
# Create and fit a multivariate linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = linear_model.predict(X_test)

# Evaluate the model using R-squared, Mean Absolute Error, and Mean Absolute Percentage Error
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)

In [26]:
# Print evaluation metrics for both models
print("Multivariate Linear Regression:")
print(f"R-squared: {r2_linear:.2f}")
print(f"Mean Absolute Error: {mae_linear:.2f}")
print(f"Mean Absolute Percentage Error: {mape_linear:.2f}")

print("\nPolynomial Regression (Degree {}):".format(degree))
print(f"R-squared: {r2_poly:.2f}")
print(f"Mean Absolute Error: {mae_poly:.2f}")
print(f"Mean Absolute Percentage Error: {mape_poly:.2f}")

Multivariate Linear Regression:
R-squared: 0.43
Mean Absolute Error: 41.13
Mean Absolute Percentage Error: 0.38

Polynomial Regression (Degree 2):
R-squared: 0.38
Mean Absolute Error: 41.17
Mean Absolute Percentage Error: 0.36


### 2.Run a multivariate linear regression on all variables

In [30]:
# Load the Diabetes dataset
data = load_diabetes()
X = data.data 
y = data.target

In [31]:
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [32]:
# For creating and fiting a multivariate linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

In [33]:
# To makee predictions on the test set
y_pred = linear_model.predict(X_test)

In [34]:
# Evaluating the model using R-squared, Mean Absolute Error, and Mean Absolute Percentage Error
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)

In [35]:
print("Multivariate Linear Regression:")
print(f"R-squared: {r2:.2f}")
print(f"Mean Absolute Error: {mae:.2f}")
print(f"Mean Absolute Percentage Error: {mape:.2f}")

Multivariate Linear Regression:
R-squared: 0.45
Mean Absolute Error: 42.79
Mean Absolute Percentage Error: 0.37


### 3.Run a polynomial regression of the 2nd degree on the BMI feature alone

In [36]:
# Extract the BMI feature
X_bmi = X[:, 2:3]  

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_bmi, y, test_size=0.2, random_state=42)

In [37]:
# To create polynomial features of degree 2
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

In [38]:
# to create and fit a polynomial regression model
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)

In [39]:
# Make predictions on the test set
y_pred = poly_model.predict(X_test_poly)

In [40]:
# Evaluate the model using R-squared, Mean Absolute Error, and Mean Absolute Percentage Error
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)

In [41]:
# Print the evaluation metrics
print("Polynomial Regression (2nd Degree) on BMI:")
print(f"R-squared: {r2:.2f}")
print(f"Mean Absolute Error: {mae:.2f}")
print(f"Mean Absolute Percentage Error: {mape:.2f}")

Polynomial Regression (2nd Degree) on BMI:
R-squared: 0.23
Mean Absolute Error: 52.38
Mean Absolute Percentage Error: 0.46


### 4.Run a multivariate polynomial regression of the 2nd degree on all variables (Hint: set include_bias=False in PolynomialFeatures)

In [45]:
# Yo create polynomial features of degree 2 with include_bias=False
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

In [None]:
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)

In [None]:
# Make predictions on the test set
y_pred = poly_model.predict(X_test_poly)

# Evaluate the model using R-squared, Mean Absolute Error, and Mean Absolute Percentage Error
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)

In [44]:
# Print the evaluation metrics
print("Multivariate Polynomial Regression (2nd Degree) with include_bias=False:")
print(f"R-squared: {r2:.2f}")
print(f"Mean Absolute Error: {mae:.2f}")
print(f"Mean Absolute Percentage Error: {mape:.2f}")

Multivariate Polynomial Regression (2nd Degree) with include_bias=False:
R-squared: 0.42
Mean Absolute Error: 43.58
Mean Absolute Percentage Error: 0.38


### 5.Compare the three models by looking at R-squared, MAPE and MAE. Explain what the values mean for a non-expert and add your insight about the values of each model.

The Multivariate Linear Regression model performs better in terms of R-squared, MAE, and MAPE compared to the other two models.
The Polynomial Regression on BMI model has a weaker fit and larger prediction errors, which indicates that the choice of the polynomial degree might not be appropriate for this dataset.
The Multivariate Polynomial Regression with include_bias=False model performs similarly to the multivariate linear regression but with a slightly lower R-squared value.
In summary, the Multivariate Linear Regression model appears to be the best-performing model among the three, as it has higher R-squared, lower MAE, and lower MAPE, indicating a better fit to the data and more accurate predictions. However, further analysis and feature engineering may be needed to improve the model's performance.

### 6.Please answer the following questions:
How many parameters are we fitting for each of the three models? Explain these values. Hint: for explaining the parameters of the polynomial regression, you can use poly.get_feature_names_out() 
Which model would you choose for deployment, and why?

##### 1.Number of Parameters

   ##### Multivariate Linear Regression: 11 parameters (10 feature coefficients and 1 intercept).

   ##### -Polynomial Regression (2nd Degree) on BMI: 3 parameters (x, x^2, and intercept).

   ##### -Multivariate Polynomial Regression (2nd Degree) with include_bias=False: 11 parameters (10 feature coefficients and 1 intercept).

##### 2.Choosing a Model for Deployment

   ##### If simplicity and decent performance are essential, go with Multivariate Linear Regression.
   
   ##### If we need to capture potential nonlinear relationships and can handle complexity, consider Multivariate Polynomial Regression.

   #### The choice should be aligned with our specific goals and resources for deployment.