#### Data Understanding and Exploration
- Dataset: Load the Wine Quality dataset (available on UCI Machine Learning Repository or other sources).
- Explore the Data:
    - Load the dataset and convert it into a Pandas DataFrame.
    - Examine the first few rows, check for missing values, and get a summary of the dataset.

- Questions:
    - What are the input features and target variables?
    - Are there any missing values or outliers that might need attention?

In [3]:
pip install ucimlrepo 

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
Note: you may need to restart the kernel to use updated packages.


In [2]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
wine_quality = fetch_ucirepo(id=186) 
  
df = wine_quality.data.features 
df['targets'] = wine_quality.data.targets 

print(df.head()) ## Examinin first few rows.
print(df.isnull().sum()) ## Checking for missing values.
print(df.describe()) ## Summary of the dataset.

   fixed_acidity  volatile_acidity  citric_acid  residual_sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free_sulfur_dioxide  total_sulfur_dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  targets  
0      9.4        5  
1      9.8        5  
2      9.8        5 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['targets'] = wine_quality.data.targets


In [5]:
features = wine_quality.data.features 
targets = wine_quality.data.targets 

print(f'The datasets features are: {features.columns[0:-1]}') 
print(f'The datasets targets are: {targets.columns}')

print(f'There are no missing values that might need attention')
print(f'There are two features (Residual Suger, Free Sulphor Dioxide) which have a max value of roughly 13x, and 9x their mean respectively. Therefore these may be outliers. The question didn''t specify to explore the data further so I will not, but in the case that I had to i would use a box plot on these two features to see if they have outliers.')

The datasets features are: Index(['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar',
       'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density',
       'pH', 'sulphates', 'alcohol'],
      dtype='object')
The datasets targets are: Index(['quality'], dtype='object')
There are no missing values that might need attention
There are two features (Residual Suger, Free Sulphor Dioxide) which have a max value of roughly 13x, and 9x their mean respectively. Therefore these may be outliers. The question didnt specify to explore the data further so I will not, but in the case that I had to i would use a box plot on these two features to see if they have outliers.


#### Data Preparation
- Preprocess the Data:
    - Split the dataset into features (X) and target (y).
    - Perform an 80-20 split for training and testing.
- Scale the features using StandardScaler or another appropriate method.

- Questions:
    - Why is it necessary to split your data into training and testing sets?
    - Why is scaling features important before applying regularized models such as Ridge or Lasso?


In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_unscaled = df.drop(columns = 'targets')
scaler = StandardScaler()
X = scaler.fit_transform(X_unscaled)
y = df['targets']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

print(f'Shape of train set: {X_train.shape}')
print(f'Shape of test set: {X_test.shape}')

Shape of train set: (5197, 11)
Shape of test set: (1300, 11)


In [10]:
print(f'It''s necessary to split data so you can test your model on data which is hasn''t seen. In essence, reflecting a real life scenario where the variables are known but the output is yet to be. Hence, performance on the test set will give an indicator to the models success when deployed.')
print(f'Scaling is important for Ridge and Lasso so that going into the model, each feature will have an equal weighting or influence on the models output before the Ridge or Lasso model chooses to downweight the less influential features. In short, the features go in equal so they can be compared fairly and accurately, without any artificial up or down weighting adversely effecting the model performance.')

Its necessary to split data so you can test your model on data which is hasnt seen. In essence, reflecting a real life scenario where the variables are known but the output is yet to be. Hence, performance on the test set will give an indicator to the models success when deployed.
Scaling is important for Ridge and Lasso so that going into the model, each feature will have an equal weighting or influence on the models output before the Ridge or Lasso model chooses to downweight the less influential features. In short, the features go in equal so they can be compared fairly and accurately, without any artificial up or down weighting adversely effecting the model performance.


#### Modeling and Evaluation
- Train and Compare Models:
    - Train Ridge Regression, Lasso Regression, and Linear Regression models using the training set.
    - Evaluate the models using the Mean Squared Error (MSE) and R² score on the test set.
- Questions:
    - How do the models perform compared to each other?
    - What insights can you derive about the differences between Ridge, Lasso, and Linear Regression from the results?

In [16]:
import pandas as pd
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

linear = LinearRegression()
ridge = Ridge()
lasso = Lasso()

linear.fit(X_train, y_train)
ridge.fit(X_train, y_train)
lasso.fit(X_train, y_train)

linear_pred = linear.predict(X_test)
ridge_pred = ridge.predict(X_test)
lasso_pred = lasso.predict(X_test)

linear_msq = mean_squared_error(y_test, linear_pred)
ridge_msq = mean_squared_error(y_test, ridge_pred)
lasso_msq = mean_squared_error(y_test, lasso_pred)

linear_r2 = r2_score(y_test, linear_pred)
ridge_r2 = r2_score(y_test, ridge_pred)
lasso_r2 = r2_score(y_test, lasso_pred)

msq_performance = pd.DataFrame({
    'linear_msq': [linear_msq],
    'ridge_msq': [ridge_msq],
    'lasso_msq': [lasso_msq]
})

r2_performance = pd.DataFrame({
    'linear_r2': [linear_r2],
    'ridge_r2': [ridge_r2],
    'lasso_r2': [lasso_r2]
})

print(msq_performance)
print(r2_performance)

   linear_msq  ridge_msq  lasso_msq
0     0.54187   0.541873   0.783099
   linear_r2  ridge_r2  lasso_r2
0   0.307616  0.307613 -0.000619


In [18]:
print(f'The Linear and Ridge models perform equally unsatisfactory, both accounting for only 30% of the variance via the r2 score. However, the Lasso model performed the worse with almost none of the variance explained, which indicates it''s over penalising during the feature selection process.')


The Linear and Ridge models perform equally unsatisfactory, both accounting for only 30% of the variance via the r2 score. However, the Lasso model performed the worse with almost none of the variance explained, which indicates its over penalising during the feature selection process.


#### Hyperparameter Tuning
- Hyperparameter Tuning:
    - Use GridSearchCV to find the best alpha for both Ridge and Lasso models. Use a range of alpha values (e.g., [0.001, 0.01, 0.1, 1, 10]).
- Evaluate the tuned models' performance.

- Questions:
    - What are the optimal alpha values for Ridge and Lasso, and how did they affect the model performance?
    - Did tuning the hyperparameters improve the models significantly?

In [19]:
from sklearn.model_selection import GridSearchCV

params = {'alpha': [0.001, 0.01, 0.1, 1, 10]}

ridge_grid = GridSearchCV(ridge, params)
lasso_grid = GridSearchCV(lasso, params)

ridge_grid.fit(X_train, y_train)
lasso_grid.fit(X_train, y_train)

ridge_grid_pred = ridge_grid.predict(X_test)
lasso_grid_pred = lasso_grid.predict(X_test)

ridge_grid_msq = mean_squared_error(y_test, ridge_grid_pred)
lasso_grid_msq = mean_squared_error(y_test, lasso_grid_pred)

ridge_grid_r2 = r2_score(y_test, ridge_grid_pred)
lasso_grid_r2 = r2_score(y_test, lasso_grid_pred)

msq_grid_performance = pd.DataFrame({
    'ridge_grid_msq': [ridge_grid_msq],
    'lasso_grid_msq': [lasso_grid_msq]
})

r2_grid_performance = pd.DataFrame({
    'ridge_grid_r2': [ridge_grid_r2],
    'lasso_grid_r2': [lasso_grid_r2]
})


print(msq_grid_performance)
print(r2_grid_performance)
print(f'Optimal Alpha for Ridge Model: {ridge_grid.best_params_}')
print(f'Optimal Alpha for Lasso Model: {lasso_grid.best_params_}')

   ridge_grid_msq  lasso_grid_msq
0        0.541892        0.542073
   ridge_grid_r2  lasso_grid_r2
0       0.307587       0.307357
Optimal Alpha for Ridge Model: {'alpha': 10}
Optimal Alpha for Lasso Model: {'alpha': 0.001}


In [20]:
print(f'Tuning the models hyperparameters only improved the Lasso model, predictably as this model performed the worse last time. However, it now sits with the Linear Model in terms of performance with an r2 of 0.3. This is still unsatisfactory when it comes to releasing an accurate model.')

Tuning the models hyperparameters only improved the Lasso model, predictably as this model performed the worse last time. However, it now sits with the Linear Model in terms of performance with an r2 of 0.3. This is still unsatisfactory when it comes to releasing an accurate model.


#### Feature Importance
- Feature Importance:
    - Extract and interpret the coefficients from the best Ridge and Lasso models. Rank the features by their importance.
- Questions:
    - According to Ridge and Lasso, which features are the most important for predicting wine quality?
    - Do Ridge and Lasso highlight the same features, or are there differences in feature importance between the models?

In [22]:
best_ridge_coefficients = ridge_grid.best_estimator_.coef_
best_lasso_coefficients = lasso_grid.best_estimator_.coef_

feature_names = X_unscaled.columns
ridge_coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Ridge Coefficient': best_ridge_coefficients
}).sort_values(by='Ridge Coefficient', key=abs, ascending=False)

lasso_coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Lasso Coefficient': best_lasso_coefficients
}).sort_values(by='Lasso Coefficient', key=abs, ascending=False)

print(ridge_coef_df)
print(lasso_coef_df)

                 Feature  Ridge Coefficient
10               alcohol           0.322912
1       volatile_acidity          -0.226760
3         residual_sugar           0.187861
6   total_sulfur_dioxide          -0.147605
7                density          -0.138657
5    free_sulfur_dioxide           0.124491
9              sulphates           0.104443
0          fixed_acidity           0.078809
8                     pH           0.061820
2            citric_acid          -0.022985
4              chlorides          -0.020987
                 Feature  Lasso Coefficient
10               alcohol           0.334957
1       volatile_acidity          -0.228196
3         residual_sugar           0.170751
6   total_sulfur_dioxide          -0.143346
5    free_sulfur_dioxide           0.122135
7                density          -0.111848
9              sulphates           0.100600
0          fixed_acidity           0.064556
8                     pH           0.053470
4              chlorides        

In [23]:
print(f'Although the coefficient values vary by an insignificant value, both the Ridge and Lasso models have the same 3 top features in order for predicting wine quality. These are alcohol, volatile acidity, and residual sugar. The biggest difference, between the two models is that the Ridge model has a more signicantly value for density when compared to the Lasso. However, the absolute value difference being only 0.02 roughly means that this isn''t highly significant')

Although the coefficient values vary by an insignificant value, both the Ridge and Lasso models have the same 3 top features in order for predicting wine quality. These are alcohol, volatile acidity, and residual sugar. The biggest difference, between the two models is that the Ridge model has a more signicantly value for density when compared to the Lasso. However, the absolute value difference being only 0.02 roughly means that this isnt highly significant


#### Evaluation of Model Selection
- Model Deployment Decision:
    - Based on the performance of Ridge, Lasso, and Linear Regression, select the best model to deploy.
- Questions:
    - Which model performed the best overall? Justify your choice based on MSE, R², and the feature importance results.
    - If you were to improve the model further, what steps would you take next?

In [24]:
print(f'All 3 models performed unsatisfactory with each of them only explaining 30% of the variance. Therefore I wouldn''t choose any to deploy. If I had to I would maybe choose the Linear model as it takes less computational resources yet delivers a similar performance. However, going forward I would test other models such as Decision Trees and Random Forests to see if they deliver an adequate performance that would be of value. In this case I would like an r2 of over 0.7.')

All 3 models performed unsatisfactory with each of them only explaining 30% of the variance. Therefore I wouldnt choose any to deploy. If I had to I would maybe choose the Linear model as it takes less computational resources yet delivers a similar performance. However, going forward I would test other models such as Decision Trees and Random Forests to see if they deliver an adequate performance that would be of value. In this case I would like an r2 of over 0.7.
