**A regression on the quality of the wine**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Load the dataset
data_url = 'https://archive.ics.uci.edu/static/public/186/data.csv'
df = pd.read_csv(data_url)

print(df)


      fixed_acidity  volatile_acidity  citric_acid  residual_sugar  chlorides  \
0               7.4              0.70         0.00             1.9      0.076   
1               7.8              0.88         0.00             2.6      0.098   
2               7.8              0.76         0.04             2.3      0.092   
3              11.2              0.28         0.56             1.9      0.075   
4               7.4              0.70         0.00             1.9      0.076   
...             ...               ...          ...             ...        ...   
6492            6.2              0.21         0.29             1.6      0.039   
6493            6.6              0.32         0.36             8.0      0.047   
6494            6.5              0.24         0.19             1.2      0.041   
6495            5.5              0.29         0.30             1.1      0.022   
6496            6.0              0.21         0.38             0.8      0.020   

      free_sulfur_dioxide  

In [None]:
# Encode the 'color' column to numerical values
df['color'] = df['color'].map({'red': 0, 'white': 1})

In [None]:
print(df)

      fixed_acidity  volatile_acidity  citric_acid  residual_sugar  chlorides  \
0               7.4              0.70         0.00             1.9      0.076   
1               7.8              0.88         0.00             2.6      0.098   
2               7.8              0.76         0.04             2.3      0.092   
3              11.2              0.28         0.56             1.9      0.075   
4               7.4              0.70         0.00             1.9      0.076   
...             ...               ...          ...             ...        ...   
6492            6.2              0.21         0.29             1.6      0.039   
6493            6.6              0.32         0.36             8.0      0.047   
6494            6.5              0.24         0.19             1.2      0.041   
6495            5.5              0.29         0.30             1.1      0.022   
6496            6.0              0.21         0.38             0.8      0.020   

      free_sulfur_dioxide  

In [None]:
# Define features and target
X = df.drop('quality', axis=1)
y = df['quality']

In [None]:
print(X)

      fixed_acidity  volatile_acidity  citric_acid  residual_sugar  chlorides  \
0               7.4              0.70         0.00             1.9      0.076   
1               7.8              0.88         0.00             2.6      0.098   
2               7.8              0.76         0.04             2.3      0.092   
3              11.2              0.28         0.56             1.9      0.075   
4               7.4              0.70         0.00             1.9      0.076   
...             ...               ...          ...             ...        ...   
6492            6.2              0.21         0.29             1.6      0.039   
6493            6.6              0.32         0.36             8.0      0.047   
6494            6.5              0.24         0.19             1.2      0.041   
6495            5.5              0.29         0.30             1.1      0.022   
6496            6.0              0.21         0.38             0.8      0.020   

      free_sulfur_dioxide  

In [None]:
print(y)

0       5
1       5
2       5
3       6
4       5
       ..
6492    6
6493    5
6494    6
6495    7
6496    6
Name: quality, Length: 6497, dtype: int64


In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create pipelines for SVR, Linear Regression, Ridge, and Random Forest
pipelines = {
    'svr': Pipeline([
        ('scaler', StandardScaler()),
        ('svr', SVR())
    ]),
    'linear_regression': Pipeline([
        ('scaler', StandardScaler()),
        ('linear_regression', LinearRegression())
    ]),
    'ridge': Pipeline([
        ('scaler', StandardScaler()),
        ('ridge', Ridge())
    ]),
    'random_forest': Pipeline([
        ('scaler', StandardScaler()),
        ('random_forest', RandomForestRegressor(random_state=42))
    ])
}

# Define the parameter grid for hyper-parameter tuning
param_grids = {
    'svr': {
        'svr__C': [0.1, 1, 10, 100],
        'svr__gamma': [1, 0.1, 0.01, 0.001],
        'svr__kernel': ['rbf', 'linear']
    },
    'linear_regression': {},
    'ridge': {
        'ridge__alpha': [0.1, 1.0, 10.0, 100.0]
    },
    'random_forest': {
        'random_forest__n_estimators': [100, 200, 300],
        'random_forest__max_depth': [None, 10, 20, 30],
        'random_forest__min_samples_split': [2, 5, 10],
        'random_forest__min_samples_leaf': [1, 2, 4]
    }
}

In [None]:
# Print the data after StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nScaled Training Data:")
print(X_train_scaled)

print("\nScaled Testing Data:")
print(X_test_scaled)


Scaled Training Data:
[[ 2.09749415e+00  9.44541345e-01 -6.19125220e-01 ... -6.07582024e-01
  -9.09141702e-01  5.63889186e-01]
 [ 3.88772797e-01 -3.53027559e-01  2.06940277e-01 ... -1.99847169e-01
  -7.41998353e-01  5.63889186e-01]
 [ 3.41786974e+00  8.51857852e-01  5.51134235e-01 ...  8.19489967e-01
  -3.79854432e-01 -1.77339808e+00]
 ...
 [-6.20926184e-01  2.03073400e-01 -8.25641595e-01 ... -6.75537833e-01
  -8.25570027e-01  5.63889186e-01]
 [-5.43257032e-01 -4.76605550e-01  1.23952215e+00 ... -4.03714597e-01
  -8.25570027e-01  5.63889186e-01]
 [ 4.27035147e-04  1.74779829e+00 -1.78938468e+00 ...  7.19760670e-02
  -8.25570027e-01 -1.77339808e+00]]

Scaled Testing Data:
[[-0.15491127 -1.03270651  2.89165314 ... -1.01531688  1.43086519
   0.56388919]
 [ 0.3887728   1.87137628 -0.7568028  ... -0.53962622 -0.49128333
  -1.77339808]
 [-0.31024957  0.32665139  0.13810149 ... -0.60758202  1.26372184
   0.56388919]
 ...
 [-0.62092618 -0.16766057 -0.27493126 ... -1.01531688  1.84872356
   0.

In [None]:
# Perform GridSearchCV to find the best hyper-parameters for each model
best_models = {}
for model_name in pipelines:
    grid_search = GridSearchCV(pipelines[model_name], param_grids[model_name], cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    best_models[model_name] = grid_search.best_estimator_
    print(f'Best parameters for {model_name}: {grid_search.best_params_}')

    # Print all hyper-parameter tuning results
    print(f'All hyper-parameter tuning results for {model_name}:')
    for mean_score, params in zip(grid_search.cv_results_['mean_test_score'], grid_search.cv_results_['params']):
        print(f'Mean Test Score: {mean_score}, Parameters: {params}')

Best parameters for svr: {'svr__C': 1, 'svr__gamma': 1, 'svr__kernel': 'rbf'}
All hyper-parameter tuning results for svr:
Mean Test Score: -0.6041562885513352, Parameters: {'svr__C': 0.1, 'svr__gamma': 1, 'svr__kernel': 'rbf'}
Mean Test Score: -0.5484679762249433, Parameters: {'svr__C': 0.1, 'svr__gamma': 1, 'svr__kernel': 'linear'}
Mean Test Score: -0.4989642162671898, Parameters: {'svr__C': 0.1, 'svr__gamma': 0.1, 'svr__kernel': 'rbf'}
Mean Test Score: -0.5484679762249433, Parameters: {'svr__C': 0.1, 'svr__gamma': 0.1, 'svr__kernel': 'linear'}
Mean Test Score: -0.5434540140174041, Parameters: {'svr__C': 0.1, 'svr__gamma': 0.01, 'svr__kernel': 'rbf'}
Mean Test Score: -0.5484679762249433, Parameters: {'svr__C': 0.1, 'svr__gamma': 0.01, 'svr__kernel': 'linear'}
Mean Test Score: -0.6381595048226223, Parameters: {'svr__C': 0.1, 'svr__gamma': 0.001, 'svr__kernel': 'rbf'}
Mean Test Score: -0.5484679762249433, Parameters: {'svr__C': 0.1, 'svr__gamma': 0.001, 'svr__kernel': 'linear'}
Mean Tes

In [None]:
# Evaluate each model on the test set
for model_name, model in best_models.items():
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    mad = mean_absolute_error(y_test, y_pred)
    print(f'\nResults for {model_name}:')
    print(f'Mean Squared Error (MSE): {mse}')
    print(f'R-squared (R2): {r2}')
    print(f'Mean Absolute Deviation (MAD): {mad}')


Results for svr:
Mean Squared Error (MSE): 0.42903900015125707
R-squared (R2): 0.4119853412434167
Mean Absolute Deviation (MAD): 0.4700591875420126

Results for linear_regression:
Mean Squared Error (MSE): 0.5276383278265474
R-squared (R2): 0.2768511226847906
Mean Absolute Deviation (MAD): 0.5620633496708347

Results for ridge:
Mean Squared Error (MSE): 0.5274570068752918
R-squared (R2): 0.27709963011008376
Mean Absolute Deviation (MAD): 0.5620497210177962

Results for random_forest:
Mean Squared Error (MSE): 0.3621349013029909
R-squared (R2): 0.5036800143146538
Mean Absolute Deviation (MAD): 0.43520186385207005


**Analysis of the Results**

The following are the results of four different regression models: Support Vector Regression (SVR), Linear Regression, Ridge Regression, and Random Forest Regression. The results are based on three key metrics: Mean Squared Error (MSE), R-squared (R²), and Mean Absolute Deviation (MAD).

1. Mean Squared Error (MSE):
Lower MSE values indicate better performance. Random Forest has the lowest MSE (0.3621), suggesting it has the best predictive accuracy among the models. SVR follows with an MSE of 0.4290, while Linear Regression and Ridge Regression have similar and higher MSE values around 0.5275.

2. R-squared (R²):
R² measures the proportion of variance in the dependent variable that is predictable from the independent variables. Higher R² values indicate better model performance. Random Forest has the highest R² (0.5037), indicating it explains about 50.37% of the variance in the data. SVR is next with an R² of 0.4120. Linear Regression and Ridge Regression have similar and lower R² values around 0.277, indicating they explain less of the variance.

3. Mean Absolute Deviation (MAD):
Lower MAD values indicate better performance as they represent the average absolute difference between predicted and actual values. Random Forest again performs the best with the lowest MAD (0.4352). SVR follows with a MAD of 0.4701. Linear Regression and Ridge Regression have similar and higher MAD values around 0.562.

Overall Analysis
Best Model: Random Forest Regression consistently outperforms the other models across all three metrics (MSE, R², and MAD), making it the best model among the four.

Second Best: SVR is the second-best model, with better performance than Linear and Ridge Regression but not as good as Random Forest.
Similar Performance: Linear Regression and Ridge Regression show very similar performance across all metrics, indicating that the regularization in Ridge Regression did not significantly improve the model over standard Linear Regression in this case.

Conclusion:
Based on the provided metrics, Random Forest Regression is the most effective model for this dataset, followed by SVR. Linear Regression and Ridge Regression perform similarly and are less effective compared to the other two models.