<a href="https://colab.research.google.com/github/andrecamara2004/andrecamara2004.github.io/blob/main/Supervised%20Learning/SVM_Hyperparameter_Tuning_WineQuality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyperparameter Tuning Practice: Wine Quality Dataset (SVM)
In this notebook, you'll explore hyperparameter tuning using Support Vector Machines (SVM) on the Wine Quality dataset. You'll learn how to tune key hyperparameters, visualize results, and interpret the impact of these hyperparameters.

## Objectives:
1. Load and preprocess the Wine Quality dataset.
2. Split the data into training and test sets.
3. Perform hyperparameter tuning for SVM.
4. Visualize training vs validation performance.


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Enable inline plots
%matplotlib inline

## Step 1: Load the Wine Quality Dataset
We'll use the Wine Quality dataset, which contains information about different wines and their quality ratings. You can download it from: [Wine Quality Dataset](https://www.kaggle.com/datasets/yasserh/wine-quality-dataset) and find more information, or load it directly from github.


In [2]:
# Load the dataset (you'll need to download 'winequality-red.csv')
url = 'https://raw.githubusercontent.com/samsung-ai-course/6-7-edition/refs/heads/main/Supervised%20Learning/Datasets/WineQT.csv'
data = pd.read_csv(url)

# Display the first few rows
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,4


### Step 1.1: Free EDA
Here you are free to explore the data in whatever it be of use to you.
* .describe()
* .info()
* plot correlation matrixes
* up to you


In [3]:
data.describe()
data.info()
data.corr()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1143 entries, 0 to 1142
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1143 non-null   float64
 1   volatile acidity      1143 non-null   float64
 2   citric acid           1143 non-null   float64
 3   residual sugar        1143 non-null   float64
 4   chlorides             1143 non-null   float64
 5   free sulfur dioxide   1143 non-null   float64
 6   total sulfur dioxide  1143 non-null   float64
 7   density               1143 non-null   float64
 8   pH                    1143 non-null   float64
 9   sulphates             1143 non-null   float64
 10  alcohol               1143 non-null   float64
 11  quality               1143 non-null   int64  
 12  Id                    1143 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 116.2 KB


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
fixed acidity,1.0,-0.250728,0.673157,0.171831,0.107889,-0.164831,-0.110628,0.681501,-0.685163,0.174592,-0.075055,0.12197,-0.275826
volatile acidity,-0.250728,1.0,-0.544187,-0.005751,0.056336,-0.001962,0.077748,0.016512,0.221492,-0.276079,-0.203909,-0.407394,-0.007892
citric acid,0.673157,-0.544187,1.0,0.175815,0.245312,-0.057589,0.036871,0.375243,-0.546339,0.331232,0.10625,0.240821,-0.139011
residual sugar,0.171831,-0.005751,0.175815,1.0,0.070863,0.165339,0.19079,0.380147,-0.116959,0.017475,0.058421,0.022002,-0.046344
chlorides,0.107889,0.056336,0.245312,0.070863,1.0,0.01528,0.048163,0.208901,-0.277759,0.374784,-0.229917,-0.124085,-0.088099
free sulfur dioxide,-0.164831,-0.001962,-0.057589,0.165339,0.01528,1.0,0.661093,-0.05415,0.072804,0.034445,-0.047095,-0.06326,0.095268
total sulfur dioxide,-0.110628,0.077748,0.036871,0.19079,0.048163,0.661093,1.0,0.050175,-0.059126,0.026894,-0.188165,-0.183339,-0.107389
density,0.681501,0.016512,0.375243,0.380147,0.208901,-0.05415,0.050175,1.0,-0.352775,0.143139,-0.494727,-0.175208,-0.363926
pH,-0.685163,0.221492,-0.546339,-0.116959,-0.277759,0.072804,-0.059126,-0.352775,1.0,-0.185499,0.225322,-0.052453,0.132904
sulphates,0.174592,-0.276079,0.331232,0.017475,0.374784,0.034445,0.026894,0.143139,-0.185499,1.0,0.094421,0.25771,-0.103954


## Step 2: Data Preprocessing
- Separate features and target variable.
- Split the dataset into training and testing sets.
- Optionally perform some feature engineering and assess its impact.

In [None]:
# Separate features and target
X = #TODO p.s check .drop
y = #TODO

In [None]:


# Bin target variable into 'low', 'medium', and 'high' quality for classification -> https://en.wikipedia.org/wiki/Data_binning
y = #TODO lets turn this into a simpler classification problem p.s lookup pd.cut
#p.s this will be a multi-class problem instead of a simple binary one. Class imbalance might cause more problems here..?
#More on that on the next class, but think about it.

# Split into training and testing sets
#TODO

print(f"Training set: {X_train.shape}, Testing set: {X_test.shape}")

## Step 3: SVM Hyperparameter Tuning
Experiment with different values for `C` (regularization parameter) and kernel types to see how they affect training and validation accuracy.

In [None]:
# Define parameter grid for SVM
# Note: the more cross validations (cv) and the bigger the search space, the slower this can be
param_grid = {
    'C': [0.1],
    'kernel': ['linear', 'rbf'],
    'gamma': ['auto']  # Only relevant for non-linear kernels like 'rbf'
}

# Perform Grid Search
svm = SVC()
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best parameters and scores
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)

In [None]:
random_search = #TODO -> https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
#TODO

print("Best Parameters:", random_search.best_params_)
print("Best Cross-Validation Accuracy:", random_search.best_score_)

## Step 4: Visualize the Results
Plot the cross-validation accuracy for different values of `C` to understand its impact.

In [None]:
# Extract results from grid search
results = pd.DataFrame(grid_search.cv_results_)

# Filter for linear and rbf kernels separately
linear_results = results[results['param_kernel'] == 'linear']
rbf_results = results[results['param_kernel'] == 'rbf']

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(linear_results['param_C'], linear_results['mean_test_score'], label='Linear Kernel', marker='o')
plt.plot(rbf_results['param_C'], rbf_results['mean_test_score'], label='RBF Kernel', marker='o')
plt.xscale('log')
plt.title('SVM: Effect of C on Accuracy')
plt.xlabel('C (Regularization Parameter)')
plt.ylabel('Cross-Validation Accuracy')
plt.legend()
plt.grid()
plt.show()

## Step 5: Evaluation

Understand precision and recall, plot confusion matrix and evaluate the model more in detail.

Note that after the hyperparameters are found we typically train the model again on all *train* dataset.

In [None]:
#;)