<a href="https://colab.research.google.com/github/samsung-ai-course/6-7-edition/blob/main/Supervised%20Learning/SVM_Hyperparameter_Tuning_WineQuality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyperparameter Tuning Practice: Wine Quality Dataset (SVM)
In this notebook, you'll explore hyperparameter tuning using Support Vector Machines (SVM) on the Wine Quality dataset. You'll learn how to tune key hyperparameters, visualize results, and interpret the impact of these hyperparameters.

## Objectives:
1. Load and preprocess the Wine Quality dataset.
2. Split the data into training and test sets.
3. Perform hyperparameter tuning for SVM.
4. Visualize training vs validation performance.


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Enable inline plots
%matplotlib inline

## Step 1: Load the Wine Quality Dataset
We'll use the Wine Quality dataset, which contains information about different wines and their quality ratings. You can download it from: [Wine Quality Dataset](https://www.kaggle.com/datasets/yasserh/wine-quality-dataset) and find more information, or load it directly from github.


In [None]:
# Load the dataset (you'll need to download 'winequality-red.csv')
url = 'https://raw.githubusercontent.com/samsung-ai-course/6-7-edition/refs/heads/main/Supervised%20Learning/Datasets/WineQT.csv'
data = #TODO:

# Display the first few rows
data.head()

### Step 1.1: Free EDA
Here you are free to explore the data in whatever it be of use to you.
* .describe()
* .info()
* plot correlation matrixes
* up to you


In [None]:
#up to you
#whatever you do what conclusion can you take from it? If any.

## Step 2: Data Preprocessing
- Separate features and target variable.
- Split the dataset into training and testing sets.
- Optionally perform some feature engineering and assess its impact.

In [None]:
# Separate features and target
X = #TODO p.s check .drop
y = #TODO

In [None]:


# Bin target variable into 'low', 'medium', and 'high' quality for classification -> https://en.wikipedia.org/wiki/Data_binning
y = #TODO lets turn this into a simpler classification problem p.s lookup pd.cut
#p.s this will be a multi-class problem instead of a simple binary one. Class imbalance might cause more problems here..? More on that on the next class, but think about it.
#After you finish this notebook, do not perform binning.

# Split into training and testing sets
#TODO

print(f"Training set: {X_train.shape}, Testing set: {X_test.shape}")

## Step 3: SVM Hyperparameter Tuning
Experiment with different values for `C` (regularization parameter) and kernel types to see how they affect training and validation accuracy.

In [None]:
# Define parameter grid for SVM
# Note: the more cross validations (cv) and the bigger the search space, the slower this can be
param_grid = {
    'C': [0.1],
    'kernel': ['linear', 'rbf'],
    'gamma': ['auto']  # Only relevant for non-linear kernels like 'rbf'
}

# Perform Grid Search
svm = SVC()
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best parameters and scores
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)

In [None]:
random_search = #TODO -> https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
#TODO

print("Best Parameters:", random_search.best_params_)
print("Best Cross-Validation Accuracy:", random_search.best_score_)

## Step 4: Visualize the Results
Plot the cross-validation accuracy for different values of `C` to understand its impact.

In [None]:
# Extract results from grid search
results = pd.DataFrame(grid_search.cv_results_)

# Filter for linear and rbf kernels separately
linear_results = results[results['param_kernel'] == 'linear']
rbf_results = results[results['param_kernel'] == 'rbf']

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(linear_results['param_C'], linear_results['mean_test_score'], label='Linear Kernel', marker='o')
plt.plot(rbf_results['param_C'], rbf_results['mean_test_score'], label='RBF Kernel', marker='o')
plt.xscale('log')
plt.title('SVM: Effect of C on Accuracy')
plt.xlabel('C (Regularization Parameter)')
plt.ylabel('Cross-Validation Accuracy')
plt.legend()
plt.grid()
plt.show()

## Step 5: Evaluation

Understand precision and recall, plot confusion matrix and evaluate the model more in detail.

Note that after the hyperparameters are found we typically train the model again on all *train* dataset.

In [None]:
#;)