# Wine

The Wine dataset is a dataset for classification tasks, often used to test machine learning algorithms. It consists of 178 samples of wine, each described by 13 features related to their chemical composition. The objective is to classify the wines into one of three classes (types of wine).

The program uses a Support Vector Classifier (SVC) to classify wine samples from the Wine dataset. 

In [25]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split as split
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score as accuracy
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

In [26]:
#loading the dataset
wine = datasets.load_wine()
X, y = wine.data, wine.target

In [27]:
print(wine.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0.98  3.88    2.29  0.63
    Fl

In [28]:
#separating training and test data
X_train, X_test, y_train, y_test = split(X, y, test_size=0.3, shuffle=True, random_state=0, stratify=y)

In [29]:
#creating the pipeline containing the scaler, pca and the model
pipe = Pipeline([("scaler", MinMaxScaler()),
                 ("svc", SVC())])

#preparing the values of hyperparameters to be validated
parameters = [{"svc__kernel": ["linear"], "svc__C": [0.01, 0.1, 1, 10, 100]},
              {"svc__kernel": ["rbf"], "svc__C": [0.01, 0.1, 1, 10, 100], "svc__gamma": [0.01, 0.1, 1, 10, 100]},
              {"svc__kernel": ["poly"], "svc__C": [0.01, 0.1, 1, 10, 100], "svc__degree": np.arange(1,5,1)}]

#set the number of subset to be created for validation
crossval= StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

In [30]:
clf = GridSearchCV(pipe, param_grid=parameters, cv=crossval, n_jobs=-1)
clf.fit(X_train, y_train)

In [31]:
pred_train = clf.best_estimator_.predict(X_train)
pred_test = clf.best_estimator_.predict(X_test)
print(f"Best parameters are: {clf.best_params_}, with a score of {round(clf.best_score_,3)}")
print(f"Accuracy on training set is: {round(accuracy(y_train, pred_train), 3)}")
print(f"Accuracy on test set is : {round(accuracy(y_test, pred_test), 3)}")

Best parameters are: {'svc__C': 0.01, 'svc__degree': 4, 'svc__kernel': 'poly'}, with a score of 0.976
Accuracy on training set is: 1.0
Accuracy on test set is : 0.981


In [None]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, pred_test)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure(figsize=(10, 7))
sns.heatmap(cnf_matrix, annot=True, cmap="YlGnBu", fmt='g')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix')
plt.show()