# Wine recognition dataset

The data is the results of a chemical analysis of wines grown in the sameregion in Italy by three different cultivators. 
There are thirteen different measurements taken for different constituents found in the three types of wine.

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2



In [1]:
import pandas as pd
import numpy as np

## Loading Datasets

In [3]:
from sklearn.datasets import load_wine

In [6]:
W = load_wine()

In [9]:
W.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

In [10]:
df = pd.DataFrame(data = W['data'], columns = W['feature_names'])

In [11]:
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [14]:
W['target_names']

array(['class_0', 'class_1', 'class_2'], dtype='<U7')

Here the datasets is numeric in nature. Hence we are not performing any type of preprocessing techniques.

## Spliting DataStets

In [17]:
from sklearn.model_selection import train_test_split

In [15]:
X = df

In [16]:
y = W['target']

In [18]:
X_train , X_test , y_train , y_test = train_test_split(X , y , test_size = 0.30 , random_state = 50)

## loading all the  model 

Now process every  model by applying loop to find out the highest score and select best model for this classification problem.

In [19]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC , LinearSVC ,NuSVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis , QuadraticDiscriminantAnalysis

In [53]:
Model = [KNeighborsClassifier(3),
         SVC(kernel= "rbf", C = 0.025, probability = True),
         NuSVC(probability= True),
         DecisionTreeClassifier(),
         RandomForestClassifier(),
         AdaBoostClassifier(),
         GradientBoostingClassifier()
        ]

In [57]:
for classifier in Model:
    M =classifier.fit(X_train,y_train)
    print(classifier)
    print("/n")
    print(M.score(X_test, y_test))

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')
/n
0.6111111111111112
SVC(C=0.025, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=True, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
/n
0.42592592592592593
NuSVC(cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
   kernel='rbf', max_iter=-1, nu=0.5, probability=True, random_state=None,
   shrinking=True, tol=0.001, verbose=False)
/n
0.5
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, r



AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
/n
0.9629629629629629
GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              n_iter_no_change=None, presort='auto', random_state=None,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)
/n
0.9259259259259259


As per the score we found out that AdaBoostingClassifier has the best score.
Now we select AdaBoostingClassifier for Hyperparameter optimization.

To increase the performance of selected algorithim we need to optimize the parameters. 
This can be done by a GridSearchCV. This function comes under scikit-learn.
We need to pass function a grid in the form of python dictionary containing the parameter names and corresponding list of parameters. 

In this model,I have choosen 3 parameters to tune our model.

##### n_estimator 
##### random_state
##### learing_rate


In [76]:
from sklearn.model_selection import GridSearchCV 


In [75]:

learning_rate = [1.0,2.0,3.0,4.0,5.0]
random_state = [2,7,6,12,15]
n_estimators = [100, 300, 500, 800, 1200]
param_grid = dict(n_estimators = n_estimators,learning_rate = learning_rate , random_state=random_state)

ADB = AdaBoostClassifier()

grid_Search = GridSearchCV(estimator= ADB , param_grid=param_grid , cv= 1 )
best_model = grid_search.fit(X_train , y_train)
gs_pred = best_model.predict(X_test) 
print(round(best_model.score(X_test , y_test),2))
print(best_model.best_params_)
print("---------------------------------------------------------------")
print(classification_report(y_test , gs_pred))
print(confusion_matrix(y_test,gs_pred))



0.98
{'base_estimator': DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=2, max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'), 'base_estimator__max_depth': 1}
---------------------------------------------------------------
              precision    recall  f1-score   support

           0       1.00      0.94      0.97        16
           1       0.96      1.00      0.98        23
           2       1.00      1.00      1.00        15

   micro avg       0.98      0.98      0.98        54
   macro avg       0.99      0.98      0.98        54
weighted avg       0.98      0.98      0.98        54

[[15  1  0]
 [ 0 23  0]
 [ 0  0 15]]


After tuning our model , we just find out the best model to predict labels on the test set.
The overall performance is improve drastically.

### Thank you