### 1. Have you come across Grid Search Cross Validation? Fit any two models covered in previous classes and optimize them using Grid search CV.
Cross-Validation is a resampling technique that can be used to evaluate and select machine learning algorithms on a limited dataset. k-fold cross-validation is a type of cross-validation, where the training data is split into k-folds and (k-1) folds is used for training and kth fold is used for validation of the model.
Grid Search cross-validation is a technique to select the best of the machine learning model, parameterized by a grid of hyperparameters. Scikit-Learn library comes with grid search cross-validation implementation.
Grid Search CV tries all combinations of parameters grid for a model and returns with the best set of parameters having the best performance score. This can also serve as a disadvantage, as training the model of each combination of parameters increases the time complexity.


In [53]:
# Import the required libraries
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix 

In [4]:
# Load the dataset
wine = pd.read_csv("wine.csv")

In [5]:
# Know the data
wine.head() #related to red variants of the Portuguese "Vinho Verde" wine

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [6]:
wine.sample(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
1393,8.0,0.52,0.25,2.0,0.078,19.0,59.0,0.99612,3.3,0.48,10.2,5
488,11.6,0.32,0.55,2.8,0.081,35.0,67.0,1.0002,3.32,0.92,10.8,7
1289,7.0,0.6,0.3,4.5,0.068,20.0,110.0,0.99914,3.3,1.17,10.2,5
1492,6.2,0.65,0.06,1.6,0.05,6.0,18.0,0.99348,3.57,0.54,11.95,5
166,6.8,0.64,0.1,2.1,0.085,18.0,101.0,0.9956,3.34,0.52,10.2,5


In [7]:
wine.info()   #prints a concise summary of the data set

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [8]:
wine['citric acid'].value_counts()   #value_counts() function return a Series containing counts 
                                     #of unique values. The resulting object will be in descending order so that the 
                                     #first element is the most frequently-occurring element. Excludes NA values by default.
                                    

0.00    132
0.49     68
0.24     51
0.02     50
0.26     38
       ... 
0.75      1
0.78      1
1.00      1
0.62      1
0.72      1
Name: citric acid, Length: 80, dtype: int64

In [9]:
wine['pH'].value_counts()

3.30    57
3.36    56
3.26    53
3.38    48
3.39    48
        ..
2.95     1
3.74     1
2.87     1
2.90     1
3.70     1
Name: pH, Length: 89, dtype: int64

In [10]:
wine['sulphates'].value_counts()

0.60    69
0.58    68
0.54    68
0.62    61
0.56    60
        ..
1.20     1
0.33     1
1.26     1
1.11     1
1.61     1
Name: sulphates, Length: 96, dtype: int64

In [11]:
wine.describe()   #returns description of the data in the data set

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [12]:
# Check for missing values
wine.isna().sum()  #returns the number of missing values in each column

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [14]:
X = wine.drop(['quality'], axis=1)

y = wine['quality']

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [16]:
X_train.shape, X_test.shape

((1071, 11), (528, 11))

In [17]:
X_train.dtypes

fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
dtype: object

In [18]:
X_train.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
548,12.4,0.35,0.49,2.6,0.079,27.0,69.0,0.9994,3.12,0.75,10.4
355,6.7,0.75,0.01,2.4,0.078,17.0,32.0,0.9955,3.55,0.61,12.8
1296,6.6,0.63,0.0,4.3,0.093,51.0,77.5,0.99558,3.2,0.45,9.5
209,11.0,0.3,0.58,2.1,0.054,7.0,19.0,0.998,3.31,0.88,10.5
140,8.4,0.745,0.11,1.9,0.09,16.0,63.0,0.9965,3.19,0.82,9.6


In [19]:
X_test.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
803,7.7,0.56,0.08,2.5,0.114,14.0,46.0,0.9971,3.24,0.66,9.6
124,7.8,0.5,0.17,1.6,0.082,21.0,102.0,0.996,3.39,0.48,9.5
350,10.7,0.67,0.22,2.7,0.107,17.0,34.0,1.0004,3.28,0.98,9.9
682,8.5,0.46,0.31,2.25,0.078,32.0,58.0,0.998,3.33,0.54,9.8
1326,6.7,0.46,0.24,1.7,0.077,18.0,34.0,0.9948,3.39,0.6,10.6


In [20]:
from sklearn.tree import DecisionTreeClassifier

In [23]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # statistical data visualization
%matplotlib inline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,accuracy_score,f1_score
from sklearn.model_selection import GridSearchCV
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        

In [24]:
c = RandomForestClassifier(n_estimators = 100, random_state = 18).fit(X_train, y_train)

In [25]:
preds = c.predict(X_test)

In [26]:
confusion_matrix(y_test, preds)

array([[  0,   0,   2,   0,   0,   0],
       [  0,   0,   9,  10,   0,   0],
       [  0,   0, 165,  50,   2,   0],
       [  0,   0,  46, 149,  18,   0],
       [  0,   0,   2,  40,  27,   1],
       [  0,   0,   0,   0,   6,   1]], dtype=int64)

In [28]:
accuracy_score(y_test, preds)

0.6477272727272727

In [29]:
f1_score(y_test,preds, average="micro")

0.6477272727272727

In [33]:
grid = { 
    'n_estimators': [200,300,400,500],
    'max_features': ['sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy'],
    'random_state' : [18]
}

In [34]:
rf_cv = GridSearchCV(estimator=RandomForestClassifier(), param_grid=grid, cv= 5)
rf_cv.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [4, 5, 6, 7, 8],
                         'max_features': ['sqrt', 'log2'],
                         'n_estimators': [200, 300, 400, 500],
                         'random_state': [18]})

In [35]:
rf_cv.best_params_

{'criterion': 'gini',
 'max_depth': 8,
 'max_features': 'sqrt',
 'n_estimators': 400,
 'random_state': 18}

In [36]:
clf = RandomForestClassifier(n_estimators = 200, max_depth=4, max_features='sqrt' , random_state = 18).fit(X_train, y_train)

In [37]:
prediction = clf.predict(X_test)

In [38]:
confusion_matrix(y_test, prediction)

array([[  0,   0,   2,   0,   0,   0],
       [  0,   0,   8,  11,   0,   0],
       [  0,   0, 167,  49,   1,   0],
       [  0,   0,  80, 129,   4,   0],
       [  0,   0,   0,  59,  11,   0],
       [  0,   0,   0,   5,   2,   0]], dtype=int64)

In [39]:
accuracy_score(y_test, prediction)

0.5814393939393939

In [40]:
f1_score(y_test,prediction, average="micro")

0.5814393939393939

In [41]:
clf = DecisionTreeClassifier(random_state=0)

In [42]:
clf.fit(X_train,y_train)

DecisionTreeClassifier(random_state=0)

In [43]:
prediction = clf.predict(X_test)

In [44]:
confusion_matrix(y_test, prediction)

array([[  0,   1,   1,   0,   0,   0],
       [  0,   1,  10,   6,   2,   0],
       [  0,  11, 140,  56,   9,   1],
       [  0,   5,  49, 119,  37,   3],
       [  0,   2,   5,  25,  37,   1],
       [  0,   0,   0,   2,   3,   2]], dtype=int64)

In [45]:
accuracy_score(y_test, prediction)

0.5662878787878788

In [46]:
f1_score(y_test,prediction, average="micro")

0.5662878787878788

In [54]:
# print prediction results 
predictions = clf.predict(X_test) 
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           3       0.00      0.00      0.00         2
           4       0.05      0.05      0.05        19
           5       0.68      0.65      0.66       217
           6       0.57      0.56      0.57       213
           7       0.42      0.53      0.47        70
           8       0.29      0.29      0.29         7

    accuracy                           0.57       528
   macro avg       0.34      0.35      0.34       528
weighted avg       0.57      0.57      0.57       528



  _warn_prf(average, modifier, msg_start, len(result))


In [55]:
# Grid Search CV implementation
param_grid = {'C': [0.1, 1, 10, 100],  
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 
              'gamma':['scale', 'auto'],
              'kernel': ['linear']}  
   
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3,n_jobs=-1) 
   
# fitting the model for grid search 
grid.fit(X_train, y_train) 
 
#
print(grid.best_params_) 
grid_predictions = grid.predict(X_test) 
   
# print classification report 
print((y_test, grid_predictions))

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:   15.4s
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  8.9min finished


{'C': 100, 'gamma': 'scale', 'kernel': 'linear'}
(803     6
124     5
350     6
682     5
1326    6
       ..
813     4
377     7
898     7
126     5
819     5
Name: quality, Length: 528, dtype: int64, array([5, 5, 6, 5, 6, 5, 5, 5, 6, 6, 6, 5, 5, 5, 5, 6, 5, 5, 6, 5, 5, 5,
       6, 6, 5, 5, 6, 5, 5, 6, 5, 5, 6, 5, 6, 5, 6, 6, 5, 6, 5, 5, 6, 5,
       6, 6, 6, 5, 5, 6, 5, 5, 6, 6, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 6, 5,
       6, 5, 6, 5, 6, 5, 6, 6, 6, 5, 6, 6, 6, 6, 5, 6, 5, 6, 6, 6, 5, 6,
       6, 5, 6, 5, 6, 6, 5, 5, 5, 6, 5, 6, 5, 5, 6, 6, 6, 6, 5, 5, 6, 5,
       6, 5, 6, 5, 6, 6, 6, 6, 5, 6, 6, 5, 6, 5, 5, 5, 6, 6, 6, 6, 6, 5,
       5, 6, 6, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 5, 6, 5, 6, 5, 6, 6, 5, 6,
       6, 6, 5, 6, 5, 6, 6, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 5, 6, 5, 5, 6,
       6, 5, 5, 5, 5, 6, 5, 6, 5, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6, 5, 5, 5,
       5, 6, 5, 5, 5, 5, 6, 6, 5, 5, 5, 6, 6, 5, 6, 6, 6, 6, 5, 5, 6, 5,
       5, 6, 6, 6, 5, 5, 5, 6, 5, 5, 5, 5, 6, 6, 6, 6, 5, 6, 5, 5, 5