### Problem


You're a data scientist working for a car company, your job is to determine when a customer will purchase a new SUV car given its age and estimated salary. The endgame is to show an ad to the customers which prediction is 1.

### Grid search

Technique to find just the right model's parameters (called hyperameters) that are not learned.

### Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Dataset

In [2]:
dataset_path = "../../../../datasets/ml_az_course/006_social_network_ads.csv"

df = pd.read_csv(dataset_path)
df

Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0
3,27,57000,0
4,19,76000,0
...,...,...,...
395,46,41000,1
396,51,23000,1
397,50,20000,1
398,36,33000,0


In [3]:
x = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

x[0]

array([   19, 19000])

In [4]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

### Feature Scaling

In [5]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

x_train = sc.fit_transform(X=x_train)
x_test = sc.transform(X=x_test)

### Training

In [6]:
from sklearn.svm import SVC

# kernel SVC - Support Vector Classification
model = SVC(kernel="rbf", random_state=42) # with Radious Basic Function Kernel, we are building a non-linear model

model.fit(X=x_train, y=y_train)

### Confusion Matrix

In [7]:
from sklearn.metrics import confusion_matrix, accuracy_score


preds = model.predict(X=x_test)
confusion_matrix(y_true=y_test, y_pred=preds)



array([[47,  5],
       [ 1, 27]])

In [8]:
accuracy_score(y_true=y_test, y_pred=preds)

0.925

### Applying k-fold Cross Validation

In [9]:
from sklearn.model_selection import cross_val_score

"""The basic idea is to make 10 splits of the dataset (into training and test set)
then evaluate the accuracy for each iteration on test set, finally we get the
average accuracy. With that we validate the result.
"""

accuracies = cross_val_score(
    estimator=model, X=x, y=y, cv=8
)

print("accuracy: {:.2f} %".format(accuracies.mean() * 100))
print("std {:.2f} %".format(accuracies.std()))

accuracy: 77.25 %
std 0.08 %


### Applying Grid Search to find the best model and the best parameters

In [17]:
from sklearn.model_selection import GridSearchCV

"""
C: Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.

kernel: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable.
Specifies the kernel type to be used in the algorithm. If none is given, ‘rbf’ will be used.

gamma: {‘scale’, ‘auto’} or float, default=’scale’
Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. - Only necessary when using supported kernels.
"""

hyperparameters_set = [
    {"C": [0.25, 0.5, 0.75, 1], "kernel": ["linear"]},
    {"C": [0.25, 0.5, 0.75, 1], "kernel": ["rbf"], "gamma": np.linspace(0.1, 0.9, num=3).tolist()}
]

# it's a slow process on local
grid_search = GridSearchCV(
    estimator=model, param_grid=hyperparameters_set, scoring="accuracy", cv=10, verbose=2, n_jobs=-1
)

grid_search.fit(X=x_train, y=y_train)

Fitting 10 folds for each of 16 candidates, totalling 160 fits
[CV] END ..............................C=0.25, kernel=linear; total time=   0.0s
[CV] END ..............................C=0.25, kernel=linear; total time=   0.0s
[CV] END ..............................C=0.25, kernel=linear; total time=   0.0s
[CV] END ..............................C=0.25, kernel=linear; total time=   0.0s
[CV] END ..............................C=0.25, kernel=linear; total time=   0.0s
[CV] END ..............................C=0.25, kernel=linear; total time=   0.0s[CV] END ..............................C=0.25, kernel=linear; total time=   0.0s

[CV] END ..............................C=0.25, kernel=linear; total time=   0.0s
[CV] END ..............................C=0.25, kernel=linear; total time=   0.0s[CV] END ...............................C=0.5, kernel=linear; total time=   0.0s

[CV] END ...............................C=0.5, kernel=linear; total time=   0.0s
[CV] END ..............................C=0.25,

In [18]:
best_accuracy = grid_search.best_score_
print("best accuracy", best_accuracy)

best_hyperparameters = grid_search.best_params_

print("best hyperameters", best_hyperparameters)

best accuracy 0.909375
best hyperameters {'C': 0.5, 'gamma': 0.9, 'kernel': 'rbf'}
