In this exercise, we will use scikit-learn to automatically train a KNN classifier on the German credit approval dataset and try out different values for the `n_neighbors` and `p` hyperparameters to get the optimal output values. We will need to scale the data before fitting KNN.

1.- Import the `pandas`package as `pd`

In [1]:
import pandas as pd

2.- Create a new variable called `file_url`, which will contain the URL to the raw dataset.

  > **Note**  
  > This exercise is a follow up from Exercise 3.02, Applying Label Encoding to Transform Categorical Variables into Numerical.

In [2]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Applied-Artificial-Intelligence-Workshop/master/Datasets/german_prepared.csv'

3.- Load the data using the `pd.read_csv()` method. And show the first 5 rows.

In [3]:
df = pd.read_csv(file_url)

4.- Import `preprocessing` from `scikit-learn`

In [4]:
from sklearn import preprocessing

5.- Instantiate `MinMaxScaler` with `feature_range=(0,1)` and save it to a variable called `scaler`

In [5]:
scaler = preprocessing.MinMaxScaler(feature_range=(0,1))

6.- Fit the scaler and apply the corresponding transformation to the DataFrame using `.fit_transform()` and save the results to a variable called `scaled_credit`

In [6]:
scaled_credit = scaler.fit_transform(df)

7.- Extract the `response` variable (the first column) to a new variable called `label`

In [7]:
label = scaled_credit[:, 0]

8.- Extract the features (all the columns except for the first one) to a new variable called `features`

In [8]:
features = scaled_credit[:, 1:]

9.- Import `model_selection.train_test_split` from `sklearn`

In [9]:
from sklearn.model_selection import train_test_split

10.- Split the scaled dataset into training and testing sets with `test_size=0.2` and `random_state=7` using `train_test_split`

In [10]:
features_train, features_test, label_train, label_test = train_test_split(features, label, test_size=0.2, random_state=7)

11.- Import `neighbors` from `sklearn`

In [11]:
from sklearn import neighbors

12.- Instantiate `KNeighborsClassifier` and save it to a variable called `classifier`

In [12]:
classifier = neighbors.KNeighborsClassifier()

13.- Fit the k-nearest neighbors classifier on the training set. Since we have not mentioned the value of `k`, the default is $5$

In [13]:
classifier.fit(features_train, label_train)

KNeighborsClassifier()

14.- Print the accuracy score for the training set with `.score()`

Output:

0.78625

In [14]:
acc_train = classifier.score(features_train, label_train)
acc_train

0.78625

With this, we've achieved an accuracy score of $0.78625$ on the training set with the default hyperparameter values: `k=5` and the Euclidean distance

15.- Print the accuracy score for the testing set with `.score()`

Output:

0.75

In [15]:
acc_test = classifier.score(features_test, label_test)
acc_test

0.75

The accuracy score dropped to 0.75 on the testing set. This means our model is overfitting and doesn't generalize well to unseen data.