# Exercises

1. Try to build a classifier for the MNIST dataset that achieves over 97% accuracy on the test set. Hint: the KNeighborsClassifier works quite well for this task; you just need to find good hyperparameter values (try a grid search on the weights and n_neighbors hyperparameters).  


2. Write a function that can shift an MNIST image in any direction (left, right, up, or down) by one pixel.⁠6 Then, for each image in the training set, create four shifted copies (one per direction) and add them to the training set. Finally, train your best model on this expanded training set and measure its accuracy on the test set. You should observe that your model performs even better now! This technique of artificially growing the training set is called data augmentation or training set expansion.  


3. Tackle the Titanic dataset. A great place to start is on Kaggle. Alternatively, you can download the data from https://homl.info/titanic.tgz and unzip this tarball like you did for the housing data in Chapter 2. This will give you two CSV files, train.csv and test.csv, which you can load using pandas.read_csv(). The goal is to train a classifier that can predict the Survived column based on the other columns.


4. Build a spam classifier (a more challenging exercise):

    - Download examples of spam and ham from Apache SpamAssassin’s public datasets.
    - Unzip the datasets and familiarize yourself with the data format.
    - Split the data into a training set and a test set.
    - Write a data preparation pipeline to convert each email into a feature vector. Your preparation pipeline should transform an email into a (sparse) vector that indicates the presence or absence of each possible word. For example, if all emails only ever contain four words, “Hello”, “how”, “are”, “you”, then the email “Hello you Hello Hello you” would be converted into a vector [1, 0, 0, 1] (meaning [“Hello” is present, “how” is absent, “are” is absent, “you” is present]), or [3, 0, 0, 2] if you prefer to count the number of occurrences of each word.

    You may want to add hyperparameters to your preparation pipeline to control whether or not to strip off email headers, convert each email to lowercase, remove punctuation, replace all URLs with “URL”, replace all numbers with “NUMBER”, or even perform stemming (i.e., trim off word endings; there are Python libraries available to do this).  


5. Finally, try out several classifiers and see if you can build a great spam classifier, with both high recall and high precision.

In [39]:
import pandas as pd

import sklearn
from packaging import version

sklearn.__version__

'1.2.0'

In [18]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

In [4]:
mnist = fetch_openml('mnist_784', as_frame=False)

  warn(


In [5]:
X, y = mnist.data, mnist.target

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [35]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

In [32]:
knn_cls = KNeighborsClassifier(
    n_neighbors=5,
    weights='uniform',
    algorithm='auto',
    leaf_size=30,
    p=2,
    metric='minkowski',
    metric_params=None,
    n_jobs=-1)

In [36]:
#Default Performance
knn_cls.fit(X_train, y_train)
y_pred = knn_cls.predict(X_test)
accuracy_score(y_pred, y_test)

0.9698285714285714

In [46]:
param_grid = [
    {'n_neighbors': [3,4,5,6],
     'weights': ['uniform', 'distance']}
]

grid_search = GridSearchCV(knn_cls, param_grid, cv=3, scoring='accuracy', n_jobs=-1)

grid_search.fit(X_train, y_train)

In [50]:
grid_search.best_params_

{'n_neighbors': 4, 'weights': 'distance'}

In [51]:
grid_search.best_score_

0.9696000000000001

In [48]:
cv_res = pd.DataFrame(grid_search.cv_results_)
cv_res.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,param_weights,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.595016,0.042553,69.141923,0.124722,3,uniform,"{'n_neighbors': 3, 'weights': 'uniform'}",0.966457,0.966114,0.968514,0.967029,0.00106,5
1,0.599268,0.040178,68.231714,0.012436,3,distance,"{'n_neighbors': 3, 'weights': 'distance'}",0.967829,0.968,0.969314,0.968381,0.000664,2
2,0.585236,0.032346,69.109518,0.06287,4,uniform,"{'n_neighbors': 4, 'weights': 'uniform'}",0.964971,0.964686,0.967257,0.965638,0.001151,7
3,0.580744,0.009133,68.306253,0.218161,4,distance,"{'n_neighbors': 4, 'weights': 'distance'}",0.968914,0.968629,0.971257,0.9696,0.001178,1
4,0.469227,0.002858,67.775931,0.046721,5,uniform,"{'n_neighbors': 5, 'weights': 'uniform'}",0.965029,0.966343,0.967143,0.966171,0.000872,6


In [49]:
knn_best_cls = grid_search.best_estimator_