# Q1
Try to build a classifier for the MNIST dataset that achieves over 97% accuracy on the test set. 
Hint: the KNeighborsClassifier works quite well for this task; you 
just need to find good hyperparameter values (try a grid search on the and hyperparameters).

### 0. 필요 패키지 import

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Is this notebook running on Colab or Kaggle?
IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "classification"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [2]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
mnist.keys()

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

### 1. 데이터 셋 불러오기

In [3]:
X, y = mnist["data"], mnist["target"]
X.shape

(70000, 784)

In [4]:
### 데이터 셋 분할 

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42)

In [5]:
print(X_train.shape)
print(y_train.shape)

(56000, 784)
(56000,)


### 2. 모델 선정

In [6]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

In [7]:
# help(KNeighborsClassifier)
# 사용 가능한 파라미터 

# n_neighbors : int, default=5
# Number of neighbors to use by default for :meth:`kneighbors` queries.
  
# weights : {'uniform', 'distance'} or callable, default='uniform'
#       weight function used in prediction.  Possible values:
  
# algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto'

  
# leaf_size : int, default=30
#       Leaf size passed to BallTree or KDTree.  This can affect the
#       speed of the construction and query, as well as the memory
#       required to store the tree.  The optimal value depends on the
#       nature of the problem.
  
# p : int, default=2
#       Power parameter for the Minkowski metric. When p = 1, this is
#       equivalent to using manhattan_distance (l1), and euclidean_distance
#       (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

  
    
    

In [9]:
# GridSearchCV를 통해 교차검증(Cross-validation) 대상 모델에 전달할 하이퍼 파라미터 값을 정의.
# RandomForestRegressor의 하이퍼 파라미터 목록을 아래의 이름과 비교하여 확인해보기!
param_grid = [
#     {'n_neighbors': [3, 5, 7 , 9], 'weights': ['uniform', 'distance'], 'p': [1, 2]},             
    {'n_neighbors': [5, 7], 'weights': ['uniform', 'distance'], 'p': [1, 2]},            
  ]

knn = KNeighborsClassifier()

grid_search = GridSearchCV(knn, param_grid,
                           cv=2,
                           scoring='accuracy',
                           return_train_score=True, verbose=2)
grid_search.fit(X_train, y_train)

Fitting 2 folds for each of 8 candidates, totalling 16 fits
[CV] n_neighbors=5, p=1, weights=uniform .............................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] .............. n_neighbors=5, p=1, weights=uniform, total=10.9min
[CV] n_neighbors=5, p=1, weights=uniform .............................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 21.8min remaining:    0.0s


[CV] .............. n_neighbors=5, p=1, weights=uniform, total=10.9min
[CV] n_neighbors=5, p=1, weights=distance ............................
[CV] ............. n_neighbors=5, p=1, weights=distance, total=80.2min
[CV] n_neighbors=5, p=1, weights=distance ............................
[CV] ............. n_neighbors=5, p=1, weights=distance, total=11.0min
[CV] n_neighbors=5, p=2, weights=uniform .............................
[CV] .............. n_neighbors=5, p=2, weights=uniform, total=16.0min
[CV] n_neighbors=5, p=2, weights=uniform .............................
[CV] .............. n_neighbors=5, p=2, weights=uniform, total=42.6min
[CV] n_neighbors=5, p=2, weights=distance ............................
[CV] ............. n_neighbors=5, p=2, weights=distance, total=12.6min
[CV] n_neighbors=5, p=2, weights=distance ............................
[CV] ............. n_neighbors=5, p=2, weights=distance, total=63.6min
[CV] n_neighbors=7, p=1, weights=uniform .............................
[CV] .

[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed: 715.4min finished


GridSearchCV(cv=2, estimator=KNeighborsClassifier(),
             param_grid=[{'n_neighbors': [5, 7], 'p': [1, 2],
                          'weights': ['uniform', 'distance']}],
             return_train_score=True, scoring='accuracy', verbose=2)

# Q2
Write a function that can shift an MNIST image in any direction (left, right, up, or down) by one pixel. 
Then, for each image in the training set, create four shifted copies (one per direction) 
and add them to the training set. 
Finally, train your best model on this expanded training set and measure its accuracy on the test set. 
You should observe that your model performs even better now! 
This technique of artificially growing the training set is called data augmentation or training set expansion.