For this assignment, we will use another digit data set, since it only consists of numerical features. Your job is to apply the dimension reduction technique we learned, and combined with the classification methods you learned from DS861, to build a classifier. 

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [2]:
# Import the data
from sklearn.datasets import load_digits
X_digits, y_digits = load_digits(return_X_y = True)

In [3]:
# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits, random_state = 862)

Here you will build your classifier with the appropriate dimension reduction technique and classification model. Recall that we have learned the following classifiers in DS861: Logistic Regression, Decision Tree, Random Forest, Boosting, KNN. This data set is a multi-level data set, hence you should think about which model is appropriate (or not appropriate). To save time, you may just pick two estimators that you think is the most appropriate and compare.

Some general rules you should follow:

1. Tune your dimension reduction technique
2. Tune your model
3. Select your hyperparameters based on a hold-out set (either via CV or train/validate/test split)
4. Report the accuracy on the test set

You may ignore randomized PCA and incremental PCA, since they don't have added value here. We will not use MDS and t-SNE here as well since they cannot be used to transform new observations. If you want to write a wrapper that contains multiple classifiers, [here](https://stackoverflow.com/questions/50285973/pipeline-multiple-classifiers) or [here](https://stackoverflow.com/questions/38555650/try-multiple-estimator-in-one-grid-search) have some good examples.

#### Setting up the classification models 

In [4]:
#Import statements:

#GridSearchCV and Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

#Classification models
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

#Accuracy score
from sklearn.metrics import accuracy_score

In [5]:
#Creating a list of classification models:
classification_models = []
classification_models.append(KNeighborsClassifier())
classification_models.append(RandomForestClassifier())

In [6]:
#Creating a list of names of those classification models:
classification_model_names = [
         "K Nearest Neighbours",
         "Random Forest"
          ]

#### Kernel PCA

In [7]:
#Kernel PCA
from sklearn.decomposition import KernelPCA

In [8]:
#Assigning parameters for GridSearchCV
parameters = [
    {
        #KNN
        "kpca__n_components": range(2, 20),
        "kpca__kernel": ["linear","rbf", "sigmoid","poly"],
        "clf__n_neighbors": range(5,16)
    },
    {
        #RandomForest
        "kpca__n_components": range(2, 20),
        "kpca__kernel": ["linear","rbf", "sigmoid","poly"],
        "clf__n_estimators": [50,100,150]
    }
]

In [9]:
#Finding accuracies for classification algorithms with Kernel PCA Dimensionality Reduction
print("CLASSIFICATION ALGORITHMS PERFORMANCE WITH KERNEL PCA DIMENSIONALITY REDUCTION: ")
print("------------------------------------------------------------------------------------")  
print ("{:<22} |{:<18}|\t{:<20}".format('MODEL NAME','ACCURACY SCORE','PARAMETERS'))
print("------------------------------------------------------------------------------------")
for name, classifier, params in zip(classification_model_names, classification_models, parameters):
    
    clf_pipe = Pipeline([
        ("kpca", KernelPCA(random_state = 862)),
        ('clf', classifier)
    ])
    
    gs_clf = GridSearchCV(clf_pipe, param_grid=params, n_jobs=-1)
    clf = gs_clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    best_params = gs_clf.best_params_
    
    #Getting the best accuracy score of the classification model
    print ("{:<22} |{:<18}|{}".format(name, score, best_params))    

CLASSIFICATION ALGORITHMS PERFORMANCE WITH KERNEL PCA DIMENSIONALITY REDUCTION: 
------------------------------------------------------------------------------------
MODEL NAME             |ACCURACY SCORE    |	PARAMETERS          
------------------------------------------------------------------------------------
K Nearest Neighbours   |0.98              |{'clf__n_neighbors': 5, 'kpca__kernel': 'linear', 'kpca__n_components': 19}
Random Forest          |0.9666666666666667|{'clf__n_estimators': 50, 'kpca__kernel': 'linear', 'kpca__n_components': 19}


#### LLE

In [10]:
#Import statements:

#LLE
from sklearn.manifold import LocallyLinearEmbedding

In [11]:
#Assigning parameters for GridSearchCV
parameters = [
    {
        #KNN
        "lle__n_components": range(2,20),
        "lle__n_neighbors": [5,10],
        "clf__n_neighbors": range(5,16)
    },
    {
        #Random Forest
        "lle__n_components": range(2,20),
        "lle__n_neighbors": [5,10],
        "clf__n_estimators": [50,100,150]
    }
]

In [12]:
#Finding accuracies for classification algorithms with LLE Dimensionality Reduction
print("CLASSIFICATION ALGORITHMS PERFORMANCE WITH LLE DIMENSIONALITY REDUCTION: ")
print("------------------------------------------------------------------------------------")  
print ("{:<25} |{:<20}|\t{:<20}".format('MODEL NAME','ACCURACY SCORE','PARAMETERS'))
print("------------------------------------------------------------------------------------")  
for name, classifier, params in zip(classification_model_names, classification_models, parameters):
    
    clf_pipe = Pipeline([
         ("lle", LocallyLinearEmbedding(random_state = 862)),
         ('clf', classifier)
    ])
    
    gs_clf = GridSearchCV(clf_pipe, param_grid=params, n_jobs=-1)
    clf = gs_clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    best_params = gs_clf.best_params_
    
    #Getting the best accuracy score of the classification model
    print("{:<25} |{:<20}|{}".format(name, score, best_params))

CLASSIFICATION ALGORITHMS PERFORMANCE WITH LLE DIMENSIONALITY REDUCTION: 
------------------------------------------------------------------------------------
MODEL NAME                |ACCURACY SCORE      |	PARAMETERS          
------------------------------------------------------------------------------------
K Nearest Neighbours      |0.9666666666666667  |{'clf__n_neighbors': 15, 'lle__n_components': 15, 'lle__n_neighbors': 5}
Random Forest             |0.9777777777777777  |{'clf__n_estimators': 150, 'lle__n_components': 19, 'lle__n_neighbors': 5}


#### Isomap

In [13]:
#Import statements:

#Isomap
from sklearn.manifold import Isomap

In [14]:
#Assigning parameters for GridSearchCV
parameters = [       
    {
        #KNN
        "iso__n_components": range(2,20),
        "iso__n_neighbors":[5,10],
        "clf__n_neighbors": range(5,16)
    },
    {
        #Random Forest
        "iso__n_components": range(2,20),
        "iso__n_neighbors":[5,10],
        "clf__n_estimators": [50,100,150]
    }
]

In [16]:
#Finding accuracies for classification algorithms with Isomap Dimensionality Reduction
print("CLASSIFICATION ALGORITHMS PERFORMANCE WITH ISOMAP DIMENSIONALITY REDUCTION: ")
print("------------------------------------------------------------------------------------")  
print ("{:<22} |{:<18} |\t{:<20}".format('MODEL NAME','ACCURACY SCORE','PARAMETERS'))
print("------------------------------------------------------------------------------------")
for name, classifier, params in zip(classification_model_names, classification_models, parameters):
    
    clf_pipe = Pipeline([
        ("iso", Isomap()),
        ('clf', classifier)
    ])
    
    gs_clf = GridSearchCV(clf_pipe, param_grid=params, n_jobs=-1)
    clf = gs_clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    best_params = gs_clf.best_params_
    #Getting the best accuracy score of the classification model
    print("{:<22} |{:<18} |{}".format(name, score, best_params))

CLASSIFICATION ALGORITHMS PERFORMANCE WITH ISOMAP DIMENSIONALITY REDUCTION: 
------------------------------------------------------------------------------------
MODEL NAME             |ACCURACY SCORE     |	PARAMETERS          
------------------------------------------------------------------------------------
K Nearest Neighbours   |0.9755555555555555 |{'clf__n_neighbors': 5, 'iso__n_components': 15, 'iso__n_neighbors': 5}
Random Forest          |0.9688888888888889 |{'clf__n_estimators': 100, 'iso__n_components': 18, 'iso__n_neighbors': 5}


#### What is the best combination according to your accuracy score on the test set?

#### Answer: Kernel PCA with KNN. Accuracy = 0.98


Now using the original data set and the two classifers you chose, run the procedure again, but this time without any dimension reduction. Make sure you tune your classifiers. Which result is better? Using the original data set or the reduced data set?

In [17]:
#Assigning parameters for GridSearchCV
parameters = [
    {
        #KNN
        "clf__n_neighbors": range(5,16)
    },
    {
        #Random Forest
       'clf__n_estimators': [50,100,150,200,250,300]
    }
]

In [18]:
#Finding accuracy of various classification algorithms without Dimensionality Reduction 
print("CLASSIFICATION ALGORITHMS PERFORMANCE WITHOUT DIMENSIONALITY REDUCTION: ")
print("------------------------------------------------------------------------------------")  
print ("{:<25} |\t{:<20}\t|\t{:<10}".format('MODEL NAME','ACCURACY SCORE','PARAMETERS'))
print("------------------------------------------------------------------------------------")  
for name, classifier, params in zip(classification_model_names, classification_models, parameters):
    
    clf_pipe = Pipeline([
        ('clf', classifier)
    ])
    
    gs_clf = GridSearchCV(clf_pipe, param_grid=params, n_jobs=-1)
    clf = gs_clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    best_params = gs_clf.best_params_
    #Getting the best accuracy score of the classification model
    print("{:<25} |\t{:<20}\t|\t{}".format(name, score, best_params))


CLASSIFICATION ALGORITHMS PERFORMANCE WITHOUT DIMENSIONALITY REDUCTION: 
------------------------------------------------------------------------------------
MODEL NAME                |	ACCURACY SCORE      	|	PARAMETERS
------------------------------------------------------------------------------------
K Nearest Neighbours      |	0.98                	|	{'clf__n_neighbors': 5}
Random Forest             |	0.9711111111111111  	|	{'clf__n_estimators': 250}


### Which result is better? Using the original data set or the reduced data set? - 
#### Using the Original dataset we get an accuracy of 0.98. 
#### Using Dimensionality Reduction, we got an accuracy of 0.98 with KNN- Linear Kernel PCA, 0.977 with Random Forest-LLE. 
#### Even though there is not much difference in the accuracy. I will choose LLE Dimensionality Reduction with Random Forest Classification giving an accuracy of 0.977