# Deep Learning Project 1 (due Friday 1/28)

## Multi-class classification via unsupervised methods

In this project, we are going to experiment with unsupervised machine learning methods for solving a multi-class classification problem. Unsupervised methods are computational methods that classify or cluster data by discovering hidden patterns or structure within the data itself, without the help of *labeled* training data. Unsupervised methods are widely used in data science applications in a variety of fields (medicine, finance, e-commerce, etc.) where labeled training data might be hard to obtain.

Here, we are going to apply a well-known unsupervised technique called hierarchical clustering to the MNIST dataset multi-class classification problem. All the code and experimentation will be done in this Jupyter notebook. We will rely on the implementation of hierarchical clustering provided as part of the Scikit-learn Python library. Secondly, you will also pick a second unsupervised technique of your choice and apply it to the MNIST dataset. This exercise will help us obtain practical knowledge in the application of unsupervised learning.


### Learning outcomes
After completing project 1, you will be able to:
* Describe the main ideas behind the method of hierarchical clustering 
* Apply hierarchical clustering to image datasets and tune hierarchical clustering models 
* Analyze results provided by hierarchical clustering
* Describe the main ideas behind a second unsupervised method of your choice, apply it to image datasets, tune its performance, and analyze results obtained

### Multi-class classification: the MNIST dataset
The MNIST dataset consists of 70,000 gray-scale images (samples) of hand-written digits 0 through 9. The multi-class classification problem consists of classifying each sample accurately as belonging to one of ten classes. For the purpose of this assignment, we will only use the labels provided for each sample in the dataset for determining classification accuracy of our results. 

### Part 1: Hierarchical clustering
Hierarchical clustering is a well-known technique used heavily in data science applications. A cluster is a subset of the data which are deemed to be similar (in some sense). Clustering methods divide a dataset into clusters in such a way that the members of the same cluster are more similar to each other than to members of other clusters. Hierarchical clustering 
can be either *top-down* or *bottom-up*. The top-down (or divisive) approach starts with all the data in a single cluster and then successively divides clusters into smaller ones as it goes down the hierarchy. In contrast, the bottom-up (or agglomerative) approach starts with each sample in its own cluster and then pairs of clusters are merged as it goes up the hierarchy.  


For project 1, you will use the agglomerative hierarchical clustering method provided in the Scikit-learn Python library to cluster the MNIST image dataset. The agglomerative method is fairly easy to understand. This [site](https://www.saedsayad.com/clustering_hierarchical.htm) provides a simple explanation in the form of pseudo-code. Descriptions of the parameters for the **sklearn.cluster.agglomerativeClustering** Scikit-learn implementation are found [here](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html?highlight=agglomerative#sklearn.cluster.AgglomerativeClustering)
In particular, identifying a good distance metric as well as a good linkage criteria are key for hierarchical clustering. [Wikipedia](https://en.wikipedia.org/wiki/Hierarchical_clustering) gives as a list of commonly used distance metrics and linkage criteria for hierarchical clustering.
Scikit also provides a good [user's guide](https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering) for hierarchical clustering as well as [sample code](https://scikit-learn.org/stable/auto_examples/cluster/plot_digits_linkage.html#sphx-glr-auto-examples-cluster-plot-digits-linkage-py)

#### Process
1. Read some of the existing literature online about hierarchical clustering. There are many tutorials on the web. A list of some such resources can be found below under **Other Resources**

2. Explore how the sklearn.cluster.agglomerativeClustering works by running sample code. Again, I provide a list of resources below.

3. Apply sklearn.cluster.agglomerativeClustering to the MNIST dataset. Here are a few of the things that you will need to take into consideration:
    * How much data can the algorithm handle to run relatively quickly in your laptop or in Google Colab (a few minutes to maximum a couple of hours)? 
    * If not all the data can be processed at once, how should a smaller batch be selected?
    * What parameters should be selected for the agglomerative clustering algorithm? 
    * Should you try another parameters and compare results?
    
  Since we have labels for all the data samples in the MNIST dataset, we can actually determine the accuracy of classification via hierarchical clustering. The are built-in functions in Scikit that allow you to calculate accuracy (see Other Resources below).
    
4. Analyze the results you obtained and explain these results, based on the parameters used and the data for this problem. **Analyzing and explaining results is a crucial step in any data science and machine learning application**. It's not enough to just find a result.


### Part 2: Exploring an unsupervised method of your choice
For part 2, you will apply the same process as in part 1 but for a different unsupervised method. There are several other options implemented in the Scikit-learn library. Feel to chose another library as long as the method is unsupervised (does not use the labels during the classification process) then that is permitted.

### Other Resources
Here is a list of various online resources that may be helpful to you for this project.
* This is a pretty good Youtube tutorial on [Hierarchical Clustering - Dendrograms Using Scipy and Scikit-learn in Python](https://youtu.be/JcfIeaGzF8A)
* Another Youtube tutorial on [K-Means Clustering - Methods using Scikit-learn in Python](https://youtu.be/ikt0sny_ImY)
* The mtcars.cvs dataset used in several of these tutorials is found [here](https://gist.github.com/seankross/a412dfbd88b3db70b74b)
* A [towardsdatascience.com tutorial](https://towardsdatascience.com/kmeans-hyper-parameters-explained-with-examples-c93505820cd3) on selecting hyper-parameters for K-means. 
* How to create a [confusion matrix with Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
* Five important clustering algorithms for data science in [towardsdatascience.com](https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68)

### What to turn in Project 1
Your submission for Project 1 should contain the following parts:

1. The code used in your project (including the output produced by each cell) 
2. A description of your work written in English, including results produced by the code.

You can turn in either: two separate files (i.e. a .ipynb file and a Word/PDF/GoogleDoc file), or just one .ipynb file with the written part in a Markdown cell.

The witten description of your work in English should be structured as follows (no more than 4 pages in total):

1. Background (1 page max)
    * Description *in your own words* of how hierarchical clustering works
    * Description *in your own words* of your selected second unsupervised method
        
       Here, I'm looking for evidence of understanding the key ideas behind clustering and how the algorithms do what they do, how they may be similar and how they be different, etc.
       
2. Methodology (1 page max)
    * Specify what specific implementations were used (so that your work could be duplicated) and any related information you might find interesting
    * Describe the process you used to systematically evaluate the clustering methods on the MNIST dataset. I.e. how data was selected as input for the clustering methods and why, how was the performance of each method tuned relative to the method's parameters, etc.
        
        Here, I'm looking for evidence of logical thinking in the process of applying unsupervised learning techniques.
        
3. Results (2 pages max)
    * Show the results of applying your methodology to both techniques
    * Show accuracy results for both techniques using at least the accuracy_score() and confusion_matrix() functions of sklearn
        
        Here, I'm looking for a results to be presented systematically and briefly, related to accuracy of the use of clustering on the MNIST dataset.
        
4. Analysis and discussion
    * Describe/summarize the accuracy results you obtained 
    * Highlight and discuss any interesting trends you found in your results
    

### Rubric
The following comments are meant to help you understand how your work will be graded in this assignment. 

1. **Fail**
    * Assignment was not turned in.
    * Assignment was turned in but it is minimal. There are virtually no results nor analysis/discussion of results.
    
2. **Below average range**
    * Assignment was turned in, both the .ipynb file and the written description.
    * The .ipynb file includes code for two unsupervised methods and runs with some minor errors.
    * Results, including accuracy calculations, are obtained for only one choice of algorithm parameters.
    * There is no evidence showing why the selected parameters were chosen.
    * There is no analysis/discussion of results obtained.
    * There is no code, results, or analysis for a second unsupervised method.
    
3. **Average range**
    * Assignment was turned in, both the .ipynb file and the written description.
    * The .ipynb file includes code for two unsupervised methods and runs without errors.
    * Results, including accuracy calculations, are presented for both unsupervised methods.
    * There is evidence showing that various parameters combinations were tested.
    * There is some analysis/discussion of results obtained.
    
4. **Very good range**
    * Assignment was turned in, both the .ipynb file and the written description.
    * The .ipynb file includes code for two unsupervised methods and runs without errors.
    * Results, including accuracy calculations, are presented for both unsupervised methods.
    * There is a systematic process for testing and evaluating results of different parameter combinations.
    * There is good analysis/discussion of the results obtained that demonstrates understanding of how hierarchical clustering and the second method work.
    * All the samples in the dataset were included in some way.
    
5. **Excellent range** 
    * Assignment was turned in, both the .ipynb file and the written description.
    * The .ipynb file includes code for two unsupervised methods and runs without errors.
    * Results, including accuracy calculations, are presented for both unsupervised methods.
    * There is a systematic process for testing and evaluating results of different parameter combinations.
    * There is good analysis/discussion of the results obtained that demonstrates understanding of how hierarchical clustering and the second method work.
    * All the samples in the dataset were included in some way.  
    * The analysis includes discussion comparing and constrasting hierarchical clustering with a second unsupervised approach.
    * Different metrics were used and discussed for evaluating clustering performance 