*This jupyter notebook is part of Arizona State University's course CAS 523 (Methods for Complex Systems Science: Statistics and Dimensionality Reduction) and was written by Bryan Daniels.  It was last updated September 7, 2022.*

*This assignment uses data, available [here](https://archive.ics.uci.edu/ml/datasets/wine), from the UCI Machine Learning Repository.  Data citation: Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.*

# Comparing techniques for finding structure in high-dimensional data

In this exercise, we will perform dimensionality reduction on data with a relatively simple, known structure.  The hope is to gain some intuition for how these techniques work so that you can use them in more complicated examples in the future.

## Load relevant packages

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
plt.rcParams.update({'font.size': 18}) # increases font size on plots
from pathlib import Path # to handle file paths across all operating systems

In [None]:
from sklearn import cluster
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

## Load data

We will work here with data taken from a chemical analysis of 178 Italian wines.  These wines were derived from 3 different types (cultivars) of grapes.  Each wine was measured using 13 different quantitative tests.  

Here we load the data from text files: (Note that we split off the "Class identifier" column from the rest of the data, imagining that we don't know beforehand which wine belongs to which class.)

In [None]:
wineAttributes = pd.read_csv(Path('data/Wines/wine-attributes.csv'))
wineData = pd.read_csv(Path('data/Wines/wine.data'),names=wineAttributes.columns)
# split off the class identifier column and transform into a useful color syntax
wineClasses = wineData.pop('Class identifier').apply(lambda c: 'C{}'.format(c))

Let's take a look at what is there:

In [None]:
wineData

Our goal will be to find structure in the data that could allow us to reduce the number of attributes we need to keep track of for describing each wine.  We'll try two tactics:
1) **Clustering** will map individual wines into a few classes.  Then a good description of any wine is just to say which class it is in, ignoring the other details.  
2) **Manifold learning** will map individual wines into a 2-dimensional space.  Then a good description of any wine consists of its two coordinates in this space.

## Preliminary analysis and processing

Before we get to the fancier tools for dimensionality reduction, we will do some preliminary analysis to look for any obvious patterns.  *It is always a good idea to start simple in any data analysis project!*

An easy way to get some basic statistics from the data is to use the `pandas` function `describe`:

In [None]:
wineData.describe()

I also like to look at things visually to begin.  It's possible that a single one of these attributes already has values that group into a few clusters.  We wouldn't see this from the above statistics, but we would if we plotted some histograms.  The following code plots a separate histogram for each attribute:

In [None]:
for attribute in wineData.columns:
    plt.figure()
    plt.hist(wineData[attribute],bins=15)
    plt.xlabel(attribute)
    plt.ylabel('Number of wines')

❓ **Do any patterns or groups stand out to you from this basic analysis?** *Hint: There's no wrong answer here.  We are just getting in the habit of looking carefully at your data before doing any complicated analysis.*

✳️ **Answer:** 

Next, we will work with our data a bit to get it in a more useful form.

Notice that the 13 attributes vary in their typical size: typical values of "Nonflavanoid phenols" are less than 1, but "Proline" values are typically larger than 500 and vary by 100s.  If we naively use these data in clustering algorithms that rely on distance measures, then differences in Proline will be overemphasized and differences in Nonflavanoid phenols will be ignored.  To put the different attributes on a similar scale, we will take the common step of normalizing (or "standardizing") the data: subtract off the mean of each attribute and divide by its standard deviation:

In [None]:
wineDataNormed = (wineData - wineData.mean())/wineData.std()

Now each attribute has a similar scale:

In [None]:
wineDataNormed.describe()

## Run clustering algorithms

Now we will run some standard clustering routines to try to find relevant groups of wines.

Two standard algorithms that we discussed in lecture are "k-means" and "agglomerative clustering" (aka hierarchical clustering).  The following code runs these algorithms on the normalized data.  *Note that we are forcing the code here to produce exactly 3 clusters, and that each wine must belong to exactly one cluster (so-called "hard" clustering).*

In [None]:
kmeans_results = cluster.KMeans(n_clusters=3).fit(wineDataNormed)

ac_results = cluster.AgglomerativeClustering(n_clusters=3).fit(wineDataNormed)

Each "results" object contains `labels_` that lists the group assigned to each wine:

In [None]:
kmeans_results.labels_

In [None]:
ac_results.labels_

Though the label for each group may not be the same using the two algorithms, you may notice some similarities in the groupings.  Let's try to visualize this a bit more intuitively using PCA.

## Visualize using PCA

The `fit_transform` method in `sklearn.PCA` will perform PCA and project the data along the principal components:

In [None]:
pca_projections = PCA(n_components=2).fit_transform(wineDataNormed)

Plotting the data along the first two principal components:

In [None]:
plt.scatter(pca_projections[:,0],
            pca_projections[:,1])
plt.xlabel('Principal component 1')
plt.ylabel('Principal component 2');

Then we can use the `c` argument of `plt.scatter` to set the colors of the points according to the labels from a clustering algorithm:

In [None]:
plt.scatter(pca_projections[:,0],
            pca_projections[:,1],
            c=kmeans_results.labels_)
plt.xlabel('Principal component 1')
plt.ylabel('Principal component 2')
plt.title('K-means clusters');

❓ **Use similar code to plot the clustering results of agglomerative clustering.  Are the results similar to k-means?  Roughly what percentage of the wines are clustered differently by the two algorithms?**

✳️ **Answer:** 

For these wines, we can also compare our clusters to the three cultivars of grapes.

❓ **Make a similar scatter plot with colors corresponding to the grape cultivar.  How well do our found clusters correspond to the known cultivars?**  *Hint: The cultivar of grape is contained in `wineClasses`, which I set up to be easily passed as the argument to `c` in `plt.scatter`.*

✳️ **Answer:** 

## Visualize using t-SNE

Finally, let's practice using a nonlinear manifold learning method on the same data.

Here we'll use the "t-distributed stochasic neighbor embedding" (t-SNE) algorithm that we encountered in lecture.  The following code runs t-SNE using default parameters, which outputs a two-dimensional vector describing each wine: *(Note: You may get some warning messages that you can ignore. Also, the "stochastic" aspect of t-SNE means you will get somewhat different results each time you run the algorithm, so you may want to experiment with running it a few times.)*

In [None]:
wineTSNE = TSNE().fit_transform(wineDataNormed)

❓ **Analogousy to the PCA plots above, make a scatter plot using the output of t-SNE, first with all points the same color, and then with colors corresponding to one of the clustering algorithms.** *Hint: Don't forget to update your axis labels.*

In [None]:
# ✳️ **Answer:** 

❓ **How do the dimensionality reduction results compare between PCA and t-SNE?  Are clusters more visible with t-SNE?  What does (nonlinear) t-SNE do with the "curved arc" of points identified by (linear) PCA?**

✳️ **Answer:** 

## Interpret the results

❓ **Using all the above evidence, briefly interpret the results in terms that a wine connoisseur might understand.  For instance: Are there clear differences between the three classes of wines?  Are some classes more distinguishable than others?  Do the data seem to show continuous variation among wines, or are there distinct separations between classes?**

✳️ **Answer:** 

⚛️ **Bonus question (for nothing but bragging rights): Experiment with other clustering methods (there are many available in `sklearn.cluster`), with varying the number of assumed clusters, and with varying parameters for t-SNE (particularly the number of dimensions and the "perplexity" parameter).  Can you gain any more insight into structure in the data?** 

✴️ **Answer:**