# Clustering, classification, data mining

We'll be using [scikit-learn](http://scikit-learn.org)  ([documentation](http://scikit-learn.org/stable/documentation.html)) and [AstroML](http://www.astroml.org/)  ([documentation](http://www.astroml.org/user_guide/index.html)) for this pre-class assignment.

You have been provided two datasets - one real and one fake.  The real dataset is a color-magnitude diagram from the [COMBO-17](https://arxiv.org/abs/astro-ph/0208345) survey ([survey website](https://www.mpia.de/COMBO/combo_index.html)), and the fake dataset is a set of Gaussian blobs of known grouping.  We are going to use clustering algorithms to try to find patterns in this data!

The COMBO-17 data should show a "red sequence" and a "blue sequence" of galaxies, which are roughly visible to the eye in the data.  Can you get the scikit-learn [k-means clustering](http://scikit-learn.org/stable/modules/clustering.html) to find the correct clusters?

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

## Fake data first!

The function below generates a user-specified number of clusters with a user-specified size of each cluster, and then uses k-means clustering to find them.  Try varying the cluster properties and the number of expected clusters and see what happens.  How good of a job does k-means clustering do in finding the correct clusters, particularly when they are near each other?

In [None]:
def fake_clusters(n_clusters=4,npart=30,fwhm=0.05):
    '''
    generates fake clusters that have been built out of gaussian blob samples.
    
    inputs: number of clusters, number of particles per clusters, and FWHM of each 
    cluster distribution (note: domain is assumed to be a 2D square and 0-1 in each 
    dimension)
    
    returns: x position, y position, group number
    '''
    x = []
    y = []
    group = []
    
    for i in range(n_clusters):
        
        # group center
        xcenter = np.random.rand()
        ycenter = np.random.rand()

        # add particles to groups with normal distribution 
        # around group center
        for j in range(npart):
            x.append(xcenter+np.random.normal(0.0,fwhm))
            y.append(ycenter+np.random.normal(0.0,fwhm))
            group.append(i)
    
    return x,y,group

# set the random seed to get reproducible results - try setting this to different values!
np.random.seed(5998821)

# generate some clusters - CHANGE PARAMETERS HERE!
n_clusters = 4
fwhm = 0.05
x,y,g=fake_clusters(n_clusters=n_clusters,fwhm=fwhm)

# plot it out!
plt.scatter(x,y,c=g,cmap='viridis')
plt.xlim(-4*fwhm,1+4*fwhm)
plt.ylim(-4*fwhm,1+4*fwhm)

In [None]:
import sklearn.cluster as skcluster

In [None]:
# we need to stack the data together in a 2xN dimensional numpy 
# array in order to feed it into the KMeans clustering tool
combined_data = np.column_stack((x,y))

# here's where the magic happens - note that we have to specify the number of clusters.
# KMeans has a bunch of parameters - see what they are!
clusters=skcluster.KMeans(n_clusters=4).fit_predict(combined_data)

plt.scatter(x,y,c=clusters,cmap='viridis')
plt.xlim(-4*fwhm,1+4*fwhm)
plt.ylim(-4*fwhm,1+4*fwhm)

Now, vary the random seed, the number of clusters, and the FWHM of the blobs that you generate.  How does K-means clustering do in the varied scenarios?  Furthermore, Scikit-learn's K-means clustering asks you to guess how many clusters there are - what happens when you give it a number that is too large or too small?

**Answers here!**

### Now...

Read through the [scikit-learn page on clustering](http://scikit-learn.org/stable/modules/clustering.html), as well as their [demonstration of the effects of k-means assumptions](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html#sphx-glr-auto-examples-cluster-plot-kmeans-assumptions-py).  Experiment with a few of the different clustering algorithms described in the first link, and record your observations about the relative properties.  **Repeat the same experiment as above regarding the FWHM of the blobs,** and record your observations below.

If you have time, try varying the shapes of the clusters and their distributions in the function above (i.e., non-circular, non-Gaussian) to see how the various algorithsm behave.  Record your observations below! 

In [None]:
# code here



**notes here!**

## COMBO-17

Now, try the K-means clustering on this observational dataset, which is described above.  Do you get similar results?  Can you reproduce the galaxy red and blue sequences?

In [None]:
V1, V2 = np.loadtxt("COMBO17_lowz.dat",skiprows=1,unpack=True)

In [None]:
# plot color-magnitude diagram.  Higher in the y direction means galaxy is 
# bluer; lower in the y-direction means redder.  Due to the insanity of the
# magnitude system, more negative numbers on the x-axis are actually brighter.

plt.plot(V1,V2,'b.')
plt.xlabel(r'M$_B$ (mag)')
plt.ylabel(r'M$_{280}$-M$_B$ (mag)')

In [None]:
# put your data here!

