## 1. KMeans vs GMM 

在第一个例子中，我们将生成一个高斯数据集，并尝试对其进行聚类，看看其聚类结果是否与数据集的原始标签相匹配。

我们可以使用 sklearn 的 [make_blobs] (http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html) 函数来创建高斯 blobs 的数据集：

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import cluster, datasets, mixture

%matplotlib inline

n_samples = 1000

varied = datasets.make_blobs(n_samples=n_samples,
                             cluster_std=[5, 1, 0.5],
                             random_state=3)
X, y = varied[0], varied[1]

plt.figure( figsize=(16,12))
plt.scatter(X[:,0], X[:,1], c=y, edgecolor='black', lw=1.5, s=100, cmap=plt.get_cmap('viridis'))
plt.show()

现在，当我们把这个数据集交给聚类算法时，我们显然不会传入标签。所以让我们从 k-means 开始，看看它是如何处理这个数据集的。是否会产生与原标签相匹配的聚类？

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
pred = kmeans.fit_predict(X)

In [None]:
plt.figure( figsize=(16,12))
plt.scatter(X[:,0], X[:,1], c=pred, edgecolor='black', lw=1.5, s=100, cmap=plt.get_cmap('viridis'))
plt.show()

k-means 的表现怎么样？它是否能够找到与原始标签匹配或相似的聚类？

现在让我们尝试使用 [GaussianMixture](http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html) 进行聚类：

In [None]:
# TODO: Import GaussianMixture
from import 

# TODO: Create an instance of Gaussian Mixture with 3 components
gmm = 

# TODO: fit the dataset
gmm = 

# TODO: predict the clustering labels for the dataset
pred_gmm = 

In [None]:
# Plot the clusters
plt.figure( figsize=(16,12))
plt.scatter(X[:,0], X[:,1], c=pred_gmm, edgecolor='black', lw=1.5, s=100, cmap=plt.get_cmap('viridis'))
plt.show()

通过视觉比较k-means和GMM聚类的结果，哪一个能更好地匹配原始标签？

# 2. KMeans vs GMM - 鸢尾花(Iris)数据集

对于第二个示例，我们将使用一个具有两个以上特征的数据集。鸢尾花(Iris)数据集在这方面做得很好，因为可以合理地假设它的数据分布是高斯分布。

鸢尾花(Iris)数据集是一个带标签的数据集，具有四个特征：


In [None]:
import seaborn as sns

iris = sns.load_dataset("iris")

iris.head()

有几种方法 (例如 [PairGrid](https://seaborn.pydata.org/generated/seaborn.PairGrid.html), [t-SNE](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html), 或 [用 PCA 投影到一个较低的数维](http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html#sphx-glr-auto-examples-decomposition-plot-pca-iris-py))。让我们尝试用 PairGrid 进行可视化，因为它不会扭曲数据集 --它只是在一个子图中将每一对特征进行相互对应：

In [None]:
g = sns.PairGrid(iris, hue="species", palette=sns.color_palette("cubehelix", 3), vars=['sepal_length','sepal_width','petal_length','petal_width'])
g.map(plt.scatter)
plt.show()

If we cluster the Iris datset using KMeans, how close would the resulting clusters match the original labels?

In [None]:
kmeans_iris = KMeans(n_clusters=3)
pred_kmeans_iris = kmeans_iris.fit_predict(iris[['sepal_length','sepal_width','petal_length','petal_width']])

In [None]:
iris['kmeans_pred'] = pred_kmeans_iris

g = sns.PairGrid(iris, hue="kmeans_pred", palette=sns.color_palette("cubehelix", 3), vars=['sepal_length','sepal_width','petal_length','petal_width'])
g.map(plt.scatter)
plt.show()

How do these clusters match the original labels?

You can clearly see that visual inspection is no longer useful if we're working with multiple dimensions like this. So how can we evaluate the clustering result versus the original labels? 

You guessed it. We can use an external cluster validation index such as the [adjusted Rand score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html) which generates a score between -1 and 1 (where an exact match will be scored as 1).

In [None]:
# TODO: Import adjusted rand score
from import 

# TODO: calculate adjusted rand score passing in the original labels and the kmeans predicted labels 
iris_kmeans_score = 

# Print the score
iris_kmeans_score

What if we cluster using Gaussian Mixture models? Would it earn a better ARI score?

In [None]:
gmm_iris = GaussianMixture(n_components=3).fit(iris[['sepal_length','sepal_width','petal_length','petal_width']])
pred_gmm_iris = gmm_iris.predict(iris[['sepal_length','sepal_width','petal_length','petal_width']])

In [None]:
iris['gmm_pred'] = pred_gmm_iris

# TODO: calculate adjusted rand score passing in the original 
# labels and the GMM predicted labels iris['species']
iris_gmm_score = 

# Print the score
iris_gmm_score

Thanks to ARI socres, we have a clear indicator which clustering result better matches the original dataset.