## 实验介绍

### 1.实验内容

本实验包括: 
* 理解聚类的概念，掌握常见聚类算法。
* 基于层次聚类算法，在Iris数据集上实现聚类。

### 2.实验环境

* python 3.6.7
* numpy 1.13.3
* pandas 0.23.4   

### 3.数据介绍

* 数据保存在文件Iris.csv中，数据集内包含 3 类共 150 条记录，每类各50个数据，每条记录都有4项特征：花萼长度、花萼宽度、花瓣长度、花瓣宽度，可以通过这4个特征预测鸢尾花卉属于（iris-setosa, iris-versicolour, iris-virginica）中的哪一品种。

### 4.实验准备

点击屏幕右上方的下载实验数据模块，选择下载Iris.tgz到指定目录下，然后再依次选择点击上方的File->Open->Upload,上传刚才下载的数据集压缩包，再使用如下命令解压：

In [1]:
!tar -zxvf ./work/iris.tgz -C ./dataset/

x ./iris.csv


## 正式实验

### 1. 导入所需要的包

In [1]:
import pandas as pd 
import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.preprocessing import LabelEncoder
from scipy.spatial.distance import pdist, squareform

### 2. 读取数据

In [2]:
iris_df = pd.read_csv('./dataset/iris.csv')
iris_df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
# 创建一个LabelEncoder对象
le = LabelEncoder()

# 将最后一列编码为数字
iris_df.iloc[:, -1] = le.fit_transform(iris_df.iloc[:, -1])

iris_df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [4]:
iris_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Sepal.Length  150 non-null    float64
 1   Sepal.Width   150 non-null    float64
 2   Petal.Length  150 non-null    float64
 3   Petal.Width   150 non-null    float64
 4   Species       150 non-null    int32  
dtypes: float64(4), int32(1)
memory usage: 5.4 KB


In [5]:
X = iris_df.iloc[:, :-1].values.astype(float)
y = iris_df.iloc[:, -1].values

In [17]:
############################################################
# 展示聚类前数据集分布
############################################################

def show_dataset(dataset):
    for item in dataset:
        plt.plot(item[0], item[1], 'ob')
    plt.title("Dataset")
    plt.show()



############################################################
# 计算两点之间的欧氏距离并返回
############################################################

def elu_distance(a, b):
    dist = np.sqrt(np.sum(np.square(np.array(a) - np.array(b))))
    return dist



############################################################
# 计算集合Ci, Cj间最小距离并返回
############################################################

def dist_min(ci, cj):
    return min(elu_distance(i, j) for i in ci for j in cj)



############################################################
# 计算集合Ci, Cj间最大距离并返回
############################################################

def dist_max(ci, cj):
    # 计算簇Ci, Cj间最大距离
    return max(elu_distance(i, j) for i in ci for j in cj)



############################################################
# 计算集合Ci, Cj间平均距离并返回
############################################################

def dist_avg(ci, cj):
    # 计算簇Ci, Cj间平均距离
    return sum(elu_distance(i, j) for i in ci for j in cj) / (len(ci) * len(cj))



############################################################
# 找出距离最小的两个簇并返回
############################################################

def find_index(m):
    min_dist = float('inf')
    x = y = 0
    for i in range(len(m)):
        for j in range(len(m[i])):
            if i != j and m[i][j] < min_dist:
                min_dist, x, y = m[i][j], i, j
    return x, y, min_dist


### 3. 聚合聚类

In [10]:
class AGNES:
    def __init__(self, n_clusters=3, linkage='single'):
        self.n_clusters = n_clusters
        self.linkage = linkage

    def fit(self, X):
        self.labels_ = np.zeros(X.shape[0])
        n_clusters = X.shape[0]
        D = squareform(pdist(X))
        np.fill_diagonal(D, np.inf)
        self.linkage_matrix_ = linkage(D, method=self.linkage)

        while n_clusters > self.n_clusters:
            i, j = np.unravel_index(np.argmin(D), D.shape)
            if self.linkage == 'single':
                D[i, :] = np.minimum(D[i, :], D[j, :])
                D[:, i] = np.minimum(D[:, i], D[:, j])
            elif self.linkage == 'complete':
                D[i, :] = np.maximum(D[i, :], D[j, :])
                D[:, i] = np.maximum(D[:, i], D[:, j])
            D = np.delete(D, j, axis=0)
            D = np.delete(D, j, axis=1)

            self.labels_[self.labels_ == j] = i
            self.labels_[self.labels_ > j] -= 1

            n_clusters -= 1

    def fit_predict(self, X):
        self.fit(X)
        return self.labels_

In [12]:
# 聚类
agnes = AGNES(n_clusters=3, linkage='single')
labels = agnes.fit_predict(X)

NameError: name 'linkage' is not defined

In [8]:
# 绘制聚类树
from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt

In [9]:
linkage_matrix = agnes.linkage_matrix_
fig, ax = plt.subplots(figsize=(10, 5))
dendrogram(linkage_matrix, ax=ax)
ax.set_xlabel('Sample index')
ax.set_ylabel('Distance')
ax.set_title('AGNES Dendrogram')
plt.show()

AttributeError: 'AGNES' object has no attribute 'linkage_matrix_'