## 示例：手写识别系统
本节构造使用 K-近邻分类器的手写识别系统。简单起见，构造的系统只能识别数字 0 到 9。需要识别的数字已经使用图形处理软件，处理成具有相同的色彩和大小：宽高是 32 像素 X 32 像素的黑白图像。尽管**采用文本格式存储图像不能有效地利用内存空间**，但为了方便理解，还是将图像转换为文本格式。

【步骤】：
- 收集数据：提供文本文件；
- 准备数据：编写函数 img2vector，将图像格式转换为分类器使用的向量格式；
- 分析数据：检查数据，确保符合要求；
- 测试算法：编写函数使用提供的部分数据集作为测试样本，测试样本与非测试样本的区别在于测试样本是已经完成分类的数据，如果预测分类与实际类别不同，则标记为一个错误。

In [8]:
# 导入所需的包
import numpy as np
import operator
import os

# 导入 KNN 实现
def classify(inX, dataset, labels, k):
    dataset_size = dataset.shape[0]
    diff_mat = np.tile(inX, (dataset_size, 1)) - dataset
    sq_diff_mat = diff_mat**2
    sq_distance = sq_diff_mat.sum(axis=1)
    distances = sq_distance**0.5
    sorted_dist_indicies = distances.argsort()
    class_count = {}
    for i in range(k):
        vote_label = labels[sorted_dist_indicies[i]]
        class_count[vote_label] = class_count.get(vote_label, 0) + 1
    sorted_class_count = sorted(class_count.items(), key=operator.itemgetter(1), reverse=True)
    return sorted_class_count[0][0]

### 准备数据：将图像转换为测试向量
实际图像存储在两个子目录内：
- 目录 trainingDigits：包含大约 2000 个例子，每个数字大约有 200 个样本；
- 目录 testDigits：包含大约 900 个测试数据。
我们使用目录 trainingDigits 中的数据训练分类器，使用目录 testDigits 中的数据测试分类器的效果。两组数据没有重叠。

为了使用先前编写好的分类器，必须先将图像格式化处理为一个向量。我们将把一个 32 X 32 的二进制图像矩阵转换为 1 X 1024 的向量。

In [2]:
def img2vector(filename):
    return_vect = np.zeros((1, 1024))
    with open(filename) as file:
        for i in range(32):
            line_str = file.readline()
            for j in range(32):
                return_vect[0, 32*i + j] = int(line_str[j])
    return return_vect

该函数创建 1 X 1024 的 NumPy 数组，然后打开给定文件，循环读出文件的前 32 行，并将每行的头 32 个字符值存储在 NumPy 数组中，最后返回数组。

In [3]:
test_vector = img2vector('testDigits/0_13.txt')
test_vector[0, 0:31]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [4]:
test_vector[0, 32:63]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.,
       1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

### 测试算法：使用 K-近邻算法识别手写数字
前面已经将数据处理成分类器可以识别的样式，现在将这些数据输入到分类器，检测分类器的执行效果。

In [15]:
def handwriting_class_test():
    hw_labels = []
    training_file_list = os.listdir('trainingDigits')
    m = len(training_file_list)
    training_mat = np.zeros((m, 1024))
    for i in range(m):
        file_name_str = training_file_list[i]
        file_str = file_name_str.split('.')[0]
        class_num_str = int(file_str.split('_')[0])
        hw_labels.append(class_num_str)
        training_mat[i, :] = img2vector('trainingDigits/%s' % file_name_str)
    test_file_list = os.listdir('testDigits')
    error_count = 0.0
    m_test = len(test_file_list)
    for i in range(m_test):
        file_name_str = test_file_list[i]
        file_str = file_name_str.split('.')[0]
        class_num_str = int(file_str.split('_')[0])
        vector_under_test = img2vector('testDigits/%s' % file_name_str)
        classifier_result = classify(vector_under_test, training_mat, hw_labels, 3)
        print('The classifier came back with: %d, the real answer is: %d' % (classifier_result, class_num_str))
        if classifier_result != class_num_str:
            error_count += 1.0
    print('\n the total number of errors is: %d' % error_count)
    print('\n the total error rate is: %f' % (error_count / float(m_test)))

【代码说明】：
1. 将 traingDigits 目录中的文件内容存储在列表中，然后可以得到目录中有多少文件，并将其存储在变量 m 中；
2. 接着，代码创建一个 m 行 1024 列的训练矩阵，该矩阵的每行数据存储一个图像。我们可以从文件名中解析出分类数字。例如文件 9_45.txt 的分类是 9，它是数字 9 的第 45 个实例；
3. 然后，我们可以将类代码存储在 hw_labels 向量中，使用 img2vector 函数载入图像；
4. 对 testDigits 目录中的文件执行相似的操作，不同之处是我们不再将该目录下的文件载入矩阵中，而是使用 classify() 函数测试该目录下的每个文件。由于文件中的值已经在 0 和 1 之间，因此不需要使用 auto_norm() 函数。

In [16]:
handwriting_class_test()

The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answe

The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answer is: 1
The classifier came back with: 1, the real answe

The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answer is: 3
The classifier came back with: 3, the real answe

The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answer is: 4
The classifier came back with: 4, the real answe

The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answer is: 6
The classifier came back with: 6, the real answe

The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answe

The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answe

改变变量 k 的值，修改函数 handwriting_class_test 随机选取训练样本，改变训练样本的数目，都会对 k-近邻算法的错误率产生影响。

实际使用该算法，算法的执行效率并不高。因为算法需要为每个测试向量做 2000 次距离计算，每个距离计算包括了 1024 个维度浮点运算，总计要执行 900 次。此外，我们还需要为测试向量准备 2MB 的存储空间。是否存在一种算法减少存储空间和计算时间的开销呢？K 决策树就是 K-近邻算法的优化版，可以节省大量的计算开销。