## KNN算法的原理
KNN是最简单的分类算法，它的核心思想是谁离我最近，我就跟谁是一类
### 算法的执行步骤
- 输入训练数据（此时相当于已经搭建好了模型）
- 输入测试的数据和超参数k
- 将输入数据分别与所有的训练数据逐条计算距离（可以是算欧氏距离，也可以是其他距离算法）
- 对距离排序，找出前k个离该条数据最近的训练数据
- 前k个数据中，出现类别最多的就是结果，返回该分类结果，输出预测

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

### 载入数据

In [2]:
iris = load_iris()
x = iris.data
y = iris.target

In [3]:
x.shape

(150, 4)

In [4]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

### 分割训练集和测试集

In [5]:
#划分数据集为百分之八十的训练集，剩余百分之二十是测试集
x_train , x_test ,y_train , y_test = train_test_split(x, y, test_size=0.2, random_state=666666)

In [6]:
X = x_test[0] #这是我们待会输入进去的数据

In [7]:
X

array([7.2, 3.2, 6. , 1.8])

In [8]:
y_test[0]

2

### 导入我们需要的numpy

In [9]:
import numpy as np

### 计算距离

In [10]:
distances = []
for x_data in x_train:
    distance = np.sqrt(np.sum(x_data - X)**2)
    distances.append(distance)

In [11]:
distances

[8.8,
 2.500000000000001,
 2.6000000000000005,
 6.6000000000000005,
 9.3,
 2.800000000000001,
 2.8000000000000007,
 4.800000000000001,
 7.500000000000002,
 5.1000000000000005,
 8.5,
 2.220446049250313e-16,
 8.6,
 8.9,
 3.5999999999999996,
 8.7,
 0.4000000000000006,
 1.3000000000000005,
 0.9000000000000001,
 0.8000000000000005,
 3.8,
 1.7999999999999996,
 8.6,
 5.400000000000001,
 2.4000000000000004,
 9.8,
 1.0000000000000002,
 8.200000000000001,
 8.0,
 2.700000000000001,
 6.9,
 7.5,
 7.800000000000001,
 9.1,
 2.2,
 6.6000000000000005,
 7.700000000000001,
 1.3,
 3.0,
 3.5000000000000004,
 7.9,
 3.9000000000000004,
 5.3999999999999995,
 2.6000000000000005,
 0.40000000000000013,
 7.5,
 7.6000000000000005,
 6.500000000000001,
 3.3000000000000007,
 0.0999999999999992,
 8.5,
 4.6000000000000005,
 4.9,
 4.6,
 8.8,
 7.800000000000001,
 1.4000000000000004,
 0.6999999999999995,
 0.10000000000000075,
 4.300000000000001,
 2.5999999999999996,
 2.9000000000000004,
 1.9000000000000001,
 7.5,
 7.4,
 1

### 对距离排序
numpy中提供了非常好的排序算法，帮我们减轻了负担

In [12]:
np.argsort(distances) #以下所显示的是离我们最近的样本的索引

array([110,  11,  94,  49,  99, 101,  58,  44,  16,  72, 107,  57,  19,
        18,  26,  66,  69,  37,  17,  56, 114,  70,  81,  21,  65,  62,
        89,  34,  96,  80,  24,  91,   1, 100,  60, 113,   2,  43,  86,
        29,  79,   6,   5,  61,  38, 109,  95,  48, 105, 115,  39,  14,
        83,  73,  20, 102,  41,  67, 119,  59,  84,  53, 116, 118,  51,
         7,  52,  90,   9,  42,  23, 112,  78,  47,  35,   3, 111, 106,
        30, 117,  98,  82,  64,  45,  63,  93,  31,  97,   8,  46,  36,
        55,  32,  87,  40,  75,  28,  71, 104,  68,  88,  27,  77,  74,
        50,  10,  76,  12,  22, 108,  15,  54,   0,  85, 103,  13,  33,
         4,  92,  25], dtype=int64)

### 指定超参数K
我们假设K = 3

In [25]:
k = 3
nearest = np.argsort(distances)[:k] #取前k个

In [27]:
top_k_y = [y_train[index] for index in nearest]
top_k_y #说实话我没想到都是2，我事先试过了，试到了40基本上都还是2，看来这个数据集非常适合KNN

[2, 2, 2]

### 选出类别数量最多的做为预测结果
这里离样本最近的都是2这个类别的鸢尾花，其实不用看都知道答案了，但是为了算法具有泛化能力我们还是得把具体过程写出来

In [29]:
d = {}
for cls in top_k_y:
    d[cls] = d.get(cls,0) + 1
d

{2: 3}

In [30]:
d_list = list(d.items())
d_list.sort(key=lambda x:x[1],reverse=True)
d_list[0][0] #这就是最终预测结果

2