### Feature Scaling Example

现在你已经看到特征缩放将可能改变 kmeans 算法的聚类，现在让我们动手练习一下吧！

首先让我们取一些数据。第一个单元格将读取必要的库，生成数据，并绘制数据的图形，在本 notebook 的其余部分，我们都会使用的这个数据。

在本 notebook 中，这个你将一直使用的数据集被存储在变量 **data** 中。

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from IPython.display import Image
from sklearn.datasets.samples_generator import make_blobs
import tests2 as t


%matplotlib inline

# DSND colors: UBlue, Salmon, Gold, Slate
plot_colors = ['#02b3e4', '#ee2e76', '#ffb613', '#2e3d49']

# Light colors: Blue light, Salmon light
plot_lcolors = ['#88d0f3', '#ed8ca1', '#fdd270']

# Gray/bg colors: Slate Dark, Gray, Silver
plot_grays = ['#1c262f', '#aebfd1', '#fafbfc']


def create_data():
    n_points = 120
    X = np.random.RandomState(3200000).uniform(-3, 3, [n_points, 2])
    X_abs = np.absolute(X)

    inner_ring_flag = np.logical_and(X_abs[:,0] < 1.2, X_abs[:,1] < 1.2)
    outer_ring_flag = X_abs.sum(axis = 1) > 5.3
    keep = np.logical_not(np.logical_or(inner_ring_flag, outer_ring_flag))

    X = X[keep]
    X = X[:60] # only keep first 100
    X1 = np.matmul(X, np.array([[2.5, 0], [0, 100]])) + np.array([22.5, 500])
    
    
    plt.figure(figsize = [15,15])

    plt.scatter(X1[:,0], X1[:,1], s = 64, c = plot_colors[-1])

    plt.xlabel('5k Completion Time (min)', size = 30)
    plt.xticks(np.arange(15, 30+5, 5), fontsize = 30)
    plt.ylabel('Test Score (raw)', size = 30)
    plt.yticks(np.arange(200, 800+200, 200), fontsize = 30)

    ax = plt.gca()
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    [side.set_linewidth(2) for side in ax.spines.values()]
    ax.tick_params(width = 2)
    plt.savefig('C18_FeatScalingEx_01.png', transparent = True)
    
    
    data = pd.DataFrame(X1)
    data.columns = ['5k_Time', 'Raw_Test_Score']
    
    return data

data = create_data()

`1.` 看一下这个数据集。有没有缺失数据？平均完成时间是多少？平均原始测试分数是多少？使用下面的单元格来查找这些问题的答案，使用字典将值与相应的语句匹配，并对照我们的解决方案进行检查。

In [None]:
# cell for work


In [None]:
# another cell for work


In [None]:
# Use the dictionary to match the values to the corresponding statements
a = 0
b = 60
c = 22.9
d = 4.53
e = 511.7

q1_dict = {
'number of missing values': # letter here,
'the mean 5k time in minutes': # letter here,    
'the mean test score as a raw value': # letter here,
'number of individuals in the dataset': # letter here
}

# check your answer against ours here
t.check_q1(q1_dict)

`2.` 现在，实例化一个有两个聚类的 kmeans `model`。用此模型来 `fit` 和 `predict` 数据集中每个点。将预测结果存储在变量 `preds` 中。如果你正确地创建了模型和预测，那么在运行以下单元格时，你应该会看到上部有一个（蓝色）分类和底部有一个（粉色）分类。

In [None]:
model = # instantiate a model with two centers
preds = # fit and predict

In [None]:
# Run this to see your results

def plot_clusters(data, preds, n_clusters):
    plt.figure(figsize = [15,15])

    for k, col in zip(range(n_clusters), plot_colors[:n_clusters]):
        my_members = (preds == k)
        plt.scatter(data['5k_Time'][my_members], data['Raw_Test_Score'][my_members], s = 64, c = col)

    plt.xlabel('5k Completion Time (min)', size = 30)
    plt.xticks(np.arange(15, 30+5, 5), fontsize = 30)
    plt.ylabel('Test Score (raw)', size = 30)
    plt.yticks(np.arange(200, 800+200, 200), fontsize = 30)

    ax = plt.gca()
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    [side.set_linewidth(2) for side in ax.spines.values()]
    ax.tick_params(width = 2)
    
plot_clusters(data, preds, 2)

`3.` 现在给你的 `data` 的数据框（dataframe）添加两个新列。第一个是 `test_scaled`，你可以通过减去测试分数的均值并除以测试分数的标准差来创建它。要创建的第二列是 `5k_time_sec`，，它应将分钟更改为秒。

In [None]:
# your work here
data['test_scaled'] = # standardized test scores
data['5k_time_sec'] = # times in seconds

`4.` 现在，与问题 2 类似，实例化一个有两个聚类的 `model`。用你的模型来 `fit` 和 `predict` 数据集中每个点。将预测值存储在变量 `preds` 中。如果你正确地创建了模型和预测，你应该会看到右边有一个（蓝色）分类和左边有一个（粉色）分类。

In [None]:
model = # instantiate a model with two centers
preds = # fit and predict

In [None]:
# Run this to see your results
plot_clusters(data, preds, 2)

`5.` 将最能描述特征缩放方式的变量与使用基于距离的度量或正则化的算法相匹配。

In [None]:
# options
a = 'We should always use normalizing'
b = 'We should always scale our variables between 0 and 1.'
c = 'Variable scale will frequently influence your results, so it is important to standardize for all of these algorithms.'
d = 'Scaling will not change the results of your output.'

best_option = # best answer variable here


# check your answer against ours here
t.check_q5(best_option)

###  如果要参考答案，点击橘色 Jupyter 图标，找到 Feature Scaling Example - Solution 文件。