### 1.项目回顾
在先前的模型中，我们只使用了一个特征值，来预测。但是在实际问题中，使用单个指标并不能反映市场情况。
有两种方法可以提高准确度：
+ 增加计算相似度的指标数
+ 增加计算近邻邻居的数量
在此次任务中，我们将关注于如何增加计算相似度的指标数(attributes).
在计算前，我们需要将那些与计算距离不太符合的列清洗。
+ 非数值列(比如 城市，地区)
+ 缺失数据
+ 非序列数 （比如，经度，维度）

In [2]:
import pandas as pd
import numpy as np
np.random.seed(1)

dc_listings = pd.read_csv('dc_airbnb.csv')
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
dc_listings['price'] = dc_listings.price.str.replace('$', '').str.replace(',', '').astype(float)

In [3]:
dc_listings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 574 to 1061
Data columns (total 19 columns):
host_response_rate      3289 non-null object
host_acceptance_rate    3109 non-null object
host_listings_count     3723 non-null int64
accommodates            3723 non-null int64
room_type               3723 non-null object
bedrooms                3702 non-null float64
bathrooms               3696 non-null float64
beds                    3712 non-null float64
price                   3723 non-null float64
cleaning_fee            2335 non-null object
security_deposit        1426 non-null object
minimum_nights          3723 non-null int64
maximum_nights          3723 non-null int64
number_of_reviews       3723 non-null int64
latitude                3723 non-null float64
longitude               3723 non-null float64
city                    3723 non-null object
zipcode                 3714 non-null object
state                   3723 non-null object
dtypes: float64(6), int64(5), objec

### 2. Removing Features
非数值的指标:
+ root type
+ city
+ state

Non-Odinal Value:

+ latitude
+ longitude
+ zipcode

去除与定价无关的数据

+ host_response_rate
+ host_acceptance_rate
+ host_listings_count


In [4]:
drop_list = ['room_type', 'city', 'state', 'latitude', 'longitude', 'zipcode', 'host_response_rate', 'host_acceptance_rate', 'host_listings_count']

dc_listings = dc_listings.drop(drop_list, axis='columns')

In [5]:
dc_listings.isnull().sum()

accommodates            0
bedrooms               21
bathrooms              27
beds                   11
price                   0
cleaning_fee         1388
security_deposit     2297
minimum_nights          0
maximum_nights          0
number_of_reviews       0
dtype: int64

### 3. 处理缺失数据
bed_rooms, bath_rooms, 和beds只有 1%左右的缺失数据。还比较好处理，影响不大

而 cleaning fee和security_deposit却有37.3%和67%的缺失数据。很难处理了，最好drop掉

然后，将所有包含none的数据全部drop掉


In [6]:
drop_missings = ['cleaning_fee', 'security_deposit']
dc_listings = dc_listings.drop(drop_missings, axis='columns')

In [7]:
dc_listings.isnull().sum()

accommodates          0
bedrooms             21
bathrooms            27
beds                 11
price                 0
minimum_nights        0
maximum_nights        0
number_of_reviews     0
dtype: int64

In [8]:
dc_listings = dc_listings.dropna(axis='index')


In [9]:
dc_listings.isnull().sum()

accommodates         0
bedrooms             0
bathrooms            0
beds                 0
price                0
minimum_nights       0
maximum_nights       0
number_of_reviews    0
dtype: int64

### 4. 归一化处理
`normalized_listings = (dc_listings - dc_listings.mean()) / (dc_listings.std())`


In [10]:
# 归一化处理
normalized_listings = (dc_listings - dc_listings.mean()) / (dc_listings.std())

In [11]:
normalized_listings.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,-0.596544,-0.249467,-0.439151,-0.546858,-0.173345,-0.341375,-0.016604,4.57965
1593,-0.596544,-0.249467,0.412923,-0.546858,-0.464148,-0.341375,-0.016603,1.159275
3091,-1.095499,-0.249467,-1.291226,-0.546858,-0.718601,-0.341375,-0.016573,-0.482505
420,-0.596544,-0.249467,-0.439151,-0.546858,0.437342,0.487635,-0.016584,-0.448301
808,4.393004,4.507903,1.264998,2.829956,0.480962,-0.065038,-0.016553,0.646219


In [12]:
# 将价格添加到归一化处理后的函数中
normalized_listings['price'] = dc_listings['price']

normalized_listings.mean()
normalized_listings.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
574,-0.596544,-0.249467,-0.439151,-0.546858,125.0,-0.341375,-0.016604,4.57965
1593,-0.596544,-0.249467,0.412923,-0.546858,85.0,-0.341375,-0.016603,1.159275
3091,-1.095499,-0.249467,-1.291226,-0.546858,50.0,-0.341375,-0.016573,-0.482505
420,-0.596544,-0.249467,-0.439151,-0.546858,209.0,0.487635,-0.016584,-0.448301
808,4.393004,4.507903,1.264998,2.829956,215.0,-0.065038,-0.016553,0.646219


### 为什么要做归一化处理
比如 accommodates, bedrooms, beds可能还都在一个小的范围，比如0-12， 但是  maximum_nights可能就范围大了。同时加入到计算逻辑中，会产生很大的副作用。
因此我们需要归一化处理，
+ 首先，计算每列的平均值，然后  
X = (x - u)/o       - x每列的值， u是平均值，  o 是标准偏差

### 5. 多指标下的欧几里何度量计算
```python
from scipy.spatial import distance
first_listing = [-0.596544, -0.439151]
second_listing = [-0.596544, 0.412923]
dist = distance.euclidean(first_listing, second_listing)

```

In [13]:
from scipy.spatial import distance 

first_listing = normalized_listings.iloc[0][['accommodates', 'bathrooms']]
fifth_listing = normalized_listings.iloc[4][['accommodates', 'bathrooms']]

first_fifth_distance = distance.euclidean(first_listing, fifth_listing)

first_fifth_distance

5.272543124668404

### 6. Scikit-Learn介绍
Scikit Learn包含所有主流的机器学习方法。使用Scikit learn，需要按照一个集成的工作流。
这个工作流主要包含4个部分:
+ 1. 初始化你需要使用的机器学习模型(instantiate the specific machine learning model you want to use)
+ 2. 将训练数据与模型拟合(fit the model to the training data)
+ 3. 使用模型预测数据(use the model to make predictions)
+ 4. 计算模型训练的准确度(evaluate the acccuracy of the predictions)

在scikit learn中，每个模型都被写成了一个类，我们在使用scikit的时候，先确定我们要使用哪个算法模型，比如 KNeighborRegressor Class 
任何可以用于做出预测的模型，我们叫回归模型。Classification, 分类模型
docs https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor

In [14]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()
"""
KNeighborsRegressor(
    n_neighbors=5,
    weights='uniform',
    algorithm='auto',
    leaf_size=30,
    p=2,
    metric='minkowski',
    metric_params=None,
    n_jobs=None,
    **kwargs,
)
"""

"\nKNeighborsRegressor(\n    n_neighbors=5,\n    weights='uniform',\n    algorithm='auto',\n    leaf_size=30,\n    p=2,\n    metric='minkowski',\n    metric_params=None,\n    n_jobs=None,\n    **kwargs,\n)\n"

In [15]:
# 1. 初始化模型
knn = KNeighborsRegressor(algorithm='brute')

#### 2. 将训练数据导入模型中
几乎所有的模型，fit方法都包含2个必要的参数:
+ matrix like object，包含训练集中所有的特征变量
+ list like object, 包含准确的目标数据


In [22]:
from sklearn.neighbors import KNeighborsRegressor
# 1. 选择对应的算法模型，初始化模型对象
knn = KNeighborsRegressor(algorithm='brute')
# 2. 选择训练集和测试集
train_df = normalized_listings.iloc[:2792]
test_df = normalized_listings.iloc[2792:]
# 3. 选择特征列和目标列
feature_columns = ['accommodates', 'bathrooms']
target_column = ['price']
# 4. 将训练数据与模型拟合
knn.fit(train_df[feature_columns], train_df[target_column])
# 5. 预测数据
predictions = knn.predict(test_df[feature_columns])
predictions[:10]

array([[ 80.8],
       [251.2],
       [ 89.4],
       [ 80.8],
       [ 80.8],
       [ 80.8],
       [189.8],
       [167.8],
       [167.8],
       [199. ]])

### 8. 计算MSE（方差）和RMSE（均方根差）
可以使用pandas的计算MSE和RMSE， 也可以使用sklearn的metrics计算`mean_squared_error`方法计算

mean_squared_error需要输入2个参数:
+ 参数1: list, 包含准确值
+ 参数2： list, 包含预测值


In [24]:
from sklearn.metrics import mean_squared_error
import numpy as np
mse = mean_squared_error(test_df['price'], predictions)
rmse = np.sqrt(mse)
print(mse.round(2), rmse.round(2))

15600.51 124.9


9. 使用更多的特征列
以下是使用各个特征列的计算误差

|feature(s)	|MSE	|RMSE|
|---------|---|---|
|accommodates|	18646.5|	136.6|
|bathrooms|	17333.4|	131.7|
|accommodates, bathrooms|	15660.4|	125.1|

可以看出，当使用2个特征的时候，误差会小一些
现在我们使用4个特征指标计算
+ accommodates
+ bedrooms
+ bathrooms
+ number_of_reviews

In [30]:
from sklearn.neighbors import KNeighborsRegressor
# 1. 选择对应的算法模型，初始化模型对象
knn = KNeighborsRegressor(algorithm='brute', n_neighbors=5)
# 2. 选择训练集和测试集
train_df = normalized_listings.iloc[:2792]
test_df = normalized_listings.iloc[2792:]
features = ['accommodates', 'bedrooms', 'bathrooms', 'number_of_reviews']
target = ['price']
knn.fit(train_df[features], train_df[target])
predictions = knn.predict(test_df[features])
mse = mean_squared_error(test_df[target], predictions)
rmse = np.sqrt(mse)
print(mse, rmse)

13322.432400455064 115.42284176217056



|feature(s)	|MSE	|RMSE|
|---------|---|---|
|accommodates|	18646.5|	136.6|
|bathrooms|	17333.4|	131.7|
|accommodates, bathrooms|	15660.4|	125.1|
|accommodates, bathrooms, bedrooms, number_of_reviews|	13322.4|	115.4|

### 10. 使用所有的特征列
|feature(s)	|MSE	|RMSE|
|---------|---|---|
|accommodates|	18646.5|	136.6|
|bathrooms|	17333.4|	131.7|
|accommodates, bathrooms|	15660.4|	125.1|
|accommodates, bathrooms, bedrooms, number_of_reviews|	13322.4|	115.4|
|all features|	15455.3|	124.3|

可以看出，使用所有特征，反而误差变大了。说明选择合适的指标非常重要。

the process of selecting features to use in model is known as FEATURE SELECTION

In [31]:
from sklearn.neighbors import KNeighborsRegressor
# 1. 选择对应的算法模型，初始化模型对象
knn = KNeighborsRegressor(algorithm='brute', n_neighbors=5)
# 2. 选择训练集和测试集
train_df = normalized_listings.iloc[:2792]
test_df = normalized_listings.iloc[2792:]
features = train_df.columns.tolist()
features.remove('price')
target = ['price']
knn.fit(train_df[features], train_df[target])
predictions = knn.predict(test_df[features])
mse = mean_squared_error(test_df[target], predictions)
rmse = np.sqrt(mse)
print(mse, rmse)

15455.275631399316 124.31924883701363
