## 需求
facebook创建了一个人造的世界，包括位于10平方公里(`x:0~10`, `y:0~10`)的地方。预测一个人想签入那个地方。train.csv源数据有2900w+，数据量较大，所有只抽取(`x:2~2.5`, `y:2~2.5`)的数据(7w+) train_simple.csv

### 字段

- row_id: 行数，主键；
- x: 经度
- y: 纬度
- accuracy: 准确和精度
- time: 时间
- place_id: 地址 

### 特征值
目标值：place_id

- x, y(x,y的值都为0~10)
- accuracy：数据的准确度
- year
- month
- day
- hour


In [1]:
import os
import pandas as pd
%matplotlib
%matplotlib inline

STATIC_PATH = '../statics'

Using matplotlib backend: MacOSX


##  获取数据

In [2]:
df = pd.read_csv(os.path.join(STATIC_PATH, 'train_simple.csv'))

## 数据处理和数据集划分
### 数据处理
时间戳才分为：年、月、周、日、小时

### 数据集划分
训练集: 测试机  8: 2


In [3]:
# 数据处理
df['year']=pd.DatetimeIndex(df['time']).year
df['month']=pd.DatetimeIndex(df['time']).month
df['week']=pd.DatetimeIndex(df['time']).week
df['day']=pd.DatetimeIndex(df['time']).day
df['hour']=pd.DatetimeIndex(df['time']).hour

In [4]:
# 查看地址的热度，方便过滤低热度数据
place_count = df.groupby('place_id').count()['row_id']
place_count

place_id
1006234733      1
1008823061      4
1012580558      3
1025585791     21
1026507711    220
             ... 
9986101718      1
9993141712      1
9995108787     23
9998968845     99
9999851158      3
Name: row_id, Length: 2524, dtype: int64

In [5]:
# 布尔过滤
place_bool = df['place_id'].isin(place_count[place_count>5].index.values)
df = df[place_bool]
df = df[['x', 'y', 'accuracy', 'time', 'place_id', 'year', 'week', 'day', 'hour']]
df

Unnamed: 0,x,y,accuracy,time,place_id,year,week,day,hour
0,2.1663,2.3755,84,669737,3869813743,1970,1,1,0
1,2.3695,2.2034,3,234719,2636621520,1970,1,1,0
2,2.3236,2.1768,66,502343,7877745055,1970,1,1,0
3,2.2613,2.3392,73,319822,9775192577,1970,1,1,0
4,2.3331,2.0011,66,595084,6731326909,1970,1,1,0
...,...,...,...,...,...,...,...,...,...
71659,2.0804,2.0657,168,217886,1247398579,1970,1,1,0
71660,2.4309,2.4646,33,314360,1951613663,1970,1,1,0
71661,2.1797,2.1707,89,74954,4724115005,1970,1,1,0
71662,2.3924,2.2704,62,206257,2819110495,1970,1,1,0


In [6]:
# 数据集划分
from sklearn.model_selection import train_test_split


x = df[['x', 'y', 'accuracy', 'time', 'year', 'week', 'day', 'hour']]
y = df['place_id']
'place_id'
x_train, x_test, y_train, y_test = train_test_split(x, y)

## 特征工程

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV


transform = StandardScaler()
x_train = transform.fit_transform(x_train)
x_test = transform.transform(x_test)

x_train

array([[ 1.40483666,  1.06974136,  0.55710852, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.06953376, -1.82071584, -0.54367495, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.36957663, -0.80613617, -0.56101013, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.98969974, -1.23824492, -0.46566668, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.46013406,  0.41354918, -0.43966392, ...,  0.        ,
         0.        ,  0.        ],
       [-0.27034724, -0.22074559, -0.18830391, ...,  0.        ,
         0.        ,  0.        ]])

## KNN算法预估流程

In [8]:
estimator = KNeighborsClassifier()
param_dict = {'n_neighbors': [1, 3, 5, 7]}
estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)
estimator.fit(x_train, y_train)



GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=None, param_grid={'n_neighbors': [1, 3, 5, 7]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

## 模型选择与调优

待补充

## 模型评估

In [9]:
print('Train score: \n', estimator.score(x_train, y_train))
print('Test score: \n', estimator.score(x_test, y_test))

# 最佳参数
print(u'最佳参数: \n', estimator.best_params_)
# 最佳结果
print(u'结果: \n', estimator.best_score_)
# 最佳估计器
print(u'估计器: \n', estimator.best_estimator_)
# 交叉验证结果
print(u'估计器: \n', estimator.cv_results_)

Train score: 
 0.5704447297770515
Test score: 
 0.4590288315629742
最佳参数: 
 {'n_neighbors': 7}
结果: 
 0.44315396288082176
估计器: 
 KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=7, p=2,
                     weights='uniform')
估计器: 
 {'mean_fit_time': array([0.12838642, 0.16337442, 0.31869467, 0.17586374]), 'std_fit_time': array([0.07997169, 0.11918781, 0.09625784, 0.08802441]), 'mean_score_time': array([ 3.27969201, 10.83174483, 12.32311583,  6.78521125]), 'std_score_time': array([0.69300366, 4.666079  , 3.65715265, 2.29339767]), 'param_n_neighbors': masked_array(data=[1, 3, 5, 7],
             mask=[False, False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'n_neighbors': 1}, {'n_neighbors': 3}, {'n_neighbors': 5}, {'n_neighbors': 7}], 'split0_test_score': array([0.37836439, 0.4021164 , 0.42995169, 0.44174143]), 'split1_test_score': array([0.37292092, 0.397082  , 0.4