### facebook-v-predicting-check-ins
train.csv, test.csv  
- row_id: 签到事件 id
- xy: 坐标
- accuracy: 定位准确度
- time: 时间戳
- place_id: 业务 id, 需要预测的值


In [1]:
import pandas as pd


In [2]:
# read data
test = pd.read_csv("../datasets/facebook-v-predicting-check-ins/test.csv")
train = pd.read_csv("../datasets/facebook-v-predicting-check-ins/train.csv")


In [3]:
test

Unnamed: 0,row_id,x,y,accuracy,time
0,0,0.1675,1.3608,107,930883
1,1,7.3909,2.5301,35,893017
2,2,8.0978,2.3473,62,976933
3,3,0.9990,1.0591,62,907285
4,4,0.6670,9.7254,40,914399
...,...,...,...,...,...
8890734,8830682,1.6438,4.1304,261,916230
8890735,8830683,6.5411,9.1068,247,870576
8890736,8830684,1.3400,2.0333,152,891652
8890737,8830685,0.0654,4.1824,148,996347


In [4]:
train

Unnamed: 0,row_id,x,y,accuracy,time,place_id
0,0,0.7941,9.0809,54,470702,8523065625
1,1,5.9567,4.7968,13,186555,1757726713
2,2,8.3078,7.0407,74,322648,1137537235
3,3,7.3665,2.5165,65,704587,6567393236
4,4,4.0961,1.1307,31,472130,7440663949
...,...,...,...,...,...,...
29118016,29118016,6.5133,1.1435,67,399740,8671361106
29118017,29118017,5.9186,4.4134,67,125480,9077887898
29118018,29118018,2.9993,6.3680,67,737758,2838334300
29118019,29118019,4.0637,8.0061,70,764975,1007355847


In [5]:
# data process
# target: 
#        1. 缩小数据范围
#           2 < x < 2.5
#           1 < y < 1.5
#        2. time -> 年月日时分秒
#        3. 过滤签到次数少的地点


# 1. 缩小数据范围
# smallTrain = train[(train["x"]>2)&(train["x"]<2.5)&(train["y"]>1)&(train["y"]<1.5)]
smallTrain = train.query("x < 2.5 & x > 2 & y < 1.5 & y > 1")

# 不缩小范围，试了下，数据量太大了
# smallTrain = train


In [6]:
# 2. time -> 年月日时分秒
time_val = pd.to_datetime(smallTrain['time'],unit='s')
date_val = pd.DatetimeIndex(time_val)
smallTrain["day"] = date_val.day
smallTrain["hour"] = date_val.hour
smallTrain["weekday"] = date_val.weekday

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [7]:
# 3. 过滤签到次数少的地点

# 签到次数，只保留 row_id 和次数
place_count = smallTrain.groupby("place_id").count()["row_id"]

# 获取签到 >3 的地点的row_id
place = place_count[place_count > 3].index.values


# 在 train 中按签到 >3 的 row_id 过滤
smallTrain = smallTrain[smallTrain["place_id"].isin(place)]
train_final = smallTrain
train_final.head()

Unnamed: 0,row_id,x,y,accuracy,time,place_id,day,hour,weekday
112,112,2.236,1.3655,66,623174,7663031065,8,5,3
367,367,2.4108,1.3213,74,579667,6644108708,7,17,2
874,874,2.0822,1.1973,320,143566,3229876087,2,15,4
1022,1022,2.016,1.1659,65,207993,3244363975,3,9,5
1045,1045,2.3859,1.166,498,503378,6438240873,6,19,1


In [8]:
# 筛选特征值和目标值
x = train_final[["x","y","accuracy","day","weekday","hour"]]
y = train_final["place_id"]


In [9]:
x.head()

Unnamed: 0,x,y,accuracy,day,weekday,hour
112,2.236,1.3655,66,8,3,5
367,2.4108,1.3213,74,7,2,17
874,2.0822,1.1973,320,2,4,15
1022,2.016,1.1659,65,3,5,9
1045,2.3859,1.166,498,6,1,19


In [10]:
y.head()

112     7663031065
367     6644108708
874     3229876087
1022    3244363975
1045    6438240873
Name: place_id, dtype: int64

In [11]:
# split dataset
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y)

In [12]:
x_train.head()

Unnamed: 0,x,y,accuracy,day,weekday,hour
29026370,2.1141,1.4762,76,7,2,11
7060797,2.4752,1.4534,79,8,3,8
6502862,2.0742,1.0719,66,7,2,1
23377377,2.187,1.1408,302,5,0,10
2640207,2.393,1.1743,109,7,2,12


In [13]:
# 特征工程：标准化

from sklearn.preprocessing import StandardScaler

std = StandardScaler()
x_train = std.fit_transform(x_train)
x_test = std.fit_transform(x_test)


In [14]:
x_train

array([[-0.86482157,  1.67132282, -0.06318071,  0.65039287, -0.63870053,
        -0.07492163],
       [ 1.64351418,  1.50449517, -0.03665816,  1.02544527, -0.06104757,
        -0.51021937],
       [-1.1419819 , -1.28694115, -0.15158921,  0.65039287, -0.63870053,
        -1.52591408],
       ...,
       [ 1.50875452,  0.32792123, -0.51406404, -0.09971195, -1.79400647,
         0.07017761],
       [ 0.73284451, -0.97670026,  0.12247714,  1.40049768,  0.5166054 ,
         1.23097157],
       [-1.26979519, -0.99865126,  3.19909284,  1.40049768,  0.5166054 ,
        -0.65531861]])

In [15]:
# KNN
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

In [16]:
# 模型选择和调优

from sklearn.model_selection import GridSearchCV
param_dict = {"n_neighbors":range(1,30)}
estimator = GridSearchCV(knn,param_dict,cv=7)
estimator.fit(x_train, y_train)


# 不用网格搜索
# knn.fit(x_train, y_train)





GridSearchCV(cv=7, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': range(1, 30)})

In [18]:
# 模型评估
print("score:\n",estimator.score(x_test,y_test))



score:
 0.3652857425350999
