# Random Forests == Mac地址识别探索

## 背景

未判断是否为手机设备的MAC，若能通过已有数据寻找判断规律，识别出收集到的MAC地址是否为手机设备。

## 原始数据解读
系统采集到的用户MAC，此次识别探索的对象，记做MAC；

进店时间，原始数据中为“yyyy-mm-dd hh:mm:ss”格式，其中年月日与MAC识别无关，每天都会收集到店访客的MAC，时分秒可以作为MAC识别的变量因子，原始数据的格式可转化为“hh.mm”，记做startime；

离店时间，格式与进店时间相同，做相同处理，仅保留时分，转化为“hh.mm”格式，记做endtime；

停留时长（分钟），可作为MAC识别的变量因子，原始格式即可，记做station；

探针MAC，收集到的用户MAC地址探针识别号，与用户MAC识别无关；

信号强度，一个设备可能在时间段内有多个不同的信号强度，数据按信号强弱排序，以分号间隔，这种多个数据存在一个属性中的样例不利于建模，因此对信号强度的原始数据分拆为四个字段：最强信号、最弱信号、波动次数、波动差值，依次记做strongsignal、weaksignal、times、diff；


In [97]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

X = pd.read_csv('mac2.csv')
#不处理MAC
X.drop(["MAC"], axis=1, inplace=True)
y = X.pop('mobiledevice')

In [110]:
X.describe()

Unnamed: 0,starttime,endtime,station,strongsignal,weaksignal,times,diff
count,11448.0,11448.0,11448.0,11448.0,11448.0,11448.0,11448.0
mean,11.399858,13.837407,152.670772,-46.360412,-65.071541,7.31927,27.407844
std,2.64056,2.663539,157.85166,8.564449,32.565599,7.766342,17.149981
min,9.0,9.0,5.0,-59.0,-98.0,-71.0,-61.0
25%,9.02,11.37,25.0,-52.0,-86.0,1.0,17.0
50%,10.275,14.11,89.0,-49.0,-80.0,4.0,30.0
75%,13.44,16.33,240.0,-42.0,-66.0,13.0,39.0
max,18.0,18.0,540.0,-10.0,32.0,56.0,85.0


In [99]:
numeric_variables = list(X.dtypes.index)

采用的回归分类方法为：提出MAC后的全变量，即
mobiledevice~starttime+endtime+station+strongsignal+weaksignal+times+diff

In [100]:
X[numeric_variables].head()

Unnamed: 0,starttime,endtime,station,strongsignal,weaksignal,times,diff
0,9.0,10.0,58,-45,-84,20,39
1,9.0,10.0,58,-30,-75,19,45
2,9.0,10.0,58,-30,-85,16,55
3,9.1,10.0,49,-49,-85,4,36
4,9.1,10.0,49,-42,-85,1,43


In [101]:
model = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)

In [102]:
model.fit(X[numeric_variables], y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=True, random_state=42,
            verbose=0, warm_start=False)

In [103]:
model.oob_score_

0.94671558350803631

In [104]:
data = {'starttime': [13.59],
        'endtime': [15.05],
        'station': [66],
        'strongsignal': [-42],
        'weaksignal': [-76],
        'times': [3],
        'diff': [34]}

In [105]:
x_test = pd.DataFrame(data)

In [106]:
print(x_test)

   diff  endtime  starttime  station  strongsignal  times  weaksignal
0    34    15.05      13.59       66           -42      3         -76


In [107]:
predicted= model.predict(x_test)

In [108]:
print(predicted)

['Y']
