## 简介
__预测用户在2016年7月领取优惠券后15天以内的使用情况__
- 输入：      

| Table1 | Table2 | Table3 | 
| --- | --- | --- | 
| 用户线下消费和优惠券领取行为： |  用户线上点击/消费和优惠券领取行为 | 用户O2O线下优惠券使用预测样本 |

- 输出： Table 4和sample_submission.csv
***
'Discount' 与 'Date_receive'不存在不同时为空情况，即两者为空时代表没有优惠券，可以将label设为0


- 当存在train_df_new.csv文件时，只运行__1简介,2导入,4去除空值__部分即可
- 整理后的新数据集有6个特征（包括标签）
- 后续操作主要利用新数据__train_df_new__继续进行

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.tree import DecisionTreeClassifier                                  # 决策树
from sklearn.model_selection import train_test_split                             # 数据集划分
from datetime import datetime, timedelta                                         # 日期处理
from chinese_calendar import is_holiday, get_holidays                            # 中国节假日
import os                                                                        # 检查文件（模型文件）是否存在，防止覆盖
import pickle                                                                    # 导入导出模型

%matplotlib inline

## 导入训练数据、测试数据

In [2]:
%%time
train_org = pd.read_csv("../Data/ccf_offline_stage1_train.csv")                    # ndarray
train_df_org = pd.DataFrame(train_org)                                             # dataframe
train_df_org.head()

Wall time: 4.42 s


In [3]:
train_df = train_df_org.iloc[:,3:]
train_df.head()

Unnamed: 0,Discount_rate,Distance,Date_received,Date
0,,0.0,,20160217.0
1,150:20,1.0,20160528.0,
2,20:1,0.0,20160217.0,
3,20:1,0.0,20160319.0,
4,20:1,0.0,20160613.0,


In [4]:
test_org = pd.read_csv("../Data/ccf_offline_stage1_test_revised.csv")         # ndarray
test_df_org = pd.DataFrame(test_org)                                             # dataframe
test_df_org.head()

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received
0,4129537,450,9983,30:5,1.0,20160712
1,6949378,1300,3429,30:5,,20160706
2,2166529,7113,6928,200:20,5.0,20160727
3,2166529,7113,1808,100:10,5.0,20160727
4,6172162,7605,6500,30:1,2.0,20160708


In [5]:
test_df = test_df_org.iloc[:,3:]
test_df.head()

Unnamed: 0,Discount_rate,Distance,Date_received
0,30:5,1.0,20160712
1,30:5,,20160706
2,200:20,5.0,20160727
3,100:10,5.0,20160727
4,30:1,2.0,20160708


### 数据类型

In [6]:
train_df['Discount_rate'].unique()

array([nan, '150:20', '20:1', '200:20', '30:5', '50:10', '10:5', '100:10',
       '200:30', '20:5', '30:10', '50:5', '150:10', '100:30', '200:50',
       '100:50', '300:30', '50:20', '0.9', '10:1', '30:1', '0.95',
       '100:5', '5:1', '100:20', '0.8', '50:1', '200:10', '300:20',
       '100:1', '150:30', '300:50', '20:10', '0.85', '0.6', '150:50',
       '0.75', '0.5', '200:5', '0.7', '30:20', '300:10', '0.2', '50:30',
       '200:100', '150:5'], dtype=object)

In [7]:
train_df['Distance'].unique()

array([ 0.,  1., nan,  2., 10.,  4.,  7.,  9.,  3.,  5.,  6.,  8.])

In [8]:
low_len = train_df.shape[0]
low_len

1754884

In [17]:
train_df.isnull().sum()

Discount_rate    701602
Distance         106003
Date_received    701602
Date             977900
dtype: int64

In [20]:
train_df[train_df['Date_received'].isnull() & train_df['Date'].notnull()]

Unnamed: 0,Discount_rate,Distance,Date_received,Date
0,,0.0,,20160217.0
5,,0.0,,20160516.0
11,,0.0,,20160626.0
12,,0.0,,20160519.0
14,,1.0,,20160606.0
21,,0.0,,20160327.0
26,,10.0,,20160115.0
28,,0.0,,20160519.0
29,,0.0,,20160516.0
30,,2.0,,20160114.0


## 数据预处理

In [9]:
# 'label': Date与Date_received相差15天内
train_df_new = pd.DataFrame(columns=['discount', 'threshold', 'distance', 'has_holidays', 'is_holiday', 'label'])

### 整理discount -> 'discount', 'thresold'
- nan:nan(不处理)     nan                nan
- 'A:B'： 满A减B      int(B)/int(A)      A
- '0.n'： 折数        float('0.n')       0

In [10]:
def process_discount(discountSeries_i):
    s = discountSeries_i
    if type(s)!=str and np.isnan(s):
        return  float(1), 0
    if ':'in s:
        l = s.split(':')
        threshold, minus = int(l[0]), int(l[1])
        return (threshold-minus)/threshold, threshold
    elif '.'in s:
        return float(s), 0

- Wall time: 1min 24s

In [11]:
%%time
l_discount, l_threshold = [], []
for i in range(low_len):
    result = process_discount(train_df['Discount_rate'][i])
    l_discount.append(result[0])
    l_threshold.append(result[1])
print(l_discount[:5], l_threshold[:5])

[1.0, 0.8666666666666667, 0.95, 0.95, 0.95] [0, 150, 20, 20, 20]
Wall time: 1min 21s


In [12]:
train_df_new['discount'] = l_discount
train_df_new['threshold'] = l_threshold
train_df_new.head(5)

Unnamed: 0,discount,threshold,distance,has_holidays,is_holiday,label
0,1.0,0,,,,
1,0.866667,150,,,,
2,0.95,20,,,,
3,0.95,20,,,,
4,0.95,20,,,,


### 整理distance

In [13]:
train_df_new['distance'] = train_df['Distance']
train_df_new.head()

Unnamed: 0,discount,threshold,distance,has_holidays,is_holiday,label
0,1.0,0,0.0,,,
1,0.866667,150,1.0,,,
2,0.95,20,0.0,,,
3,0.95,20,0.0,,,
4,0.95,20,0.0,,,


### 整理Date_received
- 当天是否是节假日或周末
- 当天往后15天，是否有节假日
***
标签属性在于Date_received与Date的天数差是否大于15天

In [14]:
def process_date(dateSeries_i):
    s = dateSeries_i
    if np.isnan(s):
        return  False, False
    else:
        d = datetime.strptime(str(int(s)), '%Y%m%d')                                     # date
        d_e = d + timedelta(days = 15)
        return get_holidays(d, d_e,include_weekends=False)!=[], is_holiday(d)

- Wall time: 3min 51s

In [15]:
%%time
# 整理Date_received
l_has, l_is = [], []
for i in range(low_len):
    result = process_date(train_df['Date_received'][i])
    l_has.append(result[0])
    l_is.append(result[1])
print(l_has[:5], l_is[:5])

[False, True, False, False, False] [False, True, False, True, False]
Wall time: 3min 51s


In [16]:
train_df_new['has_holidays'] = l_has
train_df_new['is_holiday'] = l_is
train_df_new.head()

Unnamed: 0,discount,threshold,distance,has_holidays,is_holiday,label
0,1.0,0,0.0,False,False,
1,0.866667,150,1.0,True,True,
2,0.95,20,0.0,False,False,
3,0.95,20,0.0,False,True,
4,0.95,20,0.0,False,False,


### 整理Date（label）

In [17]:
def process_label(received_i, date_i):
    s, e = received_i, date_i
    if np.isnan(s) or np.isnan(date_i):
        return  False
    else:
        d_s = datetime.strptime(str(int(s)), '%Y%m%d')
        d_e = datetime.strptime(str(int(e)), '%Y%m%d')
        minus = (d_e - d_s).days
        return minus<15

- Wall time: 2min 47s

In [18]:
%%time
# 整理Date
l_isbuy = []
for i in range(low_len):
    result = process_label(train_df['Date_received'][i], train_df['Date'][i])
    l_isbuy.append(result)
print(l_isbuy[:5])

[False, False, False, False, False]
Wall time: 2min 47s


In [19]:
train_df_new['label'] = l_isbuy
train_df_new[train_df_new['label'].notnull()].head()

Unnamed: 0,discount,threshold,distance,has_holidays,is_holiday,label
0,1.0,0,0.0,False,False,False
1,0.866667,150,1.0,True,True,False
2,0.95,20,0.0,False,False,False
3,0.95,20,0.0,False,True,False
4,0.95,20,0.0,False,False,False


In [20]:
### 保存train_df_new
train_df_new.to_csv('./train_df_new.csv',sep=',',index=False) 
train_df_new.head()

Unnamed: 0,discount,threshold,distance,has_holidays,is_holiday,label
0,1.0,0,0.0,False,False,False
1,0.866667,150,1.0,True,True,False
2,0.95,20,0.0,False,False,False
3,0.95,20,0.0,False,True,False
4,0.95,20,0.0,False,False,False


## 去除空值（决策树）

1754884个数据，空值率分别为：

| discount | threshold | distance | has_holidays | is_holiday | label |
| --- | --- | --- | --- |  --- | --- | 
| 0. | 0.. | 106003 | 0. |  0. | 0. |  

- ["discount", "threshold", "distance", "has_holidays", "is_holiday", "label"]

In [15]:
### 导入train_df_new
train_df_new = pd.read_csv('./train_df_new.csv')
train_df_new.head()

Unnamed: 0,discount,threshold,distance,has_holidays,is_holiday,label
0,1.0,0,0.0,False,False,False
1,0.866667,150,1.0,True,True,False
2,0.95,20,0.0,False,False,False
3,0.95,20,0.0,False,True,False
4,0.95,20,0.0,False,False,False


In [16]:
def count_nan_rate(series):
    return series.isnull().sum()/low_len

In [11]:
# 统计空值数量
train_df_new['discount'].isnull().sum(), train_df_new['threshold'].isnull().sum(), \
train_df_new['distance'].isnull().sum(), train_df_new['has_holidays'].isnull().sum(), \
train_df_new['is_holiday'].isnull().sum(), train_df_new['label'].isnull().sum(), \

(0, 0, 106003, 0, 0, 0)

In [12]:
# 统计空值比例
count_nan_rate(train_df_new['discount']), count_nan_rate(train_df_new['threshold']),    \
count_nan_rate(train_df_new['distance']), count_nan_rate(train_df_new['has_holidays']), \
count_nan_rate(train_df_new['is_holiday']), count_nan_rate(train_df_new['label'])

(0.0, 0.0, 0.060404562352839274, 0.0, 0.0, 0.0)

### 自定义函数

In [13]:
# 训练某个缺少值并得到训练模型,(not_null_df是全局变量)
def dt_fit_train(predictName):
    # 根据predictName整理数据集
    if predictName=='distance':
        exe_l = ['discount','threshold', 'has_holidays', 'is_holiday', 'label', 'distance']
        X, Y = not_null_df[exe_l[:5]], not_null_df[exe_l[5]]
    elif predictName in ['discount','threshold']:
        exe_l = ['has_holidays', 'is_holiday', 'distance', 'discount','threshold']                    # discount列不存在其他四列非空
        X, Y = not_null_df[exe_l[:3]], not_null_df[exe_l[3:]]
    elif predictName in ['has_holidays', 'is_holiday']:
        exe_l = ['label', 'distance', 'discount','threshold', 'has_holidays', 'is_holiday']
        X, Y = not_null_df[exe_l[:4]], not_null_df[exe_l[4:]]
    elif predictName=='label':
        exe_l = ['discount','threshold', 'has_holidays', 'is_holiday', 'distance', 'label']
        X, Y = not_null_df[exe_l[:5]], not_null_df[exe_l[5]]
        
    # 如果Y有两列，拆分分别训练
    print(X.shape,Y.shape)
    if len(Y.shape)>1:
        Y = Y[predictName]
#     print(X.head(),Y.head())
    
    # 分割训练集
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)
    print(X_train.head(),Y_train.head())
    
    # 训练
    predict_clr = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
    predict_clr.fit(X_train, Y_train.astype('int'))
    
    
    # 测试效果
    score = predict_clr.score(X_test, Y_test.astype('int'))     
    
    return predict_clr, score

In [14]:
# 导出indexs文件，存储特定查询的indexs
def exe_indexs(predictName):
    if predictName=='distance':
        indexs_0 = train_df_new[train_df_new['discount'].notnull()&train_df_new['threshold'].notnull()&train_df_new['has_holidays'].notnull()\
&train_df_new['is_holiday'].notnull()&train_df_new['label'].notnull()&train_df_new['distance'].isnull()].index
    elif predictName in ['discount','threshold']:
        indexs_0 = train_df_new[train_df_new['distance'].notnull()&train_df_new['has_holidays'].notnull()&train_df_new['is_holiday'].notnull()\
&train_df_new['discount'].isnull()].index                                                   # discount列不存在其他四列非空，不计算label
    elif predictName in ['has_holidays', 'is_holiday']:
        indexs_0 = train_df_new[train_df_new['discount'].notnull()&train_df_new['threshold'].notnull()&train_df_new['distance'].notnull()\
&train_df_new['label'].notnull()&train_df_new['has_holidays'].isnull()].index
    elif predictName=='label':
        indexs_0 = train_df_new[train_df_new['discount'].notnull()&train_df_new['threshold'].notnull()&train_df_new['has_holidays'].notnull()\
&train_df_new['is_holiday'].notnull()&train_df_new['distance'].notnull()&train_df_new['label'].isnull()].index
    
    print(indexs_0[:20])
    indexs = pd.DataFrame(list(indexs_0))
    print(indexs.head(20))
    indexs.to_csv('./indexs_%s.csv'%predictName, sep=',', index=False)

In [15]:
# 利用模型预测缺失值，直接在train_df_new里修改
def dt_pre(predictName, dtClassifier):
    indexs = list(pd.read_csv('./indexs_%s.csv'%predictName).iloc[:,0])
    
    # 分割数据集df
    columns = ["discount","threshold","distance","has_holidays","is_holiday","label"]
    columns_sep = [s for s in columns if s!=predictName]
    X_test = train_df_new.reindex(indexs)[columns_sep]
#     print('X_test', X_test.head())

    # 预测
    Y_pred = list(dtClassifier.predict(X_test))
    print(Counter(Y_pred))
    
    # 更改数据
    train_df_new.loc[indexs, predictName]  = Y_pred

    print("数据成功更改")
#     print(train_df_new[predictName])

### 预测distance
- 67165个数据，占比约4%

In [16]:
# 不为空的子集，作为训练集
not_null_df = train_df_new[train_df_new['discount'].notnull() & train_df_new['threshold'].notnull() \
                          & train_df_new['distance'].notnull() & train_df_new['has_holidays'].notnull()                \
                          & train_df_new['is_holiday'].notnull()  & train_df_new['label'].notnull()]

print(len(not_null_df))
not_null_df.head()

1648881


Unnamed: 0,discount,threshold,distance,has_holidays,is_holiday,label
0,1.0,0,0.0,False,False,False
1,0.866667,150,1.0,True,True,False
2,0.95,20,0.0,False,False,False
3,0.95,20,0.0,False,True,False
4,0.95,20,0.0,False,False,False


In [17]:
predictName = 'distance'
# 训练
model, score = dt_fit_train(predictName)
best_score = 0.5221210314050924

print('正确率： ',score)

(1648881, 5) (1648881,)
         discount  threshold  has_holidays  is_holiday  label
1572627  1.000000          0         False       False  False
1740577  0.933333        150          True       False  False
32959    0.900000        100          True        True  False
38492    1.000000          0         False       False  False
175547   0.750000         20          True       False  False 1572627    1.0
1740577    3.0
32959      5.0
38492      1.0
175547     0.0
Name: distance, dtype: float64
正确率：  0.524508505756421


In [18]:
# 导出模型
mes = datetime.now().strftime('%y%m%d_%H')                            # 以时间为导出模型文件名
## 该模型信息是否存在
if not os.path.isfile('../Model/model_dis_%s.pkl'%mes): 
    if score>best_score:
        with open('../Model/model_dis_%s.pkl'%mes, 'wb') as f:
            pickle.dump(model, f)                                           # 导出模型
            print("模型已导出")
            
# 导入模型
# with open('../Model/model%s.pkl'%mes, 'rb') as f:
#     model = pickle.load(f)                                          
#     print("模型已导入")

模型已导出


In [19]:
exe_indexs(predictName)

Int64Index([ 13,  17,  24,  52,  58, 112, 130, 131, 132, 133, 134, 135, 136,
            139, 140, 154, 209, 217, 280, 380],
           dtype='int64')
      0
0    13
1    17
2    24
3    52
4    58
5   112
6   130
7   131
8   132
9   133
10  134
11  135
12  136
13  139
14  140
15  154
16  209
17  217
18  280
19  380


In [20]:
dt_pre(predictName, model)
train_df_new[predictName].isnull().sum(), '增加了：', 106003-train_df_new[predictName].isnull().sum()

Counter({0: 80971, 10: 24918, 1: 94, 2: 16, 6: 3, 5: 1})
数据成功更改


(0, '增加了：', 106003)

In [21]:
# 不为空的子集，作为训练集
not_null_df = train_df_new[train_df_new['discount'].notnull() & train_df_new['threshold'].notnull() \
                          & train_df_new['distance'].notnull() & train_df_new['has_holidays'].notnull()                \
                          & train_df_new['is_holiday'].notnull()  & train_df_new['label'].notnull()]

print(len(not_null_df))
not_null_df.head()

1754884


Unnamed: 0,discount,threshold,distance,has_holidays,is_holiday,label
0,1.0,0,0.0,False,False,False
1,0.866667,150,1.0,True,True,False
2,0.95,20,0.0,False,False,False
3,0.95,20,0.0,False,True,False
4,0.95,20,0.0,False,False,False


### 空值数量

In [22]:
# 统计空值数量
train_df_new['discount'].isnull().sum(), train_df_new['threshold'].isnull().sum(), \
train_df_new['distance'].isnull().sum(), train_df_new['has_holidays'].isnull().sum(), \
train_df_new['is_holiday'].isnull().sum(), train_df_new['label'].isnull().sum(), \

(0, 0, 0, 0, 0, 0)

## 决策树训练

In [23]:
train_df_new.head()

Unnamed: 0,discount,threshold,distance,has_holidays,is_holiday,label
0,1.0,0,0.0,False,False,False
1,0.866667,150,1.0,True,True,False
2,0.95,20,0.0,False,False,False
3,0.95,20,0.0,False,True,False
4,0.95,20,0.0,False,False,False


In [24]:
# train_df_new
# 整理数据集
X, Y = train_df_new[["discount","threshold","distance","has_holidays","is_holiday"]], train_df_new[['label']]
# print(X.head())
# print(Y.head())

# 分割训练集
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)
# print(X_train.head(),Y_train.head())

# 训练
predict_clr = DecisionTreeClassifier(criterion = 'entropy', random_state = 1)
predict_clr.fit(X_train, Y_train.astype('int'))


# 测试效果
score = predict_clr.score(X_test, Y_test.astype('int'))     
best_score = 0.9643072867003757
print('正确率',score)
Y_pred = predict_clr.predict_proba(X_test)
print(X_test.head())
print(Y_test.head())
print(Y_pred[:10])

正确率 0.9643072867003757
         discount  threshold  distance  has_holidays  is_holiday
26931        1.00          0       1.0         False       False
1179025      0.90         50      10.0          True       False
925797       0.85        200       0.0         False       False
1240582      1.00          0       6.0         False       False
1602068      1.00          0       0.0         False       False
         label
26931    False
1179025  False
925797   False
1240582  False
1602068  False
[[1.         0.        ]
 [0.99342105 0.00657895]
 [0.98656716 0.01343284]
 [1.         0.        ]
 [1.         0.        ]
 [0.90991135 0.09008865]
 [0.99373041 0.00626959]
 [1.         0.        ]
 [1.         0.        ]
 [1.         0.        ]]


In [25]:
# 导出模型
mes = datetime.now().strftime('%y%m%d_%H')                            # 以时间为导出模型文件名
## 该模型信息是否存在
if not os.path.isfile('../Model/model_train_%s.pkl'%mes): 
#     if score>best_score:
        with open('../Model/model_train_%s.pkl'%mes, 'wb') as f:
            pickle.dump(predict_clr, f)                                           # 导出模型
            print("模型已导出")

模型已导出


### 测试集变换

In [26]:
# 原测试集
test_df = pd.read_csv("../Data/ccf_offline_stage1_test_revised.csv")
test_df.head()

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received
0,4129537,450,9983,30:5,1.0,20160712
1,6949378,1300,3429,30:5,,20160706
2,2166529,7113,6928,200:20,5.0,20160727
3,2166529,7113,1808,100:10,5.0,20160727
4,6172162,7605,6500,30:1,2.0,20160708


In [27]:
# 统计空值数量
test_df['Discount_rate'].isnull().sum(), test_df['Distance'].isnull().sum(), \
test_df['Date_received'].isnull().sum()

(0, 12064, 0)

In [28]:
# id与数据特征分开
test_df_id, test_df_tochange = test_df[["User_id","Coupon_id","Date_received"]], test_df[["Discount_rate","Distance","Date_received"]]
test_df_id.head(),test_df_tochange.head()

(   User_id  Coupon_id  Date_received
 0  4129537       9983       20160712
 1  6949378       3429       20160706
 2  2166529       6928       20160727
 3  2166529       1808       20160727
 4  6172162       6500       20160708,   Discount_rate  Distance  Date_received
 0          30:5       1.0       20160712
 1          30:5       NaN       20160706
 2        200:20       5.0       20160727
 3        100:10       5.0       20160727
 4          30:1       2.0       20160708)

In [29]:
# 数据集行数
test_len = test_df_tochange.shape[0]
test_len

113640

In [30]:
# 转换测试数据集
test_df_changed = pd.DataFrame(columns=['discount', 'threshold', 'distance', 'has_holidays', 'is_holiday'])
test_df_changed

Unnamed: 0,discount,threshold,distance,has_holidays,is_holiday


#### Discount_rate变换

In [31]:
def process_discount(discountSeries_i):
    s = discountSeries_i
    if type(s)!=str and np.isnan(s):
        return  float(1), 0
    if ':'in s:
        l = s.split(':')
        threshold, minus = int(l[0]), int(l[1])
        return (threshold-minus)/threshold, threshold
    elif '.'in s:
        return float(s), 0

- Wall time: 2.69 s

In [32]:
%%time
l_discount, l_threshold = [], []
for i in range(test_len):
    result = process_discount(test_df_tochange['Discount_rate'][i])
    l_discount.append(result[0])
    l_threshold.append(result[1])
print(l_discount[:5], l_threshold[:5])

[0.8333333333333334, 0.8333333333333334, 0.9, 0.9, 0.9666666666666667] [30, 30, 200, 100, 30]
Wall time: 2.59 s


In [33]:
test_df_changed['discount'] = l_discount
test_df_changed['threshold'] = l_threshold
test_df_changed.head(5)

Unnamed: 0,discount,threshold,distance,has_holidays,is_holiday
0,0.833333,30,,,
1,0.833333,30,,,
2,0.9,200,,,
3,0.9,100,,,
4,0.966667,30,,,


#### Distance变换

In [34]:
test_df_changed['distance'] = test_df_tochange['Distance']
test_df_changed.head()

Unnamed: 0,discount,threshold,distance,has_holidays,is_holiday
0,0.833333,30,1.0,,
1,0.833333,30,,,
2,0.9,200,5.0,,
3,0.9,100,5.0,,
4,0.966667,30,2.0,,


In [35]:
# distance空值处理
## 统计空值
org_null = test_df['Distance'].isnull().sum()
print(org_null)

12064


#### Date_received变换

In [36]:
df_dr = test_df_tochange['Date_received'].unique()
date_received = sorted(df_dr)

print('优惠卷收到日期从',date_received[0],'到',date_received[-1])

优惠卷收到日期从 20160701 到 20160731


In [37]:
def process_date(dateSeries_i):
    s = dateSeries_i
    if np.isnan(s):
        return  False, False
    else:
        d = datetime.strptime(str(int(s)), '%Y%m%d')                                     # date
        d_e = d + timedelta(days = 15)
        return get_holidays(d, d_e, include_weekends=False)!=[], is_holiday(d)

- Wall time: 1min 7s

In [38]:
%%time
# 整理Date_received
l_has, l_is = [], []
for i in range(test_len):
    result = process_date(test_df_tochange['Date_received'][i])
    l_has.append(result[0])
    l_is.append(result[1])
print(l_has[:5], l_is[:5])

[False, False, False, False, False] [False, False, False, False, False]
Wall time: 19.2 s


In [39]:
test_df_changed['has_holidays'] = l_has
test_df_changed['is_holiday'] = l_is
test_df_changed.head()

Unnamed: 0,discount,threshold,distance,has_holidays,is_holiday
0,0.833333,30,1.0,False,False
1,0.833333,30,,False,False
2,0.9,200,5.0,False,False
3,0.9,100,5.0,False,False
4,0.966667,30,2.0,False,False


#### distance空值处理

In [40]:
# 不为空的子集，作为训练集
not_null_df = test_df_changed[test_df_changed['discount'].notnull() & test_df_changed['threshold'].notnull() \
                          & test_df_changed['distance'].notnull() & test_df_changed['has_holidays'].notnull()                \
                          & test_df_changed['is_holiday'].notnull()]

print(len(not_null_df))
not_null_df.head()

101576


Unnamed: 0,discount,threshold,distance,has_holidays,is_holiday
0,0.833333,30,1.0,False,False
2,0.9,200,5.0,False,False
3,0.9,100,5.0,False,False
4,0.966667,30,2.0,False,False
5,0.833333,30,0.0,False,False


In [41]:
# distance为空作为测试集
test_distance_df = test_df_changed[test_df_changed['discount'].notnull() & test_df_changed['threshold'].notnull() \
                          & test_df_changed['is_holiday'].notnull() & test_df_changed['has_holidays'].notnull()                \
                          & test_df_changed['distance'].isnull()]
test_distance_df.head()

Unnamed: 0,discount,threshold,distance,has_holidays,is_holiday
1,0.833333,30,,False,False
14,0.833333,30,,False,False
34,0.833333,30,,False,True
35,0.833333,30,,False,False
36,0.75,20,,False,False


In [42]:
# 划分X,Y
X, Y = not_null_df[["discount","threshold","has_holidays","is_holiday"]], not_null_df[['distance']]

# 分割训练集
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)
print(X_train.head(),Y_train.head())

# 训练
predict_clr = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
predict_clr.fit(X_train, Y_train)


# 测试效果
score = predict_clr.score(X_test, Y_test)   
best_score = 0.5524041438573432
print(score)

        discount  threshold  has_holidays  is_holiday
35105   0.966667         30         False       False
102351  0.950000         20         False        True
50028   0.833333         30         False       False
21425   0.833333         30         False       False
64865   0.950000         20         False        True         distance
35105        0.0
102351       0.0
50028       10.0
21425        0.0
64865        2.0
0.4328750041019919


In [43]:
# 导出模型
mes = datetime.now().strftime('%y%m%d_%H')                            # 以时间为导出模型文件名
## 该模型信息是否存在
if not os.path.isfile('../Model/model_test_dis_%s.pkl'%mes): 
    if score>best_score:
        with open('../Model/model_test_dis_%s.pkl'%mes, 'wb') as f:
            pickle.dump(predict_clr, f)                                           # 导出模型
            print("模型已导出")

In [44]:
# 划分X
X_feature = test_distance_df[["discount","threshold","has_holidays","is_holiday"]]

# 预测distance
Y_pred_dis = list(predict_clr.predict(X_feature))

# 更改数值
test_df_changed.loc[list(test_distance_df.index), 'distance']  = Y_pred_dis

test_df_changed.head()

Unnamed: 0,discount,threshold,distance,has_holidays,is_holiday
0,0.833333,30,1.0,False,False
1,0.833333,30,0.0,False,False
2,0.9,200,5.0,False,False
3,0.9,100,5.0,False,False
4,0.966667,30,2.0,False,False


In [45]:
# 统计空值数量
test_df_changed['discount'].isnull().sum(), test_df_changed['threshold'].isnull().sum(), \
test_df_changed['distance'].isnull().sum(), test_df_changed['has_holidays'].isnull().sum(),\
test_df_changed['is_holiday'].isnull().sum()

(0, 0, 0, 0, 0)

In [46]:
test_df_changed.to_csv('./test_df_changed.csv', sep=',', index=False)

### 模型结果

In [47]:
# id与数据特征分开
# test_df_id, test_df_tochange

# 预测
test_df_changed = pd.read_csv('./test_df_changed.csv')
test_df_changed = test_df_changed[["discount","threshold","distance","has_holidays","is_holiday"]]
print(test_df_changed.head())
with open('../Model/model_train_190521_15.pkl', 'rb') as f:
    model = pickle.load(f)
# result = model.predict(test_df_changed)
result = model.predict_proba(test_df_changed)
print(result[:10][:,1])

# 获得概率
pro = result[:,1]

   discount  threshold  distance  has_holidays  is_holiday
0  0.833333         30       1.0         False       False
1  0.833333         30       0.0         False       False
2  0.900000        200       5.0         False       False
3  0.900000        100       5.0         False       False
4  0.966667         30       2.0         False       False
[0.06469394 0.13179513 0.01764706 0.03690037 0.05782313 0.13179513
 0.18283469 0.06469394 0.02228826 0.08721735]


In [48]:
# 组合
result_df = test_df_id
result_df['Probability'] = pro
result_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,User_id,Coupon_id,Date_received,Probability
0,4129537,9983,20160712,0.064694
1,6949378,3429,20160706,0.131795
2,2166529,6928,20160727,0.017647
3,2166529,1808,20160727,0.0369
4,6172162,6500,20160708,0.057823


In [50]:
result_mes = datetime.now().strftime('%y%m%d_%H')
result_df.to_csv("../AUC/result_%s.csv"%result_mes,sep=',',index=False,header=None)

### AUC

In [14]:
# ["User_id","Coupon_id","Date_received","Probability"]
result_df = pd.read_csv("../AUC/result_190521_16.csv",sep=',',header=None)
result_df.columns = ["User_id","Coupon_id","Date_received","Probability"]
result_df.head()

Unnamed: 0,User_id,Coupon_id,Date_received,Probability
0,4129537,9983,20160712,0.064694
1,6949378,3429,20160706,0.131795
2,2166529,6928,20160727,0.017647
3,2166529,1808,20160727,0.0369
4,6172162,6500,20160708,0.057823


In [None]:
# avgAUC calculation
vg = result_df.groupby(['Coupon_id'])
aucs = []
for i in vg:
    tmpdf = i[1] 
    if len(tmpdf['label'].unique()) != 2:
        continue
    fpr, tpr, thresholds = roc_curve(tmpdf['label'], tmpdf['pred_prob'], pos_label=1)
    aucs.append(auc(fpr, tpr))
print(np.average(aucs))