# Overview
隨著移動設備的完善和普及，零售與電子商務進入了高速發展階段，這其中以 O2O（Online to Offline）消費最為熱絡。據統計，O2O 行業估值上億的創業公司至少有 10 家，也不乏百億巨頭的身影。 O2O 行業每日自然流量有數億消費者，各類 APP 每天記錄了超過百億條用戶行為和位置記錄，因而成為大數據科研和商業化運營的最佳結合點之一。

以優惠券活化老用戶或吸引新客戶消費是 O2O 的一種重要營銷方式。然而，隨機投放的優惠券對多數用戶造成無意義的干擾。對商家而言，濫發的優惠券可能降低品牌聲譽，同時難以估算營銷成本。個性化投放是提高優惠券核銷率的重要技術，它可以讓具有一定偏好的消費者得到真正的實惠，同時賦予商家更強的營銷能力。本次練習數據擷取自電商之部分數據，希望各位通過分析建模，預測用戶是否會在規定時間內使用相應優惠券。

# Data
本賽題提供用戶在2016年1月1日至2016年5月31日之間真實線下消費行為，預測用戶在2016年6月領取優惠券後15天以內的使用情況。

# Evaluation
本賽題目標是預測投放的優惠券是否在規定時間內核銷。針對此任務及一些相關背景知識，以該用戶使用於某日取得之優惠券核銷預測 AUC（ROC 曲線下面積）作為評價標準。即對將 User_id - Date_received - Coupon_id 為一組計算核銷預測的AUC值，若某使用者於同一日取得多張相同優惠券，則任一張核銷皆為有效核銷。

## [作業重點]
請至以下Kaggle競賽平台期中考連結參加考試，請按join competition加入測驗 

並將任一次提交完成畫面截圖，上傳到github並回到官網完成提交。(提交畫面截圖範例)

https://www.kaggle.com/t/2eaf9a8ed13847cd9c26f05b2abfe966


同學提交檔案前可參考Baseline Kernel作法

https://www.kaggle.com/ml100marathon/baseline-example

![d51_1](../img/d51_1.png)

User_id：用戶 ID 


Merchant_id：商家 ID 


Coupon_id：優惠券 ID (null 代表無優惠券消費) 


Discount_rate：優惠券折價：[0,1] 代表折扣率；x:y 代表滿 x 減 y 元 


Distance：用戶經常活動地點離商家最近距離 (x * 500 公尺), 0 表示低於 500 公尺, 10 表示大於 5 公里。 


Date_received：優惠券取得時間。 


Date：購買商品時間 (如果 Date is null & Coupon_id is not null, 則該紀錄為有優惠券但未使用; 

若為 Date is not null & Coupon_id is null, 則為普通消費日期; 

若 Date is not null & Coupon_id is not null, 則表示優惠券消費日期)


# 載入模組、資料

In [1]:
import os
import numpy as np
import pandas as pd
from datetime import date

from sklearn.model_selection import KFold, train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import log_loss, roc_auc_score, auc, roc_curve
from sklearn.metrics import roc_auc_score, accuracy_score

from xgboost import XGBClassifier

DATA_ROOT = "../data/Cupoy"

In [2]:
dfoff = pd.read_csv(os.path.join(DATA_ROOT, 'train_offline.csv'))
dftest = pd.read_csv(os.path.join(DATA_ROOT, 'test_offline.csv'))
dftest = dftest[~dftest.Coupon_id.isna()]
dftest.reset_index(drop=True, inplace=True)
print(dfoff.shape)
print(dftest.shape)
dfoff.head(20)

(1160742, 7)
(306313, 6)


Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date
0,1439408,2632,,,0.0,,20160217.0
1,1439408,2632,8591.0,20:1,0.0,20160217.0,
2,1439408,2632,1078.0,20:1,0.0,20160319.0,
3,1832624,3381,7610.0,200:20,0.0,20160429.0,
4,2029232,3381,11951.0,200:20,1.0,20160129.0,
5,2223968,3381,9776.0,10:5,2.0,20160129.0,
6,73611,2099,12034.0,100:10,,20160207.0,
7,163606,1569,5054.0,200:30,10.0,20160421.0,
8,3273056,4833,7802.0,200:20,10.0,20160130.0,
9,94107,3381,7610.0,200:20,2.0,20160412.0,


In [3]:
dftest.head(20)

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received
0,1439408,4663,11002.0,150:20,1.0,20160528.0
1,1439408,2632,8591.0,20:1,0.0,20160613.0
2,1439408,2632,8591.0,20:1,0.0,20160516.0
3,2029232,450,1532.0,30:5,0.0,20160530.0
4,2029232,6459,12737.0,20:1,0.0,20160519.0
5,2747744,6901,1097.0,50:10,,20160606.0
6,196342,1579,10698.0,20:1,1.0,20160606.0
7,253750,6901,2366.0,30:5,0.0,20160518.0
8,343660,4663,11002.0,150:20,,20160528.0
9,1113008,3621,2705.0,20:5,0.0,20160524.0


# 定義 Target Label

In [4]:
## Creat target label 
"""
預測用戶在2016年6月領取優惠券後15天以內的使用情況
According to the definition, 
1) buy with coupon within (include) 15 days ==> 1
2) buy with coupon but out of 15 days ==> 0
3) buy without coupon ==> -1 (we don't care)
"""
def label(row):
    if np.isnan(row['Date_received']): # 無優惠券取得時間
        return -1
    if not np.isnan(row['Date']): # 有優惠券取得時間 及 優惠券消費日期 且 15天內使用
        td = pd.to_datetime(row['Date'], format='%Y%m%d') - pd.to_datetime(row['Date_received'], format='%Y%m%d')
        if td <= pd.Timedelta(15, 'D'):
            return 1
    return 0 # 有優惠券取得時間 但 無 優惠券消費日期 或 超過 15天

dfoff["label"] = dfoff.apply(label, axis=1)
dfoff["label"].value_counts()


 0    710665
-1    413773
 1     36304
Name: label, dtype: int64

# 產生 Features

In [5]:
# Generate features - weekday acquired coupon
def getWeekday(row):
    if (np.isnan(row)) or (row==-1):
        return row
    else:
        weekday = pd.to_datetime(row, format = "%Y%m%d").dayofweek+1 # add one to make it from 0~6 -> 1~7
        return int(weekday)

dfoff['weekday'] = dfoff['Date_received'].apply(getWeekday)
dftest['weekday'] = dftest['Date_received'].apply(getWeekday)

In [6]:
# weekday_type (weekend = 1)
#dfoff['weekday_type'] = dfoff['weekday'].astype('str').apply(lambda x : 1 if x in [6,7] else 0 ) # apply to trainset
dfoff['weekday_type'] = dfoff['weekday'].apply(lambda x : 1 if x in [6,7] else 0 )
dftest['weekday_type'] = dftest['weekday'].apply(lambda x : 1 if x in [6,7] else 0 ) # apply to testset

dfoff.head()

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,label,weekday,weekday_type
0,1439408,2632,,,0.0,,20160217.0,-1,,0
1,1439408,2632,8591.0,20:1,0.0,20160217.0,,0,3.0,0
2,1439408,2632,1078.0,20:1,0.0,20160319.0,,0,6.0,1
3,1832624,3381,7610.0,200:20,0.0,20160429.0,,0,5.0,0
4,2029232,3381,11951.0,200:20,1.0,20160129.0,,0,5.0,0


In [7]:
dfoff["weekday_type"].value_counts()

0    901368
1    259374
Name: weekday_type, dtype: int64

In [8]:
# Generate features - day of month acquired coupon
def getday(row):
    if (np.isnan(row)) or (row==-1):
        return -1
    else:
        day = pd.to_datetime(row, format = "%Y%m%d").day
        return int(day)
    
dfoff['dayofmonth'] = dfoff['Date_received'].apply(getday)
#dftest['dayofmonth'] = dftest['Date_received'].apply(getday)

In [9]:
dftest['dayofmonth'] = dftest['Date_received'].apply(getday)

In [10]:
dfoff.head()

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,label,weekday,weekday_type,dayofmonth
0,1439408,2632,,,0.0,,20160217.0,-1,,0,-1
1,1439408,2632,8591.0,20:1,0.0,20160217.0,,0,3.0,0,17
2,1439408,2632,1078.0,20:1,0.0,20160319.0,,0,6.0,1,19
3,1832624,3381,7610.0,200:20,0.0,20160429.0,,0,5.0,0,29
4,2029232,3381,11951.0,200:20,1.0,20160129.0,,0,5.0,0,29


In [11]:
# 幾天後使用優惠券
def diffDays(row):
    if np.isnan(row['Date_received']): # 無優惠券取得時間
        return -1
    if not np.isnan(row['Date']): # 有優惠券取得時間 及 優惠券消費日期  
        td = pd.to_datetime(row['Date'], format='%Y%m%d') - pd.to_datetime(row['Date_received'], format='%Y%m%d')
        return td.days
    return -1 # 有優惠券取得時間 但 無 優惠券消費日期  

In [12]:
weekdaycols = ['weekday_' + str(i) for i in range(1,8)]
print(weekdaycols)

tmpdf = pd.get_dummies(dfoff['weekday'].replace(-1, np.nan))
tmpdf.columns = weekdaycols
dfoff[weekdaycols] = tmpdf

tmpdf = pd.get_dummies(dftest['weekday'].replace(-1, np.nan))
tmpdf.columns = weekdaycols
dftest[weekdaycols] = tmpdf
dfoff.head()

['weekday_1', 'weekday_2', 'weekday_3', 'weekday_4', 'weekday_5', 'weekday_6', 'weekday_7']


Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,label,weekday,weekday_type,dayofmonth,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6,weekday_7
0,1439408,2632,,,0.0,,20160217.0,-1,,0,-1,0,0,0,0,0,0,0
1,1439408,2632,8591.0,20:1,0.0,20160217.0,,0,3.0,0,17,0,0,1,0,0,0,0
2,1439408,2632,1078.0,20:1,0.0,20160319.0,,0,6.0,1,19,0,0,0,0,0,1,0
3,1832624,3381,7610.0,200:20,0.0,20160429.0,,0,5.0,0,29,0,0,0,0,1,0,0
4,2029232,3381,11951.0,200:20,1.0,20160129.0,,0,5.0,0,29,0,0,0,0,1,0,0


In [13]:
# Generate features - coupon discount and distance
def getDiscountType(row):
    if row == 'null' or row == 'nan':
        return -1
    elif ':' in row:
        return 1
    else:
        return 0

def convertRate(row):
    """Convert discount to rate"""
    if row == 'null':
        return 1.0
    elif ':' in row:
        rows = row.split(':')
        return 1.0 - float(rows[1])/float(rows[0])
    else:
        return float(row)

def getDiscountMan(row):
    if ':' in row:
        rows = row.split(':')
        return int(rows[0])
    else:
        return 0

def getDiscountJian(row):
    if ':' in row:
        rows = row.split(':')
        return int(rows[1])
    else:
        return 0

def processData(df):
    
    # convert discunt_rate
    df['discount_rate'] = df['Discount_rate'].astype('str').apply(convertRate)
    df['discount_man'] = df['Discount_rate'].astype('str').apply(getDiscountMan)
    df['discount_jian'] = df['Discount_rate'].astype('str').apply(getDiscountJian)
    df['discount_type'] = df['Discount_rate'].astype('str').apply(getDiscountType)
    
    # convert distance
    df.loc[df.Distance.isna(), "Distance"] = 99
    return df

dfoff = processData(dfoff)
dftest = processData(dftest)

dfoff.head()

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,label,weekday,weekday_type,...,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6,weekday_7,discount_rate,discount_man,discount_jian,discount_type
0,1439408,2632,,,0.0,,20160217.0,-1,,0,...,0,0,0,0,0,0,,0,0,-1
1,1439408,2632,8591.0,20:1,0.0,20160217.0,,0,3.0,0,...,0,1,0,0,0,0,0.95,20,1,1
2,1439408,2632,1078.0,20:1,0.0,20160319.0,,0,6.0,1,...,0,0,0,0,1,0,0.95,20,1,1
3,1832624,3381,7610.0,200:20,0.0,20160429.0,,0,5.0,0,...,0,0,0,1,0,0,0.9,200,20,1
4,2029232,3381,11951.0,200:20,1.0,20160129.0,,0,5.0,0,...,0,0,0,1,0,0,0.9,200,20,1


In [14]:
dfoff["Distance"].value_counts()

0.0     501931
10.0    166877
1.0     143551
2.0      82186
99.0     69826
3.0      54214
4.0      39495
5.0      30317
6.0      23976
7.0      19029
8.0      15982
9.0      13358
Name: Distance, dtype: int64

In [15]:

dfoff['user_used_all_coupon_count'] = 1
used_all_count = dfoff[['User_id', 'user_used_all_coupon_count']]
used_all_count = used_all_count.groupby('User_id').agg('sum').reset_index()
dfoff = pd.merge(dfoff, used_all_count,on='User_id' ,how='left')

dftest['user_used_all_coupon_count'] = 1
t_used_all_count = dftest[['User_id', 'user_used_all_coupon_count']]
t_used_all_count = t_used_all_count.groupby('User_id').agg('sum').reset_index()
dftest = pd.merge(dftest, t_used_all_count,on='User_id' ,how='left')


In [16]:
dfoff.head()

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,label,weekday,weekday_type,...,weekday_4,weekday_5,weekday_6,weekday_7,discount_rate,discount_man,discount_jian,discount_type,user_used_all_coupon_count_x,user_used_all_coupon_count_y
0,1439408,2632,,,0.0,,20160217.0,-1,,0,...,0,0,0,0,,0,0,-1,1,3
1,1439408,2632,8591.0,20:1,0.0,20160217.0,,0,3.0,0,...,0,0,0,0,0.95,20,1,1,1,3
2,1439408,2632,1078.0,20:1,0.0,20160319.0,,0,6.0,1,...,0,0,1,0,0.95,20,1,1,1,3
3,1832624,3381,7610.0,200:20,0.0,20160429.0,,0,5.0,0,...,0,1,0,0,0.9,200,20,1,1,1
4,2029232,3381,11951.0,200:20,1.0,20160129.0,,0,5.0,0,...,0,1,0,0,0.9,200,20,1,1,1


In [17]:

dfoff['merchant_id_count'] = 1
m_all_count = dfoff[['Merchant_id', 'merchant_id_count']]
m_all_count = m_all_count.groupby('Merchant_id').agg('sum').reset_index()
dfoff = pd.merge(dfoff, m_all_count,on='Merchant_id' ,how='left')

dftest['merchant_id_count'] = 1
t_m_all_count = dftest[['Merchant_id', 'merchant_id_count']]
t_m_all_count = t_m_all_count.groupby('Merchant_id').agg('sum').reset_index()
dftest = pd.merge(dftest, t_m_all_count,on='Merchant_id' ,how='left')
dfoff.head()

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,label,weekday,weekday_type,...,weekday_6,weekday_7,discount_rate,discount_man,discount_jian,discount_type,user_used_all_coupon_count_x,user_used_all_coupon_count_y,merchant_id_count_x,merchant_id_count_y
0,1439408,2632,,,0.0,,20160217.0,-1,,0,...,0,0,,0,0,-1,1,3,1,45
1,1439408,2632,8591.0,20:1,0.0,20160217.0,,0,3.0,0,...,0,0,0.95,20,1,1,1,3,1,45
2,1439408,2632,1078.0,20:1,0.0,20160319.0,,0,6.0,1,...,1,0,0.95,20,1,1,1,3,1,45
3,1832624,3381,7610.0,200:20,0.0,20160429.0,,0,5.0,0,...,0,0,0.9,200,20,1,1,1,1,137054
4,2029232,3381,11951.0,200:20,1.0,20160129.0,,0,5.0,0,...,0,0,0.9,200,20,1,1,1,1,137054


In [18]:
dfoff[dfoff['Merchant_id'] == 1]

Unnamed: 0,User_id,Merchant_id,Coupon_id,Discount_rate,Distance,Date_received,Date,label,weekday,weekday_type,...,weekday_6,weekday_7,discount_rate,discount_man,discount_jian,discount_type,user_used_all_coupon_count_x,user_used_all_coupon_count_y,merchant_id_count_x,merchant_id_count_y
271826,2774613,1,,,0.0,,20160115.0,-1,,0,...,0,0,,0,0,-1,1,3,1,4
271827,2774613,1,,,0.0,,20160301.0,-1,,0,...,0,0,,0,0,-1,1,3,1,4
271828,2774613,1,,,0.0,,20160227.0,-1,,0,...,0,0,,0,0,-1,1,3,1,4
663175,2447473,1,,,0.0,,20160227.0,-1,,0,...,0,0,,0,0,-1,1,1,1,4


# 設定分割點，將資料分割成 train 和 valid

In [19]:
## Naive model
def split_train_valid(row, date_cut="20160416"):
    is_train = True if pd.to_datetime(row, format="%Y%m%d") < pd.to_datetime(date_cut, format="%Y%m%d") else False
    return is_train
    
df = dfoff[dfoff['label'] != -1].copy()
df["is_train"] = df["Date_received"].apply(split_train_valid)

train = df[df["is_train"]]
valid = df[~df["is_train"]]

train.reset_index(drop=True, inplace=True)
valid.reset_index(drop=True, inplace=True)

print("Train size: {}, #positive: {}".format(len(train), train["label"].sum()))
print("Valid size: {}, #positive: {}".format(len(valid), valid["label"].sum()))

Train size: 667753, #positive: 32472
Valid size: 79216, #positive: 3832


# 選擇 Feature

In [47]:
original_feature = [#'User_id', 
                    #'Merchant_id',
                    'discount_rate',
                    'discount_type',
                    'discount_man', 
                    'discount_jian',
                    'Distance',
                    'dayofmonth',
                    'user_used_all_coupon_count_y',
                    'merchant_id_count_y',
                    'weekday', 
                    'weekday_type'] + weekdaycols
print(len(original_feature),original_feature)

17 ['discount_rate', 'discount_type', 'discount_man', 'discount_jian', 'Distance', 'dayofmonth', 'user_used_all_coupon_count_y', 'merchant_id_count_y', 'weekday', 'weekday_type', 'weekday_1', 'weekday_2', 'weekday_3', 'weekday_4', 'weekday_5', 'weekday_6', 'weekday_7']


# 匯出預測結果

In [35]:
def exportSubmission(filename, model):
    targetset = dftest.copy()
    print(targetset.shape)
    
    targetset = targetset[~targetset.Coupon_id.isna()]
    targetset.reset_index(drop=True, inplace=True)
    testset = targetset[predictors].copy() 
    y_test_pred = model.predict_proba(testset[predictors])
    test1 = testset.copy()
    test1['pred_prob'] = y_test_pred[:, 1]
    print(test1.shape)
    
    output = pd.concat((targetset[["User_id", "Coupon_id", "Date_received"]], test1["pred_prob"]), axis=1)
    print(output.shape)

    output.loc[:, "User_id"] = output["User_id"].apply(lambda x:str(int(x)))
    output.loc[:, "Coupon_id"] = output["Coupon_id"].apply(lambda x:str(int(x)))
    output.loc[:, "Date_received"] = output["Date_received"].apply(lambda x:str(int(x)))
    output["uid"] = output[["User_id", "Coupon_id", "Date_received"]].apply(lambda x: '_'.join(x.values), axis=1)
    output.reset_index(drop=True, inplace=True)
    
    ### NOTE: YOUR SUBMITION FILE SHOULD HAVE COLUMN NAME: uid, label
    out = output.groupby("uid", as_index=False).mean()
    out = out[["uid", "pred_prob"]]
    out.columns = ["uid", "label"]
    out.to_csv(filename, header=["uid", "label"], index=False) # submission format
    print(out.head())

In [36]:
def exportSubmit_blending(filename, models, percentages):

    targetset = dftest.copy()
    print(targetset.shape)

    targetset = targetset[~targetset.Coupon_id.isna()]
    targetset.reset_index(drop=True, inplace=True)
    testset = targetset[predictors].copy()

    pred = 0
    for i in range(0, len(models)-1):
        y_test_pred = models[i].predict_proba(testset[predictors])
        pred = pred + y_test_pred * percentages[i]
        
    test1 = testset.copy()
    test1['pred_prob'] = pred[:, 1]
    print(test1.shape)
    
    output = pd.concat((targetset[["User_id", "Coupon_id", "Date_received"]], test1["pred_prob"]), axis=1)
    print(output.shape)

    output.loc[:, "User_id"] = output["User_id"].apply(lambda x:str(int(x)))
    output.loc[:, "Coupon_id"] = output["Coupon_id"].apply(lambda x:str(int(x)))
    output.loc[:, "Date_received"] = output["Date_received"].apply(lambda x:str(int(x)))
    output["uid"] = output[["User_id", "Coupon_id", "Date_received"]].apply(lambda x: '_'.join(x.values), axis=1)
    output.reset_index(drop=True, inplace=True)
    
    ### NOTE: YOUR SUBMITION FILE SHOULD HAVE COLUMN NAME: uid, label
    out = output.groupby("uid", as_index=False).mean()
    out = out[["uid", "pred_prob"]]
    out.columns = ["uid", "label"]
    out.to_csv(filename, header=["uid", "label"], index=False) # submission format
    out.head()

# XGBClassifier

In [51]:
predictors = ['discount_man', 'discount_jian', 'Distance', 'user_used_all_coupon_count_y']

print(predictors)


['discount_man', 'discount_jian', 'Distance', 'user_used_all_coupon_count_y']


In [52]:
params={'booster':'gbtree',
        'objective': 'rank:pairwise',
        'eval_metric':'auc',
        'gamma':0.1,
        'min_child_weight':1.1,
        'max_depth':5,
        'lambda':10,
        'subsample':0.7,
        'colsample_bytree':0.7,
        'colsample_bylevel':0.7,
        'eta': 0.01,
        'tree_method':'exact',
        'seed':0,
        'nthread':12
        }

In [53]:
xgbc = XGBClassifier()
xgbc.set_params(**params)
xgbc.fit(train[predictors], train['label'])


xgbc_pred = xgbc.predict(valid[predictors])

In [54]:
xgbc.feature_importances_

array([0.23759857, 0.24020258, 0.32139298, 0.20080587], dtype=float32)

In [55]:
auc_score = roc_auc_score(y_true=valid.label, y_score=xgbc_pred)
acc = accuracy_score(y_true=valid.label, y_pred=xgbc_pred)
print("Validation AUC: {:.3f}, Accuracy: {:.3f}".format(auc_score, acc))
#1 Validation AUC: 0.839, Accuracy: 0.175

#2 Validation AUC: 0.753, Accuracy: 0.730

#3 original_feature
#Validation AUC: 0.757, Accuracy: 0.739

#4 ['User_id', 'Merchant_id', 'discount_rate', 'discount_type', 'discount_man', 'discount_jian', 'Distance', 'user_used_all_coupon_count_y']
#Validation AUC: 0.754, Accuracy: 0.748

#5 ['Merchant_id', 'discount_rate', 'discount_type', 'discount_man', 'discount_jian', 'Distance', 'user_used_all_coupon_count_y']
#Validation AUC: 0.756, Accuracy: 0.746

#6 0.72402['discount_man', 'discount_jian', 'Distance', 'user_used_all_coupon_count_y']
#Validation AUC: 0.751, Accuracy: 0.722


Validation AUC: 0.751, Accuracy: 0.722


In [23]:
exportSubmission("xgbc.csv", xgbc)

(306313, 21)
(306313, 5)
(306313, 4)
                     uid     label
0  1000020_2705_20160519  1.498043
1  1000020_8192_20160513  1.498043
2  1000065_1455_20160527 -0.047177
3  1000085_8067_20160513  0.259360
4  1000086_2418_20160613  0.259360


# GradientBoostingClassifier

In [56]:
predictors = original_feature#['Merchant_id', 'discount_man', 'Distance', 'user_used_all_coupon_count_y']
#discount_rate
print(predictors)

['discount_rate', 'discount_type', 'discount_man', 'discount_jian', 'Distance', 'dayofmonth', 'user_used_all_coupon_count_y', 'merchant_id_count_y', 'weekday', 'weekday_type', 'weekday_1', 'weekday_2', 'weekday_3', 'weekday_4', 'weekday_5', 'weekday_6', 'weekday_7']


In [57]:
# 建立模型
gdbt = GradientBoostingClassifier(tol=100, subsample=0.75, n_estimators=250,
                                  max_features=len(predictors), max_depth=6, learning_rate=0.03)

# 訓練模型
gdbt.fit(train[predictors], train['label'])

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.03, loss='deviance', max_depth=6,
              max_features=17, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=250,
              n_iter_no_change=None, presort='auto', random_state=None,
              subsample=0.75, tol=100, validation_fraction=0.1, verbose=0,
              warm_start=False)

In [58]:

# 預測測試集
pred = gdbt.predict(valid[predictors])

# 評分
auc_score = roc_auc_score(y_true=valid.label, y_score=pred)

acc = accuracy_score(y_true=valid.label, y_pred=pred)

print("Validation AUC: {:.3f}, Accuracy: {:.3f}".format(auc_score, acc))

pred_gdbt = pred
#['User_id', 'Merchant_id', 'discount_rate', 'discount_man', 'discount_jian', 'Distance']
#Validation AUC: 0.505, Accuracy: 0.952
#['Merchant_id', 'discount_man', 'Distance', 'user_used_all_coupon_count_y']
#0.71143 Validation AUC: 0.519, Accuracy: 0.953


Validation AUC: 0.511, Accuracy: 0.952


In [59]:
gdbt.feature_importances_

array([0.08782243, 0.00125023, 0.11879599, 0.01733185, 0.20584477,
       0.06165409, 0.25728362, 0.22449781, 0.0159864 , 0.00138897,
       0.00093034, 0.00067002, 0.00134826, 0.00102534, 0.00101335,
       0.00159911, 0.00155743])

In [42]:
exportSubmission("gdbt.csv", gdbt)

(306313, 25)
(306313, 5)
(306313, 4)
                     uid     label
0  1000020_2705_20160519  0.119809
1  1000020_8192_20160513  0.043392
2  1000065_1455_20160527  0.045751
3  1000085_8067_20160513  0.022263
4  1000086_2418_20160613  0.043106


# RandomForestClassifier

In [81]:
predictors = original_feature#['User_id', 'Merchant_id', 'discount_rate', 'discount_man', 'discount_jian', 'Distance']

print(predictors)

['User_id', 'Merchant_id', 'discount_rate', 'discount_type', 'discount_man', 'discount_jian', 'Distance', 'user_used_all_coupon_count_y', 'weekday', 'weekday_type', 'weekday_1', 'weekday_2', 'weekday_3', 'weekday_4', 'weekday_5', 'weekday_6', 'weekday_7']


In [82]:
# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
model_rfc = RandomForestClassifier(n_estimators=20, max_depth=4)

# 訓練模型
model_rfc.fit(train[predictors], train['label'])

# 預測測試集
pred = model_rfc.predict(valid[predictors])

# 評分
auc_score = roc_auc_score(y_true=valid.label
                          , y_score=pred)

acc = accuracy_score(y_true=valid.label
                     , y_pred=pred)

print("Validation AUC: {:.3f}, Accuracy: {:.3f}".format(auc_score, acc))

pred_rfc = pred
#1 ['User_id', 'Merchant_id', 'discount_rate', 'discount_man', 'discount_jian', 'Distance']
# Validation AUC: 0.500, Accuracy: 0.952
#2 original_feature
# Validation AUC: 0.500, Accuracy: 0.952

Validation AUC: 0.500, Accuracy: 0.952


In [83]:
exportSubmission("rfc.csv", model_rfc)

(306313, 21)
(306313, 18)
(306313, 4)
                     uid     label
0  1000020_2705_20160519  0.150959
1  1000020_8192_20160513  0.148902
2  1000065_1455_20160527  0.087296
3  1000085_8067_20160513  0.105288
4  1000086_2418_20160613  0.109475


# SGDClassifier

In [36]:
predictors = original_feature
print(predictors)

def check_model(data, predictors):
    
    classifier = lambda: SGDClassifier(
        loss='log', 
        penalty='elasticnet', 
        fit_intercept=True, 
        max_iter=100, 
        shuffle=True, 
        n_jobs=1,
        class_weight=None)

    model = Pipeline(steps=[
        ('ss', StandardScaler()),
        ('en', classifier())
    ])

    parameters = {
        'en__alpha': [ 0.001, 0.01, 0.1],
        'en__l1_ratio': [ 0.001, 0.01, 0.1]
    }

    folder = StratifiedKFold(n_splits=3, shuffle=True)
    
    grid_search = GridSearchCV(
        model, 
        parameters, 
        cv=folder, 
        n_jobs=-1, 
        verbose=1)
    grid_search = grid_search.fit(data[predictors], 
                                  data['label'])
    
    return grid_search

['User_id', 'Merchant_id', 'discount_rate', 'discount_type', 'discount_man', 'discount_jian', 'Distance', 'user_used_all_coupon_count_y', 'weekday', 'weekday_type', 'weekday_1', 'weekday_2', 'weekday_3', 'weekday_4', 'weekday_5', 'weekday_6', 'weekday_7']


In [37]:
model = check_model(train, predictors)

Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:  4.4min finished
  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


In [38]:
y_valid_pred = model.predict_proba(valid[predictors])
valid1 = valid.copy()
valid1['pred_prob'] = y_valid_pred[:, 1]

  Xt = transform.transform(Xt)


In [39]:
auc_score = roc_auc_score(y_true=valid.label, y_score=y_valid_pred[:,1])
acc = accuracy_score(y_true=valid.label, y_pred=y_valid_pred.argmax(axis=1))
print("Validation AUC: {:.3f}, Accuracy: {:.3f}".format(auc_score, acc))

Validation AUC: 0.796, Accuracy: 0.952


In [None]:
exportSubmission("sgd.csv", model)

# Blending

In [71]:
exportSubmit_blending("gdbt_xgb_blending.csv", [gdbt, xgbc], [0.65, 0.35])

(306313, 21)
(306313, 18)
(306313, 4)
