### ?? 전체복사하니까 진행이 다소 달라지긴 했는데 2등것만큼은 안나오는데?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import notebook
import datetime

#한글 깨짐방지
plt.rc('font',family = 'Malgun Gothic')
plt.rcParams['axes.unicode_minus'] = False

In [2]:
path = './data/'
train = pd.read_csv(path + 'train.csv')
test = pd.read_csv(path + 'test.csv')
submission  = pd.read_csv(path + 'submission.csv')

In [3]:
# Dropped all columns except 'store_id', 'date', 'amount'
#
# The major reason for this was that the data does not include any hints on the type(clothing, food, alcohol, etc)
# of store and whether if the stores are similar types or not. Hence, removed them from both train & test.
train = train.drop(columns=['time', 'installments', 'days_of_week', 'card_id', 'holyday'])
test = test.drop(columns=['time', 'installments', 'days_of_week', 'card_id', 'holyday'])


# Aggregated data into the sum of amount per each day, per store_id.
train = train.groupby(['date', 'store_id']).agg({'amount':'sum'}).reset_index()
test = test.groupby(['date', 'store_id']).agg({'amount':'sum'}).reset_index()


# 'date' column was converted into datetime format for further uses
train['date'] = pd.to_datetime(train['date'],infer_datetime_format=True)
test['date'] = pd.to_datetime(test['date'],infer_datetime_format=True)


# a duplicate column of 'date' was created for further uses
train['temp_date'] = train['date']
test['temp_date'] = test['date']


# the 'date' column was set to index
train.set_index("date",inplace=True)
test.set_index("date",inplace=True)



# Then, the number of data per each store_id in the train set was counted.
# If the store_id had less than 160 rows (= 160 days of data), it was removed from the train set.
#
# Since the goal of the 1st Competition was to predict the future 100 days of sales,
# 160 days was required to split into 60 days of training (X) & 100 days of prediction (y)
# The 'limit' 160 days was the ideal number resulted from multiple trials of training.
counter = 0
limit = 160

print("Before removing stores (due to limit): ", train.shape)

for x in range(train['store_id'].max()+1): # iterating through each store_id
    if train[train['store_id']==x]['store_id'].count() >= limit:
        counter += 1
    else:
        # drop rows that has total 'store_id' less than limit
        train = train[train.store_id != x]

print("Total # of stores that exceeds {} is {}".format(limit, counter))
print("After removing stores (due to limit): ", train.shape)



# Now, the train set was checked for any stores that was out of business and removed them from the train set.
# If the store had no data within 5 days from 2018-07-31, it was 'assumed' to be closed.
# (2018-07-31 was the last date stores in train data was supposed to have)
from datetime import datetime

def keep_alive_store(df):
    
    store_id_list = df.store_id.unique() # list of train store_id
    yes, no = 0, 0 # yes: store has data within 5 days from 2018-07-31 / no: it doesn't

    t2 = datetime.strptime('2018-07-31 00:00:00', "%Y-%m-%d %H:%M:%S")

    for s in store_id_list:
        if str(df[df.store_id == s].iloc[-1]['temp_date']) == '2018-07-31 00:00:00':
            yes += 1
        else:
            # t1 is the last date of data the corresponding store_id has
            t1 = datetime.strptime(str(df[df.store_id == s].iloc[-1]['temp_date']), "%Y-%m-%d %H:%M:%S")
            difference = t2 - t1
            if difference.days <= 5:            
                yes +=1
            else:
                no +=1
                df = df[df.store_id != s] # remove stores that are 'assumed' closed
    print("# of train store open/out of business: ", yes, no)
    return df
    
train = keep_alive_store(train)



# The same goes for the test data, but in a slightly different way.
# If the store had no data within 7 days from 2018-03-31, it was 'assumed' to be closed.
# (2018-03-31 was the last date stores in test data was supposed to have)
store_id_list = test.store_id.unique() # list of test store_id
yes, no = 0, 0 # yes: store has data within 7 days from 2018-03-31 / no: it doesn't
closed_test_store = []

for s in store_id_list:
    t2 = datetime.strptime('2018-03-31 00:00:00', "%Y-%m-%d %H:%M:%S")
    t1 = datetime.strptime(str(test[test.store_id == s].iloc[-1]['temp_date']), "%Y-%m-%d %H:%M:%S")
    difference = t2 - t1
    
    if difference.days <= 7:
        yes+=1
    else:
        no+=1
        print(test[test.store_id == s].iloc[-1]['temp_date'])
        closed_test_store.append(s)
print("# of test store open/out of business: ", yes, no)
# Note that this time, closed stores were not dropped (obviously) and saved into 'closed_test_store' array.



# Finally, each train & test data was passed into 'reform_data(df, isTrain)'
# With each 'store_id' grouped into a new dataframe, any missing dates were filled with the amount 0
# Then bunch of new columns were added to prepare for training/predicting:
# such as the moving averages, y value(the future 100days), mean, median, sum, etc.
# But they will be returned as an array, then converted into a dataframe later.

def reform_data(df, isTrain):
    store_id_list = df.store_id.unique() # list of store_id
    x_array = [] # array to return

    for s in store_id_list: # iterate through each store_id
        store = df[df.store_id == s]
        
        # Filling missing dates with value of 0
        store = store.asfreq('D', fill_value=0)
        store['temp_date'] = store.index
        store['store_id'] = s
        
        # Moving Average columns were added
        store['MA7'] = store['amount'].rolling('7D').mean()
        store['MA15'] = store['amount'].rolling('15D').mean()
        store['MA30'] = store['amount'].rolling('30D').mean()

        # For the train dataframe, the last 100 days were cut off and the sum was stored as the 'y' value.
        # And the remaining data was stored into the dataframe 'store_x' to become the training data.
        if isTrain:
            store_y = store.last('100D') # last 100 days of store data
            y = store_y.amount.sum()
            store_x = store[store.temp_date < store_y.iloc[0].temp_date] # data except last 100 days
        # For the test dataframe, all data was kept as 'store_x'.
        # Of course the 'y' is 0 here because that's to be predicted later on.
        else:
            y = 0
            store_x = store[:]
            
        new_data = [] # array for each store's new data
        new_data.append(s) # store_id
        new_data.append(y) # total sum of last 100 days (answer)

        new_data.append(store_x.amount.mean()) # mean of amount
        new_data.append(store_x.amount.median()) # median of amount
        
        new_data.append(store_x.last('7D').amount.mean()) # mean of amount (last 7 days)
        new_data.append(store_x.last('15D').amount.mean()) # mean of amount (last 15 days)
        new_data.append(store_x.last('30D').amount.mean()) # mean of amount (last 30 days)
        
        new_data.append(store_x.last('7D').amount.median()) # median of amount (last 7 days)
        new_data.append(store_x.last('15D').amount.median()) # median of amount (last 15 days)
        new_data.append(store_x.last('30D').amount.median()) # median of amount (last 30 days)
        
        new_data.append(store_x.last('7D').amount.sum()) # sum of amount (last 7 days)
        new_data.append(store_x.last('15D').amount.sum()) # sum of amount (last 15 days)
        new_data.append(store_x.last('30D').amount.sum()) # sum of amount (last 30 days)
        
        new_data.append(store_x.last('7D').MA7.mean()) # mean of Moving Average of 7D (last 7 days)
        new_data.append(store_x.last('15D').MA7.mean()) # mean of Moving Average of 7D (last 15 days)
        new_data.append(store_x.last('30D').MA7.mean()) # mean of Moving Average of 7D (last 30 days)
        new_data.append(store_x.last('7D').MA15.mean()) # mean of Moving Average of 15D (last 7 days)
        new_data.append(store_x.last('15D').MA15.mean()) # mean of Moving Average of 15D (last 15 days)
        new_data.append(store_x.last('30D').MA15.mean()) # mean of Moving Average of 15D (last 30 days)
        new_data.append(store_x.last('7D').MA30.mean()) # mean of Moving Average of 30D (last 7 days)
        new_data.append(store_x.last('15D').MA30.mean()) # mean of Moving Average of 30D (last 15 days)
        new_data.append(store_x.last('30D').MA30.mean()) # mean of Moving Average of 30D (last 30 days)
        
        x_array.append(new_data) # Append the 'new_data' array in to 'x_array'
        
    return x_array

reformed_train = reform_data(train, True) # train data with new values
reformed_test = reform_data(test, False) # test data with new values


# Now the returned array is back to its dataframe form with columns names.
# t => train / r_test => test
t = pd.DataFrame(reformed_train, columns=['store_id', 'y', 'mean', 'median', '7mean', '15mean', '30mean', 
                                         '7median', '15median', '30median',  '7sum', '15sum', '30sum', 
                                         '7ma7mean', '15ma7mean', '30ma7mean',  '7ma15mean', '15ma15mean',
                                         '30ma15mean',  '7ma30mean', '15ma30mean', '30ma30mean'])
r_test = pd.DataFrame(reformed_test, columns=['store_id', 'y', 'mean', 'median', '7mean', '15mean', '30mean', 
                                         '7median', '15median', '30median',  '7sum', '15sum', '30sum', 
                                         '7ma7mean', '15ma7mean', '30ma7mean',  '7ma15mean', '15ma15mean',
                                         '30ma15mean',  '7ma30mean', '15ma30mean', '30ma30mean'])
                                         

# xgboost was used to train the model from train data.
# 'train_test_split' from sklearn was used to split the train data (t)
# into train/test with test_size as 0.1
import xgboost as xgb
from sklearn.model_selection import train_test_split

col = [i for i in t.columns if i not in ['store_id','y']]
y = 'y'

train_x, train_cv, y, y_cv = train_test_split(t[col],t[y], test_size=0.1, random_state=2018)


# The model was trained with xgboost parameters as shown below.
# Parameters were chosen after several trials of optimization.
#
# ('num_rounds' & 'early_stopping_rounds' were given relatively big numbers
# since the training doesn't take much computing power.)
def XGB_regressor(train_X, train_y, test_X, test_y, feature_names=None, seed_val=2018, num_rounds=3000):
    param = {}
    param['objective'] = 'reg:linear'
    param['eta'] = 0.05
    param['max_depth'] = 10
    param['silent'] = 1
    param['eval_metric'] = 'mae'
    param['min_child_weight'] = 1
    param['subsample'] = 0.7
    param['colsample_bytree'] = 0.7
    param['seed'] = seed_val
    num_rounds = num_rounds

    plst = list(param.items())

    xgtrain = xgb.DMatrix(train_X, label=train_y)

    if test_y is not None:
        xgtest = xgb.DMatrix(test_X, label=test_y)
        watchlist = [ (xgtrain,'train'), (xgtest, 'test') ]
        model = xgb.train(plst, xgtrain, num_rounds, watchlist, early_stopping_rounds=300)
    else:
        xgtest = xgb.DMatrix(test_X)
        model = xgb.train(plst, xgtrain, num_rounds)
        
    return model

# We now have the trained model!
model = XGB_regressor(train_X = train_x, train_y = y, test_X = train_cv, test_y = y_cv)


# (Optional) Feature Importance can be checked to see which column affected the model more.
# This was used for a quick check when optimizing columns.
'''
from matplotlib import pylab as plt

fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()
'''

# The r_test (test dataset) has to be sorted since they are mixed at the moment.
r_test = r_test.sort_values(by='store_id')

# And then passed into the model to predict the answer for the competition as 'y_test'
# This is the answer array to be submitted... after a few adjustments.
y_test = model.predict(xgb.DMatrix(r_test[col]), ntree_limit = model.best_ntree_limit)


# As mentioned above the data may have missing dates in between.
# To reflect this into 'y_test', the last two months(february, march) of test data
# is counted per each store_id and then averaged out.
# This is required as some stores tend to be closed often throughout the month.
store_id_list = test.store_id.unique() # list of test store_id
store_id_list.sort() # sort the list into order

feb_march = [] # saved here

for s in store_id_list:
    mini = test[test.store_id == s]
    march = mini[mini.temp_date >= '2018-03-01']['temp_date'].count()
    feb = mini[mini.temp_date >= '2018-02-01']['temp_date'].count() - march
    feb_march.append((march+feb)/2/31)

    
# Before using the 'feb_march', the 'closed_test_store' array from above is used.
# The store_id 'assumed' to be closed are given 0 for the future 100 day prediction.
for c in closed_test_store:
    y_test[c] = 0


# At last the 'y_test' predicted from the trained model is adjusted with 'feb_march'.
#
# The last number 0.72 is an optimized number variable to make sure that no prediction exceeds the answer.
# This is required as penalties are given.
for x in range(200):
    y_test[x] = y_test[x] * feb_march[x] * 0.72
    

# For submission, the 'y_test' is saved into the 'total_sales' column of 'submission.csv'.
submission['total_sales'] = y_test
submission.to_csv('submission.csv', index=False)

Before removing stores (due to limit):  (480160, 3)
Total # of stores that exceeds 160 is 997
After removing stores (due to limit):  (432671, 3)
# of train store open/out of business:  906 91
2018-03-21 00:00:00
2018-01-17 00:00:00
2018-03-23 00:00:00
2018-03-23 00:00:00
# of test store open/out of business:  196 4


  config.update(yaml.load(text) or {})


Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[0]	train-mae:70772.46875	test-mae:70710.32031
Multiple eval metrics have been passed: 'test-mae' will be used for early stopping.

Will train until test-mae hasn't improved in 300 rounds.
[1]	train-mae:67505.38281	test-mae:67590.89844
[2]	train-mae:64409.16016	test-mae:64492.65234
[3]	train-mae:61383.18359	test-mae:61572.80078
[4]	train-mae:58580.32031	test-mae:58923.32422
[5]	train-mae:55928.76172	test-mae:56306.30859
[6]	train-mae:53254.17578	test-mae:54006.64453
[7]	train-mae:50770.47266	test-mae:51756.55859
[8]	train-mae:48372.60938	test-mae:49591.07812
[9]	train-mae:46174.16797	test-mae:47588.50000
[10]	train-mae:44117.31641	test-mae:45709.23438
[11]	train-mae:42122.52344	test-mae:43883.47656
[12]	train

[158]	train-mae:2338.22534	test-mae:13720.31738
[159]	train-mae:2327.63916	test-mae:13723.21582
[160]	train-mae:2303.59937	test-mae:13739.49023
[161]	train-mae:2290.41260	test-mae:13739.31348
[162]	train-mae:2264.68115	test-mae:13744.42285
[163]	train-mae:2239.85181	test-mae:13746.12891
[164]	train-mae:2223.26929	test-mae:13738.67871
[165]	train-mae:2197.31470	test-mae:13738.88281
[166]	train-mae:2184.39600	test-mae:13732.58496
[167]	train-mae:2166.76880	test-mae:13735.75098
[168]	train-mae:2145.35205	test-mae:13739.58203
[169]	train-mae:2125.53271	test-mae:13737.83398
[170]	train-mae:2110.86157	test-mae:13725.13281
[171]	train-mae:2100.62622	test-mae:13727.79394
[172]	train-mae:2073.72803	test-mae:13734.20019
[173]	train-mae:2067.89673	test-mae:13731.41309
[174]	train-mae:2060.59521	test-mae:13730.99707
[175]	train-mae:2041.10339	test-mae:13726.97266
[176]	train-mae:2025.53748	test-mae:13726.30176
[177]	train-mae:2009.53638	test-mae:13734.97363
[178]	train-mae:1998.82727	test-mae:1373

[331]	train-mae:576.59338	test-mae:13795.63574
[332]	train-mae:571.29376	test-mae:13794.85156
[333]	train-mae:567.17871	test-mae:13795.20019
[334]	train-mae:564.83148	test-mae:13793.68359
[335]	train-mae:555.88379	test-mae:13796.45508
[336]	train-mae:552.34686	test-mae:13799.36719
[337]	train-mae:548.89453	test-mae:13796.58008
[338]	train-mae:544.92334	test-mae:13797.33496
[339]	train-mae:540.00275	test-mae:13795.29492
[340]	train-mae:537.07465	test-mae:13796.23731
[341]	train-mae:532.38055	test-mae:13797.95703
[342]	train-mae:529.72357	test-mae:13799.98047
[343]	train-mae:527.54999	test-mae:13800.12891
[344]	train-mae:522.29773	test-mae:13800.09863
[345]	train-mae:517.48852	test-mae:13802.62598
[346]	train-mae:515.21155	test-mae:13802.38965
[347]	train-mae:512.46277	test-mae:13804.32129
[348]	train-mae:506.19540	test-mae:13804.60547
[349]	train-mae:501.63260	test-mae:13805.74609
[350]	train-mae:499.36523	test-mae:13806.98340
[351]	train-mae:493.76959	test-mae:13805.30664
[352]	train-m