<h1>Light GBM</h1>
<h2>Import</h2>

In [1]:
import glob
import lightgbm as lgb
import math
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

<h2>SMAPE calculation</h2><br>
@param<br>
y_true = array of actual values<br>
y_pred = array of predicted values<br>

In [2]:
def smape_fast(y_true, y_pred):
    out = 0
    # shape[0] = number of rows (data)
    for i in range(y_true.shape[0]):
        a = y_true[i]
        b = y_pred[i]
        c = a + b
        if c == 0:
            continue
        out += math.fabs(a - b) / c
    out *= (200.0 / y_true.shape[0])
    return out

<h2>Evaluation printing format 1</h2><br>
@param<br>
array = target array to append<br>
type_eval = string of evaluation type<br>
eval_values = array of evaluation values<br><br>
Arrays of evaluation values for MA2, MA3 and MA4 respectively.<br>
Index 0 is 0 in all array because of the initialisation.<br>
Index 1 - 3 is MA2 DFma_1 which are MA2 without CD, with CD, and % improved respectively.<br>
Index 4 - 6 is MA2 DFma_2.<br>
...<br>
Index 16 - 18 is MA2 DFma_6.<br>
Index 19 - 21 is MA3 DFma_1.<br>
...<br><br>
Each MA has 3 values in each type of evaluation.<br>
Each set of DF has 18 values in total.
<h3>Thus,</h3>the formula for calculating the index to get the correct value is:<br>
<h4>array[3 * (which DFma_X in range [0, 5]) + 18 * (which MA_X in range [0, 2]) + 1]</h4>
<h4>array[3 * (which DFma_X in range [0, 5]) + 18 * (which MA_X in range [0, 2]) + 2]</h4>
<h4>array[3 * (which DFma_X in range [0, 5]) + 18 * (which MA_X in range [0, 2]) + 3]</h4>

In [3]:
def evaluation_print(array, type_eval, eval_values):
    for i in range (6):
        eval_arr = np.asarray([type_eval + ' DFma_' + str(i + 1)])
        for j in range (3):
            eval_arr = np.append(eval_arr, [eval_values[3 * i + 18 * j + 1], 
                                            eval_values[3 * i + 18 * j + 2], 
                                            eval_values[3 * i + 18 * j + 3]])
        array = np.append(array, [eval_arr], axis = 0)
    return array

<h2>Evaluation printing format 2</h2><br>
@param<br>
array = target array to append<br>
type_eval = string of evaluation type<br>
eval_values = array of evaluation values<br><br>
Instead of having multiple MAs, we have only 1 dataset for each DF.<br>
Index 0 is 0 in all array because of the initialisation.<br>
Index 1 - 3 is DF_1 which are without CD, with CD, and % improved respectively.<br>
Index 4 - 6 is DF_2.<br>
...<br>
Index 16 - 18 is DF_6.<br><br>
Each set of DF has 3 values in total.
<h3>Thus,</h3>the formula for calculating the index to get the correct value is:<br>
<h4>array[3 * (which DF_X in range [0, 5]) + 1]</h4>
<h4>array[3 * (which DF_X in range [0, 5]) + 2]</h4>
<h4>array[3 * (which DF_X in range [0, 5]) + 3]</h4>

In [4]:
def evaluation_print_original(array, type_eval, eval_values):
    for i in range (6):
        eval_arr = np.asarray([type_eval + ' DF_' + str(i + 1)])
        eval_arr = np.append(eval_arr, [eval_values[3 * i + 1], 
                                        eval_values[3 * i + 2], 
                                        eval_values[3 * i + 3]])
        array = np.append(array, [eval_arr], axis = 0)
    return array

<h2>Evaluation printing format 3</h2><br>
@param<br>
array = target array to append<br>
type_eval = string of evaluation type<br>
eval_values = array of evaluation values<br><br>
This format is for modified lags which the response variable is always DFma_1 for MA or DF_1 for without smoothing dataset.<br>
Index 0 is 0 in all array because of the initialisation.<br>
Index 1 - 3 is time horizon 1 week ahead which are without CD, with CD, and % improved respectively.<br>
Index 4 - 6 is time horizon 2 weeks ahead.<br>
...<br>
Index 16 - 18 is time horizon 6 weeks ahead.<br><br>
Each set of time horizon has 9 values in total.
<h3>Thus,</h3>the formula for calculating the index to get the correct value is:<br>
<h4>array[3 * (which time horizon is in range [0, 5]) + 18 * (which MA_X in range [0, 2]) + 1]</h4>
<h4>array[3 * (which time horizon is in range [0, 5]) + 18 * (which MA_X in range [0, 2]) + 2]</h4>
<h4>array[3 * (which time horizon is in range [0, 5]) + 18 * (which MA_X in range [0, 2]) + 3]</h4>

In [5]:
def evaluation_print_modified_lag(array, type_eval, eval_values):
    for i in range (6):
        eval_arr = np.asarray([type_eval + ' ' + str(i + 1) + '-week ahead'])
        for j in range (3):
            eval_arr = np.append(eval_arr, [eval_values[3 * i + 18 * j + 1], 
                                            eval_values[3 * i + 18 * j + 2], 
                                            eval_values[3 * i + 18 * j + 3]])
        array = np.append(array, [eval_arr], axis = 0)
    return array

In [6]:
def evaluation_print_modified_lag_original(array, type_eval, eval_values):
    for i in range (6):
        eval_arr = np.asarray([type_eval + ' ' + str(i + 1) + '-week ahead'])
        eval_arr = np.append(eval_arr, [eval_values[3 * i + 1], 
                                        eval_values[3 * i + 2], 
                                        eval_values[3 * i + 3]])
        array = np.append(array, [eval_arr], axis = 0)
    return array

<h2>Variables that you need to change before running the code</h2>
<b>province</b> = 'Bangkok', 'NST' or 'Krabi'<br>
<b>number of leaves</b> = 25, 31 or 70<br>
<b>data set destination</b> = '...bangkok...', '...nakhon...' or '...krabi...'

In [109]:
num_leaves = 70
province1 = 'Bangkok'
province2 = 'bangkok'

<h1>Normal Lags</h1>

- Predict DFma_1 to DFma_6<br>
- Predict DF_1 to DF_6

<h2>District level</h2>
For MAs (adjusted CD)

In [110]:
# Arrays of all evaluation values
# row: head,
# RMSE (DFma_1 - DFma_6), 
# MAE (DFma_1 - DFma_6), 
# SMAPE (DFma_1 - DFma_6), 
# R-squared (DFma_1 - DFma_6)

# col: head,
# MA2 (without CD, with CD, % improved),
# MA3 (without CD, with CD, % improved),
# MA4 (without CD, with CD, % improved)

eval_array = np.asarray([['Evaluation', 'MA2 without CD', 'MA2 with CD', 'MA2 % improved', 
                         'MA3 without CD', 'MA3 with CD', 'MA3 % improved', 
                         'MA4 without CD', 'MA4 with CD', 'MA4 % improved']])
rmse = np.zeros(1)
mae = np.zeros(1)
smape = np.zeros(1)
r2 = np.zeros(1)

dist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])

# Starting from MA2 to MA4
for i in range(2, 5):
    # Get the input variables from CSV file
    # Change files directory here
    train_file_dir = 'Data/' + province1 + '/Normal Lags/train_' + province2 + '_dist_total_mavg' + str(i) + '.csv'
    test_file_dir = 'Data/' + province1 + '/Normal Lags/test_' + province2 + '_dist_total_mavg' + str(i) + '.csv'
    
    df_train_dist =  pd.read_csv(train_file_dir, header = 0, skiprows = 0)
    df_test_dist = pd.read_csv(test_file_dir, header = 0, skiprows = 0)

    # Continue on DFma_1 to DFma_6
    for j in range(6):
        # Allocate the column of addrcode, week, year and actual values first
        df_test_addrcode_week_year_dist = df_test_dist.iloc[:, [1, 2, 3, 19 - j]]
        
        ## Without CD ##
    
        # Import the dataset
        # x: independent variables
        # DFma_0 [col 20],
        # DFma_wm1 [col 21],
        # DFma_wm2 [col 22],
        # DFma_wm3 [col 23],
        # RF_wm6 [col 24],
        # and LST_wm4 [col 25]
        
        ## With CD ##
    
        # Import the dataset
        # x: independent variables
        # DFma_0 [col 20],
        # DFma_wm1 [col 21],
        # DFma_wm2 [col 22],
        # DFma_wm3 [col 23],
        # RF_wm6 [col 24],
        # LST_wm4 [col 25],
        # bin [col 26],
        # bowl [col 27],
        # bucket [col 28],
        # misc_short [col 29],
        # jar [col 30],
        # pottedplant [col 31],
        # tire [col 32],
        # misc_tall [col 33],
        # and total [col 34]
        
        x_train_withoutCD = df_train_dist.iloc[:, [20, 21, 22, 23, 24, 25]]
        x_train_withCD = df_train_dist.iloc[:, [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]
        
        x_test_withoutCD = df_test_dist.iloc[:, [20, 21, 22, 23, 24, 25]]
        x_test_withCD = df_test_dist.iloc[:, [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]
        
        # y: response (target) variable DFma_1 to DFma_6 [col 19 -> 14]
        y_train = df_train_dist.iloc[:, [19 - j]]
        y_test = df_test_dist.iloc[:, [19 - j]]
        
        # Pass the response values to the array for evaluation calculation
        y_train_true = np.array(df_train_dist['DFma_' + str(j + 1)])
        y_test_true = np.array(df_test_dist['DFma_' + str(j + 1)])
        
        # Pass the dataset of both independent and response variables to Light GBM
        lgb_train_withoutCD = lgb.Dataset(x_train_withoutCD, y_train)
        lgb_eval_withoutCD = lgb.Dataset(x_test_withoutCD, y_test, reference = lgb_train_withoutCD)
        
        lgb_train_withCD = lgb.Dataset(x_train_withCD, y_train)
        lgb_eval_withCD = lgb.Dataset(x_test_withCD, y_test, reference = lgb_train_withCD)

        params = {
        'boosting_type': 'gbdt',
        'objective': 'regression',
        'metric': {'l2', 'l1'},
        'num_leaves': num_leaves,
        'learning_rate': 0.05,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'verbose': 0
        }

        # Train the model
        print('Starting training...')
        gbm_withoutCD = lgb.train(params,
                    lgb_train_withoutCD,
                    num_boost_round = 20,
                    valid_sets = lgb_eval_withoutCD,
                    early_stopping_rounds = 6)
        #print('Saving model...')
        # Save model to file
        #gbm_withoutCD.save_model('model.txt')
        
        gbm_withCD = lgb.train(params,
                    lgb_train_withCD,
                    num_boost_round = 20,
                    valid_sets = lgb_eval_withCD,
                    early_stopping_rounds = 6)
        #print('Saving model...')
        # Save model to file
        #gbm_withCD.save_model('model.txt')

        # Predict out by using test data
        print('Starting predicting...')
        y_pred_withoutCD = gbm_withoutCD.predict(x_test_withoutCD, num_iteration = gbm_withoutCD.best_iteration)
        y_pred_withCD = gbm_withCD.predict(x_test_withCD, num_iteration = gbm_withCD.best_iteration)

        df_y_pred_withoutCD = pd.DataFrame(y_pred_withoutCD, columns = ['predicted'])
        df_y_pred_withCD = pd.DataFrame(y_pred_withCD, columns = ['predicted'])
        
        # Store all of the predicted values to the CSV files
        df_compare_addrcode_dist_withoutCD = pd.concat([df_test_addrcode_week_year_dist, df_y_pred_withoutCD], axis = 1)
        df_compare_addrcode_dist_withoutCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
        df_compare_addrcode_dist_withoutCD.to_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' 
                                                  + str(num_leaves) + '/MA' + str(i) + '/LGBM_' + province2 + '_dist_MA' 
                                                  + str(i) + '_DFma_' + str(j + 1) + '_withoutCD_' + str(num_leaves) 
                                                  + '.csv', encoding = 'utf-8')

        df_compare_addrcode_dist_withCD = pd.concat([df_test_addrcode_week_year_dist, df_y_pred_withCD], axis = 1)
        df_compare_addrcode_dist_withCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
        df_compare_addrcode_dist_withCD.to_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' 
                                               + str(num_leaves) + '/MA' + str(i) + '/LGBM_' + province2 + '_dist_MA' 
                                               + str(i) + '_DFma_' + str(j + 1) + '_withCD_' + str(num_leaves) + '.csv', 
                                               encoding = 'utf-8')
        
        # Evaluation
        rmse_withoutCD = mean_squared_error(y_test_true, y_pred_withoutCD) ** 0.5
        mae_withoutCD = mean_absolute_error(y_test_true, y_pred_withoutCD)
        r2_withoutCD = r2_score(y_test_true, y_pred_withoutCD)
        smape_withoutCD = smape_fast(y_test_true, y_pred_withoutCD)
        #print('RMSE of the prediction without CD is:', rmse_withoutCD)
        #print('MAE of the prediction without CD is:', mae_withoutCD)
        #print('R-squared of the prediction without CD is:', r2_withoutCD)
        #print('SMAPE of the prediction without CD is:', smape_withoutCD)
        
        rmse_withCD = mean_squared_error(y_test_true, y_pred_withCD) ** 0.5
        mae_withCD = mean_absolute_error(y_test_true, y_pred_withCD)
        r2_withCD = r2_score(y_test_true, y_pred_withCD)
        smape_withCD = smape_fast(y_test_true, y_pred_withCD)
        #print('RMSE of the prediction with CD is:', rmse_withCD)
        #print('MAE of the prediction with CD is:', mae_withCD)
        #print('R-squared of the prediction with CD is:', r2_withCD)
        #print('SMAPE of the prediction with CD is:', smape_withCD)
        
        rmse_percent_improved = (rmse_withoutCD - rmse_withCD) / rmse_withoutCD
        mae_percent_improved = (mae_withoutCD - mae_withCD) / mae_withoutCD
        smape_percent_improved = (smape_withoutCD - smape_withCD) / smape_withoutCD
        r2_percent_improved = (r2_withoutCD - r2_withCD) / r2_withoutCD
        #eval_array = np.append(eval_array, ['RMSE', rmse_withoutCD, rmse_withCD, rmse_percent_improved])
        #print(eval_array)
        
        rmse = np.append(rmse, [rmse_withoutCD, rmse_withCD, rmse_percent_improved])
        mae = np.append(mae, [mae_withoutCD, mae_withCD, mae_percent_improved])
        smape = np.append(smape, [smape_withoutCD, smape_withCD, smape_percent_improved])
        r2 = np.append(r2, [r2_withoutCD, r2_withCD, r2_percent_improved])
        
        df_withoutCD = pd.read_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                   + '/MA' + str(i) + '/LGBM_' + province2 + '_dist_MA' + str(i) + '_DFma_' + str(j + 1) 
                                   + '_withoutCD_' + str(num_leaves) + '.csv', header = 0)
        df_withCD = pd.read_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                + '/MA' + str(i) + '/LGBM_' + province2 + '_dist_MA' + str(i) + '_DFma_' + str(j + 1) 
                                + '_withCD_' + str(num_leaves) + '.csv', header = 0)
        
        dist_code = df_train_dist['addrcode'].unique()
        
        # For each district
        for k in dist_code:
            
            # Get the subset of actual and predicted values according to the district code
            subset_withoutCD = df_withoutCD.loc[df_withoutCD['addrcode'] == k]
            subset_withCD = df_withCD.loc[df_withCD['addrcode'] == k]
            
            # Pass the response values to the array for evaluation calculation
            array_true = np.array(subset_withoutCD['actual'])
            array_pred_withoutCD = np.array(subset_withoutCD['predicted'])
            array_pred_withCD = np.array(subset_withCD['predicted'])
            
            # Calculate the evaluation values
            rmse_withoutCD_dist = mean_squared_error(array_true, array_pred_withoutCD) ** 0.5
            mae_withoutCD_dist = mean_absolute_error(array_true, array_pred_withoutCD)
            smape_withoutCD_dist = smape_fast(array_true, array_pred_withoutCD)
            r2_withoutCD_dist = r2_score(array_true, array_pred_withoutCD)
            
            rmse_withCD_dist = mean_squared_error(array_true, array_pred_withCD) ** 0.5
            mae_withCD_dist = mean_absolute_error(array_true, array_pred_withCD)
            smape_withCD_dist = smape_fast(array_true, array_pred_withCD)
            r2_withCD_dist = r2_score(array_true, array_pred_withCD)
            
            rmse_percent_improved_dist = (rmse_withoutCD_dist - rmse_withCD_dist) / rmse_withoutCD_dist
            mae_percent_improved_dist = (mae_withoutCD_dist - mae_withCD_dist) / mae_withoutCD_dist
            smape_percent_improved_dist = (smape_withoutCD_dist - smape_withCD_dist) / smape_withoutCD_dist
            r2_percent_improved_dist = (r2_withoutCD_dist - r2_withCD_dist) / r2_withoutCD_dist
            
            # Append
            dist_array = np.append(dist_array, [[k, rmse_withoutCD_dist, rmse_withCD_dist, rmse_percent_improved_dist,
                                                mae_withoutCD_dist, mae_withCD_dist, mae_percent_improved_dist,
                                                smape_withoutCD_dist, smape_withCD_dist, smape_percent_improved_dist,
                                                r2_withoutCD_dist, r2_withCD_dist, r2_percent_improved_dist]], axis = 0)

        #print(dist_array)
        pd.DataFrame(dist_array).to_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                        + '/MA' + str(i) + '/LGBM_' + province2 + '_ByDistrict_MA' + str(i) + '_DFma_' 
                                        + str(j + 1) + '_eval_' + str(num_leaves) + '.csv', header = False, 
                                        encoding = 'utf-8')
        
        # Clear the old memory to store a new one
        dist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])
    
# Evaluation file storing
# From RMSE DFma_1 to R squared DFma_6
eval_array = evaluation_print(eval_array, 'RMSE', rmse)
eval_array = evaluation_print(eval_array, 'MAE', mae)
eval_array = evaluation_print(eval_array, 'SMAPE', smape)
eval_array = evaluation_print(eval_array, 'R squared', r2)

#print(eval_array)

# Store all of the evaluation values into a CSV file
pd.DataFrame(eval_array).to_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                + '/LGBM_' + province2 + '_dist_eval_' + str(num_leaves) + '.csv', header = False, 
                                encoding = 'utf-8')

Starting training...
[1]	valid_0's l1: 0.294086	valid_0's l2: 0.116316
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.28157	valid_0's l2: 0.106822
[3]	valid_0's l1: 0.26954	valid_0's l2: 0.0980169
[4]	valid_0's l1: 0.258517	valid_0's l2: 0.0903174
[5]	valid_0's l1: 0.247967	valid_0's l2: 0.0832162
[6]	valid_0's l1: 0.238166	valid_0's l2: 0.0768718
[7]	valid_0's l1: 0.230261	valid_0's l2: 0.0721112
[8]	valid_0's l1: 0.221453	valid_0's l2: 0.0668732
[9]	valid_0's l1: 0.213312	valid_0's l2: 0.0622892
[10]	valid_0's l1: 0.205632	valid_0's l2: 0.0580527
[11]	valid_0's l1: 0.198283	valid_0's l2: 0.0541466
[12]	valid_0's l1: 0.191429	valid_0's l2: 0.0506908
[13]	valid_0's l1: 0.185257	valid_0's l2: 0.0477157
[14]	valid_0's l1: 0.179455	valid_0's l2: 0.0449754
[15]	valid_0's l1: 0.174003	valid_0's l2: 0.0425092
[16]	valid_0's l1: 0.168961	valid_0's l2: 0.040294
[17]	valid_0's l1: 0.165328	valid_0's l2: 0.0388748
[18]	valid_0's l1: 0.161159	valid_0's l2: 0.0372

Starting training...
[1]	valid_0's l1: 0.28241	valid_0's l2: 0.112094
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.27411	valid_0's l2: 0.105892
[3]	valid_0's l1: 0.267041	valid_0's l2: 0.100624
[4]	valid_0's l1: 0.261373	valid_0's l2: 0.0966179
[5]	valid_0's l1: 0.256436	valid_0's l2: 0.0934745
[6]	valid_0's l1: 0.252469	valid_0's l2: 0.0910145
[7]	valid_0's l1: 0.249984	valid_0's l2: 0.0896444
[8]	valid_0's l1: 0.24768	valid_0's l2: 0.088581
[9]	valid_0's l1: 0.245823	valid_0's l2: 0.0880672
[10]	valid_0's l1: 0.243943	valid_0's l2: 0.0878685
[11]	valid_0's l1: 0.242087	valid_0's l2: 0.0877056
[12]	valid_0's l1: 0.240234	valid_0's l2: 0.0877367
[13]	valid_0's l1: 0.239256	valid_0's l2: 0.0883811
[14]	valid_0's l1: 0.238541	valid_0's l2: 0.0894472
[15]	valid_0's l1: 0.238036	valid_0's l2: 0.0907084
[16]	valid_0's l1: 0.23793	valid_0's l2: 0.0925221
[17]	valid_0's l1: 0.238338	valid_0's l2: 0.0947761
Early stopping, best iteration is:
[11]	valid_0's l

Starting training...
[1]	valid_0's l1: 0.276051	valid_0's l2: 0.104113
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.265441	valid_0's l2: 0.09648
[3]	valid_0's l1: 0.255547	valid_0's l2: 0.0895576
[4]	valid_0's l1: 0.246969	valid_0's l2: 0.0838434
[5]	valid_0's l1: 0.239014	valid_0's l2: 0.078639
[6]	valid_0's l1: 0.231922	valid_0's l2: 0.0741559
[7]	valid_0's l1: 0.226127	valid_0's l2: 0.0706941
[8]	valid_0's l1: 0.220363	valid_0's l2: 0.0673972
[9]	valid_0's l1: 0.215419	valid_0's l2: 0.0645717
[10]	valid_0's l1: 0.210945	valid_0's l2: 0.0621735
[11]	valid_0's l1: 0.206506	valid_0's l2: 0.0599668
[12]	valid_0's l1: 0.202572	valid_0's l2: 0.0581569
[13]	valid_0's l1: 0.199189	valid_0's l2: 0.0566948
[14]	valid_0's l1: 0.195986	valid_0's l2: 0.0553803
[15]	valid_0's l1: 0.193091	valid_0's l2: 0.0543385
[16]	valid_0's l1: 0.190639	valid_0's l2: 0.0535618
[17]	valid_0's l1: 0.188778	valid_0's l2: 0.0532112
[18]	valid_0's l1: 0.187093	valid_0's l2: 0.052

Starting training...
[1]	valid_0's l1: 0.273515	valid_0's l2: 0.100029
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.260147	valid_0's l2: 0.0904881
[3]	valid_0's l1: 0.247561	valid_0's l2: 0.0819568
[4]	valid_0's l1: 0.235636	valid_0's l2: 0.074271
[5]	valid_0's l1: 0.224397	valid_0's l2: 0.0673593
[6]	valid_0's l1: 0.213699	valid_0's l2: 0.0611097
[7]	valid_0's l1: 0.203991	valid_0's l2: 0.0557182
[8]	valid_0's l1: 0.194553	valid_0's l2: 0.0507126
[9]	valid_0's l1: 0.185557	valid_0's l2: 0.046144
[10]	valid_0's l1: 0.177198	valid_0's l2: 0.042106
[11]	valid_0's l1: 0.169249	valid_0's l2: 0.0384473
[12]	valid_0's l1: 0.161792	valid_0's l2: 0.0351665
[13]	valid_0's l1: 0.154689	valid_0's l2: 0.032202
[14]	valid_0's l1: 0.148097	valid_0's l2: 0.0295824
[15]	valid_0's l1: 0.141925	valid_0's l2: 0.0272467
[16]	valid_0's l1: 0.136178	valid_0's l2: 0.0251524
[17]	valid_0's l1: 0.130928	valid_0's l2: 0.0233483
[18]	valid_0's l1: 0.125831	valid_0's l2: 0.0216

Starting training...
[1]	valid_0's l1: 0.259057	valid_0's l2: 0.0935911
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.249615	valid_0's l2: 0.0872757
[3]	valid_0's l1: 0.241707	valid_0's l2: 0.0821899
[4]	valid_0's l1: 0.235676	valid_0's l2: 0.0781348
[5]	valid_0's l1: 0.231446	valid_0's l2: 0.0751815
[6]	valid_0's l1: 0.22808	valid_0's l2: 0.0729629
[7]	valid_0's l1: 0.225705	valid_0's l2: 0.071619
[8]	valid_0's l1: 0.223064	valid_0's l2: 0.0704577
[9]	valid_0's l1: 0.221237	valid_0's l2: 0.0700622
[10]	valid_0's l1: 0.219277	valid_0's l2: 0.0697912
[11]	valid_0's l1: 0.217282	valid_0's l2: 0.0694661
[12]	valid_0's l1: 0.215627	valid_0's l2: 0.0695672
[13]	valid_0's l1: 0.214658	valid_0's l2: 0.0702863
[14]	valid_0's l1: 0.21372	valid_0's l2: 0.0709396
[15]	valid_0's l1: 0.21304	valid_0's l2: 0.0719316
[16]	valid_0's l1: 0.212083	valid_0's l2: 0.0725901
[17]	valid_0's l1: 0.212196	valid_0's l2: 0.0741386
Early stopping, best iteration is:
[11]	valid_0

For original DF_0 (without smoothing, adjusted CD)

In [111]:
# Arrays of all evaluation values
# row: head,
# RMSE (DF_1 - DF_6), 
# MAE (DF_1 - DF_6), 
# SMAPE (DF_1 - DF_6), 
# R-squared (DF_1 - DF_6)

# col: head,
# DF_0 (without CD, with CD, % improved)

eval_array = np.asarray([['Evaluation', 'Without CD', 'With CD', '% improved']])

rmse = np.zeros(1)
mae = np.zeros(1)
smape = np.zeros(1)
r2 = np.zeros(1)

dist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])

# Get the input variables from CSV file
# Change files directory here
rain_file_dir = 'Data/' + province1 + '/Normal Lags/train_' + province2 + '_dist_total_mavg2.csv'
test_file_dir = 'Data/' + province1 + '/Normal Lags/test_' + province2 + '_dist_total_mavg2.csv'

df_train_dist =  pd.read_csv(train_file_dir, header = 0, skiprows = 0)
df_test_dist = pd.read_csv(test_file_dir, header = 0, skiprows = 0)

# From DF_1 to DF_6
for i in range(6):
    # Allocate the column of addrcode, week, year and actual values first
    df_test_addrcode_week_year_dist = df_test_dist.iloc[:,[1, 2, 3, 10 - i]]
    
    ## Without CD ##
    
    # Import the dataset
    # x: independent variables
    # DF_0 [col 10],
    # DF_wm1 [col 11], 
    # DF_wm2 [col 12],
    # DF_wm3 [col 13],
    # RF_wm6 [col 24],
    # and LST_wm4 [col 25]
        
    ## With CD ##
    
    # Import the dataset
    # x: independent variables
    # DF_0 [col 10],
    # DF_wm1 [col 11], 
    # DF_wm2 [col 12],
    # DF_wm3 [col 13],
    # RF_wm6 [col 24],
    # LST_wm4 [col 25],
    # bin_pop9s [col 26],
    # bowl_pop9s [col 27],
    # bucket_pop9s [col 28],
    # misc_short_pop9s [col 29],
    # jar_pop9s [col 30],
    # pottedplant_pop9s [col 31],
    # tire_pop9s [col 32],
    # misc_tall_pop9s [col 33],
    # and total_pop9s [col 34]
        
    x_train_withoutCD = df_train_dist.iloc[:, [10, 11, 12, 13, 24, 25]]
    x_train_withCD = df_train_dist.iloc[:, [10, 11, 12, 13, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]
    
    x_test_withoutCD = df_test_dist.iloc[:, [10, 11, 12, 13, 24, 25]]
    x_test_withCD = df_test_dist.iloc[:, [10, 11, 12, 13, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]
    
    # y: response (target) variable from DF_1 to DF_6 (col 9 -> col 4)
    y_train = df_train_dist.iloc[:, [9 - i]]
    y_test = df_test_dist.iloc[:, [9 - i]]
    
    # Pass the response values to the array for evaluation calculation
    y_train_true = np.array(df_train_dist['DF_' + str(i + 1)])
    y_test_true = np.array(df_test_dist['DF_' + str(i + 1)])
    
    # Pass the dataset of both independent and response variables to Light GBM
    lgb_train_withoutCD = lgb.Dataset(x_train_withoutCD, y_train)
    lgb_eval_withoutCD = lgb.Dataset(x_test_withoutCD, y_test, reference = lgb_train_withoutCD)
        
    lgb_train_withCD = lgb.Dataset(x_train_withCD, y_train)
    lgb_eval_withCD = lgb.Dataset(x_test_withCD, y_test, reference = lgb_train_withCD)

    params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': num_leaves,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
    }

    # Train the model
    print('Starting training...')
    gbm_withoutCD = lgb.train(params,
                lgb_train_withoutCD,
                num_boost_round = 20,
                valid_sets = lgb_eval_withoutCD,
                early_stopping_rounds = 6)
    #print('Saving model...')
    # Save model to file
    #gbm_withoutCD.save_model('model.txt')
    
    gbm_withCD = lgb.train(params,
                lgb_train_withCD,
                num_boost_round = 20,
                valid_sets = lgb_eval_withCD,
                early_stopping_rounds = 6)
    #print('Saving model...')
    # Save model to file
    #gbm_withCD.save_model('model.txt')
    
    # Predict out by using test data
    print('Starting predicting...')
    y_pred_withoutCD = gbm_withoutCD.predict(x_test_withoutCD, num_iteration = gbm_withoutCD.best_iteration)
    y_pred_withCD = gbm_withCD.predict(x_test_withCD, num_iteration = gbm_withCD.best_iteration)

    df_y_pred_withoutCD = pd.DataFrame(y_pred_withoutCD, columns = ['predicted'])
    df_y_pred_withCD = pd.DataFrame(y_pred_withCD, columns = ['predicted'])
        
    # Store all of the predicted values to the CSV files
    df_compare_addrcode_dist_withoutCD = pd.concat([df_test_addrcode_week_year_dist, df_y_pred_withoutCD], axis = 1)
    df_compare_addrcode_dist_withoutCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
    df_compare_addrcode_dist_withoutCD.to_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' 
                                              + str(num_leaves) + '/Original DF_0/LGBM_' + province2 + '_dist_original_DF_' 
                                              + str(i + 1) + '_withoutCD_' + str(num_leaves) + '.csv', encoding = 'utf-8')

    df_compare_addrcode_dist_withCD = pd.concat([df_test_addrcode_week_year_dist, df_y_pred_withCD], axis = 1)
    df_compare_addrcode_dist_withCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
    df_compare_addrcode_dist_withCD.to_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' 
                                           + str(num_leaves) + '/Original DF_0/LGBM_' + province2 + '_dist_original_DF_' 
                                           + str(i + 1) + '_withCD_' + str(num_leaves) + '.csv', encoding = 'utf-8')
    
    # Evaluation
    rmse_withoutCD = mean_squared_error(y_test_true, y_pred_withoutCD) ** 0.5
    mae_withoutCD = mean_absolute_error(y_test_true, y_pred_withoutCD)
    r2_withoutCD = r2_score(y_test_true, y_pred_withoutCD)
    smape_withoutCD = smape_fast(y_test_true, y_pred_withoutCD)
    #print('RMSE of the prediction without CD is:', rmse_withoutCD)
    #print('MAE of the prediction without CD is:', mae_withoutCD)
    #print('R-squared of the prediction without CD is:', r2_withoutCD)
    #print('SMAPE of the prediction without CD is:', smape_withoutCD)
        
    rmse_withCD = mean_squared_error(y_test_true, y_pred_withCD) ** 0.5
    mae_withCD = mean_absolute_error(y_test_true, y_pred_withCD)
    r2_withCD = r2_score(y_test_true, y_pred_withCD)
    smape_withCD = smape_fast(y_test_true, y_pred_withCD)
    #print('RMSE of the prediction with CD is:', rmse_withCD)
    #print('MAE of the prediction with CD is:', mae_withCD)
    #print('R-squared of the prediction with CD is:', r2_withCD)
    #print('SMAPE of the prediction with CD is:', smape_withCD)
        
    rmse_percent_improved = (rmse_withoutCD - rmse_withCD) / rmse_withoutCD
    mae_percent_improved = (mae_withoutCD - mae_withCD) / mae_withoutCD
    smape_percent_improved = (smape_withoutCD - smape_withCD) / smape_withoutCD
    r2_percent_improved = (r2_withoutCD - r2_withCD) / r2_withoutCD
    #eval_array = np.append(eval_array, ['RMSE', rmse_withoutCD, rmse_withCD, rmse_percent_improved])
    #print(eval_array)
        
    rmse = np.append(rmse, [rmse_withoutCD, rmse_withCD, rmse_percent_improved])
    mae = np.append(mae, [mae_withoutCD, mae_withCD, mae_percent_improved])
    smape = np.append(smape, [smape_withoutCD, smape_withCD, smape_percent_improved])
    r2 = np.append(r2, [r2_withoutCD, r2_withCD, r2_percent_improved])
        
    #df_withoutCD = pd.read_csv('LGBM/Original/LGBM_dist_DF_' + str(j + 1) + '_withoutCD.csv', header = 0)
    #df_withCD = pd.read_csv('LGBM/Original/LGBM_dist_DF_' + str(j + 1) + '_withCD.csv', header = 0)
    
    df_withoutCD = pd.read_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                               + '/Original DF_0/LGBM_' + province2 + '_dist_original_DF_' + str(i + 1) + '_withoutCD_' 
                               + str(num_leaves) + '.csv', header = 0)
    df_withCD = pd.read_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                            + '/Original DF_0/LGBM_' + province2 + '_dist_original_DF_' + str(i + 1) + '_withCD_' 
                            + str(num_leaves) + '.csv', header = 0)
        
    dist_code = df_train_dist['addrcode'].unique()
    
    # For each district
    for j in dist_code:
            
        # Get the subset of actual and predicted values according to the district code
        subset_withoutCD = df_withoutCD.loc[df_withoutCD['addrcode'] == j]
        subset_withCD = df_withCD.loc[df_withCD['addrcode'] == j]
            
        # Pass the response values to the array for evaluation calculation
        array_true = np.array(subset_withoutCD['actual'])
        array_pred_withoutCD = np.array(subset_withoutCD['predicted'])
        array_pred_withCD = np.array(subset_withCD['predicted'])
            
        # Calculate the evaluation values
        rmse_withoutCD_dist = mean_squared_error(array_true, array_pred_withoutCD) ** 0.5
        mae_withoutCD_dist = mean_absolute_error(array_true, array_pred_withoutCD)
        smape_withoutCD_dist = smape_fast(array_true, array_pred_withoutCD)
        r2_withoutCD_dist = r2_score(array_true, array_pred_withoutCD)
            
        rmse_withCD_dist = mean_squared_error(array_true, array_pred_withCD) ** 0.5
        mae_withCD_dist = mean_absolute_error(array_true, array_pred_withCD)
        smape_withCD_dist = smape_fast(array_true, array_pred_withCD)
        r2_withCD_dist = r2_score(array_true, array_pred_withCD)
            
        rmse_percent_improved_dist = (rmse_withoutCD_dist - rmse_withCD_dist) / rmse_withoutCD_dist
        mae_percent_improved_dist = (mae_withoutCD_dist - mae_withCD_dist) / mae_withoutCD_dist
        smape_percent_improved_dist = (smape_withoutCD_dist - smape_withCD_dist) / smape_withoutCD_dist
        r2_percent_improved_dist = (r2_withoutCD_dist - r2_withCD_dist) / r2_withoutCD_dist
            
        # Append
        dist_array = np.append(dist_array, [[j, rmse_withoutCD_dist, rmse_withCD_dist, rmse_percent_improved_dist,
                                            mae_withoutCD_dist, mae_withCD_dist, mae_percent_improved_dist,
                                            smape_withoutCD_dist, smape_withCD_dist, smape_percent_improved_dist,
                                            r2_withoutCD_dist, r2_withCD_dist, r2_percent_improved_dist]], axis = 0)

    #print(dist_array)
    pd.DataFrame(dist_array).to_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                    + '/Original DF_0/LGBM_' + province2 + '_ByDistrict_Original_DF_' + str(i + 1) 
                                    + '_eval_' + str(num_leaves) + '.csv', header = False, encoding = 'utf-8')
        
    # Clear the old memory to store a new one
    dist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                              'MAE without CD', 'MAE with CD', '% improved MAE', 
                              'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                              'R squared without CD', 'R squared with CD', '% improved R squared']])
    
# Evaluation file storing
# From RMSE DF_1 to R squared DF_6
eval_array = evaluation_print_original(eval_array, 'RMSE', rmse)
eval_array = evaluation_print_original(eval_array, 'MAE', mae)
eval_array = evaluation_print_original(eval_array, 'SMAPE', smape)
eval_array = evaluation_print_original(eval_array, 'R squared', r2)

#print(eval_array)

# Store all of the evaluation values into a CSV file
pd.DataFrame(eval_array).to_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                + '/Original DF_0/LGBM_' + province2 + '_dist_original_eval_' + str(num_leaves) 
                                + '.csv', header = False, encoding = 'utf-8')

Starting training...
[1]	valid_0's l1: 0.320476	valid_0's l2: 0.144048
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.310479	valid_0's l2: 0.136154
[3]	valid_0's l1: 0.3009	valid_0's l2: 0.128727
[4]	valid_0's l1: 0.292464	valid_0's l2: 0.122536
[5]	valid_0's l1: 0.284402	valid_0's l2: 0.116608
[6]	valid_0's l1: 0.277052	valid_0's l2: 0.111426
[7]	valid_0's l1: 0.270324	valid_0's l2: 0.106816
[8]	valid_0's l1: 0.263979	valid_0's l2: 0.102709
[9]	valid_0's l1: 0.258178	valid_0's l2: 0.0991075
[10]	valid_0's l1: 0.252509	valid_0's l2: 0.0956931
[11]	valid_0's l1: 0.24751	valid_0's l2: 0.0929538
[12]	valid_0's l1: 0.242876	valid_0's l2: 0.0904785
[13]	valid_0's l1: 0.23887	valid_0's l2: 0.0884331
[14]	valid_0's l1: 0.234894	valid_0's l2: 0.0865239
[15]	valid_0's l1: 0.231166	valid_0's l2: 0.0846408
[16]	valid_0's l1: 0.227919	valid_0's l2: 0.083149
[17]	valid_0's l1: 0.225068	valid_0's l2: 0.0819192
[18]	valid_0's l1: 0.222278	valid_0's l2: 0.0807971
[19]

Starting training...
[1]	valid_0's l1: 0.311406	valid_0's l2: 0.140187
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.30414	valid_0's l2: 0.134518
[3]	valid_0's l1: 0.298473	valid_0's l2: 0.13
[4]	valid_0's l1: 0.293904	valid_0's l2: 0.126715
[5]	valid_0's l1: 0.289932	valid_0's l2: 0.124174
[6]	valid_0's l1: 0.286324	valid_0's l2: 0.122163
[7]	valid_0's l1: 0.283461	valid_0's l2: 0.120599
[8]	valid_0's l1: 0.280807	valid_0's l2: 0.119602
[9]	valid_0's l1: 0.278894	valid_0's l2: 0.11924
[10]	valid_0's l1: 0.277175	valid_0's l2: 0.119357
[11]	valid_0's l1: 0.275837	valid_0's l2: 0.119451
[12]	valid_0's l1: 0.274886	valid_0's l2: 0.120214
[13]	valid_0's l1: 0.274198	valid_0's l2: 0.120947
[14]	valid_0's l1: 0.273656	valid_0's l2: 0.121989
[15]	valid_0's l1: 0.273194	valid_0's l2: 0.123295
Early stopping, best iteration is:
[9]	valid_0's l1: 0.278894	valid_0's l2: 0.11924
[1]	valid_0's l1: 0.311349	valid_0's l2: 0.140113
Training until validation scores d

For MAs (Normal CD)

In [112]:
# Arrays of all evaluation values
# row: head,
# RMSE (DFma_1 - DFma_6), 
# MAE (DFma_1 - DFma_6), 
# SMAPE (DFma_1 - DFma_6), 
# R-squared (DFma_1 - DFma_6)

# col: head,
# MA2 (without CD, with CD, % improved),
# MA3 (without CD, with CD, % improved),
# MA4 (without CD, with CD, % improved)

eval_array = np.asarray([['Evaluation', 'MA2 without CD', 'MA2 with CD', 'MA2 % improved', 
                         'MA3 without CD', 'MA3 with CD', 'MA3 % improved', 
                         'MA4 without CD', 'MA4 with CD', 'MA4 % improved']])
rmse = np.zeros(1)
mae = np.zeros(1)
smape = np.zeros(1)
r2 = np.zeros(1)

dist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])

# Starting from MA2 to MA4
for i in range(2, 5):
    # Get the input variables from CSV file
    # Change files directory here
    train_file_dir = 'Data/' + province1 + '/Normal Lags/train_' + province2 + '_dist_cd_mavg' + str(i) + '.csv'
    test_file_dir = 'Data/' + province1 + '/Normal Lags/test_' + province2 + '_dist_cd_mavg' + str(i) + '.csv'
    
    df_train_dist =  pd.read_csv(train_file_dir, header=0, skiprows=0)
    df_test_dist = pd.read_csv(test_file_dir, header=0, skiprows=0)

    # Continue on DFma_1 to DFma_6
    for j in range(6):
        # Allocate the column of addrcode, week, year and actual values first
        df_test_addrcode_week_year_dist = df_test_dist.iloc[:, [1, 2, 3, 19 - j]]
        
        ## Without CD ##
    
        # Import the dataset
        # x: independent variables
        # DFma_0 [col 20],
        # DFma_wm1 [col 21],
        # DFma_wm2 [col 22],
        # DFma_wm3 [col 23],
        # RF_wm6 [col 24],
        # and LST_wm4 [col 25]
        
        ## With CD ##
    
        # Import the dataset
        # x: independent variables
        # DF_0 [col 10],
        # DF_wm1 [col 11], 
        # DF_wm2 [col 12],
        # DF_wm3 [col 13],
        # RF_wm6 [col 24],
        # LST_wm4 [col 25],
        # bin_pop9s [col 26],
        # bowl_pop9s [col 27],
        # bucket_pop9s [col 28],
        # misc_short_pop9s [col 29],
        # jar_pop9s [col 30],
        # pottedplant_pop9s [col 31],
        # tire_pop9s [col 32],
        # misc_tall_pop9s [col 33],
        # and total_pop9s [col 34]
        
        x_train_withoutCD = df_train_dist.iloc[:, [20, 21, 22, 23, 24, 25]]
        x_train_withCD = df_train_dist.iloc[:, [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]
        
        x_test_withoutCD = df_test_dist.iloc[:, [20, 21, 22, 23, 24, 25]]
        x_test_withCD = df_test_dist.iloc[:, [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]
        
        # y: response (target) variable from DFma_1 to DFma_6 (col 19 -> col 14)
        y_train = df_train_dist.iloc[:, [19 - j]]
        y_test = df_test_dist.iloc[:, [19 - j]]
        
        # Pass the response values to the array for evaluation calculation
        y_train_true = np.array(df_train_dist['DFma_' + str(j + 1)])
        y_test_true = np.array(df_test_dist['DFma_' + str(j + 1)])
        
        # Pass the dataset of both independent and response variables to Light GBM
        lgb_train_withoutCD = lgb.Dataset(x_train_withoutCD, y_train)
        lgb_eval_withoutCD = lgb.Dataset(x_test_withoutCD, y_test, reference = lgb_train_withoutCD)
        
        lgb_train_withCD = lgb.Dataset(x_train_withCD, y_train)
        lgb_eval_withCD = lgb.Dataset(x_test_withCD, y_test, reference = lgb_train_withCD)

        params = {
        'boosting_type': 'gbdt',
        'objective': 'regression',
        'metric': {'l2', 'l1'},
        'num_leaves': num_leaves,
        'learning_rate': 0.05,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'verbose': 0
        }

        # Train the model
        print('Starting training...')
        gbm_withoutCD = lgb.train(params,
                    lgb_train_withoutCD,
                    num_boost_round = 20,
                    valid_sets = lgb_eval_withoutCD,
                    early_stopping_rounds = 6)
        #print('Saving model...')
        # Save model to file
        #gbm_withoutCD.save_model('model.txt')
        
        gbm_withCD = lgb.train(params,
                    lgb_train_withCD,
                    num_boost_round = 20,
                    valid_sets = lgb_eval_withCD,
                    early_stopping_rounds = 6)
        #print('Saving model...')
        # Save model to file
        #gbm_withCD.save_model('model.txt')

        # Predict out by using test data
        print('Starting predicting...')
        y_pred_withoutCD = gbm_withoutCD.predict(x_test_withoutCD, num_iteration = gbm_withoutCD.best_iteration)
        y_pred_withCD = gbm_withCD.predict(x_test_withCD, num_iteration = gbm_withCD.best_iteration)

        df_y_pred_withoutCD = pd.DataFrame(y_pred_withoutCD, columns = ['predicted'])
        df_y_pred_withCD = pd.DataFrame(y_pred_withCD, columns = ['predicted'])
        
        # Store all of the predicted values to the CSV files
        df_compare_addrcode_dist_withoutCD = pd.concat([df_test_addrcode_week_year_dist, df_y_pred_withoutCD], axis = 1)
        df_compare_addrcode_dist_withoutCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
        df_compare_addrcode_dist_withoutCD.to_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' 
                                                  + str(num_leaves) + '/MA' + str(i) + '/LGBM_' + province2 + '_dist_MA' 
                                                  + str(i) + '_DFma_' + str(j + 1) + '_withoutCD_' + str(num_leaves) 
                                                  + '.csv', encoding = 'utf-8')

        df_compare_addrcode_dist_withCD = pd.concat([df_test_addrcode_week_year_dist, df_y_pred_withCD], axis = 1)
        df_compare_addrcode_dist_withCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
        df_compare_addrcode_dist_withCD.to_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' 
                                               + str(num_leaves) + '/MA' + str(i) + '/LGBM_' + province2 + '_dist_MA' 
                                               + str(i) + '_DFma_' + str(j + 1) + '_withCD_' + str(num_leaves) 
                                               + '.csv', encoding = 'utf-8')
        
        # Evaluation
        rmse_withoutCD = mean_squared_error(y_test_true, y_pred_withoutCD) ** 0.5
        mae_withoutCD = mean_absolute_error(y_test_true, y_pred_withoutCD)
        r2_withoutCD = r2_score(y_test_true, y_pred_withoutCD)
        smape_withoutCD = smape_fast(y_test_true, y_pred_withoutCD)
        #print('RMSE of the prediction without CD is:', rmse_withoutCD)
        #print('MAE of the prediction without CD is:', mae_withoutCD)
        #print('R-squared of the prediction without CD is:', r2_withoutCD)
        #print('SMAPE of the prediction without CD is:', smape_withoutCD)
        
        rmse_withCD = mean_squared_error(y_test_true, y_pred_withCD) ** 0.5
        mae_withCD = mean_absolute_error(y_test_true, y_pred_withCD)
        r2_withCD = r2_score(y_test_true, y_pred_withCD)
        smape_withCD = smape_fast(y_test_true, y_pred_withCD)
        #print('RMSE of the prediction with CD is:', rmse_withCD)
        #print('MAE of the prediction with CD is:', mae_withCD)
        #print('R-squared of the prediction with CD is:', r2_withCD)
        #print('SMAPE of the prediction with CD is:', smape_withCD)
        
        rmse_percent_improved = (rmse_withoutCD - rmse_withCD) / rmse_withoutCD
        mae_percent_improved = (mae_withoutCD - mae_withCD) / mae_withoutCD
        smape_percent_improved = (smape_withoutCD - smape_withCD) / smape_withoutCD
        r2_percent_improved = (r2_withoutCD - r2_withCD) / r2_withoutCD
        #eval_array = np.append(eval_array, ['RMSE', rmse_withoutCD, rmse_withCD, rmse_percent_improved])
        #print(eval_array)
        
        rmse = np.append(rmse, [rmse_withoutCD, rmse_withCD, rmse_percent_improved])
        mae = np.append(mae, [mae_withoutCD, mae_withCD, mae_percent_improved])
        smape = np.append(smape, [smape_withoutCD, smape_withCD, smape_percent_improved])
        r2 = np.append(r2, [r2_withoutCD, r2_withCD, r2_percent_improved])
        
        df_withoutCD = pd.read_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                   + '/MA' + str(i) + '/LGBM_' + province2 + '_dist_MA' + str(i) + '_DFma_' + str(j + 1) 
                                   + '_withoutCD_' + str(num_leaves) + '.csv', header = 0)
        df_withCD = pd.read_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                + '/MA' + str(i) + '/LGBM_' + province2 + '_dist_MA' + str(i) + '_DFma_' + str(j + 1) 
                                + '_withCD_' + str(num_leaves) + '.csv', header = 0)
        
        dist_code = df_train_dist['addrcode'].unique()
        
        # For each district
        for k in dist_code:
            
            # Get the subset of actual and predicted values according to the district code
            subset_withoutCD = df_withoutCD.loc[df_withoutCD['addrcode'] == k]
            subset_withCD = df_withCD.loc[df_withCD['addrcode'] == k]
            
            # Pass the response values to the array for evaluation calculation
            array_true = np.array(subset_withoutCD['actual'])
            array_pred_withoutCD = np.array(subset_withoutCD['predicted'])
            array_pred_withCD = np.array(subset_withCD['predicted'])
            
            # Calculate the evaluation values
            rmse_withoutCD_dist = mean_squared_error(array_true, array_pred_withoutCD) ** 0.5
            mae_withoutCD_dist = mean_absolute_error(array_true, array_pred_withoutCD)
            smape_withoutCD_dist = smape_fast(array_true, array_pred_withoutCD)
            r2_withoutCD_dist = r2_score(array_true, array_pred_withoutCD)
            
            rmse_withCD_dist = mean_squared_error(array_true, array_pred_withCD) ** 0.5
            mae_withCD_dist = mean_absolute_error(array_true, array_pred_withCD)
            smape_withCD_dist = smape_fast(array_true, array_pred_withCD)
            r2_withCD_dist = r2_score(array_true, array_pred_withCD)
            
            rmse_percent_improved_dist = (rmse_withoutCD_dist - rmse_withCD_dist) / rmse_withoutCD_dist
            mae_percent_improved_dist = (mae_withoutCD_dist - mae_withCD_dist) / mae_withoutCD_dist
            smape_percent_improved_dist = (smape_withoutCD_dist - smape_withCD_dist) / smape_withoutCD_dist
            r2_percent_improved_dist = (r2_withoutCD_dist - r2_withCD_dist) / r2_withoutCD_dist
            
            # Append
            dist_array = np.append(dist_array, [[k, rmse_withoutCD_dist, rmse_withCD_dist, rmse_percent_improved_dist,
                                                mae_withoutCD_dist, mae_withCD_dist, mae_percent_improved_dist,
                                                smape_withoutCD_dist, smape_withCD_dist, smape_percent_improved_dist,
                                                r2_withoutCD_dist, r2_withCD_dist, r2_percent_improved_dist]], axis = 0)

        #print(dist_array)
        pd.DataFrame(dist_array).to_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                        + '/MA' + str(i) + '/LGBM_' + province2 + '_ByDistrict_MA' + str(i) + '_DFma_' 
                                        + str(j + 1) + '_eval_' + str(num_leaves) + '.csv', header = False, 
                                        encoding = 'utf-8')
        
        # Clear the old memory to store a new one
        dist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])
    
# Evaluation file storing
# From RMSE DFma_1 to R squared DFma_6
eval_array = evaluation_print(eval_array, 'RMSE', rmse)
eval_array = evaluation_print(eval_array, 'MAE', mae)
eval_array = evaluation_print(eval_array, 'SMAPE', smape)
eval_array = evaluation_print(eval_array, 'R squared', r2)

#print(eval_array)

# Store all of the evaluation values into a CSV file
pd.DataFrame(eval_array).to_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                + '/LGBM_' + province2 + '_dist_eval_' + str(num_leaves) + '.csv', 
                                header = False, encoding = 'utf-8')

Starting training...
[1]	valid_0's l1: 0.294086	valid_0's l2: 0.116316
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.28157	valid_0's l2: 0.106822
[3]	valid_0's l1: 0.26954	valid_0's l2: 0.0980169
[4]	valid_0's l1: 0.258517	valid_0's l2: 0.0903174
[5]	valid_0's l1: 0.247967	valid_0's l2: 0.0832162
[6]	valid_0's l1: 0.238166	valid_0's l2: 0.0768718
[7]	valid_0's l1: 0.230261	valid_0's l2: 0.0721112
[8]	valid_0's l1: 0.221453	valid_0's l2: 0.0668732
[9]	valid_0's l1: 0.213312	valid_0's l2: 0.0622892
[10]	valid_0's l1: 0.205632	valid_0's l2: 0.0580527
[11]	valid_0's l1: 0.198283	valid_0's l2: 0.0541466
[12]	valid_0's l1: 0.191429	valid_0's l2: 0.0506908
[13]	valid_0's l1: 0.185257	valid_0's l2: 0.0477157
[14]	valid_0's l1: 0.179455	valid_0's l2: 0.0449754
[15]	valid_0's l1: 0.174003	valid_0's l2: 0.0425092
[16]	valid_0's l1: 0.168961	valid_0's l2: 0.040294
[17]	valid_0's l1: 0.165328	valid_0's l2: 0.0388748
[18]	valid_0's l1: 0.161159	valid_0's l2: 0.0372

Starting training...
[1]	valid_0's l1: 0.28241	valid_0's l2: 0.112094
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.27411	valid_0's l2: 0.105892
[3]	valid_0's l1: 0.267041	valid_0's l2: 0.100624
[4]	valid_0's l1: 0.261373	valid_0's l2: 0.0966179
[5]	valid_0's l1: 0.256436	valid_0's l2: 0.0934745
[6]	valid_0's l1: 0.252469	valid_0's l2: 0.0910145
[7]	valid_0's l1: 0.249984	valid_0's l2: 0.0896444
[8]	valid_0's l1: 0.24768	valid_0's l2: 0.088581
[9]	valid_0's l1: 0.245823	valid_0's l2: 0.0880672
[10]	valid_0's l1: 0.243943	valid_0's l2: 0.0878685
[11]	valid_0's l1: 0.242087	valid_0's l2: 0.0877056
[12]	valid_0's l1: 0.240234	valid_0's l2: 0.0877367
[13]	valid_0's l1: 0.239256	valid_0's l2: 0.0883811
[14]	valid_0's l1: 0.238541	valid_0's l2: 0.0894472
[15]	valid_0's l1: 0.238036	valid_0's l2: 0.0907084
[16]	valid_0's l1: 0.23793	valid_0's l2: 0.0925221
[17]	valid_0's l1: 0.238338	valid_0's l2: 0.0947761
Early stopping, best iteration is:
[11]	valid_0's l

Starting training...
[1]	valid_0's l1: 0.276051	valid_0's l2: 0.104113
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.265441	valid_0's l2: 0.09648
[3]	valid_0's l1: 0.255547	valid_0's l2: 0.0895576
[4]	valid_0's l1: 0.246969	valid_0's l2: 0.0838434
[5]	valid_0's l1: 0.239014	valid_0's l2: 0.078639
[6]	valid_0's l1: 0.231922	valid_0's l2: 0.0741559
[7]	valid_0's l1: 0.226127	valid_0's l2: 0.0706941
[8]	valid_0's l1: 0.220363	valid_0's l2: 0.0673972
[9]	valid_0's l1: 0.215419	valid_0's l2: 0.0645717
[10]	valid_0's l1: 0.210945	valid_0's l2: 0.0621735
[11]	valid_0's l1: 0.206506	valid_0's l2: 0.0599668
[12]	valid_0's l1: 0.202572	valid_0's l2: 0.0581569
[13]	valid_0's l1: 0.199189	valid_0's l2: 0.0566948
[14]	valid_0's l1: 0.195986	valid_0's l2: 0.0553803
[15]	valid_0's l1: 0.193091	valid_0's l2: 0.0543385
[16]	valid_0's l1: 0.190639	valid_0's l2: 0.0535618
[17]	valid_0's l1: 0.188778	valid_0's l2: 0.0532112
[18]	valid_0's l1: 0.187093	valid_0's l2: 0.052

Starting training...
[1]	valid_0's l1: 0.273515	valid_0's l2: 0.100029
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.260147	valid_0's l2: 0.0904881
[3]	valid_0's l1: 0.247561	valid_0's l2: 0.0819568
[4]	valid_0's l1: 0.235636	valid_0's l2: 0.074271
[5]	valid_0's l1: 0.224397	valid_0's l2: 0.0673593
[6]	valid_0's l1: 0.213699	valid_0's l2: 0.0611097
[7]	valid_0's l1: 0.203991	valid_0's l2: 0.0557182
[8]	valid_0's l1: 0.194553	valid_0's l2: 0.0507126
[9]	valid_0's l1: 0.185557	valid_0's l2: 0.046144
[10]	valid_0's l1: 0.177198	valid_0's l2: 0.042106
[11]	valid_0's l1: 0.169249	valid_0's l2: 0.0384473
[12]	valid_0's l1: 0.161792	valid_0's l2: 0.0351665
[13]	valid_0's l1: 0.154689	valid_0's l2: 0.032202
[14]	valid_0's l1: 0.148097	valid_0's l2: 0.0295824
[15]	valid_0's l1: 0.141925	valid_0's l2: 0.0272467
[16]	valid_0's l1: 0.136178	valid_0's l2: 0.0251524
[17]	valid_0's l1: 0.130928	valid_0's l2: 0.0233483
[18]	valid_0's l1: 0.125831	valid_0's l2: 0.0216

Starting training...
[1]	valid_0's l1: 0.259057	valid_0's l2: 0.0935911
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.249615	valid_0's l2: 0.0872757
[3]	valid_0's l1: 0.241707	valid_0's l2: 0.0821899
[4]	valid_0's l1: 0.235676	valid_0's l2: 0.0781348
[5]	valid_0's l1: 0.231446	valid_0's l2: 0.0751815
[6]	valid_0's l1: 0.22808	valid_0's l2: 0.0729629
[7]	valid_0's l1: 0.225705	valid_0's l2: 0.071619
[8]	valid_0's l1: 0.223064	valid_0's l2: 0.0704577
[9]	valid_0's l1: 0.221237	valid_0's l2: 0.0700622
[10]	valid_0's l1: 0.219277	valid_0's l2: 0.0697912
[11]	valid_0's l1: 0.217282	valid_0's l2: 0.0694661
[12]	valid_0's l1: 0.215627	valid_0's l2: 0.0695672
[13]	valid_0's l1: 0.214658	valid_0's l2: 0.0702863
[14]	valid_0's l1: 0.21372	valid_0's l2: 0.0709396
[15]	valid_0's l1: 0.21304	valid_0's l2: 0.0719316
[16]	valid_0's l1: 0.212083	valid_0's l2: 0.0725901
[17]	valid_0's l1: 0.212196	valid_0's l2: 0.0741386
Early stopping, best iteration is:
[11]	valid_0

For original DF_0 (without smoothing, normal CD)

In [113]:
# Arrays of all evaluation values
# row: head,
# RMSE (DF_1 - DF_6), 
# MAE (DF_1 - DF_6), 
# SMAPE (DF_1 - DF_6), 
# R-squared (DF_1 - DF_6)

# col: head,
# DF_0 (without CD, with CD, % improved)

eval_array = np.asarray([['Evaluation', 'Without CD', 'With CD', '% improved']])

rmse = np.zeros(1)
mae = np.zeros(1)
smape = np.zeros(1)
r2 = np.zeros(1)

dist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])

train_file_dir = 'Data/' + province1 + '/Normal Lags/train_' + province2 + '_dist_cd_mavg2.csv'
test_file_dir = 'Data/' + province1 + '/Normal Lags/test_' + province2 + '_dist_cd_mavg2.csv'

df_train_dist =  pd.read_csv(train_file_dir, header = 0, skiprows = 0)
df_test_dist = pd.read_csv(test_file_dir, header = 0, skiprows = 0)

# From DF_1 to DF_6
for i in range(6):
    # Allocate the column of addrcode, week, year and actual values first
    df_test_addrcode_week_year_dist = df_test_dist.iloc[:,[1, 2, 3, 10 - i]]
    
    ## Without CD ##
    
    # Import the dataset
    # x: independent variables
    # DF_0 [col 10],
    # DF_wm1 [col 11], 
    # DF_wm2 [col 12],
    # DF_wm3 [col 13],
    # RF_wm6 [col 24],
    # and LST_wm4 [col 25]
        
    ## With CD ##
    
    # Import the dataset
    # x: independent variables
    # DF_0 [col 10],
    # DF_wm1 [col 11], 
    # DF_wm2 [col 12],
    # DF_wm3 [col 13],
    # RF_wm6 [col 24],
    # LST_wm4 [col 25],
    # bin [col 26],
    # bowl [col 27],
    # bucket [col 28],
    # misc_short [col 29],
    # jars [col 30],
    # pottedplant [col 31],
    # tire [col 32],
    # misc_tall [col 33],
    # and total [col 34]
        
    x_train_withoutCD = df_train_dist.iloc[:, [10, 11, 12, 13, 24, 25]]
    x_train_withCD = df_train_dist.iloc[:, [10, 11, 12, 13, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]
    
    x_test_withoutCD = df_test_dist.iloc[:, [10, 11, 12, 13, 24, 25]]
    x_test_withCD = df_test_dist.iloc[:, [10, 11, 12, 13, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]
    
    # y: response (target) variable from DF_1 to DF_6 (col 9 -> col 4)
    y_train = df_train_dist.iloc[:, [9 - i]]
    y_test = df_test_dist.iloc[:, [9 - i]]
    
    # Pass the response values to the array for evaluation calculation
    y_train_true = np.array(df_train_dist['DF_' + str(i + 1)])
    y_test_true = np.array(df_test_dist['DF_' + str(i + 1)])
    
    # Pass the dataset of both independent and response variables to Light GBM
    lgb_train_withoutCD = lgb.Dataset(x_train_withoutCD, y_train)
    lgb_eval_withoutCD = lgb.Dataset(x_test_withoutCD, y_test, reference = lgb_train_withoutCD)
        
    lgb_train_withCD = lgb.Dataset(x_train_withCD, y_train)
    lgb_eval_withCD = lgb.Dataset(x_test_withCD, y_test, reference = lgb_train_withCD)

    params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': num_leaves,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
    }

    # Train the model
    print('Starting training...')
    gbm_withoutCD = lgb.train(params,
                lgb_train_withoutCD,
                num_boost_round = 20,
                valid_sets = lgb_eval_withoutCD,
                early_stopping_rounds = 6)
    #print('Saving model...')
    # Save model to file
    #gbm_withoutCD.save_model('model.txt')
    
    gbm_withCD = lgb.train(params,
                lgb_train_withCD,
                num_boost_round = 20,
                valid_sets = lgb_eval_withCD,
                early_stopping_rounds = 6)
    #print('Saving model...')
    # Save model to file
    #gbm_withCD.save_model('model.txt')
    
    # Predict out by using test data
    print('Starting predicting...')
    y_pred_withoutCD = gbm_withoutCD.predict(x_test_withoutCD, num_iteration = gbm_withoutCD.best_iteration)
    y_pred_withCD = gbm_withCD.predict(x_test_withCD, num_iteration = gbm_withCD.best_iteration)

    df_y_pred_withoutCD = pd.DataFrame(y_pred_withoutCD, columns = ['predicted'])
    df_y_pred_withCD = pd.DataFrame(y_pred_withCD, columns = ['predicted'])
        
    # Store all of the predicted values to the CSV files
    df_compare_addrcode_dist_withoutCD = pd.concat([df_test_addrcode_week_year_dist, df_y_pred_withoutCD], axis = 1)
    df_compare_addrcode_dist_withoutCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
    df_compare_addrcode_dist_withoutCD.to_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                              + '/Original DF_0/LGBM_' + province2 + '_dist_original_DF_' + str(i + 1) 
                                              + '_withoutCD_' + str(num_leaves) + '.csv', encoding = 'utf-8')

    df_compare_addrcode_dist_withCD = pd.concat([df_test_addrcode_week_year_dist, df_y_pred_withCD], axis = 1)
    df_compare_addrcode_dist_withCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
    df_compare_addrcode_dist_withCD.to_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                           + '/Original DF_0/LGBM_' + province2 + '_dist_original_DF_' + str(i + 1) 
                                           + '_withCD_' + str(num_leaves) + '.csv', encoding = 'utf-8')
    
    # Evaluation
    rmse_withoutCD = mean_squared_error(y_test_true, y_pred_withoutCD) ** 0.5
    mae_withoutCD = mean_absolute_error(y_test_true, y_pred_withoutCD)
    r2_withoutCD = r2_score(y_test_true, y_pred_withoutCD)
    smape_withoutCD = smape_fast(y_test_true, y_pred_withoutCD)
    #print('RMSE of the prediction without CD is:', rmse_withoutCD)
    #print('MAE of the prediction without CD is:', mae_withoutCD)
    #print('R-squared of the prediction without CD is:', r2_withoutCD)
    #print('SMAPE of the prediction without CD is:', smape_withoutCD)
        
    rmse_withCD = mean_squared_error(y_test_true, y_pred_withCD) ** 0.5
    mae_withCD = mean_absolute_error(y_test_true, y_pred_withCD)
    r2_withCD = r2_score(y_test_true, y_pred_withCD)
    smape_withCD = smape_fast(y_test_true, y_pred_withCD)
    #print('RMSE of the prediction with CD is:', rmse_withCD)
    #print('MAE of the prediction with CD is:', mae_withCD)
    #print('R-squared of the prediction with CD is:', r2_withCD)
    #print('SMAPE of the prediction with CD is:', smape_withCD)
        
    rmse_percent_improved = (rmse_withoutCD - rmse_withCD) / rmse_withoutCD
    mae_percent_improved = (mae_withoutCD - mae_withCD) / mae_withoutCD
    smape_percent_improved = (smape_withoutCD - smape_withCD) / smape_withoutCD
    r2_percent_improved = (r2_withoutCD - r2_withCD) / r2_withoutCD
    #eval_array = np.append(eval_array, ['RMSE', rmse_withoutCD, rmse_withCD, rmse_percent_improved])
    #print(eval_array)
        
    rmse = np.append(rmse, [rmse_withoutCD, rmse_withCD, rmse_percent_improved])
    mae = np.append(mae, [mae_withoutCD, mae_withCD, mae_percent_improved])
    smape = np.append(smape, [smape_withoutCD, smape_withCD, smape_percent_improved])
    r2 = np.append(r2, [r2_withoutCD, r2_withCD, r2_percent_improved])
        
    #df_withoutCD = pd.read_csv('LGBM/Original/LGBM_dist_DF_' + str(j + 1) + '_withoutCD.csv', header = 0)
    #df_withCD = pd.read_csv('LGBM/Original/LGBM_dist_DF_' + str(j + 1) + '_withCD.csv', header = 0)
    
    df_withoutCD = pd.read_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                               + '/Original DF_0/LGBM_' + province2 + '_dist_original_DF_' + str(i + 1) + '_withoutCD_' 
                               + str(num_leaves) + '.csv', header = 0)
    df_withCD = pd.read_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                            + '/Original DF_0/LGBM_' + province2 + '_dist_original_DF_' + str(i + 1) + '_withCD_' 
                            + str(num_leaves) + '.csv', header = 0)
    
    dist_code = df_train_dist['addrcode'].unique()
        
    # For each district
    for j in dist_code:
            
        # Get the subset of actual and predicted values according to the district code
        subset_withoutCD = df_withoutCD.loc[df_withoutCD['addrcode'] == j]
        subset_withCD = df_withCD.loc[df_withCD['addrcode'] == j]
            
        # Pass the response values to the array for evaluation calculation
        array_true = np.array(subset_withoutCD['actual'])
        array_pred_withoutCD = np.array(subset_withoutCD['predicted'])
        array_pred_withCD = np.array(subset_withCD['predicted'])
            
        # Calculate the evaluation values
        rmse_withoutCD_dist = mean_squared_error(array_true, array_pred_withoutCD) ** 0.5
        mae_withoutCD_dist = mean_absolute_error(array_true, array_pred_withoutCD)
        smape_withoutCD_dist = smape_fast(array_true, array_pred_withoutCD)
        r2_withoutCD_dist = r2_score(array_true, array_pred_withoutCD)
            
        rmse_withCD_dist = mean_squared_error(array_true, array_pred_withCD) ** 0.5
        mae_withCD_dist = mean_absolute_error(array_true, array_pred_withCD)
        smape_withCD_dist = smape_fast(array_true, array_pred_withCD)
        r2_withCD_dist = r2_score(array_true, array_pred_withCD)
            
        rmse_percent_improved_dist = (rmse_withoutCD_dist - rmse_withCD_dist) / rmse_withoutCD_dist
        mae_percent_improved_dist = (mae_withoutCD_dist - mae_withCD_dist) / mae_withoutCD_dist
        smape_percent_improved_dist = (smape_withoutCD_dist - smape_withCD_dist) / smape_withoutCD_dist
        r2_percent_improved_dist = (r2_withoutCD_dist - r2_withCD_dist) / r2_withoutCD_dist
            
        # Append
        dist_array = np.append(dist_array, [[j, rmse_withoutCD_dist, rmse_withCD_dist, rmse_percent_improved_dist,
                                            mae_withoutCD_dist, mae_withCD_dist, mae_percent_improved_dist,
                                            smape_withoutCD_dist, smape_withCD_dist, smape_percent_improved_dist,
                                            r2_withoutCD_dist, r2_withCD_dist, r2_percent_improved_dist]], axis = 0)

    #print(dist_array)
    pd.DataFrame(dist_array).to_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                    + '/Original DF_0/LGBM_' + province2 + '_ByDistrict_Original_DF_' + str(i + 1) 
                                    + '_eval_' + str(num_leaves) + '.csv', header = False, encoding = 'utf-8')
        
    # Clear the old memory to store a new one
    dist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                              'MAE without CD', 'MAE with CD', '% improved MAE', 
                              'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                              'R squared without CD', 'R squared with CD', '% improved R squared']])
    
# Evaluation file storing
# From RMSE DF_1 to R squared DF_6
eval_array = evaluation_print_original(eval_array, 'RMSE', rmse)
eval_array = evaluation_print_original(eval_array, 'MAE', mae)
eval_array = evaluation_print_original(eval_array, 'SMAPE', smape)
eval_array = evaluation_print_original(eval_array, 'R squared', r2)

#print(eval_array)

# Store all of the evaluation values into a CSV file
pd.DataFrame(eval_array).to_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                + '/Original DF_0/LGBM_' + province2 + '_dist_original_eval_' + str(num_leaves) + '.csv', 
                                header = False, encoding = 'utf-8')

Starting training...
[1]	valid_0's l1: 0.320476	valid_0's l2: 0.144048
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.310479	valid_0's l2: 0.136154
[3]	valid_0's l1: 0.3009	valid_0's l2: 0.128727
[4]	valid_0's l1: 0.292464	valid_0's l2: 0.122536
[5]	valid_0's l1: 0.284402	valid_0's l2: 0.116608
[6]	valid_0's l1: 0.277052	valid_0's l2: 0.111426
[7]	valid_0's l1: 0.270324	valid_0's l2: 0.106816
[8]	valid_0's l1: 0.263979	valid_0's l2: 0.102709
[9]	valid_0's l1: 0.258178	valid_0's l2: 0.0991075
[10]	valid_0's l1: 0.252509	valid_0's l2: 0.0956931
[11]	valid_0's l1: 0.24751	valid_0's l2: 0.0929538
[12]	valid_0's l1: 0.242876	valid_0's l2: 0.0904785
[13]	valid_0's l1: 0.23887	valid_0's l2: 0.0884331
[14]	valid_0's l1: 0.234894	valid_0's l2: 0.0865239
[15]	valid_0's l1: 0.231166	valid_0's l2: 0.0846408
[16]	valid_0's l1: 0.227919	valid_0's l2: 0.083149
[17]	valid_0's l1: 0.225068	valid_0's l2: 0.0819192
[18]	valid_0's l1: 0.222278	valid_0's l2: 0.0807971
[19]

Starting training...
[1]	valid_0's l1: 0.311406	valid_0's l2: 0.140187
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.30414	valid_0's l2: 0.134518
[3]	valid_0's l1: 0.298473	valid_0's l2: 0.13
[4]	valid_0's l1: 0.293904	valid_0's l2: 0.126715
[5]	valid_0's l1: 0.289932	valid_0's l2: 0.124174
[6]	valid_0's l1: 0.286324	valid_0's l2: 0.122163
[7]	valid_0's l1: 0.283461	valid_0's l2: 0.120599
[8]	valid_0's l1: 0.280807	valid_0's l2: 0.119602
[9]	valid_0's l1: 0.278894	valid_0's l2: 0.11924
[10]	valid_0's l1: 0.277175	valid_0's l2: 0.119357
[11]	valid_0's l1: 0.275837	valid_0's l2: 0.119451
[12]	valid_0's l1: 0.274886	valid_0's l2: 0.120214
[13]	valid_0's l1: 0.274198	valid_0's l2: 0.120947
[14]	valid_0's l1: 0.273656	valid_0's l2: 0.121989
[15]	valid_0's l1: 0.273194	valid_0's l2: 0.123295
Early stopping, best iteration is:
[9]	valid_0's l1: 0.278894	valid_0's l2: 0.11924
[1]	valid_0's l1: 0.311631	valid_0's l2: 0.140189
Training until validation scores d

<h1>Sub-district level</h1>
For MAs (adjusted CD)

In [114]:
# Arrays of all evaluation values
# row: head,
# RMSE (DFma_1 - DFma_6), 
# MAE (DFma_1 - DFma_6), 
# SMAPE (DFma_1 - DFma_6), 
# R-squared (DFma_1 - DFma_6)

# col: head,
# MA2 (without CD, with CD, % improved),
# MA3 (without CD, with CD, % improved),
# MA4 (without CD, with CD, % improved)

eval_array = np.asarray([['Evaluation', 'MA2 without CD', 'MA2 with CD', 'MA2 % improved', 
                         'MA3 without CD', 'MA3 with CD', 'MA3 % improved', 
                         'MA4 without CD', 'MA4 with CD', 'MA4 % improved']])
rmse = np.zeros(1)
mae = np.zeros(1)
smape = np.zeros(1)
r2 = np.zeros(1)

subdist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                             'MAE without CD', 'MAE with CD', '% improved MAE', 
                             'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                             'R squared without CD', 'R squared with CD', '% improved R squared']])

# Starting from MA2 to MA4
for i in range(2, 5):
    # Get the input variables from CSV file
    # Change files directory here
    train_file_dir = 'Data/' + province1 + '/Normal Lags/train_' + province2 + '_subdist_total_mavg' + str(i) + '.csv'
    test_file_dir = 'Data/' + province1 + '/Normal Lags/test_' + province2 + '_subdist_total_mavg' + str(i) + '.csv'
    
    df_train_subdist =  pd.read_csv(train_file_dir, header = 0, skiprows = 0)
    df_test_subdist = pd.read_csv(test_file_dir, header = 0, skiprows = 0)

    # Continue on DFma_1 to DFma_6
    for j in range(6):
        # Allocate the column of addrcode, week, year and actual values first
        df_test_addrcode_week_year_subdist = df_test_subdist.iloc[:, [1, 2, 3, 19 - j]]
        
        ## Without CD ##
    
        # Import the dataset
        # x: independent variables
        # DFma_0 [col 20],
        # DFma_wm1 [col 21],
        # DFma_wm2 [col 22],
        # DFma_wm3 [col 23],
        # RF_wm6 [col 24],
        # and LST_wm4 [col 25]
        
        ## With CD ##
    
        # Import the dataset
        # x: independent variables
        # DFma_0 [col 20],
        # DFma_wm1 [col 21], 
        # DFma_wm2 [col 22],
        # DFma_wm3 [col 23],
        # RF_wm6 [col 24],
        # LST_wm4 [col 25],
        # bin_pop9s [col 26],
        # bowl_pop9s [col 27],
        # bucket_pop9s [col 28],
        # misc_short_pop9s [col 29],
        # jar_pop9s [col 30],
        # pottedplant_pop9s [col 31],
        # tire_pop9s [col 32],
        # misc_tall_pop9s [col 33],
        # and total_pop9s [col 34]
        
        x_train_withoutCD = df_train_subdist.iloc[:, [20, 21, 22, 23, 24, 25]]
        x_train_withCD = df_train_subdist.iloc[:, [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]
        
        x_test_withoutCD = df_test_subdist.iloc[:, [20, 21, 22, 23, 24, 25]]
        x_test_withCD = df_test_subdist.iloc[:, [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]
        
        # y: response (target) variable from DFma_1 to DFma_6 (col 19 -> col 14)
        y_train = df_train_subdist.iloc[:, [19 - j]]
        y_test = df_test_subdist.iloc[:, [19 - j]]
        
        # Pass the response values to the array for evaluation calculation
        y_train_true = np.array(df_train_subdist['DFma_' + str(j + 1)])
        y_test_true = np.array(df_test_subdist['DFma_' + str(j + 1)])
        
        # Pass the dataset of both independent and response variables to Light GBM
        lgb_train_withoutCD = lgb.Dataset(x_train_withoutCD, y_train)
        lgb_eval_withoutCD = lgb.Dataset(x_test_withoutCD, y_test, reference = lgb_train_withoutCD)
        
        lgb_train_withCD = lgb.Dataset(x_train_withCD, y_train)
        lgb_eval_withCD = lgb.Dataset(x_test_withCD, y_test, reference = lgb_train_withCD)

        params = {
        'boosting_type': 'gbdt',
        'objective': 'regression',
        'metric': {'l2', 'l1'},
        'num_leaves': num_leaves,
        'learning_rate': 0.05,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'verbose': 0
        }

        # Train the model
        print('Starting training...')
        gbm_withoutCD = lgb.train(params,
                    lgb_train_withoutCD,
                    num_boost_round = 20,
                    valid_sets = lgb_eval_withoutCD,
                    early_stopping_rounds = 6)
        #print('Saving model...')
        # Save model to file
        #gbm_withoutCD.save_model('model.txt')
        
        gbm_withCD = lgb.train(params,
                    lgb_train_withCD,
                    num_boost_round = 20,
                    valid_sets = lgb_eval_withCD,
                    early_stopping_rounds = 6)
        #print('Saving model...')
        # Save model to file
        #gbm_withCD.save_model('model.txt')

        # Predict out by using test data
        print('Starting predicting...')
        y_pred_withoutCD = gbm_withoutCD.predict(x_test_withoutCD, num_iteration = gbm_withoutCD.best_iteration)
        y_pred_withCD = gbm_withCD.predict(x_test_withCD, num_iteration = gbm_withCD.best_iteration)

        df_y_pred_withoutCD = pd.DataFrame(y_pred_withoutCD, columns = ['predicted'])
        df_y_pred_withCD = pd.DataFrame(y_pred_withCD, columns = ['predicted'])
        
        # Store all of the predicted values to the CSV files
        df_compare_addrcode_subdist_withoutCD = pd.concat([df_test_addrcode_week_year_subdist, df_y_pred_withoutCD], axis = 1)
        df_compare_addrcode_subdist_withoutCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
        df_compare_addrcode_subdist_withoutCD.to_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' 
                                                     + str(num_leaves) + '/MA' + str(i) + '/LGBM_' + province2 
                                                     + '_subdist_MA' + str(i) + '_DFma_' + str(j + 1) + '_withoutCD_' 
                                                     + str(num_leaves) + '.csv', encoding = 'utf-8')

        df_compare_addrcode_subdist_withCD = pd.concat([df_test_addrcode_week_year_subdist, df_y_pred_withCD], axis = 1)
        df_compare_addrcode_subdist_withCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
        df_compare_addrcode_subdist_withCD.to_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' 
                                                  + str(num_leaves) + '/MA' + str(i) + '/LGBM_' + province2 
                                                  + '_subdist_MA' + str(i) + '_DFma_' + str(j + 1) + '_withCD_' 
                                                  + str(num_leaves) + '.csv', encoding = 'utf-8')
        
        # Evaluation
        rmse_withoutCD = mean_squared_error(y_test_true, y_pred_withoutCD) ** 0.5
        mae_withoutCD = mean_absolute_error(y_test_true, y_pred_withoutCD)
        r2_withoutCD = r2_score(y_test_true, y_pred_withoutCD)
        smape_withoutCD = smape_fast(y_test_true, y_pred_withoutCD)
        #print('RMSE of the prediction without CD is:', rmse_withoutCD)
        #print('MAE of the prediction without CD is:', mae_withoutCD)
        #print('R-squared of the prediction without CD is:', r2_withoutCD)
        #print('SMAPE of the prediction without CD is:', smape_withoutCD)
        
        rmse_withCD = mean_squared_error(y_test_true, y_pred_withCD) ** 0.5
        mae_withCD = mean_absolute_error(y_test_true, y_pred_withCD)
        r2_withCD = r2_score(y_test_true, y_pred_withCD)
        smape_withCD = smape_fast(y_test_true, y_pred_withCD)
        #print('RMSE of the prediction with CD is:', rmse_withCD)
        #print('MAE of the prediction with CD is:', mae_withCD)
        #print('R-squared of the prediction with CD is:', r2_withCD)
        #print('SMAPE of the prediction with CD is:', smape_withCD)
        
        rmse_percent_improved = (rmse_withoutCD - rmse_withCD) / rmse_withoutCD
        mae_percent_improved = (mae_withoutCD - mae_withCD) / mae_withoutCD
        smape_percent_improved = (smape_withoutCD - smape_withCD) / smape_withoutCD
        r2_percent_improved = (r2_withoutCD - r2_withCD) / r2_withoutCD
        #eval_array = np.append(eval_array, ['RMSE', rmse_withoutCD, rmse_withCD, rmse_percent_improved])
        #print(eval_array)
        
        rmse = np.append(rmse, [rmse_withoutCD, rmse_withCD, rmse_percent_improved])
        mae = np.append(mae, [mae_withoutCD, mae_withCD, mae_percent_improved])
        smape = np.append(smape, [smape_withoutCD, smape_withCD, smape_percent_improved])
        r2 = np.append(r2, [r2_withoutCD, r2_withCD, r2_percent_improved])
        
        df_withoutCD = pd.read_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' + str(num_leaves) + '/MA' 
                                   + str(i) + '/LGBM_' + province2 + '_subdist_MA' + str(i) + '_DFma_' + str(j + 1) 
                                   + '_withoutCD_' + str(num_leaves) + '.csv', header = 0)
        df_withCD = pd.read_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' + str(num_leaves) + '/MA' 
                                + str(i) + '/LGBM_' + province2 + '_subdist_MA' + str(i) + '_DFma_' + str(j + 1) 
                                + '_withCD_' + str(num_leaves) + '.csv', header = 0)
        
        subdist_code = df_train_subdist['addrcode'].unique()
        
        # For each sub-district
        for k in subdist_code:
            
            # Get the subset of actual and predicted values according to the district code
            subset_withoutCD = df_withoutCD.loc[df_withoutCD['addrcode'] == k]
            subset_withCD = df_withCD.loc[df_withCD['addrcode'] == k]
            
            # Pass the response values to the array for evaluation calculation
            array_true = np.array(subset_withoutCD['actual'])
            array_pred_withoutCD = np.array(subset_withoutCD['predicted'])
            array_pred_withCD = np.array(subset_withCD['predicted'])
            
            # Calculate the evaluation values
            rmse_withoutCD_subdist = mean_squared_error(array_true, array_pred_withoutCD) ** 0.5
            mae_withoutCD_subdist = mean_absolute_error(array_true, array_pred_withoutCD)
            smape_withoutCD_subdist = smape_fast(array_true, array_pred_withoutCD)
            r2_withoutCD_subdist = r2_score(array_true, array_pred_withoutCD)
            
            rmse_withCD_subdist = mean_squared_error(array_true, array_pred_withCD) ** 0.5
            mae_withCD_subdist = mean_absolute_error(array_true, array_pred_withCD)
            smape_withCD_subdist = smape_fast(array_true, array_pred_withCD)
            r2_withCD_subdist = r2_score(array_true, array_pred_withCD)
            
            rmse_percent_improved_subdist = (rmse_withoutCD_subdist - rmse_withCD_subdist) / rmse_withoutCD_subdist
            mae_percent_improved_subdist = (mae_withoutCD_subdist - mae_withCD_subdist) / mae_withoutCD_subdist
            smape_percent_improved_subdist = (smape_withoutCD_subdist - smape_withCD_subdist) / smape_withoutCD_subdist
            r2_percent_improved_subdist = (r2_withoutCD_subdist - r2_withCD_subdist) / r2_withoutCD_subdist
            #print(k)
            #print('MA' + str(i) + 'DFma_' + str(j + 1) + ' R2 without ' + str(r2_withoutCD_subdist))
            #print('MA' + str(i) + 'DFma_' + str(j + 1) + ' R2 with ' + str(r2_withCD_subdist))
            #print('MA' + str(i) + 'DFma_' + str(j + 1) + ' R2 % ' + str(r2_percent_improved_subdist))
            # Append
            subdist_array = np.append(subdist_array, [[k, rmse_withoutCD_subdist, rmse_withCD_subdist, rmse_percent_improved_subdist,
                                                       mae_withoutCD_subdist, mae_withCD_subdist, mae_percent_improved_subdist,
                                                       smape_withoutCD_subdist, smape_withCD_subdist, smape_percent_improved_subdist,
                                                       r2_withoutCD_subdist, r2_withCD_subdist, r2_percent_improved_subdist]], axis = 0)
            

        #print(dist_array)
        pd.DataFrame(subdist_array).to_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                           + '/MA' + str(i) + '/LGBM_' + province2 + '_BySubdistrict_MA' + str(i) 
                                           + '_DFma_' + str(j + 1) + '_eval_' + str(num_leaves) + '.csv', 
                                           header = False, encoding = 'utf-8')
        
        # Clear the old memory to store a new one
        subdist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                                 'MAE without CD', 'MAE with CD', '% improved MAE', 
                                 'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                                 'R squared without CD', 'R squared with CD', '% improved R squared']])
    
# Evaluation file storing
# From RMSE DFma_1 to R squared DFma_6
eval_array = evaluation_print(eval_array, 'RMSE', rmse)
eval_array = evaluation_print(eval_array, 'MAE', mae)
eval_array = evaluation_print(eval_array, 'SMAPE', smape)
eval_array = evaluation_print(eval_array, 'R squared', r2)

#print(eval_array)

# Store all of the evaluation values into a CSV file
pd.DataFrame(eval_array).to_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                + '/LGBM_' + province2 + '_subdist_eval_' + str(num_leaves) + '.csv', 
                                header = False, encoding = 'utf-8')

Starting training...
[1]	valid_0's l1: 0.384872	valid_0's l2: 0.24629
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.3726	valid_0's l2: 0.232502
[3]	valid_0's l1: 0.360453	valid_0's l2: 0.218978
[4]	valid_0's l1: 0.349779	valid_0's l2: 0.207874
[5]	valid_0's l1: 0.338148	valid_0's l2: 0.195106
[6]	valid_0's l1: 0.32797	valid_0's l2: 0.185136
[7]	valid_0's l1: 0.322048	valid_0's l2: 0.180548
[8]	valid_0's l1: 0.31222	valid_0's l2: 0.170802
[9]	valid_0's l1: 0.304214	valid_0's l2: 0.163652
[10]	valid_0's l1: 0.295606	valid_0's l2: 0.15565
[11]	valid_0's l1: 0.287971	valid_0's l2: 0.149105
[12]	valid_0's l1: 0.280955	valid_0's l2: 0.143072
[13]	valid_0's l1: 0.27518	valid_0's l2: 0.138624
[14]	valid_0's l1: 0.269013	valid_0's l2: 0.13344
[15]	valid_0's l1: 0.262986	valid_0's l2: 0.128386
[16]	valid_0's l1: 0.257241	valid_0's l2: 0.123842
[17]	valid_0's l1: 0.254262	valid_0's l2: 0.122222
[18]	valid_0's l1: 0.249621	valid_0's l2: 0.119171
[19]	valid_0's l1



Starting training...
[1]	valid_0's l1: 0.389882	valid_0's l2: 0.255768
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.382221	valid_0's l2: 0.248047
[3]	valid_0's l1: 0.37515	valid_0's l2: 0.241481
[4]	valid_0's l1: 0.368258	valid_0's l2: 0.234954
[5]	valid_0's l1: 0.361476	valid_0's l2: 0.227893
[6]	valid_0's l1: 0.355351	valid_0's l2: 0.222785
[7]	valid_0's l1: 0.349945	valid_0's l2: 0.218613
[8]	valid_0's l1: 0.344591	valid_0's l2: 0.213618
[9]	valid_0's l1: 0.339724	valid_0's l2: 0.210093
[10]	valid_0's l1: 0.335199	valid_0's l2: 0.206173
[11]	valid_0's l1: 0.331423	valid_0's l2: 0.203555
[12]	valid_0's l1: 0.327911	valid_0's l2: 0.201279
[13]	valid_0's l1: 0.324467	valid_0's l2: 0.199048
[14]	valid_0's l1: 0.32155	valid_0's l2: 0.197285
[15]	valid_0's l1: 0.318758	valid_0's l2: 0.195186
[16]	valid_0's l1: 0.315968	valid_0's l2: 0.193138
[17]	valid_0's l1: 0.313292	valid_0's l2: 0.191625
[18]	valid_0's l1: 0.310923	valid_0's l2: 0.190509
[19]	valid_



Starting training...
[1]	valid_0's l1: 0.390094	valid_0's l2: 0.258094
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.382637	valid_0's l2: 0.25082
[3]	valid_0's l1: 0.375881	valid_0's l2: 0.244576
[4]	valid_0's l1: 0.369783	valid_0's l2: 0.239091
[5]	valid_0's l1: 0.36363	valid_0's l2: 0.232221
[6]	valid_0's l1: 0.358096	valid_0's l2: 0.227649
[7]	valid_0's l1: 0.353369	valid_0's l2: 0.224066
[8]	valid_0's l1: 0.348587	valid_0's l2: 0.219518
[9]	valid_0's l1: 0.344418	valid_0's l2: 0.216475
[10]	valid_0's l1: 0.340216	valid_0's l2: 0.212875
[11]	valid_0's l1: 0.336724	valid_0's l2: 0.21068
[12]	valid_0's l1: 0.333486	valid_0's l2: 0.208688
[13]	valid_0's l1: 0.33045	valid_0's l2: 0.207011
[14]	valid_0's l1: 0.328154	valid_0's l2: 0.205725
[15]	valid_0's l1: 0.325423	valid_0's l2: 0.203823
[16]	valid_0's l1: 0.322947	valid_0's l2: 0.202078
[17]	valid_0's l1: 0.320975	valid_0's l2: 0.201515
[18]	valid_0's l1: 0.319009	valid_0's l2: 0.200927
[19]	valid_0'



Starting training...
[1]	valid_0's l1: 0.388784	valid_0's l2: 0.258282
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.381209	valid_0's l2: 0.249744
[3]	valid_0's l1: 0.37486	valid_0's l2: 0.243162
[4]	valid_0's l1: 0.36891	valid_0's l2: 0.237423
[5]	valid_0's l1: 0.363275	valid_0's l2: 0.23132
[6]	valid_0's l1: 0.358596	valid_0's l2: 0.227164
[7]	valid_0's l1: 0.354054	valid_0's l2: 0.223662
[8]	valid_0's l1: 0.350014	valid_0's l2: 0.219826
[9]	valid_0's l1: 0.34689	valid_0's l2: 0.217486
[10]	valid_0's l1: 0.343629	valid_0's l2: 0.214931
[11]	valid_0's l1: 0.340685	valid_0's l2: 0.213123
[12]	valid_0's l1: 0.338	valid_0's l2: 0.211537
[13]	valid_0's l1: 0.3358	valid_0's l2: 0.210469
[14]	valid_0's l1: 0.333674	valid_0's l2: 0.209489
[15]	valid_0's l1: 0.331846	valid_0's l2: 0.208449
[16]	valid_0's l1: 0.330061	valid_0's l2: 0.207656
[17]	valid_0's l1: 0.328432	valid_0's l2: 0.207704
[18]	valid_0's l1: 0.32702	valid_0's l2: 0.20778
[19]	valid_0's l1: 0



Starting training...
[1]	valid_0's l1: 0.356803	valid_0's l2: 0.20837
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.345637	valid_0's l2: 0.197005
[3]	valid_0's l1: 0.33511	valid_0's l2: 0.186665
[4]	valid_0's l1: 0.325495	valid_0's l2: 0.177551
[5]	valid_0's l1: 0.315647	valid_0's l2: 0.167518
[6]	valid_0's l1: 0.306902	valid_0's l2: 0.159537
[7]	valid_0's l1: 0.30102	valid_0's l2: 0.154976
[8]	valid_0's l1: 0.29287	valid_0's l2: 0.147543
[9]	valid_0's l1: 0.286397	valid_0's l2: 0.142243
[10]	valid_0's l1: 0.279448	valid_0's l2: 0.136242
[11]	valid_0's l1: 0.273308	valid_0's l2: 0.131539
[12]	valid_0's l1: 0.267682	valid_0's l2: 0.127372
[13]	valid_0's l1: 0.263025	valid_0's l2: 0.123983
[14]	valid_0's l1: 0.258273	valid_0's l2: 0.120588
[15]	valid_0's l1: 0.253453	valid_0's l2: 0.117004
[16]	valid_0's l1: 0.249036	valid_0's l2: 0.113998
[17]	valid_0's l1: 0.246418	valid_0's l2: 0.112605
[18]	valid_0's l1: 0.242597	valid_0's l2: 0.110379
[19]	valid_0'



Starting training...
[1]	valid_0's l1: 0.357747	valid_0's l2: 0.211692
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.349197	valid_0's l2: 0.203093
[3]	valid_0's l1: 0.34136	valid_0's l2: 0.195574
[4]	valid_0's l1: 0.334423	valid_0's l2: 0.189191
[5]	valid_0's l1: 0.32727	valid_0's l2: 0.182175
[6]	valid_0's l1: 0.321765	valid_0's l2: 0.177789
[7]	valid_0's l1: 0.316686	valid_0's l2: 0.173723
[8]	valid_0's l1: 0.31136	valid_0's l2: 0.169121
[9]	valid_0's l1: 0.3072	valid_0's l2: 0.166205
[10]	valid_0's l1: 0.302839	valid_0's l2: 0.162645
[11]	valid_0's l1: 0.299101	valid_0's l2: 0.160142
[12]	valid_0's l1: 0.295768	valid_0's l2: 0.158093
[13]	valid_0's l1: 0.292835	valid_0's l2: 0.156455
[14]	valid_0's l1: 0.290171	valid_0's l2: 0.155027
[15]	valid_0's l1: 0.287766	valid_0's l2: 0.153318
[16]	valid_0's l1: 0.285253	valid_0's l2: 0.15181
[17]	valid_0's l1: 0.28307	valid_0's l2: 0.151008
[18]	valid_0's l1: 0.280945	valid_0's l2: 0.150046
[19]	valid_0's l

Starting training...
[1]	valid_0's l1: 0.337211	valid_0's l2: 0.181185
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.322276	valid_0's l2: 0.165659
[3]	valid_0's l1: 0.308239	valid_0's l2: 0.151718
[4]	valid_0's l1: 0.295036	valid_0's l2: 0.139213
[5]	valid_0's l1: 0.282451	valid_0's l2: 0.127732
[6]	valid_0's l1: 0.270714	valid_0's l2: 0.117674
[7]	valid_0's l1: 0.261256	valid_0's l2: 0.110215
[8]	valid_0's l1: 0.250759	valid_0's l2: 0.101771
[9]	valid_0's l1: 0.240941	valid_0's l2: 0.0943374
[10]	valid_0's l1: 0.231613	valid_0's l2: 0.0875101
[11]	valid_0's l1: 0.222925	valid_0's l2: 0.0815133
[12]	valid_0's l1: 0.214752	valid_0's l2: 0.076106
[13]	valid_0's l1: 0.207074	valid_0's l2: 0.0711497
[14]	valid_0's l1: 0.199982	valid_0's l2: 0.0668079
[15]	valid_0's l1: 0.193208	valid_0's l2: 0.0627339
[16]	valid_0's l1: 0.186923	valid_0's l2: 0.0591467
[17]	valid_0's l1: 0.181949	valid_0's l2: 0.0564448
[18]	valid_0's l1: 0.17637	valid_0's l2: 0.0534924
[



Starting training...
[1]	valid_0's l1: 0.337225	valid_0's l2: 0.183255
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.324347	valid_0's l2: 0.169919
[3]	valid_0's l1: 0.312324	valid_0's l2: 0.157881
[4]	valid_0's l1: 0.301008	valid_0's l2: 0.14706
[5]	valid_0's l1: 0.290262	valid_0's l2: 0.136903
[6]	valid_0's l1: 0.280583	valid_0's l2: 0.128281
[7]	valid_0's l1: 0.273015	valid_0's l2: 0.122222
[8]	valid_0's l1: 0.264097	valid_0's l2: 0.114638
[9]	valid_0's l1: 0.255969	valid_0's l2: 0.108329
[10]	valid_0's l1: 0.248321	valid_0's l2: 0.102447
[11]	valid_0's l1: 0.241299	valid_0's l2: 0.0973424
[12]	valid_0's l1: 0.234676	valid_0's l2: 0.0927346
[13]	valid_0's l1: 0.228641	valid_0's l2: 0.088676
[14]	valid_0's l1: 0.223046	valid_0's l2: 0.0849944
[15]	valid_0's l1: 0.217677	valid_0's l2: 0.0815073
[16]	valid_0's l1: 0.212728	valid_0's l2: 0.0783941
[17]	valid_0's l1: 0.209101	valid_0's l2: 0.07639
[18]	valid_0's l1: 0.205085	valid_0's l2: 0.0741039
[19]	

Starting training...
[1]	valid_0's l1: 0.333352	valid_0's l2: 0.186106
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.326645	valid_0's l2: 0.179467
[3]	valid_0's l1: 0.320624	valid_0's l2: 0.173911
[4]	valid_0's l1: 0.315428	valid_0's l2: 0.16923
[5]	valid_0's l1: 0.311453	valid_0's l2: 0.166025
[6]	valid_0's l1: 0.307947	valid_0's l2: 0.163291
[7]	valid_0's l1: 0.305317	valid_0's l2: 0.161785
[8]	valid_0's l1: 0.302931	valid_0's l2: 0.160636
[9]	valid_0's l1: 0.300847	valid_0's l2: 0.159835
[10]	valid_0's l1: 0.299348	valid_0's l2: 0.159863
[11]	valid_0's l1: 0.297217	valid_0's l2: 0.159105
[12]	valid_0's l1: 0.295408	valid_0's l2: 0.158809
[13]	valid_0's l1: 0.294669	valid_0's l2: 0.159885
[14]	valid_0's l1: 0.293928	valid_0's l2: 0.160856
[15]	valid_0's l1: 0.293387	valid_0's l2: 0.162079
[16]	valid_0's l1: 0.293	valid_0's l2: 0.16341
[17]	valid_0's l1: 0.292966	valid_0's l2: 0.165139
[18]	valid_0's l1: 0.292982	valid_0's l2: 0.167113
Early stopping

For original DF_0 (without smoothing, adjusted CD)

In [115]:
# Arrays of all evaluation values
# row: head,
# RMSE (DF_1 - DF_6), 
# MAE (DF_1 - DF_6), 
# SMAPE (DF_1 - DF_6), 
# R-squared (DF_1 - DF_6)

# col: head,
# DF_0 (without CD, with CD, % improved)

eval_array = np.asarray([['Evaluation', 'Without CD', 'With CD', '% improved']])

rmse = np.zeros(1)
mae = np.zeros(1)
smape = np.zeros(1)
r2 = np.zeros(1)

subdist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                             'MAE without CD', 'MAE with CD', '% improved MAE', 
                             'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                             'R squared without CD', 'R squared with CD', '% improved R squared']])

train_file_dir = 'Data/' + province1 + '/Normal Lags/train_' + province2 + '_subdist_total_mavg2.csv'
test_file_dir = 'Data/' + province1 + '/Normal Lags/test_' + province2 + '_subdist_total_mavg2.csv'

df_train_subdist =  pd.read_csv(train_file_dir, header = 0, skiprows = 0)
df_test_subdist = pd.read_csv(test_file_dir, header = 0, skiprows = 0)

# From DF_1 to DF_6
for i in range(6):
    # Allocate the column of addrcode, week, year and actual values first
    df_test_addrcode_week_year_subdist = df_test_subdist.iloc[:,[1, 2, 3, 10 - i]]
    
    ## Without CD ##
    
    # Import the dataset
    # x: independent variables
    # DF_0 [col 10],
    # DF_wm1 [col 11], 
    # DF_wm2 [col 12],
    # DF_wm3 [col 13],
    # RF_wm6 [col 24],
    # and LST_wm4 [col 25]
        
    ## With CD ##
    
    # Import the dataset
    # x: independent variables
    # DF_0 [col 10],
    # DF_wm1 [col 11], 
    # DF_wm2 [col 12],
    # DF_wm3 [col 13],
    # RF_wm6 [col 24],
    # LST_wm4 [col 25],
    # bin [col 26],
    # bowl [col 27],
    # bucket [col 28],
    # misc_short [col 29],
    # jars [col 30],
    # pottedplant [col 31],
    # tire [col 32],
    # misc_tall [col 33],
    # and total [col 34]
        
    x_train_withoutCD = df_train_subdist.iloc[:, [10, 11, 12, 13, 24, 25]]
    x_train_withCD = df_train_subdist.iloc[:, [10, 11, 12, 13, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]
    
    x_test_withoutCD = df_test_subdist.iloc[:, [10, 11, 12, 13, 24, 25]]
    x_test_withCD = df_test_subdist.iloc[:, [10, 11, 12, 13, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]
    
    # y: response (target) variable from DF_1 to DF_6 (col 9 -> col 4)
    y_train = df_train_subdist.iloc[:, [9 - i]]
    y_test = df_test_subdist.iloc[:, [9 - i]]
    
    # Pass the response values to the array for evaluation calculation
    y_train_true = np.array(df_train_subdist['DF_' + str(i + 1)])
    y_test_true = np.array(df_test_subdist['DF_' + str(i + 1)])
    
    # Pass the dataset of both independent and response variables to Light GBM
    lgb_train_withoutCD = lgb.Dataset(x_train_withoutCD, y_train)
    lgb_eval_withoutCD = lgb.Dataset(x_test_withoutCD, y_test, reference = lgb_train_withoutCD)
        
    lgb_train_withCD = lgb.Dataset(x_train_withCD, y_train)
    lgb_eval_withCD = lgb.Dataset(x_test_withCD, y_test, reference = lgb_train_withCD)

    params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': num_leaves,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
    }

    # Train the model
    print('Starting training...')
    gbm_withoutCD = lgb.train(params,
                lgb_train_withoutCD,
                num_boost_round = 20,
                valid_sets = lgb_eval_withoutCD,
                early_stopping_rounds = 6)
    #print('Saving model...')
    # Save model to file
    #gbm_withoutCD.save_model('model.txt')
    
    gbm_withCD = lgb.train(params,
                lgb_train_withCD,
                num_boost_round = 20,
                valid_sets = lgb_eval_withCD,
                early_stopping_rounds = 6)
    #print('Saving model...')
    # Save model to file
    #gbm_withCD.save_model('model.txt')
    
    # Predict out by using test data
    print('Starting predicting...')
    y_pred_withoutCD = gbm_withoutCD.predict(x_test_withoutCD, num_iteration = gbm_withoutCD.best_iteration)
    y_pred_withCD = gbm_withCD.predict(x_test_withCD, num_iteration = gbm_withCD.best_iteration)

    df_y_pred_withoutCD = pd.DataFrame(y_pred_withoutCD, columns = ['predicted'])
    df_y_pred_withCD = pd.DataFrame(y_pred_withCD, columns = ['predicted'])
        
    # Store all of the predicted values to the CSV files
    df_compare_addrcode_subdist_withoutCD = pd.concat([df_test_addrcode_week_year_subdist, df_y_pred_withoutCD], axis = 1)
    df_compare_addrcode_subdist_withoutCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
    df_compare_addrcode_subdist_withoutCD.to_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' 
                                                 + str(num_leaves) + '/Original DF_0/LGBM_' + province2 
                                                 + '_subdist_original_DF_' + str(i + 1) + '_withoutCD_' 
                                                 + str(num_leaves) + '.csv', encoding = 'utf-8')

    df_compare_addrcode_subdist_withCD = pd.concat([df_test_addrcode_week_year_subdist, df_y_pred_withCD], axis = 1)
    df_compare_addrcode_subdist_withCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
    df_compare_addrcode_subdist_withCD.to_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' 
                                              + str(num_leaves) + '/Original DF_0/LGBM_' + province2 
                                              + '_subdist_original_DF_' + str(i + 1) + '_withCD_' + str(num_leaves) 
                                              + '.csv', encoding = 'utf-8')
    
    # Evaluation
    rmse_withoutCD = mean_squared_error(y_test_true, y_pred_withoutCD) ** 0.5
    mae_withoutCD = mean_absolute_error(y_test_true, y_pred_withoutCD)
    r2_withoutCD = r2_score(y_test_true, y_pred_withoutCD)
    smape_withoutCD = smape_fast(y_test_true, y_pred_withoutCD)
    #print('RMSE of the prediction without CD is:', rmse_withoutCD)
    #print('MAE of the prediction without CD is:', mae_withoutCD)
    #print('R-squared of the prediction without CD is:', r2_withoutCD)
    #print('SMAPE of the prediction without CD is:', smape_withoutCD)
        
    rmse_withCD = mean_squared_error(y_test_true, y_pred_withCD) ** 0.5
    mae_withCD = mean_absolute_error(y_test_true, y_pred_withCD)
    r2_withCD = r2_score(y_test_true, y_pred_withCD)
    smape_withCD = smape_fast(y_test_true, y_pred_withCD)
    #print('RMSE of the prediction with CD is:', rmse_withCD)
    #print('MAE of the prediction with CD is:', mae_withCD)
    #print('R-squared of the prediction with CD is:', r2_withCD)
    #print('SMAPE of the prediction with CD is:', smape_withCD)
        
    rmse_percent_improved = (rmse_withoutCD - rmse_withCD) / rmse_withoutCD
    mae_percent_improved = (mae_withoutCD - mae_withCD) / mae_withoutCD
    smape_percent_improved = (smape_withoutCD - smape_withCD) / smape_withoutCD
    r2_percent_improved = (r2_withoutCD - r2_withCD) / r2_withoutCD
    #eval_array = np.append(eval_array, ['RMSE', rmse_withoutCD, rmse_withCD, rmse_percent_improved])
    #print(eval_array)
        
    rmse = np.append(rmse, [rmse_withoutCD, rmse_withCD, rmse_percent_improved])
    mae = np.append(mae, [mae_withoutCD, mae_withCD, mae_percent_improved])
    smape = np.append(smape, [smape_withoutCD, smape_withCD, smape_percent_improved])
    r2 = np.append(r2, [r2_withoutCD, r2_withCD, r2_percent_improved])
        
    #df_withoutCD = pd.read_csv('LGBM/Original/LGBM_dist_DF_' + str(j + 1) + '_withoutCD.csv', header = 0)
    #df_withCD = pd.read_csv('LGBM/Original/LGBM_dist_DF_' + str(j + 1) + '_withCD.csv', header = 0)
    
    df_withoutCD = pd.read_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                               + '/Original DF_0/LGBM_' + province2 + '_subdist_original_DF_' + str(i + 1) 
                               + '_withoutCD_' + str(num_leaves) + '.csv', header = 0)
    df_withCD = pd.read_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                            + '/Original DF_0/LGBM_' + province2 + '_subdist_original_DF_' + str(i + 1) 
                            + '_withCD_' + str(num_leaves) + '.csv', header = 0)
    
    subdist_code = df_train_subdist['addrcode'].unique()
    
    # For each district
    for j in subdist_code:
            
        # Get the subset of actual and predicted values according to the district code
        subset_withoutCD = df_withoutCD.loc[df_withoutCD['addrcode'] == j]
        subset_withCD = df_withCD.loc[df_withCD['addrcode'] == j]
            
        # Pass the response values to the array for evaluation calculation
        array_true = np.array(subset_withoutCD['actual'])
        array_pred_withoutCD = np.array(subset_withoutCD['predicted'])
        array_pred_withCD = np.array(subset_withCD['predicted'])
            
        # Calculate the evaluation values
        rmse_withoutCD_subdist = mean_squared_error(array_true, array_pred_withoutCD) ** 0.5
        mae_withoutCD_subdist = mean_absolute_error(array_true, array_pred_withoutCD)
        smape_withoutCD_subdist = smape_fast(array_true, array_pred_withoutCD)
        r2_withoutCD_subdist = r2_score(array_true, array_pred_withoutCD)
            
        rmse_withCD_subdist = mean_squared_error(array_true, array_pred_withCD) ** 0.5
        mae_withCD_subdist = mean_absolute_error(array_true, array_pred_withCD)
        smape_withCD_subdist = smape_fast(array_true, array_pred_withCD)
        r2_withCD_subdist = r2_score(array_true, array_pred_withCD)
            
        rmse_percent_improved_subdist = (rmse_withoutCD_subdist - rmse_withCD_subdist) / rmse_withoutCD_subdist
        mae_percent_improved_subdist = (mae_withoutCD_subdist - mae_withCD_subdist) / mae_withoutCD_subdist
        smape_percent_improved_subdist = (smape_withoutCD_subdist - smape_withCD_subdist) / smape_withoutCD_subdist
        r2_percent_improved_subdist = (r2_withoutCD_subdist - r2_withCD_subdist) / r2_withoutCD_subdist
            
        # Append
        subdist_array = np.append(subdist_array, [[j, rmse_withoutCD_subdist, rmse_withCD_subdist, rmse_percent_improved_subdist,
                                                   mae_withoutCD_subdist, mae_withCD_subdist, mae_percent_improved_subdist,
                                                   smape_withoutCD_subdist, smape_withCD_subdist, smape_percent_improved_subdist,
                                                   r2_withoutCD_subdist, r2_withCD_subdist, r2_percent_improved_subdist]], axis = 0)

    #print(dist_array)
    pd.DataFrame(subdist_array).to_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                       + '/Original DF_0/LGBM_' + province2 + '_BySubDistrict_Original_DF_' + str(i + 1) 
                                       + '_eval_' + str(num_leaves) + '.csv', header = False, encoding = 'utf-8')
        
    # Clear the old memory to store a new one
    subdist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                              'MAE without CD', 'MAE with CD', '% improved MAE', 
                              'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                              'R squared without CD', 'R squared with CD', '% improved R squared']])
    
# Evaluation file storing
# From RMSE DF_1 to R squared DF_6
eval_array = evaluation_print_original(eval_array, 'RMSE', rmse)
eval_array = evaluation_print_original(eval_array, 'MAE', mae)
eval_array = evaluation_print_original(eval_array, 'SMAPE', smape)
eval_array = evaluation_print_original(eval_array, 'R squared', r2)

#print(eval_array)

# Store all of the evaluation values into a CSV file
pd.DataFrame(eval_array).to_csv('LGBM/' + province1 + '/Normal Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                + '/Original DF_0/LGBM_' + province2 + '_subdist_original_eval_' + str(num_leaves) + '.csv', 
                                header = False, encoding = 'utf-8')

Starting training...
[1]	valid_0's l1: 0.455561	valid_0's l2: 0.37591
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.448209	valid_0's l2: 0.367978
[3]	valid_0's l1: 0.44102	valid_0's l2: 0.359705
[4]	valid_0's l1: 0.434646	valid_0's l2: 0.353175
[5]	valid_0's l1: 0.428013	valid_0's l2: 0.345992
[6]	valid_0's l1: 0.422287	valid_0's l2: 0.340866
[7]	valid_0's l1: 0.417234	valid_0's l2: 0.335992
[8]	valid_0's l1: 0.411945	valid_0's l2: 0.331023
[9]	valid_0's l1: 0.407359	valid_0's l2: 0.327279
[10]	valid_0's l1: 0.403033	valid_0's l2: 0.323422
[11]	valid_0's l1: 0.399084	valid_0's l2: 0.320253
[12]	valid_0's l1: 0.395205	valid_0's l2: 0.317306
[13]	valid_0's l1: 0.392027	valid_0's l2: 0.315088
[14]	valid_0's l1: 0.389053	valid_0's l2: 0.31273
[15]	valid_0's l1: 0.385967	valid_0's l2: 0.310184
[16]	valid_0's l1: 0.38295	valid_0's l2: 0.308012
[17]	valid_0's l1: 0.380255	valid_0's l2: 0.306147
[18]	valid_0's l1: 0.377841	valid_0's l2: 0.30468
[19]	valid_0's



Starting training...
[1]	valid_0's l1: 0.457587	valid_0's l2: 0.381829
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.451228	valid_0's l2: 0.375396
[3]	valid_0's l1: 0.4452	valid_0's l2: 0.369167
[4]	valid_0's l1: 0.439931	valid_0's l2: 0.364248
[5]	valid_0's l1: 0.434137	valid_0's l2: 0.357004
[6]	valid_0's l1: 0.429076	valid_0's l2: 0.352307
[7]	valid_0's l1: 0.424329	valid_0's l2: 0.348194
[8]	valid_0's l1: 0.419914	valid_0's l2: 0.343335
[9]	valid_0's l1: 0.416036	valid_0's l2: 0.339986
[10]	valid_0's l1: 0.412319	valid_0's l2: 0.336244
[11]	valid_0's l1: 0.408819	valid_0's l2: 0.333749
[12]	valid_0's l1: 0.405482	valid_0's l2: 0.331163
[13]	valid_0's l1: 0.402738	valid_0's l2: 0.329462
[14]	valid_0's l1: 0.4	valid_0's l2: 0.32787
[15]	valid_0's l1: 0.39726	valid_0's l2: 0.325744
[16]	valid_0's l1: 0.394624	valid_0's l2: 0.323955
[17]	valid_0's l1: 0.392064	valid_0's l2: 0.322715
[18]	valid_0's l1: 0.389667	valid_0's l2: 0.321875
[19]	valid_0's l1:



Starting training...
[1]	valid_0's l1: 0.45814	valid_0's l2: 0.385352
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.452342	valid_0's l2: 0.379011
[3]	valid_0's l1: 0.44655	valid_0's l2: 0.37289
[4]	valid_0's l1: 0.44135	valid_0's l2: 0.367706
[5]	valid_0's l1: 0.435894	valid_0's l2: 0.360385
[6]	valid_0's l1: 0.431229	valid_0's l2: 0.355901
[7]	valid_0's l1: 0.426994	valid_0's l2: 0.352404
[8]	valid_0's l1: 0.42265	valid_0's l2: 0.347653
[9]	valid_0's l1: 0.418828	valid_0's l2: 0.344602
[10]	valid_0's l1: 0.41524	valid_0's l2: 0.341087
[11]	valid_0's l1: 0.4121	valid_0's l2: 0.339039
[12]	valid_0's l1: 0.409322	valid_0's l2: 0.337246
[13]	valid_0's l1: 0.406411	valid_0's l2: 0.335524
[14]	valid_0's l1: 0.403948	valid_0's l2: 0.334269
[15]	valid_0's l1: 0.401771	valid_0's l2: 0.332704
[16]	valid_0's l1: 0.399648	valid_0's l2: 0.331379
[17]	valid_0's l1: 0.397377	valid_0's l2: 0.330449
[18]	valid_0's l1: 0.395494	valid_0's l2: 0.329998
[19]	valid_0's l1



Starting training...
[1]	valid_0's l1: 0.460589	valid_0's l2: 0.391921
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.45512	valid_0's l2: 0.385081
[3]	valid_0's l1: 0.449872	valid_0's l2: 0.378617
[4]	valid_0's l1: 0.445458	valid_0's l2: 0.373569
[5]	valid_0's l1: 0.440442	valid_0's l2: 0.366965
[6]	valid_0's l1: 0.436478	valid_0's l2: 0.362585
[7]	valid_0's l1: 0.432692	valid_0's l2: 0.359329
[8]	valid_0's l1: 0.429002	valid_0's l2: 0.355484
[9]	valid_0's l1: 0.425998	valid_0's l2: 0.353268
[10]	valid_0's l1: 0.423035	valid_0's l2: 0.350502
[11]	valid_0's l1: 0.42045	valid_0's l2: 0.348488
[12]	valid_0's l1: 0.417906	valid_0's l2: 0.347118
[13]	valid_0's l1: 0.41559	valid_0's l2: 0.345965
[14]	valid_0's l1: 0.413817	valid_0's l2: 0.34526
[15]	valid_0's l1: 0.411955	valid_0's l2: 0.343835
[16]	valid_0's l1: 0.409914	valid_0's l2: 0.342598
[17]	valid_0's l1: 0.407941	valid_0's l2: 0.341955
[18]	valid_0's l1: 0.406308	valid_0's l2: 0.341603
[19]	valid_0'



Starting training...
[1]	valid_0's l1: 0.46092	valid_0's l2: 0.393605
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.455875	valid_0's l2: 0.387431
[3]	valid_0's l1: 0.451199	valid_0's l2: 0.382666
[4]	valid_0's l1: 0.447329	valid_0's l2: 0.378546
[5]	valid_0's l1: 0.443374	valid_0's l2: 0.374407
[6]	valid_0's l1: 0.4404	valid_0's l2: 0.371908
[7]	valid_0's l1: 0.438018	valid_0's l2: 0.369916
[8]	valid_0's l1: 0.435173	valid_0's l2: 0.367822
[9]	valid_0's l1: 0.43318	valid_0's l2: 0.366669
[10]	valid_0's l1: 0.431138	valid_0's l2: 0.365963
[11]	valid_0's l1: 0.42927	valid_0's l2: 0.365228
[12]	valid_0's l1: 0.427472	valid_0's l2: 0.364578
[13]	valid_0's l1: 0.426077	valid_0's l2: 0.364837
[14]	valid_0's l1: 0.424815	valid_0's l2: 0.365321
[15]	valid_0's l1: 0.423875	valid_0's l2: 0.366458
[16]	valid_0's l1: 0.422825	valid_0's l2: 0.367089
[17]	valid_0's l1: 0.422166	valid_0's l2: 0.368083
[18]	valid_0's l1: 0.421389	valid_0's l2: 0.368596
Early stopping

For MAs (normal CD)

In [116]:
# Arrays of all evaluation values
# row: head,
# RMSE (DFma_1 - DFma_6), 
# MAE (DFma_1 - DFma_6), 
# SMAPE (DFma_1 - DFma_6), 
# R-squared (DFma_1 - DFma_6)

# col: head,
# MA2 (without CD, with CD, % improved),
# MA3 (without CD, with CD, % improved),
# MA4 (without CD, with CD, % improved)

eval_array = np.asarray([['Evaluation', 'MA2 without CD', 'MA2 with CD', 'MA2 % improved', 
                         'MA3 without CD', 'MA3 with CD', 'MA3 % improved', 
                         'MA4 without CD', 'MA4 with CD', 'MA4 % improved']])
rmse = np.zeros(1)
mae = np.zeros(1)
smape = np.zeros(1)
r2 = np.zeros(1)

subdist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])

# Starting from MA2 to MA4
for i in range(2, 5):
    # Get the input variables from CSV file
    # Change files directory here
    train_file_dir = 'Data/' + province1 + '/Normal Lags/train_' + province2 + '_subdist_cd_mavg' + str(i) + '.csv'
    test_file_dir = 'Data/' + province1 + '/Normal Lags/test_' + province2 + '_subdist_cd_mavg' + str(i) + '.csv'
    
    df_train_subdist =  pd.read_csv(train_file_dir, header=0, skiprows=0)
    df_test_subdist = pd.read_csv(test_file_dir, header=0, skiprows=0)

    # Continue on DFma_1 to DFma_6
    for j in range(6):
        # Allocate the column of addrcode, week, year and actual values first
        df_test_addrcode_week_year_subdist = df_test_subdist.iloc[:, [1, 2, 3, 19 - j]]
        
        ## Without CD ##
    
        # Import the dataset
        # x: independent variables
        # DFma_0 [col 20],
        # DFma_wm1 [col 21],
        # DFma_wm2 [col 22],
        # DFma_wm3 [col 23],
        # RF_wm6 [col 24],
        # and LST_wm4 [col 25]
        
        ## With CD ##
    
        # Import the dataset
        # x: independent variables
        # DF_0 [col 10],
        # DF_wm1 [col 11], 
        # DF_wm2 [col 12],
        # DF_wm3 [col 13],
        # RF_wm6 [col 24],
        # LST_wm4 [col 25],
        # bin_pop9s [col 26],
        # bowl_pop9s [col 27],
        # bucket_pop9s [col 28],
        # misc_short_pop9s [col 29],
        # jar_pop9s [col 30],
        # pottedplant_pop9s [col 31],
        # tire_pop9s [col 32],
        # misc_tall_pop9s [col 33],
        # and total_pop9s [col 34]
        
        x_train_withoutCD = df_train_subdist.iloc[:, [20, 21, 22, 23, 24, 25]]
        x_train_withCD = df_train_subdist.iloc[:, [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]
        
        x_test_withoutCD = df_test_subdist.iloc[:, [20, 21, 22, 23, 24, 25]]
        x_test_withCD = df_test_subdist.iloc[:, [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]
        
        # y: response (target) variable from DFma_1 to DFma_6 (col 19 -> col 14)
        y_train = df_train_subdist.iloc[:, [19 - j]]
        y_test = df_test_subdist.iloc[:, [19 - j]]
        
        # Pass the response values to the array for evaluation calculation
        y_train_true = np.array(df_train_subdist['DFma_' + str(j + 1)])
        y_test_true = np.array(df_test_subdist['DFma_' + str(j + 1)])
        
        # Pass the dataset of both independent and response variables to Light GBM
        lgb_train_withoutCD = lgb.Dataset(x_train_withoutCD, y_train)
        lgb_eval_withoutCD = lgb.Dataset(x_test_withoutCD, y_test, reference = lgb_train_withoutCD)
        
        lgb_train_withCD = lgb.Dataset(x_train_withCD, y_train)
        lgb_eval_withCD = lgb.Dataset(x_test_withCD, y_test, reference = lgb_train_withCD)

        params = {
        'boosting_type': 'gbdt',
        'objective': 'regression',
        'metric': {'l2', 'l1'},
        'num_leaves': num_leaves,
        'learning_rate': 0.05,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'verbose': 0
        }

        # Train the model
        print('Starting training...')
        gbm_withoutCD = lgb.train(params,
                    lgb_train_withoutCD,
                    num_boost_round = 20,
                    valid_sets = lgb_eval_withoutCD,
                    early_stopping_rounds = 6)
        #print('Saving model...')
        # Save model to file
        #gbm_withoutCD.save_model('model.txt')
        
        gbm_withCD = lgb.train(params,
                    lgb_train_withCD,
                    num_boost_round = 20,
                    valid_sets = lgb_eval_withCD,
                    early_stopping_rounds = 6)
        #print('Saving model...')
        # Save model to file
        #gbm_withCD.save_model('model.txt')

        # Predict out by using test data
        print('Starting predicting...')
        y_pred_withoutCD = gbm_withoutCD.predict(x_test_withoutCD, num_iteration = gbm_withoutCD.best_iteration)
        y_pred_withCD = gbm_withCD.predict(x_test_withCD, num_iteration = gbm_withCD.best_iteration)

        df_y_pred_withoutCD = pd.DataFrame(y_pred_withoutCD, columns = ['predicted'])
        df_y_pred_withCD = pd.DataFrame(y_pred_withCD, columns = ['predicted'])
        
        # Store all of the predicted values to the CSV files
        df_compare_addrcode_subdist_withoutCD = pd.concat([df_test_addrcode_week_year_subdist, df_y_pred_withoutCD], axis = 1)
        df_compare_addrcode_subdist_withoutCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
        df_compare_addrcode_subdist_withoutCD.to_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' 
                                                     + str(num_leaves) + '/MA' + str(i) + '/LGBM_' + province2 
                                                     + '_subdist_MA' + str(i) + '_DFma_' + str(j + 1) + '_withoutCD_' 
                                                     + str(num_leaves) + '.csv', encoding = 'utf-8')

        df_compare_addrcode_subdist_withCD = pd.concat([df_test_addrcode_week_year_subdist, df_y_pred_withCD], axis = 1)
        df_compare_addrcode_subdist_withCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
        df_compare_addrcode_subdist_withCD.to_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' 
                                                  + str(num_leaves) + '/MA' + str(i) + '/LGBM_' + province2 
                                                  + '_subdist_MA' + str(i) + '_DFma_' + str(j + 1) + '_withCD_' 
                                                  + str(num_leaves) + '.csv', encoding = 'utf-8')
        
        # Evaluation
        rmse_withoutCD = mean_squared_error(y_test_true, y_pred_withoutCD) ** 0.5
        mae_withoutCD = mean_absolute_error(y_test_true, y_pred_withoutCD)
        r2_withoutCD = r2_score(y_test_true, y_pred_withoutCD)
        smape_withoutCD = smape_fast(y_test_true, y_pred_withoutCD)
        #print('RMSE of the prediction without CD is:', rmse_withoutCD)
        #print('MAE of the prediction without CD is:', mae_withoutCD)
        #print('R-squared of the prediction without CD is:', r2_withoutCD)
        #print('SMAPE of the prediction without CD is:', smape_withoutCD)
        
        rmse_withCD = mean_squared_error(y_test_true, y_pred_withCD) ** 0.5
        mae_withCD = mean_absolute_error(y_test_true, y_pred_withCD)
        r2_withCD = r2_score(y_test_true, y_pred_withCD)
        smape_withCD = smape_fast(y_test_true, y_pred_withCD)
        #print('RMSE of the prediction with CD is:', rmse_withCD)
        #print('MAE of the prediction with CD is:', mae_withCD)
        #print('R-squared of the prediction with CD is:', r2_withCD)
        #print('SMAPE of the prediction with CD is:', smape_withCD)
        
        rmse_percent_improved = (rmse_withoutCD - rmse_withCD) / rmse_withoutCD
        mae_percent_improved = (mae_withoutCD - mae_withCD) / mae_withoutCD
        smape_percent_improved = (smape_withoutCD - smape_withCD) / smape_withoutCD
        r2_percent_improved = (r2_withoutCD - r2_withCD) / r2_withoutCD
        #eval_array = np.append(eval_array, ['RMSE', rmse_withoutCD, rmse_withCD, rmse_percent_improved])
        #print(eval_array)
        
        rmse = np.append(rmse, [rmse_withoutCD, rmse_withCD, rmse_percent_improved])
        mae = np.append(mae, [mae_withoutCD, mae_withCD, mae_percent_improved])
        smape = np.append(smape, [smape_withoutCD, smape_withCD, smape_percent_improved])
        r2 = np.append(r2, [r2_withoutCD, r2_withCD, r2_percent_improved])
        
        df_withoutCD = pd.read_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) + '/MA' 
                                   + str(i) + '/LGBM_' + province2 + '_subdist_MA' + str(i) + '_DFma_' + str(j + 1) 
                                   + '_withoutCD_' + str(num_leaves) + '.csv', header = 0)
        df_withCD = pd.read_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) + '/MA' 
                                + str(i) + '/LGBM_' + province2 + '_subdist_MA' + str(i) + '_DFma_' + str(j + 1) 
                                + '_withCD_' + str(num_leaves) + '.csv', header = 0)
        
        subdist_code = df_train_subdist['addrcode'].unique()
        
        # For each district
        for k in subdist_code:
            
            # Get the subset of actual and predicted values according to the district code
            subset_withoutCD = df_withoutCD.loc[df_withoutCD['addrcode'] == k]
            subset_withCD = df_withCD.loc[df_withCD['addrcode'] == k]
            
            # Pass the response values to the array for evaluation calculation
            array_true = np.array(subset_withoutCD['actual'])
            array_pred_withoutCD = np.array(subset_withoutCD['predicted'])
            array_pred_withCD = np.array(subset_withCD['predicted'])
            
            # Calculate the evaluation values
            rmse_withoutCD_subdist = mean_squared_error(array_true, array_pred_withoutCD) ** 0.5
            mae_withoutCD_subdist = mean_absolute_error(array_true, array_pred_withoutCD)
            smape_withoutCD_subdist = smape_fast(array_true, array_pred_withoutCD)
            r2_withoutCD_subdist = r2_score(array_true, array_pred_withoutCD)
            
            rmse_withCD_subdist = mean_squared_error(array_true, array_pred_withCD) ** 0.5
            mae_withCD_subdist = mean_absolute_error(array_true, array_pred_withCD)
            smape_withCD_subdist = smape_fast(array_true, array_pred_withCD)
            r2_withCD_subdist = r2_score(array_true, array_pred_withCD)
            
            rmse_percent_improved_subdist = (rmse_withoutCD_subdist - rmse_withCD_subdist) / rmse_withoutCD_subdist
            mae_percent_improved_subdist = (mae_withoutCD_subdist - mae_withCD_subdist) / mae_withoutCD_subdist
            smape_percent_improved_subdist = (smape_withoutCD_subdist - smape_withCD_subdist) / smape_withoutCD_subdist
            r2_percent_improved_subdist = (r2_withoutCD_subdist - r2_withCD_subdist) / r2_withoutCD_subdist
            
            # Append
            subdist_array = np.append(subdist_array, [[k, rmse_withoutCD_subdist, rmse_withCD_subdist, rmse_percent_improved_subdist,
                                                mae_withoutCD_subdist, mae_withCD_subdist, mae_percent_improved_subdist,
                                                smape_withoutCD_subdist, smape_withCD_subdist, smape_percent_improved_subdist,
                                                r2_withoutCD_subdist, r2_withCD_subdist, r2_percent_improved_subdist]], axis = 0)

        #print(dist_array)
        pd.DataFrame(subdist_array).to_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                           + '/MA' + str(i) + '/LGBM_' + province2 + '_BySubDistrict_MA' + str(i) 
                                           + '_DFma_' + str(j + 1) + '_eval_' + str(num_leaves) + '.csv', 
                                           header = False, encoding = 'utf-8')
        
        # Clear the old memory to store a new one
        subdist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])
    
# Evaluation file storing
# From RMSE DFma_1 to R squared DFma_6
eval_array = evaluation_print(eval_array, 'RMSE', rmse)
eval_array = evaluation_print(eval_array, 'MAE', mae)
eval_array = evaluation_print(eval_array, 'SMAPE', smape)
eval_array = evaluation_print(eval_array, 'R squared', r2)

#print(eval_array)

# Store all of the evaluation values into a CSV file
pd.DataFrame(eval_array).to_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) + 
                                '/LGBM_' + province2 + '_subdist_eval_' + str(num_leaves) + '.csv', 
                                header = False, encoding = 'utf-8')

Starting training...
[1]	valid_0's l1: 0.384872	valid_0's l2: 0.24629
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.3726	valid_0's l2: 0.232502
[3]	valid_0's l1: 0.360453	valid_0's l2: 0.218978
[4]	valid_0's l1: 0.349779	valid_0's l2: 0.207874
[5]	valid_0's l1: 0.338148	valid_0's l2: 0.195106
[6]	valid_0's l1: 0.32797	valid_0's l2: 0.185136
[7]	valid_0's l1: 0.322048	valid_0's l2: 0.180548
[8]	valid_0's l1: 0.31222	valid_0's l2: 0.170802
[9]	valid_0's l1: 0.304214	valid_0's l2: 0.163652
[10]	valid_0's l1: 0.295606	valid_0's l2: 0.15565
[11]	valid_0's l1: 0.287971	valid_0's l2: 0.149105
[12]	valid_0's l1: 0.280955	valid_0's l2: 0.143072
[13]	valid_0's l1: 0.27518	valid_0's l2: 0.138624
[14]	valid_0's l1: 0.269013	valid_0's l2: 0.13344
[15]	valid_0's l1: 0.262986	valid_0's l2: 0.128386
[16]	valid_0's l1: 0.257241	valid_0's l2: 0.123842
[17]	valid_0's l1: 0.254262	valid_0's l2: 0.122222
[18]	valid_0's l1: 0.249621	valid_0's l2: 0.119171
[19]	valid_0's l1



Starting training...
[1]	valid_0's l1: 0.389882	valid_0's l2: 0.255768
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.382221	valid_0's l2: 0.248047
[3]	valid_0's l1: 0.37515	valid_0's l2: 0.241481
[4]	valid_0's l1: 0.368258	valid_0's l2: 0.234954
[5]	valid_0's l1: 0.361476	valid_0's l2: 0.227893
[6]	valid_0's l1: 0.355351	valid_0's l2: 0.222785
[7]	valid_0's l1: 0.349945	valid_0's l2: 0.218613
[8]	valid_0's l1: 0.344591	valid_0's l2: 0.213618
[9]	valid_0's l1: 0.339724	valid_0's l2: 0.210093
[10]	valid_0's l1: 0.335199	valid_0's l2: 0.206173
[11]	valid_0's l1: 0.331423	valid_0's l2: 0.203555
[12]	valid_0's l1: 0.327911	valid_0's l2: 0.201279
[13]	valid_0's l1: 0.324467	valid_0's l2: 0.199048
[14]	valid_0's l1: 0.32155	valid_0's l2: 0.197285
[15]	valid_0's l1: 0.318758	valid_0's l2: 0.195186
[16]	valid_0's l1: 0.315968	valid_0's l2: 0.193138
[17]	valid_0's l1: 0.313292	valid_0's l2: 0.191625
[18]	valid_0's l1: 0.310923	valid_0's l2: 0.190509
[19]	valid_



Starting training...
[1]	valid_0's l1: 0.390094	valid_0's l2: 0.258094
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.382637	valid_0's l2: 0.25082
[3]	valid_0's l1: 0.375881	valid_0's l2: 0.244576
[4]	valid_0's l1: 0.369783	valid_0's l2: 0.239091
[5]	valid_0's l1: 0.36363	valid_0's l2: 0.232221
[6]	valid_0's l1: 0.358096	valid_0's l2: 0.227649
[7]	valid_0's l1: 0.353369	valid_0's l2: 0.224066
[8]	valid_0's l1: 0.348587	valid_0's l2: 0.219518
[9]	valid_0's l1: 0.344418	valid_0's l2: 0.216475
[10]	valid_0's l1: 0.340216	valid_0's l2: 0.212875
[11]	valid_0's l1: 0.336724	valid_0's l2: 0.21068
[12]	valid_0's l1: 0.333486	valid_0's l2: 0.208688
[13]	valid_0's l1: 0.33045	valid_0's l2: 0.207011
[14]	valid_0's l1: 0.328154	valid_0's l2: 0.205725
[15]	valid_0's l1: 0.325423	valid_0's l2: 0.203823
[16]	valid_0's l1: 0.322947	valid_0's l2: 0.202078
[17]	valid_0's l1: 0.320975	valid_0's l2: 0.201515
[18]	valid_0's l1: 0.319009	valid_0's l2: 0.200927
[19]	valid_0'



Starting training...
[1]	valid_0's l1: 0.388784	valid_0's l2: 0.258282
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.381209	valid_0's l2: 0.249744
[3]	valid_0's l1: 0.37486	valid_0's l2: 0.243162
[4]	valid_0's l1: 0.36891	valid_0's l2: 0.237423
[5]	valid_0's l1: 0.363275	valid_0's l2: 0.23132
[6]	valid_0's l1: 0.358596	valid_0's l2: 0.227164
[7]	valid_0's l1: 0.354054	valid_0's l2: 0.223662
[8]	valid_0's l1: 0.350014	valid_0's l2: 0.219826
[9]	valid_0's l1: 0.34689	valid_0's l2: 0.217486
[10]	valid_0's l1: 0.343629	valid_0's l2: 0.214931
[11]	valid_0's l1: 0.340685	valid_0's l2: 0.213123
[12]	valid_0's l1: 0.338	valid_0's l2: 0.211537
[13]	valid_0's l1: 0.3358	valid_0's l2: 0.210469
[14]	valid_0's l1: 0.333674	valid_0's l2: 0.209489
[15]	valid_0's l1: 0.331846	valid_0's l2: 0.208449
[16]	valid_0's l1: 0.330061	valid_0's l2: 0.207656
[17]	valid_0's l1: 0.328432	valid_0's l2: 0.207704
[18]	valid_0's l1: 0.32702	valid_0's l2: 0.20778
[19]	valid_0's l1: 0



Starting training...
[1]	valid_0's l1: 0.356803	valid_0's l2: 0.20837
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.345637	valid_0's l2: 0.197005
[3]	valid_0's l1: 0.33511	valid_0's l2: 0.186665
[4]	valid_0's l1: 0.325495	valid_0's l2: 0.177551
[5]	valid_0's l1: 0.315647	valid_0's l2: 0.167518
[6]	valid_0's l1: 0.306902	valid_0's l2: 0.159537
[7]	valid_0's l1: 0.30102	valid_0's l2: 0.154976
[8]	valid_0's l1: 0.29287	valid_0's l2: 0.147543
[9]	valid_0's l1: 0.286397	valid_0's l2: 0.142243
[10]	valid_0's l1: 0.279448	valid_0's l2: 0.136242
[11]	valid_0's l1: 0.273308	valid_0's l2: 0.131539
[12]	valid_0's l1: 0.267682	valid_0's l2: 0.127372
[13]	valid_0's l1: 0.263025	valid_0's l2: 0.123983
[14]	valid_0's l1: 0.258273	valid_0's l2: 0.120588
[15]	valid_0's l1: 0.253453	valid_0's l2: 0.117004
[16]	valid_0's l1: 0.249036	valid_0's l2: 0.113998
[17]	valid_0's l1: 0.246418	valid_0's l2: 0.112605
[18]	valid_0's l1: 0.242597	valid_0's l2: 0.110379
[19]	valid_0'



Starting training...
[1]	valid_0's l1: 0.357747	valid_0's l2: 0.211692
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.349197	valid_0's l2: 0.203093
[3]	valid_0's l1: 0.34136	valid_0's l2: 0.195574
[4]	valid_0's l1: 0.334423	valid_0's l2: 0.189191
[5]	valid_0's l1: 0.32727	valid_0's l2: 0.182175
[6]	valid_0's l1: 0.321765	valid_0's l2: 0.177789
[7]	valid_0's l1: 0.316686	valid_0's l2: 0.173723
[8]	valid_0's l1: 0.31136	valid_0's l2: 0.169121
[9]	valid_0's l1: 0.3072	valid_0's l2: 0.166205
[10]	valid_0's l1: 0.302839	valid_0's l2: 0.162645
[11]	valid_0's l1: 0.299101	valid_0's l2: 0.160142
[12]	valid_0's l1: 0.295768	valid_0's l2: 0.158093
[13]	valid_0's l1: 0.292835	valid_0's l2: 0.156455
[14]	valid_0's l1: 0.290171	valid_0's l2: 0.155027
[15]	valid_0's l1: 0.287766	valid_0's l2: 0.153318
[16]	valid_0's l1: 0.285253	valid_0's l2: 0.15181
[17]	valid_0's l1: 0.28307	valid_0's l2: 0.151008
[18]	valid_0's l1: 0.280945	valid_0's l2: 0.150046
[19]	valid_0's l

Starting training...
[1]	valid_0's l1: 0.337211	valid_0's l2: 0.181185
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.322276	valid_0's l2: 0.165659
[3]	valid_0's l1: 0.308239	valid_0's l2: 0.151718
[4]	valid_0's l1: 0.295036	valid_0's l2: 0.139213
[5]	valid_0's l1: 0.282451	valid_0's l2: 0.127732
[6]	valid_0's l1: 0.270714	valid_0's l2: 0.117674
[7]	valid_0's l1: 0.261256	valid_0's l2: 0.110215
[8]	valid_0's l1: 0.250759	valid_0's l2: 0.101771
[9]	valid_0's l1: 0.240941	valid_0's l2: 0.0943374
[10]	valid_0's l1: 0.231613	valid_0's l2: 0.0875101
[11]	valid_0's l1: 0.222925	valid_0's l2: 0.0815133
[12]	valid_0's l1: 0.214752	valid_0's l2: 0.076106
[13]	valid_0's l1: 0.207074	valid_0's l2: 0.0711497
[14]	valid_0's l1: 0.199982	valid_0's l2: 0.0668079
[15]	valid_0's l1: 0.193208	valid_0's l2: 0.0627339
[16]	valid_0's l1: 0.186923	valid_0's l2: 0.0591467
[17]	valid_0's l1: 0.181949	valid_0's l2: 0.0564448
[18]	valid_0's l1: 0.17637	valid_0's l2: 0.0534924
[



Starting training...
[1]	valid_0's l1: 0.337225	valid_0's l2: 0.183255
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.324347	valid_0's l2: 0.169919
[3]	valid_0's l1: 0.312324	valid_0's l2: 0.157881
[4]	valid_0's l1: 0.301008	valid_0's l2: 0.14706
[5]	valid_0's l1: 0.290262	valid_0's l2: 0.136903
[6]	valid_0's l1: 0.280583	valid_0's l2: 0.128281
[7]	valid_0's l1: 0.273015	valid_0's l2: 0.122222
[8]	valid_0's l1: 0.264097	valid_0's l2: 0.114638
[9]	valid_0's l1: 0.255969	valid_0's l2: 0.108329
[10]	valid_0's l1: 0.248321	valid_0's l2: 0.102447
[11]	valid_0's l1: 0.241299	valid_0's l2: 0.0973424
[12]	valid_0's l1: 0.234676	valid_0's l2: 0.0927346
[13]	valid_0's l1: 0.228641	valid_0's l2: 0.088676
[14]	valid_0's l1: 0.223046	valid_0's l2: 0.0849944
[15]	valid_0's l1: 0.217677	valid_0's l2: 0.0815073
[16]	valid_0's l1: 0.212728	valid_0's l2: 0.0783941
[17]	valid_0's l1: 0.209101	valid_0's l2: 0.07639
[18]	valid_0's l1: 0.205085	valid_0's l2: 0.0741039
[19]	

Starting training...
[1]	valid_0's l1: 0.333352	valid_0's l2: 0.186106
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.326645	valid_0's l2: 0.179467
[3]	valid_0's l1: 0.320624	valid_0's l2: 0.173911
[4]	valid_0's l1: 0.315428	valid_0's l2: 0.16923
[5]	valid_0's l1: 0.311453	valid_0's l2: 0.166025
[6]	valid_0's l1: 0.307947	valid_0's l2: 0.163291
[7]	valid_0's l1: 0.305317	valid_0's l2: 0.161785
[8]	valid_0's l1: 0.302931	valid_0's l2: 0.160636
[9]	valid_0's l1: 0.300847	valid_0's l2: 0.159835
[10]	valid_0's l1: 0.299348	valid_0's l2: 0.159863
[11]	valid_0's l1: 0.297217	valid_0's l2: 0.159105
[12]	valid_0's l1: 0.295408	valid_0's l2: 0.158809
[13]	valid_0's l1: 0.294669	valid_0's l2: 0.159885
[14]	valid_0's l1: 0.293928	valid_0's l2: 0.160856
[15]	valid_0's l1: 0.293387	valid_0's l2: 0.162079
[16]	valid_0's l1: 0.293	valid_0's l2: 0.16341
[17]	valid_0's l1: 0.292966	valid_0's l2: 0.165139
[18]	valid_0's l1: 0.292982	valid_0's l2: 0.167113
Early stopping

For original DF_0 (without smoothing, normal CD)

In [117]:
# Arrays of all evaluation values
# row: head,
# RMSE (DF_1 - DF_6), 
# MAE (DF_1 - DF_6), 
# SMAPE (DF_1 - DF_6), 
# R-squared (DF_1 - DF_6)

# col: head,
# DF_0 (without CD, with CD, % improved)

eval_array = np.asarray([['Evaluation', 'Without CD', 'With CD', '% improved']])

rmse = np.zeros(1)
mae = np.zeros(1)
smape = np.zeros(1)
r2 = np.zeros(1)

subdist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                             'MAE without CD', 'MAE with CD', '% improved MAE', 
                             'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                             'R squared without CD', 'R squared with CD', '% improved R squared']])

train_file_dir = 'Data/' + province1 + '/Normal Lags/train_' + province2 + '_subdist_cd_mavg2.csv'
test_file_dir = 'Data/' + province1 + '/Normal Lags/test_' + province2 + '_subdist_cd_mavg2.csv'

df_train_subdist =  pd.read_csv(train_file_dir, header = 0, skiprows = 0)
df_test_subdist = pd.read_csv(test_file_dir, header = 0, skiprows = 0)

# From DF_1 to DF_6
for i in range(6):
    # Allocate the column of addrcode, week, year and actual values first
    df_test_addrcode_week_year_subdist = df_test_subdist.iloc[:,[1, 2, 3, 10 - i]]
    
    ## Without CD ##
    
    # Import the dataset
    # x: independent variables
    # DF_0 [col 10],
    # DF_wm1 [col 11], 
    # DF_wm2 [col 12],
    # DF_wm3 [col 13],
    # RF_wm6 [col 24],
    # and LST_wm4 [col 25]
        
    ## With CD ##
    
    # Import the dataset
    # x: independent variables
    # DF_0 [col 10],
    # DF_wm1 [col 11], 
    # DF_wm2 [col 12],
    # DF_wm3 [col 13],
    # RF_wm6 [col 24],
    # LST_wm4 [col 25],
    # bin [col 26],
    # bowl [col 27],
    # bucket [col 28],
    # misc_short [col 29],
    # jars [col 30],
    # pottedplant [col 31],
    # tire [col 32],
    # misc_tall [col 33],
    # and total [col 34]
        
    x_train_withoutCD = df_train_subdist.iloc[:, [10, 11, 12, 13, 24, 25]]
    x_train_withCD = df_train_subdist.iloc[:, [10, 11, 12, 13, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]
    
    x_test_withoutCD = df_test_subdist.iloc[:, [10, 11, 12, 13, 24, 25]]
    x_test_withCD = df_test_subdist.iloc[:, [10, 11, 12, 13, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]]
    
    # y: response (target) variable from DF_1 to DF_6 (col 9 -> col 4)
    y_train = df_train_subdist.iloc[:, [9 - i]]
    y_test = df_test_subdist.iloc[:, [9 - i]]
    
    # Pass the response values to the array for evaluation calculation
    y_train_true = np.array(df_train_subdist['DF_' + str(i + 1)])
    y_test_true = np.array(df_test_subdist['DF_' + str(i + 1)])
    
    # Pass the dataset of both independent and response variables to Light GBM
    lgb_train_withoutCD = lgb.Dataset(x_train_withoutCD, y_train)
    lgb_eval_withoutCD = lgb.Dataset(x_test_withoutCD, y_test, reference = lgb_train_withoutCD)
        
    lgb_train_withCD = lgb.Dataset(x_train_withCD, y_train)
    lgb_eval_withCD = lgb.Dataset(x_test_withCD, y_test, reference = lgb_train_withCD)

    params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': num_leaves,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
    }

    # Train the model
    print('Starting training...')
    gbm_withoutCD = lgb.train(params,
                lgb_train_withoutCD,
                num_boost_round = 20,
                valid_sets = lgb_eval_withoutCD,
                early_stopping_rounds = 6)
    #print('Saving model...')
    # Save model to file
    #gbm_withoutCD.save_model('model.txt')
    
    gbm_withCD = lgb.train(params,
                lgb_train_withCD,
                num_boost_round = 20,
                valid_sets = lgb_eval_withCD,
                early_stopping_rounds = 6)
    #print('Saving model...')
    # Save model to file
    #gbm_withCD.save_model('model.txt')
    
    # Predict out by using test data
    print('Starting predicting...')
    y_pred_withoutCD = gbm_withoutCD.predict(x_test_withoutCD, num_iteration = gbm_withoutCD.best_iteration)
    y_pred_withCD = gbm_withCD.predict(x_test_withCD, num_iteration = gbm_withCD.best_iteration)

    df_y_pred_withoutCD = pd.DataFrame(y_pred_withoutCD, columns = ['predicted'])
    df_y_pred_withCD = pd.DataFrame(y_pred_withCD, columns = ['predicted'])
        
    # Store all of the predicted values to the CSV files
    df_compare_addrcode_subdist_withoutCD = pd.concat([df_test_addrcode_week_year_subdist, df_y_pred_withoutCD], axis = 1)
    df_compare_addrcode_subdist_withoutCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
    df_compare_addrcode_subdist_withoutCD.to_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' 
                                                 + str(num_leaves) + '/Original DF_0/LGBM_' + province2 
                                                 + '_subdist_original_DF_' + str(i + 1) + '_withoutCD_' 
                                                 + str(num_leaves) + '.csv', encoding = 'utf-8')

    df_compare_addrcode_subdist_withCD = pd.concat([df_test_addrcode_week_year_subdist, df_y_pred_withCD], axis = 1)
    df_compare_addrcode_subdist_withCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
    df_compare_addrcode_subdist_withCD.to_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' 
                                              + str(num_leaves) + '/Original DF_0/LGBM_' + province2 
                                              + '_subdist_original_DF_' + str(i + 1) + '_withCD_' 
                                              + str(num_leaves) + '.csv', encoding = 'utf-8')
    
    # Evaluation
    rmse_withoutCD = mean_squared_error(y_test_true, y_pred_withoutCD) ** 0.5
    mae_withoutCD = mean_absolute_error(y_test_true, y_pred_withoutCD)
    r2_withoutCD = r2_score(y_test_true, y_pred_withoutCD)
    smape_withoutCD = smape_fast(y_test_true, y_pred_withoutCD)
    #print('RMSE of the prediction without CD is:', rmse_withoutCD)
    #print('MAE of the prediction without CD is:', mae_withoutCD)
    #print('R-squared of the prediction without CD is:', r2_withoutCD)
    #print('SMAPE of the prediction without CD is:', smape_withoutCD)
        
    rmse_withCD = mean_squared_error(y_test_true, y_pred_withCD) ** 0.5
    mae_withCD = mean_absolute_error(y_test_true, y_pred_withCD)
    r2_withCD = r2_score(y_test_true, y_pred_withCD)
    smape_withCD = smape_fast(y_test_true, y_pred_withCD)
    #print('RMSE of the prediction with CD is:', rmse_withCD)
    #print('MAE of the prediction with CD is:', mae_withCD)
    #print('R-squared of the prediction with CD is:', r2_withCD)
    #print('SMAPE of the prediction with CD is:', smape_withCD)
        
    rmse_percent_improved = (rmse_withoutCD - rmse_withCD) / rmse_withoutCD
    mae_percent_improved = (mae_withoutCD - mae_withCD) / mae_withoutCD
    smape_percent_improved = (smape_withoutCD - smape_withCD) / smape_withoutCD
    r2_percent_improved = (r2_withoutCD - r2_withCD) / r2_withoutCD
    #eval_array = np.append(eval_array, ['RMSE', rmse_withoutCD, rmse_withCD, rmse_percent_improved])
    #print(eval_array)
        
    rmse = np.append(rmse, [rmse_withoutCD, rmse_withCD, rmse_percent_improved])
    mae = np.append(mae, [mae_withoutCD, mae_withCD, mae_percent_improved])
    smape = np.append(smape, [smape_withoutCD, smape_withCD, smape_percent_improved])
    r2 = np.append(r2, [r2_withoutCD, r2_withCD, r2_percent_improved])
        
    #df_withoutCD = pd.read_csv('LGBM/Original/LGBM_dist_DF_' + str(j + 1) + '_withoutCD.csv', header = 0)
    #df_withCD = pd.read_csv('LGBM/Original/LGBM_dist_DF_' + str(j + 1) + '_withCD.csv', header = 0)
    
    df_withoutCD = pd.read_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                               + '/Original DF_0/LGBM_' + province2 + '_subdist_original_DF_' + str(i + 1) 
                               + '_withoutCD_' + str(num_leaves) + '.csv', header = 0)
    df_withCD = pd.read_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                            + '/Original DF_0/LGBM_' + province2 + '_subdist_original_DF_' + str(i + 1) 
                            + '_withCD_' + str(num_leaves) + '.csv', header = 0)
    
    subdist_code = df_train_subdist['addrcode'].unique()
    
    # For each district
    for j in subdist_code:
            
        # Get the subset of actual and predicted values according to the district code
        subset_withoutCD = df_withoutCD.loc[df_withoutCD['addrcode'] == j]
        subset_withCD = df_withCD.loc[df_withCD['addrcode'] == j]
            
        # Pass the response values to the array for evaluation calculation
        array_true = np.array(subset_withoutCD['actual'])
        array_pred_withoutCD = np.array(subset_withoutCD['predicted'])
        array_pred_withCD = np.array(subset_withCD['predicted'])
            
        # Calculate the evaluation values
        rmse_withoutCD_subdist = mean_squared_error(array_true, array_pred_withoutCD) ** 0.5
        mae_withoutCD_subdist = mean_absolute_error(array_true, array_pred_withoutCD)
        smape_withoutCD_subdist = smape_fast(array_true, array_pred_withoutCD)
        r2_withoutCD_subdist = r2_score(array_true, array_pred_withoutCD)
            
        rmse_withCD_subdist = mean_squared_error(array_true, array_pred_withCD) ** 0.5
        mae_withCD_subdist = mean_absolute_error(array_true, array_pred_withCD)
        smape_withCD_subdist = smape_fast(array_true, array_pred_withCD)
        r2_withCD_subdist = r2_score(array_true, array_pred_withCD)
            
        rmse_percent_improved_subdist = (rmse_withoutCD_subdist - rmse_withCD_subdist) / rmse_withoutCD_subdist
        mae_percent_improved_subdist = (mae_withoutCD_subdist - mae_withCD_subdist) / mae_withoutCD_subdist
        smape_percent_improved_subdist = (smape_withoutCD_subdist - smape_withCD_subdist) / smape_withoutCD_subdist
        r2_percent_improved_subdist = (r2_withoutCD_subdist - r2_withCD_subdist) / r2_withoutCD_subdist
            
        # Append
        subdist_array = np.append(subdist_array, [[j, rmse_withoutCD_subdist, rmse_withCD_subdist, rmse_percent_improved_subdist,
                                                   mae_withoutCD_subdist, mae_withCD_subdist, mae_percent_improved_subdist,
                                                   smape_withoutCD_subdist, smape_withCD_subdist, smape_percent_improved_subdist,
                                                   r2_withoutCD_subdist, r2_withCD_subdist, r2_percent_improved_subdist]], axis = 0)

    #print(dist_array)
    pd.DataFrame(subdist_array).to_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                       + '/Original DF_0/LGBM_' + province2 + '_BySubDistrict_Original_DF_' 
                                       + str(i + 1) + '_eval_' + str(num_leaves) + '.csv', 
                                       header = False, encoding = 'utf-8')
        
    # Clear the old memory to store a new one
    subdist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                              'MAE without CD', 'MAE with CD', '% improved MAE', 
                              'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                              'R squared without CD', 'R squared with CD', '% improved R squared']])
    
# Evaluation file storing
# From RMSE DF_1 to R squared DF_6
eval_array = evaluation_print_original(eval_array, 'RMSE', rmse)
eval_array = evaluation_print_original(eval_array, 'MAE', mae)
eval_array = evaluation_print_original(eval_array, 'SMAPE', smape)
eval_array = evaluation_print_original(eval_array, 'R squared', r2)

#print(eval_array)

# Store all of the evaluation values into a CSV file
pd.DataFrame(eval_array).to_csv('LGBM/' + province1 + '/Normal Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                + '/Original DF_0/LGBM_' + province2 + '_subdist_original_eval_' + str(num_leaves) 
                                + '.csv', header = False, encoding = 'utf-8')

Starting training...
[1]	valid_0's l1: 0.455561	valid_0's l2: 0.37591
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.448209	valid_0's l2: 0.367978
[3]	valid_0's l1: 0.44102	valid_0's l2: 0.359705
[4]	valid_0's l1: 0.434646	valid_0's l2: 0.353175
[5]	valid_0's l1: 0.428013	valid_0's l2: 0.345992
[6]	valid_0's l1: 0.422287	valid_0's l2: 0.340866
[7]	valid_0's l1: 0.417234	valid_0's l2: 0.335992
[8]	valid_0's l1: 0.411945	valid_0's l2: 0.331023
[9]	valid_0's l1: 0.407359	valid_0's l2: 0.327279
[10]	valid_0's l1: 0.403033	valid_0's l2: 0.323422
[11]	valid_0's l1: 0.399084	valid_0's l2: 0.320253
[12]	valid_0's l1: 0.395205	valid_0's l2: 0.317306
[13]	valid_0's l1: 0.392027	valid_0's l2: 0.315088
[14]	valid_0's l1: 0.389053	valid_0's l2: 0.31273
[15]	valid_0's l1: 0.385967	valid_0's l2: 0.310184
[16]	valid_0's l1: 0.38295	valid_0's l2: 0.308012
[17]	valid_0's l1: 0.380255	valid_0's l2: 0.306147
[18]	valid_0's l1: 0.377841	valid_0's l2: 0.30468
[19]	valid_0's



Starting training...
[1]	valid_0's l1: 0.457587	valid_0's l2: 0.381829
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.451228	valid_0's l2: 0.375396
[3]	valid_0's l1: 0.4452	valid_0's l2: 0.369167
[4]	valid_0's l1: 0.439931	valid_0's l2: 0.364248
[5]	valid_0's l1: 0.434137	valid_0's l2: 0.357004
[6]	valid_0's l1: 0.429076	valid_0's l2: 0.352307
[7]	valid_0's l1: 0.424329	valid_0's l2: 0.348194
[8]	valid_0's l1: 0.419914	valid_0's l2: 0.343335
[9]	valid_0's l1: 0.416036	valid_0's l2: 0.339986
[10]	valid_0's l1: 0.412319	valid_0's l2: 0.336244
[11]	valid_0's l1: 0.408819	valid_0's l2: 0.333749
[12]	valid_0's l1: 0.405482	valid_0's l2: 0.331163
[13]	valid_0's l1: 0.402738	valid_0's l2: 0.329462
[14]	valid_0's l1: 0.4	valid_0's l2: 0.32787
[15]	valid_0's l1: 0.39726	valid_0's l2: 0.325744
[16]	valid_0's l1: 0.394624	valid_0's l2: 0.323955
[17]	valid_0's l1: 0.392064	valid_0's l2: 0.322715
[18]	valid_0's l1: 0.389667	valid_0's l2: 0.321875
[19]	valid_0's l1:



Starting training...
[1]	valid_0's l1: 0.45814	valid_0's l2: 0.385352
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.452342	valid_0's l2: 0.379011
[3]	valid_0's l1: 0.44655	valid_0's l2: 0.37289
[4]	valid_0's l1: 0.44135	valid_0's l2: 0.367706
[5]	valid_0's l1: 0.435894	valid_0's l2: 0.360385
[6]	valid_0's l1: 0.431229	valid_0's l2: 0.355901
[7]	valid_0's l1: 0.426994	valid_0's l2: 0.352404
[8]	valid_0's l1: 0.42265	valid_0's l2: 0.347653
[9]	valid_0's l1: 0.418828	valid_0's l2: 0.344602
[10]	valid_0's l1: 0.41524	valid_0's l2: 0.341087
[11]	valid_0's l1: 0.4121	valid_0's l2: 0.339039
[12]	valid_0's l1: 0.409322	valid_0's l2: 0.337246
[13]	valid_0's l1: 0.406411	valid_0's l2: 0.335524
[14]	valid_0's l1: 0.403948	valid_0's l2: 0.334269
[15]	valid_0's l1: 0.401771	valid_0's l2: 0.332704
[16]	valid_0's l1: 0.399648	valid_0's l2: 0.331379
[17]	valid_0's l1: 0.397377	valid_0's l2: 0.330449
[18]	valid_0's l1: 0.395494	valid_0's l2: 0.329998
[19]	valid_0's l1



Starting training...
[1]	valid_0's l1: 0.460589	valid_0's l2: 0.391921
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.45512	valid_0's l2: 0.385081
[3]	valid_0's l1: 0.449872	valid_0's l2: 0.378617
[4]	valid_0's l1: 0.445458	valid_0's l2: 0.373569
[5]	valid_0's l1: 0.440442	valid_0's l2: 0.366965
[6]	valid_0's l1: 0.436478	valid_0's l2: 0.362585
[7]	valid_0's l1: 0.432692	valid_0's l2: 0.359329
[8]	valid_0's l1: 0.429002	valid_0's l2: 0.355484
[9]	valid_0's l1: 0.425998	valid_0's l2: 0.353268
[10]	valid_0's l1: 0.423035	valid_0's l2: 0.350502
[11]	valid_0's l1: 0.42045	valid_0's l2: 0.348488
[12]	valid_0's l1: 0.417906	valid_0's l2: 0.347118
[13]	valid_0's l1: 0.41559	valid_0's l2: 0.345965
[14]	valid_0's l1: 0.413817	valid_0's l2: 0.34526
[15]	valid_0's l1: 0.411955	valid_0's l2: 0.343835
[16]	valid_0's l1: 0.409914	valid_0's l2: 0.342598
[17]	valid_0's l1: 0.407941	valid_0's l2: 0.341955
[18]	valid_0's l1: 0.406308	valid_0's l2: 0.341603
[19]	valid_0'



Starting training...
[1]	valid_0's l1: 0.46092	valid_0's l2: 0.393605
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.455875	valid_0's l2: 0.387431
[3]	valid_0's l1: 0.451199	valid_0's l2: 0.382666
[4]	valid_0's l1: 0.447329	valid_0's l2: 0.378546
[5]	valid_0's l1: 0.443374	valid_0's l2: 0.374407
[6]	valid_0's l1: 0.4404	valid_0's l2: 0.371908
[7]	valid_0's l1: 0.438018	valid_0's l2: 0.369916
[8]	valid_0's l1: 0.435173	valid_0's l2: 0.367822
[9]	valid_0's l1: 0.43318	valid_0's l2: 0.366669
[10]	valid_0's l1: 0.431138	valid_0's l2: 0.365963
[11]	valid_0's l1: 0.42927	valid_0's l2: 0.365228
[12]	valid_0's l1: 0.427472	valid_0's l2: 0.364578
[13]	valid_0's l1: 0.426077	valid_0's l2: 0.364837
[14]	valid_0's l1: 0.424815	valid_0's l2: 0.365321
[15]	valid_0's l1: 0.423875	valid_0's l2: 0.366458
[16]	valid_0's l1: 0.422825	valid_0's l2: 0.367089
[17]	valid_0's l1: 0.422166	valid_0's l2: 0.368083
[18]	valid_0's l1: 0.421389	valid_0's l2: 0.368596
Early stopping

<h1>Modified Lags</h1>

- Predict DFma_1 as the target<br>
- Predict DF_1 as the target<br>
But adjust the independent variables <b>according to the different time horizons</b><br>
 - 1 week ahead = independent variables are DFma_0, DFma_wm1, DFma_wm2, ..., and DFma_wm6<br>
 - 2 weeks ahead = independent variables are DFma_wm1, DFma_wm2, DFma_wm3, ..., and DFma_wm6<br>
 - 3 weeks ahead = independent variables are DFma_wm2, DFma_wm3, DFma_wm4, ..., and DFma_wm6<br>
Maximum time horizon = 6 weeks ahead

<h2>District level</h2>
For MAs (adjusted CD)

In [118]:
# Arrays of all evaluation values
# row: head,
# RMSE (DFma_1), 
# MAE (DFma_1), 
# SMAPE (DFma_1), 
# R-squared (DFma_1)

# col: head,
# MA2 (without CD, with CD, % improved),
# MA3 (without CD, with CD, % improved),
# MA4 (without CD, with CD, % improved)

eval_array = np.asarray([['Evaluation', 'MA2 without CD', 'MA2 with CD', 'MA2 % improved', 
                         'MA3 without CD', 'MA3 with CD', 'MA3 % improved', 
                         'MA4 without CD', 'MA4 with CD', 'MA4 % improved']])
rmse = np.zeros(1)
mae = np.zeros(1)
smape = np.zeros(1)
r2 = np.zeros(1)

dist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])

# Starting from MA2 to MA4
for i in range(2, 5):
    # Get the input variables from CSV file
    # Change files directory here
    train_file_dir = 'Data/' + province1 + '/Modified Lags/train_' + province2 + '_dist_total_mavg' + str(i) + '.csv'
    test_file_dir = 'Data/' + province1 + '/Modified Lags/test_' + province2 + '_dist_total_mavg' + str(i) + '.csv'
    
    df_train_dist =  pd.read_csv(train_file_dir, header = 0, skiprows = 0)
    df_test_dist = pd.read_csv(test_file_dir, header = 0, skiprows = 0)

    # Continue on DFma_1 to DFma_6
    for j in range(6):
        # Allocate the column of addrcode, week, year and actual values first
        df_test_addrcode_week_year_dist = df_test_dist.iloc[:, [1, 2, 3, 12]]
        
        ## Without CD ##
    
        # Import the dataset
        # x: independent variables
        # DFma_0 [col 13],
        # DFma_wm1 [col 14],
        # DFma_wm2 [col 15],
        # DFma_wm3 [col 16],
        # DFma_wm4 [col 17],
        # DFma_wm5 [col 18],
        # DFma_wm6 [col 19],
        # RF_wm6 [col 20],
        # and LST_wm4 [col 21]
        
        ## With CD ##
    
        # Import the dataset
        # x: independent variables
        # DFma_0 [col 13],
        # DFma_wm1 [col 14],
        # DFma_wm2 [col 15],
        # DFma_wm3 [col 16],
        # DFma_wm4 [col 17],
        # DFma_wm5 [col 18],
        # DFma_wm6 [col 19],
        # RF_wm6 [col 20],
        # LST_wm4 [col 21]
        # bin [col 22],
        # bowl [col 23],
        # bucket [col 24],
        # misc_short [col 25],
        # jar [col 26],
        # pottedplant [col 27],
        # tire [col 28],
        # misc_tall [col 29],
        # and total [col 30]
        
        x_train_withoutCD = df_train_dist.iloc[:, (13 + j): 22]
        x_train_withCD = df_train_dist.iloc[:, (13 + j): 31]
        
        x_test_withoutCD = df_test_dist.iloc[:, (13 + j): 22]
        x_test_withCD = df_test_dist.iloc[:, (13 + j): 31]
        
        # y: response (target) variable DFma_1 [col 12]
        y_train = df_train_dist.iloc[:, [12]]
        y_test = df_test_dist.iloc[:, [12]]
        
        # Pass the response values to the array for evaluation calculation
        y_train_true = np.array(df_train_dist['DFma_1'])
        y_test_true = np.array(df_test_dist['DFma_1'])
        
        # Pass the dataset of both independent and response variables to Light GBM
        lgb_train_withoutCD = lgb.Dataset(x_train_withoutCD, y_train)
        lgb_eval_withoutCD = lgb.Dataset(x_test_withoutCD, y_test, reference = lgb_train_withoutCD)
        
        lgb_train_withCD = lgb.Dataset(x_train_withCD, y_train)
        lgb_eval_withCD = lgb.Dataset(x_test_withCD, y_test, reference = lgb_train_withCD)

        params = {
        'boosting_type': 'gbdt',
        'objective': 'regression',
        'metric': {'l2', 'l1'},
        'num_leaves': num_leaves,
        'learning_rate': 0.05,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'verbose': 0
        }

        # Train the model
        print('Starting training...')
        gbm_withoutCD = lgb.train(params,
                    lgb_train_withoutCD,
                    num_boost_round = 20,
                    valid_sets = lgb_eval_withoutCD,
                    early_stopping_rounds = 6)
        #print('Saving model...')
        # Save model to file
        #gbm_withoutCD.save_model('model.txt')
        
        gbm_withCD = lgb.train(params,
                    lgb_train_withCD,
                    num_boost_round = 20,
                    valid_sets = lgb_eval_withCD,
                    early_stopping_rounds = 6)
        #print('Saving model...')
        # Save model to file
        #gbm_withCD.save_model('model.txt')

        # Predict out by using test data
        print('Starting predicting...')
        y_pred_withoutCD = gbm_withoutCD.predict(x_test_withoutCD, num_iteration = gbm_withoutCD.best_iteration)
        y_pred_withCD = gbm_withCD.predict(x_test_withCD, num_iteration = gbm_withCD.best_iteration)

        df_y_pred_withoutCD = pd.DataFrame(y_pred_withoutCD, columns = ['predicted'])
        df_y_pred_withCD = pd.DataFrame(y_pred_withCD, columns = ['predicted'])
        
        # Store all of the predicted values to the CSV files
        df_compare_addrcode_dist_withoutCD = pd.concat([df_test_addrcode_week_year_dist, df_y_pred_withoutCD], axis = 1)
        df_compare_addrcode_dist_withoutCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
        df_compare_addrcode_dist_withoutCD.to_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' 
                                                  + str(num_leaves) + '/MA' + str(i) + '/LGBM_' + province2 + '_dist_MA' 
                                                  + str(i) + '_horizon_' + str(j + 1) + '_withoutCD_' + str(num_leaves) 
                                                  + '.csv', encoding = 'utf-8')

        df_compare_addrcode_dist_withCD = pd.concat([df_test_addrcode_week_year_dist, df_y_pred_withCD], axis = 1)
        df_compare_addrcode_dist_withCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
        df_compare_addrcode_dist_withCD.to_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' 
                                               + str(num_leaves) + '/MA' + str(i) + '/LGBM_' + province2 + '_dist_MA' 
                                               + str(i) + '_horizon_' + str(j + 1) + '_withCD_' + str(num_leaves) + '.csv', 
                                               encoding = 'utf-8')
        
        # Evaluation
        rmse_withoutCD = mean_squared_error(y_test_true, y_pred_withoutCD) ** 0.5
        mae_withoutCD = mean_absolute_error(y_test_true, y_pred_withoutCD)
        r2_withoutCD = r2_score(y_test_true, y_pred_withoutCD)
        smape_withoutCD = smape_fast(y_test_true, y_pred_withoutCD)
        #print('RMSE of the prediction without CD is:', rmse_withoutCD)
        #print('MAE of the prediction without CD is:', mae_withoutCD)
        #print('R-squared of the prediction without CD is:', r2_withoutCD)
        #print('SMAPE of the prediction without CD is:', smape_withoutCD)
        
        rmse_withCD = mean_squared_error(y_test_true, y_pred_withCD) ** 0.5
        mae_withCD = mean_absolute_error(y_test_true, y_pred_withCD)
        r2_withCD = r2_score(y_test_true, y_pred_withCD)
        smape_withCD = smape_fast(y_test_true, y_pred_withCD)
        #print('RMSE of the prediction with CD is:', rmse_withCD)
        #print('MAE of the prediction with CD is:', mae_withCD)
        #print('R-squared of the prediction with CD is:', r2_withCD)
        #print('SMAPE of the prediction with CD is:', smape_withCD)
        
        rmse_percent_improved = (rmse_withoutCD - rmse_withCD) / rmse_withoutCD
        mae_percent_improved = (mae_withoutCD - mae_withCD) / mae_withoutCD
        smape_percent_improved = (smape_withoutCD - smape_withCD) / smape_withoutCD
        r2_percent_improved = (r2_withoutCD - r2_withCD) / r2_withoutCD
        #eval_array = np.append(eval_array, ['RMSE', rmse_withoutCD, rmse_withCD, rmse_percent_improved])
        #print(eval_array)
        
        rmse = np.append(rmse, [rmse_withoutCD, rmse_withCD, rmse_percent_improved])
        mae = np.append(mae, [mae_withoutCD, mae_withCD, mae_percent_improved])
        smape = np.append(smape, [smape_withoutCD, smape_withCD, smape_percent_improved])
        r2 = np.append(r2, [r2_withoutCD, r2_withCD, r2_percent_improved])
        
        df_withoutCD = pd.read_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                   + '/MA' + str(i) + '/LGBM_' + province2 + '_dist_MA' + str(i) + '_horizon_' + str(j + 1) 
                                   + '_withoutCD_' + str(num_leaves) + '.csv', header = 0)
        df_withCD = pd.read_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                + '/MA' + str(i) + '/LGBM_' + province2 + '_dist_MA' + str(i) + '_horizon_' + str(j + 1) 
                                + '_withCD_' + str(num_leaves) + '.csv', header = 0)
        
        dist_code = df_train_dist['addrcode'].unique()
        
        # For each district
        for k in dist_code:
            
            # Get the subset of actual and predicted values according to the district code
            subset_withoutCD = df_withoutCD.loc[df_withoutCD['addrcode'] == k]
            subset_withCD = df_withCD.loc[df_withCD['addrcode'] == k]
            
            # Pass the response values to the array for evaluation calculation
            array_true = np.array(subset_withoutCD['actual'])
            array_pred_withoutCD = np.array(subset_withoutCD['predicted'])
            array_pred_withCD = np.array(subset_withCD['predicted'])
            
            # Calculate the evaluation values
            rmse_withoutCD_dist = mean_squared_error(array_true, array_pred_withoutCD) ** 0.5
            mae_withoutCD_dist = mean_absolute_error(array_true, array_pred_withoutCD)
            smape_withoutCD_dist = smape_fast(array_true, array_pred_withoutCD)
            r2_withoutCD_dist = r2_score(array_true, array_pred_withoutCD)
            
            rmse_withCD_dist = mean_squared_error(array_true, array_pred_withCD) ** 0.5
            mae_withCD_dist = mean_absolute_error(array_true, array_pred_withCD)
            smape_withCD_dist = smape_fast(array_true, array_pred_withCD)
            r2_withCD_dist = r2_score(array_true, array_pred_withCD)
            
            rmse_percent_improved_dist = (rmse_withoutCD_dist - rmse_withCD_dist) / rmse_withoutCD_dist
            mae_percent_improved_dist = (mae_withoutCD_dist - mae_withCD_dist) / mae_withoutCD_dist
            smape_percent_improved_dist = (smape_withoutCD_dist - smape_withCD_dist) / smape_withoutCD_dist
            r2_percent_improved_dist = (r2_withoutCD_dist - r2_withCD_dist) / r2_withoutCD_dist
            
            # Append
            dist_array = np.append(dist_array, [[k, rmse_withoutCD_dist, rmse_withCD_dist, rmse_percent_improved_dist,
                                                mae_withoutCD_dist, mae_withCD_dist, mae_percent_improved_dist,
                                                smape_withoutCD_dist, smape_withCD_dist, smape_percent_improved_dist,
                                                r2_withoutCD_dist, r2_withCD_dist, r2_percent_improved_dist]], axis = 0)

        #print(dist_array)
        pd.DataFrame(dist_array).to_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                        + '/MA' + str(i) + '/LGBM_' + province2 + '_ByDistrict_MA' + str(i) + '_horizon_' 
                                        + str(j + 1) + '_eval_' + str(num_leaves) + '.csv', header = False, 
                                        encoding = 'utf-8')
        
        # Clear the old memory to store a new one
        dist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])
    
# Evaluation file storing
# From RMSE DFma_1 to R squared DFma_6
eval_array = evaluation_print_modified_lag(eval_array, 'RMSE', rmse)
eval_array = evaluation_print_modified_lag(eval_array, 'MAE', mae)
eval_array = evaluation_print_modified_lag(eval_array, 'SMAPE', smape)
eval_array = evaluation_print_modified_lag(eval_array, 'R squared', r2)

#print(eval_array)

# Store all of the evaluation values into a CSV file
pd.DataFrame(eval_array).to_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                + '/LGBM_' + province2 + '_dist_eval_' + str(num_leaves) + '.csv', header = False, 
                                encoding = 'utf-8')

Starting training...
[1]	valid_0's l1: 0.283641	valid_0's l2: 0.110376
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.271572	valid_0's l2: 0.101492
[3]	valid_0's l1: 0.26041	valid_0's l2: 0.0935903
[4]	valid_0's l1: 0.249939	valid_0's l2: 0.086458
[5]	valid_0's l1: 0.240266	valid_0's l2: 0.0800588
[6]	valid_0's l1: 0.230778	valid_0's l2: 0.0739703
[7]	valid_0's l1: 0.222297	valid_0's l2: 0.0687202
[8]	valid_0's l1: 0.214238	valid_0's l2: 0.0639262
[9]	valid_0's l1: 0.206957	valid_0's l2: 0.0597754
[10]	valid_0's l1: 0.19999	valid_0's l2: 0.0559453
[11]	valid_0's l1: 0.193503	valid_0's l2: 0.0526016
[12]	valid_0's l1: 0.187519	valid_0's l2: 0.0496168
[13]	valid_0's l1: 0.181837	valid_0's l2: 0.0468264
[14]	valid_0's l1: 0.176483	valid_0's l2: 0.044302
[15]	valid_0's l1: 0.171885	valid_0's l2: 0.0422899
[16]	valid_0's l1: 0.167348	valid_0's l2: 0.0403468
[17]	valid_0's l1: 0.163089	valid_0's l2: 0.0385858
[18]	valid_0's l1: 0.159218	valid_0's l2: 0.03703

Starting training...
[1]	valid_0's l1: 0.286456	valid_0's l2: 0.112864
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.27821	valid_0's l2: 0.106555
[3]	valid_0's l1: 0.271292	valid_0's l2: 0.101491
[4]	valid_0's l1: 0.265564	valid_0's l2: 0.0974065
[5]	valid_0's l1: 0.260955	valid_0's l2: 0.0942011
[6]	valid_0's l1: 0.256499	valid_0's l2: 0.0913524
[7]	valid_0's l1: 0.253219	valid_0's l2: 0.0894192
[8]	valid_0's l1: 0.250239	valid_0's l2: 0.087907
[9]	valid_0's l1: 0.247597	valid_0's l2: 0.0868757
[10]	valid_0's l1: 0.245143	valid_0's l2: 0.0862237
[11]	valid_0's l1: 0.242977	valid_0's l2: 0.0857874
[12]	valid_0's l1: 0.240693	valid_0's l2: 0.085384
[13]	valid_0's l1: 0.239166	valid_0's l2: 0.0857084
[14]	valid_0's l1: 0.237841	valid_0's l2: 0.0863646
[15]	valid_0's l1: 0.236823	valid_0's l2: 0.0872521
[16]	valid_0's l1: 0.236068	valid_0's l2: 0.0884058
[17]	valid_0's l1: 0.235002	valid_0's l2: 0.0892528
[18]	valid_0's l1: 0.234272	valid_0's l2: 0.09033

Starting training...
[1]	valid_0's l1: 0.273941	valid_0's l2: 0.101831
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.263662	valid_0's l2: 0.0945538
[3]	valid_0's l1: 0.254726	valid_0's l2: 0.088533
[4]	valid_0's l1: 0.246928	valid_0's l2: 0.0835239
[5]	valid_0's l1: 0.240052	valid_0's l2: 0.0792437
[6]	valid_0's l1: 0.233552	valid_0's l2: 0.0752471
[7]	valid_0's l1: 0.227505	valid_0's l2: 0.0716614
[8]	valid_0's l1: 0.223143	valid_0's l2: 0.0691763
[9]	valid_0's l1: 0.218188	valid_0's l2: 0.0665136
[10]	valid_0's l1: 0.213805	valid_0's l2: 0.0642877
[11]	valid_0's l1: 0.209855	valid_0's l2: 0.0624407
[12]	valid_0's l1: 0.206305	valid_0's l2: 0.0609559
[13]	valid_0's l1: 0.203076	valid_0's l2: 0.0596966
[14]	valid_0's l1: 0.200039	valid_0's l2: 0.0586393
[15]	valid_0's l1: 0.19755	valid_0's l2: 0.0578463
[16]	valid_0's l1: 0.195369	valid_0's l2: 0.0572276
[17]	valid_0's l1: 0.193324	valid_0's l2: 0.0567586
[18]	valid_0's l1: 0.19158	valid_0's l2: 0.056

Starting training...
[1]	valid_0's l1: 0.263514	valid_0's l2: 0.0938624
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.250934	valid_0's l2: 0.0851496
[3]	valid_0's l1: 0.239062	valid_0's l2: 0.0772854
[4]	valid_0's l1: 0.227776	valid_0's l2: 0.0701867
[5]	valid_0's l1: 0.217209	valid_0's l2: 0.0638335
[6]	valid_0's l1: 0.207223	valid_0's l2: 0.0580765
[7]	valid_0's l1: 0.197795	valid_0's l2: 0.0528953
[8]	valid_0's l1: 0.18881	valid_0's l2: 0.0482259
[9]	valid_0's l1: 0.180389	valid_0's l2: 0.0440176
[10]	valid_0's l1: 0.172586	valid_0's l2: 0.0402947
[11]	valid_0's l1: 0.165197	valid_0's l2: 0.0369492
[12]	valid_0's l1: 0.158277	valid_0's l2: 0.0339428
[13]	valid_0's l1: 0.151732	valid_0's l2: 0.0312259
[14]	valid_0's l1: 0.145538	valid_0's l2: 0.0287703
[15]	valid_0's l1: 0.139602	valid_0's l2: 0.0265104
[16]	valid_0's l1: 0.13409	valid_0's l2: 0.0245161
[17]	valid_0's l1: 0.128784	valid_0's l2: 0.0226932
[18]	valid_0's l1: 0.123931	valid_0's l2: 0.0

Starting training...
[1]	valid_0's l1: 0.266376	valid_0's l2: 0.0960921
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.257052	valid_0's l2: 0.0896502
[3]	valid_0's l1: 0.249402	valid_0's l2: 0.0844811
[4]	valid_0's l1: 0.243069	valid_0's l2: 0.0802362
[5]	valid_0's l1: 0.237738	valid_0's l2: 0.076786
[6]	valid_0's l1: 0.232679	valid_0's l2: 0.0736859
[7]	valid_0's l1: 0.228958	valid_0's l2: 0.071568
[8]	valid_0's l1: 0.225615	valid_0's l2: 0.0698964
[9]	valid_0's l1: 0.22316	valid_0's l2: 0.069034
[10]	valid_0's l1: 0.22068	valid_0's l2: 0.0682072
[11]	valid_0's l1: 0.218561	valid_0's l2: 0.0677982
[12]	valid_0's l1: 0.216288	valid_0's l2: 0.0672888
[13]	valid_0's l1: 0.214888	valid_0's l2: 0.067462
[14]	valid_0's l1: 0.213533	valid_0's l2: 0.0678073
[15]	valid_0's l1: 0.212383	valid_0's l2: 0.0683252
[16]	valid_0's l1: 0.211686	valid_0's l2: 0.0691729
[17]	valid_0's l1: 0.210426	valid_0's l2: 0.069549
[18]	valid_0's l1: 0.209409	valid_0's l2: 0.070156

For original DF_0 (without smoothing, adjusted CD)

In [119]:
# Arrays of all evaluation values
# row: head,
# RMSE (DF_1 - DF_6), 
# MAE (DF_1 - DF_6), 
# SMAPE (DF_1 - DF_6), 
# R-squared (DF_1 - DF_6)

# col: head,
# DF_0 (without CD, with CD, % improved)

eval_array = np.asarray([['Evaluation', 'Without CD', 'With CD', '% improved']])

rmse = np.zeros(1)
mae = np.zeros(1)
smape = np.zeros(1)
r2 = np.zeros(1)

dist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])

train_file_dir = 'Data/' + province1 + '/Modified Lags/train_' + province2 + '_dist_total_mavg2.csv'
test_file_dir = 'Data/' + province1 + '/Modified Lags/test_' + province2 + '_dist_total_mavg2.csv'

df_train_dist =  pd.read_csv(train_file_dir, header = 0, skiprows = 0)
df_test_dist = pd.read_csv(test_file_dir, header = 0, skiprows = 0)

# From DF_1 to DF_6
for i in range(6):
    # Allocate the column of addrcode, week, year and actual values first
    df_test_addrcode_week_year_dist = df_test_dist.iloc[:,[1, 2, 3, 4]]
    
    ## Without CD ##
    
    # Import the dataset
    # x: independent variables
    # DF_0 [col 5],
    # DF_wm1 [col 6], 
    # DF_wm2 [col 7],
    # DF_wm3 [col 8],
    # DF_wm4 [col 9], 
    # DF_wm5 [col 10],
    # DF_wm6 [col 11],
    # RF_wm6 [col 20],
    # and LST_wm4 [col 21]
        
    ## With CD ##
    
    # Import the dataset
    # x: independent variables
    # DF_0 [col 5],
    # DF_wm1 [col 6], 
    # DF_wm2 [col 7],
    # DF_wm3 [col 8],
    # DF_wm4 [col 9], 
    # DF_wm5 [col 10],
    # DF_wm6 [col 11],
    # RF_wm6 [col 20],
    # LST_wm4 [col 21],
    # bin_pop9s [col 22],
    # bowl_pop9s [col 23],
    # bucket_pop9s [col 24],
    # misc_short_pop9s [col 25],
    # jar_pop9s [col 26],
    # pottedplant_pop9s [col 27],
    # tire_pop9s [col 28],
    # misc_tall_pop9s [col 29],
    # and total_pop9s [col 30]
    
    df_train_dist_DFinfo = df_train_dist.iloc[:, (5 + i):12]
    df_train_dist_withoutCD = df_train_dist.iloc[:, [20, 21]]
    df_train_dist_withCD = df_train_dist.iloc[:, 20: 31]
    
    df_test_dist_DFinfo = df_test_dist.iloc[:, (5 + i):12]
    df_test_dist_withoutCD = df_test_dist.iloc[:, [20, 21]]
    df_test_dist_withCD = df_test_dist.iloc[:, 20: 31]
        
    x_train_withoutCD = pd.concat([df_train_dist_DFinfo, df_train_dist_withoutCD], axis = 1)
    x_train_withCD = pd.concat([df_train_dist_DFinfo, df_train_dist_withCD], axis = 1)
    
    x_test_withoutCD = pd.concat([df_test_dist_DFinfo, df_test_dist_withoutCD], axis = 1)
    x_test_withCD = pd.concat([df_test_dist_DFinfo, df_test_dist_withCD], axis = 1)
    
    # y: response (target) variable DF_1 (col 4)
    y_train = df_train_dist.iloc[:, [4]]
    y_test = df_test_dist.iloc[:, [4]]
    
    # Pass the response values to the array for evaluation calculation
    y_train_true = np.array(df_train_dist['DF_1'])
    y_test_true = np.array(df_test_dist['DF_1'])
    
    # Pass the dataset of both independent and response variables to Light GBM
    lgb_train_withoutCD = lgb.Dataset(x_train_withoutCD, y_train)
    lgb_eval_withoutCD = lgb.Dataset(x_test_withoutCD, y_test, reference = lgb_train_withoutCD)
        
    lgb_train_withCD = lgb.Dataset(x_train_withCD, y_train)
    lgb_eval_withCD = lgb.Dataset(x_test_withCD, y_test, reference = lgb_train_withCD)

    params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': num_leaves,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
    }

    # Train the model
    print('Starting training...')
    gbm_withoutCD = lgb.train(params,
                lgb_train_withoutCD,
                num_boost_round = 20,
                valid_sets = lgb_eval_withoutCD,
                early_stopping_rounds = 6)
    #print('Saving model...')
    # Save model to file
    #gbm_withoutCD.save_model('model.txt')
    
    gbm_withCD = lgb.train(params,
                lgb_train_withCD,
                num_boost_round = 20,
                valid_sets = lgb_eval_withCD,
                early_stopping_rounds = 6)
    #print('Saving model...')
    # Save model to file
    #gbm_withCD.save_model('model.txt')
    
    # Predict out by using test data
    print('Starting predicting...')
    y_pred_withoutCD = gbm_withoutCD.predict(x_test_withoutCD, num_iteration = gbm_withoutCD.best_iteration)
    y_pred_withCD = gbm_withCD.predict(x_test_withCD, num_iteration = gbm_withCD.best_iteration)

    df_y_pred_withoutCD = pd.DataFrame(y_pred_withoutCD, columns = ['predicted'])
    df_y_pred_withCD = pd.DataFrame(y_pred_withCD, columns = ['predicted'])
        
    # Store all of the predicted values to the CSV files
    df_compare_addrcode_dist_withoutCD = pd.concat([df_test_addrcode_week_year_dist, df_y_pred_withoutCD], axis = 1)
    df_compare_addrcode_dist_withoutCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
    df_compare_addrcode_dist_withoutCD.to_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' 
                                              + str(num_leaves) + '/Original DF_0/LGBM_' + province2 + '_dist_original_DF_' 
                                              + str(i + 1) + '_withoutCD_' + str(num_leaves) + '.csv', encoding = 'utf-8')

    df_compare_addrcode_dist_withCD = pd.concat([df_test_addrcode_week_year_dist, df_y_pred_withCD], axis = 1)
    df_compare_addrcode_dist_withCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
    df_compare_addrcode_dist_withCD.to_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' 
                                           + str(num_leaves) + '/Original DF_0/LGBM_' + province2 + '_dist_original_DF_' 
                                           + str(i + 1) + '_withCD_' + str(num_leaves) + '.csv', encoding = 'utf-8')
    
    # Evaluation
    rmse_withoutCD = mean_squared_error(y_test_true, y_pred_withoutCD) ** 0.5
    mae_withoutCD = mean_absolute_error(y_test_true, y_pred_withoutCD)
    r2_withoutCD = r2_score(y_test_true, y_pred_withoutCD)
    smape_withoutCD = smape_fast(y_test_true, y_pred_withoutCD)
    #print('RMSE of the prediction without CD is:', rmse_withoutCD)
    #print('MAE of the prediction without CD is:', mae_withoutCD)
    #print('R-squared of the prediction without CD is:', r2_withoutCD)
    #print('SMAPE of the prediction without CD is:', smape_withoutCD)
        
    rmse_withCD = mean_squared_error(y_test_true, y_pred_withCD) ** 0.5
    mae_withCD = mean_absolute_error(y_test_true, y_pred_withCD)
    r2_withCD = r2_score(y_test_true, y_pred_withCD)
    smape_withCD = smape_fast(y_test_true, y_pred_withCD)
    #print('RMSE of the prediction with CD is:', rmse_withCD)
    #print('MAE of the prediction with CD is:', mae_withCD)
    #print('R-squared of the prediction with CD is:', r2_withCD)
    #print('SMAPE of the prediction with CD is:', smape_withCD)
        
    rmse_percent_improved = (rmse_withoutCD - rmse_withCD) / rmse_withoutCD
    mae_percent_improved = (mae_withoutCD - mae_withCD) / mae_withoutCD
    smape_percent_improved = (smape_withoutCD - smape_withCD) / smape_withoutCD
    r2_percent_improved = (r2_withoutCD - r2_withCD) / r2_withoutCD
    #eval_array = np.append(eval_array, ['RMSE', rmse_withoutCD, rmse_withCD, rmse_percent_improved])
    #print(eval_array)
        
    rmse = np.append(rmse, [rmse_withoutCD, rmse_withCD, rmse_percent_improved])
    mae = np.append(mae, [mae_withoutCD, mae_withCD, mae_percent_improved])
    smape = np.append(smape, [smape_withoutCD, smape_withCD, smape_percent_improved])
    r2 = np.append(r2, [r2_withoutCD, r2_withCD, r2_percent_improved])
        
    #df_withoutCD = pd.read_csv('LGBM/Original/LGBM_dist_DF_' + str(j + 1) + '_withoutCD.csv', header = 0)
    #df_withCD = pd.read_csv('LGBM/Original/LGBM_dist_DF_' + str(j + 1) + '_withCD.csv', header = 0)
    
    df_withoutCD = pd.read_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                               + '/Original DF_0/LGBM_' + province2 + '_dist_original_DF_' + str(i + 1) + '_withoutCD_' 
                               + str(num_leaves) + '.csv', header = 0)
    df_withCD = pd.read_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                            + '/Original DF_0/LGBM_' + province2 + '_dist_original_DF_' + str(i + 1) + '_withCD_' 
                            + str(num_leaves) + '.csv', header = 0)
        
    dist_code = df_train_dist['addrcode'].unique()
    
    # For each district
    for j in dist_code:
            
        # Get the subset of actual and predicted values according to the district code
        subset_withoutCD = df_withoutCD.loc[df_withoutCD['addrcode'] == j]
        subset_withCD = df_withCD.loc[df_withCD['addrcode'] == j]
            
        # Pass the response values to the array for evaluation calculation
        array_true = np.array(subset_withoutCD['actual'])
        array_pred_withoutCD = np.array(subset_withoutCD['predicted'])
        array_pred_withCD = np.array(subset_withCD['predicted'])
            
        # Calculate the evaluation values
        rmse_withoutCD_dist = mean_squared_error(array_true, array_pred_withoutCD) ** 0.5
        mae_withoutCD_dist = mean_absolute_error(array_true, array_pred_withoutCD)
        smape_withoutCD_dist = smape_fast(array_true, array_pred_withoutCD)
        r2_withoutCD_dist = r2_score(array_true, array_pred_withoutCD)
            
        rmse_withCD_dist = mean_squared_error(array_true, array_pred_withCD) ** 0.5
        mae_withCD_dist = mean_absolute_error(array_true, array_pred_withCD)
        smape_withCD_dist = smape_fast(array_true, array_pred_withCD)
        r2_withCD_dist = r2_score(array_true, array_pred_withCD)
            
        rmse_percent_improved_dist = (rmse_withoutCD_dist - rmse_withCD_dist) / rmse_withoutCD_dist
        mae_percent_improved_dist = (mae_withoutCD_dist - mae_withCD_dist) / mae_withoutCD_dist
        smape_percent_improved_dist = (smape_withoutCD_dist - smape_withCD_dist) / smape_withoutCD_dist
        r2_percent_improved_dist = (r2_withoutCD_dist - r2_withCD_dist) / r2_withoutCD_dist
            
        # Append
        dist_array = np.append(dist_array, [[j, rmse_withoutCD_dist, rmse_withCD_dist, rmse_percent_improved_dist,
                                            mae_withoutCD_dist, mae_withCD_dist, mae_percent_improved_dist,
                                            smape_withoutCD_dist, smape_withCD_dist, smape_percent_improved_dist,
                                            r2_withoutCD_dist, r2_withCD_dist, r2_percent_improved_dist]], axis = 0)

    #print(dist_array)
    pd.DataFrame(dist_array).to_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                    + '/Original DF_0/LGBM_' + province2 + '_ByDistrict_Original_DF_' + str(i + 1) 
                                    + '_eval_' + str(num_leaves) + '.csv', header = False, encoding = 'utf-8')
        
    # Clear the old memory to store a new one
    dist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                              'MAE without CD', 'MAE with CD', '% improved MAE', 
                              'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                              'R squared without CD', 'R squared with CD', '% improved R squared']])
    
# Evaluation file storing
# From RMSE DF_1 to R squared DF_6
eval_array = evaluation_print_modified_lag_original(eval_array, 'RMSE', rmse)
eval_array = evaluation_print_modified_lag_original(eval_array, 'MAE', mae)
eval_array = evaluation_print_modified_lag_original(eval_array, 'SMAPE', smape)
eval_array = evaluation_print_modified_lag_original(eval_array, 'R squared', r2)

#print(eval_array)

# Store all of the evaluation values into a CSV file
pd.DataFrame(eval_array).to_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                + '/Original DF_0/LGBM_' + province2 + '_dist_original_eval_' + str(num_leaves) 
                                + '.csv', header = False, encoding = 'utf-8')

Starting training...
[1]	valid_0's l1: 0.31428	valid_0's l2: 0.139405
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.305192	valid_0's l2: 0.132154
[3]	valid_0's l1: 0.296936	valid_0's l2: 0.125953
[4]	valid_0's l1: 0.289451	valid_0's l2: 0.120464
[5]	valid_0's l1: 0.28261	valid_0's l2: 0.115405
[6]	valid_0's l1: 0.27608	valid_0's l2: 0.11089
[7]	valid_0's l1: 0.270015	valid_0's l2: 0.106922
[8]	valid_0's l1: 0.264622	valid_0's l2: 0.10356
[9]	valid_0's l1: 0.259604	valid_0's l2: 0.100467
[10]	valid_0's l1: 0.255125	valid_0's l2: 0.0978501
[11]	valid_0's l1: 0.250901	valid_0's l2: 0.0954956
[12]	valid_0's l1: 0.247003	valid_0's l2: 0.0933007
[13]	valid_0's l1: 0.243492	valid_0's l2: 0.0914733
[14]	valid_0's l1: 0.240292	valid_0's l2: 0.0900979
[15]	valid_0's l1: 0.23739	valid_0's l2: 0.088925
[16]	valid_0's l1: 0.234363	valid_0's l2: 0.0876695
[17]	valid_0's l1: 0.231615	valid_0's l2: 0.0866116
[18]	valid_0's l1: 0.22914	valid_0's l2: 0.0857226
[19]	val

Starting training...
[1]	valid_0's l1: 0.315479	valid_0's l2: 0.140282
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.308749	valid_0's l2: 0.134484
[3]	valid_0's l1: 0.302685	valid_0's l2: 0.129557
[4]	valid_0's l1: 0.297787	valid_0's l2: 0.125835
[5]	valid_0's l1: 0.293558	valid_0's l2: 0.12276
[6]	valid_0's l1: 0.289579	valid_0's l2: 0.119919
[7]	valid_0's l1: 0.286596	valid_0's l2: 0.118055
[8]	valid_0's l1: 0.284009	valid_0's l2: 0.116676
[9]	valid_0's l1: 0.281736	valid_0's l2: 0.115773
[10]	valid_0's l1: 0.27965	valid_0's l2: 0.115259
[11]	valid_0's l1: 0.278097	valid_0's l2: 0.114976
[12]	valid_0's l1: 0.276321	valid_0's l2: 0.114741
[13]	valid_0's l1: 0.275249	valid_0's l2: 0.115117
[14]	valid_0's l1: 0.274344	valid_0's l2: 0.11579
[15]	valid_0's l1: 0.273604	valid_0's l2: 0.116738
[16]	valid_0's l1: 0.273224	valid_0's l2: 0.118049
[17]	valid_0's l1: 0.272328	valid_0's l2: 0.119011
[18]	valid_0's l1: 0.271665	valid_0's l2: 0.120289
Early stoppi

For MAs (normal CD)

In [120]:
# Arrays of all evaluation values
# row: head,
# RMSE (DFma_1), 
# MAE (DFma_1), 
# SMAPE (DFma_1), 
# R-squared (DFma_1)

# col: head,
# MA2 (without CD, with CD, % improved),
# MA3 (without CD, with CD, % improved),
# MA4 (without CD, with CD, % improved)

eval_array = np.asarray([['Evaluation', 'MA2 without CD', 'MA2 with CD', 'MA2 % improved', 
                         'MA3 without CD', 'MA3 with CD', 'MA3 % improved', 
                         'MA4 without CD', 'MA4 with CD', 'MA4 % improved']])
rmse = np.zeros(1)
mae = np.zeros(1)
smape = np.zeros(1)
r2 = np.zeros(1)

dist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])

# Starting from MA2 to MA4
for i in range(2, 5):
    # Get the input variables from CSV file
    # Change files directory here
    train_file_dir = 'Data/' + province1 + '/Modified Lags/train_' + province2 + '_dist_cd_mavg' + str(i) + '.csv'
    test_file_dir = 'Data/' + province1 + '/Modified Lags/test_' + province2 + '_dist_cd_mavg' + str(i) + '.csv'
    
    df_train_dist =  pd.read_csv(train_file_dir, header = 0, skiprows = 0)
    df_test_dist = pd.read_csv(test_file_dir, header = 0, skiprows = 0)

    # Continue on DFma_1 to DFma_6
    for j in range(6):
        # Allocate the column of addrcode, week, year and actual values first
        df_test_addrcode_week_year_dist = df_test_dist.iloc[:, [1, 2, 3, 12]]
        
        ## Without CD ##
    
        # Import the dataset
        # x: independent variables
        # DFma_0 [col 13],
        # DFma_wm1 [col 14],
        # DFma_wm2 [col 15],
        # DFma_wm3 [col 16],
        # DFma_wm4 [col 17],
        # DFma_wm5 [col 18],
        # DFma_wm6 [col 19],
        # RF_wm6 [col 20],
        # and LST_wm4 [col 21]
        
        ## With CD ##
    
        # Import the dataset
        # x: independent variables
        # DFma_0 [col 13],
        # DFma_wm1 [col 14],
        # DFma_wm2 [col 15],
        # DFma_wm3 [col 16],
        # DFma_wm4 [col 17],
        # DFma_wm5 [col 18],
        # DFma_wm6 [col 19],
        # RF_wm6 [col 20],
        # LST_wm4 [col 21]
        # bin [col 22],
        # bowl [col 23],
        # bucket [col 24],
        # misc_short [col 25],
        # jar [col 26],
        # pottedplant [col 27],
        # tire [col 28],
        # misc_tall [col 29],
        # and total [col 30]
        
        x_train_withoutCD = df_train_dist.iloc[:, (13 + j): 22]
        x_train_withCD = df_train_dist.iloc[:, (13 + j): 31]
        
        x_test_withoutCD = df_test_dist.iloc[:, (13 + j): 22]
        x_test_withCD = df_test_dist.iloc[:, (13 + j): 31]
        
        # y: response (target) variable DFma_1 [col 12]
        y_train = df_train_dist.iloc[:, [12]]
        y_test = df_test_dist.iloc[:, [12]]
        
        # Pass the response values to the array for evaluation calculation
        y_train_true = np.array(df_train_dist['DFma_1'])
        y_test_true = np.array(df_test_dist['DFma_1'])
        
        # Pass the dataset of both independent and response variables to Light GBM
        lgb_train_withoutCD = lgb.Dataset(x_train_withoutCD, y_train)
        lgb_eval_withoutCD = lgb.Dataset(x_test_withoutCD, y_test, reference = lgb_train_withoutCD)
        
        lgb_train_withCD = lgb.Dataset(x_train_withCD, y_train)
        lgb_eval_withCD = lgb.Dataset(x_test_withCD, y_test, reference = lgb_train_withCD)

        params = {
        'boosting_type': 'gbdt',
        'objective': 'regression',
        'metric': {'l2', 'l1'},
        'num_leaves': num_leaves,
        'learning_rate': 0.05,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'verbose': 0
        }

        # Train the model
        print('Starting training...')
        gbm_withoutCD = lgb.train(params,
                    lgb_train_withoutCD,
                    num_boost_round = 20,
                    valid_sets = lgb_eval_withoutCD,
                    early_stopping_rounds = 6)
        #print('Saving model...')
        # Save model to file
        #gbm_withoutCD.save_model('model.txt')
        
        gbm_withCD = lgb.train(params,
                    lgb_train_withCD,
                    num_boost_round = 20,
                    valid_sets = lgb_eval_withCD,
                    early_stopping_rounds = 6)
        #print('Saving model...')
        # Save model to file
        #gbm_withCD.save_model('model.txt')

        # Predict out by using test data
        print('Starting predicting...')
        y_pred_withoutCD = gbm_withoutCD.predict(x_test_withoutCD, num_iteration = gbm_withoutCD.best_iteration)
        y_pred_withCD = gbm_withCD.predict(x_test_withCD, num_iteration = gbm_withCD.best_iteration)

        df_y_pred_withoutCD = pd.DataFrame(y_pred_withoutCD, columns = ['predicted'])
        df_y_pred_withCD = pd.DataFrame(y_pred_withCD, columns = ['predicted'])
        
        # Store all of the predicted values to the CSV files
        df_compare_addrcode_dist_withoutCD = pd.concat([df_test_addrcode_week_year_dist, df_y_pred_withoutCD], axis = 1)
        df_compare_addrcode_dist_withoutCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
        df_compare_addrcode_dist_withoutCD.to_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' 
                                                  + str(num_leaves) + '/MA' + str(i) + '/LGBM_' + province2 + '_dist_MA' 
                                                  + str(i) + '_horizon_' + str(j + 1) + '_withoutCD_' + str(num_leaves) 
                                                  + '.csv', encoding = 'utf-8')

        df_compare_addrcode_dist_withCD = pd.concat([df_test_addrcode_week_year_dist, df_y_pred_withCD], axis = 1)
        df_compare_addrcode_dist_withCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
        df_compare_addrcode_dist_withCD.to_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' 
                                               + str(num_leaves) + '/MA' + str(i) + '/LGBM_' + province2 + '_dist_MA' 
                                               + str(i) + '_horizon_' + str(j + 1) + '_withCD_' + str(num_leaves) + '.csv', 
                                               encoding = 'utf-8')
        
        # Evaluation
        rmse_withoutCD = mean_squared_error(y_test_true, y_pred_withoutCD) ** 0.5
        mae_withoutCD = mean_absolute_error(y_test_true, y_pred_withoutCD)
        r2_withoutCD = r2_score(y_test_true, y_pred_withoutCD)
        smape_withoutCD = smape_fast(y_test_true, y_pred_withoutCD)
        #print('RMSE of the prediction without CD is:', rmse_withoutCD)
        #print('MAE of the prediction without CD is:', mae_withoutCD)
        #print('R-squared of the prediction without CD is:', r2_withoutCD)
        #print('SMAPE of the prediction without CD is:', smape_withoutCD)
        
        rmse_withCD = mean_squared_error(y_test_true, y_pred_withCD) ** 0.5
        mae_withCD = mean_absolute_error(y_test_true, y_pred_withCD)
        r2_withCD = r2_score(y_test_true, y_pred_withCD)
        smape_withCD = smape_fast(y_test_true, y_pred_withCD)
        #print('RMSE of the prediction with CD is:', rmse_withCD)
        #print('MAE of the prediction with CD is:', mae_withCD)
        #print('R-squared of the prediction with CD is:', r2_withCD)
        #print('SMAPE of the prediction with CD is:', smape_withCD)
        
        rmse_percent_improved = (rmse_withoutCD - rmse_withCD) / rmse_withoutCD
        mae_percent_improved = (mae_withoutCD - mae_withCD) / mae_withoutCD
        smape_percent_improved = (smape_withoutCD - smape_withCD) / smape_withoutCD
        r2_percent_improved = (r2_withoutCD - r2_withCD) / r2_withoutCD
        #eval_array = np.append(eval_array, ['RMSE', rmse_withoutCD, rmse_withCD, rmse_percent_improved])
        #print(eval_array)
        
        rmse = np.append(rmse, [rmse_withoutCD, rmse_withCD, rmse_percent_improved])
        mae = np.append(mae, [mae_withoutCD, mae_withCD, mae_percent_improved])
        smape = np.append(smape, [smape_withoutCD, smape_withCD, smape_percent_improved])
        r2 = np.append(r2, [r2_withoutCD, r2_withCD, r2_percent_improved])
        
        df_withoutCD = pd.read_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                   + '/MA' + str(i) + '/LGBM_' + province2 + '_dist_MA' + str(i) + '_horizon_' + str(j + 1) 
                                   + '_withoutCD_' + str(num_leaves) + '.csv', header = 0)
        df_withCD = pd.read_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                + '/MA' + str(i) + '/LGBM_' + province2 + '_dist_MA' + str(i) + '_horizon_' + str(j + 1) 
                                + '_withCD_' + str(num_leaves) + '.csv', header = 0)
        
        dist_code = df_train_dist['addrcode'].unique()
        
        # For each district
        for k in dist_code:
            
            # Get the subset of actual and predicted values according to the district code
            subset_withoutCD = df_withoutCD.loc[df_withoutCD['addrcode'] == k]
            subset_withCD = df_withCD.loc[df_withCD['addrcode'] == k]
            
            # Pass the response values to the array for evaluation calculation
            array_true = np.array(subset_withoutCD['actual'])
            array_pred_withoutCD = np.array(subset_withoutCD['predicted'])
            array_pred_withCD = np.array(subset_withCD['predicted'])
            
            # Calculate the evaluation values
            rmse_withoutCD_dist = mean_squared_error(array_true, array_pred_withoutCD) ** 0.5
            mae_withoutCD_dist = mean_absolute_error(array_true, array_pred_withoutCD)
            smape_withoutCD_dist = smape_fast(array_true, array_pred_withoutCD)
            r2_withoutCD_dist = r2_score(array_true, array_pred_withoutCD)
            
            rmse_withCD_dist = mean_squared_error(array_true, array_pred_withCD) ** 0.5
            mae_withCD_dist = mean_absolute_error(array_true, array_pred_withCD)
            smape_withCD_dist = smape_fast(array_true, array_pred_withCD)
            r2_withCD_dist = r2_score(array_true, array_pred_withCD)
            
            rmse_percent_improved_dist = (rmse_withoutCD_dist - rmse_withCD_dist) / rmse_withoutCD_dist
            mae_percent_improved_dist = (mae_withoutCD_dist - mae_withCD_dist) / mae_withoutCD_dist
            smape_percent_improved_dist = (smape_withoutCD_dist - smape_withCD_dist) / smape_withoutCD_dist
            r2_percent_improved_dist = (r2_withoutCD_dist - r2_withCD_dist) / r2_withoutCD_dist
            
            # Append
            dist_array = np.append(dist_array, [[k, rmse_withoutCD_dist, rmse_withCD_dist, rmse_percent_improved_dist,
                                                mae_withoutCD_dist, mae_withCD_dist, mae_percent_improved_dist,
                                                smape_withoutCD_dist, smape_withCD_dist, smape_percent_improved_dist,
                                                r2_withoutCD_dist, r2_withCD_dist, r2_percent_improved_dist]], axis = 0)

        #print(dist_array)
        pd.DataFrame(dist_array).to_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                        + '/MA' + str(i) + '/LGBM_' + province2 + '_ByDistrict_MA' + str(i) + '_horizon_' 
                                        + str(j + 1) + '_eval_' + str(num_leaves) + '.csv', header = False, 
                                        encoding = 'utf-8')
        
        # Clear the old memory to store a new one
        dist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])
    
# Evaluation file storing
# From RMSE DFma_1 to R squared DFma_6
eval_array = evaluation_print_modified_lag(eval_array, 'RMSE', rmse)
eval_array = evaluation_print_modified_lag(eval_array, 'MAE', mae)
eval_array = evaluation_print_modified_lag(eval_array, 'SMAPE', smape)
eval_array = evaluation_print_modified_lag(eval_array, 'R squared', r2)

#print(eval_array)

# Store all of the evaluation values into a CSV file
pd.DataFrame(eval_array).to_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                + '/LGBM_' + province2 + '_dist_eval_' + str(num_leaves) + '.csv', header = False, 
                                encoding = 'utf-8')

Starting training...
[1]	valid_0's l1: 0.283641	valid_0's l2: 0.110376
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.271572	valid_0's l2: 0.101492
[3]	valid_0's l1: 0.26041	valid_0's l2: 0.0935903
[4]	valid_0's l1: 0.249939	valid_0's l2: 0.086458
[5]	valid_0's l1: 0.240266	valid_0's l2: 0.0800588
[6]	valid_0's l1: 0.230778	valid_0's l2: 0.0739703
[7]	valid_0's l1: 0.222297	valid_0's l2: 0.0687202
[8]	valid_0's l1: 0.214238	valid_0's l2: 0.0639262
[9]	valid_0's l1: 0.206957	valid_0's l2: 0.0597754
[10]	valid_0's l1: 0.19999	valid_0's l2: 0.0559453
[11]	valid_0's l1: 0.193503	valid_0's l2: 0.0526016
[12]	valid_0's l1: 0.187519	valid_0's l2: 0.0496168
[13]	valid_0's l1: 0.181837	valid_0's l2: 0.0468264
[14]	valid_0's l1: 0.176483	valid_0's l2: 0.044302
[15]	valid_0's l1: 0.171885	valid_0's l2: 0.0422899
[16]	valid_0's l1: 0.167348	valid_0's l2: 0.0403468
[17]	valid_0's l1: 0.163089	valid_0's l2: 0.0385858
[18]	valid_0's l1: 0.159218	valid_0's l2: 0.03703

Starting training...
[1]	valid_0's l1: 0.286456	valid_0's l2: 0.112864
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.27821	valid_0's l2: 0.106555
[3]	valid_0's l1: 0.271292	valid_0's l2: 0.101491
[4]	valid_0's l1: 0.265564	valid_0's l2: 0.0974065
[5]	valid_0's l1: 0.260955	valid_0's l2: 0.0942011
[6]	valid_0's l1: 0.256499	valid_0's l2: 0.0913524
[7]	valid_0's l1: 0.253219	valid_0's l2: 0.0894192
[8]	valid_0's l1: 0.250239	valid_0's l2: 0.087907
[9]	valid_0's l1: 0.247597	valid_0's l2: 0.0868757
[10]	valid_0's l1: 0.245143	valid_0's l2: 0.0862237
[11]	valid_0's l1: 0.242977	valid_0's l2: 0.0857874
[12]	valid_0's l1: 0.240693	valid_0's l2: 0.085384
[13]	valid_0's l1: 0.239166	valid_0's l2: 0.0857084
[14]	valid_0's l1: 0.237841	valid_0's l2: 0.0863646
[15]	valid_0's l1: 0.236823	valid_0's l2: 0.0872521
[16]	valid_0's l1: 0.236068	valid_0's l2: 0.0884058
[17]	valid_0's l1: 0.235002	valid_0's l2: 0.0892528
[18]	valid_0's l1: 0.234272	valid_0's l2: 0.09033

Starting training...
[1]	valid_0's l1: 0.273941	valid_0's l2: 0.101831
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.263662	valid_0's l2: 0.0945538
[3]	valid_0's l1: 0.254726	valid_0's l2: 0.088533
[4]	valid_0's l1: 0.246928	valid_0's l2: 0.0835239
[5]	valid_0's l1: 0.240052	valid_0's l2: 0.0792437
[6]	valid_0's l1: 0.233552	valid_0's l2: 0.0752471
[7]	valid_0's l1: 0.227505	valid_0's l2: 0.0716614
[8]	valid_0's l1: 0.223143	valid_0's l2: 0.0691763
[9]	valid_0's l1: 0.218188	valid_0's l2: 0.0665136
[10]	valid_0's l1: 0.213805	valid_0's l2: 0.0642877
[11]	valid_0's l1: 0.209855	valid_0's l2: 0.0624407
[12]	valid_0's l1: 0.206305	valid_0's l2: 0.0609559
[13]	valid_0's l1: 0.203076	valid_0's l2: 0.0596966
[14]	valid_0's l1: 0.200039	valid_0's l2: 0.0586393
[15]	valid_0's l1: 0.19755	valid_0's l2: 0.0578463
[16]	valid_0's l1: 0.195369	valid_0's l2: 0.0572276
[17]	valid_0's l1: 0.193324	valid_0's l2: 0.0567586
[18]	valid_0's l1: 0.19158	valid_0's l2: 0.056

Starting training...
[1]	valid_0's l1: 0.263514	valid_0's l2: 0.0938624
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.250934	valid_0's l2: 0.0851496
[3]	valid_0's l1: 0.239062	valid_0's l2: 0.0772854
[4]	valid_0's l1: 0.227776	valid_0's l2: 0.0701867
[5]	valid_0's l1: 0.217209	valid_0's l2: 0.0638335
[6]	valid_0's l1: 0.207223	valid_0's l2: 0.0580765
[7]	valid_0's l1: 0.197795	valid_0's l2: 0.0528953
[8]	valid_0's l1: 0.18881	valid_0's l2: 0.0482259
[9]	valid_0's l1: 0.180389	valid_0's l2: 0.0440176
[10]	valid_0's l1: 0.172586	valid_0's l2: 0.0402947
[11]	valid_0's l1: 0.165197	valid_0's l2: 0.0369492
[12]	valid_0's l1: 0.158277	valid_0's l2: 0.0339428
[13]	valid_0's l1: 0.151732	valid_0's l2: 0.0312259
[14]	valid_0's l1: 0.145538	valid_0's l2: 0.0287703
[15]	valid_0's l1: 0.139602	valid_0's l2: 0.0265104
[16]	valid_0's l1: 0.13409	valid_0's l2: 0.0245161
[17]	valid_0's l1: 0.128784	valid_0's l2: 0.0226932
[18]	valid_0's l1: 0.123931	valid_0's l2: 0.0

Starting training...
[1]	valid_0's l1: 0.266376	valid_0's l2: 0.0960921
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.257052	valid_0's l2: 0.0896502
[3]	valid_0's l1: 0.249402	valid_0's l2: 0.0844811
[4]	valid_0's l1: 0.243069	valid_0's l2: 0.0802362
[5]	valid_0's l1: 0.237738	valid_0's l2: 0.076786
[6]	valid_0's l1: 0.232679	valid_0's l2: 0.0736859
[7]	valid_0's l1: 0.228958	valid_0's l2: 0.071568
[8]	valid_0's l1: 0.225615	valid_0's l2: 0.0698964
[9]	valid_0's l1: 0.22316	valid_0's l2: 0.069034
[10]	valid_0's l1: 0.22068	valid_0's l2: 0.0682072
[11]	valid_0's l1: 0.218561	valid_0's l2: 0.0677982
[12]	valid_0's l1: 0.216288	valid_0's l2: 0.0672888
[13]	valid_0's l1: 0.214888	valid_0's l2: 0.067462
[14]	valid_0's l1: 0.213533	valid_0's l2: 0.0678073
[15]	valid_0's l1: 0.212383	valid_0's l2: 0.0683252
[16]	valid_0's l1: 0.211686	valid_0's l2: 0.0691729
[17]	valid_0's l1: 0.210426	valid_0's l2: 0.069549
[18]	valid_0's l1: 0.209409	valid_0's l2: 0.070156

For original DF_0 (without smoothing, normal CD)

In [121]:
# Arrays of all evaluation values
# row: head,
# RMSE (DF_1 - DF_6), 
# MAE (DF_1 - DF_6), 
# SMAPE (DF_1 - DF_6), 
# R-squared (DF_1 - DF_6)

# col: head,
# DF_0 (without CD, with CD, % improved)

eval_array = np.asarray([['Evaluation', 'Without CD', 'With CD', '% improved']])

rmse = np.zeros(1)
mae = np.zeros(1)
smape = np.zeros(1)
r2 = np.zeros(1)

dist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])

train_file_dir = 'Data/' + province1 + '/Modified Lags/train_' + province2 + '_dist_cd_mavg2.csv'
test_file_dir = 'Data/' + province1 + '/Modified Lags/test_' + province2 + '_dist_cd_mavg2.csv'

df_train_dist =  pd.read_csv(train_file_dir, header = 0, skiprows = 0)
df_test_dist = pd.read_csv(test_file_dir, header = 0, skiprows = 0)

# From DF_1 to DF_6
for i in range(6):
    # Allocate the column of addrcode, week, year and actual values first
    df_test_addrcode_week_year_dist = df_test_dist.iloc[:,[1, 2, 3, 4]]
    
    ## Without CD ##
    
    # Import the dataset
    # x: independent variables
    # DF_0 [col 5],
    # DF_wm1 [col 6], 
    # DF_wm2 [col 7],
    # DF_wm3 [col 8],
    # DF_wm4 [col 9], 
    # DF_wm5 [col 10],
    # DF_wm6 [col 11],
    # RF_wm6 [col 20],
    # and LST_wm4 [col 21]
        
    ## With CD ##
    
    # Import the dataset
    # x: independent variables
    # DF_0 [col 5],
    # DF_wm1 [col 6], 
    # DF_wm2 [col 7],
    # DF_wm3 [col 8],
    # DF_wm4 [col 9], 
    # DF_wm5 [col 10],
    # DF_wm6 [col 11],
    # RF_wm6 [col 20],
    # LST_wm4 [col 21],
    # bin_pop9s [col 22],
    # bowl_pop9s [col 23],
    # bucket_pop9s [col 24],
    # misc_short_pop9s [col 25],
    # jar_pop9s [col 26],
    # pottedplant_pop9s [col 27],
    # tire_pop9s [col 28],
    # misc_tall_pop9s [col 29],
    # and total_pop9s [col 30]
    
    df_train_dist_DFinfo = df_train_dist.iloc[:, (5 + i):12]
    df_train_dist_withoutCD = df_train_dist.iloc[:, [20, 21]]
    df_train_dist_withCD = df_train_dist.iloc[:, 20: 31]
    
    df_test_dist_DFinfo = df_test_dist.iloc[:, (5 + i):12]
    df_test_dist_withoutCD = df_test_dist.iloc[:, [20, 21]]
    df_test_dist_withCD = df_test_dist.iloc[:, 20: 31]
        
    x_train_withoutCD = pd.concat([df_train_dist_DFinfo, df_train_dist_withoutCD], axis = 1)
    x_train_withCD = pd.concat([df_train_dist_DFinfo, df_train_dist_withCD], axis = 1)
    
    x_test_withoutCD = pd.concat([df_test_dist_DFinfo, df_test_dist_withoutCD], axis = 1)
    x_test_withCD = pd.concat([df_test_dist_DFinfo, df_test_dist_withCD], axis = 1)
    
    # y: response (target) variable DF_1 (col 4)
    y_train = df_train_dist.iloc[:, [4]]
    y_test = df_test_dist.iloc[:, [4]]
    
    # Pass the response values to the array for evaluation calculation
    y_train_true = np.array(df_train_dist['DF_1'])
    y_test_true = np.array(df_test_dist['DF_1'])
    
    # Pass the dataset of both independent and response variables to Light GBM
    lgb_train_withoutCD = lgb.Dataset(x_train_withoutCD, y_train)
    lgb_eval_withoutCD = lgb.Dataset(x_test_withoutCD, y_test, reference = lgb_train_withoutCD)
        
    lgb_train_withCD = lgb.Dataset(x_train_withCD, y_train)
    lgb_eval_withCD = lgb.Dataset(x_test_withCD, y_test, reference = lgb_train_withCD)

    params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': num_leaves,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
    }

    # Train the model
    print('Starting training...')
    gbm_withoutCD = lgb.train(params,
                lgb_train_withoutCD,
                num_boost_round = 20,
                valid_sets = lgb_eval_withoutCD,
                early_stopping_rounds = 6)
    #print('Saving model...')
    # Save model to file
    #gbm_withoutCD.save_model('model.txt')
    
    gbm_withCD = lgb.train(params,
                lgb_train_withCD,
                num_boost_round = 20,
                valid_sets = lgb_eval_withCD,
                early_stopping_rounds = 6)
    #print('Saving model...')
    # Save model to file
    #gbm_withCD.save_model('model.txt')
    
    # Predict out by using test data
    print('Starting predicting...')
    y_pred_withoutCD = gbm_withoutCD.predict(x_test_withoutCD, num_iteration = gbm_withoutCD.best_iteration)
    y_pred_withCD = gbm_withCD.predict(x_test_withCD, num_iteration = gbm_withCD.best_iteration)

    df_y_pred_withoutCD = pd.DataFrame(y_pred_withoutCD, columns = ['predicted'])
    df_y_pred_withCD = pd.DataFrame(y_pred_withCD, columns = ['predicted'])
        
    # Store all of the predicted values to the CSV files
    df_compare_addrcode_dist_withoutCD = pd.concat([df_test_addrcode_week_year_dist, df_y_pred_withoutCD], axis = 1)
    df_compare_addrcode_dist_withoutCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
    df_compare_addrcode_dist_withoutCD.to_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' 
                                              + str(num_leaves) + '/Original DF_0/LGBM_' + province2 + '_dist_original_DF_' 
                                              + str(i + 1) + '_withoutCD_' + str(num_leaves) + '.csv', encoding = 'utf-8')

    df_compare_addrcode_dist_withCD = pd.concat([df_test_addrcode_week_year_dist, df_y_pred_withCD], axis = 1)
    df_compare_addrcode_dist_withCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
    df_compare_addrcode_dist_withCD.to_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' 
                                           + str(num_leaves) + '/Original DF_0/LGBM_' + province2 + '_dist_original_DF_' 
                                           + str(i + 1) + '_withCD_' + str(num_leaves) + '.csv', encoding = 'utf-8')
    
    # Evaluation
    rmse_withoutCD = mean_squared_error(y_test_true, y_pred_withoutCD) ** 0.5
    mae_withoutCD = mean_absolute_error(y_test_true, y_pred_withoutCD)
    r2_withoutCD = r2_score(y_test_true, y_pred_withoutCD)
    smape_withoutCD = smape_fast(y_test_true, y_pred_withoutCD)
    #print('RMSE of the prediction without CD is:', rmse_withoutCD)
    #print('MAE of the prediction without CD is:', mae_withoutCD)
    #print('R-squared of the prediction without CD is:', r2_withoutCD)
    #print('SMAPE of the prediction without CD is:', smape_withoutCD)
        
    rmse_withCD = mean_squared_error(y_test_true, y_pred_withCD) ** 0.5
    mae_withCD = mean_absolute_error(y_test_true, y_pred_withCD)
    r2_withCD = r2_score(y_test_true, y_pred_withCD)
    smape_withCD = smape_fast(y_test_true, y_pred_withCD)
    #print('RMSE of the prediction with CD is:', rmse_withCD)
    #print('MAE of the prediction with CD is:', mae_withCD)
    #print('R-squared of the prediction with CD is:', r2_withCD)
    #print('SMAPE of the prediction with CD is:', smape_withCD)
        
    rmse_percent_improved = (rmse_withoutCD - rmse_withCD) / rmse_withoutCD
    mae_percent_improved = (mae_withoutCD - mae_withCD) / mae_withoutCD
    smape_percent_improved = (smape_withoutCD - smape_withCD) / smape_withoutCD
    r2_percent_improved = (r2_withoutCD - r2_withCD) / r2_withoutCD
    #eval_array = np.append(eval_array, ['RMSE', rmse_withoutCD, rmse_withCD, rmse_percent_improved])
    #print(eval_array)
        
    rmse = np.append(rmse, [rmse_withoutCD, rmse_withCD, rmse_percent_improved])
    mae = np.append(mae, [mae_withoutCD, mae_withCD, mae_percent_improved])
    smape = np.append(smape, [smape_withoutCD, smape_withCD, smape_percent_improved])
    r2 = np.append(r2, [r2_withoutCD, r2_withCD, r2_percent_improved])
        
    #df_withoutCD = pd.read_csv('LGBM/Original/LGBM_dist_DF_' + str(j + 1) + '_withoutCD.csv', header = 0)
    #df_withCD = pd.read_csv('LGBM/Original/LGBM_dist_DF_' + str(j + 1) + '_withCD.csv', header = 0)
    
    df_withoutCD = pd.read_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                               + '/Original DF_0/LGBM_' + province2 + '_dist_original_DF_' + str(i + 1) + '_withoutCD_' 
                               + str(num_leaves) + '.csv', header = 0)
    df_withCD = pd.read_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                            + '/Original DF_0/LGBM_' + province2 + '_dist_original_DF_' + str(i + 1) + '_withCD_' 
                            + str(num_leaves) + '.csv', header = 0)
        
    dist_code = df_train_dist['addrcode'].unique()
    
    # For each district
    for j in dist_code:
            
        # Get the subset of actual and predicted values according to the district code
        subset_withoutCD = df_withoutCD.loc[df_withoutCD['addrcode'] == j]
        subset_withCD = df_withCD.loc[df_withCD['addrcode'] == j]
            
        # Pass the response values to the array for evaluation calculation
        array_true = np.array(subset_withoutCD['actual'])
        array_pred_withoutCD = np.array(subset_withoutCD['predicted'])
        array_pred_withCD = np.array(subset_withCD['predicted'])
            
        # Calculate the evaluation values
        rmse_withoutCD_dist = mean_squared_error(array_true, array_pred_withoutCD) ** 0.5
        mae_withoutCD_dist = mean_absolute_error(array_true, array_pred_withoutCD)
        smape_withoutCD_dist = smape_fast(array_true, array_pred_withoutCD)
        r2_withoutCD_dist = r2_score(array_true, array_pred_withoutCD)
            
        rmse_withCD_dist = mean_squared_error(array_true, array_pred_withCD) ** 0.5
        mae_withCD_dist = mean_absolute_error(array_true, array_pred_withCD)
        smape_withCD_dist = smape_fast(array_true, array_pred_withCD)
        r2_withCD_dist = r2_score(array_true, array_pred_withCD)
            
        rmse_percent_improved_dist = (rmse_withoutCD_dist - rmse_withCD_dist) / rmse_withoutCD_dist
        mae_percent_improved_dist = (mae_withoutCD_dist - mae_withCD_dist) / mae_withoutCD_dist
        smape_percent_improved_dist = (smape_withoutCD_dist - smape_withCD_dist) / smape_withoutCD_dist
        r2_percent_improved_dist = (r2_withoutCD_dist - r2_withCD_dist) / r2_withoutCD_dist
            
        # Append
        dist_array = np.append(dist_array, [[j, rmse_withoutCD_dist, rmse_withCD_dist, rmse_percent_improved_dist,
                                            mae_withoutCD_dist, mae_withCD_dist, mae_percent_improved_dist,
                                            smape_withoutCD_dist, smape_withCD_dist, smape_percent_improved_dist,
                                            r2_withoutCD_dist, r2_withCD_dist, r2_percent_improved_dist]], axis = 0)

    #print(dist_array)
    pd.DataFrame(dist_array).to_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                    + '/Original DF_0/LGBM_' + province2 + '_ByDistrict_Original_DF_' + str(i + 1) 
                                    + '_eval_' + str(num_leaves) + '.csv', header = False, encoding = 'utf-8')
        
    # Clear the old memory to store a new one
    dist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                              'MAE without CD', 'MAE with CD', '% improved MAE', 
                              'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                              'R squared without CD', 'R squared with CD', '% improved R squared']])
    
# Evaluation file storing
# From RMSE DF_1 to R squared DF_6
eval_array = evaluation_print_modified_lag_original(eval_array, 'RMSE', rmse)
eval_array = evaluation_print_modified_lag_original(eval_array, 'MAE', mae)
eval_array = evaluation_print_modified_lag_original(eval_array, 'SMAPE', smape)
eval_array = evaluation_print_modified_lag_original(eval_array, 'R squared', r2)

#print(eval_array)

# Store all of the evaluation values into a CSV file
pd.DataFrame(eval_array).to_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                + '/Original DF_0/LGBM_' + province2 + '_dist_original_eval_' + str(num_leaves) 
                                + '.csv', header = False, encoding = 'utf-8')

Starting training...
[1]	valid_0's l1: 0.31428	valid_0's l2: 0.139405
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.305192	valid_0's l2: 0.132154
[3]	valid_0's l1: 0.296936	valid_0's l2: 0.125953
[4]	valid_0's l1: 0.289451	valid_0's l2: 0.120464
[5]	valid_0's l1: 0.28261	valid_0's l2: 0.115405
[6]	valid_0's l1: 0.27608	valid_0's l2: 0.11089
[7]	valid_0's l1: 0.270015	valid_0's l2: 0.106922
[8]	valid_0's l1: 0.264622	valid_0's l2: 0.10356
[9]	valid_0's l1: 0.259604	valid_0's l2: 0.100467
[10]	valid_0's l1: 0.255125	valid_0's l2: 0.0978501
[11]	valid_0's l1: 0.250901	valid_0's l2: 0.0954956
[12]	valid_0's l1: 0.247003	valid_0's l2: 0.0933007
[13]	valid_0's l1: 0.243492	valid_0's l2: 0.0914733
[14]	valid_0's l1: 0.240292	valid_0's l2: 0.0900979
[15]	valid_0's l1: 0.23739	valid_0's l2: 0.088925
[16]	valid_0's l1: 0.234363	valid_0's l2: 0.0876695
[17]	valid_0's l1: 0.231615	valid_0's l2: 0.0866116
[18]	valid_0's l1: 0.22914	valid_0's l2: 0.0857226
[19]	val

Starting training...
[1]	valid_0's l1: 0.315479	valid_0's l2: 0.140282
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.308749	valid_0's l2: 0.134484
[3]	valid_0's l1: 0.302685	valid_0's l2: 0.129557
[4]	valid_0's l1: 0.297787	valid_0's l2: 0.125835
[5]	valid_0's l1: 0.293558	valid_0's l2: 0.12276
[6]	valid_0's l1: 0.289579	valid_0's l2: 0.119919
[7]	valid_0's l1: 0.286596	valid_0's l2: 0.118055
[8]	valid_0's l1: 0.284009	valid_0's l2: 0.116676
[9]	valid_0's l1: 0.281736	valid_0's l2: 0.115773
[10]	valid_0's l1: 0.27965	valid_0's l2: 0.115259
[11]	valid_0's l1: 0.278097	valid_0's l2: 0.114976
[12]	valid_0's l1: 0.276321	valid_0's l2: 0.114741
[13]	valid_0's l1: 0.275249	valid_0's l2: 0.115117
[14]	valid_0's l1: 0.274344	valid_0's l2: 0.11579
[15]	valid_0's l1: 0.273604	valid_0's l2: 0.116738
[16]	valid_0's l1: 0.273224	valid_0's l2: 0.118049
[17]	valid_0's l1: 0.272328	valid_0's l2: 0.119011
[18]	valid_0's l1: 0.271665	valid_0's l2: 0.120289
Early stoppi

<h1>Sub-district level</h1>
For MAs (adjusted CD)

In [122]:
# Arrays of all evaluation values
# row: head,
# RMSE (DFma_1), 
# MAE (DFma_1), 
# SMAPE (DFma_1), 
# R-squared (DFma_1)

# col: head,
# MA2 (without CD, with CD, % improved),
# MA3 (without CD, with CD, % improved),
# MA4 (without CD, with CD, % improved)

eval_array = np.asarray([['Evaluation', 'MA2 without CD', 'MA2 with CD', 'MA2 % improved', 
                         'MA3 without CD', 'MA3 with CD', 'MA3 % improved', 
                         'MA4 without CD', 'MA4 with CD', 'MA4 % improved']])
rmse = np.zeros(1)
mae = np.zeros(1)
smape = np.zeros(1)
r2 = np.zeros(1)

subdist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])

# Starting from MA2 to MA4
for i in range(2, 5):
    # Get the input variables from CSV file
    # Change files directory here
    train_file_dir = 'Data/' + province1 + '/Modified Lags/train_' + province2 + '_subdist_total_mavg' + str(i) + '.csv'
    test_file_dir = 'Data/' + province1 + '/Modified Lags/test_' + province2 + '_subdist_total_mavg' + str(i) + '.csv'
    
    df_train_subdist =  pd.read_csv(train_file_dir, header = 0, skiprows = 0)
    df_test_subdist = pd.read_csv(test_file_dir, header = 0, skiprows = 0)

    # Continue on DFma_1 to DFma_6
    for j in range(6):
        # Allocate the column of addrcode, week, year and actual values first
        df_test_addrcode_week_year_subdist = df_test_subdist.iloc[:, [1, 2, 3, 12]]
        
        ## Without CD ##
    
        # Import the dataset
        # x: independent variables
        # DFma_0 [col 13],
        # DFma_wm1 [col 14],
        # DFma_wm2 [col 15],
        # DFma_wm3 [col 16],
        # DFma_wm4 [col 17],
        # DFma_wm5 [col 18],
        # DFma_wm6 [col 19],
        # RF_wm6 [col 20],
        # and LST_wm4 [col 21]
        
        ## With CD ##
    
        # Import the dataset
        # x: independent variables
        # DFma_0 [col 13],
        # DFma_wm1 [col 14],
        # DFma_wm2 [col 15],
        # DFma_wm3 [col 16],
        # DFma_wm4 [col 17],
        # DFma_wm5 [col 18],
        # DFma_wm6 [col 19],
        # RF_wm6 [col 20],
        # LST_wm4 [col 21]
        # bin [col 22],
        # bowl [col 23],
        # bucket [col 24],
        # misc_short [col 25],
        # jar [col 26],
        # pottedplant [col 27],
        # tire [col 28],
        # misc_tall [col 29],
        # and total [col 30]
        
        x_train_withoutCD = df_train_subdist.iloc[:, (13 + j): 22]
        x_train_withCD = df_train_subdist.iloc[:, (13 + j): 31]
        
        x_test_withoutCD = df_test_subdist.iloc[:, (13 + j): 22]
        x_test_withCD = df_test_subdist.iloc[:, (13 + j): 31]
        
        # y: response (target) variable DFma_1 [col 12]
        y_train = df_train_subdist.iloc[:, [12]]
        y_test = df_test_subdist.iloc[:, [12]]
        
        # Pass the response values to the array for evaluation calculation
        y_train_true = np.array(df_train_subdist['DFma_1'])
        y_test_true = np.array(df_test_subdist['DFma_1'])
        
        # Pass the dataset of both independent and response variables to Light GBM
        lgb_train_withoutCD = lgb.Dataset(x_train_withoutCD, y_train)
        lgb_eval_withoutCD = lgb.Dataset(x_test_withoutCD, y_test, reference = lgb_train_withoutCD)
        
        lgb_train_withCD = lgb.Dataset(x_train_withCD, y_train)
        lgb_eval_withCD = lgb.Dataset(x_test_withCD, y_test, reference = lgb_train_withCD)

        params = {
        'boosting_type': 'gbdt',
        'objective': 'regression',
        'metric': {'l2', 'l1'},
        'num_leaves': num_leaves,
        'learning_rate': 0.05,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'verbose': 0
        }

        # Train the model
        print('Starting training...')
        gbm_withoutCD = lgb.train(params,
                    lgb_train_withoutCD,
                    num_boost_round = 20,
                    valid_sets = lgb_eval_withoutCD,
                    early_stopping_rounds = 6)
        #print('Saving model...')
        # Save model to file
        #gbm_withoutCD.save_model('model.txt')
        
        gbm_withCD = lgb.train(params,
                    lgb_train_withCD,
                    num_boost_round = 20,
                    valid_sets = lgb_eval_withCD,
                    early_stopping_rounds = 6)
        #print('Saving model...')
        # Save model to file
        #gbm_withCD.save_model('model.txt')

        # Predict out by using test data
        print('Starting predicting...')
        y_pred_withoutCD = gbm_withoutCD.predict(x_test_withoutCD, num_iteration = gbm_withoutCD.best_iteration)
        y_pred_withCD = gbm_withCD.predict(x_test_withCD, num_iteration = gbm_withCD.best_iteration)

        df_y_pred_withoutCD = pd.DataFrame(y_pred_withoutCD, columns = ['predicted'])
        df_y_pred_withCD = pd.DataFrame(y_pred_withCD, columns = ['predicted'])
        
        # Store all of the predicted values to the CSV files
        df_compare_addrcode_subdist_withoutCD = pd.concat([df_test_addrcode_week_year_subdist, df_y_pred_withoutCD], axis = 1)
        df_compare_addrcode_subdist_withoutCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
        df_compare_addrcode_subdist_withoutCD.to_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' 
                                                  + str(num_leaves) + '/MA' + str(i) + '/LGBM_' + province2 + '_subdist_MA' 
                                                  + str(i) + '_horizon_' + str(j + 1) + '_withoutCD_' + str(num_leaves) 
                                                  + '.csv', encoding = 'utf-8')

        df_compare_addrcode_subdist_withCD = pd.concat([df_test_addrcode_week_year_subdist, df_y_pred_withCD], axis = 1)
        df_compare_addrcode_subdist_withCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
        df_compare_addrcode_subdist_withCD.to_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' 
                                               + str(num_leaves) + '/MA' + str(i) + '/LGBM_' + province2 + '_subdist_MA' 
                                               + str(i) + '_horizon_' + str(j + 1) + '_withCD_' + str(num_leaves) + '.csv', 
                                               encoding = 'utf-8')
        
        # Evaluation
        rmse_withoutCD = mean_squared_error(y_test_true, y_pred_withoutCD) ** 0.5
        mae_withoutCD = mean_absolute_error(y_test_true, y_pred_withoutCD)
        r2_withoutCD = r2_score(y_test_true, y_pred_withoutCD)
        smape_withoutCD = smape_fast(y_test_true, y_pred_withoutCD)
        #print('RMSE of the prediction without CD is:', rmse_withoutCD)
        #print('MAE of the prediction without CD is:', mae_withoutCD)
        #print('R-squared of the prediction without CD is:', r2_withoutCD)
        #print('SMAPE of the prediction without CD is:', smape_withoutCD)
        
        rmse_withCD = mean_squared_error(y_test_true, y_pred_withCD) ** 0.5
        mae_withCD = mean_absolute_error(y_test_true, y_pred_withCD)
        r2_withCD = r2_score(y_test_true, y_pred_withCD)
        smape_withCD = smape_fast(y_test_true, y_pred_withCD)
        #print('RMSE of the prediction with CD is:', rmse_withCD)
        #print('MAE of the prediction with CD is:', mae_withCD)
        #print('R-squared of the prediction with CD is:', r2_withCD)
        #print('SMAPE of the prediction with CD is:', smape_withCD)
        
        rmse_percent_improved = (rmse_withoutCD - rmse_withCD) / rmse_withoutCD
        mae_percent_improved = (mae_withoutCD - mae_withCD) / mae_withoutCD
        smape_percent_improved = (smape_withoutCD - smape_withCD) / smape_withoutCD
        r2_percent_improved = (r2_withoutCD - r2_withCD) / r2_withoutCD
        #eval_array = np.append(eval_array, ['RMSE', rmse_withoutCD, rmse_withCD, rmse_percent_improved])
        #print(eval_array)
        
        rmse = np.append(rmse, [rmse_withoutCD, rmse_withCD, rmse_percent_improved])
        mae = np.append(mae, [mae_withoutCD, mae_withCD, mae_percent_improved])
        smape = np.append(smape, [smape_withoutCD, smape_withCD, smape_percent_improved])
        r2 = np.append(r2, [r2_withoutCD, r2_withCD, r2_percent_improved])
        
        df_withoutCD = pd.read_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                   + '/MA' + str(i) + '/LGBM_' + province2 + '_subdist_MA' + str(i) + '_horizon_' + str(j + 1) 
                                   + '_withoutCD_' + str(num_leaves) + '.csv', header = 0)
        df_withCD = pd.read_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                + '/MA' + str(i) + '/LGBM_' + province2 + '_subdist_MA' + str(i) + '_horizon_' + str(j + 1) 
                                + '_withCD_' + str(num_leaves) + '.csv', header = 0)
        
        subdist_code = df_train_subdist['addrcode'].unique()
        
        # For each district
        for k in subdist_code:
            
            # Get the subset of actual and predicted values according to the district code
            subset_withoutCD = df_withoutCD.loc[df_withoutCD['addrcode'] == k]
            subset_withCD = df_withCD.loc[df_withCD['addrcode'] == k]
            
            # Pass the response values to the array for evaluation calculation
            array_true = np.array(subset_withoutCD['actual'])
            array_pred_withoutCD = np.array(subset_withoutCD['predicted'])
            array_pred_withCD = np.array(subset_withCD['predicted'])
            
            # Calculate the evaluation values
            rmse_withoutCD_subdist = mean_squared_error(array_true, array_pred_withoutCD) ** 0.5
            mae_withoutCD_subdist = mean_absolute_error(array_true, array_pred_withoutCD)
            smape_withoutCD_subdist = smape_fast(array_true, array_pred_withoutCD)
            r2_withoutCD_subdist = r2_score(array_true, array_pred_withoutCD)
            
            rmse_withCD_subdist = mean_squared_error(array_true, array_pred_withCD) ** 0.5
            mae_withCD_subdist = mean_absolute_error(array_true, array_pred_withCD)
            smape_withCD_subdist = smape_fast(array_true, array_pred_withCD)
            r2_withCD_subdist = r2_score(array_true, array_pred_withCD)
            
            rmse_percent_improved_subdist = (rmse_withoutCD_subdist - rmse_withCD_subdist) / rmse_withoutCD_subdist
            mae_percent_improved_subdist = (mae_withoutCD_subdist - mae_withCD_subdist) / mae_withoutCD_subdist
            smape_percent_improved_subdist = (smape_withoutCD_subdist - smape_withCD_subdist) / smape_withoutCD_subdist
            r2_percent_improved_subdist = (r2_withoutCD_subdist - r2_withCD_subdist) / r2_withoutCD_subdist
            
            # Append
            subdist_array = np.append(subdist_array, [[k, rmse_withoutCD_subdist, rmse_withCD_subdist, rmse_percent_improved_subdist,
                                                mae_withoutCD_subdist, mae_withCD_subdist, mae_percent_improved_subdist,
                                                smape_withoutCD_subdist, smape_withCD_subdist, smape_percent_improved_subdist,
                                                r2_withoutCD_subdist, r2_withCD_subdist, r2_percent_improved_subdist]], axis = 0)

        #print(dist_array)
        pd.DataFrame(subdist_array).to_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                        + '/MA' + str(i) + '/LGBM_' + province2 + '_BySubDistrict_MA' + str(i) + '_horizon_' 
                                        + str(j + 1) + '_eval_' + str(num_leaves) + '.csv', header = False, 
                                        encoding = 'utf-8')
        
        # Clear the old memory to store a new one
        subdist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])
    
# Evaluation file storing
# From RMSE DFma_1 to R squared DFma_6
eval_array = evaluation_print_modified_lag(eval_array, 'RMSE', rmse)
eval_array = evaluation_print_modified_lag(eval_array, 'MAE', mae)
eval_array = evaluation_print_modified_lag(eval_array, 'SMAPE', smape)
eval_array = evaluation_print_modified_lag(eval_array, 'R squared', r2)

#print(eval_array)

# Store all of the evaluation values into a CSV file
pd.DataFrame(eval_array).to_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                + '/LGBM_' + province2 + '_subdist_eval_' + str(num_leaves) + '.csv', header = False, 
                                encoding = 'utf-8')

Starting training...
[1]	valid_0's l1: 0.37839	valid_0's l2: 0.240277
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.365864	valid_0's l2: 0.225832
[3]	valid_0's l1: 0.354928	valid_0's l2: 0.21403
[4]	valid_0's l1: 0.344881	valid_0's l2: 0.203476
[5]	valid_0's l1: 0.33397	valid_0's l2: 0.192215
[6]	valid_0's l1: 0.323954	valid_0's l2: 0.182453
[7]	valid_0's l1: 0.31425	valid_0's l2: 0.1726
[8]	valid_0's l1: 0.305548	valid_0's l2: 0.164498
[9]	valid_0's l1: 0.298172	valid_0's l2: 0.15829
[10]	valid_0's l1: 0.290282	valid_0's l2: 0.151115
[11]	valid_0's l1: 0.282991	valid_0's l2: 0.14505
[12]	valid_0's l1: 0.276325	valid_0's l2: 0.139541
[13]	valid_0's l1: 0.269913	valid_0's l2: 0.134164
[14]	valid_0's l1: 0.264186	valid_0's l2: 0.129643
[15]	valid_0's l1: 0.259404	valid_0's l2: 0.126366
[16]	valid_0's l1: 0.254246	valid_0's l2: 0.122412
[17]	valid_0's l1: 0.249552	valid_0's l2: 0.119163
[18]	valid_0's l1: 0.245002	valid_0's l2: 0.116025
[19]	valid_0's l1

Starting training...
[1]	valid_0's l1: 0.383376	valid_0's l2: 0.247554
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.376883	valid_0's l2: 0.241456
[3]	valid_0's l1: 0.370628	valid_0's l2: 0.235684
[4]	valid_0's l1: 0.365095	valid_0's l2: 0.23097
[5]	valid_0's l1: 0.360115	valid_0's l2: 0.226845
[6]	valid_0's l1: 0.355393	valid_0's l2: 0.22218
[7]	valid_0's l1: 0.351144	valid_0's l2: 0.218737
[8]	valid_0's l1: 0.347536	valid_0's l2: 0.216266
[9]	valid_0's l1: 0.344163	valid_0's l2: 0.213998
[10]	valid_0's l1: 0.340966	valid_0's l2: 0.211918
[11]	valid_0's l1: 0.337973	valid_0's l2: 0.209741
[12]	valid_0's l1: 0.335263	valid_0's l2: 0.207375
[13]	valid_0's l1: 0.332799	valid_0's l2: 0.205851
[14]	valid_0's l1: 0.330517	valid_0's l2: 0.204575
[15]	valid_0's l1: 0.328691	valid_0's l2: 0.203688
[16]	valid_0's l1: 0.326835	valid_0's l2: 0.203041
[17]	valid_0's l1: 0.325279	valid_0's l2: 0.202421
[18]	valid_0's l1: 0.323872	valid_0's l2: 0.201908
[19]	valid_

Starting training...
[1]	valid_0's l1: 0.351494	valid_0's l2: 0.203499
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.343667	valid_0's l2: 0.196299
[3]	valid_0's l1: 0.336486	valid_0's l2: 0.189713
[4]	valid_0's l1: 0.330067	valid_0's l2: 0.183995
[5]	valid_0's l1: 0.32407	valid_0's l2: 0.179171
[6]	valid_0's l1: 0.318435	valid_0's l2: 0.174659
[7]	valid_0's l1: 0.312542	valid_0's l2: 0.169346
[8]	valid_0's l1: 0.307799	valid_0's l2: 0.166035
[9]	valid_0's l1: 0.302903	valid_0's l2: 0.161702
[10]	valid_0's l1: 0.298741	valid_0's l2: 0.1588
[11]	valid_0's l1: 0.294989	valid_0's l2: 0.156255
[12]	valid_0's l1: 0.291688	valid_0's l2: 0.154116
[13]	valid_0's l1: 0.288487	valid_0's l2: 0.152218
[14]	valid_0's l1: 0.285636	valid_0's l2: 0.150651
[15]	valid_0's l1: 0.282922	valid_0's l2: 0.149113
[16]	valid_0's l1: 0.280311	valid_0's l2: 0.14766
[17]	valid_0's l1: 0.277884	valid_0's l2: 0.146323
[18]	valid_0's l1: 0.27561	valid_0's l2: 0.145219
[19]	valid_0's

Starting training...
[1]	valid_0's l1: 0.327153	valid_0's l2: 0.172603
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.312918	valid_0's l2: 0.15817
[3]	valid_0's l1: 0.29938	valid_0's l2: 0.14506
[4]	valid_0's l1: 0.286764	valid_0's l2: 0.133307
[5]	valid_0's l1: 0.274892	valid_0's l2: 0.122802
[6]	valid_0's l1: 0.263557	valid_0's l2: 0.113228
[7]	valid_0's l1: 0.252829	valid_0's l2: 0.104484
[8]	valid_0's l1: 0.243006	valid_0's l2: 0.096959
[9]	valid_0's l1: 0.23363	valid_0's l2: 0.0900656
[10]	valid_0's l1: 0.224702	valid_0's l2: 0.083571
[11]	valid_0's l1: 0.21644	valid_0's l2: 0.0779209
[12]	valid_0's l1: 0.208626	valid_0's l2: 0.0726958
[13]	valid_0's l1: 0.201245	valid_0's l2: 0.067979
[14]	valid_0's l1: 0.194496	valid_0's l2: 0.0638774
[15]	valid_0's l1: 0.188056	valid_0's l2: 0.0601118
[16]	valid_0's l1: 0.181939	valid_0's l2: 0.0566932
[17]	valid_0's l1: 0.176368	valid_0's l2: 0.0537329
[18]	valid_0's l1: 0.171232	valid_0's l2: 0.0511234
[19]	v

Starting training...
[1]	valid_0's l1: 0.334155	valid_0's l2: 0.181215
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.326583	valid_0's l2: 0.17426
[3]	valid_0's l1: 0.319686	valid_0's l2: 0.168424
[4]	valid_0's l1: 0.313272	valid_0's l2: 0.163224
[5]	valid_0's l1: 0.307278	valid_0's l2: 0.158577
[6]	valid_0's l1: 0.301653	valid_0's l2: 0.153875
[7]	valid_0's l1: 0.296909	valid_0's l2: 0.150536
[8]	valid_0's l1: 0.292521	valid_0's l2: 0.147661
[9]	valid_0's l1: 0.288715	valid_0's l2: 0.145501
[10]	valid_0's l1: 0.285247	valid_0's l2: 0.143461
[11]	valid_0's l1: 0.282055	valid_0's l2: 0.141512
[12]	valid_0's l1: 0.279035	valid_0's l2: 0.139583
[13]	valid_0's l1: 0.276466	valid_0's l2: 0.138272
[14]	valid_0's l1: 0.274074	valid_0's l2: 0.137225
[15]	valid_0's l1: 0.271953	valid_0's l2: 0.136185
[16]	valid_0's l1: 0.269958	valid_0's l2: 0.135321
[17]	valid_0's l1: 0.268157	valid_0's l2: 0.134457
[18]	valid_0's l1: 0.266638	valid_0's l2: 0.133816
[19]	valid

For original DF_0 (without smoothing, adjusted CD)

In [123]:
# Arrays of all evaluation values
# row: head,
# RMSE (DF_1 - DF_6), 
# MAE (DF_1 - DF_6), 
# SMAPE (DF_1 - DF_6), 
# R-squared (DF_1 - DF_6)

# col: head,
# DF_0 (without CD, with CD, % improved)

eval_array = np.asarray([['Evaluation', 'Without CD', 'With CD', '% improved']])

rmse = np.zeros(1)
mae = np.zeros(1)
smape = np.zeros(1)
r2 = np.zeros(1)

subdist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])

train_file_dir = 'Data/' + province1 + '/Modified Lags/train_' + province2 + '_subdist_total_mavg2.csv'
test_file_dir = 'Data/' + province1 + '/Modified Lags/test_' + province2 + '_subdist_total_mavg2.csv'

df_train_subdist =  pd.read_csv(train_file_dir, header = 0, skiprows = 0)
df_test_subdist = pd.read_csv(test_file_dir, header = 0, skiprows = 0)

# From DF_1 to DF_6
for i in range(6):
    # Allocate the column of addrcode, week, year and actual values first
    df_test_addrcode_week_year_subdist = df_test_subdist.iloc[:,[1, 2, 3, 4]]
    
    ## Without CD ##
    
    # Import the dataset
    # x: independent variables
    # DF_0 [col 5],
    # DF_wm1 [col 6], 
    # DF_wm2 [col 7],
    # DF_wm3 [col 8],
    # DF_wm4 [col 9], 
    # DF_wm5 [col 10],
    # DF_wm6 [col 11],
    # RF_wm6 [col 20],
    # and LST_wm4 [col 21]
        
    ## With CD ##
    
    # Import the dataset
    # x: independent variables
    # DF_0 [col 5],
    # DF_wm1 [col 6], 
    # DF_wm2 [col 7],
    # DF_wm3 [col 8],
    # DF_wm4 [col 9], 
    # DF_wm5 [col 10],
    # DF_wm6 [col 11],
    # RF_wm6 [col 20],
    # LST_wm4 [col 21],
    # bin_pop9s [col 22],
    # bowl_pop9s [col 23],
    # bucket_pop9s [col 24],
    # misc_short_pop9s [col 25],
    # jar_pop9s [col 26],
    # pottedplant_pop9s [col 27],
    # tire_pop9s [col 28],
    # misc_tall_pop9s [col 29],
    # and total_pop9s [col 30]
    
    df_train_subdist_DFinfo = df_train_subdist.iloc[:, (5 + i):12]
    df_train_subdist_withoutCD = df_train_subdist.iloc[:, [20, 21]]
    df_train_subdist_withCD = df_train_subdist.iloc[:, 20: 31]
    
    df_test_subdist_DFinfo = df_test_subdist.iloc[:, (5 + i):12]
    df_test_subdist_withoutCD = df_test_subdist.iloc[:, [20, 21]]
    df_test_subdist_withCD = df_test_subdist.iloc[:, 20: 31]
        
    x_train_withoutCD = pd.concat([df_train_subdist_DFinfo, df_train_subdist_withoutCD], axis = 1)
    x_train_withCD = pd.concat([df_train_subdist_DFinfo, df_train_subdist_withCD], axis = 1)
    
    x_test_withoutCD = pd.concat([df_test_subdist_DFinfo, df_test_subdist_withoutCD], axis = 1)
    x_test_withCD = pd.concat([df_test_subdist_DFinfo, df_test_subdist_withCD], axis = 1)
    
    # y: response (target) variable DF_1 (col 4)
    y_train = df_train_subdist.iloc[:, [4]]
    y_test = df_test_subdist.iloc[:, [4]]
    
    # Pass the response values to the array for evaluation calculation
    y_train_true = np.array(df_train_subdist['DF_1'])
    y_test_true = np.array(df_test_subdist['DF_1'])
    
    # Pass the dataset of both independent and response variables to Light GBM
    lgb_train_withoutCD = lgb.Dataset(x_train_withoutCD, y_train)
    lgb_eval_withoutCD = lgb.Dataset(x_test_withoutCD, y_test, reference = lgb_train_withoutCD)
        
    lgb_train_withCD = lgb.Dataset(x_train_withCD, y_train)
    lgb_eval_withCD = lgb.Dataset(x_test_withCD, y_test, reference = lgb_train_withCD)

    params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': num_leaves,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
    }

    # Train the model
    print('Starting training...')
    gbm_withoutCD = lgb.train(params,
                lgb_train_withoutCD,
                num_boost_round = 20,
                valid_sets = lgb_eval_withoutCD,
                early_stopping_rounds = 6)
    #print('Saving model...')
    # Save model to file
    #gbm_withoutCD.save_model('model.txt')
    
    gbm_withCD = lgb.train(params,
                lgb_train_withCD,
                num_boost_round = 20,
                valid_sets = lgb_eval_withCD,
                early_stopping_rounds = 6)
    #print('Saving model...')
    # Save model to file
    #gbm_withCD.save_model('model.txt')
    
    # Predict out by using test data
    print('Starting predicting...')
    y_pred_withoutCD = gbm_withoutCD.predict(x_test_withoutCD, num_iteration = gbm_withoutCD.best_iteration)
    y_pred_withCD = gbm_withCD.predict(x_test_withCD, num_iteration = gbm_withCD.best_iteration)

    df_y_pred_withoutCD = pd.DataFrame(y_pred_withoutCD, columns = ['predicted'])
    df_y_pred_withCD = pd.DataFrame(y_pred_withCD, columns = ['predicted'])
        
    # Store all of the predicted values to the CSV files
    df_compare_addrcode_subdist_withoutCD = pd.concat([df_test_addrcode_week_year_subdist, df_y_pred_withoutCD], axis = 1)
    df_compare_addrcode_subdist_withoutCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
    df_compare_addrcode_subdist_withoutCD.to_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' 
                                              + str(num_leaves) + '/Original DF_0/LGBM_' + province2 + '_subdist_original_DF_' 
                                              + str(i + 1) + '_withoutCD_' + str(num_leaves) + '.csv', encoding = 'utf-8')

    df_compare_addrcode_subdist_withCD = pd.concat([df_test_addrcode_week_year_subdist, df_y_pred_withCD], axis = 1)
    df_compare_addrcode_subdist_withCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
    df_compare_addrcode_subdist_withCD.to_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' 
                                           + str(num_leaves) + '/Original DF_0/LGBM_' + province2 + '_subdist_original_DF_' 
                                           + str(i + 1) + '_withCD_' + str(num_leaves) + '.csv', encoding = 'utf-8')
    
    # Evaluation
    rmse_withoutCD = mean_squared_error(y_test_true, y_pred_withoutCD) ** 0.5
    mae_withoutCD = mean_absolute_error(y_test_true, y_pred_withoutCD)
    r2_withoutCD = r2_score(y_test_true, y_pred_withoutCD)
    smape_withoutCD = smape_fast(y_test_true, y_pred_withoutCD)
    #print('RMSE of the prediction without CD is:', rmse_withoutCD)
    #print('MAE of the prediction without CD is:', mae_withoutCD)
    #print('R-squared of the prediction without CD is:', r2_withoutCD)
    #print('SMAPE of the prediction without CD is:', smape_withoutCD)
        
    rmse_withCD = mean_squared_error(y_test_true, y_pred_withCD) ** 0.5
    mae_withCD = mean_absolute_error(y_test_true, y_pred_withCD)
    r2_withCD = r2_score(y_test_true, y_pred_withCD)
    smape_withCD = smape_fast(y_test_true, y_pred_withCD)
    #print('RMSE of the prediction with CD is:', rmse_withCD)
    #print('MAE of the prediction with CD is:', mae_withCD)
    #print('R-squared of the prediction with CD is:', r2_withCD)
    #print('SMAPE of the prediction with CD is:', smape_withCD)
        
    rmse_percent_improved = (rmse_withoutCD - rmse_withCD) / rmse_withoutCD
    mae_percent_improved = (mae_withoutCD - mae_withCD) / mae_withoutCD
    smape_percent_improved = (smape_withoutCD - smape_withCD) / smape_withoutCD
    r2_percent_improved = (r2_withoutCD - r2_withCD) / r2_withoutCD
    #eval_array = np.append(eval_array, ['RMSE', rmse_withoutCD, rmse_withCD, rmse_percent_improved])
    #print(eval_array)
        
    rmse = np.append(rmse, [rmse_withoutCD, rmse_withCD, rmse_percent_improved])
    mae = np.append(mae, [mae_withoutCD, mae_withCD, mae_percent_improved])
    smape = np.append(smape, [smape_withoutCD, smape_withCD, smape_percent_improved])
    r2 = np.append(r2, [r2_withoutCD, r2_withCD, r2_percent_improved])
        
    #df_withoutCD = pd.read_csv('LGBM/Original/LGBM_dist_DF_' + str(j + 1) + '_withoutCD.csv', header = 0)
    #df_withCD = pd.read_csv('LGBM/Original/LGBM_dist_DF_' + str(j + 1) + '_withCD.csv', header = 0)
    
    df_withoutCD = pd.read_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                               + '/Original DF_0/LGBM_' + province2 + '_subdist_original_DF_' + str(i + 1) + '_withoutCD_' 
                               + str(num_leaves) + '.csv', header = 0)
    df_withCD = pd.read_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                            + '/Original DF_0/LGBM_' + province2 + '_subdist_original_DF_' + str(i + 1) + '_withCD_' 
                            + str(num_leaves) + '.csv', header = 0)
        
    subdist_code = df_train_subdist['addrcode'].unique()
    
    # For each district
    for j in subdist_code:
            
        # Get the subset of actual and predicted values according to the district code
        subset_withoutCD = df_withoutCD.loc[df_withoutCD['addrcode'] == j]
        subset_withCD = df_withCD.loc[df_withCD['addrcode'] == j]
            
        # Pass the response values to the array for evaluation calculation
        array_true = np.array(subset_withoutCD['actual'])
        array_pred_withoutCD = np.array(subset_withoutCD['predicted'])
        array_pred_withCD = np.array(subset_withCD['predicted'])
            
        # Calculate the evaluation values
        rmse_withoutCD_subdist = mean_squared_error(array_true, array_pred_withoutCD) ** 0.5
        mae_withoutCD_subdist = mean_absolute_error(array_true, array_pred_withoutCD)
        smape_withoutCD_subdist = smape_fast(array_true, array_pred_withoutCD)
        r2_withoutCD_subdist = r2_score(array_true, array_pred_withoutCD)
            
        rmse_withCD_subdist = mean_squared_error(array_true, array_pred_withCD) ** 0.5
        mae_withCD_subdist = mean_absolute_error(array_true, array_pred_withCD)
        smape_withCD_subdist = smape_fast(array_true, array_pred_withCD)
        r2_withCD_subdist = r2_score(array_true, array_pred_withCD)
            
        rmse_percent_improved_subdist = (rmse_withoutCD_subdist - rmse_withCD_subdist) / rmse_withoutCD_subdist
        mae_percent_improved_subdist = (mae_withoutCD_subdist - mae_withCD_subdist) / mae_withoutCD_subdist
        smape_percent_improved_subdist = (smape_withoutCD_subdist - smape_withCD_subdist) / smape_withoutCD_subdist
        r2_percent_improved_subdist = (r2_withoutCD_subdist - r2_withCD_subdist) / r2_withoutCD_subdist
            
        # Append
        subdist_array = np.append(subdist_array, [[j, rmse_withoutCD_subdist, rmse_withCD_subdist, rmse_percent_improved_subdist,
                                            mae_withoutCD_subdist, mae_withCD_subdist, mae_percent_improved_subdist,
                                            smape_withoutCD_subdist, smape_withCD_subdist, smape_percent_improved_subdist,
                                            r2_withoutCD_subdist, r2_withCD_subdist, r2_percent_improved_subdist]], axis = 0)

    #print(dist_array)
    pd.DataFrame(subdist_array).to_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                    + '/Original DF_0/LGBM_' + province2 + '_BySubDistrict_Original_DF_' + str(i + 1) 
                                    + '_eval_' + str(num_leaves) + '.csv', header = False, encoding = 'utf-8')
        
    # Clear the old memory to store a new one
    subdist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                              'MAE without CD', 'MAE with CD', '% improved MAE', 
                              'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                              'R squared without CD', 'R squared with CD', '% improved R squared']])
    
# Evaluation file storing
# From RMSE DF_1 to R squared DF_6
eval_array = evaluation_print_modified_lag_original(eval_array, 'RMSE', rmse)
eval_array = evaluation_print_modified_lag_original(eval_array, 'MAE', mae)
eval_array = evaluation_print_modified_lag_original(eval_array, 'SMAPE', smape)
eval_array = evaluation_print_modified_lag_original(eval_array, 'R squared', r2)

#print(eval_array)

# Store all of the evaluation values into a CSV file
pd.DataFrame(eval_array).to_csv('LGBM/' + province1 + '/Modified Lags/Adjusted CD/num_leaves = ' + str(num_leaves) 
                                + '/Original DF_0/LGBM_' + province2 + '_subdist_original_eval_' + str(num_leaves) 
                                + '.csv', header = False, encoding = 'utf-8')

Starting training...
[1]	valid_0's l1: 0.449625	valid_0's l2: 0.367072
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.442645	valid_0's l2: 0.359194
[3]	valid_0's l1: 0.436086	valid_0's l2: 0.352044
[4]	valid_0's l1: 0.430176	valid_0's l2: 0.345791
[5]	valid_0's l1: 0.424605	valid_0's l2: 0.34018
[6]	valid_0's l1: 0.419093	valid_0's l2: 0.335408
[7]	valid_0's l1: 0.41378	valid_0's l2: 0.329977
[8]	valid_0's l1: 0.408962	valid_0's l2: 0.325856
[9]	valid_0's l1: 0.404733	valid_0's l2: 0.3226
[10]	valid_0's l1: 0.400382	valid_0's l2: 0.318656
[11]	valid_0's l1: 0.396557	valid_0's l2: 0.315551
[12]	valid_0's l1: 0.392996	valid_0's l2: 0.313066
[13]	valid_0's l1: 0.38963	valid_0's l2: 0.310503
[14]	valid_0's l1: 0.386476	valid_0's l2: 0.308327
[15]	valid_0's l1: 0.383571	valid_0's l2: 0.306385
[16]	valid_0's l1: 0.380883	valid_0's l2: 0.304371
[17]	valid_0's l1: 0.378166	valid_0's l2: 0.302887
[18]	valid_0's l1: 0.375897	valid_0's l2: 0.30183
[19]	valid_0's 

Starting training...
[1]	valid_0's l1: 0.45101	valid_0's l2: 0.369932
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.445455	valid_0's l2: 0.364613
[3]	valid_0's l1: 0.440183	valid_0's l2: 0.360025
[4]	valid_0's l1: 0.435265	valid_0's l2: 0.35576
[5]	valid_0's l1: 0.430862	valid_0's l2: 0.352298
[6]	valid_0's l1: 0.426884	valid_0's l2: 0.347885
[7]	valid_0's l1: 0.422948	valid_0's l2: 0.344929
[8]	valid_0's l1: 0.419197	valid_0's l2: 0.342197
[9]	valid_0's l1: 0.416078	valid_0's l2: 0.340121
[10]	valid_0's l1: 0.412905	valid_0's l2: 0.338285
[11]	valid_0's l1: 0.410211	valid_0's l2: 0.336813
[12]	valid_0's l1: 0.407508	valid_0's l2: 0.334721
[13]	valid_0's l1: 0.40511	valid_0's l2: 0.333409
[14]	valid_0's l1: 0.402822	valid_0's l2: 0.332436
[15]	valid_0's l1: 0.400737	valid_0's l2: 0.331206
[16]	valid_0's l1: 0.398657	valid_0's l2: 0.330421
[17]	valid_0's l1: 0.397095	valid_0's l2: 0.329786
[18]	valid_0's l1: 0.395739	valid_0's l2: 0.329351
[19]	valid_0

For MAs (normal CD)

In [124]:
# Arrays of all evaluation values
# row: head,
# RMSE (DFma_1), 
# MAE (DFma_1), 
# SMAPE (DFma_1), 
# R-squared (DFma_1)

# col: head,
# MA2 (without CD, with CD, % improved),
# MA3 (without CD, with CD, % improved),
# MA4 (without CD, with CD, % improved)

eval_array = np.asarray([['Evaluation', 'MA2 without CD', 'MA2 with CD', 'MA2 % improved', 
                         'MA3 without CD', 'MA3 with CD', 'MA3 % improved', 
                         'MA4 without CD', 'MA4 with CD', 'MA4 % improved']])
rmse = np.zeros(1)
mae = np.zeros(1)
smape = np.zeros(1)
r2 = np.zeros(1)

subdist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])

# Starting from MA2 to MA4
for i in range(2, 5):
    # Get the input variables from CSV file
    # Change files directory here
    train_file_dir = 'Data/' + province1 + '/Modified Lags/train_' + province2 + '_subdist_cd_mavg' + str(i) + '.csv'
    test_file_dir = 'Data/' + province1 + '/Modified Lags/test_' + province2 + '_subdist_cd_mavg' + str(i) + '.csv'
    
    df_train_subdist =  pd.read_csv(train_file_dir, header = 0, skiprows = 0)
    df_test_subdist = pd.read_csv(test_file_dir, header = 0, skiprows = 0)

    # Continue on DFma_1 to DFma_6
    for j in range(6):
        # Allocate the column of addrcode, week, year and actual values first
        df_test_addrcode_week_year_subdist = df_test_subdist.iloc[:, [1, 2, 3, 12]]
        
        ## Without CD ##
    
        # Import the dataset
        # x: independent variables
        # DFma_0 [col 13],
        # DFma_wm1 [col 14],
        # DFma_wm2 [col 15],
        # DFma_wm3 [col 16],
        # DFma_wm4 [col 17],
        # DFma_wm5 [col 18],
        # DFma_wm6 [col 19],
        # RF_wm6 [col 20],
        # and LST_wm4 [col 21]
        
        ## With CD ##
    
        # Import the dataset
        # x: independent variables
        # DFma_0 [col 13],
        # DFma_wm1 [col 14],
        # DFma_wm2 [col 15],
        # DFma_wm3 [col 16],
        # DFma_wm4 [col 17],
        # DFma_wm5 [col 18],
        # DFma_wm6 [col 19],
        # RF_wm6 [col 20],
        # LST_wm4 [col 21]
        # bin [col 22],
        # bowl [col 23],
        # bucket [col 24],
        # misc_short [col 25],
        # jar [col 26],
        # pottedplant [col 27],
        # tire [col 28],
        # misc_tall [col 29],
        # and total [col 30]
        
        x_train_withoutCD = df_train_subdist.iloc[:, (13 + j): 22]
        x_train_withCD = df_train_subdist.iloc[:, (13 + j): 31]
        
        x_test_withoutCD = df_test_subdist.iloc[:, (13 + j): 22]
        x_test_withCD = df_test_subdist.iloc[:, (13 + j): 31]
        
        # y: response (target) variable DFma_1 [col 12]
        y_train = df_train_subdist.iloc[:, [12]]
        y_test = df_test_subdist.iloc[:, [12]]
        
        # Pass the response values to the array for evaluation calculation
        y_train_true = np.array(df_train_subdist['DFma_1'])
        y_test_true = np.array(df_test_subdist['DFma_1'])
        
        # Pass the dataset of both independent and response variables to Light GBM
        lgb_train_withoutCD = lgb.Dataset(x_train_withoutCD, y_train)
        lgb_eval_withoutCD = lgb.Dataset(x_test_withoutCD, y_test, reference = lgb_train_withoutCD)
        
        lgb_train_withCD = lgb.Dataset(x_train_withCD, y_train)
        lgb_eval_withCD = lgb.Dataset(x_test_withCD, y_test, reference = lgb_train_withCD)

        params = {
        'boosting_type': 'gbdt',
        'objective': 'regression',
        'metric': {'l2', 'l1'},
        'num_leaves': num_leaves,
        'learning_rate': 0.05,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'verbose': 0
        }

        # Train the model
        print('Starting training...')
        gbm_withoutCD = lgb.train(params,
                    lgb_train_withoutCD,
                    num_boost_round = 20,
                    valid_sets = lgb_eval_withoutCD,
                    early_stopping_rounds = 6)
        #print('Saving model...')
        # Save model to file
        #gbm_withoutCD.save_model('model.txt')
        
        gbm_withCD = lgb.train(params,
                    lgb_train_withCD,
                    num_boost_round = 20,
                    valid_sets = lgb_eval_withCD,
                    early_stopping_rounds = 6)
        #print('Saving model...')
        # Save model to file
        #gbm_withCD.save_model('model.txt')

        # Predict out by using test data
        print('Starting predicting...')
        y_pred_withoutCD = gbm_withoutCD.predict(x_test_withoutCD, num_iteration = gbm_withoutCD.best_iteration)
        y_pred_withCD = gbm_withCD.predict(x_test_withCD, num_iteration = gbm_withCD.best_iteration)

        df_y_pred_withoutCD = pd.DataFrame(y_pred_withoutCD, columns = ['predicted'])
        df_y_pred_withCD = pd.DataFrame(y_pred_withCD, columns = ['predicted'])
        
        # Store all of the predicted values to the CSV files
        df_compare_addrcode_subdist_withoutCD = pd.concat([df_test_addrcode_week_year_subdist, df_y_pred_withoutCD], axis = 1)
        df_compare_addrcode_subdist_withoutCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
        df_compare_addrcode_subdist_withoutCD.to_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' 
                                                  + str(num_leaves) + '/MA' + str(i) + '/LGBM_' + province2 + '_subdist_MA' 
                                                  + str(i) + '_horizon_' + str(j + 1) + '_withoutCD_' + str(num_leaves) 
                                                  + '.csv', encoding = 'utf-8')

        df_compare_addrcode_subdist_withCD = pd.concat([df_test_addrcode_week_year_subdist, df_y_pred_withCD], axis = 1)
        df_compare_addrcode_subdist_withCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
        df_compare_addrcode_subdist_withCD.to_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' 
                                               + str(num_leaves) + '/MA' + str(i) + '/LGBM_' + province2 + '_subdist_MA' 
                                               + str(i) + '_horizon_' + str(j + 1) + '_withCD_' + str(num_leaves) + '.csv', 
                                               encoding = 'utf-8')
        
        # Evaluation
        rmse_withoutCD = mean_squared_error(y_test_true, y_pred_withoutCD) ** 0.5
        mae_withoutCD = mean_absolute_error(y_test_true, y_pred_withoutCD)
        r2_withoutCD = r2_score(y_test_true, y_pred_withoutCD)
        smape_withoutCD = smape_fast(y_test_true, y_pred_withoutCD)
        #print('RMSE of the prediction without CD is:', rmse_withoutCD)
        #print('MAE of the prediction without CD is:', mae_withoutCD)
        #print('R-squared of the prediction without CD is:', r2_withoutCD)
        #print('SMAPE of the prediction without CD is:', smape_withoutCD)
        
        rmse_withCD = mean_squared_error(y_test_true, y_pred_withCD) ** 0.5
        mae_withCD = mean_absolute_error(y_test_true, y_pred_withCD)
        r2_withCD = r2_score(y_test_true, y_pred_withCD)
        smape_withCD = smape_fast(y_test_true, y_pred_withCD)
        #print('RMSE of the prediction with CD is:', rmse_withCD)
        #print('MAE of the prediction with CD is:', mae_withCD)
        #print('R-squared of the prediction with CD is:', r2_withCD)
        #print('SMAPE of the prediction with CD is:', smape_withCD)
        
        rmse_percent_improved = (rmse_withoutCD - rmse_withCD) / rmse_withoutCD
        mae_percent_improved = (mae_withoutCD - mae_withCD) / mae_withoutCD
        smape_percent_improved = (smape_withoutCD - smape_withCD) / smape_withoutCD
        r2_percent_improved = (r2_withoutCD - r2_withCD) / r2_withoutCD
        #eval_array = np.append(eval_array, ['RMSE', rmse_withoutCD, rmse_withCD, rmse_percent_improved])
        #print(eval_array)
        
        rmse = np.append(rmse, [rmse_withoutCD, rmse_withCD, rmse_percent_improved])
        mae = np.append(mae, [mae_withoutCD, mae_withCD, mae_percent_improved])
        smape = np.append(smape, [smape_withoutCD, smape_withCD, smape_percent_improved])
        r2 = np.append(r2, [r2_withoutCD, r2_withCD, r2_percent_improved])
        
        df_withoutCD = pd.read_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                   + '/MA' + str(i) + '/LGBM_' + province2 + '_subdist_MA' + str(i) + '_horizon_' + str(j + 1) 
                                   + '_withoutCD_' + str(num_leaves) + '.csv', header = 0)
        df_withCD = pd.read_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                + '/MA' + str(i) + '/LGBM_' + province2 + '_subdist_MA' + str(i) + '_horizon_' + str(j + 1) 
                                + '_withCD_' + str(num_leaves) + '.csv', header = 0)
        
        subdist_code = df_train_subdist['addrcode'].unique()
        
        # For each district
        for k in subdist_code:
            
            # Get the subset of actual and predicted values according to the district code
            subset_withoutCD = df_withoutCD.loc[df_withoutCD['addrcode'] == k]
            subset_withCD = df_withCD.loc[df_withCD['addrcode'] == k]
            
            # Pass the response values to the array for evaluation calculation
            array_true = np.array(subset_withoutCD['actual'])
            array_pred_withoutCD = np.array(subset_withoutCD['predicted'])
            array_pred_withCD = np.array(subset_withCD['predicted'])
            
            # Calculate the evaluation values
            rmse_withoutCD_subdist = mean_squared_error(array_true, array_pred_withoutCD) ** 0.5
            mae_withoutCD_subdist = mean_absolute_error(array_true, array_pred_withoutCD)
            smape_withoutCD_subdist = smape_fast(array_true, array_pred_withoutCD)
            r2_withoutCD_subdist = r2_score(array_true, array_pred_withoutCD)
            
            rmse_withCD_subdist = mean_squared_error(array_true, array_pred_withCD) ** 0.5
            mae_withCD_subdist = mean_absolute_error(array_true, array_pred_withCD)
            smape_withCD_subdist = smape_fast(array_true, array_pred_withCD)
            r2_withCD_subdist = r2_score(array_true, array_pred_withCD)
            
            rmse_percent_improved_subdist = (rmse_withoutCD_subdist - rmse_withCD_subdist) / rmse_withoutCD_subdist
            mae_percent_improved_subdist = (mae_withoutCD_subdist - mae_withCD_subdist) / mae_withoutCD_subdist
            smape_percent_improved_subdist = (smape_withoutCD_subdist - smape_withCD_subdist) / smape_withoutCD_subdist
            r2_percent_improved_subdist = (r2_withoutCD_subdist - r2_withCD_subdist) / r2_withoutCD_subdist
            
            # Append
            subdist_array = np.append(subdist_array, [[k, rmse_withoutCD_subdist, rmse_withCD_subdist, rmse_percent_improved_subdist,
                                                mae_withoutCD_subdist, mae_withCD_subdist, mae_percent_improved_subdist,
                                                smape_withoutCD_subdist, smape_withCD_subdist, smape_percent_improved_subdist,
                                                r2_withoutCD_subdist, r2_withCD_subdist, r2_percent_improved_subdist]], axis = 0)

        #print(dist_array)
        pd.DataFrame(subdist_array).to_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                        + '/MA' + str(i) + '/LGBM_' + province2 + '_BySubDistrict_MA' + str(i) + '_horizon_' 
                                        + str(j + 1) + '_eval_' + str(num_leaves) + '.csv', header = False, 
                                        encoding = 'utf-8')
        
        # Clear the old memory to store a new one
        subdist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])
    
# Evaluation file storing
# From RMSE DFma_1 to R squared DFma_6
eval_array = evaluation_print_modified_lag(eval_array, 'RMSE', rmse)
eval_array = evaluation_print_modified_lag(eval_array, 'MAE', mae)
eval_array = evaluation_print_modified_lag(eval_array, 'SMAPE', smape)
eval_array = evaluation_print_modified_lag(eval_array, 'R squared', r2)

#print(eval_array)

# Store all of the evaluation values into a CSV file
pd.DataFrame(eval_array).to_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                + '/LGBM_' + province2 + '_subdist_eval_' + str(num_leaves) + '.csv', header = False, 
                                encoding = 'utf-8')

Starting training...
[1]	valid_0's l1: 0.37839	valid_0's l2: 0.240277
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.365864	valid_0's l2: 0.225832
[3]	valid_0's l1: 0.354928	valid_0's l2: 0.21403
[4]	valid_0's l1: 0.344881	valid_0's l2: 0.203476
[5]	valid_0's l1: 0.33397	valid_0's l2: 0.192215
[6]	valid_0's l1: 0.323954	valid_0's l2: 0.182453
[7]	valid_0's l1: 0.31425	valid_0's l2: 0.1726
[8]	valid_0's l1: 0.305548	valid_0's l2: 0.164498
[9]	valid_0's l1: 0.298172	valid_0's l2: 0.15829
[10]	valid_0's l1: 0.290282	valid_0's l2: 0.151115
[11]	valid_0's l1: 0.282991	valid_0's l2: 0.14505
[12]	valid_0's l1: 0.276325	valid_0's l2: 0.139541
[13]	valid_0's l1: 0.269913	valid_0's l2: 0.134164
[14]	valid_0's l1: 0.264186	valid_0's l2: 0.129643
[15]	valid_0's l1: 0.259404	valid_0's l2: 0.126366
[16]	valid_0's l1: 0.254246	valid_0's l2: 0.122412
[17]	valid_0's l1: 0.249552	valid_0's l2: 0.119163
[18]	valid_0's l1: 0.245002	valid_0's l2: 0.116025
[19]	valid_0's l1

Starting training...
[1]	valid_0's l1: 0.383376	valid_0's l2: 0.247554
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.376883	valid_0's l2: 0.241456
[3]	valid_0's l1: 0.370628	valid_0's l2: 0.235684
[4]	valid_0's l1: 0.365095	valid_0's l2: 0.23097
[5]	valid_0's l1: 0.360115	valid_0's l2: 0.226845
[6]	valid_0's l1: 0.355393	valid_0's l2: 0.22218
[7]	valid_0's l1: 0.351144	valid_0's l2: 0.218737
[8]	valid_0's l1: 0.347536	valid_0's l2: 0.216266
[9]	valid_0's l1: 0.344163	valid_0's l2: 0.213998
[10]	valid_0's l1: 0.340966	valid_0's l2: 0.211918
[11]	valid_0's l1: 0.337973	valid_0's l2: 0.209741
[12]	valid_0's l1: 0.335263	valid_0's l2: 0.207375
[13]	valid_0's l1: 0.332799	valid_0's l2: 0.205851
[14]	valid_0's l1: 0.330517	valid_0's l2: 0.204575
[15]	valid_0's l1: 0.328691	valid_0's l2: 0.203688
[16]	valid_0's l1: 0.326835	valid_0's l2: 0.203041
[17]	valid_0's l1: 0.325279	valid_0's l2: 0.202421
[18]	valid_0's l1: 0.323872	valid_0's l2: 0.201908
[19]	valid_

Starting training...
[1]	valid_0's l1: 0.351494	valid_0's l2: 0.203499
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.343667	valid_0's l2: 0.196299
[3]	valid_0's l1: 0.336486	valid_0's l2: 0.189713
[4]	valid_0's l1: 0.330067	valid_0's l2: 0.183995
[5]	valid_0's l1: 0.32407	valid_0's l2: 0.179171
[6]	valid_0's l1: 0.318435	valid_0's l2: 0.174659
[7]	valid_0's l1: 0.312542	valid_0's l2: 0.169346
[8]	valid_0's l1: 0.307799	valid_0's l2: 0.166035
[9]	valid_0's l1: 0.302903	valid_0's l2: 0.161702
[10]	valid_0's l1: 0.298741	valid_0's l2: 0.1588
[11]	valid_0's l1: 0.294989	valid_0's l2: 0.156255
[12]	valid_0's l1: 0.291688	valid_0's l2: 0.154116
[13]	valid_0's l1: 0.288487	valid_0's l2: 0.152218
[14]	valid_0's l1: 0.285636	valid_0's l2: 0.150651
[15]	valid_0's l1: 0.282922	valid_0's l2: 0.149113
[16]	valid_0's l1: 0.280311	valid_0's l2: 0.14766
[17]	valid_0's l1: 0.277884	valid_0's l2: 0.146323
[18]	valid_0's l1: 0.27561	valid_0's l2: 0.145219
[19]	valid_0's

Starting training...
[1]	valid_0's l1: 0.327153	valid_0's l2: 0.172603
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.312918	valid_0's l2: 0.15817
[3]	valid_0's l1: 0.29938	valid_0's l2: 0.14506
[4]	valid_0's l1: 0.286764	valid_0's l2: 0.133307
[5]	valid_0's l1: 0.274892	valid_0's l2: 0.122802
[6]	valid_0's l1: 0.263557	valid_0's l2: 0.113228
[7]	valid_0's l1: 0.252829	valid_0's l2: 0.104484
[8]	valid_0's l1: 0.243006	valid_0's l2: 0.096959
[9]	valid_0's l1: 0.23363	valid_0's l2: 0.0900656
[10]	valid_0's l1: 0.224702	valid_0's l2: 0.083571
[11]	valid_0's l1: 0.21644	valid_0's l2: 0.0779209
[12]	valid_0's l1: 0.208626	valid_0's l2: 0.0726958
[13]	valid_0's l1: 0.201245	valid_0's l2: 0.067979
[14]	valid_0's l1: 0.194496	valid_0's l2: 0.0638774
[15]	valid_0's l1: 0.188056	valid_0's l2: 0.0601118
[16]	valid_0's l1: 0.181939	valid_0's l2: 0.0566932
[17]	valid_0's l1: 0.176368	valid_0's l2: 0.0537329
[18]	valid_0's l1: 0.171232	valid_0's l2: 0.0511234
[19]	v

Starting training...
[1]	valid_0's l1: 0.334155	valid_0's l2: 0.181215
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.326583	valid_0's l2: 0.17426
[3]	valid_0's l1: 0.319686	valid_0's l2: 0.168424
[4]	valid_0's l1: 0.313272	valid_0's l2: 0.163224
[5]	valid_0's l1: 0.307278	valid_0's l2: 0.158577
[6]	valid_0's l1: 0.301653	valid_0's l2: 0.153875
[7]	valid_0's l1: 0.296909	valid_0's l2: 0.150536
[8]	valid_0's l1: 0.292521	valid_0's l2: 0.147661
[9]	valid_0's l1: 0.288715	valid_0's l2: 0.145501
[10]	valid_0's l1: 0.285247	valid_0's l2: 0.143461
[11]	valid_0's l1: 0.282055	valid_0's l2: 0.141512
[12]	valid_0's l1: 0.279035	valid_0's l2: 0.139583
[13]	valid_0's l1: 0.276466	valid_0's l2: 0.138272
[14]	valid_0's l1: 0.274074	valid_0's l2: 0.137225
[15]	valid_0's l1: 0.271953	valid_0's l2: 0.136185
[16]	valid_0's l1: 0.269958	valid_0's l2: 0.135321
[17]	valid_0's l1: 0.268157	valid_0's l2: 0.134457
[18]	valid_0's l1: 0.266638	valid_0's l2: 0.133816
[19]	valid

For original DF_0 (without smoothing, normal CD)

In [125]:
# Arrays of all evaluation values
# row: head,
# RMSE (DF_1 - DF_6), 
# MAE (DF_1 - DF_6), 
# SMAPE (DF_1 - DF_6), 
# R-squared (DF_1 - DF_6)

# col: head,
# DF_0 (without CD, with CD, % improved)

eval_array = np.asarray([['Evaluation', 'Without CD', 'With CD', '% improved']])

rmse = np.zeros(1)
mae = np.zeros(1)
smape = np.zeros(1)
r2 = np.zeros(1)

subdist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                         'MAE without CD', 'MAE with CD', '% improved MAE', 
                         'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                         'R squared without CD', 'R squared with CD', '% improved R squared']])

train_file_dir = 'Data/' + province1 + '/Modified Lags/train_' + province2 + '_subdist_cd_mavg2.csv'
test_file_dir = 'Data/' + province1 + '/Modified Lags/test_' + province2 + '_subdist_cd_mavg2.csv'

df_train_subdist =  pd.read_csv(train_file_dir, header = 0, skiprows = 0)
df_test_subdist = pd.read_csv(test_file_dir, header = 0, skiprows = 0)

# From DF_1 to DF_6
for i in range(6):
    # Allocate the column of addrcode, week, year and actual values first
    df_test_addrcode_week_year_subdist = df_test_subdist.iloc[:,[1, 2, 3, 4]]
    
    ## Without CD ##
    
    # Import the dataset
    # x: independent variables
    # DF_0 [col 5],
    # DF_wm1 [col 6], 
    # DF_wm2 [col 7],
    # DF_wm3 [col 8],
    # DF_wm4 [col 9], 
    # DF_wm5 [col 10],
    # DF_wm6 [col 11],
    # RF_wm6 [col 20],
    # and LST_wm4 [col 21]
        
    ## With CD ##
    
    # Import the dataset
    # x: independent variables
    # DF_0 [col 5],
    # DF_wm1 [col 6], 
    # DF_wm2 [col 7],
    # DF_wm3 [col 8],
    # DF_wm4 [col 9], 
    # DF_wm5 [col 10],
    # DF_wm6 [col 11],
    # RF_wm6 [col 20],
    # LST_wm4 [col 21],
    # bin_pop9s [col 22],
    # bowl_pop9s [col 23],
    # bucket_pop9s [col 24],
    # misc_short_pop9s [col 25],
    # jar_pop9s [col 26],
    # pottedplant_pop9s [col 27],
    # tire_pop9s [col 28],
    # misc_tall_pop9s [col 29],
    # and total_pop9s [col 30]
    
    df_train_subdist_DFinfo = df_train_subdist.iloc[:, (5 + i):12]
    df_train_subdist_withoutCD = df_train_subdist.iloc[:, [20, 21]]
    df_train_subdist_withCD = df_train_subdist.iloc[:, 20: 31]
    
    df_test_subdist_DFinfo = df_test_subdist.iloc[:, (5 + i):12]
    df_test_subdist_withoutCD = df_test_subdist.iloc[:, [20, 21]]
    df_test_subdist_withCD = df_test_subdist.iloc[:, 20: 31]
        
    x_train_withoutCD = pd.concat([df_train_subdist_DFinfo, df_train_subdist_withoutCD], axis = 1)
    x_train_withCD = pd.concat([df_train_subdist_DFinfo, df_train_subdist_withCD], axis = 1)
    
    x_test_withoutCD = pd.concat([df_test_subdist_DFinfo, df_test_subdist_withoutCD], axis = 1)
    x_test_withCD = pd.concat([df_test_subdist_DFinfo, df_test_subdist_withCD], axis = 1)
    
    # y: response (target) variable DF_1 (col 4)
    y_train = df_train_subdist.iloc[:, [4]]
    y_test = df_test_subdist.iloc[:, [4]]
    
    # Pass the response values to the array for evaluation calculation
    y_train_true = np.array(df_train_subdist['DF_1'])
    y_test_true = np.array(df_test_subdist['DF_1'])
    
    # Pass the dataset of both independent and response variables to Light GBM
    lgb_train_withoutCD = lgb.Dataset(x_train_withoutCD, y_train)
    lgb_eval_withoutCD = lgb.Dataset(x_test_withoutCD, y_test, reference = lgb_train_withoutCD)
        
    lgb_train_withCD = lgb.Dataset(x_train_withCD, y_train)
    lgb_eval_withCD = lgb.Dataset(x_test_withCD, y_test, reference = lgb_train_withCD)

    params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': num_leaves,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
    }

    # Train the model
    print('Starting training...')
    gbm_withoutCD = lgb.train(params,
                lgb_train_withoutCD,
                num_boost_round = 20,
                valid_sets = lgb_eval_withoutCD,
                early_stopping_rounds = 6)
    #print('Saving model...')
    # Save model to file
    #gbm_withoutCD.save_model('model.txt')
    
    gbm_withCD = lgb.train(params,
                lgb_train_withCD,
                num_boost_round = 20,
                valid_sets = lgb_eval_withCD,
                early_stopping_rounds = 6)
    #print('Saving model...')
    # Save model to file
    #gbm_withCD.save_model('model.txt')
    
    # Predict out by using test data
    print('Starting predicting...')
    y_pred_withoutCD = gbm_withoutCD.predict(x_test_withoutCD, num_iteration = gbm_withoutCD.best_iteration)
    y_pred_withCD = gbm_withCD.predict(x_test_withCD, num_iteration = gbm_withCD.best_iteration)

    df_y_pred_withoutCD = pd.DataFrame(y_pred_withoutCD, columns = ['predicted'])
    df_y_pred_withCD = pd.DataFrame(y_pred_withCD, columns = ['predicted'])
        
    # Store all of the predicted values to the CSV files
    df_compare_addrcode_subdist_withoutCD = pd.concat([df_test_addrcode_week_year_subdist, df_y_pred_withoutCD], axis = 1)
    df_compare_addrcode_subdist_withoutCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
    df_compare_addrcode_subdist_withoutCD.to_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' 
                                              + str(num_leaves) + '/Original DF_0/LGBM_' + province2 + '_subdist_original_DF_' 
                                              + str(i + 1) + '_withoutCD_' + str(num_leaves) + '.csv', encoding = 'utf-8')

    df_compare_addrcode_subdist_withCD = pd.concat([df_test_addrcode_week_year_subdist, df_y_pred_withCD], axis = 1)
    df_compare_addrcode_subdist_withCD.columns = [['addrcode', 'Week', 'Year', 'actual', 'predicted']]
    df_compare_addrcode_subdist_withCD.to_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' 
                                           + str(num_leaves) + '/Original DF_0/LGBM_' + province2 + '_subdist_original_DF_' 
                                           + str(i + 1) + '_withCD_' + str(num_leaves) + '.csv', encoding = 'utf-8')
    
    # Evaluation
    rmse_withoutCD = mean_squared_error(y_test_true, y_pred_withoutCD) ** 0.5
    mae_withoutCD = mean_absolute_error(y_test_true, y_pred_withoutCD)
    r2_withoutCD = r2_score(y_test_true, y_pred_withoutCD)
    smape_withoutCD = smape_fast(y_test_true, y_pred_withoutCD)
    #print('RMSE of the prediction without CD is:', rmse_withoutCD)
    #print('MAE of the prediction without CD is:', mae_withoutCD)
    #print('R-squared of the prediction without CD is:', r2_withoutCD)
    #print('SMAPE of the prediction without CD is:', smape_withoutCD)
        
    rmse_withCD = mean_squared_error(y_test_true, y_pred_withCD) ** 0.5
    mae_withCD = mean_absolute_error(y_test_true, y_pred_withCD)
    r2_withCD = r2_score(y_test_true, y_pred_withCD)
    smape_withCD = smape_fast(y_test_true, y_pred_withCD)
    #print('RMSE of the prediction with CD is:', rmse_withCD)
    #print('MAE of the prediction with CD is:', mae_withCD)
    #print('R-squared of the prediction with CD is:', r2_withCD)
    #print('SMAPE of the prediction with CD is:', smape_withCD)
        
    rmse_percent_improved = (rmse_withoutCD - rmse_withCD) / rmse_withoutCD
    mae_percent_improved = (mae_withoutCD - mae_withCD) / mae_withoutCD
    smape_percent_improved = (smape_withoutCD - smape_withCD) / smape_withoutCD
    r2_percent_improved = (r2_withoutCD - r2_withCD) / r2_withoutCD
    #eval_array = np.append(eval_array, ['RMSE', rmse_withoutCD, rmse_withCD, rmse_percent_improved])
    #print(eval_array)
        
    rmse = np.append(rmse, [rmse_withoutCD, rmse_withCD, rmse_percent_improved])
    mae = np.append(mae, [mae_withoutCD, mae_withCD, mae_percent_improved])
    smape = np.append(smape, [smape_withoutCD, smape_withCD, smape_percent_improved])
    r2 = np.append(r2, [r2_withoutCD, r2_withCD, r2_percent_improved])
        
    #df_withoutCD = pd.read_csv('LGBM/Original/LGBM_dist_DF_' + str(j + 1) + '_withoutCD.csv', header = 0)
    #df_withCD = pd.read_csv('LGBM/Original/LGBM_dist_DF_' + str(j + 1) + '_withCD.csv', header = 0)
    
    df_withoutCD = pd.read_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                               + '/Original DF_0/LGBM_' + province2 + '_subdist_original_DF_' + str(i + 1) + '_withoutCD_' 
                               + str(num_leaves) + '.csv', header = 0)
    df_withCD = pd.read_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                            + '/Original DF_0/LGBM_' + province2 + '_subdist_original_DF_' + str(i + 1) + '_withCD_' 
                            + str(num_leaves) + '.csv', header = 0)
        
    subdist_code = df_train_subdist['addrcode'].unique()
    
    # For each district
    for j in subdist_code:
            
        # Get the subset of actual and predicted values according to the district code
        subset_withoutCD = df_withoutCD.loc[df_withoutCD['addrcode'] == j]
        subset_withCD = df_withCD.loc[df_withCD['addrcode'] == j]
            
        # Pass the response values to the array for evaluation calculation
        array_true = np.array(subset_withoutCD['actual'])
        array_pred_withoutCD = np.array(subset_withoutCD['predicted'])
        array_pred_withCD = np.array(subset_withCD['predicted'])
            
        # Calculate the evaluation values
        rmse_withoutCD_subdist = mean_squared_error(array_true, array_pred_withoutCD) ** 0.5
        mae_withoutCD_subdist = mean_absolute_error(array_true, array_pred_withoutCD)
        smape_withoutCD_subdist = smape_fast(array_true, array_pred_withoutCD)
        r2_withoutCD_subdist = r2_score(array_true, array_pred_withoutCD)
            
        rmse_withCD_subdist = mean_squared_error(array_true, array_pred_withCD) ** 0.5
        mae_withCD_subdist = mean_absolute_error(array_true, array_pred_withCD)
        smape_withCD_subdist = smape_fast(array_true, array_pred_withCD)
        r2_withCD_subdist = r2_score(array_true, array_pred_withCD)
            
        rmse_percent_improved_subdist = (rmse_withoutCD_subdist - rmse_withCD_subdist) / rmse_withoutCD_subdist
        mae_percent_improved_subdist = (mae_withoutCD_subdist - mae_withCD_subdist) / mae_withoutCD_subdist
        smape_percent_improved_subdist = (smape_withoutCD_subdist - smape_withCD_subdist) / smape_withoutCD_subdist
        r2_percent_improved_subdist = (r2_withoutCD_subdist - r2_withCD_subdist) / r2_withoutCD_subdist
            
        # Append
        subdist_array = np.append(subdist_array, [[j, rmse_withoutCD_subdist, rmse_withCD_subdist, rmse_percent_improved_subdist,
                                            mae_withoutCD_subdist, mae_withCD_subdist, mae_percent_improved_subdist,
                                            smape_withoutCD_subdist, smape_withCD_subdist, smape_percent_improved_subdist,
                                            r2_withoutCD_subdist, r2_withCD_subdist, r2_percent_improved_subdist]], axis = 0)

    #print(dist_array)
    pd.DataFrame(subdist_array).to_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                    + '/Original DF_0/LGBM_' + province2 + '_BySubDistrict_Original_DF_' + str(i + 1) 
                                    + '_eval_' + str(num_leaves) + '.csv', header = False, encoding = 'utf-8')
        
    # Clear the old memory to store a new one
    subdist_array = np.asarray([['addrcode', 'RMSE without CD', 'RMSE with CD', '% improved RMSE', 
                              'MAE without CD', 'MAE with CD', '% improved MAE', 
                              'SMAPE without CD', 'SMAPE with CD', '% improved SMAPE', 
                              'R squared without CD', 'R squared with CD', '% improved R squared']])
    
# Evaluation file storing
# From RMSE DF_1 to R squared DF_6
eval_array = evaluation_print_modified_lag_original(eval_array, 'RMSE', rmse)
eval_array = evaluation_print_modified_lag_original(eval_array, 'MAE', mae)
eval_array = evaluation_print_modified_lag_original(eval_array, 'SMAPE', smape)
eval_array = evaluation_print_modified_lag_original(eval_array, 'R squared', r2)

#print(eval_array)

# Store all of the evaluation values into a CSV file
pd.DataFrame(eval_array).to_csv('LGBM/' + province1 + '/Modified Lags/Normal CD/num_leaves = ' + str(num_leaves) 
                                + '/Original DF_0/LGBM_' + province2 + '_subdist_original_eval_' + str(num_leaves) 
                                + '.csv', header = False, encoding = 'utf-8')

Starting training...
[1]	valid_0's l1: 0.449625	valid_0's l2: 0.367072
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.442645	valid_0's l2: 0.359194
[3]	valid_0's l1: 0.436086	valid_0's l2: 0.352044
[4]	valid_0's l1: 0.430176	valid_0's l2: 0.345791
[5]	valid_0's l1: 0.424605	valid_0's l2: 0.34018
[6]	valid_0's l1: 0.419093	valid_0's l2: 0.335408
[7]	valid_0's l1: 0.41378	valid_0's l2: 0.329977
[8]	valid_0's l1: 0.408962	valid_0's l2: 0.325856
[9]	valid_0's l1: 0.404733	valid_0's l2: 0.3226
[10]	valid_0's l1: 0.400382	valid_0's l2: 0.318656
[11]	valid_0's l1: 0.396557	valid_0's l2: 0.315551
[12]	valid_0's l1: 0.392996	valid_0's l2: 0.313066
[13]	valid_0's l1: 0.38963	valid_0's l2: 0.310503
[14]	valid_0's l1: 0.386476	valid_0's l2: 0.308327
[15]	valid_0's l1: 0.383571	valid_0's l2: 0.306385
[16]	valid_0's l1: 0.380883	valid_0's l2: 0.304371
[17]	valid_0's l1: 0.378166	valid_0's l2: 0.302887
[18]	valid_0's l1: 0.375897	valid_0's l2: 0.30183
[19]	valid_0's 

[19]	valid_0's l1: 0.389917	valid_0's l2: 0.322409
[20]	valid_0's l1: 0.388662	valid_0's l2: 0.321862
Did not meet early stopping. Best iteration is:
[20]	valid_0's l1: 0.388662	valid_0's l2: 0.321862
Starting predicting...
Starting training...
[1]	valid_0's l1: 0.45101	valid_0's l2: 0.369932
Training until validation scores don't improve for 6 rounds.
[2]	valid_0's l1: 0.445455	valid_0's l2: 0.364613
[3]	valid_0's l1: 0.440183	valid_0's l2: 0.360025
[4]	valid_0's l1: 0.435265	valid_0's l2: 0.35576
[5]	valid_0's l1: 0.430862	valid_0's l2: 0.352298
[6]	valid_0's l1: 0.426884	valid_0's l2: 0.347885
[7]	valid_0's l1: 0.422948	valid_0's l2: 0.344929
[8]	valid_0's l1: 0.419197	valid_0's l2: 0.342197
[9]	valid_0's l1: 0.416078	valid_0's l2: 0.340121
[10]	valid_0's l1: 0.412905	valid_0's l2: 0.338285
[11]	valid_0's l1: 0.410211	valid_0's l2: 0.336813
[12]	valid_0's l1: 0.407508	valid_0's l2: 0.334721
[13]	valid_0's l1: 0.40511	valid_0's l2: 0.333409
[14]	valid_0's l1: 0.402822	valid_0's l2: 0