This notebook is downloaded from: https://www.kaggle.com/tommy1028/lightgbm-starter-with-feature-engineering-idea

# LightGBM starter with feature engineering idea
***I am creating this note mostly for myself to re-organize my ideas. Please leave your comments and/or advice if you find the room I can improve my coding/model builing to go further.***


We will predict **the realized volatility of the next ten-minutes time window** with two data sets of the last ten minutes (600 seconds).One dataset contains ask and bid prices of almost each second, which allows us to calculate the realized volatility of the last ten minutes.The other dataset contains the actual record of stock trading, which is more sparse.

Please look at this notebook for the detailed explanation: https://www.kaggle.com/jiashenliu/introduction-to-financial-concepts-and-data

As for EDA, you may find this notebook useful: https://www.kaggle.com/chumajin/optiver-realized-eda-for-starter-english-version

Thank you!

### Contribution from the commmunity
**The codes for showing Feature Importance is kindly prepared by this expert: https://www.kaggle.com/something4kag
and extracted by this notebook: https://www.kaggle.com/something4kag/lightgbm-starter-with-fe-and-importance**

## My approach(work in progress)
### Feature Engineering

Here are my thoughts on feature engieering with my background knowledge on financial market. 

 - price_spread: the difference between ask price and bid price. Wide spread means low liquidity, leading to high volatility.
 - volume: the sum of the ask/bid size. Low volume means low liquidity, leading to high volatility
 - volume_imbalance: the difference between ask size and bid size. Large imbalance means low liquidity for one side, leading to high volatility
 
Also, I created features only using last XX seconds to capture the dynamics of volatility further.


### Model Building
- optimize the weight for RMSPE: see this discussion https://www.kaggle.com/c/optiver-realized-volatility-prediction/discussion/250324
- one model for all stocks: model by stock_id does not work well. I am afraid of overfitting as well. stock_id is used as categorical and for target mean encoding.

## Preparation

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('max_rows', 300)
pd.set_option('max_columns', 300)

import os
import glob

In [2]:
# data directory
data_dir = 'data/'

## Functions for preprocess

In [3]:
def calc_wap(df):
    wap = (df['bid_price1'] * df['ask_size1'] + df['ask_price1'] * df['bid_size1'])/(df['bid_size1'] + df['ask_size1'])
    return wap
def calc_wap2(df):
    wap = (df['bid_price2'] * df['ask_size2'] + df['ask_price2'] * df['bid_size2'])/(df['bid_size2'] + df['ask_size2'])
    return wap

In [4]:
def log_return(list_stock_prices):
    return np.log(list_stock_prices).diff() 

In [5]:
def realized_volatility(series):
    return np.sqrt(np.sum(series**2))

In [6]:
def count_unique(series):
    return len(np.unique(series))

## Main function for preprocessing book data

In [7]:
def preprocessor_book(file_path):
    df = pd.read_parquet(file_path)
    #calculate return etc
    df['wap'] = calc_wap(df)
    df['log_return'] = df.groupby('time_id')['wap'].apply(log_return)
    
    df['wap2'] = calc_wap2(df)
    df['log_return2'] = df.groupby('time_id')['wap2'].apply(log_return)
    
    df['wap_balance'] = abs(df['wap'] - df['wap2'])
    
    df['price_spread'] = (df['ask_price1'] - df['bid_price1']) / ((df['ask_price1'] + df['bid_price1'])/2)
    df['bid_spread'] = df['bid_price1'] - df['bid_price2']
    df['ask_spread'] = df['ask_price1'] - df['ask_price2']
    df['total_volume'] = (df['ask_size1'] + df['ask_size2']) + (df['bid_size1'] + df['bid_size2'])
    df['volume_imbalance'] = abs((df['ask_size1'] + df['ask_size2']) - (df['bid_size1'] + df['bid_size2']))

    # dict for aggregate
    create_feature_dict = {
        'log_return':[realized_volatility],
        'log_return2':[realized_volatility],
        'wap_balance':[np.mean],
        'price_spread':[np.mean],
        'bid_spread':[np.mean],
        'ask_spread':[np.mean],
        'volume_imbalance':[np.mean],
        'total_volume':[np.mean],
        'wap':[np.mean],
            }

    ##### groupby / all seconds
    df_feature = pd.DataFrame(df.groupby(['time_id']).agg(create_feature_dict)).reset_index()
    
    df_feature.columns = ['_'.join(col) for col in df_feature.columns] #time_id is changed to time_id_
        
    ###### groupby / last XX seconds
    last_seconds = [150, 300, 450]
    
    for second in last_seconds:
        second = 600 - second 
    
        df_feature_sec = pd.DataFrame(df.query(f'seconds_in_bucket >= {second}').groupby(['time_id']).agg(create_feature_dict)).reset_index()

        df_feature_sec.columns = ['_'.join(col) for col in df_feature_sec.columns] # time_id is changed to time_id_
     
        df_feature_sec = df_feature_sec.add_suffix('_' + str(second))

        df_feature = pd.merge(df_feature,df_feature_sec,how='left',left_on='time_id_',right_on=f'time_id__{second}')
        df_feature = df_feature.drop([f'time_id__{second}'],axis=1)
    
    # create row_id
    stock_id = file_path.split('=')[1]
    df_feature['row_id'] = df_feature['time_id_'].apply(lambda x:f'{stock_id}-{x}')
    df_feature = df_feature.drop(['time_id_'],axis=1)
    
    return df_feature

In [8]:
%%time
file_path = data_dir + "book_train.parquet/stock_id=0"
preprocessor_book(file_path)

Wall time: 12 s


Unnamed: 0,log_return_realized_volatility,log_return2_realized_volatility,wap_balance_mean,price_spread_mean,bid_spread_mean,ask_spread_mean,volume_imbalance_mean,total_volume_mean,wap_mean,log_return_realized_volatility_400,log_return2_realized_volatility_400,wap_balance_mean_400,price_spread_mean_400,bid_spread_mean_400,ask_spread_mean_400,volume_imbalance_mean_400,total_volume_mean_400,wap_mean_400,log_return_realized_volatility_300,log_return2_realized_volatility_300,wap_balance_mean_300,price_spread_mean_300,bid_spread_mean_300,ask_spread_mean_300,volume_imbalance_mean_300,total_volume_mean_300,wap_mean_300,log_return_realized_volatility_200,log_return2_realized_volatility_200,wap_balance_mean_200,price_spread_mean_200,bid_spread_mean_200,ask_spread_mean_200,volume_imbalance_mean_200,total_volume_mean_200,wap_mean_200,row_id
0,0.004499,0.006999,0.000388,0.000852,0.000176,-0.000151,134.894040,323.496689,1.003725,0.002300,0.004589,0.000390,0.000783,0.000214,-0.000191,124.326531,262.489796,1.003633,0.002953,0.004864,0.000372,0.000822,0.000223,-0.000162,137.158273,294.928058,1.003753,0.003402,0.005803,0.000379,0.000865,0.000205,-0.000155,134.772021,321.455959,1.003836,0-5
1,0.001204,0.002476,0.000212,0.000394,0.000142,-0.000135,142.050000,411.450000,1.000239,0.000934,0.001907,0.000261,0.000367,0.000186,-0.000133,96.136986,480.000000,1.000480,0.000981,0.002009,0.000239,0.000353,0.000164,-0.000123,135.513043,484.521739,1.000397,0.001014,0.002105,0.000210,0.000348,0.000139,-0.000123,151.407895,438.921053,1.000332,0-11
2,0.002369,0.004801,0.000331,0.000725,0.000197,-0.000198,141.414894,416.351064,0.999542,0.001179,0.003034,0.000411,0.000625,0.000167,-0.000204,152.509804,454.000000,0.998356,0.001295,0.003196,0.000431,0.000689,0.000141,-0.000249,144.147059,455.235294,0.998685,0.001940,0.003900,0.000396,0.000683,0.000146,-0.000263,143.514851,440.544554,0.998944,0-16
3,0.002574,0.003637,0.000380,0.000860,0.000190,-0.000108,146.216667,435.266667,0.998832,0.001003,0.001513,0.000350,0.001050,0.000155,-0.000048,153.826087,498.956522,0.998079,0.001776,0.002713,0.000331,0.000833,0.000158,-0.000095,144.698113,418.169811,0.998436,0.001855,0.002881,0.000339,0.000856,0.000133,-0.000114,137.217391,424.782609,0.998472,0-31
4,0.001894,0.003257,0.000254,0.000397,0.000191,-0.000109,123.846591,343.221591,0.999619,0.001434,0.001516,0.000298,0.000469,0.000209,-0.000126,85.618182,339.945455,0.999473,0.001520,0.002188,0.000252,0.000425,0.000191,-0.000120,99.449438,407.584270,0.999488,0.001571,0.002461,0.000230,0.000414,0.000174,-0.000108,116.956140,388.394737,0.999575,0-62
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3825,0.002579,0.003821,0.000212,0.000552,0.000083,-0.000182,197.144781,374.235690,0.997938,0.001411,0.002306,0.000167,0.000447,0.000068,-0.000197,220.857143,317.296703,0.996927,0.001673,0.002573,0.000193,0.000509,0.000062,-0.000169,233.946667,350.560000,0.997519,0.002247,0.003328,0.000223,0.000558,0.000067,-0.000176,207.519608,341.882353,0.997725,0-32751
3826,0.002206,0.002847,0.000267,0.000542,0.000092,-0.000172,233.781553,621.131068,1.000310,0.001288,0.001866,0.000263,0.000555,0.000083,-0.000192,250.672414,566.017241,1.000753,0.001487,0.002255,0.000300,0.000588,0.000074,-0.000177,257.920000,668.640000,1.000682,0.001496,0.002315,0.000293,0.000570,0.000082,-0.000180,262.807407,694.985185,1.000543,0-32753
3827,0.002913,0.003266,0.000237,0.000525,0.000202,-0.000083,115.829787,343.734043,0.999552,0.001511,0.002365,0.000226,0.000470,0.000177,-0.000082,109.246753,335.272727,1.000220,0.001928,0.002646,0.000216,0.000446,0.000191,-0.000075,105.432692,326.759615,1.000111,0.001963,0.002842,0.000223,0.000458,0.000213,-0.000072,111.068966,318.155172,1.000001,0-32758
3828,0.003046,0.005105,0.000245,0.000480,0.000113,-0.000166,132.074919,385.429967,1.002357,0.001617,0.002919,0.000242,0.000503,0.000104,-0.000169,115.728155,435.165049,1.002305,0.002137,0.003934,0.000269,0.000516,0.000096,-0.000175,123.423313,394.588957,1.002277,0.002550,0.004430,0.000269,0.000523,0.000093,-0.000153,118.563107,387.029126,1.002256,0-32763


In [9]:
trade_train = pd.read_parquet(data_dir + "trade_train.parquet/stock_id=0")
trade_train.head(15)

Unnamed: 0,time_id,seconds_in_bucket,price,size,order_count
0,5,21,1.002301,326,12
1,5,46,1.002778,128,4
2,5,50,1.002818,55,1
3,5,57,1.003155,121,5
4,5,68,1.003646,4,1
5,5,78,1.003762,134,5
6,5,122,1.004207,102,3
7,5,127,1.004577,1,1
8,5,144,1.00437,6,1
9,5,147,1.003964,233,4


## Main function for preprocessing trade data

In [10]:
def preprocessor_trade(file_path):
    df = pd.read_parquet(file_path)
    df['log_return'] = df.groupby('time_id')['price'].apply(log_return)
    
    
    aggregate_dictionary = {
        'log_return':[realized_volatility],
        'seconds_in_bucket':[count_unique],
        'size':[np.sum],
        'order_count':[np.mean],
    }
    
    df_feature = df.groupby('time_id').agg(aggregate_dictionary)
    
    df_feature = df_feature.reset_index()
    df_feature.columns = ['_'.join(col) for col in df_feature.columns]

    
    ######groupby / last XX seconds
    last_seconds = [150, 300, 450]
    
    for second in last_seconds:
        second = 600 - second
    
        df_feature_sec = df.query(f'seconds_in_bucket >= {second}').groupby('time_id').agg(aggregate_dictionary)
        df_feature_sec = df_feature_sec.reset_index()
        
        df_feature_sec.columns = ['_'.join(col) for col in df_feature_sec.columns]
        df_feature_sec = df_feature_sec.add_suffix('_' + str(second))
        
        df_feature = pd.merge(df_feature,df_feature_sec,how='left',left_on='time_id_',right_on=f'time_id__{second}')
        df_feature = df_feature.drop([f'time_id__{second}'],axis=1)
    
    df_feature = df_feature.add_prefix('trade_')
    stock_id = file_path.split('=')[1]
    df_feature['row_id'] = df_feature['trade_time_id_'].apply(lambda x:f'{stock_id}-{x}')
    df_feature = df_feature.drop(['trade_time_id_'],axis=1)
    
    return df_feature

In [11]:
%%time
file_path = data_dir + "trade_train.parquet/stock_id=0"
preprocessor_trade(file_path)

Wall time: 5.79 s


Unnamed: 0,trade_log_return_realized_volatility,trade_seconds_in_bucket_count_unique,trade_size_sum,trade_order_count_mean,trade_log_return_realized_volatility_400,trade_seconds_in_bucket_count_unique_400,trade_size_sum_400,trade_order_count_mean_400,trade_log_return_realized_volatility_300,trade_seconds_in_bucket_count_unique_300,trade_size_sum_300,trade_order_count_mean_300,trade_log_return_realized_volatility_200,trade_seconds_in_bucket_count_unique_200,trade_size_sum_200,trade_order_count_mean_200,row_id
0,0.002006,40,3179,2.750000,0.001121,16.0,1045.0,2.437500,0.001308,21.0,1587.0,2.571429,0.001666,27.0,1901.0,2.555556,0-5
1,0.000901,30,1289,1.900000,0.000510,11.0,829.0,2.090909,0.000587,16.0,900.0,2.250000,0.000802,22.0,1124.0,2.045455,0-11
2,0.001961,25,2161,2.720000,0.001048,10.0,1087.0,3.400000,0.001137,12.0,1189.0,3.166667,0.001575,18.0,1691.0,2.833333,0-16
3,0.001561,15,1962,3.933333,0.000802,3.0,514.0,3.666667,0.001089,9.0,1556.0,5.111111,0.001090,10.0,1561.0,4.700000,0-31
4,0.000871,22,1791,4.045455,0.000395,6.0,162.0,3.666667,0.000453,11.0,1219.0,4.909091,0.000498,14.0,1458.0,4.428571,0-62
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3825,0.001519,52,3450,3.057692,0.000911,28.0,1856.0,3.142857,0.001162,35.0,2365.0,3.257143,0.001257,39.0,2407.0,3.128205,0-32751
3826,0.001411,28,4547,3.892857,0.000765,6.0,1401.0,5.166667,0.001066,12.0,2161.0,4.250000,0.001235,18.0,2493.0,3.555556,0-32753
3827,0.001521,36,4250,3.500000,0.000875,13.0,1149.0,2.692308,0.001242,22.0,2294.0,3.727273,0.001243,23.0,2295.0,3.608696,0-32758
3828,0.001794,53,3217,2.150943,0.001070,16.0,1463.0,2.312500,0.001404,25.0,1627.0,1.920000,0.001622,33.0,2171.0,2.030303,0-32763


## Combined preprocessor function

In [12]:
def preprocessor(list_stock_ids, is_train = True):
    from joblib import Parallel, delayed # parallel computing to save time
    df = pd.DataFrame()
    
    def for_joblib(stock_id):
        if is_train:
            file_path_book = data_dir + "book_train.parquet/stock_id=" + str(stock_id)
            file_path_trade = data_dir + "trade_train.parquet/stock_id=" + str(stock_id)
        else:
            file_path_book = data_dir + "book_test.parquet/stock_id=" + str(stock_id)
            file_path_trade = data_dir + "trade_test.parquet/stock_id=" + str(stock_id)
            
        df_tmp = pd.merge(preprocessor_book(file_path_book),preprocessor_trade(file_path_trade),on='row_id',how='left')
     
        return pd.concat([df,df_tmp])
    
    df = Parallel(n_jobs=-1, verbose=1)(
        delayed(for_joblib)(stock_id) for stock_id in list_stock_ids
        )

    df =  pd.concat(df,ignore_index = True)
    return df


In [13]:
list_stock_ids = [0,1]
preprocessor(list_stock_ids, is_train = True)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:   23.0s finished


Unnamed: 0,log_return_realized_volatility,log_return2_realized_volatility,wap_balance_mean,price_spread_mean,bid_spread_mean,ask_spread_mean,volume_imbalance_mean,total_volume_mean,wap_mean,log_return_realized_volatility_400,log_return2_realized_volatility_400,wap_balance_mean_400,price_spread_mean_400,bid_spread_mean_400,ask_spread_mean_400,volume_imbalance_mean_400,total_volume_mean_400,wap_mean_400,log_return_realized_volatility_300,log_return2_realized_volatility_300,wap_balance_mean_300,price_spread_mean_300,bid_spread_mean_300,ask_spread_mean_300,volume_imbalance_mean_300,total_volume_mean_300,wap_mean_300,log_return_realized_volatility_200,log_return2_realized_volatility_200,wap_balance_mean_200,price_spread_mean_200,bid_spread_mean_200,ask_spread_mean_200,volume_imbalance_mean_200,total_volume_mean_200,wap_mean_200,row_id,trade_log_return_realized_volatility,trade_seconds_in_bucket_count_unique,trade_size_sum,trade_order_count_mean,trade_log_return_realized_volatility_400,trade_seconds_in_bucket_count_unique_400,trade_size_sum_400,trade_order_count_mean_400,trade_log_return_realized_volatility_300,trade_seconds_in_bucket_count_unique_300,trade_size_sum_300,trade_order_count_mean_300,trade_log_return_realized_volatility_200,trade_seconds_in_bucket_count_unique_200,trade_size_sum_200,trade_order_count_mean_200
0,0.004499,0.006999,0.000388,0.000852,0.000176,-0.000151,134.894040,323.496689,1.003725,0.002300,0.004589,0.000390,0.000783,0.000214,-0.000191,124.326531,262.489796,1.003633,0.002953,0.004864,0.000372,0.000822,0.000223,-0.000162,137.158273,294.928058,1.003753,0.003402,0.005803,0.000379,0.000865,0.000205,-0.000155,134.772021,321.455959,1.003836,0-5,0.002006,40,3179,2.750000,0.001121,16.0,1045.0,2.437500,0.001308,21.0,1587.0,2.571429,0.001666,27.0,1901.0,2.555556
1,0.001204,0.002476,0.000212,0.000394,0.000142,-0.000135,142.050000,411.450000,1.000239,0.000934,0.001907,0.000261,0.000367,0.000186,-0.000133,96.136986,480.000000,1.000480,0.000981,0.002009,0.000239,0.000353,0.000164,-0.000123,135.513043,484.521739,1.000397,0.001014,0.002105,0.000210,0.000348,0.000139,-0.000123,151.407895,438.921053,1.000332,0-11,0.000901,30,1289,1.900000,0.000510,11.0,829.0,2.090909,0.000587,16.0,900.0,2.250000,0.000802,22.0,1124.0,2.045455
2,0.002369,0.004801,0.000331,0.000725,0.000197,-0.000198,141.414894,416.351064,0.999542,0.001179,0.003034,0.000411,0.000625,0.000167,-0.000204,152.509804,454.000000,0.998356,0.001295,0.003196,0.000431,0.000689,0.000141,-0.000249,144.147059,455.235294,0.998685,0.001940,0.003900,0.000396,0.000683,0.000146,-0.000263,143.514851,440.544554,0.998944,0-16,0.001961,25,2161,2.720000,0.001048,10.0,1087.0,3.400000,0.001137,12.0,1189.0,3.166667,0.001575,18.0,1691.0,2.833333
3,0.002574,0.003637,0.000380,0.000860,0.000190,-0.000108,146.216667,435.266667,0.998832,0.001003,0.001513,0.000350,0.001050,0.000155,-0.000048,153.826087,498.956522,0.998079,0.001776,0.002713,0.000331,0.000833,0.000158,-0.000095,144.698113,418.169811,0.998436,0.001855,0.002881,0.000339,0.000856,0.000133,-0.000114,137.217391,424.782609,0.998472,0-31,0.001561,15,1962,3.933333,0.000802,3.0,514.0,3.666667,0.001089,9.0,1556.0,5.111111,0.001090,10.0,1561.0,4.700000
4,0.001894,0.003257,0.000254,0.000397,0.000191,-0.000109,123.846591,343.221591,0.999619,0.001434,0.001516,0.000298,0.000469,0.000209,-0.000126,85.618182,339.945455,0.999473,0.001520,0.002188,0.000252,0.000425,0.000191,-0.000120,99.449438,407.584270,0.999488,0.001571,0.002461,0.000230,0.000414,0.000174,-0.000108,116.956140,388.394737,0.999575,0-62,0.000871,22,1791,4.045455,0.000395,6.0,162.0,3.666667,0.000453,11.0,1219.0,4.909091,0.000498,14.0,1458.0,4.428571
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7655,0.003723,0.004996,0.000330,0.000597,0.000157,-0.000118,125.013029,296.185668,1.000142,0.001584,0.001957,0.000268,0.000605,0.000169,-0.000100,132.111111,359.691358,1.000157,0.002212,0.002954,0.000283,0.000617,0.000165,-0.000104,120.469231,322.284615,1.000130,0.002973,0.003814,0.000277,0.000608,0.000152,-0.000121,129.730964,291.416244,1.000000,1-32751,0.001776,49,3249,2.775510,0.001028,16.0,1532.0,3.937500,0.001280,23.0,1889.0,3.608696,0.001462,31.0,2290.0,3.129032
7656,0.010829,0.012168,0.000403,0.000922,0.000159,-0.000125,254.006073,567.840081,1.007503,0.006736,0.007670,0.000511,0.001110,0.000222,-0.000113,235.212500,527.625000,1.011841,0.008499,0.009971,0.000483,0.001082,0.000196,-0.000129,217.410788,485.195021,1.012343,0.009946,0.011071,0.000468,0.001037,0.000177,-0.000131,229.679128,513.903427,1.011773,1-32753,0.008492,183,75903,7.874317,0.003885,51.0,19713.0,6.862745,0.006310,88.0,30858.0,8.136364,0.007641,126.0,49265.0,8.293651
7657,0.003135,0.004268,0.000243,0.000648,0.000141,-0.000132,163.645367,426.603834,1.000854,0.001737,0.002177,0.000279,0.000634,0.000132,-0.000146,131.526882,433.182796,1.001216,0.002108,0.003184,0.000261,0.000625,0.000146,-0.000146,142.402685,426.939597,1.001250,0.002363,0.003635,0.000258,0.000620,0.000143,-0.000138,154.732719,404.843318,1.001174,1-32758,0.001927,26,2239,2.615385,0.001355,7.0,654.0,2.714286,0.001567,11.0,980.0,2.727273,0.001692,20.0,1734.0,2.700000
7658,0.003750,0.005773,0.000199,0.000421,0.000190,-0.000231,138.235023,526.317972,1.003032,0.002287,0.003912,0.000209,0.000394,0.000166,-0.000235,155.471831,526.260563,1.004099,0.002728,0.004435,0.000195,0.000410,0.000165,-0.000240,148.273543,519.044843,1.004296,0.003087,0.005227,0.000200,0.000439,0.000180,-0.000239,130.053872,524.074074,1.004028,1-32763,0.002856,109,16648,2.935780,0.001684,37.0,5279.0,2.837838,0.001919,57.0,8274.0,2.701754,0.002478,76.0,10960.0,2.710526


## Training set

In [14]:
train = pd.read_csv(data_dir + 'train.csv')

In [15]:
train_ids = train.stock_id.unique()

In [16]:
%%time
df_train = preprocessor(list_stock_ids= train_ids, is_train = True)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 112 out of 112 | elapsed:  6.1min finished


Wall time: 6min 6s


In [17]:
train['row_id'] = train['stock_id'].astype(str) + '-' + train['time_id'].astype(str)
train = train[['row_id','target']]
df_train = train.merge(df_train, on = ['row_id'], how = 'left')

In [27]:
df_train.to_csv('train_data.csv')

## Test set

In [19]:
test = pd.read_csv(data_dir + 'test.csv')

In [20]:
test_ids = test.stock_id.unique()

In [21]:
%%time
df_test = preprocessor(list_stock_ids= test_ids, is_train = False)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


Wall time: 363 ms


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.3s finished


In [22]:
df_test = test.merge(df_test, on = ['row_id'], how = 'left')

In [28]:
df_test.to_csv('data/test_data.csv')

## Target encoding by stock_id

In [34]:
df_train = pd.read_csv('data/train_data.csv', index_col=0)
df_test = pd.read_csv('data/test_data.csv', index_col=0)

In [35]:
from sklearn.model_selection import KFold
#stock_id target encoding
df_train['stock_id'] = df_train['row_id'].apply(lambda x:x.split('-')[0])
df_test['stock_id'] = df_test['row_id'].apply(lambda x:x.split('-')[0])

stock_id_target_mean = df_train.groupby('stock_id')['target'].mean() 
df_test['stock_id_target_enc'] = df_test['stock_id'].map(stock_id_target_mean) # test_set

#training
tmp = np.repeat(np.nan, df_train.shape[0])
kf = KFold(n_splits = 10, shuffle=True,random_state = 19911109)
for idx_1, idx_2 in kf.split(df_train):
    target_mean = df_train.iloc[idx_1].groupby('stock_id')['target'].mean()

    tmp[idx_2] = df_train['stock_id'].iloc[idx_2].map(target_mean)
df_train['stock_id_target_enc'] = tmp

## Model Building

In [36]:
df_train.head()

Unnamed: 0,row_id,target,log_return_realized_volatility,log_return2_realized_volatility,wap_balance_mean,price_spread_mean,bid_spread_mean,ask_spread_mean,volume_imbalance_mean,total_volume_mean,wap_mean,log_return_realized_volatility_400,log_return2_realized_volatility_400,wap_balance_mean_400,price_spread_mean_400,bid_spread_mean_400,ask_spread_mean_400,volume_imbalance_mean_400,total_volume_mean_400,wap_mean_400,log_return_realized_volatility_300,log_return2_realized_volatility_300,wap_balance_mean_300,price_spread_mean_300,bid_spread_mean_300,ask_spread_mean_300,volume_imbalance_mean_300,total_volume_mean_300,wap_mean_300,log_return_realized_volatility_200,log_return2_realized_volatility_200,wap_balance_mean_200,price_spread_mean_200,bid_spread_mean_200,ask_spread_mean_200,volume_imbalance_mean_200,total_volume_mean_200,wap_mean_200,trade_log_return_realized_volatility,trade_seconds_in_bucket_count_unique,trade_size_sum,trade_order_count_mean,trade_log_return_realized_volatility_400,trade_seconds_in_bucket_count_unique_400,trade_size_sum_400,trade_order_count_mean_400,trade_log_return_realized_volatility_300,trade_seconds_in_bucket_count_unique_300,trade_size_sum_300,trade_order_count_mean_300,trade_log_return_realized_volatility_200,trade_seconds_in_bucket_count_unique_200,trade_size_sum_200,trade_order_count_mean_200,stock_id,stock_id_target_enc
0,0-5,0.004136,0.004499,0.006999,0.000388,0.000852,0.000176,-0.000151,134.89404,323.496689,1.003725,0.0023,0.004589,0.00039,0.000783,0.000214,-0.000191,124.326531,262.489796,1.003633,0.002953,0.004864,0.000372,0.000822,0.000223,-0.000162,137.158273,294.928058,1.003753,0.003402,0.005803,0.000379,0.000865,0.000205,-0.000155,134.772021,321.455959,1.003836,0.002006,40.0,3179.0,2.75,0.001121,16.0,1045.0,2.4375,0.001308,21.0,1587.0,2.571429,0.001666,27.0,1901.0,2.555556,0,0.004041
1,0-11,0.001445,0.001204,0.002476,0.000212,0.000394,0.000142,-0.000135,142.05,411.45,1.000239,0.000934,0.001907,0.000261,0.000367,0.000186,-0.000133,96.136986,480.0,1.00048,0.000981,0.002009,0.000239,0.000353,0.000164,-0.000123,135.513043,484.521739,1.000397,0.001014,0.002105,0.00021,0.000348,0.000139,-0.000123,151.407895,438.921053,1.000332,0.000901,30.0,1289.0,1.9,0.00051,11.0,829.0,2.090909,0.000587,16.0,900.0,2.25,0.000802,22.0,1124.0,2.045455,0,0.004039
2,0-16,0.002168,0.002369,0.004801,0.000331,0.000725,0.000197,-0.000198,141.414894,416.351064,0.999542,0.001179,0.003034,0.000411,0.000625,0.000167,-0.000204,152.509804,454.0,0.998356,0.001295,0.003196,0.000431,0.000689,0.000141,-0.000249,144.147059,455.235294,0.998685,0.00194,0.0039,0.000396,0.000683,0.000146,-0.000263,143.514851,440.544554,0.998943,0.001961,25.0,2161.0,2.72,0.001048,10.0,1087.0,3.4,0.001137,12.0,1189.0,3.166667,0.001575,18.0,1691.0,2.833333,0,0.004041
3,0-31,0.002195,0.002574,0.003637,0.00038,0.00086,0.00019,-0.000108,146.216667,435.266667,0.998831,0.001003,0.001513,0.00035,0.00105,0.000155,-4.8e-05,153.826087,498.956522,0.998079,0.001776,0.002713,0.000331,0.000833,0.000158,-9.5e-05,144.698113,418.169811,0.998436,0.001855,0.002881,0.000339,0.000856,0.000133,-0.000114,137.217391,424.782609,0.998472,0.001561,15.0,1962.0,3.933333,0.000802,3.0,514.0,3.666667,0.001089,9.0,1556.0,5.111111,0.00109,10.0,1561.0,4.7,0,0.004013
4,0-62,0.001747,0.001894,0.003257,0.000254,0.000397,0.000191,-0.000109,123.846591,343.221591,0.999619,0.001434,0.001516,0.000298,0.000469,0.000209,-0.000126,85.618182,339.945455,0.999473,0.00152,0.002188,0.000252,0.000425,0.000191,-0.00012,99.449438,407.58427,0.999488,0.001571,0.002461,0.00023,0.000414,0.000174,-0.000108,116.95614,388.394737,0.999575,0.000871,22.0,1791.0,4.045455,0.000395,6.0,162.0,3.666667,0.000453,11.0,1219.0,4.909091,0.000498,14.0,1458.0,4.428571,0,0.004032


In [37]:
df_test.head()

Unnamed: 0,stock_id,time_id,row_id,log_return_realized_volatility,log_return2_realized_volatility,wap_balance_mean,price_spread_mean,bid_spread_mean,ask_spread_mean,volume_imbalance_mean,total_volume_mean,wap_mean,log_return_realized_volatility_400,log_return2_realized_volatility_400,wap_balance_mean_400,price_spread_mean_400,bid_spread_mean_400,ask_spread_mean_400,volume_imbalance_mean_400,total_volume_mean_400,wap_mean_400,log_return_realized_volatility_300,log_return2_realized_volatility_300,wap_balance_mean_300,price_spread_mean_300,bid_spread_mean_300,ask_spread_mean_300,volume_imbalance_mean_300,total_volume_mean_300,wap_mean_300,log_return_realized_volatility_200,log_return2_realized_volatility_200,wap_balance_mean_200,price_spread_mean_200,bid_spread_mean_200,ask_spread_mean_200,volume_imbalance_mean_200,total_volume_mean_200,wap_mean_200,trade_log_return_realized_volatility,trade_seconds_in_bucket_count_unique,trade_size_sum,trade_order_count_mean,trade_log_return_realized_volatility_400,trade_seconds_in_bucket_count_unique_400,trade_size_sum_400,trade_order_count_mean_400,trade_log_return_realized_volatility_300,trade_seconds_in_bucket_count_unique_300,trade_size_sum_300,trade_order_count_mean_300,trade_log_return_realized_volatility_200,trade_seconds_in_bucket_count_unique_200,trade_size_sum_200,trade_order_count_mean_200,stock_id_target_enc
0,0,4,0-4,0.000294,0.000252,0.000145,0.000557,0.000393,-0.000115,164.666667,350.666667,1.000405,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.000295,3.0,201.0,3.666667,,,,,,,,,,,,,0.004028
1,0,32,0-32,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.004028
2,0,34,0-34,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.004028


## LightGBM

In [38]:
import lightgbm as lgbm
from bayes_opt import BayesianOptimization

In [39]:
df_train['stock_id'] = df_train['stock_id'].astype(int)
df_test['stock_id'] = df_test['stock_id'].astype(int)

In [40]:
X = df_train.drop(['row_id','target'],axis=1)
y = df_train['target']

In [168]:
def rmspe(y_true, y_pred):
    return  (np.sqrt(np.mean(np.square((y_true - y_pred) / y_true))))

def feval_RMSPE(preds, lgbm_train):
    labels = lgbm_train.get_label()
    return 'RMSPE', round(rmspe(y_true = labels, y_pred = preds),5), False

In [176]:
def bayes_parameter_opt_lgb(X, y, init_round=15, opt_round=25, n_folds=3, random_seed=6,n_estimators=10000, output_process=False):
    def lgb_eval(learning_rate,num_leaves, feature_fraction, bagging_fraction, max_depth, max_bin, min_data_in_leaf,min_sum_hessian_in_leaf,subsample):
        params = {'application':'regression', 'metric':'RMSPE'}
        params['learning_rate'] = max(min(learning_rate, 1), 0)
        params["num_leaves"] = int(round(num_leaves))
        params['feature_fraction'] = max(min(feature_fraction, 1), 0)
        params['bagging_fraction'] = max(min(bagging_fraction, 1), 0)
        params['max_depth'] = int(round(max_depth))
        params['max_bin'] = int(round(max_depth))
        params['min_data_in_leaf'] = int(round(min_data_in_leaf))
        params['min_sum_hessian_in_leaf'] = min_sum_hessian_in_leaf
        params['subsample'] = max(min(subsample, 1), 0)
        
        scores = 0.0
        
        for fold, (trn_idx, val_idx) in enumerate(kf.split(X, y)):
            X_train, y_train = X.loc[trn_idx], y[trn_idx]
            X_valid, y_valid = X.loc[val_idx], y[val_idx]
            
            weights = 1/np.square(y_train)
            lgbm_train = lgbm.Dataset(X_train,y_train,weight = weights)

            weights = 1/np.square(y_valid)
            lgbm_valid = lgbm.Dataset(X_valid,y_valid,reference = lgbm_train,weight = weights)
            
            model = lgbm.train(params=params,
                      train_set=lgbm_train,
                      valid_sets=[lgbm_train, lgbm_valid],
                      num_boost_round=5000,         
                      feval=feval_RMSPE,
                      verbose_eval=100,
                      categorical_feature = ['stock_id']                
                     )
            y_pred = model.predict(X_valid, num_iteration=model.best_iteration)

            RMSPE = round(rmspe(y_true = y_valid, y_pred = y_pred),3)
#             print(f'Performance of the　prediction: , RMSPE: {RMSPE}')

            #keep scores and models
            scores += RMSPE / 5
        return scores
     
    lgbBO = BayesianOptimization(lgb_eval, {'learning_rate': (0.01, 1.0),
                                            'num_leaves': (24, 80),
                                            'feature_fraction': (0.1, 0.9),
                                            'bagging_fraction': (0.2, 1),
                                            'max_depth': (5, 500),
                                            'max_bin':(20,200),
                                            'min_data_in_leaf': (20, 200),
                                            'min_sum_hessian_in_leaf':(0,100),
                                           'subsample': (0.01, 1.0)}, random_state=200)

    lgbBO.maximize(init_points=init_round, n_iter=opt_round)
    
    model_auc=[]
    for model in range(len(lgbBO.res)):
        model_auc.append(lgbBO.res[model]['target'])
    
    # return best parameters
    return lgbBO.res[pd.Series(model_auc).idxmax()]['target'],lgbBO.res[pd.Series(model_auc).idxmax()]['params']

In [None]:
opt_params = bayes_parameter_opt_lgb(X, y, init_round=5, opt_round=10, n_folds=3, random_seed=6,n_estimators=100)

|   iter    |  target   | baggin... | featur... | learni... |  max_bin  | max_depth | min_da... | min_su... | num_le... | subsample |
-------------------------------------------------------------------------------------------------------------------------------------


New categorical_feature is ['stock_id']


You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 20145
[LightGBM] [Info] Number of data points in the train set: 343145, number of used features: 54




[LightGBM] [Info] Start training from score 0.001800
[100]	training's RMSPE: 0.20456	valid_1's RMSPE: 0.24535
[200]	training's RMSPE: 0.18984	valid_1's RMSPE: 0.25101
[300]	training's RMSPE: 0.17747	valid_1's RMSPE: 0.2561
[400]	training's RMSPE: 0.16746	valid_1's RMSPE: 0.26078
[500]	training's RMSPE: 0.15831	valid_1's RMSPE: 0.26412
[600]	training's RMSPE: 0.15015	valid_1's RMSPE: 0.26754
[700]	training's RMSPE: 0.14276	valid_1's RMSPE: 0.27
[800]	training's RMSPE: 0.13601	valid_1's RMSPE: 0.27222
[900]	training's RMSPE: 0.12963	valid_1's RMSPE: 0.27429
[1000]	training's RMSPE: 0.1238	valid_1's RMSPE: 0.27658
[1100]	training's RMSPE: 0.11802	valid_1's RMSPE: 0.27854
[1200]	training's RMSPE: 0.11315	valid_1's RMSPE: 0.28003
[1300]	training's RMSPE: 0.1083	valid_1's RMSPE: 0.28199
[1400]	training's RMSPE: 0.10391	valid_1's RMSPE: 0.28279
[1500]	training's RMSPE: 0.09978	valid_1's RMSPE: 0.28417
[1600]	training's RMSPE: 0.09563	valid_1's RMSPE: 0.28537
[1700]	training's RMSPE: 0.09179	v

New categorical_feature is ['stock_id']


You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 20144
[LightGBM] [Info] Number of data points in the train set: 343145, number of used features: 54




[LightGBM] [Info] Start training from score 0.001795
[100]	training's RMSPE: 0.20443	valid_1's RMSPE: 0.24032
[200]	training's RMSPE: 0.18958	valid_1's RMSPE: 0.24641
[300]	training's RMSPE: 0.17774	valid_1's RMSPE: 0.25132
[400]	training's RMSPE: 0.16751	valid_1's RMSPE: 0.25572
[500]	training's RMSPE: 0.15836	valid_1's RMSPE: 0.25924
[600]	training's RMSPE: 0.1502	valid_1's RMSPE: 0.2624
[700]	training's RMSPE: 0.14279	valid_1's RMSPE: 0.26521
[800]	training's RMSPE: 0.13608	valid_1's RMSPE: 0.26739
[900]	training's RMSPE: 0.12987	valid_1's RMSPE: 0.26923
[1000]	training's RMSPE: 0.12401	valid_1's RMSPE: 0.2715
[1100]	training's RMSPE: 0.11846	valid_1's RMSPE: 0.27304
[1200]	training's RMSPE: 0.11323	valid_1's RMSPE: 0.27469
[1300]	training's RMSPE: 0.10827	valid_1's RMSPE: 0.27626
[1400]	training's RMSPE: 0.10365	valid_1's RMSPE: 0.27784
[1500]	training's RMSPE: 0.09937	valid_1's RMSPE: 0.27918
[1600]	training's RMSPE: 0.09532	valid_1's RMSPE: 0.28058
[1700]	training's RMSPE: 0.0914

New categorical_feature is ['stock_id']


You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 20143
[LightGBM] [Info] Number of data points in the train set: 343146, number of used features: 54




[LightGBM] [Info] Start training from score 0.001800
[100]	training's RMSPE: 0.20422	valid_1's RMSPE: 0.25187
[200]	training's RMSPE: 0.18926	valid_1's RMSPE: 0.25436
[300]	training's RMSPE: 0.17736	valid_1's RMSPE: 0.25899
[400]	training's RMSPE: 0.1671	valid_1's RMSPE: 0.26271
[500]	training's RMSPE: 0.15822	valid_1's RMSPE: 0.26598
[600]	training's RMSPE: 0.15006	valid_1's RMSPE: 0.27094
[700]	training's RMSPE: 0.14239	valid_1's RMSPE: 0.27359
[800]	training's RMSPE: 0.13556	valid_1's RMSPE: 0.27623
[900]	training's RMSPE: 0.12916	valid_1's RMSPE: 0.27955


### Hyperparameter Tunning using Hyperopt

In [78]:
# from hyperopt import hp, fmin, tpe, Trials, STATUS_OK
# from hyperopt.pyll import scope

In [113]:
# lgb_reg_params = {
#     'learning_rate':    hp.uniform('learning_rate',0.01,1),
#     'max_depth':        hp.choice('max_depth', np.arange(2, 100, 1, dtype=int)),
#     'min_child_weight': hp.choice('min_child_weight', np.arange(1, 50, 1, dtype=int)),
#     'colsample_bytree': hp.uniform('colsample_bytree',0.1,1),
#     'subsample':        hp.uniform('subsample', 0.1, 1),
#     'num_leaves':       hp.choice('num_leaves', np.arange(1, 200, 1, dtype=int)),
#     'min_split_gain':   hp.uniform('min_split_gain', 0, 1),
#     'boosting_type':    hp.choice('boosting_type', ['dart', 'goss', 'gbdt']),
#     'reg_alpha':        hp.uniform('reg_alpha',0,1),
#     'reg_lambda':       hp.uniform('reg_lambda',0,1),
#     'n_estimators':     hp.choice('n_estimators', np.arange(5, 500, 5, dtype=int)),
#     'feature_fraction': hp.uniform('feature_fraction', 0.0, 1.0),
#     'bagging_fraction': hp.uniform('bagging_fraction', 0.0, 1.0)
# }

In [None]:
# param_hyperopt= {
#     'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(1)),
#     'max_depth': scope.int(hp.quniform('max_depth', 5, 40, 1)),
#     'n_estimators': scope.int(hp.quniform('n_estimators', 10, 500, 10)),
#     'num_leaves': scope.int(hp.quniform('num_leaves', 5, 50, 1)),
#     'boosting_type': hp.choice('boosting_type', ['dart', 'goss', 'gbdt']),
#     'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0)
#     'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),
#     'colsample_bytree': hp.uniform('colsample_bytree', 0.0, 1.0)
# #     'feature_fraction': hp.uniform('feature_fraction', 0.0, 1.0),
# #     'bagging_fraction': hp.uniform('bagging_fraction', 0.0, 1.0),
# }

### Cross Validation

In [144]:
# from sklearn.model_selection import KFold
# kf = KFold(n_splits=5, random_state=19901028, shuffle=True)

In [148]:
# scores = 0.0 
# models = []

In [149]:
# def f(params):
#     for fold, (trn_idx, val_idx) in enumerate(kf.split(X, y)):

#         print("Fold :", fold+1)
    
#         # create dataset
#         X_train, y_train = X.loc[trn_idx], y[trn_idx]
#         X_valid, y_valid = X.loc[val_idx], y[val_idx]

#         #RMSPE weight
#         weights = 1/np.square(y_train)
#         lgbm_train = lgbm.Dataset(X_train,y_train,weight = weights)

#         weights = 1/np.square(y_valid)
#         lgbm_valid = lgbm.Dataset(X_valid,y_valid,reference = lgbm_train,weight = weights)

#         # model 
#         model = lgbm.train(params=params,
#                            train_set=lgbm_train,
#                            valid_sets=[lgbm_train, lgbm_valid],
#                            num_boost_round=5000,
#                            feval=feval_RMSPE,
#                            verbose_eval=100,
#                            categorical_feature = ['stock_id']
#                          )

#         # validation 
#         y_pred = model.predict(X_valid, num_iteration=model.best_iteration)

#         RMSPE = round(rmspe(y_true = y_valid, y_pred = y_pred),3)
#         print(f'Performance of the　prediction: , RMSPE: {RMSPE}')

#         #keep scores and models
#         scores += RMSPE / 5
#         models.append(model)
#         print("*" * 100)
#     return scores

In [150]:
# trials = Trials()
# result = fmin(
#     fn=f,                           # objective function
#     space=lgb_reg_params,           # parameter space
#     algo=tpe.suggest,               # surrogate algorithm
#     max_evals=50,                   # no. of evaluations
#     trials=trials                   # trials object that keeps track of the sample results (optional)
# )
# print(result)

Fold :                                                
1                                                     
  0%|          | 0/50 [00:00<?, ?trial/s, best loss=?]


New categorical_feature is ['stock_id']



You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 13573                    
[LightGBM] [Info] Number of data points in the train set: 343145, number of used features: 54
  0%|          | 0/50 [00:01<?, ?trial/s, best loss=?]





[LightGBM] [Info] Start training from score 0.001800  
[100]	training's RMSPE: 0.22565	valid_1's RMSPE: 0.23432


[200]	training's RMSPE: 0.22541	valid_1's RMSPE: 0.23416
  0%|          | 0/50 [00:18<?, ?trial/s, best loss=?]


KeyboardInterrupt: 

In [None]:
if result['boosting_type']==0:
    result['boosting_type'] = 'dart'
elif result['boosting_type']==1:
    result['boosting_type'] = 'goss'
else:
    result['boosting_type'] = 'gbdt'

In [None]:
f(result)

In [None]:
model

# Test set

In [125]:
y_pred = df_test[['row_id']]
X_test = df_test.drop(['time_id', 'row_id'], axis = 1)

In [126]:
X_test

Unnamed: 0,stock_id,log_return_realized_volatility,log_return2_realized_volatility,wap_balance_mean,price_spread_mean,bid_spread_mean,ask_spread_mean,volume_imbalance_mean,total_volume_mean,wap_mean,log_return_realized_volatility_400,log_return2_realized_volatility_400,wap_balance_mean_400,price_spread_mean_400,bid_spread_mean_400,ask_spread_mean_400,volume_imbalance_mean_400,total_volume_mean_400,wap_mean_400,log_return_realized_volatility_300,log_return2_realized_volatility_300,wap_balance_mean_300,price_spread_mean_300,bid_spread_mean_300,ask_spread_mean_300,volume_imbalance_mean_300,total_volume_mean_300,wap_mean_300,log_return_realized_volatility_200,log_return2_realized_volatility_200,wap_balance_mean_200,price_spread_mean_200,bid_spread_mean_200,ask_spread_mean_200,volume_imbalance_mean_200,total_volume_mean_200,wap_mean_200,trade_log_return_realized_volatility,trade_seconds_in_bucket_count_unique,trade_size_sum,trade_order_count_mean,trade_log_return_realized_volatility_400,trade_seconds_in_bucket_count_unique_400,trade_size_sum_400,trade_order_count_mean_400,trade_log_return_realized_volatility_300,trade_seconds_in_bucket_count_unique_300,trade_size_sum_300,trade_order_count_mean_300,trade_log_return_realized_volatility_200,trade_seconds_in_bucket_count_unique_200,trade_size_sum_200,trade_order_count_mean_200,stock_id_target_enc
0,0,0.000294,0.000252,0.000145,0.000557,0.000393,-0.000115,164.666667,350.666667,1.000405,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.000295,3.0,201.0,3.666667,,,,,,,,,,,,,0.004028
1,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.004028
2,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.004028


In [129]:
model.fit(X,y)





LGBMRegressor(bagging_fraction=0.5786962235763202, boosting_type='goss',
              colsample_bytree=0.3676292462136665, early_stopping_rounds=None,
              feature_fraction=0.6539607982625899,
              learning_rate=0.737114518760958, max_depth=78,
              min_child_weight=46, min_split_gain=0.0016060416701769668,
              n_estimators=47, num_leaves=72, reg_alpha=0.3966088097347819,
              reg_lambda=0.14110438421198862, subsample=0.4024465891223342)

In [130]:
pred = model.predict(X_test[X.columns])

In [133]:
y_pred = y_pred.assign(target = pred)

In [134]:
y_pred

Unnamed: 0,row_id,target
0,0-4,0.001643
1,0-32,0.001643
2,0-34,0.001643


In [135]:
y_pred.to_csv('submission.csv',index = False)

In [136]:
X_test

Unnamed: 0,stock_id,log_return_realized_volatility,log_return2_realized_volatility,wap_balance_mean,price_spread_mean,bid_spread_mean,ask_spread_mean,volume_imbalance_mean,total_volume_mean,wap_mean,log_return_realized_volatility_400,log_return2_realized_volatility_400,wap_balance_mean_400,price_spread_mean_400,bid_spread_mean_400,ask_spread_mean_400,volume_imbalance_mean_400,total_volume_mean_400,wap_mean_400,log_return_realized_volatility_300,log_return2_realized_volatility_300,wap_balance_mean_300,price_spread_mean_300,bid_spread_mean_300,ask_spread_mean_300,volume_imbalance_mean_300,total_volume_mean_300,wap_mean_300,log_return_realized_volatility_200,log_return2_realized_volatility_200,wap_balance_mean_200,price_spread_mean_200,bid_spread_mean_200,ask_spread_mean_200,volume_imbalance_mean_200,total_volume_mean_200,wap_mean_200,trade_log_return_realized_volatility,trade_seconds_in_bucket_count_unique,trade_size_sum,trade_order_count_mean,trade_log_return_realized_volatility_400,trade_seconds_in_bucket_count_unique_400,trade_size_sum_400,trade_order_count_mean_400,trade_log_return_realized_volatility_300,trade_seconds_in_bucket_count_unique_300,trade_size_sum_300,trade_order_count_mean_300,trade_log_return_realized_volatility_200,trade_seconds_in_bucket_count_unique_200,trade_size_sum_200,trade_order_count_mean_200,stock_id_target_enc
0,0,0.000294,0.000252,0.000145,0.000557,0.000393,-0.000115,164.666667,350.666667,1.000405,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.000295,3.0,201.0,3.666667,,,,,,,,,,,,,0.004028
1,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.004028
2,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.004028
