This dataset contains information on historic trades for several cryptoassets, such as Bitcoin and Ethereum. Your challenge is to predict their future returns.

As historic cryptocurrency prices are not confidential this will be a forecasting competition using the time series API. Furthermore the public leaderboard targets are publicly available and are provided as part of the competition dataset. Expect to see many people submitting perfect submissions for fun. Accordingly, THE PUBLIC LEADERBOARD FOR THIS COMPETITION IS NOT MEANINGFUL and is only provided as a convenience for anyone who wants to test their code. The final private leaderboard will be determined using real market data gathered after the submission period closes.

* train.csv - The training set
    * timestamp - A timestamp for the minute covered by the row.
    * Asset_ID - An ID code for the cryptoasset.
    * Count - The number of trades that took place this minute.
    * Open - The USD price at the beginning of the minute.
    * High - The highest USD price during the minute.
    * Low - The lowest USD price during the minute.
    * Close - The USD price at the end of the minute.
    * Volume - The number of cryptoasset units traded during the minute.
    * VWAP - The volume weighted average price for the minute.
    * Target - 15 minute residualized returns. See the 'Prediction and Evaluation' section of this 
    * notebook for details of how the target is calculated.

This forecasting competition aims to predict returns in the near future for prices  Pa , for each asset  a . For each row in the dataset, we include the target for prediction, Target. Target is derived from log returns ( Ra ) over 15 minutes.

> *Ra(t)=log(Pa(t+16) / Pa(t+1))*
 
Crypto asset returns are highly correlated, following to a large extend the overall crypto market. As we want to test your ability to predict returns for individual assets, we perform a linear residualization, removing the market signal from individual asset returns when creating the target. In more detail, if  M(t)  is the weighted average market returns, the target is:

> *M(t)=∑awaRa(t)∑awaβa=⟨M⋅Ra⟩⟨M2⟩Targeta(t)=Ra(t)−βaM(t)*

where the bracket  ⟨.⟩  represent the rolling average over time (3750 minute windows), and same asset weights  wa  used for the evaluation metric.

In [None]:
import os
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px

from lightgbm import LGBMRegressor
from sklearn.model_selection import GridSearchCV

import seaborn as sns
cmap = sns.color_palette()

import warnings
warnings.simplefilter('ignore')

In [None]:
#Required Data
ASSET_DATA ='../input/g-research-crypto-forecasting/asset_details.csv'
TRAINING_DATA = '../input/g-research-crypto-forecasting/train.csv'
TEST_DATA = '../input/g-research-crypto-forecasting/example_test.csv'

In [None]:
df_asset = pd.read_csv(ASSET_DATA)
df_asset.sort_values('Asset_ID')

In [None]:
df_asset.sort_values('Weight',ascending= False)

We see that the top cryptocurrencies are given for this competition including bitcoin, ethereum, binance coin etc. Also, we can see the weights column that will be used to get the weighted pearson correlation coefficient evaluation metric.

In [None]:
print('Sum of Weights ::-',df_asset['Weight'].sum())
print('Average of Weights ::-',df_asset['Weight'].mean())
print('Min of Weights ::-',df_asset['Weight'].min())
print('Max of Weights ::-',df_asset['Weight'].max())

In [None]:
# Our Training Data
data_train = pd.read_csv(TRAINING_DATA)
data_train.head()

In [None]:
print('Shape of our crypto data',data_train.shape)

In [None]:
data_train.info()

In [None]:
# missing values?
data_train.isna().sum()

In [None]:
from datetime import datetime
btc =  data_train[data_train['Asset_ID']==1].set_index('timestamp')
beg_btc = datetime.fromtimestamp(btc.index[0]).strftime("%A, %B %d, %Y %I:%M:%S") 
end_btc = datetime.fromtimestamp(btc.index[-1]).strftime("%A, %B %d, %Y %I:%M:%S") 
print('Bitcoin data date counts', beg_btc,'to', end_btc)

# *Similarly I will check for other cryptos*

In [None]:
eth =  data_train[data_train['Asset_ID']==6].set_index('timestamp')
beg_eth = datetime.fromtimestamp(eth.index[0]).strftime("%A, %B %d, %Y %I:%M:%S") 
end_eth = datetime.fromtimestamp(eth.index[-1]).strftime("%A, %B %d, %Y %I:%M:%S") 
print('Ethereum data date counts', beg_eth,'to', end_eth)

In [None]:
bic =  data_train[data_train['Asset_ID']==6].set_index('timestamp')
beg_bic = datetime.fromtimestamp(bic.index[0]).strftime("%A, %B %d, %Y %I:%M:%S") 
end_bic = datetime.fromtimestamp(bic.index[-1]).strftime("%A, %B %d, %Y %I:%M:%S") 
print('Binance Coin data date counts', beg_bic,'to', end_bic)

In [None]:
cdo =  data_train[data_train['Asset_ID']==3].set_index('timestamp')
beg_cdo = datetime.fromtimestamp(cdo.index[0]).strftime("%A, %B %d, %Y %I:%M:%S") 
end_cdo = datetime.fromtimestamp(cdo.index[-1]).strftime("%A, %B %d, %Y %I:%M:%S") 
print('Cardono Coin data date counts', beg_cdo,'to', end_cdo)

In [None]:
btc = btc.reindex(range(btc.index[0],btc.index[-1]+60,60),method='pad')
eth = eth.reindex(range(eth.index[0],eth.index[-1]+60,60),method='pad')
bic = bic.reindex(range(bic.index[0],bic.index[-1]+60,60),method='pad')
cdo = cdo.reindex(range(cdo.index[0],cdo.index[-1]+60,60),method='pad')

In [None]:
f = plt.figure(figsize=(15,4))
ax = f.add_subplot(121)
plt.plot(btc['Close'], color='Red', label='BTC')
plt.legend()
plt.xlabel('Time (timestamp)')
plt.ylabel('Bitcoin')

ax2 = f.add_subplot(122)
ax2.plot(eth['Close'], color='Green', label='ETH')
plt.legend()
plt.xlabel('Time (timestamp)')
plt.ylabel('Ethereum')

plt.tight_layout()
plt.show()

In [None]:
f = plt.figure(figsize=(15,4))

ax = f.add_subplot(121)
ax.plot(bic['Close'], color='Red', label='BIC')
plt.legend()
plt.xlabel('Time (timestamp)')
plt.ylabel('Binance')

ax = f.add_subplot(122)
ax.plot(cdo['Close'], color='Green', label='CDO')
plt.legend()
plt.xlabel('Time (timestamp)')
plt.ylabel('Cardono')

plt.tight_layout()
plt.show()

In [None]:
# Thanks to https://www.kaggle.com/odins0n/g-research-plots-eda
def candelstick_chart(data,title):
    candlestick = go.Figure(data = [go.Candlestick(x =data.index, 
                                               open = data[('Open')], 
                                               high = data[('High')], 
                                               low = data[('Low')], 
                                               close = data[('Close')])])
    candlestick.update_xaxes(title_text = 'Time',
                             rangeslider_visible = True)

    candlestick.update_layout(
    title = {
        'text': '{:} Candelstick Chart'.format(title),
        'y':0.90,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'} , 
    template="plotly_white")

    candlestick.update_yaxes(title_text = 'Price in USD', ticksuffix = '$')
    return candlestick

In [None]:
btc_plot = candelstick_chart(btc[-500:],title = "Bitcoin-BTC)")
btc_plot.show()

In [None]:
ETH_plot = candelstick_chart(eth[-500:],title = "Etheruim-ETH)")
ETH_plot.show()

# Feature Engineering

In [None]:
def hlco_ratio(df):
    return (df['High']-df['Low'])/(df['Close']-df['Open'])

In [None]:
def upper_shadow(df):
    return df['High']-np.maximum(df['Close'],df['Open'])

In [None]:
def lower_shadow(df):
    return np.minimum(df['Close'],df['Open'])-df['Low']

In [None]:
def get_features(df):
    df_feat =df[['Count','Open','High','Low','Close','Volume','VWAP']].copy()
    df_feat['Upper_Shadow'] = upper_shadow(df_feat)
    df_feat['hlco_ratio'] = hlco_ratio(df_feat)
    df_feat['Lower_Shadow'] = lower_shadow(df_feat)
    return df_feat

In [None]:
train_df = data_train


def get_Xy_and_model_for_asset(df_train, asset_id):
    df = df_train[df_train["Asset_ID"] == asset_id]
    
    df = df.sample(frac=0.2)
    df_proc = get_features(df)
    df_proc['y'] = df['Target']
    df_proc.replace([np.inf, -np.inf], np.nan, inplace=True)
    df_proc = df_proc.dropna(how="any")
    
    #Spliting into Target and Dependent  X AND Y
    X = df_proc.drop("y",axis = 1)
    y = df_proc['y']
    
    #LGBM MODEL
    model  = LGBMRegressor()
    model.fit(X,y)
    return X,y,model 

In [None]:
Xs ={}
ys = {}
models = {}

for asset_id,asset_name in zip(df_asset['Asset_ID'],df_asset['Asset_Name']):
    print(f'Training model for {asset_name:<16}(ID = {asset_id:<2})')
    print('---'*15)
    X,y,model = get_Xy_and_model_for_asset(train_df,asset_id)
    try:
        Xs[asset_id], ys[asset_id], models[asset_id] = X, y, model
    except: 
        Xs[asset_id], ys[asset_id], models[asset_id] = None, None, None

In [None]:
prams = {
    
    'num_leaves': range(21,161,10),
    'learning_rate':[0.1,0.01,0.05]
}
new_models = {}
for asset_id,asset_name in zip(df_asset['Asset_ID'],df_asset['Asset_Name']):
    print('GridSearchCV FOR : '+ asset_name)
    grid_search = GridSearchCV(
    estimator = get_Xy_and_model_for_asset(data_train,asset_id)[2],
    param_grid = prams,
    n_jobs = -1,
    cv = 5,
    verbose =True
    )
    
    grid_search.fit(Xs[asset_id],ys[asset_id])
    new_models[asset_id]=grid_search.best_estimator_
    
    grid_search.best_estimator_
    
    print('---'*20)

In [None]:
for asset_id ,asset_name in zip(df_asset['Asset_ID'],df_asset['Asset_Name']):
    print(f'Tuned model for {asset_name:<1}(ID ={asset_id:})')
    print(new_models[asset_id])

In [None]:
dfddfdf

In [None]:
#env = gresearch_crypto.make_env()
#iter_test = env.iter_test()

In [None]:
for i, (TEST_DATA, df_pred) in enumerate(iter_test):
    for j , row in TEST_DATA.iterrows():        
        if new_models[row['Asset_ID']] is not None:
            try:
                model = new_models[row['Asset_ID']]
                x_test = get_features(row)
                y_pred = model.predict(pd.DataFrame([x_test]))[0]
                df_pred.loc[df_pred['row_id'] == row['row_id'], 'Target'] = y_pred
            except:
                df_pred.loc[df_pred['row_id'] == row['row_id'], 'Target'] = 0
                traceback.print_exc()
        else: 
            df_pred.loc[df_pred['row_id'] == row['row_id'], 'Target'] = 0  
    
    env.predict(df_pred)