## Crypto Forecasting Model For Market Data 2018 - 2021

In this Data Science project, we will be performing some data analysis as well as we will build a Light Gradient Boosting Machine ML model to forecast short term returns in 14 popular cryptocurrencies using millions of high-frequency market data 2018-2021.
<br>
#### Project Description

In this Data Science project, we will be performing some data analysis as well as we will build a Light Gradient Boosting Machine ML model to forecast short term returns in 14 popular cryptocurrencies using millions of high-frequency market data 2018-2021. The dataset used here contains information on historic trades for several crypto assets, such as Bitcoin and Ethereum. Our challenge is to predict their future returns.
<br>
***train.csv:***

   **timestamp:** All timestamps are returned as second Unix timestamps (the number of seconds elapsed since 1970-01-01 00:00:00.000 UTC). Timestamps in this dataset are multiple of 60, indicating minute-by-minute data.
   
   **Asset_ID:** The asset ID corresponding to one of the cryptocurrencies (e.g. Asset_ID = 1 for Bitcoin). The mapping from Asset_ID to crypto asset is contained in asset_details.csv.
   
   **Count:** Total number of trades in the time interval (last minute).
   
   **Open:** Opening price of the time interval (in USD).
   
   **High:** Highest price reached during time interval (in USD).
   
   **Low:** Lowest price reached during time interval (in USD).
   
   **Close:** Closing price of the time interval (in USD).
   
   **Volume:** Quantity of asset bought or sold, displayed in base currency USD.
   
   **VWAP:** The average price of the asset over the time interval, weighted by volume. VWAP is an aggregated form of trade data.
   
   **Target:** Residual log-returns for the asset over a 15 minute horizon.
<br>
***asset_details.csv:***

It provides the real name of the crypto asset for each Asset_ID and the weight each crypto asset receives in the metric. Weights are determined by the logarithm of each product's market cap (in USD), of the cryptocurrencies at a fixed point in time. Weights were assigned to give more relevance to cryptocurrencies with higher market volumes to ensure smaller cryptocurrencies do not disproportionately impact the models. In this Data Science project, We will build a Light Gradient Boosting Machine ML model to forecast short term returns in 14 popular cryptocurrencies using millions of high-frequency market data 2018-2021.
<br>
#### Technologies used:

**Language -** Python
**Algorithms -** LightGBM

### Module 1: Project Setup and Installation

This module consists of guidance videos for setup and installation of various tools and libraries that we will be needing for our project. By following these videos you can set up the environment for your project development. In our project we will be using various python packages such as NumPy ,Pandas ,sklearn,etc. We will be using Visual Studio Code Editor for development.

#### Task 1: Setup and Installation

In this task you’ll understand how to get started with vscode, python, jupyter notebook etc. with the help of a guided video and how to create a simple jupyter notebook.

#### Task 2: Installing Packages Using Pip

**Packages:**

**Data preprocessing:** NumPy and Pandas

**Data visualization:** Seaborn and Matplotlib

### Module 2: Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics hypothesis testing. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional indexing. We will explore the data in detail and have a detailed look at the data and its statistics. We will also define a helper function that will turn a date format into a timestamp to use for  modeling.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
warnings.filterwarnings("ignore")

#### Task 1: Knowing The Dataset

In this task, you’ll understand how to Explore the dataset and know what each field means.

In [12]:
path = "data/train/"
list_of_files = os.listdir(path)
train = pd.read_csv(path + list_of_files[0])
print("Number of files imported : ",1)
ctr = 2
for i in list_of_files[1:]:
    if ".csv" not in i:
        continue
    df = pd.read_csv(path + i)
    train = pd.concat([train, df], axis=0, ignore_index=True)
    print("Number of files imported : ",ctr)
    ctr += 1

Number of files imported :  1
Number of files imported :  2
Number of files imported :  3
Number of files imported :  4
Number of files imported :  5
Number of files imported :  6
Number of files imported :  7
Number of files imported :  8
Number of files imported :  9
Number of files imported :  10
Number of files imported :  11
Number of files imported :  12
Number of files imported :  13
Number of files imported :  14
Number of files imported :  15
Number of files imported :  16
Number of files imported :  17
Number of files imported :  18
Number of files imported :  19
Number of files imported :  20
Number of files imported :  21
Number of files imported :  22
Number of files imported :  23
Number of files imported :  24
Number of files imported :  25
Number of files imported :  26
Number of files imported :  27
Number of files imported :  28
Number of files imported :  29
Number of files imported :  30
Number of files imported :  31


In [14]:
asset_details = pd.read_csv("data/asset_details.csv")
asset_details.sort_values(by="Asset_ID",ignore_index=True,inplace=True)
asset_details

Unnamed: 0,Asset_ID,Weight,Asset_Name
0,0,4.304065,Binance Coin
1,1,6.779922,Bitcoin
2,2,2.397895,Bitcoin Cash
3,3,4.406719,Cardano
4,4,3.555348,Dogecoin
5,5,1.386294,EOS.IO
6,6,5.894403,Ethereum
7,7,2.079442,Ethereum Classic
8,8,1.098612,IOTA
9,9,2.397895,Litecoin


In [15]:
train.head()

Unnamed: 0,timestamp,Asset_ID,Count,Open,High,Low,Close,Volume,VWAP,Target
0,1530202800,12,10.0,0.190815,0.19132,0.19015,0.19074,32450.372,0.1908,0.001392
1,1530202860,3,19.0,0.125492,0.125594,0.12539,0.125431,57357.2205,0.125477,0.002543
2,1530202860,2,12.0,697.195,698.14,696.5,697.22,1.4614,697.222803,-0.001323
3,1530202860,0,8.0,14.6499,14.6499,14.63,14.6395,515.1,14.637003,-0.002171
4,1530202860,1,131.0,6106.43689,6114.84,6100.0,6106.00289,8.465268,6106.365069,-0.000313


In [17]:
test = pd.read_csv("data/example_test.csv")
test.head()

Unnamed: 0,timestamp,Asset_ID,Count,Open,High,Low,Close,Volume,VWAP,group_num,row_id
0,1623542400,3,1201.0,1.478556,1.48603,1.478,1.483681,654799.561103,1.481439,0,0
1,1623542400,2,1020.0,580.306667,583.89,579.91,582.276667,1227.988328,581.697038,0,1
2,1623542400,0,626.0,343.7895,345.108,343.64,344.598,1718.832569,344.441729,0,2
3,1623542400,1,2888.0,35554.289632,35652.46465,35502.67,35602.004286,163.811537,35583.469303,0,3
4,1623542400,4,433.0,0.312167,0.3126,0.31192,0.312208,585577.410442,0.312154,0,4


#### Task 2: EDA - Dataset

In this you’ll understand how we will perform EDA and understand the data well.

In [18]:
train.isna().sum()

timestamp         0
Asset_ID          0
Count             0
Open              0
High              0
Low               0
Close             0
Volume            0
VWAP              9
Target       750338
dtype: int64

In [19]:
test.isna().sum()

timestamp    0
Asset_ID     0
Count        0
Open         0
High         0
Low          0
Close        0
Volume       0
VWAP         0
group_num    0
row_id       0
dtype: int64

In [21]:
asset_details.isna().sum()

Asset_ID      0
Weight        0
Asset_Name    0
dtype: int64

In [23]:
train.describe()

Unnamed: 0,timestamp,Asset_ID,Count,Open,High,Low,Close,Volume,VWAP,Target
count,24236810.0,24236810.0,24236810.0,24236810.0,24236810.0,24236810.0,24236810.0,24236810.0,24236800.0,23486470.0
mean,1577120000.0,6.292544,286.4593,1432.64,1436.35,1429.568,1432.64,286853.0,,7.121752e-06
std,33233500.0,4.091861,867.3982,6029.605,6039.482,6020.261,6029.611,2433935.0,,0.005679042
min,1514765000.0,0.0,1.0,0.0011704,0.001195,0.0002,0.0011714,-0.3662812,-inf,-0.5093509
25%,1549011000.0,3.0,19.0,0.26765,0.26816,0.2669,0.2676483,141.0725,0.2676368,-0.001694354
50%,1578372000.0,6.0,64.0,14.2886,14.3125,14.263,14.2892,1295.415,14.28769,-4.289844e-05
75%,1606198000.0,9.0,221.0,228.8743,229.3,228.42,228.8729,27297.64,228.8728,0.00160152
max,1632182000.0,13.0,165016.0,64805.94,64900.0,64670.53,64808.54,759755400.0,inf,0.9641699


In [24]:
test.describe()

Unnamed: 0,timestamp,Asset_ID,Count,Open,High,Low,Close,Volume,VWAP,group_num,row_id
count,56.0,56.0,56.0,56.0,56.0,56.0,56.0,56.0,56.0,56.0,56.0
mean,1623542000.0,6.5,739.535714,3017.680422,3021.572698,3010.191113,3015.398331,444025.2,3015.895414,1.5,27.5
std,67.68913,4.06761,806.311917,9148.328623,9159.452237,9125.151477,9141.188374,1027897.0,9142.30054,1.128152,16.309506
min,1623542000.0,0.0,34.0,0.068015,0.068055,0.067866,0.067936,1.187095,0.067958,0.0,0.0
25%,1623542000.0,3.0,258.0,1.0024,1.0188,0.9835,1.000538,691.1536,1.000546,0.75,13.75
50%,1623542000.0,6.5,448.5,108.591617,108.712,108.226,108.41484,2494.379,108.447083,1.5,27.5
75%,1623543000.0,10.0,891.75,580.8175,582.975,579.6925,580.974167,296542.7,581.061773,2.25,41.25
max,1623543000.0,13.0,3531.0,35596.771429,35652.46465,35533.38,35602.004286,4981365.0,35584.861196,3.0,55.0


In [25]:
asset_details.describe()

Unnamed: 0,Asset_ID,Weight
count,14.0,14.0
mean,6.5,2.919989
std,4.1833,1.801957
min,0.0,1.098612
25%,3.25,1.655018
50%,6.5,2.238668
75%,9.75,4.116886
max,13.0,6.779922


#### Task 3: Timestamps

In this you’ll understand how we convert date time to timestamps.

In [26]:
datetime_to_timestamp = lambda x: x.timestamp()

### Module 3: Plotting - Data Visualization

Data visualization is the representation of data through the use of common graphics, such as charts, plots, infographics, and even animations. In our project, we use heatmaps, correlation matrices, and other plots to have an interesting detailed view of coin prices.

**Advantages:**

* Easily sharing information
* Interactively explore opportunities
* Visualize patterns and relationships

#### Task 1: Plotting BTC and ETH

In this you’ll understand how we create some interesting plots for BTC and ETH and compare their changes with respect to each other.

In [41]:
def retrieve_crypto_id(asset_details,asset):
    asset_id = asset_details[asset_details["Asset_Name"]==asset]["Asset_ID"].values[0]
    return asset_id
def retrieve_crypto_name(asset_details,asset_id):
    asset = asset_details[asset_details["Asset_ID"]==asset_id]["Asset_Name"].values[0]
    return asset

In [48]:
btc_asset_id = retrieve_crypto_id(asset_details,"Bitcoin")
eth_asset_id = retrieve_crypto_id(asset_details,"Ethereum")

btc = train[train["Asset_ID"]==btc_asset_id]
eth = train[train["Asset_ID"]==btc_asset_id]

btc

Unnamed: 0,timestamp,Asset_ID,Count,Open,High,Low,Close,Volume,VWAP,Target
4,1530202860,1,131.0,6106.436890,6114.840000,6100.00,6106.002890,8.465268,6106.365069,-0.000313
15,1530202920,1,158.0,6106.448890,6118.200000,6099.99,6105.962000,20.668026,6107.661887,-0.000347
26,1530202980,1,199.0,6106.128333,6116.980000,6099.99,6105.531667,29.715174,6105.825348,-0.000655
37,1530203040,1,201.0,6106.210000,6114.130000,6099.99,6106.256000,23.940284,6106.207722,-0.000597
48,1530203100,1,175.0,6105.706667,6116.580000,6099.99,6106.108333,13.856197,6106.027932,0.000234
...,...,...,...,...,...,...,...,...,...,...
24236747,1555088100,1,312.0,5073.663498,5111.324485,5050.39,5074.040000,82.212045,5074.114509,0.000856
24236760,1555088160,1,331.0,5074.042857,5111.300000,5050.82,5073.958571,42.160399,5073.933994,0.000039
24236773,1555088220,1,245.0,5074.222857,5111.300000,5051.71,5074.697143,53.895518,5074.658289,0.000741
24236786,1555088280,1,308.0,5073.286667,5111.800000,5053.87,5074.163333,44.278444,5073.714956,0.000046


#### Task 2: Plotting Coin Correlation

In this you’ll understand how we will plot coin correlation of all 14 coins for returns made.

### Module 4: Prediction

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

* Faster training speed and higher efficiency (6 times faster than XGBoost)
* Lower memory usage
* Better accuracy
* Support of parallel, distributed, and GPU learning
* Capable of handling large-scale data

Only situation where LGBM is not advised is with small datasets because of its sensitivity to overfitting. We will harness this powerful ML model to make predictions based on a large dataset containing 4 years worth of cryptocurrency data.

#### Task 1: Feature Engineering

In this you’ll understand how we create some new features using fields from data.

#### Task 2: Building LGBM Model

In this you’ll understand how we will build the model.

#### Task 3: Hyperparameter Tuning

In this you’ll understand how we will tune the hyperparameters of the model.

#### Task 4: Evaluation And Prediction

In this you’ll understand how we will use the model to predict on test data and evaluate the performance.