# Binance Open Data lab

[Binance Open Data](https://github.com/binance/binance-public-data/#klines) and analyze it.

### Stet 1. Download data 

Downloading __1-minute candles__ for `BTC/USDT` and `BTC/UDSC` using `bash` or `powershell` scripts:

In [None]:
#!/bin/sh

# create dir for data
!mkdir ../data

# download data using GET request
!wget -N -P ../data https://data.binance.vision/data/spot/daily/klines/BTCUSDT/1m/BTCUSDT-1m-2022-06-21.zip
!wget -N -P../data https://data.binance.vision/data/spot/daily/klines/BTCUSDC/1m/BTCUSDC-1m-2022-06-21.zip

# unzip
!unzip -o -d ../data ../data/BTCUSDT-1m-2022-06-21.zip 
!unzip -o -d ../data ../data/BTCUSDC-1m-2022-06-21.zip

### Step 2: Import data to Dataframe 

Import packages for data analysis:

In [None]:
import numpy as np
import pandas as pd

import httpx

from datetime import datetime

Import data from CSV file to Pandas DataFrame:

In [None]:
def get_data(pair: str) -> pd.DataFrame:
    return pd.read_csv(f'../data/{pair}-1m-2022-06-21.csv', header = None)

btcusdt_df = get_data('BTCUSDT')
btcusdt_df.head()

Set names to columns:

In [None]:
def set_column_names(df: pd.DataFrame) -> pd.DataFrame:
    column_names_mapping = {
        0: 'Open_time',
        1: 'Open',
        2: 'High',
        3: 'Low',
        4: 'Close',
        5: 'Volume',
        6: 'Close_time',
        7: 'Quote_asset_volume',
        8: 'Number_of_trades',
        9: 'Taker_buy_base_asset_volume',
        10: 'Taker_buy_quote_asset_volume',
        11: 'Ignore'
        }
    return df.rename(columns=column_names_mapping)

btcusdt_df = set_column_names(btcusdt_df)
btcusdt_df.head()

Convert timestamp to human-readable date and time format:

In [None]:
btcusdt_df['Open_time'] = btcusdt_df.iloc[:, 0].apply(lambda t: datetime.fromtimestamp(t/1000))
btcusdt_df['Close_time'] = btcusdt_df.iloc[:, 6].apply(lambda t: datetime.fromtimestamp(t/1000))

btcusdt_df.head()

Let's take a look at _Descriptive statistics_ (min, mean, max, standard deviation):

In [None]:
btcusdt_df.describe(datetime_is_numeric=True)

### Step 2: Transform data

Calculate __1-hour OHLCV__ candles:

In [None]:
def calculate_ohclv(df: pd.DataFrame) -> pd.DataFrame:
    df['hour'] = df['Close_time'].apply(lambda t: t.hour)

    return (
        df
            .groupby(['hour'])
            .agg(
                {
                    'Open': 'first',
                    'High': max,
                    'Low': min,
                    'Close': 'last',
                    'Volume': sum,
                    'Close_time': max
                }
            )
            .reset_index()
            .drop(columns=['hour'])
        )

btcusdt_1h_df = calculate_ohclv(btcusdt_df)

btcusdt_1h_df

Data validation is very important. Let's write domain-driven asserts:

In [None]:
assert(
    isinstance(btcusdt_1h_df, pd.DataFrame)
    and btcusdt_1h_df.shape == (24, 6)
    and not btcusdt_1h_df.isnull().any().any()
    and btcusdt_1h_df.iloc[:, 0:5].ge(0).all().all()
    )

### Step 3: Expand the dataset with information about `BTC/USDC` 

Download `BTC/USDC` 1-minute candles and transform it to 1-hour candles:

In [None]:
btcusdc_df = get_data('BTCUSDC')  # download data
btcusdc_df = set_column_names(btcusdc_df)  # set column names
btcusdc_df['Close_time'] = btcusdc_df.iloc[:, 6].apply(lambda t: datetime.fromtimestamp(t/1000))  # convert timestamp to date+time

btcusdc_1h_df = calculate_ohclv(btcusdc_df)  # calculate 1h OHCLV candles
btcusdc_1h_df

Join altogether:

In [None]:
btcusdt_1h_df['pair'] = 'BTC-USDT'
btcusdc_1h_df['pair'] = 'BTC-USDC'

# Join datasets
candles_1h_df = pd.concat([btcusdt_1h_df, btcusdc_1h_df])

# Validate result
assert(
    isinstance(candles_1h_df, pd.DataFrame)
    and candles_1h_df.shape == (48, 7)
    and (candles_1h_df['pair'].unique() == ['BTC-USDT', 'BTC-USDC']).all()
)

# Sort output by Close_time
candles_1h_df.sort_values('Close_time')