<p float="middle">
    <img src="image/BTCwPath.jpg" />
</p>


## Data from Crytpocurrency "Exchanges"


<br>
<center> <h3>ML Lab Extra</h3> </center>
<br>
<center> <h3>Dr Richard Diamond, 2017, 2020 </h3> </center>
<br>


#### Examples of data direct from "exchanges", and Quandl

The availability and ease of close to real-time data from crypto token exchanges invites the modelling for the puproses of arbitrage in two ways:

* Between the exchanges on the same crypto token. This is **algotrading approach**.

* Between the crypto tokens, in two ways between an altcoin and the BTC itself, or between the altcoins, particularly if there is a competing application, eg ETC and LTC, or XRP. This is **statistical or systematic arbitrage approach**.


**Quick Cointegration Test** used speculatively but the concept of cointegration immediately proves useful, and the statistical arbitrage can be designed if two prices are cointegrated.


COINBASE is not "an exchange" for cryptocurrencies trading but rather a retail service with convenient interface, therefore it charges a higher price. The same company offers an exchange-like service GDAX.

<br>

#### Correlation in cryptocurrency space

**Plotly** demonstrates the functionality of interactive charts (using D3.js): visual controls give the ease to expore.  Those kinds of plots you would embedd in web pages.

Pandas dataframe readily computes correlation (linear and rank) which can be feed into Plotly functionality to provide attractive **heatmaps**.


Source: coding below is adopted from [here](https://blog.patricktriest.com/analyzing-cryptocurrencies-python/).

---------


In [241]:
import os
import numpy as np
import pandas as pd
import pickle
import quandl

from datetime import datetime

def get_quandl_data(quandl_id): #Download and cache Quandl dataseries
    cache_path = '{}.pkl'.format(quandl_id).replace('/','-')
    try:
        f = open(cache_path, 'rb')
        df = pickle.load(f)   
        print('Loaded {} from cache'.format(quandl_id))
    except (OSError, IOError) as e:
        print('Downloading {} from Quandl'.format(quandl_id))
        df = quandl.get(quandl_id, returns="pandas")
        df.to_pickle(cache_path)
        print('Cached {} at {}'.format(quandl_id, cache_path))
    return df

In [242]:
import plotly.offline as py
import plotly.graph_objs as go
import plotly.figure_factory as ff

py.init_notebook_mode(connected=True)

In [244]:
# [QUANDL_API_KEY] string is generated by Quandl and can be found in your Account Settings

quandl.ApiConfig.api_key = 'QUANDL_API_KEY' # There was API key for Richard Diamond. Please use your own
#quandl.ApiConfig.api_key = os.environ['QUANDL_API_KEY']

In [245]:
# BTCUSD price data from Kraken -- popular but relatively small exchange
exchange_data = {}

# BTCUSD price data from BTC "exchanges"
exchanges = ['KRAKEN', 'COINBASE', 'BITSTAMP', 'ITBIT']

for exchange in exchanges:
    exchange_code = 'BCHARTS/{}USD'.format(exchange)
    btc_exchange_df = get_quandl_data(exchange_code)
    exchange_data[exchange] = btc_exchange_df

Downloading BCHARTS/KRAKENUSD from Quandl
Cached BCHARTS/KRAKENUSD at BCHARTS-KRAKENUSD.pkl
Downloading BCHARTS/COINBASEUSD from Quandl
Cached BCHARTS/COINBASEUSD at BCHARTS-COINBASEUSD.pkl
Downloading BCHARTS/BITSTAMPUSD from Quandl
Cached BCHARTS/BITSTAMPUSD at BCHARTS-BITSTAMPUSD.pkl
Downloading BCHARTS/ITBITUSD from Quandl
Cached BCHARTS/ITBITUSD at BCHARTS-ITBITUSD.pkl


In [246]:
exchange_data['KRAKEN'].tail()

Unnamed: 0_level_0,Open,High,Low,Close,Volume (BTC),Volume (Currency),Weighted Price
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-04-27,5154.5,5216.5,5116.7,5168.5,2868.07008,14818680.0,5166.77827
2019-04-28,5169.8,5215.4,5100.2,5156.1,2657.172979,13721170.0,5163.821921
2019-04-29,5156.2,5194.2,5061.1,5149.0,4372.351913,22505630.0,5147.260593
2019-04-30,5152.8,5297.0,5131.1,5272.2,4856.839579,25379910.0,5225.602652
2019-05-01,5272.2,5359.3,5270.0,5300.3,3296.556899,17502780.0,5309.411884


In [247]:
# Chart the BTC pricing data using plolty

btc_trace = go.Scatter(x=exchange_data['KRAKEN'].index, y=exchange_data['KRAKEN']['Weighted Price'])
py.iplot([btc_trace])

In [248]:
# Merge 'Weighted price' column from each dataframe ("exchange") into a combined dataframe

def merge_dfs_on_column(dataframes, labels, col):
    series_dict = {}
    for index in range(len(dataframes)):
        series_dict[labels[index]] = dataframes[index][col]
        
    return pd.DataFrame(series_dict)

In [249]:
btc_usd_datasets = merge_dfs_on_column(list(exchange_data.values()), list(exchange_data.keys()), 'Weighted Price')
btc_usd_datasets.tail()

Unnamed: 0_level_0,KRAKEN,COINBASE,BITSTAMP,ITBIT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019-04-27,5166.77827,,5161.635332,
2019-04-28,5163.821921,,5164.048289,
2019-04-29,5147.260593,,5148.786912,
2019-04-30,5225.602652,,5211.322348,
2019-05-01,5309.411884,,5303.888295,


In [250]:
def df_scatter(df, title, seperate_y_axis=False, y_axis_label='', scale='linear', initial_hide=False):
    '''Generate a scatter plot of the entire dataframe'''
    label_arr = list(df)
    series_arr = list(map(lambda col: df[col], label_arr))
    
    layout = go.Layout(
        title=title,
        legend=dict(orientation="h"),
        xaxis=dict(type='date'),
        yaxis=dict(
            title=y_axis_label,
            showticklabels= not seperate_y_axis,
            type=scale
        )
    )
    
    y_axis_config = dict(
        overlaying='y',
        showticklabels=False,
        type=scale )
    
    visibility = 'visible'
    if initial_hide:
        visibility = 'legendonly'
        
    # Form Trace For Each Series
    trace_arr = []
    for index, series in enumerate(series_arr):
        trace = go.Scatter(
            x=series.index, 
            y=series, 
            name=label_arr[index],
            visible=visibility
        )
        
        # Add seperate axis for the series
        if seperate_y_axis:
            trace['yaxis'] = 'y{}'.format(index + 1)
            layout['yaxis{}'.format(index + 1)] = y_axis_config    
        trace_arr.append(trace)

    fig = go.Figure(data=trace_arr, layout=layout)
    py.iplot(fig)

In [251]:

btc_usd_datasets.replace(0, np.nan, inplace=True)

df_scatter(btc_usd_datasets[btc_usd_datasets.index.year == 2017], 'Bitcoin Price (USD) by Exchange')

In [252]:
# Engle-Granger procedure pairwise cointegration -- statsmodels implementation is raw
# http://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.coint.html

# VECM model exists in beta-testing -- download the dev version of statsmodels using git command

import statsmodels.tsa.stattools as ts 


In [253]:
#SUDDENLY  NaN values became a problem for ts.coint which it cannot resolve -- btc_usd_datasets['KRAKEN'] has NaN values 

coint_result1 =ts.coint(btc_usd_datasets.dropna()['BITSTAMP'], btc_usd_datasets.dropna()['KRAKEN']) 

# Returns t-statistic, p-value, set of critical values at 1%, 5% and 10%
print(coint_result1)

(-5.213371513890104, 6.546885129257095e-05, array([-3.90609821, -3.34150916, -3.04818227]))


## Between Exchanges

### Quick Cointegration Test (unit root in residuals)

#### BTC price from BITSTAMP vs.  KRAKEN online exchanges

with t statistic $-5.2133$ we Reject $H_0$ of unit root in residuals.

The array (on right) has critical values for the test statistic at the 1 %, 5 %, and 10 %.

Those appear to correspond to MacKinnon, 2010 updated CV tables for Dickey-Fuller Distribution.


PRESERVED TO SHOW THAT PAST _ts.coint_ output was as follows and therefore mistaken -- that was fixed in the subsequent updates to _statsmodels_ library but hows that Python is a developing ecosystem.

```
(0, 0.98590025802596426, array([-3.90124569, -3.33880883, -3.04630907]))
```

**TAKE AWAY** We have a solid dataset of time series the price of Bitcoin. Cointegration testing shows that the prices follow each other tightly. Engle-Granger and other functionality _statsmodels.tsa.stattools_ would recognise these price series as "perfectly colinear".

------

# Between Crytocurrencies (Altcoins)

For retrieving data on the wider range of cryptocurrencies (tockens) it is convenient to use [Poloniex API](https://poloniex.com/support/api/). Define two helper functions to download and cache JSON data from this API.

Note that altcoin prices are exchange rate to Bitcoin, and we will need  and we have the Bitcoin/USD historical pricing index, we can directly calculate the USD price series for each altcoin.


In [254]:
def get_json_data(json_url, cache_path):
    '''Download and cache JSON data, return as a dataframe.'''
    try:        
        f = open(cache_path, 'rb')
        df = pickle.load(f)   
        print('Loaded {} from cache'.format(json_url))
    except (OSError, IOError) as e:
        print('Downloading {}'.format(json_url))
        df = pd.read_json(json_url)
        df.to_pickle(cache_path)
        print('Cached response at {}'.format(json_url, cache_path))
    return df


base_polo_url = 'https://poloniex.com/public?command=returnChartData&currencyPair={}&start={}&end={}&period={}'
start_date = datetime.strptime('2017-01-01', '%Y-%m-%d') # get data from the start of 2015
end_date = datetime.now() # up until today
pediod = 86400 # pull daily data (86,400 seconds per day)

def get_crypto_data(poloniex_pair): # Retrieve cryptocurrency data from Poloniex 'exchange'
    json_url = base_polo_url.format(poloniex_pair, start_date.timestamp(), end_date.timestamp(), pediod)
    data_df = get_json_data(json_url, poloniex_pair)
    data_df = data_df.set_index('date')
    return data_df

In [None]:
altcoins = ['ETH','LTC','XRP','ETC','STR','DASH','SC','XMR','XEM']

altcoin_data = {}
for altcoin in altcoins:
    coinpair = 'BTC_{}'.format(altcoin)
    crypto_price_df = get_crypto_data(coinpair)
    altcoin_data[altcoin] = crypto_price_df

In [256]:
altcoin_data['ETH'].tail()

Unnamed: 0_level_0,close,high,low,open,quoteVolume,volume,weightedAverage
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-04-28,0.029712,0.03014,0.029706,0.03008,9493.834955,284.047059,0.029919
2019-04-29,0.029615,0.029885,0.0293,0.029757,4902.241815,145.222405,0.029624
2019-04-30,0.03041,0.030466,0.029569,0.02961,12090.638312,363.780633,0.030088
2019-05-01,0.029738,0.03068,0.0295,0.03041,19300.239739,575.484849,0.029817
2019-05-02,0.029235,0.029947,0.029118,0.029738,6165.757311,181.681077,0.029466


In [257]:
# Converting into USD by multipling BTCUSD -- using Kraken BTC prices

for altcoin in altcoin_data.keys():
    altcoin_data[altcoin]['price_usd'] =  altcoin_data[altcoin]['weightedAverage'] * btc_usd_datasets['KRAKEN']

# Reuse merge_dfs_on_column function from the earlier to create a combined dataframe 
combined_df = merge_dfs_on_column(list(altcoin_data.values()), list(altcoin_data.keys()), 'price_usd')

combined_df['BTC'] = btc_usd_datasets['KRAKEN']

In [258]:
df_scatter(combined_df, 'Cryptocurrency Prices (USD)', seperate_y_axis=False, y_axis_label='Token Value (USD)', scale='log')

In [266]:
coint_result2 =ts.coint(combined_df['ETH'].dropna(), combined_df['BTC'].dropna())
print(coint_result2)
# Returns t-statistic, p-value, set of critical values at 1%, 5% and 10%

(-4.584083291418297, 0.0008879382863217248, array([-3.90938628, -3.34333629, -3.0494493 ]))


In [267]:
coint_result3 =ts.coint(combined_df['LTC'].dropna(), combined_df['BTC'].dropna())
print(coint_result3)
# Returns t-statistic, p-value, set of critical values at 1%, 5% and 10%

(-2.4510827201774092, 0.30105826458839696, array([-3.90938628, -3.34333629, -3.0494493 ]))


### Quick Cointegration Test (unit root on residuals)

#### Among the various pairs: ETH and BTC, BTC and LTC (Litecoin)

**TAKE AWAY** Quick and rough analysis of revised computation (above) indicates cointegration for ETH and BTC as well as near-cointegrated relationship between BTC and LTC (Litecoin) -- even as daily prices are noisy.
<br><br>

_ts.coint_ requires investigation if it is properly validated; its description confuses rejection of the null hypothesis. See also _ts.adfuller_. 

```
#PAST coint_result2 from 2017-Dec, 
(-1.4429846731488123, 0.7824000257854892, array([-3.92846632, -3.35389776, -3.05676619]))
```

```
#PAST coint_result3
(-1.9440496626675863, 0.55751073489984992, array([-3.92846632, -3.35389776, -3.05676619]))
```

**SINCE 2017-Dec** ts.coint() implementation HAS CHANGED IN PYTHON (MacKinnon critical values were wrong!).  As part of changes, _ts.coint()_ no longer tolerates empty rows and NaN values -- so you might want to check the dataframe, particularly to ensure that _.dropna()_ drops the same rows for ETH, BTC.

For the Johansen Procedure in Python please see [an attempt](https://searchcode.com/codesearch/view/88477497/).

**NOTE** The plot has _logarithmic scale_ if you have not noticed!

------


### Correlation and Heat Map


In [261]:
# Compute Pearson Correlation (linear correlation) altcoins in 2017
combined_df_2017 = combined_df[combined_df.index.year == 2017]
combined_df_2017.pct_change().corr(method='pearson')

# pearson - linear correlation
# spearman - Spearman rho rank correlation uses linear correlation formula over U (pseudo-samples) kendall -- are supported paramters
# kendall - Kendall tau is rank correlation based on sign
#https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html

Unnamed: 0,ETH,LTC,XRP,ETC,STR,DASH,SC,XMR,XEM,BTC
ETH,1.0,0.440472,0.212775,0.599813,0.2549,0.506254,0.376888,0.554616,0.399358,0.409185
LTC,0.440472,1.0,0.326748,0.485281,0.305358,0.345011,0.343938,0.44178,0.383568,0.42965
XRP,0.212775,0.326748,1.0,0.116352,0.508776,0.094335,0.246611,0.228918,0.271698,0.139592
ETC,0.599813,0.485281,0.116352,1.0,0.20669,0.38768,0.301888,0.447963,0.322429,0.417786
STR,0.2549,0.305358,0.508776,0.20669,1.0,0.180176,0.402584,0.323847,0.336503,0.22601
DASH,0.506254,0.345011,0.094335,0.38768,0.180176,1.0,0.296268,0.500809,0.328874,0.314252
SC,0.376888,0.343938,0.246611,0.301888,0.402584,0.296268,1.0,0.382498,0.332861,0.333731
XMR,0.554616,0.44178,0.228918,0.447963,0.323847,0.500809,0.382498,1.0,0.339145,0.414735
XEM,0.399358,0.383568,0.271698,0.322429,0.336503,0.328874,0.332861,0.339145,1.0,0.336178
BTC,0.409185,0.42965,0.139592,0.417786,0.22601,0.314252,0.333731,0.414735,0.336178,1.0


In [262]:
combined_df_2017.pct_change().corr(method='kendall')


Unnamed: 0,ETH,LTC,XRP,ETC,STR,DASH,SC,XMR,XEM,BTC
ETH,1.0,0.378198,0.282263,0.526352,0.303666,0.410014,0.306391,0.442194,0.328187,0.270336
LTC,0.378198,1.0,0.318469,0.424697,0.349438,0.311022,0.325341,0.400842,0.27854,0.346744
XRP,0.282263,0.318469,1.0,0.224442,0.43835,0.172222,0.293464,0.273242,0.281718,0.156964
ETC,0.526352,0.424697,0.224442,1.0,0.29519,0.345715,0.30851,0.411255,0.308752,0.278933
STR,0.303666,0.349438,0.43835,0.29519,1.0,0.231223,0.402779,0.341023,0.323162,0.215058
DASH,0.410014,0.311022,0.172222,0.345715,0.231223,1.0,0.254473,0.422244,0.279599,0.196319
SC,0.306391,0.325341,0.293464,0.30851,0.402779,0.254473,1.0,0.34514,0.339388,0.234008
XMR,0.442194,0.400842,0.273242,0.411255,0.341023,0.422244,0.34514,1.0,0.317228,0.293404
XEM,0.328187,0.27854,0.281718,0.308752,0.323162,0.279599,0.339388,0.317228,1.0,0.256531
BTC,0.270336,0.346744,0.156964,0.278933,0.215058,0.196319,0.234008,0.293404,0.256531,1.0


In [263]:
def correlation_heatmap(df, title, corr_model, absolute_bounds=True):
    '''Plot a correlation heatmap for the entire dataframe'''
    heatmap = go.Heatmap(
        z=df.corr(method=corr_model).as_matrix(),
        x=df.columns,
        y=df.columns,
        colorbar=dict(title='Pearson Coefficient'),
    )
    
    layout = go.Layout(title=title)
    
    if absolute_bounds:
        heatmap['zmax'] = 1.0
        heatmap['zmin'] = -1.0
        
    fig = go.Figure(data=[heatmap], layout=layout)
    py.iplot(fig)

In [269]:
correlation_heatmap(combined_df_2017.pct_change(), "2017 Correlations (Pearson)", 'pearson')


Method .as_matrix will be removed in a future version. Use .values instead.



In [272]:
correlation_heatmap(combined_df_2017.pct_change(), "2017 Correlations (Kendall tau)", 'kendall')


Method .as_matrix will be removed in a future version. Use .values instead.



Rank correlation measure (Kendall tau) suggests lesser correlation values. This is valuable for such high noise data as daily market movements, especially in cryptocurrencies in 2017.

**END OF DEMONSTRATION**