## Context
Bitcoin is the longest running and most well known cryptocurrency, first released as open source in 2009 by the anonymous Satoshi Nakamoto. Bitcoin serves as a decentralized medium of digital exchange, with transactions verified and recorded in a public distributed ledger (the blockchain) without the need for a trusted record keeping authority or central intermediary. Transaction blocks contain a SHA-256 cryptographic hash of previous transaction blocks, and are thus "chained" together, serving as an immutable record of all transactions that have ever occurred. As with any currency/commodity on the market, bitcoin trading and financial instruments soon followed public adoption of bitcoin and continue to grow. Included here is historical bitcoin market data at 1-min intervals for select bitcoin exchanges where trading takes place. Happy (data) mining!

## Content
* coinbaseUSD1-mindata2014-12-01to_2019-01-09.csv

* bitstampUSD1-mindata2012-01-01to_2020-04-22.csv

CSV files for select bitcoin exchanges for the time period of Jan 2012 to April 2020, with minute to minute updates of OHLC (Open, High, Low, Close), Volume in BTC and indicated currency, and weighted bitcoin price. Timestamps are in Unix time. Timestamps without any trades or activity have their data fields filled with NaNs. If a timestamp is missing, or if there are jumps, this may be because the exchange (or its API) was down, the exchange (or its API) did not exist, or some other unforseen technical error in data reporting or gathering. All effort has been made to deduplicate entries and verify the contents are correct and complete to the best of my ability, but obviously trust at your own risk.

## Acknowledgements and Inspiration
Bitcoin charts for the data. The various exchange APIs, for making it difficult or unintuitive enough to get OHLC and volume data at 1-min intervals that I set out on this data scraping project. Satoshi Nakamoto and the novel core concept of the blockchain, as well as its first execution via the bitcoin protocol. I'd also like to thank viewers like you! Can't wait to see what code or insights you all have to share.

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

import warnings
warnings.filterwarnings(action='ignore')

In [2]:
btcdf = pd.read_csv('bitcoin-historical-data/bitstampUSD_1-min_data_2012-01-01_to_2020-04-22.csv')

In [3]:
btcdf.head()

Unnamed: 0,Timestamp,Open,High,Low,Close,Volume_(BTC),Volume_(Currency),Weighted_Price
0,1325317920,4.39,4.39,4.39,4.39,0.455581,2.0,4.39
1,1325317980,,,,,,,
2,1325318040,,,,,,,
3,1325318100,,,,,,,
4,1325318160,,,,,,,


In [4]:
btcdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4363457 entries, 0 to 4363456
Data columns (total 8 columns):
Timestamp            int64
Open                 float64
High                 float64
Low                  float64
Close                float64
Volume_(BTC)         float64
Volume_(Currency)    float64
Weighted_Price       float64
dtypes: float64(7), int64(1)
memory usage: 266.3 MB


In [5]:
btcdf.describe()

Unnamed: 0,Timestamp,Open,High,Low,Close,Volume_(BTC),Volume_(Currency),Weighted_Price
count,4363457.0,3126480.0,3126480.0,3126480.0,3126480.0,3126480.0,3126480.0,3126480.0
mean,1456469000.0,3674.656,3677.366,3671.73,3674.595,9.85504,28844.59,3674.57
std,75732960.0,3935.578,3939.077,3931.713,3935.49,32.29272,101027.7,3935.458
min,1325318000.0,3.8,3.8,1.5,1.5,0.0,0.0,3.8
25%,1390770000.0,410.0,410.24,409.83,410.0,0.398812,350.3759,409.9998
50%,1456610000.0,1175.14,1175.77,1174.825,1175.14,1.99,2620.491,1175.2
75%,1522062000.0,6931.175,6935.78,6926.79,6931.225,7.639098,17600.57,6931.18
max,1587514000.0,19665.76,19666.0,19649.96,19665.75,5853.852,7569437.0,19663.3


In [6]:
btcdf.isnull().any()

Timestamp            False
Open                  True
High                  True
Low                   True
Close                 True
Volume_(BTC)          True
Volume_(Currency)     True
Weighted_Price        True
dtype: bool

Currently timpstamp data shows data for ever minute. so, let's resample data to have the timestamp with a stamp every month.

In [7]:
# print(btcdf.Timestamp)

btcdf.Timestamp = pd.to_datetime(btcdf.Timestamp, unit = 's')
print(btcdf.Timestamp);

btcdf.index = btcdf.Timestamp

btcdf_copy=btcdf

# Monthly frequency
btcdf = btcdf_copy.resample('M').mean()

# Day frequency
btcdfday = btcdf.resample('D').mean()

# Annual frequency
btcyear = btcdf.resample('A-DEC').mean()

# Quarterly frequency
btcQut = btcdf.resample('Q-DEC').mean()

0         2011-12-31 07:52:00
1         2011-12-31 07:53:00
2         2011-12-31 07:54:00
3         2011-12-31 07:55:00
4         2011-12-31 07:56:00
                  ...        
4363452   2020-04-21 23:56:00
4363453   2020-04-21 23:57:00
4363454   2020-04-21 23:58:00
4363455   2020-04-21 23:59:00
4363456   2020-04-22 00:00:00
Name: Timestamp, Length: 4363457, dtype: datetime64[ns]


NameError: name 'df' is not defined

In [None]:
plt.subplot(221)
sns.lineplot(data = btcdfday.Weighted_Price, label='By Days', color = 'green')

plt.subplot(221)
sns.lineplot(data = btcdf.Weighted_Price, label='By Month', color = 'green')

plt.subplot(221)
sns.lineplot(data = btcQut.Weighted_Price, label='By Quarterly', color = 'green')

plt.subplot(221)
sns.lineplot(data = btcyear.Weighted_Price, label='By Annual', color = 'green')

In [None]:
btcdf.head()

In [None]:
btcdf.describe()

In [None]:
btcdf.isnull().any()

In [None]:
plt.figure(figsize = (15,5))
sns.lineplot(x = btcdf.index, y = btcdf.Weighted_Price)


implementation of time series analysis to make some meaning out of the given time series data.

<b>Here I am trying to implement ARIMA model to analyse our time series first. but our series data should be stationary. To do that we will be using below techniques</b>
<ol>
    <li>Seasonal Trend Decomposition</li>
    <li>Dicky Fuller test</li>
</ol>

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.arima_model import ARIMA
from scipy.stats import boxcox

#### Test for original prices series

In [None]:
#Creating functions to reduce testing code repeat.

def sea_decompose(Seriesdata):
    plt.figure(figsize = (15,7))
    sea_data=seasonal_decompose(Seriesdata);
    sea_data.plot()
    plt.show()
    
def Dicky_Fuller_Test(Seriesdata):
    testDF = adfuller(Seriesdata)
    print("Dicky Fuller test p-value : %.16f" %testDF[1] )
    
    
def dataPlot(Seriesdata):
    plt.figure(figsize = (10,6))
    sns.lineplot(data = Seriesdata, color = 'red', label = 'Observed line plot')
    sns.lineplot(data = Seriesdata.rolling(window = 12).mean(), color = 'green', label = 'Rolling mean, window -12')
    sns.lineplot(data = Seriesdata.rolling(window = 12).std(), color = 'blue', label = 'Rolling standard deviation, window -12')

In [None]:
price = btcdf.Weighted_Price
print("Dicky Fuller Test")
Dicky_Fuller_Test(price)
print("\nseasonal decompose")
sea_decompose(price)

dataPlot(price)

By this visualisation and the p value of the Dicky Fuller test > 0.05, the series is not stationary and hence,so ARIMA model can't be applied immediatly. 

By implementing a few transformations, we can make our series suitable for ARIMA modelling.

### Transformations

#### Log Transformation

In [None]:
pricelog = np.log(price)

print("Dicky Fuller Test")
Dicky_Fuller_Test(pricelog)
print("\nseasonal decompose")
sea_decompose(pricelog)

dataPlot(pricelog)

Above log transformation for series are not stationary.

#### Box Cox power transform

In [None]:
price_boxcox, lambdaData = boxcox(price)

#decompose functions requires a pandas object that has a timestamp index.
prices_box_cox = pd.Series(data = price_boxcox, index = btcdf.index)

print("Dicky Fuller Test")
Dicky_Fuller_Test(prices_box_cox)

print('lambda value:', lambdaData)

print("\nseasonal decompose")
sea_decompose(prices_box_cox)

dataPlot(prices_box_cox)


Above Box Cox power transform for series are not stationary.

#### Regular differentiation 
applied to log transformation price

In [None]:
regularl_log_price = pricelog - pricelog.shift(1)
regularl_log_price.dropna(inplace = True)

print("Dicky Fuller Test")
Dicky_Fuller_Test(regularl_log_price)
print("\nseasonal decompose")
sea_decompose(regularl_log_price)

dataPlot(regularl_log_price)

#### Regular differentiation 
applied to box cox log transformation price

In [None]:
price_boxcox_reg = prices_box_cox - prices_box_cox.shift(1)
price_boxcox_reg.dropna(inplace = True)

print("Dicky Fuller Test")
Dicky_Fuller_Test(price_boxcox_reg)
print("\nseasonal decompose")
sea_decompose(price_boxcox_reg)

dataPlot(price_boxcox_reg)

After Regular differentiation our series data looks good and satisfactory

By ploting Autocorrelation and Partial Autocorrelation. We get an idea of the parameters to be used for using in ARIMA.

In [None]:
plt.figure(figsize = (15,8)) 
plot_acf = acf(regularl_log_price)
plot_pacf = pacf(regularl_log_price)

# fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,5))

plt.subplot(221)
sns.lineplot(data = plot_acf, color = 'green')
plt.axhline(y=0, linestyle='--', color='gray')

plt.subplot(222)
sns.lineplot(data = plot_pacf, color = 'gold')
plt.axhline(y=0, linestyle='--', color='gray')

# fig.tight_laypout()

We infer from the plot that The ACF and PACF gets close to zero while lag approaches one.

As per the plots let us try different values of p and q. D = 1

In [None]:
from itertools import product
from numpy.linalg import LinAlgError

In [None]:
a = [[1,2,3], [1],[1,2,3]]
params = list(product(*a))

results = []   
min_aic = float('inf')
best_param = []

# checking different set of params for best fit
for param in params:
    try:
        model = ARIMA(pricelog, order = param).fit(disp = -1)
    except LinAlgError:
        print('Rejected Parameters:', param)
        continue
    except ValueError:
        print('Rejected Parameters:', param)
        continue
    if(min_aic > model.aic):
        min_aic = model.aic
        best_param = param
        best_model = model
        
    results.append([param, model.aic])

print(best_param,min_aic)
# print(results)

# print(best_model.fittedvalues)

plt.figure(figsize=(16,8))
sns.lineplot(data = regularl_log_price, color = 'blue')
sns.lineplot(data = best_model.fittedvalues, color = 'brown')    

In [None]:
fitted_values = best_model.fittedvalues
fitted_values = fitted_values.cumsum()

fitted_values = fitted_values + pricelog[0]

final_values = np.exp(fitted_values)

d = {'prices' : price, 'pricelog' : pricelog, 'regularl_log_price' : regularl_log_price, 'fitted_values' : fitted_values, 'final_values' : final_values}
summaryDF = pd.DataFrame(data = d)
plt.figure(figsize=(15,6))
sns.lineplot(data = summaryDF['prices'], color = 'blue')
sns.lineplot(data = summaryDF['final_values'], color = 'brown')

### Final Price Prediction

In [None]:
startMonth=1
endMonth=112
predicted_values = np.exp((best_model.predict(start = startMonth, end = endMonth).cumsum()) + pricelog[0])
plt.figure(figsize=(15,6))
sns.lineplot(data = price, label  = 'Recorded')
sns.lineplot(data = predicted_values, ls='--', label = 'Predicted')
