# Exploratory Data Analysis - BTC-USD 
We're exploring here the price of the BTC-USD pair. Bitcoin is the reference coin of crypto market, and every cryptocurrency's price moves with it. 

This market has an extreme volatility. Making predictions here will be impossible, even if we can achieve a model that predicts the last year decently. However, we want to train a model to make predictions, evaluate the best training period, and study the behaviour of this particular stock.

Let's start the analysis importing our needed libraries:

In [1]:
import pandas as pd
from fbprophet import Prophet
from fbprophet.plot import plot_plotly, plot_components_plotly
import plotly.express as px
import os, sys
path = os.getcwd()
path = os.path.dirname(path)
sys.path.append(path)
from train import train, save_model
import datetime as dt
from datetime import timedelta
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

First look at BTC-USD data:

In [2]:
# Loading the data
btc = pd.read_csv('../data/bitcoin.csv')
btc

Unnamed: 0,ds,y
0,2014-09-17,457.334015
1,2014-09-18,424.440002
2,2014-09-19,394.795990
3,2014-09-20,408.903992
4,2014-09-21,398.821014
...,...,...
2619,2021-11-22,56289.289062
2620,2021-11-23,57569.074219
2621,2021-11-24,56280.425781
2622,2021-11-25,58927.890625


Our available data goes from september 2014 to today, and the evolution of the pair looks like this:

In [3]:
px.line(btc, x='ds', y='y')

BTC is always known for its high volatility, but 2021 was extreme. The price raises from $20K to $60K, then goes down to $30K, and again goes up to almost $70K. The price of Bitcoin is unpredictable at all, but we are trying here to follow the price in 2021.

As we did in the previous notebooks, we want to make predictions with different periods of time, until we find the best possible approach to the real price. We are training with all the data, 4 years and 2 years to check the best results:

In [4]:
# Year 2021
btc['ds'] = pd.to_datetime(btc['ds'])
X_test = btc[btc['ds'].dt.year == 2021][['ds']]
X_test

Unnamed: 0,ds
2294,2021-01-01
2295,2021-01-02
2296,2021-01-03
2297,2021-01-04
2298,2021-01-05
...,...
2619,2021-11-22
2620,2021-11-23
2621,2021-11-24
2622,2021-11-25


## Case 1. Training the model with all the Data (2014-2020)

In [5]:
# Full Data
X_train_full_data = btc[btc['ds'].dt.year != 2021]

In [6]:
# Predictions
model = Prophet()
model.fit(X_train_full_data)
forecast = model.predict(X_test)

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.


In [7]:
fig = plot_plotly(model, forecast, xlabel='Date', ylabel='Price')
fig.update_layout(title='BTC-USD stock 2021 Predictions - Model trained with Full Data')
fig.show()

In [8]:
# Validating predictions
val = forecast.merge(btc, on='ds', how='right')
val = val[['ds', 'yhat', 'y']]
val.columns = ['Date', 'Predicted Price', 'True Price']
val = val[val.Date.dt.year == 2021]
fig = px.scatter(val, x=val.Date, y=val.columns[1:],
                title='BTC-USD stock 2021 Predictions - Validation')
fig.update_traces(marker_size=5)
fig.show()

In [9]:
# Forecast Components
plot_components_plotly(model, forecast)

The model here don't believe in the last price growth, because is trained with all the data, and just keep following the upper trend, ignoring the most recent results of the price.

In [10]:
# Scores
def scores(y_true, y_pred):
    print('MAE:', mean_absolute_error(y_true, y_pred))
    print('RMSE', np.sqrt(mean_squared_error(y_true, y_pred)))

y_true = btc[btc.ds.dt.year == 2021]['y']
y_pred =  forecast['yhat']
scores(y_true, y_pred)

MAE: 31009.65161318647
RMSE 32596.708939335425


In [11]:
print('Mean BTC Price in 2021: $', round(val['True Price'].mean(), 2))
print(f'Score: ', 1 - 31009 / val['True Price'].mean())

Mean BTC Price in 2021: $ 47155.6
Score:  0.34241106345300254


A score of 0.34 is very poor, so let's try to improve our model training with less data.

## Case 2. Training with the last 4 years of data (2017-2020)

In [12]:
# Training data - 2017-20
X_train_last_four = btc[(btc.ds.dt.year >= 2017) & (btc.ds.dt.year <=2020)]
X_train_last_four

Unnamed: 0,ds,y
837,2017-01-01,998.325012
838,2017-01-02,1021.750000
839,2017-01-03,1043.839966
840,2017-01-04,1154.729980
841,2017-01-05,1013.380005
...,...,...
2289,2020-12-27,26272.294922
2290,2020-12-28,27084.808594
2291,2020-12-29,27362.437500
2292,2020-12-30,28840.953125


In [13]:
# Predictions
model = Prophet()
model.fit(X_train_last_four)
forecast2 = model.predict(X_test)
fig = plot_plotly(model, forecast2, xlabel='Date', ylabel='Price')
fig.update_layout(title='BTC-USD 2021 Predictions - Model trained with last 4 years of Data')
fig.show()

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.


In [15]:
# Validating predictions
val2 = forecast2.merge(btc, on='ds', how='right')
val2 = val2[['ds', 'yhat', 'y']]
val2.columns = ['Date', 'Predicted Price', 'Real Price']
val2 = val2[val2.Date.dt.year == 2021]
fig = px.scatter(val2, x=val2.Date, y=val2.columns[1:],
                title='BTC-USD stock 2021 Predictions - Validation')
fig.update_traces(marker_size=5)
fig.show()

In [16]:
y_true = btc[btc.ds.dt.year == 2021]['y']
y_pred =  forecast2['yhat']
scores(y_true, y_pred)

MAE: 24400.155706591257
RMSE 26231.5894340771


In [17]:
print('Mean BTC Price in 2021: $', round(val['True Price'].mean(), 2))
print(f'Score: ', 1 - 24400 / val['True Price'].mean())

Mean BTC Price in 2021: $ 47155.6
Score:  0.4825640926264395


Again, our score is poor, but this time we improved our model around 40%. We need to train with less data.

## Case 3. Training with 2 years of data (2019-2020)

In [18]:
# Training data - 2019-20
X_train_last_two = btc[(btc.ds.dt.year >= 2019) & (btc.ds.dt.year <=2020)]
X_train_last_two

Unnamed: 0,ds,y
1567,2019-01-01,3843.520020
1568,2019-01-02,3943.409424
1569,2019-01-03,3836.741211
1570,2019-01-04,3857.717529
1571,2019-01-05,3845.194580
...,...,...
2289,2020-12-27,26272.294922
2290,2020-12-28,27084.808594
2291,2020-12-29,27362.437500
2292,2020-12-30,28840.953125


In [30]:
# Predictions
model = Prophet()
model.fit(X_train_last_two)
forecast3 = model.predict(X_test)
fig = plot_plotly(model, forecast3, xlabel='Date', ylabel='Price')
fig.update_layout(title='BTC-USD 2021 Predictions - Model trained with last 2 years of Data')
fig.show()

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.


In [21]:
# Validating predictions
val3 = forecast3.merge(btc, on='ds', how='right')
val3 = val3[['ds', 'yhat', 'y']]
val3.columns = ['Date', 'Predicted Price', 'Real Price']
val3 = val3[val3.Date.dt.year == 2021]
fig = px.scatter(val3, x=val3.Date, y=val3.columns[1:],
                title='BTC-USD stock 2021 Predictions - Validation')
fig.update_traces(marker_size=5)
fig.show()

In [23]:
plot_components_plotly(model, forecast3)

In [24]:
y_true = btc[btc.ds.dt.year == 2021]['y']
y_pred =  forecast3['yhat']
scores(y_true, y_pred)

MAE: 10436.471910302243
RMSE 12609.538262332011


In [25]:
print('Mean BTC Price in 2021: $', round(val['True Price'].mean(), 2))
print(f'Score: ', 1 - 10436 / val['True Price'].mean())

Mean BTC Price in 2021: $ 47155.6
Score:  0.7786901176495706


We improved our model to 0.77, which is not bad, considering how volatile the price is.

If we look closely to the "blind" predictions plot, we can see a big *uncertainty area* over and under the predictions, which indicates that the model predictions aren't accurate and may vary a lot. This is caused by the high volatility of the pair in the training data. 

Let's try to improve even more training with only the last year.

## Case 4. Training with 1 year of data (2020)

In [26]:
# Training data - 20120
X_train_last_year = btc[btc.ds.dt.year ==2020]
X_train_last_year

Unnamed: 0,ds,y
1932,2020-01-01,7200.174316
1933,2020-01-02,6985.470215
1934,2020-01-03,7344.884277
1935,2020-01-04,7410.656738
1936,2020-01-05,7411.317383
...,...,...
2289,2020-12-27,26272.294922
2290,2020-12-28,27084.808594
2291,2020-12-29,27362.437500
2292,2020-12-30,28840.953125


In [32]:
# Predictions
model = Prophet(daily_seasonality=True)
model.fit(X_train_last_year)
forecast4 = model.predict(X_test)
fig = plot_plotly(model, forecast4, xlabel='Date', ylabel='Price')
fig.update_layout(title='BTC-USD 2021 Predictions - Model trained with last year of Data')
fig.show()

INFO:fbprophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.


In [33]:
# Validating predictions
val4 = forecast4.merge(btc, on='ds', how='right')
val4 = val4[['ds', 'yhat', 'y']]
val4.columns = ['Date', 'Predicted Price', 'Real Price']
val4 = val4[val4.Date.dt.year == 2021]
fig = px.scatter(val4, x=val4.Date, y=val4.columns[1:],
                title='BTC-USD stock 2021 Predictions - Validation')
fig.update_traces(marker_size=5)
fig.show()

In [34]:
y_true = btc[btc.ds.dt.year == 2021]['y']
y_pred =  forecast4['yhat']
scores(y_true, y_pred)

MAE: 16837.505850396155
RMSE 18082.51563514129


In [35]:
print('Mean BTC Price in 2021: $', round(val['True Price'].mean(), 2))
print(f'Score: ', 1 - 16837 / val['True Price'].mean())

Mean BTC Price in 2021: $ 47155.6
Score:  0.6429480175225968


In this case, the model predicted a single straight line, with less score than before. This predictions are just a linear regression following the trend from the past year, so we can discard it.

## Results
The best model was the trained with 2 years of data. Let's plot the results:

In [36]:
results = pd.DataFrame(
    {'MAE': [31009, 24400, 10436, 16837], 
    'RMSE': [32596, 26231, 12609, 18082], 
    'Train Data': ['All the data', 'Last 4 years', 'Last 2 years', 'Last Year']})

In [37]:
px.bar(results, x='Train Data', y=['MAE', 'RMSE'], barmode='group', 
        title='Train MAE: All Data vs Last 4, 2 and Last Year (Less is Better)')

In [39]:
val['Last 4 Years'] = val2['Predicted Price']
val['Last 2 Years'] = val3['Predicted Price']
val['Last Year'] = val4['Predicted Price']
val = val.rename(columns={'Predicted Price': 'All the Data',  'y': 'True Price', 'ds': 'Date'})
px.line(val, x='Date', y=val.columns[1:], title='BTC-USD Stock Predictions: Train with all the Data vs Train with Last 4, 2 and 1 years')

## Training a model to make predictions over BTC-USD 
We will train a model with two years of data, and make predictions over the next year (2022). The final step will be save our model.

In [40]:
model = train('bitcoin', '../data/bitcoin.csv', False, True, len(X_train_last_two))

INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.


In [41]:
# Making future predictions with the model: one year

# 1. Creating the forecast Dates
X_test_future = []
end = dt.datetime.strptime('2022-12-31', '%Y-%m-%d').date()
start = dt.datetime.strptime('2021-11-20', '%Y-%m-%d').date()

for i in range((end-start).days):
    X_test_future += [(start+timedelta(i)).strftime('%Y-%m-%d')]

X_test_future = pd.DataFrame(X_test_future)
X_test_future.columns = ['ds']
X_test_future

Unnamed: 0,ds
0,2021-11-20
1,2021-11-21
2,2021-11-22
3,2021-11-23
4,2021-11-24
...,...
401,2022-12-26
402,2022-12-27
403,2022-12-28
404,2022-12-29


In [42]:
# 2. Making predictions: 1 year
forecast = model.predict(X_test_future)
fig = plot_plotly(model, forecast, xlabel='Date', ylabel='Price')
fig.update_layout(title='BTC-USD Stocks - 2022 Forecasting')
fig.show()

## Final Step: Save the model
We save the model in our *models* folder as final step.

In [43]:
# Saving the model
save_model('../models', model, 'bitcoin')

Model Succesfully Saved in: 
../models/bitcoin.json


## Conclussions
Our model performs *not bad* with two years of training data. We can't rely on this study, because we have few data and crypto market volatility is very high, sometimes extreme. However, **we had 0.77 score with our final model**, we made predictions to 2022, and we knew before we started that **predicting the bitcoin price would be a complicated task**.

Also, we can see that the ***uncertainty area*** around the predictions is big, which indicates that **we can't really know where the price will be in the future, because** —again— **of the extreme volatility of the BTC-USD pair**.