# Cryptocurrencies Price Prediction using a RNN

## Abstract
An investment in bitcoin is the equivalent of betting on the increased adoption of this asset. Contrary to the US dollar, which used to be backed by gold and currently is backed by trust on the US government) bitcoin itself has no intrinsic value. This means that bitcoin’s market price is based on the law of supply and demand. Furthermore, bitcoin has a cap on the amount of coins produced, namely 21 million. This makes bitcoin a deflationary asset, which translates into a greater economic value as its supply decreases.
This project uses machine learning to test if bitcoin’s market price can be significantly predicted by other Bitcoin’s blockchain features such as market price, hash rate, difficulty adjustment, transactions volume… A RNN (Recurrent Neural Network) will be used to predict a time-series dataset.

## Goal
- Use a RNN to predict against a time-series dataset of 4 cryptocurrencies: BTC, LTC, ETH, BCH
- The goal is that the neural network doesn't just memorize our data and that it instead "generalizes" and learns the actual problem and patterns associated with it.



## Input data
- The data we'll be using is Open, High, Low, Close, Volume data for Bitcoin, Ethereum, Litecoin and Bitcoin Cash.
- Since there are different ways to measure the price (opening price, closing price, high price, low price...) we specify that the model uses the `Close` and `Volume` columns from the input dataset
- We're going to be tracking the `Close` and `Volume` every minute for Bitcoin, Litecoin, Ethereum, and Bitcoin Cash.

### Install Libraries

In [1]:
!pip install sklearn tensorflow keras pandas numpy matplotlib; 

You should consider upgrading via the '/Users/alvaroserranorivas/.pyenv/versions/3.9.2/envs/bitcoin_linear_regression/bin/python3.9 -m pip install --upgrade pip' command.[0m


### Import Libraries

In [2]:
# import libraries
import random
from collections import deque
from pathlib import Path
import time


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
from sklearn import preprocessing

# Keras libraries
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import LSTM, Dense, Dropout, BatchNormalization
from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint

### Define constant variables

##### Key concepts:
- **Neural Network**: ![A basic neural network](https://nickmccullum.com/images/python-deep-learning/intuition-recurrent-neural-networks/basic-neural-network.png)
- **Batches**: Total number of training samples that are fed one-by-one to the Neural Netwok
- **Epoch**: One epoch is when the dataset has been passed forward and backward the RNN once. Therefore, the greater the number of epochs, the better the accuracy of the model. However, too many epochs could lead to overfitting
- **Overfitting**: The model memorizes its behavior and therefore, it will perform on the training data but it will do poorly on the test or validation data.

In [3]:
PRECEDING_PRICES = 60  # preceeding 60 days prices
FUTURE_PRICES_PREDICT = 3  # how far into the future to predict
EPOCHS = 10  # how many times to train the model
BATCH_SIZE = 64  # how many samples per batch
TO_PREDICT = "BTC-USD"
NAME = f"{TO_PREDICT}-{PRECEDING_PRICES}-SEQ-{FUTURE_PRICES_PREDICT}-PRED-{int(time.time())}" 

### The process of building a RNN
#### **Input Data**: Combine price and volume for each coin into a single feature 

In [4]:
main_df = pd.DataFrame() 

data_directory: Path = Path("./crypto-data")

for file in data_directory.glob("*.csv"):
    file_name = file.name.split(".")[0]
    df = pd.read_csv(f"{data_directory.name}/{file.name}", names = ["time", "low", "high", "open", "close", "volume"],  encoding="ISO-8859-1")
    df.rename(columns={"close": f"{file_name}_close", "volume": f"{file_name}_volume"}, inplace=True)
    df.set_index("time", inplace=True)
    df = df[[f"{file_name}_close", f"{file_name}_volume"]]
    df.dropna(inplace=True)
    main_df = df if len(main_df) == 0 else main_df.join(df)

main_df.fillna(method="ffill", inplace=True) 
main_df.dropna(inplace=True)
print(main_df.head())

            ETH-USD_close  ETH-USD_volume  BTC-USD_close  BTC-USD_volume  \
time                                                                       
1528968720      486.01001       26.019083    6487.379883        7.706374   
1528968780      486.00000        8.449400    6479.410156        3.088252   
1528968840      485.75000       26.994646    6479.410156        1.404100   
1528968900      486.00000       77.355759    6479.979980        0.753000   
1528968960      486.00000        7.503300    6480.000000        1.490900   

            BCH-USD_close  BCH-USD_volume  LTC-USD_close  LTC-USD_volume  
time                                                                      
1528968720     870.859985       26.856577      96.660004      314.387024  
1528968780     870.099976        1.124300      96.570000       77.129799  
1528968840     870.789978        1.749862      96.500000        7.216067  
1528968900     870.000000        1.680500      96.389999      524.539978  
1528968960     86

#### Define logic for making a decision (Buy/Sell)
- If the "future" column is higher, we buy.
- Else,we sell

In [5]:
def make_order_decision(current_price, future_price):
	if float(current_price) < float(future_price): 
		return 1 # buy order
	else:
		return 0

#### Speculate future price based on closing prices from the past
##### Choose target:
- Price we are trying to predict
- How far out in the future we are trying to predict


In [6]:
main_df["future_price_to_predict"] = main_df[f"{TO_PREDICT}_close"].shift(-FUTURE_PRICES_PREDICT) # negative to shift columnn up
main_df["order_decision"] = list(map(make_order_decision, main_df[f"{TO_PREDICT}_close"], main_df["future_price_to_predict"]))
main_df.dropna(inplace=True)

### Prediction: 
#### Will price rise or fall? 
#### Should I buy or should I sell?
##### Apply feature scaling to the dataset

There are 2 alternatives:
- Standardization
- Normalization

Since our output is trying to predict whether price will fall or rise for each cryptocurrencies:
1. Take the price of all assets
2. Pick a target and take its future price
3. Choose a time frame step: how many observations should be considered when the RNN makes a prediction about the current observation.
4. Make a decision about the classification function we will be using (how are we going to predict the price):
a) Make it a regression question: using a linear activation with output lyer
b) Use **binary classification** (we pick this one)

##### 1. **Balance** the  buys and sells:  make sure that there are the same number of examples for each cryptocurrency (same number of recorded transactions for each coin)
- a) Feed weights to the model to measure error accordingly
- b) Trim down datasets to make them all have approximately the same number of transactions recorded
- Not balancing could lead to a our model to predict only one coin, whichever is the most common, and memorize it
##### 2. Scale and normalize the data
- The order_decision column is not normalized

#### Data validation, and sequence normalization
1. Split training and test datasets: 
- Since the data is inherently sequential, taking sequeneces that do not come in the future is likely a mistake.
2. Validate the data:
- Sequences that are, for example, 1 minute apart, are likely to be the same, thus indicating an identical Buy or Sell signal. This might cause the model to be overfitted on the test split. In order to remediate this, the validation data is sliced while it is still in order
3. Create random sequences and shuffle them before balancing


In [7]:
def normalize_and_scale_df(df):
	df = df.drop("future_price_to_predict", axis=1)
	for column in df.columns:
		if column != "order_decision":
			df[column] = df[column].pct_change()
			df.dropna(inplace=True)
			df[column] = preprocessing.scale(df[column].values)
	df.dropna(inplace=True)
	predictions_sequence = []
	previous_days_sequence = deque(maxlen=PRECEDING_PRICES)	
	for value in df.values:
		previous_days_sequence.append([i for i in value[:-1]])
		if len(previous_days_sequence) == PRECEDING_PRICES:
			predictions_sequence.append([np.array(previous_days_sequence), value[-1]])
	random.shuffle(predictions_sequence)
	buy_orders = []
	not_buy_orders = []
	for sequence, order_decision in predictions_sequence:
		if order_decision == 1:
			buy_orders.append([sequence, order_decision])
		else:
			not_buy_orders.append([sequence, order_decision])
	random.shuffle(buy_orders)
	random.shuffle(not_buy_orders)
	shorter_sequence = min(len(buy_orders), len(not_buy_orders))
	buy_orders = buy_orders[:shorter_sequence]
	not_buy_orders = not_buy_orders[:shorter_sequence]
	predictions_sequence = buy_orders + not_buy_orders
	random.shuffle(predictions_sequence)
	x_data = []
	y_data = []
	for sequence, order_decision in predictions_sequence:
		x_data.append(sequence) 
		y_data.append(order_decision)
	return np.array(x_data), y_data

In [8]:
sorted_dates = sorted(main_df.index.values)
last_5_percent = sorted(main_df.index.values)[-int(len(sorted_dates) * 0.05)]
test_df = main_df[(main_df.index >= last_5_percent)]
print(f"Shape of test data: {test_df.shape}")
main_df = main_df[(main_df.index < last_5_percent)]
print(f"Shape of training data: {main_df.shape}")
# x_train is a numpy array of sequences
# y_train is a list of buy or not to buy (1 or 0)
x_train, y_train = normalize_and_scale_df(main_df)
x_test, y_test = normalize_and_scale_df(test_df)
print(test_df.head())

print(f"train data: {len(x_train)} test data: {len(x_test)}")
print(f"Train Dont buys: {y_train.count(0)}, buys: {y_train.count(1)}")
print(f"Test Dont buys: {y_test.count(0)}, buys: {y_test.count(1)}")

Shape of test data: (5141, 10)
Shape of training data: (97687, 10)
            ETH-USD_close  ETH-USD_volume  BTC-USD_close  BTC-USD_volume  \
time                                                                       
1534904940     291.140015        6.290395    6705.200195        8.163854   
1534905000     291.570007      489.621918    6700.000000        9.827062   
1534905060     292.299988      173.859314    6700.000000       13.631424   
1534905120     292.399994       77.035606    6702.359863       12.383007   
1534905180     292.600006       89.692940    6711.560059       15.645502   

            BCH-USD_close  BCH-USD_volume  LTC-USD_close  LTC-USD_volume  \
time                                                                       
1534904940     559.989990        5.615524      58.220001      135.698441   
1534905000     558.239990       35.015507      57.930000      864.464905   
1534905060     560.640015       84.553719      58.250000      563.237671   
1534905120     560.2

#### Build and train the model

In [9]:
model = Sequential()
model.add(LSTM(128, input_shape=(x_train.shape[1:]), return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization()) 

model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.1))
model.add(BatchNormalization())

model.add(LSTM(128))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(2, activation='softmax'))

2021-11-16 16:46:25.430608: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


##### Model Compilation Settings: Adam Optimizier

In [10]:
opt = Adam(learning_rate=0.001, decay=1e-6)

# Compile model
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=opt,
    metrics=['accuracy']
)
tensorboard = TensorBoard(log_dir="logs/{}".format(NAME))

##### Model Checkpoint

In [11]:
model_checkpoint = "RNN_Final-{epoch:02d}-{val_accuracy:.3f}"  
checkpoint = ModelCheckpoint("models/{}.model".format(model_checkpoint, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max'))

##### Traning phase

In [12]:
history = model.fit(
    x_train, np.array(y_train),
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=(x_test, np.array(y_test)),
    callbacks=[tensorboard, checkpoint],
)

Epoch 1/10

2021-11-16 16:49:47.925878: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


INFO:tensorflow:Assets written to: models/RNN_Final-01-0.531.model/assets


INFO:tensorflow:Assets written to: models/RNN_Final-01-0.531.model/assets


Epoch 2/10
 178/1293 [===>..........................] - ETA: 3:18 - loss: 0.6568 - accuracy: 0.5721

#### Save Model Score

In [None]:
model_score = model.evaluate(x_test, np.array(y_test), verbose=0)
print('Test loss:', model_score[0])
print('Test accuracy:', model_score[1])
model.save("models/{}".format(NAME))

Test loss: 0.6935529708862305
Test accuracy: 0.48899704217910767


2021-11-16 16:45:06.482019: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


INFO:tensorflow:Assets written to: models/BTC-USD-60-SEQ-3-PRED-1637099034/assets


INFO:tensorflow:Assets written to: models/BTC-USD-60-SEQ-3-PRED-1637099034/assets


In [None]:
!tensorboard --logdir=logs


NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.7.0 at http://localhost:6006/ (Press CTRL+C to quit)
