# Cryptocurrencies Price Prediction using a RNN

## Abstract
An investment in bitcoin is the equivalent of betting on the increased adoption of this asset. Contrary to the US dollar, which used to be backed by gold and currently is backed by trust on the US government) bitcoin itself has no intrinsic value. This means that bitcoin’s market price is based on the law of supply and demand. Furthermore, bitcoin has a cap on the amount of coins produced, namely 21 million. This makes bitcoin a deflationary asset, which translates into a greater economic value as its supply decreases.
This project uses machine learning to test if bitcoin’s market price can be significantly predicted by other Bitcoin’s blockchain features such as market price, hash rate, difficulty adjustment, transactions volume… A RNN (Recurrent Neural Network) will be used to predict a time-series dataset.

## Goal
- Use a RNN to predict against a time-series dataset of 4 cryptocurrencies: BTC, LTC, ETH, BCH
- The goal is that the neural network doesn't just memorize our data and that it instead "generalizes" and learns the actual problem and patterns associated with it.



## Input data
- The data we'll be using is Open, High, Low, Close, Volume data for Bitcoin, Ethereum, Litecoin and Bitcoin Cash.
- Since there are different ways to measure the price (opening price, closing price, high price, low price...) we specify that the model uses the `Close` and `Volume` columns from the input dataset
- We're going to be tracking the `Close` and `Volume` every minute for Bitcoin, Litecoin, Ethereum, and Bitcoin Cash.

## Exploratory data analysis to data set
1. Manage input data:
	1.1 Combine price and volume for each coin into a single feature 
	1.2 Choose a target: price we are trying to predict and how far out in the future we are considering
2. **Feed Forward Neural Network**: Take that feature set and combine it into sequences of 60 feature sets that we will use as input to the model
3. Since our output is trying to predict whether price will fall or rise for each cryptocurrencies:
	3.1 Take the price of all assets
	3.2 Take the future price of the asset
	3.3 Determine whether the price will be higher or lower at the current price
4. **Balance** the  buys and sells:  make sure that there are the same number of examples for each cryptocurrency (same number of recorded transactions for each coin).
	- a) Feed weights to the model to measure error accordingly
	- b) Trim down datasets to make them all have approximately the same number of transactions recorded
	- Not balancing could lead to a our model to predict only one coin, whichever is the most common, and memorize it
5. Scale and normalize the data
6. Before creating a target we need to know: 
	6.1 Which price we are trying to predict
	6.2 Time frame: how far out we are trying to predict
	- Choose a time frame step: how many observations should be considered when the RNN makes a prediction about the current observation. 
	- The longer the sequence the more accurate the prediction will be
	6.3 Make a decision about the classification function we will be using (how are we going to predict the price):
	a) Make it a regression question: using a linear activation with output lyer
	b) Use binary classification
7. In this case, we make the decision of following the binary classification strategy:
8. Use the classification function to make a target
9. Validate the data, sequences, and normalize the data
	9.1 Split training and test datasets: Since the data is inherently sequential, taking sequeneces that do not come in the future is likely a mistake.
	9.2 Validate the data:
	- Sequences that are, for example, 1 minute apart, are likely to be the same, thus indicating an identical Buy or Sell signal
	- This might cause the model to be overfitted on the test split
	- In order to remediate this, the validation data is sliced while it is still in order
	9.3 Create random sequences and shuffle them before balancing
10. Apply feature scaling to the dataset:
	- Apply normalization by subtracting the minimum value of the dataset and then dividing by the range of the dataset. ![Feature scaling normalization equation](https://nickmccullum.com/images/python-deep-learning/recurrent-neural-networks/normalization.jpg)
	- Scale the data and normalize all except for the target column itself. 
	- We use percentage change in price to normalize the prices of all cryptocurrencies. The reason for that is that each cryptocurrency has very different values and, therefore, we care about the other coins' movements.



### Install Libraries

In [30]:
!pip install sklearn tensorflow keras pandas numpy matplotlib;

You should consider upgrading via the '/Users/alvaroserranorivas/.pyenv/versions/3.9.2/envs/bitcoin_linear_regression/bin/python3.9 -m pip install --upgrade pip' command.[0m


### Import Libraries

In [31]:
# import libraries
import random
from collections import deque
from pathlib import Path
import time


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
from sklearn import preprocessing

# Keras libraries
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, BatchNormalization
from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint

In [32]:
PRECEDING_PRICES = 60  # preceeding 60 days prices
FUTURE_PRICES_PREDICT = 3  # how far into the future to predict
TO_PREDICT = "BTC-USD"

### The process of building a RNN

#### Read input datasets

In [33]:
main_df = pd.DataFrame() 

data_directory: Path = Path("./crypto-data")

for file in data_directory.glob("*.csv"):
    file_name = file.name.split(".")[0]
    df = pd.read_csv(f"{data_directory.name}/{file.name}", names = ["time", "low", "high", "open", "close", "volume"],  encoding="ISO-8859-1")
    # rename to distinguish the cryptocurrency we are working with
    df.rename(columns={"close": f"{file_name}_close", "volume": f"{file_name}_volume"}, inplace=True)
    # set time as index (row labels) so we can join the dataframes
    df.set_index("time", inplace=True)
    # drop columns we are not interested in
    df = df[[f"{file_name}_close", f"{file_name}_volume"]]
    # drop rows with missing values
    df.dropna(inplace=True)
    main_df = df if len(main_df) == 0 else main_df.join(df)

main_df.fillna(method="ffill", inplace=True) 
main_df.dropna(inplace=True)
print(main_df.head())
# # use previous valid value if there are gaps in the data
# main_df.fillna(method="ffill", inplace=True) 
# # main_df.fillna(method="bfill", inplace=True)
# main_df.dropna(inplace=True)
# print(main_df.shape)
# print(main_df.head())

            ETH-USD_close  ETH-USD_volume  BTC-USD_close  BTC-USD_volume  \
time                                                                       
1528968720      486.01001       26.019083    6487.379883        7.706374   
1528968780      486.00000        8.449400    6479.410156        3.088252   
1528968840      485.75000       26.994646    6479.410156        1.404100   
1528968900      486.00000       77.355759    6479.979980        0.753000   
1528968960      486.00000        7.503300    6480.000000        1.490900   

            BCH-USD_close  BCH-USD_volume  LTC-USD_close  LTC-USD_volume  
time                                                                      
1528968720     870.859985       26.856577      96.660004      314.387024  
1528968780     870.099976        1.124300      96.570000       77.129799  
1528968840     870.789978        1.749862      96.500000        7.216067  
1528968900     870.000000        1.680500      96.389999      524.539978  
1528968960     86

#### Define logic for making a decision (Buy/Sell)
- If the "future" column is higher, we buy.
- Else,we sell

In [34]:
def make_order_decision(current_price, future_price):
	if float(current_price) < float(future_price): 
		return 1 # buy order
	else:
		return 0

##### Speculate future price based on closing prices from the past

In [35]:
# create column with the target future value
main_df["future_price_to_predict"] = main_df[f"{TO_PREDICT}_close"].shift(-FUTURE_PRICES_PREDICT) # negative to shift columnn up
# main_df["future_price_to_predict"] = main_df["future_price_to_predict"].astype(float)
main_df["order_decision"] = list(map(make_order_decision, main_df[f"{TO_PREDICT}_close"], main_df["future_price_to_predict"]))
main_df.dropna(inplace=True)
# print(main_df.head())

In [36]:
# normalize data except for the order_decision column
def normalize_and_scale_df(df):
	# no longer need future_price_to_predict column
	df = df.drop("future_price_to_predict", axis=1)

	for column in df.columns:
		if column != "order_decision":
			# TODO why is order_decision being included
			# normalize data based on percentrage
			df[column] = df[column].pct_change()
			df.dropna(inplace=True)
			# scale values between 0 and 1
			df[column] = preprocessing.scale(df[column].values)
			# min_max_scaler = preprocessing.MinMaxScaler()
			# data_scaled = min_max_scaler.fit_transform(df[column].values.reshape(-1, 1))
			# df[column] = data_scaled

	df.dropna(inplace=True)

	predictions_sequence = []
	previous_days_sequence = deque(maxlen=PRECEDING_PRICES)	

	for value in df.values:
		# do not include the target in the sequence
		previous_days_sequence.append([i for i in value[:-1]])
		# only keep last PRECEDING_PRICES observations
		if len(previous_days_sequence) == PRECEDING_PRICES:
			predictions_sequence.append([np.array(previous_days_sequence), value[-1]])
	# shuffle sequential data for good measure
	random.shuffle(predictions_sequence)

	buy_orders = []
	not_buy_orders = []

	for sequence, order_decision in predictions_sequence:
		if order_decision == 1:
			buy_orders.append([sequence, order_decision])
		else:
			not_buy_orders.append([sequence, order_decision])
	
	random.shuffle(buy_orders)
	random.shuffle(not_buy_orders)

	# ensure both buy and not buy orders are the same length
	shorter_sequence = min(len(buy_orders), len(not_buy_orders))
	buy_orders = buy_orders[:shorter_sequence]
	not_buy_orders = not_buy_orders[:shorter_sequence]

	# combine buy and not buy orders into predictions_sequence
	predictions_sequence = buy_orders + not_buy_orders
	random.shuffle(predictions_sequence)

	x_data = []
	y_data = []

	for sequence, order_decision in predictions_sequence:
		x_data.append(sequence) # sequence is the input
		y_data.append(order_decision) # buy or not to buy
	 
	return np.array(x_data), y_data

In [37]:
# data will not be shuffled due to the nature of the data, which is sequential
# taking sequences of data that do not come in the future is likely a bad idea
# make_order_decision will usually be the same for data points 1 minute apart
sorted_dates = sorted(main_df.index.values)
last_5_percent = sorted(main_df.index.values)[-int(len(sorted_dates) * 0.05)]

test_df = main_df[(main_df.index >= last_5_percent)]
print(f"Shape of test data: {test_df.shape}")
main_df = main_df[(main_df.index < last_5_percent)]
print(f"Shape of training data: {main_df.shape}")

# train test split
# normalize and scale training data
# x_train is a numpy array of sequences
# y_train is a list of buy or not to buy (1 or 0)
x_train, y_train = normalize_and_scale_df(main_df)
# normalize and scale test data
x_test, y_test = normalize_and_scale_df(test_df)
print(test_df)

print(f"train data: {len(x_train)} test data: {len(x_test)}")
print(f"Train Dont buys: {y_train.count(0)}, buys: {y_train.count(1)}")
print(f"Test Dont buys: {y_test.count(0)}, buys: {y_test.count(1)}")

Shape of test data: (5141, 10)
Shape of training data: (97687, 10)
            ETH-USD_close  ETH-USD_volume  BTC-USD_close  BTC-USD_volume  \
time                                                                       
1534904940     291.140015        6.290395    6705.200195        8.163854   
1534905000     291.570007      489.621918    6700.000000        9.827062   
1534905060     292.299988      173.859314    6700.000000       13.631424   
1534905120     292.399994       77.035606    6702.359863       12.383007   
1534905180     292.600006       89.692940    6711.560059       15.645502   
...                   ...             ...            ...             ...   
1535214840     279.290009        4.150405    6710.089844        1.293573   
1535214900     279.299988        5.566861    6712.990234        2.330975   
1535214960     279.359985       11.280577    6713.140137        0.769891   
1535215020     279.359985        8.790519    6714.520020        1.002652   
1535215080     279.36