# Crytocurrency Price prediction

This notebook predicts the price of a target cryptocurrency given a few features about it. We use time-series data to predict the next price. As time-series data has an inherent order to it, using a recurrent architecture benefits us. This notebook uses LSTMs to do so.

Lets start with some imports...

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
import os

from sklearn import preprocessing




SEQLEN = 60
FUTURE_PREDICT = 3
RATIO = "LTC-USD"



The above cell has also defined a few variables...

1. **SEQLEN** : the length of the sequences that are generated and fed to the model.

2. **FUTURE_PREDICT** : the amount of values to predict in the future.

3. **RATIO** : the ratio (or cryptocurrency) that has to be predicted.

These variables are frequently used and hence given a proper variable.


The next step is to form a dataframe from the CSVs...


In [2]:

# Create an empty dataframe...
main_df = pd.DataFrame()


# We will be reading all the CSVs present...so we create a list of all cryptocurrencies present...
ratios=["BTC-USD","LTC-USD","ETH-USD","BCH-USD"]

# and then pass them into reading one-by-one...
for ratio in ratios:
    
    #Read the file...
    dataset=f"D:\\PROJECTS\\Datasets\\crypto_data\\{ratio}.csv"
    df = pd.read_csv(dataset,names=["time","low","high","open","close","volume"])
    
    
    #rename the columns to put all cryptos in one df...
    df.rename(columns={"close": f"{ratio}_close", "volume": f"vol_{ratio}"},inplace=True)
    
    #set the index of the dataframe as the timestamp...
    df.set_index("time",inplace=True)
    
    #finally, remove all columns except the closing price(our target variable) 
    # and the traaded volume which is the quantity of that currency being traded...
    df = df[[f"{ratio}_close", f"vol_{ratio}"]]
    
    #if we are reading the first file, assign it to main_df
    # otherwise, append the df to main_df...
    if len(main_df)==0:
        main_df = df
    else:
        main_df=main_df.join(df)

# Lets preview our dataframe...
main_df.head()

Unnamed: 0_level_0,BTC-USD_close,vol_BTC-USD,LTC-USD_close,vol_LTC-USD,ETH-USD_close,vol_ETH-USD,BCH-USD_close,vol_BCH-USD
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1528968660,6489.549805,0.5871,96.580002,9.6472,,,871.719971,5.675361
1528968720,6487.379883,7.706374,96.660004,314.387024,486.01001,26.019083,870.859985,26.856577
1528968780,6479.410156,3.088252,96.57,77.129799,486.0,8.4494,870.099976,1.1243
1528968840,6479.410156,1.4041,96.5,7.216067,485.75,26.994646,870.789978,1.749862
1528968900,6479.97998,0.753,96.389999,524.539978,486.0,77.355759,870.0,1.6805


## A little background on the data

As we can see, we have the closing prices and volumes of 4 cryptocurrencies. We will be predicting LiteCoin's price
Traded as (LTC-USD). Here, LTC-USD is a ratio which is how many USDs can be bought for a unit LTC.

In [3]:
# This function classifies if the movement was upwards or downwards 
# (bullish or bearish for the commerce folks out there!)
# Here, 1 specifies bullish movement
# and 0 specifies bearish movement...

def classify(currPrice,futPrice):
    if float(futPrice) > float(currPrice):
        return 1
    else:
        return 0
    

## Creating a target value for each data row

Now that we have our dataframe loaded, we need to find our target variable. However, in such scenarios, the input variables next instance is our target variable (the future price of a coin at "t" is the current price of the coin at time "t+1"). We thereby create target values for each datarow...

In [4]:
# Create columns with the name "future" that will have the future price...
# for a price at time "t", and leap of "n"
# df[t,'future'] has value referring to price at time (t+n)


main_df['future'] = main_df[f"{RATIO}_close"].shift(-FUTURE_PREDICT)
print(main_df[[f"{RATIO}_close","future"]].head())

            LTC-USD_close     future
time                                
1528968660      96.580002  96.500000
1528968720      96.660004  96.389999
1528968780      96.570000  96.519997
1528968840      96.500000  96.440002
1528968900      96.389999  96.470001


In [5]:
# Now, we map all our input and outputs to the classification function 
# to find if the movement was upwards or downwards...

main_df["target"] = list(map(classify,main_df[f"{RATIO}_close"],main_df["future"]))

In [6]:
print(main_df[[f"{RATIO}_close","future","target"]].head(10))

            LTC-USD_close     future  target
time                                        
1528968660      96.580002  96.500000       0
1528968720      96.660004  96.389999       0
1528968780      96.570000  96.519997       0
1528968840      96.500000  96.440002       0
1528968900      96.389999  96.470001       1
1528968960      96.519997  96.400002       0
1528969020      96.440002  96.400002       0
1528969080      96.470001  96.400002       0
1528969140      96.400002  96.400002       0
1528969200      96.400002  96.400002       0


In [7]:
times = sorted(main_df.index.values)

In [8]:
# Defining a split index to split the data into training and validation sets...
last_5pct = times[-int(0.05*len(times))]
print(last_5pct)

1534922100


In [9]:
# Creating the training and validation sets...

val_main_df = main_df[(main_df.index >= last_5pct)]
main_df = main_df[(main_df.index < last_5pct)]

In [10]:

# we will be using a double ended queue
# Find more here : https://docs.python.org/2/library/collections.html

from collections import deque
import random


def preprocessDf(df):
    # drop the column "future"...
    df = df.drop("future",1)
    
    # calculate the percent change in input variables
    # Remove any NaNs generated in the process...
    # use scikit to scale the values in [0,1]...
    
    for col in df.columns:
        if col != "target":
            df[col] = df[col].pct_change()
            df.dropna(inplace=True)
            df[col] = preprocessing.scale(df[col].values)
    
    
    #Again, just to be sure...
    df.dropna(inplace=True)
    
    # Create a sequence to be passed to the model...
    # A sequence is a list of inputs in a defined order...
    seq_data=[]
    prev_days = deque(maxlen=SEQLEN)
    
    
    # df.values returns all data in numpy array without the headers...
    # [TODO] : Shift it to df.to_numpy()
    
    # We iterate in it and create a list of prices of previous days(the length being SEQLEN).
    # Once a "sequence" of such data is ready, we put it on the seq_data.
    # So, seq_data is a list of sequences, each of which is in order.
    # the inner sequence is thereby preserved.
    # We do shuffle the list of seq_data...
    
    for i in df.values:
        prev_days.append([n for n in i[:-1]])
        if len(prev_days) == SEQLEN:
            seq_data.append([np.array(prev_days),i[-1]])
    random.shuffle(seq_data)
    
    
    
    # Should we buy or should we sell?
    buys=[]
    sells=[]
    
    
    # find out buys and sells through our target variable...
    for seq,target in seq_data:
        if target==0:
            sells.append([seq,target])
        elif target==1:
            buys.append([seq,target])
     
    #shuffle everything again...
    random.shuffle(buys)
    random.shuffle(sells)
    
    # Create equal sets of data... 
    lwr = min(len(buys),len(sells))
    buys = buys[:lwr]
    sells = sells[:lwr]
    
    
    seq_data = buys + sells
    random.shuffle(seq_data)
    

    X=[]
    y=[]
    
    # Create X (inputs) and y (outputs) for the given data...
    for seq,target in seq_data:
        X.append(seq)
        y.append(target)
    
    # finally, return numpy arrays of X and y...
    return np.array(X),y


# Pass the dataframes to obtain the data...
xTrain,yTrain = preprocessDf(main_df)
xTest,yTest = preprocessDf(val_main_df)

In [11]:
print(xTrain.shape)
print(xTest.shape)

(69188, 60, 8)
(3062, 60, 8)


## Creating a model

We will now create a model that takes our data as input and spits out either a buy or sell action. To do this, we take advantage of inherent order in our data(the time) to extract more information. This can be done easily by RNNs or LSTMs which take previous output into consideration as well... 

In [12]:
#used to define filenames and save models...
import uuid
import time

#used to define the actual model...
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dropout,CuDNNLSTM,Dense,BatchNormalization
from tensorflow.keras.callbacks import TensorBoard,ModelCheckpoint


EPOCHS=10
BATCH=64
NAME=f"{SEQLEN}-Day-{RATIO}-Predictor-{int(time.time())}"



model = Sequential()

#Layer 1 : LSTM-dropout-bathcnorm
model.add(CuDNNLSTM(128,input_shape=(xTrain.shape[1:]),return_sequences=True))
model.add(Dropout(0.1))
model.add(BatchNormalization())

#Layer 2 : LSTM-dropout-batchnorm
model.add(CuDNNLSTM(128,input_shape=(xTrain.shape[1:]),return_sequences=True))
model.add(Dropout(0.1))
model.add(BatchNormalization())

#Layer 3 : LSTM-dropout-batchnorm
model.add(CuDNNLSTM(128,input_shape=(xTrain.shape[1:]),return_sequences=True))
model.add(Dropout(0.1))
model.add(BatchNormalization())

#Layer 4 : LSTM-dropout-batchnorm
model.add(CuDNNLSTM(128,input_shape=(xTrain.shape[1:])))
model.add(Dropout(0.1))
model.add(BatchNormalization())

#Layer 5 : Fully connected layer with dropout
model.add(Dense(32,activation="relu"))
model.add(Dropout(0.2))

#Layer 6 : Final outer layer to classify as buy/sell...
model.add(Dense(2,activation="softmax"))



# Use Adam optimiser on sparse categoricalentropy loss(BCE will work as well) 
# and print out the accuracy of the model...
model.compile(optimizer=tf.keras.optimizers.Adam(lr=1e-3,decay=1e-6),
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

#defining a few callbacks to monitor the model...
tensorboard = TensorBoard(log_dir=f"logs/{NAME}")

#checkpointing the model to save the best validation accuracy weights...
FILEPATH = "RNN_Final-{epoch:02d}-{val_acc:.3f}"
chkpt = ModelCheckpoint("models/{}.model".format(FILEPATH,monitor="val_acc",verbose=1,save_best_only=True,mode="max"))


#train the model...
history = model.fit(xTrain,yTrain,batch_size = BATCH, epochs=EPOCHS,
                    validation_data=(xTest,yTest),
                    callbacks=[tensorboard,chkpt])


Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Train on 69188 samples, validate on 3062 samples
Instructions for updating:
Use tf.cast instead.
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
