# Cryptocurrency prediction with Recurrent Neural Networks
Source:
1. https://pythonprogramming.net/crypto-rnn-model-deep-learning-python-tensorflow-keras/?completed=/balancing-rnn-data-deep-learning-python-tensorflow-keras/

### Data Preprocessing
It is seen that there are 6 columns, including 'time', 'low', 'high', 'open', 'close' and 'volume'. We decidede that only the closing price and the volumn are important features for predicting the future price. Thus, we have to combine the closing price and the volume of the 4 different crypto currencies.

Tips
- to rename the headers of the columns, one can use: df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
- to set a column as the new index, one can use df.set_index
- replace gaps in data with previously known values, one can use df.fillna(method="ffill", inplace=True)
- Some knowledge regarding f strings https://realpython.com/python-f-strings/

In [1]:
import pandas as pd

df = pd.read_csv("input/LTC-USD.csv", names=['time', 'low', 'high', 'open', 'close', 'volume'])

print(df.head())

         time        low       high       open      close      volume
0  1528968660  96.580002  96.589996  96.589996  96.580002    9.647200
1  1528968720  96.449997  96.669998  96.589996  96.660004  314.387024
2  1528968780  96.470001  96.570000  96.570000  96.570000   77.129799
3  1528968840  96.449997  96.570000  96.570000  96.500000    7.216067
4  1528968900  96.279999  96.540001  96.500000  96.389999  524.539978


In [2]:
main_df = pd.DataFrame() # begin empty

ratios = ["BTC-USD", "LTC-USD", "BCH-USD", "ETH-USD"]  # the 4 ratios we want to consider
for ratio in ratios:  # begin iteration
    dataset = f'input/{ratio}.csv'  # get the full path to the file.
    df = pd.read_csv(dataset, names=['time', 'low', 'high', 'open', 'close', 'volume'])  # read in specific file

    # rename volume and close to include the ticker 
    df.rename(columns={"close": f"{ratio}_close", "volume": f"{ratio}_volume"}, inplace=True)
    df.set_index("time", inplace=True)  # set time as index so we can join them on this shared time
    df = df[[f"{ratio}_close", f"{ratio}_volume"]]  # ignore the other columns besides price and volume

    if len(main_df)==0:  # if the dataframe is empty
        main_df = df  # then it's just the current df
    else:  # otherwise, join this data to the main one
        main_df = main_df.join(df)

main_df.fillna(method="ffill", inplace=True)
main_df.dropna(inplace=True)
print(main_df.head())

            BTC-USD_close  BTC-USD_volume  LTC-USD_close  LTC-USD_volume  \
time                                                                       
1528968720    6487.379883        7.706374      96.660004      314.387024   
1528968780    6479.410156        3.088252      96.570000       77.129799   
1528968840    6479.410156        1.404100      96.500000        7.216067   
1528968900    6479.979980        0.753000      96.389999      524.539978   
1528968960    6480.000000        1.490900      96.519997       16.991997   

            BCH-USD_close  BCH-USD_volume  ETH-USD_close  ETH-USD_volume  
time                                                                      
1528968720     870.859985       26.856577      486.01001       26.019083  
1528968780     870.099976        1.124300      486.00000        8.449400  
1528968840     870.789978        1.749862      485.75000       26.994646  
1528968900     870.000000        1.680500      486.00000       77.355759  
1528968960     86

#### Create target
Our ojective is to predict Litecoin price. I'd like to go with a sequence length of 60, and a future prediction out of 3. Which means that using the past 60 minutes to predic something in the next 3 minutes. However, we sould be more clear what we would like to predict. What we like to do is that if price goes up in 3 minutes, then it's a buy. If it goes down in 3 minutes, not buy/sell.

Tips
- map() function returns a list of the results after applying the given function to each item of a given iterable (list, tuple etc.)

In [3]:
def classify(current, future):
    if float(future) > float(current):
        return 1
    else:
        return 0

In [4]:
SEQ_LEN = 60  # how long of a preceeding sequence to collect for RNN
FUTURE_PERIOD_PREDICT = 3  # how far into the future are we trying to predict?
RATIO_TO_PREDICT = "LTC-USD"

In [5]:
main_df['future'] = main_df[f'{RATIO_TO_PREDICT}_close'].shift(-FUTURE_PERIOD_PREDICT)
main_df['target'] = list(map(classify, main_df[f'{RATIO_TO_PREDICT}_close'], main_df['future']))

In [6]:
print(main_df.head(3))

            BTC-USD_close  BTC-USD_volume  LTC-USD_close  LTC-USD_volume  \
time                                                                       
1528968720    6487.379883        7.706374      96.660004      314.387024   
1528968780    6479.410156        3.088252      96.570000       77.129799   
1528968840    6479.410156        1.404100      96.500000        7.216067   

            BCH-USD_close  BCH-USD_volume  ETH-USD_close  ETH-USD_volume  \
time                                                                       
1528968720     870.859985       26.856577      486.01001       26.019083   
1528968780     870.099976        1.124300      486.00000        8.449400   
1528968840     870.789978        1.749862      485.75000       26.994646   

               future  target  
time                           
1528968720  96.389999       0  
1528968780  96.519997       0  
1528968840  96.440002       0  


By now we have created a new data frame that includes the information we only needed, which is the closing prices and the volumn of the 4 indecies plus the a) "future price" and b) "direction" of the index that we are predicting. In fact, in this case we only care about the direction, where we treat this problem as a classification problem, so we will drop the future price column later on. One could also treat this problem as a regression problem, if that's the case, one could drop the direction column(which is our target right now).

### Normalization and Data balencing
- One could normalize the dtat using sklearn's module: preprocessing.scale, note that one sould not normalize the target
- Deque is preferred over list in the cases where we need quicker append and pop operations from both the ends of container. We make sue that our sequence contains only 60 datapoints(minutes). This is like a moving window that moves one step at a time and includes 60 points in the window.
- data balencing improves the learning efficiency of the model. We first could the sequences with target = 1 and target = 0. Next, we select the same number from each catogory, which means that if we picked 1000 sequencies with target =1, we also have to pick 1000 sequencies with target =0. To get the maximum sequencies, we count the numers of diferent target, choose all the sequencies of targets with lowernumbers, and choose the same number of sequencies from the target with higher numbers. 


In [7]:
# # a small demo of deque
# a_list = deque(maxlen=3)
# a_list.append('a')
# print(a_list)
# a_list.append('b')
# print(a_list)
# a_list.append('c')
# print(a_list)
# a_list.append('d')
# print(a_list)

In [8]:
from sklearn import preprocessing  
from collections import deque
import numpy as np
import random   

In [9]:
def preprocess_df(df):
    df = df.drop("future", 1)  # don't need this anymore.

    for col in df.columns:  # go through all of the columns
        if col != "target":  # normalize all ... except for the target itself!
            df[col] = df[col].pct_change()  # pct change "normalizes" the different currencies (each crypto coin has vastly diff values, we're really more interested in the other coin's movements)
            df.dropna(inplace=True)  # remove the nas created by pct_change
            df[col] = preprocessing.scale(df[col].values)  # scale between 0 and 1.

    df.dropna(inplace=True)  # cleanup again... jic.


    sequential_data = []  # this is a list that will CONTAIN the sequences
    prev_days = deque(maxlen=SEQ_LEN)  # These will be our actual sequences. They are made with deque, which keeps the maximum length by popping out older values as new ones come in

    for i in df.values:  # i will be the columns in the data frame
        prev_days.append([n for n in i[:-1]])  # store values in the column i but the target
        if len(prev_days) == SEQ_LEN:  # make sure we have 60 sequences!
            sequential_data.append([np.array(prev_days), i[-1]])  # append the target

    random.shuffle(sequential_data)  # shuffle for good measure.
    
    #Balencing the data

    buys = []  # list that will store our buy sequences and targets
    sells = []  # list that will store our sell sequences and targets

    for seq, target in sequential_data:  # iterate over the sequential data
        if target == 0:  # if it's a "not buy"
            sells.append([seq, target])  # append to sells list
        elif target == 1:  # otherwise if the target is a 1...
            buys.append([seq, target])  # it's a buy!

    random.shuffle(buys)  # shuffle the buys
    random.shuffle(sells)  # shuffle the sells!

    lower = min(len(buys), len(sells))  # what's the shorter length?

    buys = buys[:lower]  # make sure both lists are only up to the shortest length.
    sells = sells[:lower]  # make sure both lists are only up to the shortest length.

    sequential_data = buys+sells  # add them together
    random.shuffle(sequential_data)  # another shuffle, so the model doesn't get confused with all 1 class then the other.

    X = []
    y = []

    for seq, target in sequential_data:  # going over our new sequential data
        X.append(seq)  # X is the sequences
        y.append(target)  # y is the targets/labels (buys vs sell/notbuy)

    return np.array(X), y  # return X and y...and make X a numpy array!

### Train-Test split

In [10]:
times = sorted(main_df.index.values)  # get the times
last_5pct = sorted(main_df.index.values)[-int(0.05*len(times))]  # get the last 5% of the times

validation_main_df = main_df[(main_df.index >= last_5pct)]  # make the validation data where the index is in the last 5%
main_df = main_df[(main_df.index < last_5pct)]  # now the main_df is all the data up to the last 5%


In [11]:
train_x, train_y = preprocess_df(main_df) 
validation_x, validation_y = preprocess_df(validation_main_df)

In [12]:
print(f"train data: {len(train_x)} validation: {len(validation_x)}")
print(f"Dont buys: {train_y.count(0)}, buys: {train_y.count(1)}")
print(f"VALIDATION Dont buys: {validation_y.count(0)}, buys: {validation_y.count(1)}")

train data: 77922 validation: 3860
Dont buys: 38961, buys: 38961
VALIDATION Dont buys: 1930, buys: 1930


77922 sequencies with 60 timesteps in a single sequece, each sequence containe 8 features

In [13]:
train_x.shape

(77922, 60, 8)

### Training Phase

Source:

Input of LSTM
1. https://medium.com/@shivajbd/understanding-input-and-output-shape-in-lstm-keras-c501ee95c65e
2. https://stackoverflow.com/questions/37901047/what-is-num-units-in-tensorflow-basiclstmcell
3. https://datascience.stackexchange.com/questions/12964/what-is-the-meaning-of-the-number-of-units-in-the-lstm-cell
4. https://datascience.stackexchange.com/questions/20413/clarification-on-the-keras-recurrent-unit-cell

Crossentropy loss
1. https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html

In [14]:
import time

EPOCHS = 10  # how many passes through our data
BATCH_SIZE = 64  # how many batches? Try smaller batch if you're getting OOM (out of memory) errors.
NAME = f"{SEQ_LEN}-SEQ-{FUTURE_PERIOD_PREDICT}-PRED-{int(time.time())}"  # a unique name for the model


In [15]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, CuDNNLSTM, BatchNormalization
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.callbacks import ModelCheckpoint 

# the model check point saves the model when ever it improves its accuracy
# soruce
# https://machinelearningmastery.com/check-point-deep-learning-models-keras/

  from ._conv import register_converters as _register_converters


In [16]:
import tensorflow as tf; print(tf.__version__)

1.11.0


In [17]:
model = Sequential()
model.add(CuDNNLSTM(128, input_shape=(train_x.shape[1:]), return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())  #normalizes activation outputs, same reason you want to normalize your input data.

model.add(CuDNNLSTM(128, return_sequences=True))
model.add(Dropout(0.1))
model.add(BatchNormalization())

model.add(CuDNNLSTM(128))
model.add(Dropout(0.2))
model.add(BatchNormalization())

model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(2, activation='softmax'))

In [18]:
opt = tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)

# Compile model
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=opt,
    metrics=['accuracy']
)

In [19]:
tensorboard = TensorBoard(log_dir="logs/{}".format(NAME))

In [20]:
filepath = "RNN_Final-{epoch:02d}-{val_acc:.3f}"  # unique file name that will include the epoch and the validation acc for that epoch
checkpoint = ModelCheckpoint("models/{}.model".format(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')) # saves only the best ones

In [21]:
# Train model
history = model.fit(
    train_x, train_y,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=(validation_x, validation_y),
    callbacks=[tensorboard, checkpoint],
)

Train on 77922 samples, validate on 3860 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [22]:
# Score model
score = model.evaluate(validation_x, validation_y, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
# Save model
model.save("models/{}".format(NAME))

Test loss: 0.6744038399637055
Test accuracy: 0.586010362756067


## How to imporve the acuracy?

### 1. Improve Performance With Data
1. Get More Data.
2. Invent More Data- data augmentation,especially useful for image classification.
3. Rescale Your Data- Rescale your data to the bounds of your activation functions. This help the model to learn faster with the same epochs.
4. Transform Your Data- a) changing data distribution-skewed Gaussian> Box-Cox transform;exponential distribution>log transform. b) Pre-process data with a projection method like PCA
5. Feature Selection-PCA, Univariate Selection, Recursive Feature Elimination.

### 2. Improve Performance With Algorithms
This method is quite limited, because one can search literatures for approaches to specific problems. If there are different methods, one can try all of them. One could also change the resampling methods used in the model. This is by adjusting how you split the data and which part you use for training.




### 3. Improve Performance With Algorithm Tuning (or Hyperparameter tuning)

1. Diagnostics-The general way. Is the model overfitting or underfitting? This could be solved by cross validation. (For a lot of applications, we couldnt do this, such as time series prediction.)

2. Weight Initialization.

3. Learning Rate-Lr is one of the hyperparameters in almost all ML models, we see the LR rate in the optimisers. ie. tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)

4. Activation Functions(not recommended).

5. Network Topology- This is for neural nets, try tuning the hidden layers and hidden units.

6. Batches and Epochs-Try a grid search of different mini-batch sizes (8, 16, 32, …). Try training for a few epochs and for a heck of a lot of epochs.

7. Regularization.Grid search different dropout percentage, Weight decay, penalties that can be applied such as L1 and L2.

8. Optimization and Loss. We chose the adam optimizer(tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)). One could choose a optimizer from Adam, SGD,...etc. For the loss function, we chose the sparse_categorical_crossentropy (loss='sparse_categorical_crossentropy'). In fact there are a various choices for the loss function depending on different applications.


source: 
- https://www.dlology.com/blog/quick-notes-on-how-to-choose-optimizer-in-keras/
- https://machinelearningmastery.com/improve-deep-learning-performance/

#### 3.1 Why cross validation ins not used in RNNs?

Standard cross validation takes the data sample, leaves a part out, trains the model on the rest, the trains the model on a different set of the data when a different section has been left out, repeat until you've covered the entire dataset.

This is because inmost Rnn applications are time related, these kind of dataset are auto-correlated most of the time, ie they depend on the order of events.

Resampling techniques such k-fold cross validation would not work in these cases.

### 4. Improve Performance With Ensembles

Source:
- https://machinelearningmastery.com/ensemble-methods-for-deep-learning-neural-networks/

## How to select features in time series prediction


1. Lag Features, Rolling Window Statistics,Expanding Window Statistics
Source:
https://machinelearningmastery.com/basic-feature-engineering-time-series-data-python/

In [23]:
print(tf.__version__)

1.11.0
