## Sentdex Rnn with Keras
### This is an ipynb notebook that attempts to speed up and simplify the creation of sequence data for the RNN that Harrison created in :
#### See (https://pythonprogramming.net/crypto-rnn-model-deep-learning-python-tensorflow-keras/)

### Imports

In [None]:
import pandas as pd
import numpy as np
import os,sys
from collections import deque
from sklearn import preprocessing
import datetime,time
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import RNN, Dense, Dropout, LSTM, CuDNNLSTM, BatchNormalization,Flatten,SimpleRNN
# from tensorflow.keras.callbacks import TensorBoard
# from tensorflow.keras.callbacks import ModelCheckpoint

### Description of Data
* You will create a pandas DataFrame which will hold price and volume data for one minute currency bars.
* You must transform the rows of the DataFrame into 4 numpy arrays:
 1. train_x: shape = (num_training_rows-SEQ_LEN,SEQ_LEN,NUM_FEATURES)
 2. train_y: shape = (num_training_rows-SEQ_LEN,SEQ_LEN,1)  _We assume that your target is a scaler number._
 3. validation_x: shape = (num_validation_rows-SEQ_LEN,SEQ_LEN,NUM_FEATURES)
 4. validation_y: shape = (num_validation_rows-SEQ_LEN,SEQ_LEN,1) 
___
### IMPORTANT
* **The currency data for the project is in the folder ./data/crypto_data .** 
* **_I HAVE ADDED HEADERS to each of the original csv files._** 
* **The original csv files without the headers are in a folder called ./data/crypto_data_original .**
* **The DataFrame that combines all of the currency price and volume data gets assembled in (2.1).**
___
### ALSO IMPORTANT
* **Run the cells in this notebook several times.  I have seen instances where, on the very first run of the notebook, the accurancy numbers are much lower than in subsequent runs.**

___
### (1.1) Define parameters:
___
1. SCALING_TO_USE: one of ['actual', 'pct', or 'minmax']
 * 'actual' : use the totally unscalled price and volume data for each currency pair
 * 'pct' : use percent change data for price and volume for each currency pair
 * 'minmax' : use percent change data that is further scaled to be between 0 and 1
2. SEQ_LEN: You divide your original feature data into an array of shape (-1,SEQ_LEN,num_features)
3. FUTURE_PERIOD_PREDICT: the number of rows beyond the last row of each sequence that you use to create your target variable.  If your first sequence of training data comes from rows
4. RATIO_TO_PREDICT: the currency that you will use as the target.  You will try to predict if this currency price goes up or down afer the number of periods in FUTURE_PERIOD_PREDICT.
5. EPOCHS: Number of epochs to input into the keras model
6. BATCH_SIZE: number of batches of shape (SEQ_LEN,NUM_FEATURES) for x data and (SEQ_LEN,1) for y data (labels).
___

In [None]:
# change these values
# Chose as value for SCALING_TO_USE among ['actual', 'pct', or 'minmax']
SCALING_TO_USE = 'minmax' 
SEQ_LEN = 60  # how long of a preceeding sequence to collect for RNN
FUTURE_PERIOD_PREDICT = 3  # how far into the future are we trying to predict?
RATIO_TO_PREDICT = "LTCUSD" # change this to use different currencies as the target
EPOCHS = 3  # how many passes through our data
BATCH_SIZE = 64  # how many batches? Try smaller batch if you're getting OOM (out of memory) errors.
USE_LSTM = False # If true, use keras.layers.LSTM as first layer.  Otherwise, use keras.layers.SimpleRNN

# DO NOT CHANGE THE VALUES BELOW
DATA_FOLDER = './data/crypto_data'
CURRENCY_FILE_NAMES = os.listdir(DATA_FOLDER) 
CURRENCY_LIST = [c.replace('-','').replace('.csv','')for c in CURRENCY_FILE_NAMES]
if RATIO_TO_PREDICT not in CURRENCY_LIST:
    raise ValueError(f'the currency {RATIO_TO_PREDICT} is not in the valid currency list {CURRENCY_LIST}')
print(f'possible currencies to use below are: {[c.replace("-","").replace(".csv","") for c in CURRENCY_FILE_NAMES]}')


NAME = f"{SEQ_LEN}-SEQ-{FUTURE_PERIOD_PREDICT}-PRED-{int(time.time())}"

print(f'using {SCALING_TO_USE} as scaling')
print(f'using {RATIO_TO_PREDICT} currency as a target')

___
### (2.1) Create DataFrame of close and volume for each currency
___

In [None]:
days_in_future = FUTURE_PERIOD_PREDICT
# define variable to hold final DataFrame with combined currency prices and volumes, and target values
df_curr = None
# loop through each currency csv file
for i in range(len(CURRENCY_LIST)):
    cfn = CURRENCY_FILE_NAMES[i]
    c  = CURRENCY_LIST[i]
    df_temp = pd.read_csv(f'{DATA_FOLDER}/{cfn}')
    df_temp = df_temp.rename(columns={'close':f'close_{c}','volume':f'volume_{c}'})
    df_temp.index = df_temp.timestamp
    df_temp = df_temp[[f'close_{c}',f'volume_{c}']]
    if i ==0:
        df_curr = df_temp.copy()
    else:
        df_curr = pd.concat([df_curr,df_temp],axis=1)
close_string = f'close_{RATIO_TO_PREDICT}'
future_close_string = f'close_{RATIO_TO_PREDICT}_future'
df_curr[future_close_string] = df_curr[close_string].shift(-days_in_future)
df_curr['target'] = df_curr[future_close_string] >= df_curr[close_string]
df_curr['hour'] = [datetime.datetime.fromtimestamp(d).hour for d  in df_curr.index]
df_curr['minute'] = [datetime.datetime.fromtimestamp(d).minute for d  in df_curr.index]
df_curr['year'] = [datetime.datetime.fromtimestamp(d).year for d  in df_curr.index]
df_curr['month'] = [datetime.datetime.fromtimestamp(d).month for d  in df_curr.index]
df_curr['day'] = [datetime.datetime.fromtimestamp(d).day for d  in df_curr.index]

# get all columns except future_close_string
cols = list(filter(lambda co: co!=future_close_string,df_curr.columns.values))


df_curr = df_curr[(df_curr.hour>=8) & (df_curr.hour<=17)]
df_curr['yyyymmdd'] = df_curr.year*100*100 + df_curr.month*100 + df_curr.day
df_curr = df_curr[['yyyymmdd','hour','minute']+cols]

len(df_curr),df_curr.columns.values


___
### (3.1) Create separate training and validation DataFrames
Create 2 DataFrames:
1. df_train
2. df_test
___

In [None]:
# get all yyyymmdd's  in df_curr.yyyymmdd
all_days = np.array(list(set(df_curr.yyyymmdd.as_matrix().reshape(-1))))
# establish a training size percent
train_perc = .9
# get the index into the array all_days, which represents the last training day
last_train_index = int(len(all_days)*train_perc)
# get training days
train_days = all_days[:last_train_index]
# get df_train
df_train = df_curr[df_curr.yyyymmdd.isin(train_days)]
# get test days
test_days = all_days[last_train_index:]
# get df_test
df_test = df_curr[df_curr.yyyymmdd.isin(test_days)]
# print various lengths to make sure the df_train and df_test DataFrames sum to df_curr
print(f'df_curr:{len(df_curr)}, df_train:{len(df_train)}, df_test:{len(df_test)}, should be 0: {len(df_curr)-(len(df_train)+len(df_test))}')


___
### (3.2) normalize data
For both training and validation data, provide return a dictionary with 3 DataFrames:
1. A DataFrame of actual prices, and the actual target values
2. A DataFrame with percentage prices, and the actual target values
3. A DataFrame with min/max normalized percentage of the prices, and the actual (not normalized) target values.  
___

In [None]:
def normalize_df(df):
    num_cols = list(filter(lambda c:'close' in c or 'volume' in c,df_train.columns.values))
    dft = df.copy() #[['yyyymmdd','hour','minute']+num_cols]
    dft.target = dft.target.astype(int)
    dft = dft[~dft.isnull().any(axis=1)]
    dft_actual = dft.copy()
    dft[num_cols] = dft[num_cols].pct_change()
    dft = dft[~dft.isnull().any(axis=1)]
    dft_pct = dft.copy()
    dft[num_cols] = (dft[num_cols] - dft[num_cols].min()) / (dft[num_cols].max() - dft[num_cols].min())
    dft_minmax = dft.copy()
    return {'actual':dft_actual,'pct':dft_pct,'minmax':dft_minmax}

training = normalize_df(df_train)
testing = normalize_df(df_test)

#### (3.3) Show the head of the actual, pct and minmax training data.

In [None]:
training['actual'].head()

In [None]:
training['pct'].head()

In [None]:
training['minmax'].head()

___
###  (3.4) Create sequences from the minmax DataFrames
From the input DataFrame df, a new DataFrame with::
1. an 'x' column where each cell is an array of "time sequences" (shape of array in cell = (SEQ_LEN,num_x_cols))
2. a 'y' column where each cell is an array of "time sequences" (shape of array in cell = (SEQ_LEN,1))
___
_**These methods assume that the input DataFrame has a millisecond timestamp in he index of each DataFrame row.**_
___

In [None]:

def create_sequences(df,return_full_y_sequences):
    '''
    '''
    # get the price and volume column names
    num_cols = list(filter(lambda c: 'close' in c or 'volume' in c,df.columns.values))
    # copy the input DataFrame
    df2 = df.copy()
    # Get the timestamps from the DataFrame's index
    df2['ts'] = df2.index
    df2.index = list(range(len(df2)))
    dfy = df2[['target']]
    dfx = df2.drop(['target'],axis=1)
    dx = dfx.as_matrix()
    
    dy = dfy.as_matrix()
    s = len(dx)
    sl = 60
    last_col = len(df2.columns.values) -1
    x_sequences = [dx[i:None if i+sl>=s else i+sl] for i in range(len(dx)-(sl-1))]
    y_sequences = [dy[i:None if i+sl>=s else i+sl] for i in range(len(dy)-(sl-1))]
    x_seq_dicts = [{'x':x_sequences[i]} for i in list(range(len(x_sequences)))]
    if return_full_y_sequences:
        y_seq_dicts = [{'y':y_sequences[i]} for i in list(range(len(y_sequences)))]
    else:
        y_seq_dicts = [{'y':y_sequences[i][sl-1]} for i in list(range(len(y_sequences)))]
    dfx = pd.DataFrame(x_seq_dicts)
    dfy = pd.DataFrame(y_seq_dicts)
    df_seq = dfx.join(dfy)
    return df_seq


from tqdm import tqdm
def create_daily_sequences(df,return_full_y_sequences):
    '''
    Given and input DataFrame df, run create_sequences on subsets of df 
       where each subset contains data for only a specific day.
    Combine all of the specific "daily" calls to create_sequences into a single x and y array.
    '''
    cols_to_return = list(filter(lambda c: any([d in c for d in ['close','volume','target']]),df.columns.values))
    df_seqs = None
    all_days = np.array(list(set(df.yyyymmdd)))
    for yyyymmdd in tqdm(all_days):
        df_this_day = df[df.yyyymmdd==yyyymmdd][cols_to_return]
        df_this_seqs = create_sequences(df_this_day,return_full_y_sequences)
        if df_seqs is None:
            df_seqs = df_this_seqs.copy()
        else:
            if len(df_this_seqs)>0:
                df_seqs = df_seqs.append(df_this_seqs)
    return df_seqs


### (3.5) Run create_daily_sequences to get final input data for the RNN
Produce 4 arrays:
1. train_x
2. train_y
3. validation_x
4. validation_y

In [None]:
use_full_seqs =  USE_LSTM
df_train_seqs = create_daily_sequences(training[SCALING_TO_USE],use_full_seqs)
train_x = np.array([df_train_seqs['x'].values[i] for i in range(len(df_train_seqs))])
train_y = np.array([df_train_seqs['y'].values[i] for i in range(len(df_train_seqs))])

df_test_seqs = create_daily_sequences(testing[SCALING_TO_USE],use_full_seqs)
validation_x = np.array([df_test_seqs['x'].values[i] for i in range(len(df_test_seqs))])
validation_y = np.array([df_test_seqs['y'].values[i] for i in range(len(df_test_seqs))])

txl = train_x.shape[0]//BATCH_SIZE * BATCH_SIZE
train_x_even = train_x[:txl]
train_y_even = train_y[:txl]

vxl = validation_x.shape[0]//BATCH_SIZE * BATCH_SIZE
validation_x_even =  validation_x[:vxl]
validation_y_even =  validation_y[:vxl]

train_x_even.shape,train_y_even.shape,validation_x_even.shape,validation_y_even.shape


___
# TO DO, implement this
### Balance data
Create sequences whose corrosponding target values have an equal number of 0's and 1's
___

In [None]:
def balance_data(tseqs):
    # get the target (y) values from the tsegs['y'] array, and put it into a DataFrame for indexing
    dfy  = pd.DataFrame({'y':tseqs['y'].reshape(-1)})
    # split dfy into 2 DataFrames: dfy0 contains 0's, and dfy1 contains 1's
    dfy0 = dfy[dfy.y==0]
    dfy1 = dfy[dfy.y==1]
    # find the smaller DataFrame
    dfy_len = min(len(dfy0),len(dfy1))
    # create a new DataFrame which contains an equal number of targets with 1's and 0's
    dfy01 = dfy0.iloc[:dfy_len].append(dfy1.iloc[:dfy_len])
    # get the index values of this new DataFrame
    dfy_indices= dfy01.index
    # use those index values to locate the x sequences that corrospond to the targets in dfy01
    tsy = tseqs['y'][dfy_indices]
    tsx = tseqs['x'][dfy_indices]
    # return 2 arrays: 1 for the x sequences and another for the corrosponding targets
    return tsx,tsy



#### Run balance_data

___
## (4.1) Now create the Keras model and run it
___

In [None]:

model = Sequential()
# model.add(LSTM(128, input_shape=(train_x.shape[1:]), return_sequences=True))
if USE_LSTM:
    model.add(LSTM(128, return_sequences=True,stateful=True,
              batch_input_shape=(BATCH_SIZE,train_x.shape[1],train_x.shape[2])))
    model.add(Dropout(0.2))
    model.add(Dense(32, activation='relu'))
    model.add(Dropout(0.2))
else:
    model.add(SimpleRNN(128))

model.add(Dense(1, activation='sigmoid'))


In [None]:
opt = tf.keras.optimizers.Adam(lr=0.00001, decay=1e-6)

# Compile model
model.compile(
    loss='binary_crossentropy',
#     loss='sparse_categorical_crossentropy',
    optimizer=opt,
    metrics=['accuracy']
)

# Train model
history = model.fit(
    train_x_even, train_y_even,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    shuffle=True,
    validation_data=(validation_x_even, validation_y_even))


### End