# Final Project: Overview

# Objective

The objective of this project is for you to demonstrate your mastery of the Machine Learning process
**using Neural Networks**.



# Submission requirements

The guidelines will be similar to the Midterm
- you will write a procedure that takes raw data and produces predictions

You will submit a *single* model for evaluation.

**Demonstrate that all cells in your notebook work**

The final cell in your notebook should print the message "Done"
- `print("Done")`
- If we run your notebook and this last cell does not execute your submission will be inadequate

## Testing

*You must perform out of sample testing*.

If you want to perform cross-validation in training, that is fine, but you
must *also* test out of sample to show that you are not over-fitting.

It is up to you to create the out of sample data that you feel best evaluates your model.

We will create holdout data (that we will not show you) for grading.

The procedure you write to make predictions should be able to work on the unseen holdout data
(similar to how it should work for your test set but the holdout set has *no targets*)

    

# The data

Data will be provided to you 
- as multiple files in a directory which we refer to as a *data directory*

The reason for this is that the different files may convey different information.

You will be responsible for deciding
- which files to use
- which fields within the files to use

We will give you a data directory for training.

# Submission guidelines

Here are the basics, a code template that you must complete is in the following cells
- you will be required to store  your model in a file
- you will be required to write a procedure `MyModel` that takes two arguments
    - `test_dir`
        - this is a *relative path* to the holdout data directory
    - `model_path`
        - this is a *relative path* to the file containing your model
- the holdout data directory is similar in structure to the training data directory
    - but without target labels !  It is your job to predict these.
- your procedure must produce predictions given this holdout data directory

This means that your procedure must
- prepare the files in the holdout data directory similar to the way that they were prepared in the training data directory

We will provide you with a sample data directory that will resemble the holdout -- this is so that you
may test the procedure you write for submission.



## Detailed submission guidelines


In **addition to your notebook that trains/evaluates your model**, 
- please also submit an **archive file of the directory** whose name is stored in `model_path`, which 
contains your trained model.
    - use `saveModel` to put your final, trained model in this directory
- We will **not** train your model; we will only use the method `MyModel`
    - which **you** will implement
    - and which uses `loadModel` and the directory whose name is stored in `model_path`
    - this will create the model that we will evaluate


Here is a code template for you to complete
- it will save your model (assuming it is in variable `my_model`)
- it provides the specification for procedure `MyModel`, which *you must complete*


In [87]:
import os
from tensorflow.keras.models import load_model

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid") 

import sklearn
from sklearn.impute import SimpleImputer



from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn import tree
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import cross_val_score

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras import layers
from tensorflow.keras.utils import plot_model
from tensorflow.keras import regularizers

from numpy.random import seed
seed(1)
tf.random.set_seed(2)

cross_validation_k = 5

modelName = "final_model"
model_path = os.path.join(".", modelName)

def saveModel(model, model_path): 
    try:
        os.makedirs(model_path)
    except OSError:
        print("Directory {dir:s} already exists, files will be over-written.".format(dir=model_path))
        
    # Save JSON config to disk
    json_config = model.to_json()
    with open(os.path.join(model_path, 'config.json'), 'w') as json_file:
        json_file.write(json_config)
    # Save weights to disk
    model.save_weights(os.path.join(model_path, 'weights.h5'))
    
    print("Model saved in directory {dir:s}; create an archive of this directory and submit with your assignment.".format(dir=model_path))
    
def loadModel(model_path):
    # Reload the model from the 2 files we saved
    with open(os.path.join(model_path, 'config.json')) as json_file:
        json_config = json_file.read()
    model = tf.keras.models.model_from_json(json_config)
    model.load_weights(os.path.join(model_path, 'weights.h5'))
    
    return model

def MyModel(test_dir, model_path):
    # YOU MAY NOT change model after this statement !
    model = loadModel(model_path)
    
    # It should run model to create an array of predictions; we initialize it to the empty array for convenience
    predictions = []
    
    # We need to match your array of predictions with the examples you are predicting
    # The array below (ids) should have a one-to-one correspondence and identify the example your are predicting
    # For Bankruptcy: the Id column
    # For Stock prediction: the date on which you are making a prediction
    ids = []
    
    # YOUR CODE GOES HERE
    trade_df, all_df = load_data(holdout_dir)
    X_train_all, X_test, y_train_all, y_test = train_test_split(all_df)
    X_train_all, X_test = imputation(X_train_all, X_test)
    useful_tickers = ["AAPL","XLK","SPY","XLY","XLB","XLI","GOOG","IBM","MA"]
    X_train_all,X_test = feature_engineering(X_train_all, X_test, useful_tickers)
    
    scaler = StandardScaler()
    X_train_all_scale = pd.DataFrame(scaler.fit_transform(X_train_all), columns=X_train_all.columns)
    X_test_scale = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

    X_train_all_scale = pd.DataFrame(scaler.fit_transform(X_train_all), columns=X_train_all.columns)
    X_test_scale = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
    
    return_feature_columns = ['Return_'+ticker for ticker in useful_tickers]
    vol_feature_columns = ([f'Vol_{window}_{ticker}' for ticker in useful_tickers for window in [20, 60, 250]] 
                                + [f'Range_{window}_{ticker}' for ticker in useful_tickers for window in [20, 60, 250]])
    volume_feature_columns = ['log(Volume)_'+ticker for ticker in useful_tickers]
    used_features = return_feature_columns + vol_feature_columns + volume_feature_columns

    model_test, y_test_predict = test_final_model(X_train_all_scale, X_test_scale, y_train_all, y_test, model, 'final model', used_features)
    prediction = y_test_predict
    ids = list(y_test.index)
    
    print(y_test_predict)
    return predictions, ids

def load_data(data_dir):
    
    all_tickers = []
    filenames = os.listdir(data_dir)
    for filename in filenames:
        all_tickers.append(filename.split('.')[0])
    
    trade_ticker = 'AAPL'
    
    trade_df = pd.read_csv(os.path.join(data_dir, f'{trade_ticker}.csv')).set_index('Dt')
    trade_df.index = pd.to_datetime(trade_df.index)
    trade_df['Return'] = trade_df['Close'].pct_change(periods=1)
    trade_df['Adj Return'] = trade_df['Adj Close'].pct_change(periods=1)
    

    all_df = trade_df.copy()
    
    
    # merge DataFrame
    for ticker in all_tickers:
        if ticker == trade_ticker:
            continue
        new_df = pd.read_csv(data_dir + ticker + '.csv').set_index("Dt")
        new_df.index = pd.to_datetime(new_df.index)
        new_df['Return'] = new_df['Close'].pct_change(periods=1)
        #all_df = all_df.join(new_df, how='left', rsuffix=f'_{ticker}') 
        new_df["Adj Return"] =new_df["Adj Close"].pct_change(1)
        all_df = all_df.join(new_df, how = "left", rsuffix = "_" + str(ticker))
           
    all_df.columns = [col+'_AAPL' for col in all_df.columns[:11]] + list(all_df.columns[11:])
    all_df['Forward_AAPL'] = all_df['Adj Close_AAPL'].pct_change(periods=1).shift(-1)
    return trade_df, all_df
def train_test_split(all_df):

    X, y = all_df.iloc[:,:-1].astype('float64'), all_df['Forward_AAPL'].astype('float64')

    X_train_all, X_test = X.iloc[: -200, :], X.iloc[-200:-1, :]
    y_train_all, y_test = y.iloc[: -200], y.iloc[-200:-1]
    return X_train_all, X_test, y_train_all, y_test

def imputation(X_train_all, X_test):
    # imputation mean values
    mean_univariate_columns = [col for col in X_train_all.columns if col.split('_')[0] in ['Volume']]
    zero_univariate_columns = [col for col in X_train_all.columns if col.split('_')[0] in ['Return', 'Div']]
    one_univariate_columns = [col for col in X_train_all.columns if col.split('_')[0] in ['Factor']]
    first_univariate_columns = [col for col in X_train_all.columns if col.split('_')[0].split(' ')[0] in ['Adj Close', 'Open', 'High', 'Low', 'Close']]


    # imputation with mean
    imputer_uni_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
    X_train_all[mean_univariate_columns] = imputer_uni_mean.fit_transform(X_train_all[mean_univariate_columns])
    X_test[mean_univariate_columns] = imputer_uni_mean.transform(X_test[mean_univariate_columns])

    # imputation with zero
    imputer_uni_zero = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0)
    X_train_all[zero_univariate_columns] = imputer_uni_zero.fit_transform(X_train_all[zero_univariate_columns])
    X_test[zero_univariate_columns] = imputer_uni_zero.transform(X_test[zero_univariate_columns])

    # imputation with one
    imputer_uni_one = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=1)
    X_train_all[one_univariate_columns] = imputer_uni_one.fit_transform(X_train_all[one_univariate_columns])
    X_test[one_univariate_columns] = imputer_uni_one.transform(X_test[one_univariate_columns])

    # imputation with first value
    X_train_all[first_univariate_columns] = X_train_all[first_univariate_columns].fillna(method='bfill')

    return X_train_all, X_test

def feature_engineering(X_train_all, X_test, useful_tickers):
    X_all = pd.concat([X_train_all, X_test])
    
    # technical
    for ticker in useful_tickers:
        for window in [7, 14, 28]:
            X_all[f'RSI_{window}_{ticker}'] = RSI(X_all['Adj Close_'+ticker], window)
            X_all[f'OBV_{window}_{ticker}'] = OBV(X_all['Adj Close_'+ticker], X_all['Volume_'+ticker], window)
    
    # volume
    for ticker in useful_tickers:
        X_all['log(Volume)_'+ticker] = np.log(X_all['Volume_'+ticker])
    
    
    # div/split
    div_event = X_all['Div_AAPL']>0
    split_event = X_all['Factor_AAPL']>1

    X_all['is_div'] = X_all['Div_AAPL']>0
    X_all['is_split'] = X_all['Factor_AAPL']>1
    
    # volatility
    for ticker in useful_tickers:
        for window in [20, 60, 250]:
            X_all[f'Vol_{window}_{ticker}'] = X_all['Return_'+ticker].rolling(window).std().fillna(0)
            X_all[f'Range_{window}_{ticker}'] = ((X_all['High_'+ticker] - X_all['Low_'+ticker])/X_all['Close_'+ticker]).rolling(window).mean().fillna(0)
    
    X_train_all, X_test = X_all.iloc[:-200, :], X_all.iloc[-200:, :]
    
    return X_train_all, X_test

def RSI(price, window):
    change = price.diff(1).fillna(0)
    is_gain, is_loss = change > 0, change < 0
    gain, loss = change, -change
    gain[is_loss] = 0
    loss[is_gain] = 0
    rs = gain.ewm(span=window).mean().fillna(0) / loss.ewm(span=window).mean().fillna(0)
    rsi = (0.5 - (1 / (1 + rs))).fillna(0)
    return rsi

def OBV(price, volume, window):
    change = price.diff(1).fillna(0)
    return (np.sign(change) * volume).ewm(span=window).mean().fillna(0)

def test_final_model(X_train_all_scale, X_test_scale, y_train_all, y_test, model_test, model_name_test, used_features, loss='mean_squared_error'): 
    callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=100)    
    X_train_all_rnn, y_train_all_rnn, X_test_rnn, y_test_rnn, X_train_rnn, y_train_rnn, X_valid_rnn, y_valid_rnn = transform_dataset(X_train_all_scale[used_features], X_test_scale[used_features], y_train_all, y_test, 100, 1, verbose=False)
    
    optimizer = keras.optimizers.Adam(lr=0.005)
    model_test.compile(loss=loss, optimizer=optimizer, metrics='mean_squared_error')


    score_test = model_test.evaluate(X_test_rnn, y_test_rnn, verbose=0)
    y_test_predict = model_test.predict(X_test_rnn)
    r_square = r2_score(y_test_rnn, y_test_predict)
    corr = np.corrcoef(y_test_rnn.T, y_test_predict.T)[0,1]
    
    num_parameters = model_test.count_params()

    print('{n:s}: Test Loss: {l:3.4%}, MSE: {m:3.4%}, R^2: {r:3.4%}, Corr: {c:3.4%}'.format(n=model_name_test, l=score_test[0], m=score_test[1], r=r_square, c=corr))
    print('Parameters number in model:', num_parameters)
    return model_test, y_test_predict

def transform_dataset(train_set, test_set, y_train, y_test, n_input, n_output, verbose=True):
    all_data = np.vstack((train_set, test_set)).astype('float64')
    y_set = y_train.append(y_test).astype('float64')
    X = []
    y = []
    
    for i in range(all_data.shape[0] - n_input - n_output + 2):
        X_sample = all_data[i:i + n_input, :]
        y_sample = y_set[i + n_input - 1:i + n_input - 1 + n_output]
        X.append(X_sample)
        y.append(y_sample)   
    X = np.array(X)
    y = np.array(y)    
    train_all_X = X[:train_set.shape[0] - n_input + 1, :, :]
    train_all_y = y[:train_set.shape[0] - n_input + 1, :]
    test_X = X[train_set.shape[0] - n_input + 1:, :, :]
    test_y = y[train_set.shape[0] - n_input + 1:, :]
    
    num_train = int(train_all_X.shape[0] * 0.9)
    
    train_X, valid_X = train_all_X[:num_train], train_all_X[num_train:]
    train_y, valid_y = train_all_y[:num_train], train_all_y[num_train:]
    
    if verbose:
        print(f'X_train_all_rnn shape: {train_all_X.shape}')
        print(f'X_train_rnn shape: {train_X.shape}')
        print(f'X_valid_rnn shape: {valid_X.shape}')
        print(f'X_test_rnn shape: {test_X.shape}')

        print(f'y_train_all_rnn shape: {train_all_y.shape}')
        print(f'y_train_rnn shape: {train_y.shape}')
        print(f'y_valid_rnn shape: {valid_y.shape}')
        print(f'y_test_rnn shape: {test_y.shape}')
    
    return train_all_X, train_all_y, test_X, test_y, train_X, train_y, valid_X, valid_y

# Assign to variable my_model the model that is your final model (the one  you will be evaluated on)
my_model = "Final Model"

#saveModel(my_model, model_path)

## Evaluate your model on the holdout data directory

**You must run the following cell** from the directory that contains your model file

Here is how we will evaluate your submission
- we will create a directory whose only content is
    - sub-directory `Data`
- we will copy your model file to this directory with the name stored in `model_path`
- we will run the cell in your notebook that should be a copy of the one below
    - it calls procedure `MyModel` with the arguments given below
    - your implementation of `MyModel`
        - must successfully load your model file, *given where **we** have place it as described above*
        - must successfully return one prediction for each example in the holdout directory *given where **we** have placed the holdout directory*

In [88]:
holdout_dir = os.path.join(".", "Data", "holdout/")
predicts = MyModel(holdout_dir, model_path)

final model: Test Loss: 0.0135%, MSE: 0.0135%, R^2: -1.9709%, Corr: -9.5251%
Parameters number in model: 3051
[[9.5037068e-04]
 [9.5661380e-04]
 [9.1180031e-04]
 [6.9428346e-04]
 [4.1659654e-04]
 [1.6480929e-04]
 [1.7510564e-04]
 [2.4387107e-04]
 [4.4860493e-04]
 [8.1398472e-04]
 [1.1344305e-03]
 [1.3050594e-03]
 [1.2980780e-03]
 [1.2412587e-03]
 [1.1838038e-03]
 [1.4835986e-03]
 [1.6692167e-03]
 [1.7501276e-03]
 [1.9981484e-03]
 [2.4539151e-03]
 [2.3831320e-03]
 [2.4313454e-03]
 [2.1785665e-03]
 [1.9734283e-03]
 [1.9685074e-03]
 [1.7359196e-03]
 [1.6249027e-03]
 [1.5334247e-03]
 [1.6193867e-03]
 [1.6661806e-03]
 [1.5943178e-03]
 [1.3590676e-03]
 [1.1512742e-03]
 [1.0577904e-03]
 [9.7055751e-04]
 [7.8959658e-04]
 [5.9961784e-04]
 [4.0577247e-04]
 [3.5612797e-04]
 [4.7699921e-04]
 [5.6086486e-04]
 [4.5325712e-04]
 [3.6191160e-04]
 [2.2595143e-04]
 [1.4900789e-04]
 [4.3107779e-05]
 [6.4730295e-05]
 [7.5951335e-05]
 [8.0452301e-05]]


In [89]:
print("Done")

Done
