# Project Implementation

## Install and import the required libraries

<br>
In the implementation part, we will start by importing the required libraries for our work. We will work mainly with yfinance for data collection, Pandas and Numpy for data processing, and TensorFlow for machine learning.

<br>
Other relevant libraries are keras_tuner for hyperparameter optimization, scikit-learn for data scaling and model evaluation, pandas-ta for calculating technical indicators based on the data from yfinance, and matplotlib for visualization.

In [1]:
# install Dependencies and import libraries
# !pip install yfinance pandas numpy tensorflow scikit-learn pandas-ta matplotlib

In [2]:
# https://pypi.org/project/yfinance/ (""" it's an open-source tool that uses Yahoo's publicly available APIs, and is intended for research and educational purposes. """)
# import yfinance, our data source
import yfinance as yf

# import pandas and numpy
import pandas as pd 
import numpy as np

# import from tensorflow
from tensorflow.keras.models import Sequential, load_model, Model
from tensorflow.keras.layers import SimpleRNN, Dense, LSTM, Input, GRU, SeparableConv1D, BatchNormalization, MaxPooling1D, add, Layer, concatenate
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.optimizers import Adam, RMSprop, SGD
from tensorflow.keras.saving import register_keras_serializable

# import from keras_tuner
from keras_tuner import HyperModel, Hyperband, Tuner, Oracle

# import from scikit-learn
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, ConfusionMatrixDisplay

# https://pypi.org/project/pandas-ta/ ("""An easy to use Python 3 Pandas Extension with 130+ Technical Analysis Indicators. Can be called from a Pandas DataFrame or standalone""")
# import pandas-ta
import pandas_ta as ta

# import matplotlib for data visualisation
import matplotlib.pyplot as plt

# this library allow us to calculate how long a process would take 
from datetime import datetime

## Load Data


<br>
In this implementation, we will work with 5 different stocks from the S&P500(1) list. The 5 stocks we will work with are chosen based on their ranking in this list from most valuable to least valuable, and each one is relatively distant from the other and belongs to a different industry. This will ensure a diverse sample and that our model evaluation results generalize relatively well, reducing the possibility of bias and overfitting.

Check out our stock list for this project (2).

<br>
The yfinance API allows us to request the stock data for a company's given period and interval values. For the period value, we will set it to 10 years or max value which will be sufficient for all of our experiments, for the interval value however, which determines the frequency of the data rows, we will experiment with many options to see if our approach generalizes better with specific interval values as different intervals are relevant to other groups of financial analysts and traders in the real world, therefore we must try to create the best model relevant to each of these groups.

That's why we will define a function that allows us to download any number of stock data at any period or interval, save the data as a CSV file to local storage, load it from storage, split it into different data frames based on the stock, and organize the data frames in a dictionary so it's easy to work with for the rest of the project.

Check out the loadData function (3).

In [3]:
# insert the stock symbols into a list
symbols_list = ['PFE', 'ROP', 'XYL', 'CPAY', 'INCY']

In [4]:
# define a function to load the data from source (yfinance API), and save it as a csv to local storage
def loadData(symbols=symbols_list, period='10y', interval='1wk'):
    
    try:
        # load the the dataframe from the csv file if it already exist
        df = pd.read_csv(f'{period}_{interval}_stocks_data.csv').set_index(['Date', 'Ticker'])
        
        print("Data loaded from directory")
        
    except FileNotFoundError:
        # print a message stating the data does not already exists and need to be downloaded from yfinance
        print(f"There is no {period}_{interval}_stocks_data.csv. Data will be downloaded from yfinance.")
        
        # download the data from source and store it in the stock_data variable which will hold the data as a pandas dataframe
        stocks_data =  yf.download(symbols, period=period, interval=interval)

        # reshape the dataframe as a multi-level index dataframe
        stocks_data = stocks_data.stack()

        # source: https://www.statology.org/pandas-change-column-names-to-lowercase/
        # convert column names to lowercase
        stocks_data.columns = stocks_data.columns.str.lower()

        # save the dataframe to a csv file (Save the data to a CSV so we don't have to make any extra unnecessary requests to the API every time we reload the notebook)
        stocks_data.to_csv(f'{period}_{interval}_stocks_data.csv', index=True)

        # load the the dataframe from the csv file
        df = pd.read_csv(f'{period}_{interval}_stocks_data.csv').set_index(['Date', 'Ticker'])

    finally: 
        # create a dict to store the dataframe of each unique symbol where keys are symbol, values are dataframes
        df_dict = {}

        # iterate over the symbols
        for symbol in symbols:

            # source of inspiration https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.xs.html [11]
            # extract the specific stock data at the 'Ticker' level of this multi index dataframe and save it as a dataframe
            symbol_df = df.xs(symbol, axis=0, level='Ticker', drop_level=True)

            # store the datafram into the df_dict
            df_dict[symbol] = symbol_df

        # return the dictionary
        return df_dict

Load the data of the 5 selected stocks for the last 10 years on a weekly intrevals.

In [43]:
# load the stock data for the 5 companies into a dictionary
dfs = loadData(symbols=symbols_list, period='10y', interval='1wk')

Data loaded from directory


## Perform simple exploritory data analysis

<br> 
Now that we have a dictionary of dataframes, we can analyze the data and make some observations.

1. We can get the shape of the data for any stock

In [10]:
# the data shape
for symbol in dfs.keys():
    print(f"Symbol: {symbol}, Shape: {dfs[symbol].shape} ")

Symbol: PFE, Shape: (523, 6) 
Symbol: ROP, Shape: (523, 6) 
Symbol: XYL, Shape: (523, 6) 
Symbol: CPAY, Shape: (523, 6) 
Symbol: INCY, Shape: (523, 6) 


2. We can get the basic stats for any stock

In [15]:
# data basic stats
dfs["PFE"].describe()

Unnamed: 0,adj close,close,high,low,open,volume
count,523.0,523.0,523.0,523.0,523.0,523.0
mean,29.870774,36.26916,37.066481,35.423588,36.250718,137332300.0
std,7.594988,6.862873,7.114429,6.536369,6.851178,62650500.0
min,17.923565,25.4,26.17,25.200001,25.58,39227250.0
25%,23.773307,31.555978,32.129982,30.858634,31.555977,97143200.0
50%,28.344492,34.478176,34.914612,33.78558,34.487667,121504600.0
75%,33.0938,40.028976,40.682581,39.165085,40.033976,158385600.0
max,52.74073,59.48,61.709999,57.16,60.599998,633399700.0


3. We can check how many missing values each column have for a any stock dataframe

In [20]:
# how many null values in each column
dfs['PFE'].isnull().sum()

adj close    0
close        0
high         0
low          0
open         0
volume       0
dtype: int64

#### Columns breakdown

<br>
Date: The index is the date on which the information on the rest of the columns takes place.

<br>
Adjusted close: is the closing price after adjustments for all applicable splits and dividend distributions which represents the true closing price.

<br>
Close: is the historical closing price of the stock.

<br>
High: is the highest point a stock has reached.

<br>
Low: is the lowest point of a stock.

<br>
Open: the opening price of the stock.

<br>
Volume: the volume of stocks traded in that timeframe.

<br>
Usually for this type of model we would only keep either Adjusted close or close, we are going to keep the adjusted close for now but it worth mentiong that most of the technical indicators we are utilizing are dependent on the none adjusted close.

## Adding Targets

<br>
To predict stock trends based on past data, we'll create two new columns:

- 'next_close': Represents the next closing price, serving as the target for the regression model.

- 'trend': Indicates whether the next close is higher '1' or lower '0' than the current close, serving as the target for the classification model.

So we will train the model on to the current closing price and make it predict the next closing price/trend for any given timestep.

<br>
To do that we define the add_targets(4) function which takes a data frame as input, adds the 'next_close' and 'trend' columns to it, and returns it as an output.

In [22]:
# create a function that takes a dataframe and create 'next_close' column based on its 'close' column
def get_next_close(_df):
    
    # create the 'next_close' column to be equal to the next closing price
    # this can be accomplished easily by shifting the close column backward by 1
    return _df['close'].shift(-1)

# create a function that returns 1 if the the next closing price is higher than current closing price and 0 otherwise.
def assign_trend(row):
    if row['next_close'] > row['close']:
        return 1
    elif row['next_close'] < row['close']:
        return 0
    else: # if the next value is missing then return NaN
        return np.nan

# create a function that add the target columns to the dataframe
def add_targets(_df):
    
    # add the next_close column to the dataframe
    _df['next_close'] = get_next_close(_df)
    
    # add the trend column to the dataframe
    _df['trend'] = _df.apply(assign_trend, axis=1)
    
    # drop the NaN values
    _df.dropna(inplace=True)
    
    # fix the 'trend' data type to be int
    _df = _df.astype({'trend': int})
    
    return _df

## Feature Engineering

<br>
Adding indicators to the dataframe is important for enhancing model performance and accuracy.

<br>
We can either manually calculate technical indicators, which is time-consuming and prone to errors, or we can utilize an existing library designed for this purpose. pandas-ta is a library that includes a wide set of technical indicators and is designed to work seamlessly with pandas dataframes.

To explore the available indicators in pandas-ta, you can use the following:

In [24]:
# list available technical indicators
help(dfs['PFE'].ta.indicators())

For this project, we added a total of 66 technical indicators. Each feature is carefully selected based on the technical indicator's definition and description. 

Check the full list of the selected indicators and the implementation of the add_technical_indicators function which take a dataframe as input and add these indicators to it (5).

We can also get detailed information on specific indicators:

In [None]:
# examine the MACD indicator
help(ta.macd)

Then, we’ll group the features into four categories:
1. Base Features: Original features from yfinance (6 features).
2. Technical Indicators based on Closing Price: (30 features).
3. Technical Indicators based on Highs and Lows: (31 features).
4. Technical Indicators based on Volume: (5 features).
This grouping will enable us to create more sophisticated models, such as multi-output or inception models, which we will explore later.

In [25]:
# for the time being let's create a function that add all the technical indicators we want to a df
def add_technical_indicators(_df):
    
    ##### indicators based on the closing price ##### index range: 6:36
    # apply macd on the close column in a df and add it to the dataframe    
    macd = ta.macd(_df['close'])
    # The MACD (Moving Average Convergence/Divergence) is a popular indicator to that is used to identify a trend
    _df.insert(6, 'macd', macd.iloc[:,0])
    # Histogram is the difference of MACD and Signal
    _df.insert(7, 'macd_histogram', macd.iloc[:,1])
    # Signal is an EMA (exponential moving average) of MACD
    _df.insert(8, 'macd_signal', macd.iloc[:,2])
    
    # apply RSI on the Close column in a df and add it to the dataframe    
    # RSI (Relative Strength Index) is popular momentum oscillator. Measures velocity and magnitude a trend
    rsi = ta.rsi(_df['close'])
    _df.insert(9, 'rsi', rsi)

    # apply SMA on the Close column in a df and add it to the dataframe    
    # SMA (Simple Moving Average) is the classic moving average that is the equally weighted average over n periods.
    sma = ta.sma(_df['close'])
    _df.insert(10, 'sma', sma)

    # apply EMA on the Close column in a df and add it to the dataframe    
    # EMA (Exponential Moving Average). The weights are determined by alpha which is proportional to it's length.
    ema = ta.ema(_df['close'])
    _df.insert(11, 'ema', ema)
    
    ######## repeat the same proccess for all the technical indicators we want to include ##########
    # bbands: A popular volatility indicator by John Bollinger.
    bbands = ta.bbands(_df['close'])
    _df.insert(12, 'bbands_lower', bbands.iloc[:,0])
    _df.insert(13, 'bbands_mid', bbands.iloc[:,1])
    _df.insert(14, 'bbands_upper', bbands.iloc[:,2])
    _df.insert(15, 'bbands_bandwidth', bbands.iloc[:,3])
    _df.insert(16, 'bbands_percent', bbands.iloc[:,4])
    
    # dema: The Double Exponential Moving Average attempts to a smoother average with less lag than the normal Exponential Moving Average (EMA).
    dema = ta.dema(_df['close'])
    _df.insert(17, 'dema', dema)
    
    # tema: A less laggy Exponential Moving Average.
    tema = ta.tema(_df['close'])
    _df.insert(18, 'tema', tema)

    # roc: Rate of Change is an indicator is also referred to as Momentum. It is a pure momentum oscillator that measures the percent change in price with the previous price 'n' (or length) periods ago.
    roc = ta.roc(_df['close'])
    _df.insert(19, 'roc', roc)
    
    # mom: Momentum is an indicator used to measure a security's speed (or strength) of movement.  Or simply the change in price.
    mom = ta.mom(_df['close'])
    _df.insert(20, 'mom', mom)
    
    # kama: Developed by Perry Kaufman, Kaufman's Adaptive Moving Average (KAMA) is a moving average designed to account for market noise or volatility. KAMA will closely follow prices when the price swings are relatively small and the noise is low. KAMA will adjust when the price swings widen and follow prices from a greater distance. This trend-following indicator can be used to identify the overall trend, time turning points and filter price movements.
    kama = ta.kama(_df['close'])
    _df.insert(21, 'kama', kama)
                       
    # trix: is a momentum oscillator to identify divergences.
    trix = ta.trix(_df['close'])
    _df.insert(22, 'trix', trix.iloc[:,0])
    _df.insert(23, 'trixs', trix.iloc[:,1])
    
    # hma: The Hull Exponential Moving Average attempts to reduce or remove lag in moving averages.
    hma = ta.hma(_df['close'])
    _df.insert(24, 'hma', hma)
    
    # alma: The ALMA moving average uses the curve of the Normal (Gauss) distribution, which can be shifted from 0 to 1. This allows regulating the smoothness and high sensitivity of the indicator. Sigma is another parameter that is responsible for the shape of the curve coefficients. This moving average reduces lag of the data in conjunction with smoothing to reduce noise.
    alma = ta.alma(_df['close'])
    _df.insert(25, 'alma', alma)
    
    # apo: The Absolute Price Oscillator is an indicator used to measure a security's momentum.  It is simply the difference of two Exponential Moving Averages (EMA) of two different periods. Note: APO and MACD lines are equivalent.
    apo = ta.apo(_df['close'])
    _df.insert(26, 'apo', apo)
    
    # cfo: The Forecast Oscillator calculates the percentage difference between the actualprice and the Time Series Forecast (the endpoint of a linear regression line).
    cfo = ta.cfo(_df['close'])
    _df.insert(27, 'cfo', cfo)
    
    # cg: The Center of Gravity Indicator by John Ehlers attempts to identify turning points while exhibiting zero lag and smoothing.
    cg = ta.cg(_df['close'])
    _df.insert(28, 'cg', cg)
    
    # cmo: Attempts to capture the momentum of an asset with overbought at 50 and oversold at -50.
    cmo = ta.cmo(_df['close'])
    _df.insert(29, 'cmo', cmo)
    
    # coppock: Coppock Curve (originally called the "Trendex Model") is a momentum indicator is designed for use on a monthly time scale.  Although designed for monthly use, a daily calculation over the same period can be made, converting the periods to 294-day and 231-day rate of changes, and a 210-day weighted moving average.
    coppock = ta.coppock(_df['close'])
    _df.insert(30, 'coppock', coppock)
    
    # cti: The Correlation Trend Indicator is an oscillator created by John Ehler in 2020. It assigns a value depending on how close prices in that range are to following a positively- or negatively-sloping straight line. Values range from -1 to 1. This is a wrapper for ta.linreg(close, r=True).
    cti = ta.cti(_df['close'])
    _df.insert(31, 'cti', cti)
    
    # decay: Creates a decay moving forward from prior signals like crosses. The default is "linear". Exponential is optional as "exponential" or "exp".
    decay = ta.decay(_df['close'])
    _df.insert(32, 'decay', decay)
    
    # decreasing: Returns True if the series is decreasing over a period, False otherwise. If the kwarg 'strict' is True, it returns True if it is continuously decreasing over the period. When using the kwarg 'asint', then it returns 1 for True or 0 for False.
    decreasing = ta.decreasing(_df['close'])
    _df.insert(33, 'decreasing', decreasing)
    
    # ebsw: This indicator measures market cycles and uses a low pass filter to remove noise. Its output is bound signal between -1 and 1 and the maximum length of a detected trend is limited by its length input.
    ebsw = ta.ebsw(_df['close'])
    _df.insert(34, 'ebsw', ebsw)
    
    # entropy: Introduced by Claude Shannon in 1948, entropy measures the unpredictability of the data, or equivalently, of its average information. A die has higher entropy (p=1/6) versus a coin (p=1/2).
    entropy = ta.entropy(_df['close'])
    _df.insert(35, 'entropy', entropy)
    
    
    ##### indicators based on the high and lows of the price ##### range= 36:67
    
    # aberration: A volatility indicator
    aberration = ta.aberration(_df['high'], _df['low'], _df['close'])
    _df.insert(36, 'aberration_zg', aberration.iloc[:,0])
    _df.insert(37, 'aberration_sg', aberration.iloc[:,1])
    _df.insert(38, 'aberration_xg', aberration.iloc[:,2])
    _df.insert(39, 'aberration_atr', aberration.iloc[:,3])
    
    # adx:  Average Directional Movement is meant to quantify trend strength by measuring the amount of movement in a single direction.    
    adx = ta.adx(_df['high'], _df['low'], _df['close'])
    _df.insert(40, 'adx_adx', adx.iloc[:,0])
    _df.insert(41, 'adx_dmp', adx.iloc[:,1])
    _df.insert(42, 'adx_dmn', adx.iloc[:,2])

    # atr: Averge True Range is used to measure volatility, especially volatility caused by gaps or limit moves.
    atr = ta.atr(_df['high'], _df['low'], _df['close'])
    _df.insert(43, 'atr', atr)
    
    # stoch: The Stochastic Oscillator (STOCH) was developed by George Lane in the 1950's. He believed this indicator was a good way to measure momentum because changes in momentum precede changes in price.
    stoch = ta.stoch(_df['high'], _df['low'], _df['close'])
    _df.insert(44, 'stoch_k', stoch.iloc[:,0])
    _df.insert(45, 'stoch_d', stoch.iloc[:,1])
    
    # Supertrend: is an overlap indicator. It is used to help identify trend direction, setting stop loss, identify support and resistance, and/or generate buy & sell signals.
    supertrend = ta.supertrend(_df['high'], _df['low'], _df['close'])
    _df.insert(46, 'supertrend_trend', supertrend.iloc[:,0])
    _df.insert(47, 'supertrend_direction', supertrend.iloc[:,1])
    
    # cci: Commodity Channel Index is a momentum oscillator used to primarily identify overbought and oversold levels relative to a mean.
    cci = ta.cci(_df['high'], _df['low'], _df['close'])
    _df.insert(48, 'cci', cci)
    
    # aroon: attempts to identify if a security is trending and how strong.
    aroon = ta.aroon(_df['high'], _df['low'])
    _df.insert(49, 'aroon_up', aroon.iloc[:,0])
    _df.insert(50, 'aroon_down', aroon.iloc[:,1])
    _df.insert(51, 'aroon_osc', aroon.iloc[:,2])
    
    # natr: Normalized Average True Range attempt to normalize the average true range.
    natr = ta.natr(_df['high'], _df['low'], _df['close'])
    _df.insert(52, 'natr', natr)
    
    # William's Percent R is a momentum oscillator similar to the RSI that attempts to identify overbought and oversold conditions.
    willr = ta.willr(_df['high'], _df['low'], _df['close'])
    _df.insert(53, 'willr', willr)
    
    # vortex: Two oscillators that capture positive and negative trend movement.
    vortex = ta.vortex(_df['high'], _df['low'], _df['close'])
    _df.insert(54, 'vortex_vip', vortex.iloc[:,0])
    _df.insert(55, 'vortex_vim', vortex.iloc[:,1])
    
    # hlc3: the average of high, low, and close prices
    hlc3 = ta.hlc3(_df['high'], _df['low'], _df['close'])
    _df.insert(56, 'hlc3', hlc3)
    
    # ohlc4: the average of open, high, low, and close prices
    ohlc4 = ta.ohlc4(_df['open'], _df['high'], _df['low'], _df['close'])
    _df.insert(57, 'ohlc4', ohlc4)
    
    # accbands: Acceleration Bands created by Price Headley plots upper and lower envelope bands around a simple moving average.
    accbands = ta.accbands(_df['high'], _df['low'], _df['close'])
    _df.insert(58, 'accbands_lower', accbands.iloc[:,0])
    _df.insert(59, 'accbands_mid', accbands.iloc[:,1])
    _df.insert(60, 'accbands_upper', accbands.iloc[:,2])

    # chop: The Choppiness Index was created by Australian commodity trader E.W. Dreiss and is designed to determine if the market is choppy (trading sideways) or not choppy (trading within a trend in either direction). Values closer to 100 implies the underlying is choppier whereas values closer to 0 implies the underlying is trending.
    chop = ta.chop(_df['high'], _df['low'], _df['close'])
    _df.insert(61, 'chop', chop)
    
    # dm: The Directional Movement was developed by J. Welles Wilder in 1978 attempts to determine which direction the price of an asset is moving. It compares prior highs and lows to yield to two series +DM and -DM.
    dm = ta.dm(_df['high'], _df['low'])
    _df.insert(62, 'dm_positive', dm.iloc[:,0])
    _df.insert(63, 'dm_negative', dm.iloc[:,1])

    # donchian: Donchian Channels are used to measure volatility, similar to Bollinger Bands and Keltner Channels.
    donchian = ta.donchian(_df['high'], _df['low'])
    _df.insert(64, 'donchian_lower', donchian.iloc[:,0])
    _df.insert(65, 'donchian_mid', donchian.iloc[:,1])
    _df.insert(66, 'donchian_upper', donchian.iloc[:,2])
    
    
    ##### indicators based on the volume of the price ##### range= 67:72
    
    # obv: On Balance Volume is a cumulative indicator to measure buying and selling pressure.
    obv = ta.obv(_df['close'], _df['volume'])
    _df.insert(67, 'obv', obv)
    
    # vwma: Volume Weighted Moving Average.
    vwma = ta.vwma(_df['close'], _df['volume'])
    _df.insert(68, 'vwma', vwma)
    
    # adosc: Accumulation/Distribution Oscillator indicator utilizes Accumulation/Distribution and treats it similarily to MACD or APO.
    adosc = ta.adosc(_df['high'], _df['low'], _df['close'], _df['volume'])
    _df.insert(69, 'adosc', adosc)
    
    # cmf: Chailin Money Flow measures the amount of money flow volume over a specific period in conjunction with Accumulation/Distribution.
    cmf = ta.cmf(_df['high'], _df['low'], _df['close'], _df['volume'])
    _df.insert(70, 'cmf', cmf)
    
    # efi: Elder's Force Index measures the power behind a price movement using price and volume as well as potential reversals and price corrections.
    efi = ta.efi(_df['close'], _df['volume'])
    _df.insert(71, 'efi', efi)


    #### we can add more technical indicators if we want using the same process ####
    
    # remove the NaN values and return the new dataframe
    _df.dropna(inplace=True)
    
    return _df

Finally we will create add_targets_and_indicators (6), a helper functions to add the targets and indicators to all dataframes in a dictionary.

In [33]:
# create a function that takes a dictionary of dataframes as input and add the targets and features to them 
def add_targets_and_indicators(_dfs):
    
    # iterate over the dataframes in the dictionary
    for symbol in _dfs.keys():
        
        # copy the dataframe
        _df = _dfs[symbol].copy(deep=True)
        
        # add target columns to the copied dataframe
        _df = add_targets(_df)
        
        # add technical indicators to the copied dataframe
        _df = add_technical_indicators(_df)
        
        # replace the original dataframe with the new dataframe
        _dfs[symbol] = _df
    
    # return the new dataframes dictionary
    return _dfs

In [46]:
# add the targets and technical indicators to each dataframe in the dictionary
full_dfs = add_targets_and_indicators(dfs.copy())

In [50]:
# check the shape of the new dataframes
full_dfs['PFE'].shape

(479, 74)