# Project Implementation

## Install and import the required libraries

<br>
In the implementation part, we will start by importing the required libraries for our work. We will work mainly with yfinance for data collection, Pandas and Numpy for data processing, and TensorFlow for machine learning.

<br>
Other relevant libraries are keras_tuner for hyperparameter optimization, scikit-learn for data scaling and model evaluation, pandas-ta for calculating technical indicators based on the data from yfinance, and matplotlib for visualization.

In [1]:
# install Dependencies and import libraries
# !pip install yfinance pandas numpy tensorflow scikit-learn pandas-ta matplotlib

In [2]:
# https://pypi.org/project/yfinance/ (""" it's an open-source tool that uses Yahoo's publicly available APIs, and is intended for research and educational purposes. """)
# import yfinance, our data source
import yfinance as yf

# import pandas and numpy
import pandas as pd 
import numpy as np

# import from tensorflow
from tensorflow.keras.models import Sequential, load_model, Model
from tensorflow.keras.layers import SimpleRNN, Dense, LSTM, Input, GRU, SeparableConv1D, BatchNormalization, MaxPooling1D, add, Layer, concatenate
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.optimizers import Adam, RMSprop, SGD
from tensorflow.keras.saving import register_keras_serializable

# import from keras_tuner
from keras_tuner import HyperModel, Hyperband, Tuner, Oracle

# import from scikit-learn
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, ConfusionMatrixDisplay

# https://pypi.org/project/pandas-ta/ ("""An easy to use Python 3 Pandas Extension with 130+ Technical Analysis Indicators. Can be called from a Pandas DataFrame or standalone""")
# import pandas-ta
import pandas_ta as ta

# import matplotlib for data visualisation
import matplotlib.pyplot as plt

# this library allow us to calculate how long a process would take 
from datetime import datetime

## Load Data


<br>
In this implementation, we will work with 5 different stocks from the S&P500(1) list. The 5 stocks we will work with are chosen based on their ranking in this list from most valuable to least valuable, and each one is relatively distant from the other and belongs to a different industry. This will ensure a diverse sample and that our model evaluation results generalize relatively well, reducing the possibility of bias and overfitting.

Check out our stock list for this project (2).

<br>
The yfinance API allows us to request the stock data for a company's given period and interval values. For the period value, we will set it to 10 years or max value which will be sufficient for all of our experiments, for the interval value however, which determines the frequency of the data rows, we will experiment with many options to see if our approach generalizes better with specific interval values as different intervals are relevant to other groups of financial analysts and traders in the real world, therefore we must try to create the best model relevant to each of these groups.

That's why we will define a function that allows us to download any number of stock data at any period or interval, save the data as a CSV file to local storage, load it from storage, split it into different data frames based on the stock, and organize the data frames in a dictionary so it's easy to work with for the rest of the project.

Check out the loadData function (3).

In [3]:
# insert the stock symbols into a list
symbols_list = ['PFE', 'ROP', 'XYL', 'CPAY', 'INCY']

In [4]:
# define a function to load the data from source (yfinance API), and save it as a csv to local storage
def loadData(symbols=symbols_list, period='10y', interval='1wk'):
    
    try:
        # load the the dataframe from the csv file if it already exist
        df = pd.read_csv(f'{period}_{interval}_stocks_data.csv').set_index(['Date', 'Ticker'])
        
        print("Data loaded from directory")
        
    except FileNotFoundError:
        # print a message stating the data does not already exists and need to be downloaded from yfinance
        print(f"There is no {period}_{interval}_stocks_data.csv. Data will be downloaded from yfinance.")
        
        # download the data from source and store it in the stock_data variable which will hold the data as a pandas dataframe
        stocks_data =  yf.download(symbols, period=period, interval=interval)

        # reshape the dataframe as a multi-level index dataframe
        stocks_data = stocks_data.stack()

        # source: https://www.statology.org/pandas-change-column-names-to-lowercase/
        # convert column names to lowercase
        stocks_data.columns = stocks_data.columns.str.lower()

        # save the dataframe to a csv file (Save the data to a CSV so we don't have to make any extra unnecessary requests to the API every time we reload the notebook)
        stocks_data.to_csv(f'{period}_{interval}_stocks_data.csv', index=True)

        # load the the dataframe from the csv file
        df = pd.read_csv(f'{period}_{interval}_stocks_data.csv').set_index(['Date', 'Ticker'])

    finally: 
        # create a dict to store the dataframe of each unique symbol where keys are symbol, values are dataframes
        df_dict = {}

        # iterate over the symbols
        for symbol in symbols:

            # source of inspiration https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.xs.html [11]
            # extract the specific stock data at the 'Ticker' level of this multi index dataframe and save it as a dataframe
            symbol_df = df.xs(symbol, axis=0, level='Ticker', drop_level=True)

            # store the datafram into the df_dict
            df_dict[symbol] = symbol_df

        # return the dictionary
        return df_dict

In [5]:
# load the stock data for the 5 companies into a dictionary
dfs = loadData(symbols=symbols_list, period='10y', interval='1wk')

There is no 10y_1wk_stocks_data.csv. Data will be downloaded from yfinance.


[*********************100%%**********************]  5 of 5 completed


## Perform simple exploritory data analysis

<br> 
Now that we have a dictionary of dataframes, we can analyze the data and make some observations.

1. We can get the shape of the data for any stock

In [10]:
# the data shape
for symbol in dfs.keys():
    print(f"Symbol: {symbol}, Shape: {dfs[symbol].shape} ")

Symbol: PFE, Shape: (523, 6) 
Symbol: ROP, Shape: (523, 6) 
Symbol: XYL, Shape: (523, 6) 
Symbol: CPAY, Shape: (523, 6) 
Symbol: INCY, Shape: (523, 6) 


2. We can get the basic stats for any stock

In [15]:
# data basic stats
dfs["PFE"].describe()

Unnamed: 0,adj close,close,high,low,open,volume
count,523.0,523.0,523.0,523.0,523.0,523.0
mean,29.870774,36.26916,37.066481,35.423588,36.250718,137332300.0
std,7.594988,6.862873,7.114429,6.536369,6.851178,62650500.0
min,17.923565,25.4,26.17,25.200001,25.58,39227250.0
25%,23.773307,31.555978,32.129982,30.858634,31.555977,97143200.0
50%,28.344492,34.478176,34.914612,33.78558,34.487667,121504600.0
75%,33.0938,40.028976,40.682581,39.165085,40.033976,158385600.0
max,52.74073,59.48,61.709999,57.16,60.599998,633399700.0


3. We can check how many missing values each column have for a any stock dataframe

In [20]:
# how many null values in each column
dfs['PFE'].isnull().sum()

adj close    0
close        0
high         0
low          0
open         0
volume       0
dtype: int64

## Add Targets