# Data Extraction and Merging process

**The purpose of this Juputer notebook is to centralized the codes made to extract, prepare, transform the data obtained from the sources to get the final dataframe used to train and test the machine learning algorithms utilized in the project.**


There are 2 sources to get all variables in the dataset.
1. Yahoo Finance
2. Provided by Albert

# Data Source for demographics


The data source for demographic data used in this project was Yahoo Finance. 

We will be using the yfinance library to retrieve the necessary demographics for each company from Yahoo Finance

In [1]:
import yfinance as yf
import pandas as pd
import numpy as np

We create a list of the tickers of the stocks that we wish to make API calls to. This simplifies the process of retrieving the values

In [2]:
tickers = ['aapl', 'amd', 'amzn', 'ba', 'baba', 'bac', 'c', 'csco', 'cvx', 'dis', 'f', 'ge',
           'googl', 'ibm', 'intc', 'jnj', 'jpm', 'ko', 'mcd', 'meta', 'msft', 'nflx', 'nvda',
           'pfe', 'pltr', 't', 'tsla', 'vz', 'wmt', 'xom']

Since we will be calling all the available information from yfinance using .info, we will need to fitler it down. We can achieve this by creating a function to help us

In [3]:
def extract_demographics(x, to_extract):
    # Given a dictionary and a list of keys to extract, will return as a dictionary
    # if the value does not exist on the yahoo finance site, np.nan will fill the value
    result_dict = {key: x.get(key, np.nan) for key in to_extract}
    return result_dict

Define the ratios and demographics we wish to extract from the yfinance .info function

In [4]:
ratios_to_extract = ['symbol', 'sector', 'trailingPE' , 'priceToBook', 'debtToEquity', 'freeCashflow',
                     'pegRatio' , 'returnOnEquity']

Prepare the results dataframe to concat each result onto

In [5]:
demographic_df = pd.DataFrame()

We create a for loop now to complete the task of retrieving all the ratios and demographics for the companies we wish to know about:

- Step 1: Retrieve the yf.Ticker info for the stock (This will retrieve all available information)
- Step 2: Apply the extract_demographics function to keep only the ratios and demographics we are interested in
- Step 3: Convert the dictionary of values we wish to keep into a dataframe
- Step 4: Concatenate the results of this onto our final demographic_df dataframe

In [9]:
for i in range(len(tickers)):
    temp_ticker = yf.Ticker(tickers[i]).info
    temp_row = extract_demographics(temp_ticker, ratios_to_extract)
    temp_df  = pd.DataFrame([temp_row])
    demographic_df = pd.concat([demographic_df, temp_df],ignore_index=True)
# As for 4-11-2023 the demographic scarpping is unavaiable due to an error with the yahoo finance API, no solution has been found yet but updates maybe found in the next page : https://github.com/ranaroussi/yfinance/issues/1729

In [10]:
demographic_df

Unnamed: 0,symbol,sector,trailingPE,priceToBook,debtToEquity,freeCashflow,pegRatio,returnOnEquity
0,AAPL,Technology,32.609642,43.349182,140.968,84726870000.0,3.57,1.4725
1,AMD,Technology,231.11429,4.657818,5.335,2385500000.0,1.53,0.02013
2,AMZN,Consumer Cyclical,52.275284,8.935516,74.107,57269750000.0,1.53,0.20305
3,BA,Industrials,,,,3065250000.0,-0.51,
4,BABA,Consumer Cyclical,16.897726,0.183409,18.481,122730900000.0,54.2,0.06382
5,BAC,Financial Services,13.789656,1.186471,,,1.48,0.08724
6,C,Financial Services,17.67347,0.611841,,,1.05,0.03909
7,CSCO,Technology,15.783784,4.11485,72.57,14412380000.0,8.59,0.27517
8,CVX,Energy,14.410304,1.801184,13.507,14773130000.0,2.57,0.12641
9,DIS,Communication Services,110.69565,1.873609,44.62,8282375000.0,1.36,0.02716


These values are a snapshot of the current values on yfinance and not a historical capture of the metrics.

# Data Source for commodities

The code for extracting commodity data is currently unavailable. However, the CSV files were directly provided by Albert Wong. You can find these CSV files in the following link: https://www.dropbox.com/scl/fo/n638k3vvnic2pss80zyvq/h?rlkey=3ilmw5fiehotnlqbeneiu249t&dl=0

In [8]:
import pandas as pd
import numpy as np
import holidays
from functools import reduce

## Merging the stocks csv files into one table

In [None]:
#reading stock data
# The csv'S can be found in the next link : https://www.dropbox.com/scl/fo/n638k3vvnic2pss80zyvq/h?rlkey=3ilmw5fiehotnlqbeneiu249t&dl=0
aapl = pd.read_csv('stocks/AAPL.csv', delimiter=";")
amd = pd.read_csv('stocks/AMD.csv', delimiter=";")
amzn = pd.read_csv('stocks/AMZN.csv', delimiter=";")
ba = pd.read_csv('stocks/BA.csv', delimiter=";")
baba = pd.read_csv('stocks/BABA.csv', delimiter=";")
bac = pd.read_csv('stocks/BAC.csv', delimiter=";")
c = pd.read_csv('stocks/C.csv', delimiter=";")
csco = pd.read_csv('stocks/CSCO.csv', delimiter=";")
cvx = pd.read_csv('stocks/CVX.csv', delimiter=";")
dis = pd.read_csv('stocks/DIS.csv', delimiter=";")
f = pd.read_csv('stocks/F.csv', delimiter=";")
ge = pd.read_csv('stocks/GE.csv', delimiter=";")
googl = pd.read_csv('stocks/GOOGL.csv', delimiter=";")
ibm = pd.read_csv('stocks/IBM.csv', delimiter=";")
intc = pd.read_csv('stocks/INTC.csv', delimiter=";")
jnj = pd.read_csv('stocks/JNJ.csv', delimiter=";")
jpm = pd.read_csv('stocks/JPM.csv', delimiter=";")
ko = pd.read_csv('stocks/KO.csv', delimiter=";")
mcd = pd.read_csv('stocks/MCD.csv', delimiter=";")
meta = pd.read_csv('stocks/META.csv', delimiter=";")
msft = pd.read_csv('stocks/MSFT.csv', delimiter=";")
nflx = pd.read_csv('stocks/NFLX.csv', delimiter=";")7
nvda = pd.read_csv('stocks/NVDA.csv', delimiter=";")
pfe = pd.read_csv('stocks/PFE.csv', delimiter=";")
pltr = pd.read_csv('stocks/PLTR.csv', delimiter=";")
t = pd.read_csv('stocks/T.csv', delimiter=";")
tsla = pd.read_csv('stocks/TSLA.csv', delimiter=";")
vz = pd.read_csv('stocks/VZ.csv', delimiter=";")
wmt = pd.read_csv('stocks/WMT.csv', delimiter=";")
xom = pd.read_csv('stocks/XOM.csv', delimiter=";")

In [None]:
# list of stocks to allow for easier cleaning
stocks = [aapl, amd, amzn, ba, baba, bac, c, csco, cvx, dis, f, ge, googl, ibm, intc, jnj, jpm, ko, mcd, meta, msft, nflx,
          nvda, pfe, pltr, t, tsla, vz, wmt, xom]
tickers = ["AAPL", "AMD", "AMZN", "BA", "BABA", "BAC", "C", "CSCO", "CVX", "DIS", "F", "GE", "GOOGL", "IBM", "INTC", "JNJ",
           "JPM", "KO", "MCD", "META", "MSFT", "NFLX", "NVDA", "PFE", "PLTR", "T", "TSLA", "VZ", "WMT", "XOM"]

In [None]:
for stock, ticker in zip(stocks, tickers):
    """
    Renames the 'Date' column of the stock DataFrame to 'Time' and adds a 'stock_ID' column.

    Args:
        stock (DataFrame): The stock data to be processed.
        ticker (str): The ticker symbol associated with the stock.

    Returns:
        None. The 'stock' DataFrame is modified in-place.

    Example:
        stocks = [stock1, stock2, stock3]
        tickers = ['AAPL', 'GOOGL', 'MSFT']
        for stock, ticker in zip(stocks, tickers):
            rename_and_add_columns(stock, ticker)

    """
    stock.rename(columns={'Date': 'Time'}, inplace=True)
    stock["stock_ID"] = ticker

In [None]:
for stock in stocks:
    stock['DATETIME'] = pd.to_datetime(stock['Time'], format ='%Y-%m-%d %H:%M:%S')

In [None]:
aapl.head()

In [None]:
stocks_reduced = []
for i in range(len(stocks)):
    """
    Filters specific columns from the stock DataFrame and creates a reduced DataFrame.

    Args:
        stocks (list): A list of stock DataFrames.
    
    Returns:
        list: A list of reduced stock DataFrames.

    Example:
        stocks = [stock1, stock2, stock3]
        stocks_reduced = filter_columns(stocks)
    
    """
    stocks_reduced.append(stocks[i].filter(["stock_ID", "Volume", "Close", 'DATETIME']))

In [None]:
for stock in stocks_reduced:
    stock["Volume"] = stock["Volume"].shift(1)

In [None]:
stocks_reduced[0].head()

## Clean Commodities and Bonds

In [None]:
# The csv'S can be found in the next link : https://www.dropbox.com/scl/fo/n638k3vvnic2pss80zyvq/h?rlkey=3ilmw5fiehotnlqbeneiu249t&dl=0
us2y = pd.read_csv('commodities/2YTBond.csv', delimiter=";")
us5y= pd.read_csv('commodities/5YTBond.csv', delimiter=";")
us10y = pd.read_csv('commodities/10YTBond.csv', delimiter=";")
dowjones = pd.read_csv('commodities/DowJones.csv', delimiter=";")
nasdaq = pd.read_csv('commodities/NASDAQ.csv', delimiter=";")
sp500 = pd.read_csv('commodities/S&P.csv', delimiter=";")
gold = pd.read_csv('commodities/Gold.csv', delimiter=";")
oil = pd.read_csv('commodities/Oil.csv', delimiter=";")

In [None]:
commodities = [us2y, us5y, us10y, dowjones, nasdaq, sp500, gold, oil]

In [None]:
#rename column Date to Time
for commodity in commodities:
    commodity.rename(columns={'Date': 'Time'}, inplace=True)

In [None]:
# change data type 
for commodity in commodities:
    commodity['DATETIME'] = pd.to_datetime(commodity['Time'], format ='%Y-%m-%d %H:%M:%S')

In [None]:
us2y.head()

In [None]:
#move time period back by 15 minutes
for commodity in commodities:
    commodity['DATETIME'] = commodity['DATETIME'] - pd.Timedelta(minutes=15)

In [None]:
us2y.head()

In [None]:
Tickers = ["US2Y", "US5Y", "US10Y", 'DJ', "NQ", "SP", "Gold", "Oil"]
commodities_reduced = []
for i in range(len(commodities)):
    commodities_reduced.append(commodities[i].filter(['Close', 'DATETIME']).\
                    rename(columns = {'Close' : '{}_PP'.format(Tickers[i])}))

In [None]:
#use reduce function to merge list of dataframes into singular dataframe on DATETIME_ADJUSTED
merge_commodities = reduce(lambda df1,df2: pd.merge(df1,df2,on='DATETIME'), commodities_reduced)

In [None]:
merge_commodities

In [None]:
merged_df = []
for stock in stocks_reduced:
    merged_df.append(pd.merge(stock, merge_commodities, on="DATETIME"))

In [None]:
merged_df[0].head()

In [None]:
df = reduce(lambda df1,df2: pd.concat([df1,df2]), merged_df)

In [None]:
df.sort_values(by='DATETIME')

In [None]:
df['Sector'] = df['stock_ID'].map(lambda x: 'Technology' if x in ("AAPL", "AMD", "CSCO", "IBM", "INTC", "MSFT", "NVDA", "PLTR")
                                  else "Consumer Cyclical" if x in ("AMZN", "BABA", "F", "MCD", "TSLA")
                                  else "Industrial" if x in ("BA", "GE")
                                  else "Financial Services" if x in ("BAC", "C", "JPM")
                                  else 'Energy' if x in ("CVX", 'XOM')
                                  else "Communication Services" if x in ("DIS", "GOOGL", "META", "NFLX", "T", "VZ")
                                  else "Healthcare" if x in ("JNJ", "PFE")
                                  else "Consumer Defensive" if x in ("KO", "WMT")
                                  else '')

In [None]:
df

In [None]:
df['Sector'].value_counts()

In [None]:
#create columns for each categorical datetime variable of interest
df['DATE']        = df.DATETIME.dt.date
df['MONTH']       = df.DATETIME.dt.month
df['DAY']         = df.DATETIME.dt.day
df['HOUR']        = df.DATETIME.dt.hour
df['MINUTE']      = df.DATETIME.dt.minute
df['WEEK_DAY']    = df.DATETIME.dt.dayofweek

In [None]:
# Create holiday flag
holiday_days = []
for holiday in holidays.US(state = 'NY', years=[2020, 2021, 2022]).items(): holiday_days.append(str(holiday[0]))
df['HOLIDAY']=[1 if str(value) in holiday_days else 0 for value in df['DATE']]

In [None]:
# Create pre - post holiday flag
df['POSTHOLIDAY_MORNING'], df['PREHOLIDAY_AFTERNON'] = 0, 0

id = df.loc[(df.HOLIDAY == 1),:].index
POSTHOLIDAY = pd.to_datetime((df.loc[id,'DATE'] + pd.DateOffset(days=1)).unique())
PREHOLIDAY  = pd.to_datetime((df.loc[id,'DATE'] - pd.DateOffset(days=1)).unique())

for i in range(len(PREHOLIDAY)) : df.loc[(df.DATE == PREHOLIDAY[i]) & (df.HOUR > 12)   , 'PREHOLIDAY_AFTERNON'] =1
for i in range(len(POSTHOLIDAY)): df.loc[ (df.DATE == POSTHOLIDAY[i]) & (df.HOUR <= 12), 'POSTHOLIDAY_MORNING'] =1

In [None]:
# Create monday morning flag
df['MONDAY_MORNING']=0
df.loc[(df.WEEK_DAY == 0) & (df.HOUR <= 12), 'MONDAY_MORNING'] =1

In [None]:
# Create friday afternoon flag
df['FRIDAY_AFTERNOON']=0
df.loc[(df.WEEK_DAY == 4) & (df.HOUR > 12), 'FRIDAY_AFTERNOON'] =1

In [None]:
# Delete holiday days
df.drop(df.index[df['HOLIDAY']==1], inplace=True)
df.drop(columns=['HOLIDAY'], inplace=True)

## Hot encoding

In [None]:
def one_hot(og_df, feature_to_encode):
    "Function that takes a dataframe, and a feature to one-hot encode and returns the dataframe with that feature encoded"
    dummies = pd.get_dummies(og_df[feature_to_encode], prefix = feature_to_encode, prefix_sep = "_")
    df = pd.concat([og_df, dummies], axis=1)
    df = df.drop([feature_to_encode], axis=1)
    return(df)

In [None]:
to_encode = ['Sector', 'MONTH', 'DAY']

In [None]:
#encode the time features of interest
encoded_df = df
for encode in to_encode:
    encoded_df = one_hot(encoded_df, encode)

In [None]:
encoded_df.rename(columns={"Volume":"Volume_PP"}, inplace=True)

In [None]:
encoded_df

## Final result

In [None]:
encoded_df.to_csv('df.csv', index=False)