# Predicting daily stock market movement with machine learning
Week 2 - Obtain and prepare data
- Historical stock price data will be obtained from the Yahoo Finance API, via yfinance
- Documentation links for yfinance:

> https://pypi.org/project/yfinance/ ; https://github.com/ranaroussi/yfinance

- Market indicator data obtained from Nasdaq Data Link:
> https://data.nasdaq.com/

In [2]:
# import libraries
import pandas as pd
import numpy as np
import yfinance as yf

import warnings
warnings.filterwarnings('ignore')

## Stock market data
### Obtaining data
- Tesla stock will be used for this project

In [3]:
day_df = yf.download("TSLA",           # obtaining data for Tesla stock       
                     period='1y',      # 1 year of data beginning current day, previous year
                     interval='1d')    # daily statistics

hr_df = yf.download("TSLA", 
                    period='1y', 
                    interval='1h',     # hour interval 
                    prepost=True)      # inlcude pre-market & post-market data

min_df = yf.download("TSLA", 
                    period='60d',      # max period for this interval is 60 days
                    interval='2m',     # 2 minute intervals 
                    prepost=True)      

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


### Inspecting TSLA stock data

In [4]:
# Daily data (1 yr)

day_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 251 entries, 2022-04-04 to 2023-04-03
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       251 non-null    float64
 1   High       251 non-null    float64
 2   Low        251 non-null    float64
 3   Close      251 non-null    float64
 4   Adj Close  251 non-null    float64
 5   Volume     251 non-null    int64  
dtypes: float64(5), int64(1)
memory usage: 13.7 KB


- No missing values to deal with

In [5]:
day_df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2022-04-04,363.126678,383.303345,357.51001,381.816681,381.816681,82035900
2022-04-05,378.766663,384.290009,362.433319,363.753326,363.753326,80075100
2022-04-06,357.823334,359.666656,342.566681,348.58667,348.58667,89348400
2022-04-07,350.796661,358.863342,340.513336,352.420013,352.420013,79447200
2022-04-08,347.736664,349.480011,340.813324,341.829987,341.829987,55013700


In [6]:
day_df.tail()

# final row reflects current day data, although there is a delay

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-03-28,192.0,192.350006,185.429993,189.190002,189.190002,98654600
2023-03-29,193.130005,195.289993,189.440002,193.880005,193.880005,123660000
2023-03-30,195.580002,197.330002,194.419998,195.279999,195.279999,110252200
2023-03-31,197.529999,207.789993,197.199997,207.460007,207.460007,169638500
2023-04-03,199.910004,202.689697,193.839996,195.779907,195.779907,98979662


- The '1d' interval has returned the following features:
    - date (index)
    - opening price
    - daily high
    - daily low
    - closing price
    - adjusted closing price
    - volumne of shares traded

In [9]:
# Hourly data (1y)

hr_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4171 entries, 2022-04-04 04:00:00-04:00 to 2023-04-03 11:30:00-04:00
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       4171 non-null   float64
 1   High       4171 non-null   float64
 2   Low        4171 non-null   float64
 3   Close      4171 non-null   float64
 4   Adj Close  4171 non-null   float64
 5   Volume     4171 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 228.1 KB


- No missing values to deal with

In [8]:
hr_df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2022-04-01 04:00:00-04:00,360.03333,361.0167,358.04333,358.04333,358.04333,0
2022-04-01 05:00:00-04:00,358.07333,358.71667,355.67,357.86667,357.86667,0
2022-04-01 06:00:00-04:00,357.90665,360.91666,357.0,360.52332,360.52332,0
2022-04-01 07:00:00-04:00,360.62,361.74332,359.83334,361.14334,361.14334,0
2022-04-01 08:00:00-04:00,361.16666,363.15332,359.19998,362.13333,362.13333,0


- The same columns have been returned but now the time is included in the index and volumne is only present during intraday trading hours

In [10]:
# 2-minute interval data (60 days)

min_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 14443 entries, 2023-02-16 19:00:00-05:00 to 2023-04-03 12:16:00-04:00
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       14443 non-null  float64
 1   High       14443 non-null  float64
 2   Low        14443 non-null  float64
 3   Close      14443 non-null  float64
 4   Adj Close  14443 non-null  float64
 5   Volume     14443 non-null  int64  
dtypes: float64(5), int64(1)
memory usage: 789.9 KB


- No missing values to deal with

In [11]:
min_df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-02-16 19:00:00-05:00,197.4,197.45,197.17,197.18,197.18,0
2023-02-16 19:02:00-05:00,197.18,197.4,197.18,197.335,197.335,0
2023-02-16 19:04:00-05:00,197.35,197.72,197.35,197.57,197.57,0
2023-02-16 19:06:00-05:00,197.51,197.56,197.38,197.46,197.46,0
2023-02-16 19:08:00-05:00,197.45,197.7,197.43,197.64,197.64,0


### Preparing TSLA stock data

With no missing values to deal with, I'll move straight into feature engineering:
- The focus will be on the adjusted closing price (Adj Close) and % change between intervals
- Absolute percent change will be calcuated for EDA & visualization, while natural log returns will be calculated for analysis & machine learning purposes
- UDFs will be created to avoid redundancy

In [12]:
def add_returns(df):
    '''
    Adds column with adjusted close percentage change between intervals;
    also adds column with the natural log of percentage change
    '''
    
    df['% change'] = df['Adj Close'] / df['Adj Close'].shift(1) - 1
    df['% change (ln)'] = np.log(df['Adj Close'] / df['Adj Close'].shift(1))                  
    
    # Division by shifted values creates useless NA values
    df.dropna(inplace=True) 

In [13]:
# apply to each dataframe

df_list = [day_df, hr_df, min_df]

for df in df_list:
    add_returns(df)

In [14]:
def add_sessions(df):
    '''
    Creates new date & time columns for readability;
    new column that indicates whether trade occurred during intraday period;
    new column that indicates whether the period is before the market opens or after it closesas well as if the
    '''

    df['Date'] = pd.to_datetime(df.index)
    df['time'] = df['Date'].dt.strftime ('%H:%M')
    df['date'] = df['Date'].dt.strftime ('%Y-%m-%d')
    df.drop(columns='Date', axis=1, inplace=True)
    
    df['intraday'] = ''
    df['after/before'] = ''
    
    for i in range(len(df)):

        if df.time[i] > '15:30':        
            df.intraday[i] = 'no'
            df['after/before'][i] = 'after close'

        elif df.time[i] < '09:30':
            df.intraday[i] = 'no'
            df['after/before'][i] = 'before open'

        else:
            df.intraday[i] = 'yes'
            df['after/before'][i] = 'intraday'

In [15]:
# no sessions for day_df

df_list = [hr_df, min_df]

for df in df_list:
    add_sessions(df)

### Reviewing TSLA stock data

In [29]:
day_df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,% change,% change (ln)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2022-04-05,378.766663,384.290009,362.433319,363.753326,363.753326,80075100,-0.047309,-0.048465
2022-04-06,357.823334,359.666656,342.566681,348.58667,348.58667,89348400,-0.041695,-0.042589
2022-04-07,350.796661,358.863342,340.513336,352.420013,352.420013,79447200,0.010997,0.010937
2022-04-08,347.736664,349.480011,340.813324,341.829987,341.829987,55013700,-0.030049,-0.03051
2022-04-11,326.799988,336.156677,324.880005,325.309998,325.309998,59357100,-0.048328,-0.049535


In [30]:
hr_df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,% change,% change (ln),time,date,intraday,after/before
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2022-04-04 05:00:00-04:00,364.6,365.56665,363.78333,365.55667,365.55667,0,0.002798,0.002794,05:00,2022-04-04,no,before open
2022-04-04 06:00:00-04:00,365.51334,366.29666,365.0167,365.93335,365.93335,0,0.00103,0.00103,06:00,2022-04-04,no,before open
2022-04-04 07:00:00-04:00,365.87,365.91666,363.5,363.5,363.5,0,-0.00665,-0.006672,07:00,2022-04-04,no,before open
2022-04-04 08:00:00-04:00,363.51334,365.38333,362.33334,364.24332,364.24332,0,0.002045,0.002043,08:00,2022-04-04,no,before open
2022-04-04 09:00:00-04:00,364.24332,364.53333,362.29333,363.11,363.11,0,-0.003111,-0.003116,09:00,2022-04-04,no,before open


In [31]:
min_df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,% change,% change (ln),time,date,intraday,after/before
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2023-02-16 19:02:00-05:00,197.18,197.4,197.18,197.335,197.335,0,0.000786,0.000786,19:02,2023-02-16,no,after close
2023-02-16 19:04:00-05:00,197.35,197.72,197.35,197.57,197.57,0,0.001191,0.00119,19:04,2023-02-16,no,after close
2023-02-16 19:06:00-05:00,197.51,197.56,197.38,197.46,197.46,0,-0.000557,-0.000557,19:06,2023-02-16,no,after close
2023-02-16 19:08:00-05:00,197.45,197.7,197.43,197.64,197.64,0,0.000912,0.000911,19:08,2023-02-16,no,after close
2023-02-16 19:10:00-05:00,197.655,197.84,197.6,197.81,197.81,0,0.00086,0.00086,19:10,2023-02-16,no,after close


- The returns and sessions are now present

### Adding TSLA stock price lags
- New dataframes will now be created with 'Adj Close' price and daily return, and 'lag' prices which reflect the stock prices from 1-5 prior days

In [32]:
def add_lags(df):
    
    '''Add price lags based on "Adj Close";
       Specify number of lags in first line'''
    
    lags = 5    
    cols = []
    
    df = df[['Adj Close']]
    df.rename(columns={'Adj Close': 'price'}, inplace=True)
    df['return'] = np.log(df['price'] / df['price'].shift(1))
    
    for lag in range(1, lags + 1):
        col = f'lag_{lag}'
        df[col] = df['price'].shift(lag)
        cols.append(col)
    
    df.dropna(inplace=True)
    
    return(df)

In [33]:
lag_day_df = add_lags(day_df)
lag_day_df.head()

Unnamed: 0_level_0,price,return,lag_1,lag_2,lag_3,lag_4,lag_5
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2022-04-12,328.983337,0.011229,325.309998,341.829987,352.420013,348.58667,363.753326
2022-04-13,340.790009,0.035259,328.983337,325.309998,341.829987,352.420013,348.58667
2022-04-14,328.333344,-0.037237,340.790009,328.983337,325.309998,341.829987,352.420013
2022-04-18,334.763336,0.019394,328.333344,340.790009,328.983337,325.309998,341.829987
2022-04-19,342.716675,0.02348,334.763336,328.333344,340.790009,328.983337,325.309998


In [34]:
lag_hr_df = add_lags(hr_df)
lag_hr_df.head()

Unnamed: 0_level_0,price,return,lag_1,lag_2,lag_3,lag_4,lag_5
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2022-04-04 09:30:00-04:00,369.333344,0.016994,363.11,364.24332,363.5,365.93335,365.55667
2022-04-04 10:30:00-04:00,377.793335,0.022648,369.333344,363.11,364.24332,363.5,365.93335
2022-04-04 11:30:00-04:00,377.484619,-0.000817,377.793335,369.333344,363.11,364.24332,363.5
2022-04-04 12:30:00-04:00,381.39743,0.010312,377.484619,377.793335,369.333344,363.11,364.24332
2022-04-04 13:30:00-04:00,382.065826,0.001751,381.39743,377.484619,377.793335,369.333344,363.11


In [35]:
lag_min_df = add_lags(min_df)
lag_min_df.head()

Unnamed: 0_level_0,price,return,lag_1,lag_2,lag_3,lag_4,lag_5
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-02-16 19:12:00-05:00,198.2,0.00197,197.81,197.64,197.46,197.57,197.335
2023-02-16 19:14:00-05:00,199.1,0.004531,198.2,197.81,197.64,197.46,197.57
2023-02-16 19:16:00-05:00,198.7,-0.002011,199.1,198.2,197.81,197.64,197.46
2023-02-16 19:18:00-05:00,198.71,5e-05,198.7,199.1,198.2,197.81,197.64
2023-02-16 19:20:00-05:00,198.7,-5e-05,198.71,198.7,199.1,198.2,197.81


## Market indicator data
In addition to standard stock trading data, possible market indicators were obtained with the Nasdaq Data Link API, from sources including:
- US treasury
- Financial industry regulatory authority

Additionally, indicator signals within the data will be calculated, including:
- Moving average
- Volatility
- Relative strength index 

In [39]:
import credentials
import nasdaqdatalink

nasdaqdatalink.ApiConfig.api_key = credentials.nasdaq_key

### US Treasury data

**Daily Treasury Real Long-Term Rates**
- Long Term Real Rate Average: The Long-Term Real Rate Average is the unweighted average of bid real yields on all outstanding TIPS with remaining maturities of more than 10 years and is intended as a proxy for long-term real rates.

In [64]:
ltrt_df = (nasdaqdatalink.get(dataset="USTREASURY/REALLONGTERM", start_date="2022-04-01", end_date="2023-03-31").rename(columns={"Value": "LT Treasury Rates"}))

In [65]:
ltrt_df.tail()

Unnamed: 0_level_0,LT Real Average (>10Yrs)
Date,Unnamed: 1_level_1
2023-03-27,1.62
2023-03-28,1.57
2023-03-29,1.57
2023-03-30,1.54
2023-03-31,1.51


**Daily Treasury Par Real Yield Curve Rates**
- These par real yields are calculated from indicative secondary market quotations obtained by the Federal Reserve Bank of New York. The par real yield values are read from the par real yield curve at fixed maturities, currently 5, 7, 10, 20, and 30 years. 

In [46]:
ryc_df = (nasdaqdatalink.get(dataset="USTREASURY/REALYIELD", start_date="2021-01-01", end_date="2023-03-31"))

In [47]:
ryc_df.tail()

Unnamed: 0_level_0,5 YR,7 YR,10 YR,20 YR,30 YR
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2023-03-27,1.32,1.3,1.29,1.43,1.55
2023-03-28,1.29,1.26,1.24,1.38,1.5
2023-03-29,1.29,1.26,1.24,1.38,1.5
2023-03-30,1.26,1.23,1.21,1.35,1.48
2023-03-31,1.2,1.17,1.16,1.31,1.44


**Daily Treasury Bill Rates**
- These rates are the daily secondary market quotations on the most recently auctioned Treasury Bills for each maturity tranche (4-week, 8-week, 13-week, 17-week, 26-week, and 52-week) for which Treasury currently issues new bills. Market quotations are obtained at approximately 3:30 PM each business day by the Federal Reserve Bank of New York. The Bank Discount rate is the rate at which a bill is quoted in the secondary market and is based on the par value, amount of the discount and a 360-day year. The Coupon Equivalent, also called the Bond Equivalent, or the Investment Yield, is the bill's yield based on the purchase price, discount, and a 365- or 366-day year. The Coupon Equivalent can be used to compare the yield on a discount bill to the yield on a nominal coupon security that pays semiannual interest with the same maturity date.

In [48]:
tb_df = (nasdaqdatalink.get(dataset="USTREASURY/BILLRATES", start_date="2021-01-01", end_date="2023-03-31")

In [49]:
tb_df.tail()

Unnamed: 0_level_0,4 Wk Bank Discount Rate,4 Wk Coupon Equiv,8 Wk Bank Discount Rate,8 Wk Coupon Equiv,13 Wk Bank Discount Rate,13 Wk Coupon Equiv,26 Wk Bank Discount Rate,26 Wk Coupon Equiv,52 Wk Bank Discount Rate,52 Wk Coupon Equiv
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2023-03-27,4.07,4.15,4.29,4.39,4.75,4.89,4.65,4.84,4.3,4.52
2023-03-28,4.1,4.18,4.2,4.3,4.64,4.77,4.68,4.87,4.34,4.56
2023-03-29,4.2,4.28,4.29,4.39,4.64,4.77,4.7,4.89,4.38,4.6
2023-03-30,4.6,4.69,4.64,4.75,4.8,4.94,4.7,4.89,4.41,4.64
2023-03-31,4.6,4.69,4.65,4.76,4.68,4.81,4.72,4.91,4.43,4.66


**Daily Treasury Par Yield Curve Rates**
- These rates are commonly referred to as "Constant Maturity Treasury" rates, or CMTs. Yields are interpolated by the Treasury from the daily yield curve.

In [51]:
py_df = (nasdaqdatalink.get(dataset="USTREASURY/YIELD", start_date="2021-01-01", end_date="2023-03-31"))

In [52]:
py_df.tail()

Unnamed: 0_level_0,1 MO,2 MO,3 MO,6 MO,1 YR,2 YR,3 YR,5 YR,7 YR,10 YR,20 YR,30 YR
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2023-03-27,4.22,4.47,4.91,4.86,4.51,3.94,3.79,3.59,3.57,3.53,3.9,3.77
2023-03-28,4.24,4.39,4.8,4.9,4.55,4.02,3.84,3.63,3.6,3.55,3.9,3.77
2023-03-29,4.34,4.5,4.8,4.92,4.59,4.08,3.87,3.67,3.62,3.57,3.91,3.78
2023-03-30,4.74,4.77,4.97,4.92,4.63,4.1,3.87,3.66,3.61,3.55,3.88,3.74
2023-03-31,4.74,4.79,4.85,4.94,4.64,4.06,3.81,3.6,3.55,3.48,3.81,3.67


**High Quality Market Corporate Bond Yield Curve Spot Rates**
- The HQM (High Quality Market) yield curve represents the high quality corporate bond market

In [53]:
hqm_df = (nasdaqdatalink.get(dataset="USTREASURY/HQMYC", start_date="2021-01-01", end_date="2023-03-31"))

In [56]:
hqm_df.tail()

Unnamed: 0_level_0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,...,95.5,96.0,96.5,97.0,97.5,98.0,98.5,99.0,99.5,100.0
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2022-10-31,4.53,4.79,5.0,5.14,5.22,5.25,5.26,5.26,5.28,5.3,...,5.53,5.53,5.53,5.53,5.53,5.53,5.53,5.53,5.53,5.53
2022-11-30,4.94,5.04,5.11,5.14,5.15,5.13,5.11,5.09,5.09,5.1,...,5.26,5.26,5.26,5.26,5.25,5.25,5.25,5.25,5.25,5.25
2022-12-31,4.96,4.94,4.92,4.89,4.85,4.81,4.77,4.75,4.75,4.75,...,4.67,4.67,4.67,4.67,4.67,4.67,4.67,4.67,4.67,4.67
2023-01-31,4.93,4.89,4.86,4.81,4.75,4.7,4.65,4.62,4.6,4.6,...,4.73,4.73,4.73,4.73,4.73,4.73,4.72,4.72,4.72,4.72
2023-02-28,5.18,5.15,5.12,5.07,5.01,4.95,4.9,4.86,4.83,4.82,...,4.84,4.84,4.84,4.84,4.84,4.84,4.84,4.84,4.84,4.83


### Financial industry regulatory authority (FINRA) data
**Short interest**
- Daily short volume

In [57]:
si_df = (nasdaqdatalink.get(dataset="FINRA/FNYX_TSLA", start_date="2021-01-01", end_date="2023-03-31"))

In [58]:
si_df.tail()

Unnamed: 0_level_0,ShortVolume,ShortExemptVolume,TotalVolume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2023-03-27,31729834.0,35323.0,47585849.0
2023-03-28,24348648.0,27973.0,36664308.0
2023-03-29,30771016.0,56533.0,46480664.0
2023-03-30,24907112.0,41247.0,41395000.0
2023-03-31,36420054.0,125930.0,60956383.0


### Calculated indicators

**Simple moving averages**
- Pandas has a built-in `rolling()` function that can be used to calculate simple rolling averages (SMAs) and there are also  libraries with SMA calculators, such as cufflinks
- To obtain these calculations, I borrowed code from this Medium article to save time
> link: https://guese-justin.medium.com/enhancing-stock-data-for-your-python-algorithmic-trading-model-b40c668e4087

In [70]:
sma_df = day_df[['Adj Close']]
sma_df.head()

Unnamed: 0_level_0,Adj Close
Date,Unnamed: 1_level_1
2022-04-05,363.753326
2022-04-06,348.58667
2022-04-07,352.420013
2022-04-08,341.829987
2022-04-11,325.309998


In [73]:
# specify rolling averages
averages = [5, 15, 30, 50]

for average in averages:
    sma_df['SMA_%d'%average] = sma_df["Adj Close"].rolling(window=average).mean()
    
# dropping nulls as I am most interested in recent data
sma_df.dropna(inplace=True)

In [74]:
sma_df.head()

Unnamed: 0_level_0,Adj Close,SMA_5,SMA_15,SMA_30,SMA_50
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2022-06-15,233.0,228.312665,237.447334,245.195444,277.316601
2022-06-16,213.100006,222.991333,237.014,241.71411,274.303535
2022-06-17,216.759995,219.897333,235.737333,239.236333,271.667001
2022-06-21,237.036667,224.157333,234.659111,237.519222,269.359334
2022-06-22,236.08667,227.196667,233.548,236.643111,267.244468


**Volatility**
- Trading volatility can be determined by using the Pandas `rolling()` function, similar to the previous calculation
- Volatility based on windows of 10 and 50 days will be determined

In [87]:
volt_df = day_df[['% change (ln)']]
volt_df.head()

Unnamed: 0_level_0,% change (ln)
Date,Unnamed: 1_level_1
2022-04-05,-0.048465
2022-04-06,-0.042589
2022-04-07,0.010937
2022-04-08,-0.03051
2022-04-11,-0.049535


In [88]:
# specify rolling averages
averages = [10, 50]

for average in averages:
    volt_df['VOLT_%d'%average] = volt_df["% change (ln)"].rolling(window=average).std()
    
# dropping nulls as I am most interested in recent data
volt_df.dropna(inplace=True)

In [89]:
volt_df.head()

Unnamed: 0_level_0,% change (ln),VOLT_10,VOLT_50
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2022-06-15,0.053374,0.048815,0.048215
2022-06-16,-0.089277,0.051588,0.049216
2022-06-17,0.017029,0.044686,0.04915
2022-06-21,0.089424,0.054091,0.051033
2022-06-22,-0.004016,0.054089,0.050932


**Relative strength index**
- The relative strength index (RSI) measures the strength of a stock's price action. It can be used to identify overbought or oversold conditions
- The Technical Analysis Library (TA-Lib) will be used to calculate the RSI

In [90]:
import talib

In [91]:
rsi_df = day_df[['Adj Close']]

Unnamed: 0_level_0,Adj Close
Date,Unnamed: 1_level_1
2022-04-05,363.753326
2022-04-06,348.58667
2022-04-07,352.420013
2022-04-08,341.829987
2022-04-11,325.309998


In [96]:
rsi_df['rsi'] = talib.RSI(rsi_df['Adj Close'])

# only interested in recent data
rsi_df.dropna(inplace=True)
rsi_df.head()

Unnamed: 0_level_0,Adj Close,rsi
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-05-16,241.456665,32.522582
2022-05-17,253.869995,38.218616
2022-05-18,236.603333,33.928373
2022-05-19,236.473328,33.897519
2022-05-20,221.300003,30.420407


## Week 2 summary
For week 2, the data collection and preparation stage, I obtained historical data for Tesla stock from yfinance. Data was collected for 3 separate time intervals (daily, hourly, 2-minute)
- Upon inspection, it was revealed that each dataframe contained a datetime index, opening price, high price, low price, closing price, adjusted closing price, and volumne of shares traded. Fortunately, missing values were not present

Market indicators from the US Treasury and Financial Industry Regulatory Authority (FINRA) were obtained from Nasdaq Data Link
- Whether or not these will be utilized will be determined in the EDA/visualization stage

The data preparation stage focused on feature engineering:
- Absolute returns and log returns based on percentage change in adjusted closing price were created
- Columns indicating intraday status & interday periods were created
- Price lags were added to TSLA stock data

## Next steps
- Exploratory data analysis (EDA)
- Visualization
- Feature selection