# Imports
---

In [52]:
import os
import time
import pandas as pd
import yfinance as yf
import talib
from talib import MA_Type
import numpy as np
from tqdm.notebook import tqdm

# Problem Statement and Background
---

### Research Question
Using a regression-based model such as XGBoost, can several technical indicators such as various moving averages, Bollinger Bands, Releative Strength Index, etc. predict the following weeks's stock price percent return with an RMSE of +- 2%. 

### Addtional Details
This project will compare the performance of several models such as XGBoost, linear regression, random forest, and nueral networks, along with a couple time-series specific methods such as ARIMA and LSTM. A benchmark model of using the previous weeks's return to predict the current weeks's return will be used to measure performance as well. To adapt to the time-series nature of the data, the dataset will be sorted by time, and the train-test-split will simply use the first 60% of data to train and the final 40% to test. The dataset contains 2000 different companies' stock data. The model will be trained and tested on all the companies at once instead of by each company.  

### Background
Stock prediction is an extremely common yet incredibly difficult task. The two most common techniques for this task are [fundemental and technical analysis](https://www.investopedia.com/articles/active-trading/101713/technical-vs-fundamental-investing-friends-or-foes.asp#:~:text=While%20technical%20analysis%20focuses%20on,better%20serves%20longer%20investment%20horizons.). Fundemental analysis focuses on the 'bigger picture,' looking at factors like company balance sheets, business strengths and industry state, and various macroeconomic factors. Technical analysis on the other hand focuses more on uncovering patterns in the historical data and numbers to predict future prices. Overall, fundemental analysis is usually more useful for long-term investors due to its believed 'self-correcting' nature, whereas technical analysis is preferred for short-term investors looking to capitalize on historical patterns with quick entries and exits. This project will apply technical analysis in an attempt to predict stock prices.




# Dataset
---

The dataset was sourced using the [Yahoo Finance API](https://pypi.org/project/yfinance/). As there is no fixed dataset, I created a python script to pull data from the API, store it in a dataframe, and finally output it a csv for future use. If the data ever needs to be updated, running the script once should update the entire csv. Potential future updates can including different interval lengths, a larger/shorter overall dataset time period, or a different set of companies. An initial time period of 10 years with an interval of 1 week was chosen and stock data from 2000 companies was gathered. The final shape of the raw dataset was (1,014,887, 11). The features in this raw dataset will be used to construct various technical indicators such as several moving averages, Bollinger Bands, RSI, among others. This updated dataset will also be saved as a csv for resuseabliity and efficiency. 

In [None]:
master_df = pd.read_csv('master_stock_dataset.csv')
master_df.sort_values(by='Date', ascending=True).head()

# Data Cleaning & Preprocessing
---

In [54]:
master_df.drop(['Capital Gains', 'Adj Close', 'Stock Splits', 'Dividends'], axis=1, inplace=True)

master_df['Date'] = pd.to_datetime(master_df['Date'], utc=True)
master_df['Date'] = master_df['Date'].dt.date

master_df.set_index('Date', inplace=True)
master_df.sort_index(inplace=True)
master_df.head(20)

headers = ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Ticker']

## Additional Technical Indicators


### Log Returns

In [55]:
try:
    os.remove('modified_dataset.csv')
except:
    pass

for ticker in tqdm(master_df['Ticker'].unique(), desc='Calculating Log Return'):
    subdf = master_df[master_df['Ticker'] == ticker].copy()

    # Calculate the log of the return % for each entry (excluding the first occurance)
    subdf['Log Return'] = np.log(subdf['Close'] / subdf['Close'].shift(1))
    subdf.head()
    subdf.to_csv('modified_dataset.csv', mode='a', header=False, index=True)


invalid value encountered in log


invalid value encountered in log



In [56]:
mod_df = pd.read_csv('modified_dataset.csv')
headers.append('Log Return')
mod_df.columns = headers
mod_df.set_index('Date', inplace=True)
mod_df.sort_index(inplace=True)
mod_df[mod_df['Ticker'] == 'NVDA'].head(3000)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Ticker,Log Return
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2015-02-09,0.486744,0.540026,0.483624,0.535466,1.807968e+09,NVDA,0.089500
2015-02-16,0.534506,0.539786,0.528026,0.536186,7.014720e+08,NVDA,0.001344
2015-02-23,0.535946,0.535946,0.527066,0.529466,7.085400e+08,NVDA,-0.012613
2015-03-02,0.530059,0.551743,0.529336,0.543310,1.269144e+09,NVDA,0.025812
2015-03-09,0.543551,0.561381,0.535841,0.546924,1.777980e+09,NVDA,0.006630
...,...,...,...,...,...,...,...
2025-01-06,148.589996,153.130005,134.220001,135.910004,1.052112e+09,NVDA,-0.061079
2025-01-13,129.990005,138.750000,129.509995,137.710007,9.960411e+08,NVDA,0.013157
2025-01-20,139.160004,148.970001,137.089996,142.619995,8.259735e+08,NVDA,0.035034
2025-01-27,124.800003,129.000000,116.250000,120.070000,2.648916e+09,NVDA,-0.172109


### Simple Moving Average

In [57]:
try:
    os.remove('modified_dataset.csv')
except:
    pass

for ticker in tqdm(mod_df['Ticker'].unique(), desc="Calculating Simple Moving Average"):
    subdf = mod_df[mod_df['Ticker'] == ticker].copy()

    # A timeperiod of 10 refers to a 10 week moving average
    subdf['SMA'] = talib.SMA(subdf['Close'], timeperiod=10)
    
    subdf.to_csv('modified_dataset.csv', mode='a', header=False, index=True)

Calculating Simple Moving Average:   0%|          | 0/1986 [00:00<?, ?it/s]

In [58]:
mod_df = pd.read_csv('modified_dataset.csv')
headers.append('SMA')
mod_df.columns = headers
mod_df.set_index('Date', inplace=True)
mod_df.sort_index(inplace=True)
mod_df[mod_df['Ticker'] == 'NVDA'].head(3000)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Ticker,Log Return,SMA
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2015-02-09,0.486744,0.540026,0.483624,0.535466,1.807968e+09,NVDA,0.089500,
2015-02-16,0.534506,0.539786,0.528026,0.536186,7.014720e+08,NVDA,0.001344,
2015-02-23,0.535946,0.535946,0.527066,0.529466,7.085400e+08,NVDA,-0.012613,
2015-03-02,0.530059,0.551743,0.529336,0.543310,1.269144e+09,NVDA,0.025812,
2015-03-09,0.543551,0.561381,0.535841,0.546924,1.777980e+09,NVDA,0.006630,
...,...,...,...,...,...,...,...,...
2025-01-06,148.589996,153.130005,134.220001,135.910004,1.052112e+09,NVDA,-0.061079,139.854092
2025-01-13,129.990005,138.750000,129.509995,137.710007,9.960411e+08,NVDA,0.013157,138.863110
2025-01-20,139.160004,148.970001,137.089996,142.619995,8.259735e+08,NVDA,0.035034,138.928088
2025-01-27,124.800003,129.000000,116.250000,120.070000,2.648916e+09,NVDA,-0.172109,136.741067


### Weighted Moving Average

In [59]:
try:
    os.remove('modified_dataset.csv')
except:
    pass

for ticker in tqdm(mod_df['Ticker'].unique(), desc="Calculating Weighted Moving Average"):
    subdf = mod_df[mod_df['Ticker'] == ticker].copy()

    # A timeperiod of 10 refers to a 10 week moving average
    subdf['WMA'] = talib.WMA(subdf['Close'], timeperiod=10)
    
    subdf.to_csv('modified_dataset.csv', mode='a', header=False, index=True)

Calculating Weighted Moving Average:   0%|          | 0/1986 [00:00<?, ?it/s]

In [60]:
mod_df = pd.read_csv('modified_dataset.csv')
headers.append('WMA')
mod_df.columns = headers
mod_df.set_index('Date', inplace=True)
mod_df.sort_index(inplace=True)
mod_df[mod_df['Ticker'] == 'NVDA'].head(3000)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Ticker,Log Return,SMA,WMA
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2015-02-09,0.486744,0.540026,0.483624,0.535466,1.807968e+09,NVDA,0.089500,,
2015-02-16,0.534506,0.539786,0.528026,0.536186,7.014720e+08,NVDA,0.001344,,
2015-02-23,0.535946,0.535946,0.527066,0.529466,7.085400e+08,NVDA,-0.012613,,
2015-03-02,0.530059,0.551743,0.529336,0.543310,1.269144e+09,NVDA,0.025812,,
2015-03-09,0.543551,0.561381,0.535841,0.546924,1.777980e+09,NVDA,0.006630,,
...,...,...,...,...,...,...,...,...,...
2025-01-06,148.589996,153.130005,134.220001,135.910004,1.052112e+09,NVDA,-0.061079,139.854092,138.660068
2025-01-13,129.990005,138.750000,129.509995,137.710007,9.960411e+08,NVDA,0.013157,138.863110,138.270235
2025-01-20,139.160004,148.970001,137.089996,142.619995,8.259735e+08,NVDA,0.035034,138.928088,138.953304
2025-01-27,124.800003,129.000000,116.250000,120.070000,2.648916e+09,NVDA,-0.172109,136.741067,135.524561


### Bollinger Bands

In [61]:
try:
    os.remove('modified_dataset.csv')
except:
    pass

for ticker in tqdm(mod_df['Ticker'].unique(), desc="Calculating Bollinger Bands"):
    subdf = mod_df[mod_df['Ticker'] == ticker].copy()

    upper_band, sma, lower_band = talib.BBANDS(subdf['Close'], matype=MA_Type.SMA, timeperiod=10)

    subdf['BB Upper'] = upper_band
    subdf['BB Lower'] = lower_band
    
    subdf.to_csv('modified_dataset.csv', mode='a', header=False, index=True)

Calculating Bollinger Bands:   0%|          | 0/1986 [00:00<?, ?it/s]

In [62]:
mod_df = pd.read_csv('modified_dataset.csv')
headers.append('BB_Upper')
headers.append('BB_Lower')
mod_df.columns = headers
mod_df.set_index('Date', inplace=True)
mod_df.sort_index(inplace=True)
mod_df[mod_df['Ticker'] == 'NVDA'].head(3000)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Ticker,Log Return,SMA,WMA,BB_Upper,BB_Lower
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2015-02-09,0.486744,0.540026,0.483624,0.535466,1.807968e+09,NVDA,0.089500,,,,
2015-02-16,0.534506,0.539786,0.528026,0.536186,7.014720e+08,NVDA,0.001344,,,,
2015-02-23,0.535946,0.535946,0.527066,0.529466,7.085400e+08,NVDA,-0.012613,,,,
2015-03-02,0.530059,0.551743,0.529336,0.543310,1.269144e+09,NVDA,0.025812,,,,
2015-03-09,0.543551,0.561381,0.535841,0.546924,1.777980e+09,NVDA,0.006630,,,,
...,...,...,...,...,...,...,...,...,...,...,...
2025-01-06,148.589996,153.130005,134.220001,135.910004,1.052112e+09,NVDA,-0.061079,139.854092,138.660068,148.369924,131.338261
2025-01-13,129.990005,138.750000,129.509995,137.710007,9.960411e+08,NVDA,0.013157,138.863110,138.270235,145.668063,132.058157
2025-01-20,139.160004,148.970001,137.089996,142.619995,8.259735e+08,NVDA,0.035034,138.928088,138.953304,145.861668,131.994508
2025-01-27,124.800003,129.000000,116.250000,120.070000,2.648916e+09,NVDA,-0.172109,136.741067,135.524561,149.685723,123.796411


### Money Flow Index

In [63]:
try:
    os.remove('modified_dataset.csv')
except:
    pass

for ticker in tqdm(mod_df['Ticker'].unique(), desc="Calculating Money Flow Index"):
    subdf = mod_df[mod_df['Ticker'] == ticker].copy()

    subdf['MFI'] = talib.MFI(subdf['High'], subdf['Low'], subdf['Close'], subdf['Volume'], timeperiod=10)

    subdf.to_csv('modified_dataset.csv', mode='a', header=False, index=True)

Calculating Money Flow Index:   0%|          | 0/1986 [00:00<?, ?it/s]

In [64]:
mod_df = pd.read_csv('modified_dataset.csv')
headers.append('MFI')
mod_df.columns = headers
mod_df.set_index('Date', inplace=True)
mod_df.sort_index(inplace=True)
mod_df[mod_df['Ticker'] == 'NVDA'].head(15)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Ticker,Log Return,SMA,WMA,BB_Upper,BB_Lower,MFI
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2015-02-09,0.486744,0.540026,0.483624,0.535466,1807968000.0,NVDA,0.0895,,,,,
2015-02-16,0.534506,0.539786,0.528026,0.536186,701472000.0,NVDA,0.001344,,,,,
2015-02-23,0.535946,0.535946,0.527066,0.529466,708540000.0,NVDA,-0.012613,,,,,
2015-03-02,0.530059,0.551743,0.529336,0.54331,1269144000.0,NVDA,0.025812,,,,,
2015-03-09,0.543551,0.561381,0.535841,0.546924,1777980000.0,NVDA,0.00663,,,,,
2015-03-16,0.551261,0.56885,0.545961,0.565476,1723540000.0,NVDA,0.033358,,,,,
2015-03-23,0.548611,0.555839,0.492955,0.515121,1929252000.0,NVDA,-0.093267,,,,,
2015-03-30,0.51753,0.519939,0.497532,0.507411,1133500000.0,NVDA,-0.015081,,,,,
2015-04-06,0.505242,0.548611,0.499942,0.54837,1934420000.0,NVDA,0.077629,,,,,
2015-04-13,0.547165,0.550297,0.5303,0.534878,1149480000.0,NVDA,-0.024912,0.536261,0.535385,0.56783,0.504692,


### AROON INDICATORS

In [65]:
try:
    os.remove('modified_dataset.csv')
except:
    pass

for ticker in tqdm(mod_df['Ticker'].unique(), desc="Calculating AROON Indicators"):
    subdf = mod_df[mod_df['Ticker'] == ticker].copy()

    a_down, a_up = talib.AROON(subdf['High'], subdf['Low'], timeperiod=10)
    a_real = talib.AROONOSC(subdf['High'], subdf['Low'], timeperiod=10)
    
    subdf['Aroon Up'] = a_up
    subdf['Aroon Down'] = a_down
    subdf['Aroon Real'] = a_real

    subdf.to_csv('modified_dataset.csv', mode='a', header=False, index=True)

Calculating AROON Indicators:   0%|          | 0/1986 [00:00<?, ?it/s]

In [66]:
mod_df = pd.read_csv('modified_dataset.csv')
headers.append('Aroon Up')
headers.append('Aroon Down')
headers.append('Aroon Real')
mod_df.columns = headers
mod_df.set_index('Date', inplace=True)
mod_df.sort_index(inplace=True)
mod_df[mod_df['Ticker'] == 'NVDA'].head(15)
mod_df.to_csv('modified_dataset.csv', index=True)

# Basic Data Characteristics
---

With a dataset containing over 2000 companies, and each company having a divserse set of financials, it is not practical to explore all of them. Instead, five hand-picked stocks of established companies from *different* industries (to explore different trends) will be analyzed at first, in conjunction with the technical indicators calculated above. The stocks that will be analyzed are Nvidia (NVDA), Disney (DIS), ExxonMobile (XOM), Bank of America (BAC), and Walmart (WMT). 

### Time Series Chart

In [67]:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

In [68]:
subdf = mod_df[mod_df['Ticker'].isin(['NVDA', 'DIS', 'XOM', 'BAC', 'WMT'])]

figure1 = px.line(subdf, x=subdf.index, y='Close', color='Ticker', title='Stock Price from 2015 to 2025')
figure1.update_xaxes(rangeslider_visible=True,
                    rangeselector=dict(
                        buttons=list([
                            dict(count=1, label="1m", step="month", stepmode="backward"),
                            dict(count=6, label="6m", step="month", stepmode="backward"),
                            dict(count=1, label="YTD", step="year", stepmode="todate"),
                            dict(count=1, label="1y", step="year", stepmode="backward"),
                            dict(step="all")
                        ])
                    ))
figure1.show()

In [69]:
nvdaDF = subdf[subdf['Ticker']=='NVDA']
wmtDF = subdf[subdf['Ticker']=='WMT']
xomDF = subdf[subdf['Ticker']=='XOM']
bacDF = subdf[subdf['Ticker']=='BAC']
disDF = subdf[subdf['Ticker']=='DIS']

### Scatter Plots

In [70]:
figure2 = px.scatter(subdf, x='Volume', y='Close', color='Ticker')
figure2.update_layout(title='Volume vs. Close')
figure2.show()

In [71]:
figure3 = px.scatter(subdf, x='Volume', y='Log Return', color='Ticker')
figure3.update_layout(title='Volume vs. Log Return')
figure3.show()

In [72]:
figure4 = px.scatter(subdf, x=subdf.index, y='Log Return', color='Ticker')
figure4.update_layout(title='Log Return vs. Date')
figure4.show()

### Technical Indicators

In [73]:
figure5 = make_subplots(1, 1, shared_xaxes=True, x_title='Date', y_title='USD')
figure5.add_trace(
    go.Scatter(x=nvdaDF.index, y=nvdaDF['BB_Upper'], mode='lines', name='Upper Band')
)
figure5.add_trace(
    go.Scatter(x=nvdaDF.index, y=nvdaDF['BB_Lower'], mode='lines', name='Lower Band')
)
figure5.add_trace(
    go.Scatter(x=nvdaDF.index, y=nvdaDF['SMA'], mode='lines', name='Moving Average')
)
figure5.update_layout(title='NVIDIA Stock Bollinger Bands')

figure5.show()

In [74]:
figure6 = make_subplots(2, 1, shared_xaxes=True, x_title='Date', subplot_titles=['Disney Stock Price and Aroon Indicators'])

figure6.add_trace(
    go.Scatter(x=disDF.index, y=disDF['Close'], mode='lines', name='DISNEY'), 1, 1
)
figure6.add_trace(
    go.Scatter(x=disDF.index, y=disDF['Aroon Down'], mode='lines', name='Aroon Down'), 2, 1
)
figure6.add_trace(
    go.Scatter(x=disDF.index, y=disDF['Aroon Up'], mode='lines', name='Aroon Up'), 2, 1
)
figure6.update_yaxes(title_text="Stock Price ($USD)", row=1, col=1)

figure6.show()

In [75]:
figure7 = make_subplots(1, 1, shared_xaxes=True, x_title='Date', subplot_titles=['Exxon Mobile Price with Simple and Weight Moving Average'])

figure7.add_trace(
    go.Scatter(x=xomDF.index, y=xomDF['Close'], mode='lines', name='Close Price'), 1, 1
)

figure7.add_trace(
    go.Scatter(x=xomDF.index, y=xomDF['SMA'], mode='lines', name='Simple Moving Average'), 1, 1
)

figure7.add_trace(
    go.Scatter(x=xomDF.index, y=xomDF['WMA'], mode='lines', name='Weighted Moving Average'), 1, 1
)

figure7.show()

In [76]:
figure8 = make_subplots(2, 1, shared_xaxes=True, x_title='Date', subplot_titles=['BofA Stock Price and MFI'])

figure8.add_trace(
    go.Scatter(x=bacDF.index, y=bacDF['Close'], mode='lines', name='BofA'), 1, 1
)

figure8.add_trace(
    go.Scatter(x=bacDF.index, y=bacDF['MFI'], mode='lines', name='Money Flow Index'), 2, 1
)

figure8.add_trace(
    go.Scatter(x=bacDF.index, y=np.full(len(bacDF), 90), mode='lines', name='Top Threshold'), 2, 1
)

figure8.add_trace(
    go.Scatter(x=bacDF.index, y=np.full(len(bacDF), 10) , mode='lines', name='Bottom Threshold'), 2, 1
)

figure8.update_yaxes(title_text="Stock Price ($USD)", row=1, col=1)

figure8.show()

These charts help gain insight into some of the many stocks in the dataset. The first time series chart simply shows how the price has changed from the beginning to the end of the dataset. I used the patterns of some stocks (such as Disney's large drop) to create graphs later. One interesting thing was that certain stocks, such as Disney and Exxon Mobile had a pretty significant dip during the COVID-19 Pandemic, but the others did not. It would be interesting to investigate other stocks for this trend as well. This particular one can be explained because travel (Disney) and automobile usage (gas -> Exxon Mobile) dipped during the pandemic, but banks, tech, and retail remained relatively stable. 

I created a couple scatter plots to see if I could find any surface coorelations between variables. Unsurprisingly, there was no real visual correlation found. It's important to note that in the chart of Log Return vs. Date, the data appears almost perfectly distributed. This shows that there is no real trend with stock price returns (at least with the companies I selected) over time. This hints at the task of return prediction being extremely difficult. One other thing I found in the scatter plots was how much more volume NVIDIA sold compared to the other stocks I selected. Disney had the highest stock price, but a relatively low amount of volume traded.

The next section plotted a few of the technical indicators against a company. Bollinger bands help show volatility. The wider the bands are, the more volatile a stock. Next are the Aroon Indicators. The Aroon indicators measure time since a high price (aroon up) and time since a low price (aroon down). They can help identify signals and entry points. When the indicators cross-over, it indicates a change in the trend of the stock. An Aroon Up above 50 and an Aroon low below 50 indicate a bullish trend and the opposite is also true. Next up are some moving averages. The simple moving average just averages out the price in a given timeperiod. The weighted moving average does the same, but gives recent prices more emphasis. Both can be used to identify trends in the market by smoothing out prices. Finally, the money flow index (MFI) shows when a stock is overbought (above 80-90) or oversold (below 10-20). It can help indicate market entry points.

# Basic Machine Learning Analysis
---

This section will examine some basic machine learning techniques such as Random Forest and Linear Regression. Based on the results of this section, the dataset may be further modified to improve performance (ie. Change the Y-variable from log returns to a binary classification of positive or negative return). The model will 

## Data Preperation

In [86]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

In [94]:
# Convert tickers into embeddings to avoid losing information
def add_company_embeddings(dataset):
    X = dataset.copy()
    
    le = LabelEncoder()
    X['Company_ID'] = le.fit_transform(X['Ticker'])
    
    X = X.drop('Ticker', axis=1)
    
    return X

In [124]:
# Split the dataset into training and testing sets using TimeSeriesSplit
def create_time_series_splits(dataset, n_splits=5):
    dataset.sort_values(by='Date', ascending=True, inplace=True)
    dataset = add_company_embeddings(dataset)
    dataset.drop(['Date'], axis=1, inplace=True)

    tscv = TimeSeriesSplit(n_splits=n_splits)

    splits = []
    for train_idx, test_idx in tscv.split(dataset):
        train_data = dataset.iloc[train_idx]
        test_data = dataset.iloc[test_idx]
        splits.append((train_data, test_data))

    return splits

In [126]:
dataset = pd.read_csv('modified_dataset.csv')
dataset.dropna(inplace=True)

dataset.head() 

splits = create_time_series_splits(dataset)

X = dataset.drop(['Log Return', 'Date'], axis=1)
y = dataset['Log Return']

print(splits)

[(             Open       High        Low      Close     Volume  Log Return  \
19046   25.424533  25.983572  25.101705  25.322172   173100.0   -0.004653   
20325    4.364500   4.395000   3.800000   3.975000   164108.0   -0.109503   
20324    3.919735   3.923176   3.468913   3.534300  1593187.0    0.000974   
20323   12.644451  12.680960  12.425393  12.522752  9151000.0   -0.010634   
20322   93.804106  95.428384  91.943426  92.126442  5760200.0   -0.029203   
...           ...        ...        ...        ...        ...         ...   
183848  35.453049  36.869850  35.354202  36.375618  1709130.0    0.026619   
183789  46.777038  47.269428  46.777038  47.269428     3300.0    0.010471   
183787  40.420385  43.508983  40.017523  42.927074  1223800.0    0.067951   
183756   1.570000   1.690000   1.550000   1.570000    83000.0   -0.012658   
183755   4.700000   5.100000   4.550000   5.050000   887700.0    0.071826   

              SMA        WMA   BB_Upper   BB_Lower        MFI  Aroon Up  

In [133]:
def train_evaluate_model(model, splits):

    results = []

    for fold, (train_data, test_data) in enumerate(splits):

        X_train = train_data.drop(['Log Return'], axis=1)
        y_train = train_data['Log Return']
        X_test = test_data.drop(['Log Return'], axis=1)
        y_test = test_data['Log Return']

        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)

        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        
        mse = mean_squared_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)

        results.append({
            'x_train': X_train,
            'y_train': y_train,
            'x_test': X_test,
            'y_test': y_test,
            'fold': fold,
            'mse': mse,
            'r2': r2
        })

    return results

In [128]:
linear_model = LinearRegression()
random_forest_model = RandomForestRegressor(n_estimators=100, random_state=42)

linear_results = train_evaluate_model(linear_model, X, y, splits)
random_forest_results = train_evaluate_model(random_forest_model, X, y, splits)

In [131]:
print(linear_results)
avg_mse = sum([result['mse'] for result in linear_results]) / len(linear_results)
avg_r2 = sum([result['r2'] for result in linear_results]) / len(linear_results)
print(f"Average Linear Regression MSE: {avg_mse}")
print(f"Average Linear Regression R2: {avg_r2}")

[{'fold': 0, 'mse': 0.012579180500415617, 'r2': -0.01116467439655322}, {'fold': 1, 'mse': 0.01215065944419776, 'r2': 0.012643252661893145}, {'fold': 2, 'mse': 0.013860193842440685, 'r2': 0.010609898654502126}, {'fold': 3, 'mse': 0.008939176628481075, 'r2': 0.009798591939795598}, {'fold': 4, 'mse': 0.013448188396385423, 'r2': 0.012519889441254928}]
Average Linear Regression MSE: 0.012195479762384113
Average Linear Regression R2: 0.006881391660178515


In [132]:
print(random_forest_results)
avg_mse = sum([result['mse'] for result in random_forest_results]) / len(random_forest_results)
avg_r2 = sum([result['r2'] for result in random_forest_results]) / len(random_forest_results)
print(f"Average Random Forest MSE: {avg_mse}")
print(f"Average Random Forest R2: {avg_r2}")


[{'fold': 0, 'mse': 0.00815039373772071, 'r2': 0.3448388605494793}, {'fold': 1, 'mse': 0.0071814533024290896, 'r2': 0.4164385557499223}, {'fold': 2, 'mse': 0.007285122334435854, 'r2': 0.47996196830154936}, {'fold': 3, 'mse': 0.004103743504889685, 'r2': 0.5454242861794587}, {'fold': 4, 'mse': 0.006747354176119099, 'r2': 0.5045519997619952}]
Average Random Forest MSE: 0.006693613411118887
Average Random Forest R2: 0.458243134108481


# Surprises
---

Based on this extremely initial analysis, it seems as if the Random Forest model above performed considerably well compared to my expectations. The Linear Regression model performed quite poorly. Given that the prediction task involved log returns, which are numerically small in this dataset with weekly intervals, a MSE of <= 0.01 would be a good indicator of model performance. The Random Forest model finished with an average MSE of 0.006 compared to 0.012 for Linear Regression. Looking at R-Squared, 0.6% of the variance in the data was explained by the Linear Regression model, where was 45% of the variance was explained by the Random Forest model. Given the volatility and difficulty of predicting stock data, I believe the Random Forest model performed very well for a simple model.

One of my greatest concerns for this analysis is the risk of overfitting the model, such that it performs well on the given data, but poorly in a real world environment. This will be something to look out for as additional models are evaluated.

# Next Steps
---

Time was one of the biggest constraints in this analysis. I underestimated the amount of time it would take to finish training and evaluating the two simple models above on my laptop (the final iteration had a runtime of ~99 minutes). There are a few immediate next steps in this analysis, along with long-term goals. 

The inital next steps are:
1. Create a baseline model (random walk?) to compare results against
2. Create visualizations of the model predictions versus actual log returns

Long term goals include:
1. Adding additional features such as 'Month' or 'Quarter' to extract potential seasonal patterns in the data the current models may have missed
1. Adding additional models into the analysis to examine their performance (include advanced methods like ARIMA)
3. Exploring different Y-variables such as converting the problem into a classification task and using 'Positive' and 'Negative' values to predict of the stock price went up or down based on the previous week