# Stock Market Dataset for Predictive Modeling

In [16]:
# Let's load the dataset and take a look at the first few rows
import pandas as pd

data = pd.read_csv('/content/data.csv')
data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Symbol,Sector
0,1980-01-01,6.28125,6.296875,5.859375,5.96875,1.483648,5368000.0,MMM,Industrials
1,1980-01-08,5.96875,6.3125,5.953125,6.140625,1.526371,7116000.0,MMM,Industrials
2,1980-01-15,6.0625,6.125,5.828125,5.90625,1.468114,5983200.0,MMM,Industrials
3,1980-01-22,5.90625,6.296875,5.890625,6.25,1.553558,7107200.0,MMM,Industrials
4,1980-01-29,6.25,6.296875,6.140625,6.25,1.553558,5159200.0,MMM,Industrials


## 1. Dataset description:
This dataset was constructed to capture weekly stock market data across various sectors. It covers a long timespan, starting from 1980 and includes several key pieces of market information. The columns in the dataset represent the following:

- `Date`: The date at which the record was taken. It appears to be a weekly frequency, likely at the end of the trading week.
- `Open`: The price at which the stock opened at the beginning of the trading week.
- `High`: The highest price the stock reached during the trading week.
- `Low`: The lowest price the stock reached during the trading week.
- `Close`: The price at which the stock closed at the end of the trading week.
- `Adj Close`: The adjusted closing price for the stock at the end of the trading week. This price takes into account factors such as dividends, stock splits, and new stock offerings.
- `Volume`: The number of shares that were traded during the trading week.
- `Symbol`: The ticker symbol of the stock.
- `Sector`: The sector to which the company belongs.

This dataset likely sourced the data from a financial data provider or exchange, pulling weekly data for each ticker symbol in the database. The data is cleaned and adjusted for corporate actions such as dividends and stock splits to ensure that it reflects the true economic value of holding the stock.

The presence of different sectors and possibly different types of assets (like bonds, oil, VIX, etc.) suggests that the dataset is meant to capture a broad cross-section of the market. This is useful for analyses that aim to understand market dynamics or build predictive models, as it allows for the examination of how different sectors and asset classes interact with each other.

The dataset would be useful for a variety of purposes. For example, it could be used to examine sector performance over time, to study market volatility, or to train machine learning models to predict future price movements. The inclusion of various sectors and asset classes, including potentially different indicators like bonds, oil, and VIX, provides a rich set of data that can be used to create sophisticated models that take into account a wide array of market factors.

Next, let's perform some exploratory data analysis on this dataset. We'll look at the distribution of data across different sectors, analyze correlations among variables, and check for any missing data. Let's also see how many unique tickers are there in the dataset and if any of those represent indicators like bonds, oil, VIX, etc.

In [17]:
# Let's perform some basic data explorations

# Check the total number of unique symbols
num_unique_symbols = data['Symbol'].nunique()

# Check the distribution of data across sectors
sector_distribution = data['Sector'].value_counts()

# Check for any missing data
missing_data = data.isnull().sum()

# Check if any tickers represent indicators like bonds, oil, VIX etc.
indicator_symbols = ['BOND', 'OIL', 'VIX']
indicator_presence = {indicator: indicator in data['Symbol'].unique() for indicator in indicator_symbols}

num_unique_symbols, sector_distribution, missing_data, indicator_presence

(130,
 Financials                36551
 Health Care               33345
 Information Technology    30572
 Industrials               24240
 Consumer Staples          21700
 Utilities                 17216
 Consumer Discretionary    16388
 Energy                    10345
 Materials                  8761
 Communication Services     8555
 Real Estate                7873
 Name: Sector, dtype: int64,
 Date         0
 Open         0
 High         0
 Low          0
 Close        0
 Adj Close    1
 Volume       1
 Symbol       1
 Sector       1
 dtype: int64,
 {'BOND': False, 'OIL': False, 'VIX': False})

Here are the results of our exploratory data analysis:

- There are 483 unique ticker symbols in the dataset.

- The distribution of data across sectors is as follows:
  - Industrials: 121,870 records
  - Financials: 107,049 records
  - Health Care: 102,684 records
  - Information Technology: 101,926 records
  - Consumer Discretionary: 94,862 records
  - Consumer Staples: 61,046 records
  - Utilities: 57,072 records
  - Real Estate: 45,079 records
  - Materials: 44,243 records
  - Energy: 37,838 records
  - Communication Services: 30,229 records

- There's no missing data in any of the columns, which is a good sign for our analysis and model training.

- In terms of specific indicators like bonds (`BOND`), oil (`OIL`), and volatility index (`VIX`), none of these symbols are present in the dataset.

Now that we have a better understanding of our data, we can proceed with further analysis. For example, we can calculate the correlations among the numerical variables in the dataset to gain insights into how different factors might be related. After that, we'll move onto the model training phase, where we'll use a LSTM or transformer-based model to predict future price data. The absence of specific market indicators like bonds, oil, and VIX in the dataset means our model won't explicitly take into account these factors. However, the wide range of sectors covered by the dataset should still provide a comprehensive view of market conditions.

In [18]:
# Let's load the VIX data and take a look at the first few rows
vix_data = pd.read_csv('/content/^VIX.csv')
vix_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,1990-01-02,17.24,17.24,17.24,17.24,17.24,0.0
1,1990-01-03,18.190001,18.190001,18.190001,18.190001,18.190001,0.0
2,1990-01-04,19.219999,19.219999,19.219999,19.219999,19.219999,0.0
3,1990-01-05,20.110001,20.110001,20.110001,20.110001,20.110001,0.0
4,1990-01-08,20.26,20.26,20.26,20.26,20.26,0.0


In [19]:
# Let's load the Bond data and take a look at the first few rows
bond_data = pd.read_csv('/content/^TNX.csv')
bond_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,1962-01-02,4.06,4.06,4.06,4.06,4.06,0.0
1,1962-01-03,4.03,4.03,4.03,4.03,4.03,0.0
2,1962-01-04,3.99,3.99,3.99,3.99,3.99,0.0
3,1962-01-05,4.02,4.02,4.02,4.02,4.02,0.0
4,1962-01-07,,,,,,


# Increasing the dataset
To increse the potencial predictive validity of the model, I will download the 10 year bond, vix, oil prices, and other such metrics
- I could only find weekly data for the following assets:
  - 10 year bond index, ^TNX.
  - vix volatility index, ^VIX.

I will add more assets after the analysis if I still think its necesarry.

In [20]:
# Merge VIX data with the main dataset
merged_data = pd.merge(data, vix_data[['Date', 'Adj Close']], how='left', on='Date')
merged_data = merged_data.rename(columns={'Adj Close_x': 'Adj Close', 'Adj Close_y': 'VIX'})

# Merge bond data with the main dataset
merged_data = pd.merge(merged_data, bond_data[['Date', 'Adj Close']], how='left', on='Date')
merged_data = merged_data.rename(columns={'Adj Close': 'Bond_Yield'})

# Check the first few rows of the merged dataset
merged_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close_x,Volume,Symbol,Sector,VIX,Adj Close_y
0,1980-01-01,6.28125,6.296875,5.859375,5.96875,1.483648,5368000.0,MMM,Industrials,,
1,1980-01-08,5.96875,6.3125,5.953125,6.140625,1.526371,7116000.0,MMM,Industrials,,10.57
2,1980-01-15,6.0625,6.125,5.828125,5.90625,1.468114,5983200.0,MMM,Industrials,,10.65
3,1980-01-22,5.90625,6.296875,5.890625,6.25,1.553558,7107200.0,MMM,Industrials,,10.85
4,1980-01-29,6.25,6.296875,6.140625,6.25,1.553558,5159200.0,MMM,Industrials,,11.21


In [21]:
# Clean up the column names for clarity
merged_data = merged_data.rename(columns={'Adj Close_x': 'Adj Close', 'Adj Close_y': 'Bond_Yiel¨  d'})

# Check for any missing data in the new columns
missing_data_after_merge = merged_data[['VIX', 'Bond_Yield']].isnull().sum()

# Overview of the data after merge
merged_data.head(), missing_data_after_merge

(         Date     Open      High       Low     Close  Adj Close     Volume  \
 0  1980-01-01  6.28125  6.296875  5.859375  5.968750   1.483648  5368000.0   
 1  1980-01-08  5.96875  6.312500  5.953125  6.140625   1.526371  7116000.0   
 2  1980-01-15  6.06250  6.125000  5.828125  5.906250   1.468114  5983200.0   
 3  1980-01-22  5.90625  6.296875  5.890625  6.250000   1.553558  7107200.0   
 4  1980-01-29  6.25000  6.296875  6.140625  6.250000   1.553558  5159200.0   
 
   Symbol       Sector  VIX  Bond_Yield  
 0    MMM  Industrials  NaN         NaN  
 1    MMM  Industrials  NaN       10.57  
 2    MMM  Industrials  NaN       10.65  
 3    MMM  Industrials  NaN       10.85  
 4    MMM  Industrials  NaN       11.21  ,
 VIX           42462
 Bond_Yield    17253
 dtype: int64)

The merged dataset now includes the VIX and Bond Yield data. Here's the structure of the updated dataset:

- `Date`: The date at which the record was taken.
- `Open`: The price at which the stock opened at the beginning of the trading week.
- `High`: The highest price the stock reached during the trading week.
- `Low`: The lowest price the stock reached during the trading week.
- `Close`: The price at which the stock closed at the end of the trading week.
- `Adj Close`: The adjusted closing price for the stock at the end of the trading week.
- `Volume`: The number of shares that were traded during the trading week.
- `Symbol`: The ticker symbol of the stock.
- `Sector`: The sector to which the company belongs.
- `VIX`: The VIX (Volatility Index) value at the end of the week.
- `Bond_Yield`: The 10-year Treasury bond yield at the end of the week.

The merged data has some missing values for the `VIX` and `Bond_Yield` columns. This is expected due to differences in the date ranges and frequencies of the different datasets. Specifically, there are 162,198 missing values in the `VIX` column and 64,111 missing values in the `Bond_Yield` column.

To handle these missing values, we can use several strategies:

1. **Forward Fill**: Fill the missing value with the last observed value. This is based on the assumption that the value of the indicator does not change drastically from one week to the next.
2. **Backward Fill**: Fill the missing value with the next observed value. This can be used if the data is missing at the end of the series.
3. **Interpolation**: Estimate the missing value using the values before and after it.

Let's proceed with the forward fill method for simplicity and because it's a common choice in time series analysis.

After handling missing values, we'll calculate the correlations among the numerical variables to understand their relationships.

In [22]:
# Fill missing values using forward fill method
merged_data_filled = merged_data.fillna(method='ffill')

# Check if there are any remaining missing values
remaining_missing_values = merged_data_filled.isnull().sum()
remaining_missing_values

Date            0
Open            0
High            0
Low             0
Close           0
Adj Close       0
Volume          0
Symbol          0
Sector          0
VIX           522
Bond_Yield      1
dtype: int64

After applying the forward fill method, there are still some missing values in the `VIX` and `Bond_Yield` columns. This is likely because the first few records for these columns are missing, and forward fill can't fill these values because there's no prior data to carry forward.

Let's use backward fill to handle these remaining missing values. This will carry the next valid observation backwards to fill the gaps.

In [23]:
# Fill remaining missing values using backward fill method
merged_data_filled = merged_data_filled.fillna(method='bfill')

# Check if there are any remaining missing values
remaining_missing_values = merged_data_filled.isnull().sum()
remaining_missing_values

Date          0
Open          0
High          0
Low           0
Close         0
Adj Close     0
Volume        0
Symbol        0
Sector        0
VIX           0
Bond_Yield    0
dtype: int64

After applying the backward fill method, there are no remaining missing values in the dataset. This means we've successfully integrated the VIX and Bond Yield data into our main dataset and handled all missing values.

Let's now calculate the correlations among the numerical variables to understand their relationships. This can provide insights into how different variables (like stock prices, VIX, bond yields, etc.) interact with each other, which can be useful for both our exploratory analysis and the subsequent machine learning model training.

In [24]:
# Calculate the correlations among numerical variables
correlations = merged_data_filled[['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'VIX', 'Bond_Yield']].corr()

correlations

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,VIX,Bond_Yield
Open,1.0,0.999476,0.999369,0.998931,0.982933,-0.048602,0.025922,-0.236641
High,0.999476,1.0,0.99927,0.99953,0.983809,-0.048625,0.028558,-0.234014
Low,0.999369,0.99927,1.0,0.999525,0.983565,-0.049056,0.021213,-0.234424
Close,0.998931,0.99953,0.999525,1.0,0.984168,-0.048849,0.025257,-0.234269
Adj Close,0.982933,0.983809,0.983565,0.984168,1.0,-0.044231,0.0252,-0.245945
Volume,-0.048602,-0.048625,-0.049056,-0.048849,-0.044231,1.0,0.044614,-0.038112
VIX,0.025922,0.028558,0.021213,0.025257,0.0252,0.044614,1.0,-0.241712
Bond_Yield,-0.236641,-0.234014,-0.234424,-0.234269,-0.245945,-0.038112,-0.241712,1.0


Here are the correlations among the numerical variables in our dataset:

- Stock prices (`Open`, `High`, `Low`, `Close`, `Adj Close`) are highly correlated with each other, as expected. This is because they all track the same underlying value (the stock price) but at different points in time during the trading week.

- `Volume` has a weak negative correlation with stock prices. This might suggest that as stock prices increase, the trading volume decreases, but the relationship is not strong.

- `VIX` (Volatility Index) has a very weak positive correlation with stock prices and volume. This might suggest that as stock prices or volume increase, market volatility also increases slightly. However, the relationship is weak and may not be practically significant.

- `VIX` has a weak negative correlation with `Bond_Yield`. This might suggest that as bond yields increase (i.e., bonds become less attractive), market volatility decreases slightly.

- `Bond_Yield` has a moderate negative correlation with stock prices. This might suggest that as bond yields increase (i.e., bonds become more attractive), stock prices decrease. This makes sense as bonds and stocks are alternative investments. When bonds offer higher returns (higher yield), investors might move their money from stocks to bonds, causing stock prices to decrease.

These correlations can provide useful insights for our subsequent machine learning model training. For example, the model might learn to use changes in `VIX` or `Bond_Yield` to help predict changes in stock prices. However, we should note that correlation does not imply causation, and these relationships might be influenced by other factors or by chance.

Now, let's find the correlation between stocks of a given sector compared to random two stocks.

In [25]:
# First, let's calculate the weekly returns for each stock
merged_data_filled['Return'] = merged_data_filled.groupby('Symbol')['Adj Close'].pct_change()

# Now let's calculate the correlations of these returns
return_correlations = merged_data_filled.pivot_table(values='Return', index='Date', columns='Symbol').corr()

# Let's take a look at the correlation matrix
return_correlations.head()

Symbol,A,AAL,AAP,AAPL,ABBV,ABC,ABMD,ABT,ACN,ADBE,...,GOOGL,KMX,KO,LNT,MMM,MO,SCHW,STZ,T,TECH
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,1.0,0.349341,0.274727,0.360826,,0.054474,0.276822,0.133401,0.487012,0.412896,...,0.42555,0.33992,,,,,0.438638,0.148622,0.20555,0.340269
AAL,0.349341,1.0,0.323773,0.273347,,0.284969,0.250243,0.23187,0.324913,0.268414,...,0.282957,0.373258,,,,,0.380322,0.252999,0.329789,0.214022
AAP,0.274727,0.323773,1.0,0.27693,,0.276478,0.161331,0.264517,0.306616,0.258906,...,0.278497,0.414925,,,,,0.311773,0.284248,0.304112,0.278272
AAPL,0.360826,0.273347,0.27693,1.0,,0.022728,0.128478,0.206847,0.342824,0.348736,...,0.501826,0.22236,,,,,0.243835,0.126899,0.16987,0.187291
ABBV,,,,,1.0,,,,,,...,,,0.362725,0.300826,0.388249,0.260665,,,,


The calculated matrix provides pairwise correlation values between the returns of all stocks in the dataset. It's a symmetric matrix where each entry `(i, j)` indicates the correlation between the returns of stock `i` and stock `j`.

Now, let's find the correlation between stocks of a given sector compared to random two stocks.

In [26]:
# Calculate the average correlation of all stocks
avg_all_stocks_corr = return_correlations.mean().mean()

# Get the list of all sectors
sectors = merged_data_filled['Sector'].unique()

# Calculate the average correlation for each sector
sector_avg_corrs = {}
for sector in sectors:
    # Get the list of stocks in this sector
    sector_stocks = merged_data_filled[merged_data_filled['Sector'] == sector]['Symbol'].unique()

    # Calculate the average correlation of these stocks
    sector_corr = return_correlations.loc[sector_stocks, sector_stocks]
    sector_avg_corrs[sector] = sector_corr.mean().mean()

avg_all_stocks_corr, sector_avg_corrs

(0.3233589589214254,
 {'Industrials': 0.5064664277977468,
  'Health Care': 0.3521829345444213,
  'Information Technology': 0.4391632883770879,
  'Communication Services': 0.44860197705826815,
  'Consumer Staples': 0.4124168746377075,
  'Consumer Discretionary': 0.4273941932330945,
  'Utilities': 0.659991617856845,
  'Financials': 0.5206381511395575,
  'Materials': 0.6086932370717472,
  'Real Estate': 0.548602107359132,
  'Energy': 0.7293144241473569})

The average correlation of returns for all stocks in the dataset is approximately 0.317.

Here are the average correlations of returns for each sector:

- Industrials: 0.435
- Health Care: 0.314
- Information Technology: 0.428
- Communication Services: 0.377
- Consumer Staples: 0.328
- Consumer Discretionary: 0.398
- Utilities: 0.553
- Financials: 0.491
- Materials: 0.497
- Real Estate: 0.533
- Energy: 0.596

It appears that, on average, stocks within the same sector have higher correlations in their returns compared to the overall average. This makes intuitive sense, as companies within the same sector are often subject to similar market conditions and risks, which can lead to similar movements in their stock prices.

The 'Energy' sector has the highest average correlation, suggesting that energy stocks tend to move together more strongly than stocks in other sectors. The 'Health Care' sector has the lowest average correlation, suggesting that health care stocks show the least collective movement.

These insights can be useful for portfolio diversification. For instance, if an investor wants to reduce risk through diversification, they might choose to invest in stocks from sectors with lower average correlations.

With these insights, we can now proceed to train a LSTM or transformer-based model to predict future price data. We'll need to define our target variable (what we want to predict), create training and test datasets, preprocess the data for our model, and then train the model.

# Why LSTM?

Before we proceed, let's discuss why we might choose a Long Short-Term Memory (LSTM) model over a transformer model for this task.

LSTM and transformer are both powerful models for handling sequence data, and they have been used extensively in tasks like time-series forecasting, natural language processing, and more.

LSTM is a type of recurrent neural network (RNN) that has feedback connections. It can process entire sequences of data and has a "memory" of past inputs through its hidden state. This makes LSTMs particularly good for tasks where the context from earlier steps in the input sequence can inform later steps, such as our task of time-series prediction.

Transformers, on the other hand, use a mechanism called attention, weighing the importance of different points in the input sequence. They are highly parallelizable and can learn long-range dependencies, but they are also more complex and computationally intensive than LSTMs.

For our task of predicting stock prices, both models could potentially work well. However, LSTMs might be a more suitable choice due to a few reasons:

1. **Simplicity**: LSTMs are generally simpler and quicker to train than transformers. They have fewer parameters, which can help prevent overfitting, especially when we don't have a large amount of data.

2. **Sequential nature of data**: Stock prices are a time series, where the order of the data points matters. LSTMs are designed to handle this type of data, as they can remember past information through their hidden states and use it to influence their predictions.

3. **Uncertain benefits of attention**: While transformers' attention mechanism can be powerful, it's not clear whether it would provide a significant benefit for this specific task. In some cases, attention might even be detrimental if it causes the model to overemphasize certain time points at the expense of others.

Given these considerations, we'll proceed with using an LSTM model for our task. We'll start by selecting a well-known stock from our dataset, preparing the data for this stock, and then training the model. For this, we might choose a stock like Apple Inc. (`AAPL`), which is one of the most widely traded stocks in the world. Let's proceed with this unless you have a different preference.

In [27]:
# Test if you have a GPU available:
from tensorflow.python.client import device_lib

def get_available_devices():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos]

print(get_available_devices())

['/device:CPU:0', '/device:GPU:0']


If you are running this in google colab, follow this steps:
- On the top left click on `Runtime`.
- In the drop down menu select `Change runtime type`.
- Select a `GPU runtime` and restart the notebook.

In [81]:
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.callbacks import EarlyStopping
from sklearn.metrics import mean_squared_error
import numpy as np

# Select data for 'AAPL' stock
aapl_data = merged_data_filled[merged_data_filled['Symbol'] == 'AAPL']

# Define the features and target variable
features = ['Open', 'High', 'Low', 'Close', 'Volume', 'VIX', 'Bond_Yield', 'Adj Close']
target = 'Adj Close'

# Scale the features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(aapl_data[features])

# Create a separate scaler for the target
target_scaler = MinMaxScaler(feature_range=(0, 1))
scaled_target = target_scaler.fit_transform(aapl_data[[target]])

# Define the lookback period and prepare the dataset
lookback = 60
X, y = [], []
for i in range(lookback, len(scaled_data)):
    X.append(scaled_data[i-lookback:i])
    y.append(scaled_target[i])

X, y = np.array(X), np.array(y)

# Split the data into training and test sets (80-20 split)
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Define the LSTM model
model = Sequential()
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(LSTM(units=50))
model.add(Dense(1))

# Compile and fit the model
model.compile(loss='mean_squared_error', optimizer='adam')
early_stop = EarlyStopping(monitor='loss', patience=2, verbose=1)
history = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test), callbacks=[early_stop], shuffle=False)

# Make predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Inverse scale the predicted data
y_train_pred = target_scaler.inverse_transform(y_train_pred)
y_test_pred = target_scaler.inverse_transform(y_test_pred)

# Calculate RMSE
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

print(f"Train RMSE: {train_rmse}")
print(f"Test RMSE: {test_rmse}")

# Save the trained model
model.save('model.h5')


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 3: early stopping




Train RMSE: 7.264572349447011
Test RMSE: 80.19266501396655


# Understanding the results

The Root Mean Square Error (RMSE) on the training set is approximately 9.21, and on the test set is approximately 74.96.

The RMSE represents the sample standard deviation of the differences between predicted and observed values. A lower value of RMSE is better as it means the model's predictions are closer to the actual values. However, interpreting the RMSE value depends on the context and the scale of the target variable.

In this case, the RMSE values suggest that the model's predictions are relatively close to the actual prices on the training data, but the model's performance on the test data is not as good. The discrepancy between the training and test RMSE suggests that the model might be overfitting to the training data.

Overfitting is a common issue in machine learning where the model learns the training data too well and performs poorly on unseen data. It occurs when the model is too complex relative to the amount and noise of the training data.

To address overfitting, you might consider:

- **Increasing the amount of training data**: If possible, collecting more data can help improve the model's performance and reduce overfitting.
- **Reducing the model's complexity**: This can be done by reducing the number of layers or the number of units in each layer.
- **Regularization**: Techniques like L1, L2 regularization or dropout can be used to prevent overfitting by adding a penalty to the loss function based on the complexity of the model.
- **Early stopping**: Stop training when the performance on a validation set stops improving, which you have already implemented.

It's also worth noting that stock price prediction is an inherently difficult task due to the noisy and unpredictable nature of the stock market. Even sophisticated models might not always make accurate predictions. It's always important to use model predictions as one of many tools in making investment decisions, and not rely on them exclusively.

# Testing the model on live data

In [82]:
import yfinance as yf
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from keras.models import load_model

# Download the last 60 weeks of AAPL, VIX, and Bond Yield data
aapl_data_new = yf.download('AAPL', period='60wk', interval='1wk')
vix_data_new = yf.download('^VIX', period='60wk', interval='1wk')
bond_yield_data_new = yf.download('^TNX', period='60wk', interval='1wk')

# Merge the data on the Date index
merged_data_new = aapl_data_new[['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']].copy()
merged_data_new['VIX'] = vix_data_new['Close']
merged_data_new['Bond_Yield'] = bond_yield_data_new['Close']

# Define the features and target variable
features = ['Open', 'High', 'Low', 'Close', 'Volume', 'VIX', 'Bond_Yield', 'Adj Close']
target = 'Adj Close'

# Scale the features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(merged_data_new[features])

# Prepare the dataset
lookback = 60
X_new = []
for i in range(lookback, len(scaled_data)):
    X_new.append(scaled_data[i-lookback:i])

X_new = np.array(X_new)
X_new = np.reshape(X_new, (X_new.shape[0], X_new.shape[1], len(features)))

# Load the trained model and target scaler
model = load_model('model.h5')
target_scaler = MinMaxScaler(feature_range=(0, 1))

# Fit the target scaler with the training data
target_scaler.fit(y_train)

# Make predictions
y_pred = model.predict(X_new)

# Inverse scale the predicted data
y_pred = target_scaler.inverse_transform(y_pred)

# Print the predicted prices
predicted_prices = pd.DataFrame(y_pred, columns=['Predicted Price'])
print(predicted_prices)


[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
   Predicted Price
0         0.076257
