## LSTM Model
This notebook is used to run the LSTM model where we will predict the closing price of the next day for all the stocks present in the `ticker.csv` and finally predict with a new stock price and sentiment.

### 1. Installing `scikit-learn` and `tensorflow`

In [96]:
%pip install scikit-learn tensorflow


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### 2. Importing libraries
Following libraries are uploaded into the notebook here:
*   `numpy` (imported as `np):
    NumPy is used here for data pre-processing and assigning values to the train and test sets. NumPy is widely used in data analysis, numerical computations, and machine learning tasks.

*   `pandas` (imported as `pd`):
    With pandas, we have read and write data from various file formats, perform data cleaning, aggregation, filtering, and transformation tasks with ease.

*   `sklearn.preprocessing.MinMaxScaler`:
    MinMaxScaler is a data preprocessing class from the scikit-learn library (sklearn). It is used for feature scaling, specifically normalization. It transforms the data so that it lies within a specific range, typically [0, 1], by subtracting the minimum value and dividing by the range (maximum value - minimum value).

*   `sklearn.metrics.mean_absolute_error`:
    It calculates the mean absolute error between the true target values and predicted values. Mean absolute error is a measure of the average absolute difference between the predicted and actual values, and it provides an indication of how close the predictions are to the true values.

*   `sklearn.metrics.mean_absolute_percentage_error`:
    It calculates the mean absolute percentage error between the true target values and predicted values. Mean absolute percentage error is a measure of the average percentage difference between the predicted and actual values and provides insight into the accuracy of predictions relative to the true values.

*   `tensorflow` (imported as `tf`):
    TensorFlow provides a flexible framework for building, training, and deploying various machine learning models, especially deep learning models like Long Short-Term Memory (LSTM) networks. The code imports TensorFlow for building and training an LSTM model.

In [97]:
# Import necessary libraries
import numpy as np
import pandas as pd

# Import libraries for data preprocessing and evaluation
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error

# Import TensorFlow for building and training the LSTM model
import tensorflow as tf

In [98]:
# Define hyperparameters for the model
split = 0.85  # Split ratio for train-test data
sequence_length = 10  # Length of input sequence for LSTM model
epochs = 100  # Number of training epochs
learning_rate = 0.02  # Learning rate for the optimizer

# Load stock price and news data from CSV files
stock_data = pd.read_csv("../fin-bert/finbert_stocks_output.csv")

In [99]:
# Prepare data for training and testing
stock_column = ['Close']
news_column = ['score']

len_stock_data = stock_data.shape[0]
print("len_stock_data: {}".format(stock_data.shape))

# Split data into training and testing sets

ticker_df = pd.read_csv("../../ticker.csv")
ticker_list = ticker_df['ticker'].tolist()

train_data = []
train_sentiment_data = []
test_data = []
test_sentiment_data = []

test_df = pd.DataFrame()

for ticker in ticker_list:
    filtered_stock = stock_data[stock_data['related'] == ticker]
    print(ticker)
    train_examples = int(filtered_stock.shape[0] * split)
    train_temp = filtered_stock[stock_column].values[:train_examples]
    train_sentiment_temp = filtered_stock[news_column].values[:train_examples]
    test_temp = filtered_stock[stock_column].values[train_examples:]
    test_sentiment_temp = filtered_stock[news_column].values[train_examples:]

    # To store the dates
    test_df = filtered_stock[['related', 'datetime_norm']].copy()

    train_data.extend(train_temp)
    train_sentiment_data.extend(train_sentiment_temp)
    test_data.extend(test_temp)
    test_sentiment_data.extend(test_sentiment_temp)

# Convert Python lists to NumPy arrays
train = np.array(train_data)
train_sentiment = np.array(train_sentiment_data)
test = np.array(test_data)
test_sentiment = np.array(test_sentiment_data)

len_train = train.shape[0]
len_test = test.shape[0]
len_train_sentiment = train_sentiment.shape[0]
len_test_sentiment = test_sentiment.shape[0]

# Reshape train and test arrays to 2D
train = train.reshape(-1, 1)
test = test.reshape(-1, 1)

print("Train & Test shape")
print(train.shape)
print(test.shape)

len_stock_data: (638, 16)
AMZN
(109, 1)
META
(116, 1)
AAPL
(108, 1)
MSFT
(105, 1)
TSLA
(102, 1)
Train & Test shape
(540, 1)
(98, 1)


In [100]:
# Normalize data using MinMaxScaler
scaler = MinMaxScaler()
train, test = scaler.fit_transform(train), scaler.fit_transform(test)

In [101]:
# Prepare input features (X) and target values (y) for training data
X_train = []
for i in range(len_train - sequence_length):
    X_train.append(train[i: i + sequence_length])
len_X_train = len(X_train)
y_train = np.array(train[sequence_length:]).astype(float)
# print(X_train)
# print(y_train)

In [102]:
# Prepare input features (X) and target values (y) for testing data
X_test = []
for i in range(len_test - sequence_length):
    X_test.append(test[i: i + sequence_length])
len_X_test = len(X_test)
y_test = np.array(test[sequence_length:]).astype(float)

In [103]:
# Add news sentiment to the input features (X) for both training and testing data
for i in range(len_X_train):
    X_train[i] = X_train[i].tolist()
    X_train[i].append(train_sentiment[sequence_length + i].tolist())
    if i == 0:
        print(X_train[i])
X_train = np.array(X_train).astype(float)

for i in range(len_X_test):
    X_test[i] = X_test[i].tolist()
    X_test[i].append(test_sentiment[sequence_length + i].tolist())
    if i == 0:
        print(X_test[i])
X_test = np.array(X_test).astype(float)

[[0.010189436614223446], [0.007623204081348867], [0.0], [0.011170650792412695], [0.016001199373743558], [0.025473620327899293], [0.04517319586257518], [0.04585249355634613], [0.05660804517310952], [0.04879613609091393], [-0.97]]
[[0.009904007917537538], [0.006294569941466488], [0.005634260208446595], [0.005018011423942759], [0.0], [0.010828414677333309], [0.010916469408285279], [0.002024822232375545], [0.008275364808365993], [0.003873568585887055], [0.0]]


In [104]:
print("X_train: {}".format(X_train.shape))
print("X_test: {}".format(X_test.shape))
print("y_train: {}".format(y_train.shape))
print("y_test: {}".format(y_test.shape))

print(X_test)

X_train: (530, 11, 1)
X_test: (88, 11, 1)
y_train: (530, 1)
y_test: (88, 1)
[[[ 0.00990401]
  [ 0.00629457]
  [ 0.00563426]
  [ 0.00501801]
  [ 0.        ]
  [ 0.01082841]
  [ 0.01091647]
  [ 0.00202482]
  [ 0.00827536]
  [ 0.00387357]
  [ 0.        ]]

 [[ 0.00629457]
  [ 0.00563426]
  [ 0.00501801]
  [ 0.        ]
  [ 0.01082841]
  [ 0.01091647]
  [ 0.00202482]
  [ 0.00827536]
  [ 0.00387357]
  [ 0.01276522]
  [-0.964     ]]

 [[ 0.00563426]
  [ 0.00501801]
  [ 0.        ]
  [ 0.01082841]
  [ 0.01091647]
  [ 0.00202482]
  [ 0.00827536]
  [ 0.00387357]
  [ 0.01276522]
  [ 0.0281715 ]
  [ 0.        ]]

 [[ 0.00501801]
  [ 0.        ]
  [ 0.01082841]
  [ 0.01091647]
  [ 0.00202482]
  [ 0.00827536]
  [ 0.00387357]
  [ 0.01276522]
  [ 0.0281715 ]
  [ 0.02984414]
  [ 0.        ]]

 [[ 0.        ]
  [ 0.01082841]
  [ 0.01091647]
  [ 0.00202482]
  [ 0.00827536]
  [ 0.00387357]
  [ 0.01276522]
  [ 0.0281715 ]
  [ 0.02984414]
  [ 0.02170086]
  [ 0.        ]]

 [[ 0.01082841]
  [ 0.01091647]
  

In [105]:
# Define the LSTM model architecture using TensorFlow
def model_create():
    tf.random.set_seed(1234)
    model = tf.keras.models.Sequential(
        [
            tf.keras.Input(shape=(X_train.shape[1], 1)),
            tf.keras.layers.LSTM(units=70, activation="tanh", return_sequences=True),
            tf.keras.layers.LSTM(units=30, activation="tanh", return_sequences=True),
            tf.keras.layers.LSTM(units=10, activation="tanh", return_sequences=False),
            tf.keras.layers.Dense(units=1, activation="linear")
        ]
    )

    model.compile(
        loss=tf.keras.losses.mean_squared_error,
        optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate)
    )

    model.fit(
        X_train, y_train,
        epochs=epochs
    )
    return model

In [106]:
# Invert the normalization on the test target values (y_test)
y_test = scaler.inverse_transform(y_test)

In [107]:
# Use the trained model to predict stock prices for the test data
def predict(model, test):
    predictions = model.predict(test)
    predictions = scaler.inverse_transform(predictions.reshape(-1, 1)).reshape(-1, 1)
    return predictions

In [108]:
# Evaluate the model's performance on the test data
def evaluate(predictions):
    mae = mean_absolute_error(predictions, y_test)
    mape = mean_absolute_percentage_error(predictions, y_test)
    return mae, mape, (1 - mape)

In [109]:
# Perform trial runs of the model and get average evaluation results
def run_model(n):
    total_mae = total_mape = total_acc = 0
    for i in range(n):
        model = model_create()
        predictions = predict(model, X_test)
        mae, mape, acc = evaluate(predictions)
        total_mae += mae
        total_mape += mape
        total_acc += acc
    return (total_mae / n), (total_mape / n), (total_acc / n)

In [110]:
# Perform a single trial run of the model
mae, mape, acc = run_model(1)

# Print the evaluation results
print(f"Mean Absolute Error = {mae}")
print(f"Mean Absolute Percentage Error = {mape}%")
print(f"Accuracy = {acc}")

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

### Saving the Model for frontend

In [111]:
# Save the model to a file named 'lstm_model.h5' in the current directory
model.save('lstm_model.h5')

  saving_api.save_model(


In [121]:
# Load the existing LSTM model from the file
from tensorflow.keras.models import load_model
model = load_model('lstm_model.h5')

# Load the custom_test.csv file
data = pd.read_csv("../../stocks_hist.csv")

# Prepare data for prediction
stock_column = ['Close']

# Normalize data using MinMaxScaler
scaler = MinMaxScaler()
data['MinMax_Close'] = scaler.fit_transform(data[stock_column])

# Prepare input features (X) and target values (y) for testing data
sequence_length = 10
customX_test = []
customy_test = []
related_stocks = []
related_dates = []
for i in range(len(data) - sequence_length):
    related_date = data['Date'].values[i]
    related_dates.append(related_date)
    related_stock = data['ticker'].values[i]
    related_stocks.append(related_stock)
    customX_test.append(data['MinMax_Close'].values[i: i + sequence_length])
    customy_test.append(data['MinMax_Close'].values[i + sequence_length])

for i in range(len(customX_test)):
    customX_test[i] = customX_test[i].tolist()
    customX_test[i].append(0.0)
    if i == 0:
        print(customX_test[i])
customX_test = np.array(customX_test).astype(float)

# customX_test = np.array(customX_test)
customy_test = np.array(customy_test)

print("customX_test.shape: {}".format(customX_test.shape))
predictions = predict(model, customX_test)

# Create a DataFrame to store the results
print("customy_test len: {}".format(len(customy_test)))
print("predictions len: {}".format(len(predictions)))

result_df = pd.DataFrame({
    'Date': related_dates,
    'Stock': related_stocks,
    'Actual_Close': customy_test.flatten(),
    'Predicted_Close': predictions.flatten()
})


[0.026584288288354796, 0.028226847422264445, 0.023385509413632155, 0.02022999917288304, 0.031166278326279495, 0.007824037546620866, 0.007996914207664907, 0.0028097569186611437, 0.0042362367158367276, 0.0, 0.0]
customX_test.shape: (45, 11)
customy_test len: 45
predictions len: 45


In [122]:
transformed_column = result_df['Actual_Close'].values.reshape(-1, 1)  # Assuming 'transformed_column' is the column to be converted back
original_column = scaler.inverse_transform(transformed_column)

result_df['Actual_Close'] = original_column
result_df['Real_Close'] = data['Close']

print(result_df)

result_df.to_csv("lol.csv")

          Date Stock  Actual_Close  Predicted_Close  Real_Close
0   2023-07-13  AMZN    128.250000       130.127472  134.300003
1   2023-07-14  AMZN    313.410004       130.350464  134.679993
2   2023-07-17  AMZN    308.869995       185.019608  133.559998
3   2023-07-18  AMZN    310.619995       241.455811  132.830002
4   2023-07-19  AMZN    312.049988       270.443207  135.360001
5   2023-07-20  AMZN    316.010010       291.860748  129.960007
6   2023-07-21  AMZN    302.519989       306.462738  130.000000
7   2023-07-24  AMZN    294.260010       296.026093  128.800003
8   2023-07-25  AMZN    291.609985       282.161530  129.130005
9   2023-07-26  AMZN    294.470001       279.026886  128.149994
10  2023-07-27  AMZN    298.570007       285.159363  128.250000
11  2023-07-13  META    311.709991       293.933258  313.410004
12  2023-07-14  META    190.539993       303.170135  308.869995
13  2023-07-17  META    190.690002       167.113998  310.619995
14  2023-07-18  META    193.990005      

In [124]:
from sklearn.metrics import mean_absolute_error

def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def accuracy_within_threshold(y_true, y_pred, threshold):
    return np.mean(np.abs((y_true - y_pred) / y_true) <= threshold) * 100

# Assuming 'result_df' contains the actual and predicted stock prices as columns 'Actual_Close' and 'Predicted_Close', respectively.
actual_stock = result_df['Actual_Close'].values
predicted_stock = result_df['Predicted_Close'].values

# Calculate MAE
mae = mean_absolute_error(actual_stock, predicted_stock)

# Calculate MAPE
mape = mean_absolute_percentage_error(actual_stock, predicted_stock)

# Calculate Accuracy within a 5% threshold
threshold = 0.05  # 5% threshold (you can adjust this value as needed)
accuracy = accuracy_within_threshold(actual_stock, predicted_stock, threshold)

# Print the evaluation results
print(f"Mean Absolute Error (MAE) = {mae}")
print(f"Mean Absolute Percentage Error (MAPE) = {mape:.2f}%")
print(f"Accuracy within {threshold*100:.0f}% threshold = {accuracy:.2f}%")


Mean Absolute Error (MAE) = 30.86593288845486
Mean Absolute Percentage Error (MAPE) = 10.61%
Accuracy within 5% threshold = 46.67%
