# **Stock Price and Volume Prediction with Long Short-Term Memory (LSTM) Model**


## **Introduction**
---

This project focuses on developing and evaluating an LSTM (Long Short-Term Memory) model for predicting stock prices and trading volume of **TLKM** (a Indonesia telecommunications company) based on historical stock market data. The primary goal is to accurately forecast critical stock price features — Open, High, Low, Close — and trading volume, enabling better decision-making for traders and investors.

---
### **Objectives**:
1. Predict stock price movements (Open, High, Low, Close) with high accuracy.
2. Estimate trading volume trends to provide comprehensive market insights.
3. Evaluate model performance using metrics such as:
   - *MAE (Mean Absolute Error)*
   - *RMSE (Root Mean Squared Error)*
   - *R² Score*
   - *MAPE (Mean Absolute Percentage Error)*
---
### **Methodology**:
- *Data Preprocessing*: Historical stock price data is normalized and structured into sequences to train the LSTM model.
- *Model Architecture*: An LSTM-based neural network with multiple layers is used to capture temporal dependencies in the data.
- *Evaluation*: The model's predictions are compared to actual values, with visualizations and metrics used to analyze performance.
---
### **Significance**:
Stock price prediction is a challenging task due to market volatility and noise. This project demonstrates how deep learning, specifically LSTM models, can capture temporal dependencies and trends in financial time series data, providing valuable insights for stock market analysis.

## **Preprocessing**
---

### **Get The Dataset**
---
The stock dataset comes from yfinance libraries, which provides various data and are commonly used in technical analytics, including:

1. **Open**: The opening price of the stock at the beginning of the trading session.
2. **High**: The highest price reached during the trading session.
3. **Low**: The lowest price reached during the trading session.
4. **Close**: The closing price of the stock at the end of the trading session.
5. **Volume**: The number of shares traded during the trading session.

This data is used for technical analysis, such as creating price charts or technical indicators.

In [28]:
import pandas as pd
import numpy as np
import yfinance as yf

ticker = "TLKM" # Stock ticker
period = "5y"   # Stock period
data = pd.read_csv(f"./data/{ticker}/{ticker}_stock_{period}_data.csv")
data = data[["Date", "Open", "High", "Low", "Close", "Volume"]] # Just use two features from the dataset
data.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,2020-01-10 00:00:00+07:00,3145.500977,3153.404245,3121.79117,3145.500977,48099000
1,2020-01-13 00:00:00+07:00,3161.307285,3185.01709,3153.404017,3185.01709,61913800
2,2020-01-14 00:00:00+07:00,3145.500821,3161.307358,3113.887747,3121.791016,95058600
3,2020-01-15 00:00:00+07:00,3129.694661,3129.694661,3058.565237,3066.468506,147583300
4,2020-01-16 00:00:00+07:00,3074.371621,3074.371621,3026.952007,3042.758545,118903100


### **Change "Date" Column Data Type to Date**

In [29]:
from functions.str_to_date import str_to_date
data["Date"] = data["Date"].apply(str_to_date)
data["Date"]

0      2020-01-10
1      2020-01-13
2      2020-01-14
3      2020-01-15
4      2020-01-16
          ...    
1207   2025-01-06
1208   2025-01-07
1209   2025-01-08
1210   2025-01-09
1211   2025-01-10
Name: Date, Length: 1212, dtype: datetime64[ns]

### **Change Indexing Dataset to Date**

In [30]:
data.index = data.pop("Date")
columns = data.columns
dataset_date = data.index
data.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-10,3145.500977,3153.404245,3121.79117,3145.500977,48099000
2020-01-13,3161.307285,3185.01709,3153.404017,3185.01709,61913800
2020-01-14,3145.500821,3161.307358,3113.887747,3121.791016,95058600
2020-01-15,3129.694661,3129.694661,3058.565237,3066.468506,147583300
2020-01-16,3074.371621,3074.371621,3026.952007,3042.758545,118903100


### **Visualize The Dataset**
---
Visualize dataset is provides 3 chart, including:

1. **Candlestick Chart**

    Candlestick chart is a type of price chart often used in technical analysis to represent price movements of stocks or other assets. Each candlestick represents a specific time period (e.g., 1 day) and consists of:
    - **Body**:
        - *Bullish Candle (Green)*: The closing price is higher than the opening price.
        - *Bearish Candle (Red)*: The closing price is lower than the opening price.
    - **Shadow (Wick)**: 
        - *Upper Shadow*: Shows the highest price during period.
        - *Lower Shadow*: Shows the lowest price during the period.
    - **Open and Close**:
        - The opening price is the top of the body for a bearish candle and the bottom of the body for a bullish candle.
        - The closing price is the bottom of the body for a bearish candle and the top of the body for a bullish candle.

    To understand candlestick charts you can see from their candlestick and trend pattern:
    - **Candlestick Pattern**:    
        - *Bullish Engulfing*: A bullish candle that completely covers the previous bearish candle, indicating a potential upward reversal.
        - *Bearish Engulfing*: A bearish candle that completely covers the previous bullish candle, indicating a potential downward reversal.
        - *Doji*: The opening and closing prices are almost the same, showing market indecision.
    - **Trend**:
        - Long-bodied candlesticks indicate strong momentum.
        - Long shadows indicate high volatility without a clear trend direction.
---
3. **Volume Chart**
    
    A volume chart is a visual representation of the trading volume of an asset (such as a stock, in this case) over a specific time period. It shows how many units of the asset were traded during each time interval, often displayed as bars at the bottom of a price chart.
---
2. **Relative Strength Index (RSI)**
    
    The RSI (Relative Strength Index) is a technical indicator used to measure the relative strength of stock prices over a certain period. RSI values range from 0 to 100, with the main interpretations as follows:
    - **Overbought** when RSI > 70 The asset might be overbought (price is too high) and susceptible to a downward correction.
    - **Oversold** when RSI < 30 The asset might be oversold (price is too low) and susceptible to an upward reversal.
    - **Divergence** is the mismatch between price action and RSI suggests a potential reversal to the upside or down. They have 2 divergences, that is:
        - *Bullish Divergence*: The price forms a lower low, but the RSI forms a higher low, indicating a potential upward reversal.
        - *Bearish Divergence*: The price forms a higher high, but the RSI forms a lower high, indicating a potential downward reversal.
    
    To interpret the RSI chart you can analyze three indicators, which is:
    - **Levels 70 and 30**:
        - If RSI is above 70, be cautious of a potential downward reversal.
        - If RSI is below 30, look for opportunities for an upward reversal.
    - **RSI Trends**:
        - RSI above 50 often indicates an upward trend.
        - RSI below 50 often indicates a downward trend.
    - **Combining with Other Analysis**: Combine RSI with candlestick patterns or other indicators to strengthen your analysis. For instance, confirm candlestick patterns with RSI conditions to increase the accuracy of your decisions.


In [31]:
from functions.plot_graph import PlotGraph
window = 20
std = 2.0
dataset_graph = PlotGraph(data, ticker, window, std)
dataset_graph.plot_main_candle_graph()

#### **TLKM Stock Analysis**
---
1. **Candlestick Chart**
- **Price Trend**: 
  - Uptrend from mid-2020 to mid-2022, followed by consolidation (2023–2024).
  - Peaks in mid-2022, then sideways movement with lower highs and lows.

- **Moving Averages (SMA-20 & EMA-20)**:
  - *SMA (Simple Moving Average)*: Average price over 20 periods, smoothing price fluctuations evenly.
  - *EMA (Exponential Moving Average)*: Weighted more towards recent prices, reacts faster to changes.
  - Price stayed above MA lines during the uptrend and below them during the downtrend, useful for identifying trend direction.

- **Bollinger Bands (BB)**:
  - Upper (BBU), Middle (BBM), and Lower (BBL) bands represent volatility.
  - Bands expand during high volatility (e.g., price rally in 2022) and contract during low volatility (e.g., 2023–2024).
  - Price touching or exceeding bands signals potential trend continuation or reversal.

---

2. **Volume Chart**
- **Volume Spikes**:
  - High volume during mid-2022 (price rally) and late 2024 (key movements).
- **Volume Confirmation**:
  - Rising volume supported the uptrend in 2020–2022.
  - Low volume during sideways movement in 2023–2024 suggests market indecision.

---

3. **RSI Chart**
- **RSI Range**:
  - Overbought (>70) in mid-2022, signaling possible pullback.
  - Neutral range (30–70) during consolidation in 2023–2024.

- **Momentum**:
  - RSI > 50 during the uptrend indicates bullish momentum.
  - No clear bullish or bearish divergence observed.

---

**Key Insights**
1. *Uptrend (2020–2022)*: Strong rally confirmed by volume and RSI.
2. *Consolidation (2023–2024)*: Sideways trend with low volatility and reduced volume.
3. *Actionable*:
   - Watch for a breakout with high volume to confirm a new trend.
   - Monitor RSI nearing 70 (overbought) or 30 (oversold) for entry/exit points.


### **Normalize the data**
---
To normalize the data I use `MinMaxScaler`, which is a data preprocessing tool provided by `sklearn.preprocessing` libraries.
- It scales and normalizes the data by transforming it into a specific range, typically `[0, 1]`.
- The formula used is:

    $$X_{\text{scaled}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}$$
    
  where:
    - $ X $: The original value.
    - $ X_{\text{min}} $: The minimum value in the data.
    - $ X_{\text{max}} $: The maximum value in the data.

**Why use MinMaxScaler?**
  - Helps normalize the dataset for machine learning models, improving model convergence and performance.
  - Maintains the relationships and proportions between values while scaling them.

In [32]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
scaled_data.shape

(1212, 5)

### **Create The Sequences**
---
The `sequences` function creates sequences of historical stock prices, which are used as input (`X`) to predict the target value (`y`), representing the stock price at the next time step.

- The `time_step` parameter defines how many past data points are used to predict the next value.
- The generated sequences are then ready for training an LSTM model for time series forecasting.


In [33]:
from functions.sequences import sequences

time_step = window
X, y = sequences(scaled_data, time_step=time_step)
X.shape, y.shape

((1191, 20, 5), (1191, 5))

### **Splitting The Dataset**
---
This program splits a time series dataset into training, validation, and testing subsets, maintaining the temporal order of the data. 

1. **Define Split Ratios**:
   - *Vrain size*: 70% of the dataset.
   - *Validation size*: 15% of the dataset.
   - *Test size*: 15% of the dataset
---
2. **Set Time Ranges**:
   - `train_date`: Dates corresponding to the training subset.
   - `val_date`: Dates corresponding to the validation subset.
   - `test_date`: Dates corresponding to the testing subset.
---
3. **Split Data**:
   - `X_train`, `X_val`, `X_test`: Features for training, validation, and testing.
   - `y_train`, `y_val`, `y_test`: Targets for training, validation, and testing.

By using this approach, the temporal integrity of the dataset is preserved, ensuring meaningful evaluation of the model on unseen data.

In [34]:
train_size = int(len(X) * 0.7)
validation_size = int(len(X) * 0.15)

train_date = dataset_date[time_step+1:train_size]
val_date = dataset_date[time_step+train_size+1:train_size+validation_size]
test_date = dataset_date[time_step+train_size+validation_size+1:]

X_train, X_val, X_test = X[:train_size], X[train_size:train_size+validation_size], X[train_size+validation_size:]
y_train, y_val, y_test = y[:train_size], y[train_size:train_size+validation_size], y[train_size+validation_size:]

### **Reshaping data for LSTM input**
---
Why Are Features 3D and Targets 2D?
1. **Features (`X`) - 3D**:
   - Sequential models like RNNs, LSTMs, and GRUs require inputs with three dimensions: 
     - `(samples, timesteps, features)`.
   - This structure captures temporal dependencies by:
     - `samples`: Number of data points (e.g., batches of sequences).
     - `timesteps`: Number of steps in each sequence (e.g., look-back window size).
     - `features`: Number of features at each timestep.
   - The 3D structure enables the model to learn patterns across time steps for multivariate or time-dependent data.
---
2. **Targets (`y`) - 2D**:
   - Targets typically represent the output values for each sample:
     - `(samples, output_dimensions)`.
   - For most regression or classification tasks, each sample has a single target or a fixed set of targets (e.g., future value prediction or classification labels).
   - Keeping `y` 2D simplifies model outputs and loss computation.

This distinction ensures the model receives the correct input shapes for learning from sequential data while predicting the desired outputs.


In [35]:
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], X.shape[2])
X_val = X_val.reshape(X_val.shape[0], X_val.shape[1], X.shape[2])
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], X.shape[2])

print("Sahape of:")
print(f"X_train : {X_train.shape}, y_train : {y_train.shape}")
print(f"X_val : {X_val.shape}, y_val shape:{y_val.shape}")
print(f"X_test : {X_test.shape}, y_test : {y_test.shape}")

Sahape of:
X_train : (833, 20, 5), y_train : (833, 5)
X_val : (178, 20, 5), y_val shape:(178, 5)
X_test : (180, 20, 5), y_test : (180, 5)


## **Training Dataset**
---

An LSTM (Long Short-Term Memory) model is a type of recurrent neural network (RNN) designed to process and predict sequential data by capturing long-term dependencies. 

Key Features:
- **Memory Cells**: Store information over long sequences.
- **Gates (Input, Forget, Output)**: Control what information to add, retain, or discard.

Purpose:
LSTMs are ideal for time-series forecasting, natural language processing, and any task involving sequential patterns.

---

**Specification of this LSTM model project:**

1. **LSTM Layers**:
   - Capture temporal dependencies in sequential data.
   - First LSTM (`128 units, return_sequences=True`): Outputs the full sequence for further processing.
   - Second LSTM (`64 units, return_sequences=False`): Outputs the final hidden state for dense layers.

2. **Dense Layer**:
   - `Dense(32, activation="relu")`: Learns complex patterns from LSTM outputs.

3. **Dropout Layer**:
   - `Dropout(0.2)`: Reduces overfitting by randomly deactivating 20% of neurons during training.

4. **Output Layer**:
   - `Dense(5)`: Produces final predictions with 5 outputs, typically for regression or multi-output tasks.


In [36]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

model = Sequential([
    LSTM(128, input_shape=(X_train.shape[1], X_train.shape[2]),
         return_sequences=True),
    LSTM(64, return_sequences=False),
    Dense(32, activation="relu"),
    Dropout(0.2),
    Dense(5)
])

model.save(f'./models/lstm_model.keras') # Save the model as an .h5 file
model.compile(optimizer="adam", loss="mean_squared_error", metrics=["mae"])
model.summary()


Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.



In [37]:
epochs = 100
batch_size = 32
model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_val, y_val))
model.save(f'../RestAPI/app/assets/ML/lstm_model.keras')
print("Success save the LSTM model")

Epoch 1/100


[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 44ms/step - loss: 0.1366 - mae: 0.2787 - val_loss: 0.0121 - val_mae: 0.0979
Epoch 2/100
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 35ms/step - loss: 0.0336 - mae: 0.1321 - val_loss: 0.0045 - val_mae: 0.0488
Epoch 3/100
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 24ms/step - loss: 0.0240 - mae: 0.1146 - val_loss: 0.0032 - val_mae: 0.0453
Epoch 4/100
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step - loss: 0.0237 - mae: 0.1115 - val_loss: 0.0161 - val_mae: 0.1161
Epoch 5/100
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 27ms/step - loss: 0.0238 - mae: 0.1123 - val_loss: 0.0069 - val_mae: 0.0722
Epoch 6/100
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 27ms/step - loss: 0.0177 - mae: 0.0963 - val_loss: 0.0059 - val_mae: 0.0666
Epoch 7/100
[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 26ms/step - loss: 0.0181

## **Evaluate The Model**
---


In [38]:
y_pred = model.predict(X_test)

# Inverse transform the predictions and actual values
y_test_original = scaler.inverse_transform(y_test)
y_pred_original = scaler.inverse_transform(y_pred)





[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 109ms/step


### **Evaluating the LSTM Model with Metrics**
---
1. **MAE (Mean Absolute Error)**:
   - Measures average absolute differences between predicted $\hat{y}_i$ and actual $y_i$ values.
   - **Equation**:  
     $$
      \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \lvert y_i - \hat{y}_i \rvert
     $$
   - **Threshold**:
     - Lower MAE indicates better predictions.
     - A good MAE depends on the scale of the data (e.g., <10% of the average target value is often acceptable).
---
2. **RMSE (Root Mean Squared Error)**:
   - Measures the square root of the average squared differences, emphasizing larger errors.
   - **Equation**:  
     $$
      \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}
     $$
   - **Threshold**:
     - Lower RMSE is better.
     - Like MAE, it should ideally be a small fraction of the average target value.
---
3. **R² Score (Coefficient of Determination)**:
   - Measures the proportion of variance explained by the model.
   - **Equation**:  
     $$
      R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}
     $$
     where $\bar{y}$ is the mean of actual values.
   - **Threshold**:
     - $R^2 = 1$: Perfect predictions.
     - $R^2 > 0.8$: Generally good for many applications.
     - $R^2 < 0.5$: Indicates poor fit.
---
4. **MAPE (Mean Absolute Percentage Error)**:
   - Measures error as a percentage of actual values.
   - **Equation**:  
     $$
      \text{MAPE} = \frac{100}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right|
     $$
   - **Threshold**:
     - $\text{MAPE} < 10\%$: Excellent.
     - $10\%-20\%$: Good.
     - $>20\%$: Needs improvement.
---
#### **Summary**
- **Lower MAE, RMSE, and MAPE** indicate better performance.
- **Higher $R^2$** values closer to 1 reflect better explanatory power.
- The acceptability of these metrics depends on the specific problem domain and data scale.


In [39]:
from functions.plot_graph import PlotGraph
actual_data = pd.DataFrame(y_test_original, index=test_date, columns=columns)
predicted_data = pd.DataFrame(y_pred_original, index=test_date, columns=columns)
actual_graph = PlotGraph(actual_data, ticker, window, std)
actual_graph.plot_evaluate_table(predicted_data)

### **LSTM Model with Metrics Evaluation Results**
---
The table summarizes the evaluation metrics for predicting stock-related features: Open, High, Low, Close, and Volume.

1. **MAE (Mean Absolute Error)**:
   - Measures the average absolute prediction error.
   - Results:
     - *Open, High, Low, Close*: Errors range from ~55 to ~66, which are acceptable given the scale of the data.
     - *Volume*: Exceptionally high error (~46M), indicating poor prediction performance for this feature.
---
2. **RMSE (Root Mean Squared Error)**:
   - Highlights larger errors by squaring differences before averaging.
   - Results:
     - Similar trends as MAE; errors are higher for Volume (~663M), showing greater deviation in predictions.
---
3. **R² Score**:
   - Measures the proportion of variance explained by the model.
   - Results:
     - *Open, High, Low*: Scores between 0.75 and 0.81, indicating a strong fit.
     - *Close*: Moderate fit with 0.73.
     - *Volume*: Poor fit with 0.226, suggesting the model struggles to capture patterns for this feature.
---
4. **MAPE (Mean Absolute Percentage Error)**:
   - Measures error as a percentage of actual values.
   - Results:
     - *Open, High, Low, Close*: Low percentages (~2%), indicating good scale-independent performance.
     - *Volume*: Very high percentage (~39%), reinforcing poor predictive performance.


### **Visualization of LSTM Model Evaluation for Stock Price and Volume Prediction**

In [40]:
actual_graph.plot_evaluate_model(predicted_data)

#### **Explanation**
---
1. **Open, High, Low, and Close Price Charts**:
   - The *actual prices* (blue lines) closely follow the *predicted prices* (green lines), showing good model performance for these features.
   - Minor deviations are present, particularly during sharp price fluctuations.
---
2. **Candlestick Chart**:
   - Visualizes the actual vs. predicted stock prices as candlesticks.
   - The overlap between actual and predicted candlesticks indicates accurate forecasting of price trends.
---
3. **Volume Chart**:
   - The predicted volume (green bars) shows significant deviation from actual volume (blue bars).
   - Highlights the model's poor performance in predicting stock volumes.
---
4. **RSI (Relative Strength Index) Chart**:
   - Compares RSI calculated from actual vs. predicted close prices.
   - Both RSI lines generally align, reflecting the model's ability to predict close prices well.
   - RSI thresholds (70 and 30) show the overbought and oversold levels for evaluation.

## **References**
---
- Building LSTM Model: 
    - M. A. Istiake Sunny, M. M. S. Maswood, and A. G. Alharbi, "Deep Learning-Based Stock Price Prediction Using LSTM and Bi-Directional LSTM Model," *2020 2nd Novel Intelligent and Leading Emerging Sciences Conference (NILES)*, Giza, Egypt, 2020, pp. 87-92. doi: [10.1109/NILES50944.2020.9257950](https://doi.org/10.1109/NILES50944.2020.9257950)

    - Nicholas Renotte, "Stock Price Prediction using an LSTM Model in Python," YouTube, May 3, 2022. [Online]. Available: [https://www.youtube.com/watch?v=CbTU92pbDKw](https://www.youtube.com/watch?v=CbTU92pbDKw). [Accessed: Jan. 9, 2025].

- Evaluation Metrics: 
    -   P. Datta and S. A. Faroughi, "A multihead LSTM technique for prognostic prediction of soil moisture," *Geoderma*, vol. 433, 2023, p. 116452. doi: [10.1016/j.geoderma.2023.116452](https://doi.org/10.1016/j.geoderma.2023.116452). Available: [https://www.sciencedirect.com/science/article/pii/S0016706123001295](https://www.sciencedirect.com/science/article/pii/S0016706123001295).

- Source Dataset: 
    - Yahoo Finance, "TLKM.JK – Telkom Indonesia (Persero) Tbk Stock Price & News," Yahoo Finance. [Online]. Available: [https://finance.yahoo.com/quote/TLKM.JK/](https://finance.yahoo.com/quote/TLKM.JK/). [Accessed: Jan. 9, 2025].