# How to use the Recurrent Neural Networks

**Authors:** Carlos Alfredo Hernández Alvarez, Anabel Abreu Llanes

**Github:** [carloshdez522](https://github.com/carloshdez522/), [bels-03](https://github.com/bels-03/)

**ORCID:** [0009-0006-6749-1686](https://orcid.org/0009-0006-6749-1686), [0009-0003-7264-3785](https://orcid.org/0009-0003-7264-3785)

<br>

Recurrent Neural Networks (RNNs) are a type of neural network specifically designed to process sequences of data. Unlike traditional neural networks that assume that all inputs and outputs are independent of each other, RNNs have an internal memory that allows them to maintain information about previous inputs in the sequence. This feature makes them particularly suitable for tasks where context and data order are crucial, such as in time series prediction, natural language processing, and speech recognition.

## Key Characteristics of RNNs:
- **Temporal Memory:** RNNs can maintain information over time using loops in their internal structure.
- **Parameter Sharing:** The same weights are applied at each time step, which allows them to generalize better over long sequences.
- **Ability to Model Sequences:** They are able to process sequences of variable length and capture long-term dependencies.

## Objective of the notebook

The main objective is to show how RNNs can be used, specifically through the LSTM (Long Short-Term Memory) architecture, to predict company stock prices from historical data. This deep learning technique is especially useful in the field of bioinformatics, where data sequences, such as DNA sequences or biomedical data time series, are common and require advanced methods for analysis and prediction.

## Recommendations
- **Use of Adequate Computational Resources:**
The Recurrent Neural Networks (RNN) model with LSTM architecture presented in this notebook was trained using Google's Tensor Processing Units (TPU) with 300 GB of RAM, freely available in the Google Colab environment. This high level of computational resources allows handling large volumes of data and training complex models efficiently.

- **Memory Requirements:**
It is not recommended to run this notebook in an environment with less than 50 GB of RAM due to the high computational requirements of the model and the volume of data processed. An environment with limited resources can result in long run times and possible crashes due to lack of memory.

- **Use of the Trained Model:**
For those who wish to use the trained model without having to retrain it, they can download the model files and their weights from the Google Colab environment. The notebook includes specific instructions on how to load the previously trained model using the `save_models` and `load_models` functions. This allows predictions to be made without the need for extensive computational resources, taking advantage of the work previously done.

## Loading the data

The `yfinance` API was used to download historical stock price data for the last 20 years of companies. This example was performed only with Microsoft (`MSFT`) because of the high computational requirements of these models but is perfectly compatible with more than one company at a time.


In [None]:
!pip install -q yfinance
import yfinance as yf
import pandas as pd

In [None]:
from datetime import datetime, timedelta

# Define the start and end dates for downloading historical data
end = datetime.now() - timedelta(days=1)
end = end.strftime('%Y-%m-%d')

start = datetime.now() - timedelta(days=365.25 * 20 + 1)
start = start.strftime('%Y-%m-%d')

print(f'Start date: {start} - End date: {end}')

Start date: 2004-07-21 - End date: 2024-07-21


In [None]:
# Companies from which data will be obtained (in this case, only Microsoft)
companies = ['MSFT']#, 'AAPL', 'AMZN', 'GOGL', 'MT', 'TSLA', 'WMT', 'V', 'JNJ', 'NVDA']
df = pd.DataFrame()

# Download historical company data
for company in companies:
  df[company] = yf.download(company, start=start, end=end)[['High']]

[*********************100%%**********************]  1 of 1 completed


In [None]:
companies_data = df.reset_index()
companies_data

Unnamed: 0,Date,MSFT
0,2004-07-21,29.889999
1,2004-07-22,29.299999
2,2004-07-23,28.400000
3,2004-07-26,28.709999
4,2004-07-27,28.760000
...,...,...
5029,2024-07-15,457.260010
5030,2024-07-16,454.299988
5031,2024-07-17,444.850006
5032,2024-07-18,444.649994


## Visualization of historical company data

In [None]:
import plotly.express as px

fig = px.line(companies_data, x=companies_data['Date'], y=companies_data.columns[1:], title='Companies stock price')
fig.update_layout(
    xaxis_title='Date',
    yaxis_title='Value',
    legend_title='Company'
)

fig.show()

## Normalize data

Data are normalized to improve model performance

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data_scaled = pd.DataFrame()

data_scaled[companies] = scaler.fit_transform(companies_data[companies])
data_scaled['Date'] = companies_data['Date']

## Split into training and test

In [None]:
test_size = pd.DateOffset(years=18)
start = pd.to_datetime(start)
cutoff = start + test_size

train = companies_data[companies_data['Date'] <= cutoff]
test = companies_data[companies_data['Date'] > cutoff]

In [None]:
import plotly.graph_objects as go

fig = go.Figure()
for company in companies:
  fig.add_trace(go.Scatter(x=train['Date'], y=train[company], mode='lines', name=f'{company} training set'))
  fig.add_trace(go.Scatter(x=test['Date'], y=test[company], mode='lines', name=f'{company} test set'))
fig.update_layout(title='Companies stock price', xaxis_title='Date', yaxis_title='Sock Price', legend_title=f'Before and after ({cutoff.year}-{cutoff.month})')

fig.show()

The `prepare_data` function prepares the data in sequences suitable to be used as input to the RNN. It takes as input a DataFrame with the data to prepare and the number of time steps (`time_step`) to consider for each sequence. It returns the training (`x_train` and `y_train`), test (`x_test` and `y_test`), and last data sequence (`x_last`) feature sets and labels, which will be used for future predictions.

In [None]:
import numpy as np

# Prepare data in sequences suitable for entry into the RNN
def prepare_data(data, time_step=365):
  x, y = [], []
  data = data.to_numpy()

  for i in range(time_step, data.shape[0]):
    x.append(data[i-time_step:i])
    y.append(data[i])

  x, y = np.array(x), np.array(y)

  n = companies_data[companies_data['Date'] <= cutoff].shape[0] - time_step
  x_train = x[:n]
  y_train = y[:n]
  x_test = x[n:]
  y_test = y[n:]
  x_last = x[-1]

  return x_train, y_train, x_test, y_test, x_last

For each company (`for_company`) the training and test data for each specified company is organized and structured. It takes no input and returns a dictionary containing the training and test data sets for each company.

In [None]:
# Prepare data from all companies
def for_company():
  data_companies = dict()

  for company in companies:
    x_train, y_train, x_test, y_test, _ = prepare_data(data_scaled[company])

    x_y = {'x_train': x_train, 'x_test': x_test, 'y_train': y_train, 'y_test': y_test}
    data_companies.update({f'{company}': x_y})

  return data_companies

all_companies = for_company()

From the dictionary generated by `for_company` the training and test data sets for a specific company are extracted with `train_test`. It returns `x_train`, `y_train`, `x_test` and `y_test` for the specified company.

In [None]:
def train_test(company):
  x_train = all_companies[company]['x_train']
  y_train = all_companies[company]['y_train']
  x_test = all_companies[company]['x_test']
  y_test = all_companies[company]['y_test']

  return x_train, y_train, x_test, y_test

## Create the model

The `create_model` function defines and compiles the LSTM model for time series prediction. It takes as input the shape of the sequences (`input_shape`) and builds a LSTM architecture with recurrent and dropout layers to avoid overfitting. It returns the compiled model ready to be trained.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

# Creation of the LSTM model for the prediction of the time series
def create_model(input_shape=365):
  model = Sequential()

  model.add(LSTM(units=50, return_sequences=True, input_shape=(input_shape, 1)))
  model.add(Dropout(0.2))
  model.add(LSTM(units=50, return_sequences=True))
  model.add(Dropout(0.2))
  model.add(LSTM(units=50, return_sequences=True))
  model.add(Dropout(0.2))
  model.add(LSTM(units=50))
  model.add(Dropout(0.2))
  model.add(Dense(units=1))

  model.compile(optimizer='rmsprop', loss='mean_squared_error')
  return model

## Train the model

In [None]:
y_predicted = pd.DataFrame()
models = {}
history = {}

In [None]:
for company in companies:
  x_train, y_train, x_test, y_test = train_test(company)

  print(f'\n{company} model:')

  model = create_model(x_train.shape[1])
  hist = model.fit(x_train, y_train, epochs=100, batch_size=10000, validation_data=(x_test, y_test))

  history.update({f'{company}': hist})
  models[f'{company}_model'] = model


MSFT model:
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77

The trained models are saved in JSON files and their respective weights in H5 files, allowing their storage and later use, this is achieved using `save_models`.

In [None]:
# Saving trained models
def save_models():
  for company in companies:
    model = models[f'{company}_model']
    model_json = model.to_json()

    with open(f'{company}_model.json', 'w') as json_file:
      json_file.write(model_json)

    model.save_weights(f'{company}_weight.h5')

save_models()

To load the saved models from JSON and H5 files, `load_models` is used. It returns a list of loaded models ready for prediction.

In [None]:
from tensorflow.keras.models import model_from_json

# Load the trained models
def load_models():
  models_loaded = list()

  for company in companies:
    json = f'{company}_model.json'
    h5 = f'{company}_weight.h5'

    json_file = open(json, 'r')
    loaded_model_json = json_file.read()
    json_file.close()

    model_loaded = model_from_json(loaded_model_json)
    model_loaded.load_weights(h5)
    models_loaded.append(model_loaded)

  return models_loaded

models_loaded = load_models()

## Comparison of actual vs. predicted values

In [None]:
fig = go.Figure()
for company in companies:
  x_train, y_train, x_test, y_test = train_test(company)
  predicted = models[f'{company}_model'].predict(x_test)

  y_real = scaler.inverse_transform(y_test.reshape(-1, 1)).reshape(-1)
  y_predicted = scaler.inverse_transform(predicted).reshape(-1)

  fig.add_trace(go.Scatter(x=test['Date'], y=y_real, mode='lines', name=f'{company} real value'))
  fig.add_trace(go.Scatter(x=test['Date'], y=y_predicted, mode='lines', name=f'{company} predicted value'))
fig.update_layout(title='Companies stock price', xaxis_title='Date', yaxis_title='Sock Price', legend_title=f'Before and after ({cutoff.year}-{cutoff.month})')

fig.show()



## Make future predictions

The `predict_future` function makes future predictions using the trained model. It takes as inputs the model, the last data sequence (`x_last`), the number of days to predict (`n_days`), and the number of time steps (`steps`). It returns a series of predictions for the specified future days.

In [None]:
# Make future predictions using the trained model
def predict_future(model, x_last, n_days, steps=365):
    predictions = []

    for _ in range(steps):
        prediction = model.predict(x_last[-steps:].reshape(1, -1))
        x_last = np.append(x_last, prediction)

    predictions = np.array(x_last[-n_days:]).flatten()
    return predictions

With `make_prediction` it makes future predictions for all companies using the trained model and organizes the results in a DataFrame. It returns a DataFrame with the dates and future predictions of the stock prices.

In [None]:
def make_prediction():
  time_step = 365
  n_days = 365 * 2
  future_values = pd.DataFrame(columns=['Date'] + companies)
  future_values['Date'] = pd.date_range(start=data_scaled['Date'][data_scaled.shape[0] - 1], periods=n_days + 1)[1:]

  for company in companies:
    input_data = prepare_data(data_scaled['MSFT'].head(-365))[-1].reshape(1, -1)
    future_values[company] = predict_future(models[f'{company}_model'], input_data, int(n_days))

  future_values[companies] = scaler.inverse_transform(future_values[companies])
  return future_values.copy()

In [None]:
import io
from contextlib import redirect_stdout

with io.StringIO() as f, redirect_stdout(f):
  all_data = pd.concat([companies_data, make_prediction()], axis=0).reset_index().drop(columns=['index'])

Representation of actual values and future predictions

In [None]:
fig = px.line(all_data, x=all_data['Date'], y=all_data.columns[1:], title='Companies stock price')
fig.update_layout(
    xaxis_title='Date',
    yaxis_title='Value',
    legend_title='Company'
)

fig.show()

## Conclusions
1.   **Application of RNNs in Biological Sequence Prediction:**
- Recurrent Neural Networks (RNNs) and, in particular, LSTM architectures, have proven to be powerful tools for sequence analysis and prediction. In the field of bioinformatics, these techniques can be used to predict protein secondary structures from amino acid sequences, identify functional regions in DNA sequences, and predict gene expression based on time series data. The ability of RNNs to handle sequences and capture long-term dependencies is especially valuable, as the structure and function of biomolecules are intrinsically linked to their sequence and temporal context.
2. **Improved Prediction Accuracy:**
- The implementation of LSTM models in this notebook has shown how deep learning techniques can significantly improve the accuracy of stock price time series predictions. This same approach can be transferred to bioinformatics to improve the prediction accuracy of complex biological phenomena, where accuracy is crucial for the advancement of scientific knowledge and biomedical research.
3.   **New Possibilities for Scientific Discovery:**
- The use of RNNs in bioinformatics not only allows for improved predictions in specific contexts, but also opens up new possibilities for scientific discovery. By being able to model and predict complex behaviors from sequential data, these techniques enable a better understanding of underlying biological processes and facilitate the development of new hypotheses and experiments. This can accelerate the pace of discovery and the development of new therapies and technologies in biomedicine.