# Data processing for XLSTM-CEEMDAN model

## Python requirements

```
!pip install plotly
!pip install cufflinks
!pip install chart_studio
!pip install ipywidgets
!pip install yfinance
!pip install EMD-signal
!pip install sklearn
!pip install keras
!pip install tensorflow
```

## Model

The **XLSTM-CEEMDAN** works with daily data from equities: open, high, low, close, volume. Hereinafter we are calling these "features".

### Data processing steps:

1. Data gathering

We use yfinance package to get the daily ticker history. We obtain a dataframe with one time series for each of the features, for each of the tickers we are working with.

Here we are considering the top 10 most liquid tickers in Brazilian B3 exchange in period of **2019-01-01 until 2020-05-01**:

**PETR4, VALE3, BOVA11, ITUB4, BBDC4, B3SA3, BBAS3, ABEV3, MGLU3, VVAR3**

2. Decomposition

We decompose the data with the complete ensemble empirical mode decomposition with adative noise (CEEMDAN) algorithm, using the [PyEMD public package from Dawid Laszuk](github.com/laszukdawid/PyEMD).

As a result, we obtain a set of instrinsic mode functions (IMFs) time series for each of the features, for each of the tickers. The number of IMFs resulting from a decomposition may vary, and depends upon CEEMDAN hyperparameters such as scale of added noise, specific series characteristics, and mainly **series length**.

3. Data transformation

For each of the IMFs series obtained, we tranform the data according to the following equation in order to work with data values between 0 and 1.

$$x' = \frac{x-x_{min}}{x_{max}-x_{min}}$$

Where x is any element in the series to be transformed, x' is the equivalent transformed element. The equation also takes into account the maximum and minimum values of the series.

4. Windowing

Besides the mathematical transformation, each series undergo a vectorial transformation in order to split the data into windows.

The process is examplified:

```
original single dimensional series = [1,2,3,4,5,6,7,8]
Splitting into windows of length 4.
Windowed data = [[1,2,3,4],[2,3,4,5],[3,4,5,6],[4,5,6,7],[5,6,7,8]]
```
This way the resulting series will have $n-w+1$ elements, where $n$ is the length of the orginal series and $w$ is the window length.

### Processed data availability

After the processing method, the final data is available in a dataframe format.

In [1]:
from datetime import timedelta, datetime
import pandas as pd
import numpy as np

import cufflinks as cf
import chart_studio.plotly as plotly
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=True, world_readable=False)

from sklearn.preprocessing import MinMaxScaler
from keras.preprocessing.sequence import TimeseriesGenerator
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
import yfinance as yf
from PyEMD import CEEMDAN

Using TensorFlow backend.


In [2]:
tickers = ["PETR4", "VALE3", "BOVA11", "ITUB4", "BBDC4", "B3SA3", "BBAS3", "ABEV3", "MGLU3", "VVAR3"]

tickers = [f"{ticker}.SA" for ticker in tickers]

start_datetime = datetime(year=2018, month=5, day=1)
end_datetime = datetime(year=2020, month=5, day=1)

history_data = {ticker.split('.')[0]:yf.download(ticker, start=start_datetime, end=end_datetime).drop(['Close'], axis=1).dropna() for ticker in tickers}

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


In [3]:
ceemdan = CEEMDAN()
decomposed_data = {}
decomposed_ticker_features_series = {}
scalers = {}
for ticker in history_data:
    # iterating every ticker
    print(f'[{ticker}]')
    ticker_dataframe = history_data[ticker]
    decomposed_ticker_features_series[ticker] = {}
    scalers[ticker] = {}
    for column in ticker_dataframe.columns:
        # iterating evey feature
        scaler = MinMaxScaler()
        decomposed_ticker_features_series[ticker][column] = {}
        series = ticker_dataframe[column].values.reshape(-1,1)
        scaler.fit(series)
        scalers[ticker][column] = scaler
        ticker_feature_time_series = np.frombuffer(scaler.transform(series))
        ticker_feature_time_series_imfs = ceemdan(ticker_feature_time_series, max_imf=10)
        for i, imf_series in enumerate(ticker_feature_time_series_imfs):
            # iterating every IMF 
            if i < len(ticker_feature_time_series_imfs):
                decomposed_ticker_features_series[ticker][column][f'IMF{i+1}'] = imf_series
            else:
                decomposed_ticker_features_series[ticker][column][f'Rsd'] = imf_series

[PETR4]
[VALE3]
[BOVA11]
[ITUB4]
[BBDC4]
[B3SA3]
[BBAS3]
[ABEV3]
[MGLU3]
[VVAR3]


In [73]:
# Data organisation
features_in_order = ['Open', 'High', 'Low', 'Adj Close', 'Volume']

windows_sizes_for_imf_level = {
    'IMF1': 2,
    'IMF2': 2,
    'IMF3': 3,
    'IMF4': 3,
    'IMF5': 4,
    'IMF6': 4,
    'IMF7': 5,
    'IMF8': 5,
    'Rsd': 6,
    'DEFAULT': 4
}

# Coupling together the IMFs of the same level for different features to create exogenous input
# The number of imfs for each feature decomposition may differ, thus some of the last imfs may not match in number of features
series = {}
for ticker in decomposed_ticker_features_series:
    series[ticker] = {}
    for feature in decomposed_ticker_features_series[ticker]:
        imfs = pd.DataFrame.from_dict(decomposed_ticker_features_series[ticker][feature])
        for imf in imfs:
            if imf not in series[ticker]:
                series[ticker][imf] = []
            _series = imfs[imf].values
            _series = _series.reshape((len(_series),1))
            series[ticker][imf] += [_series] # reshaping to get into column format

dataset = {}
# # horizontal stack
for ticker in series:
    dataset[ticker] = {}
    for imf_level in series[ticker]:
        dataset[ticker][imf_level] = np.hstack(tuple(series[ticker][imf_level]))

# # data set split rates
train = 0.7
validation = 0.2
test = 0.1

train_dataset = {}
validation_dataset = {}
test_dataset = {}

train_generators = {}
validation_generators = {}
test_generators = {}

for ticker in dataset:

    train_dataset[ticker] = {}
    validation_dataset[ticker] = {}
    test_dataset[ticker] = {}

    train_generators[ticker] = {}
    validation_generators[ticker] = {}
    test_generators[ticker] = {}

    for imf_level in dataset[ticker]:
        # splitting data sets according to rates
        train_dataset[ticker][imf_level] = dataset[ticker][imf_level][:round(train*dataset[ticker][imf_level].shape[0]),:]
        validation_dataset[ticker][imf_level] = dataset[ticker][imf_level][round(train*dataset[ticker][imf_level].shape[0]):round((train+validation)*dataset[ticker][imf_level].shape[0]),:]
        test_dataset[ticker][imf_level] = dataset[ticker][imf_level][round((train+validation)*dataset[ticker][imf_level].shape[0]):,:]

        if imf_level in windows_sizes_for_imf_level:
            window_size = windows_sizes_for_imf_level[imf_level]
        else: 
            window_size = windows_sizes_for_imf_level['DEFAULT']
        # windowing
        train_generators[ticker][imf_level] = TimeseriesGenerator(train_dataset[ticker][imf_level], train_dataset[ticker][imf_level], length=window_size, batch_size=1)
        validation_generators[ticker][imf_level] = TimeseriesGenerator(validation_dataset[ticker][imf_level], validation_dataset[ticker][imf_level], length=window_size, batch_size=1)
        test_generators[ticker][imf_level] = TimeseriesGenerator(test_dataset[ticker][imf_level], test_dataset[ticker][imf_level], length=window_size, batch_size=1)


In [39]:
# define model
model = Sequential()
n_features = dataset.shape[1]
model.add(LSTM(100, activation='relu', input_shape=(window_size, n_features)))
model.add(Dense(2))
model.compile(optimizer='adam', loss='mse')

# fit model
model.fit_generator(train_generator, steps_per_epoch=1, epochs=1000, verbose=0)

<keras.callbacks.callbacks.History at 0x145c11bd0>

In [40]:

results = {}
for feature in features_in_order:
    results[feature] = {
        'real': [],
        'predicted': []
    }

for i in range(len(validation_generator)):
    x, y = validation_generator[i]
    yhat = model.predict(x, verbose=0)

    for j in range(yhat.shape[1]):
        results[features_in_order[j]]['real'] += [y[:,j][0]]
        results[features_in_order[j]]['predicted'] += [yhat[:,j][0]]


In [41]:
for feature in features_in_order:
    pd.DataFrame.from_dict(results[feature]).iplot(title=feature)