# Data processing for XLSTM-CEEMDAN model

## Python requirements

```
!pip install plotly
!pip install cufflinks
!pip install chart_studio
!pip install ipywidgets
!pip install yfinance
!pip install EMD-signal
!pip install sklearn
!pip install keras
!pip install tensorflow
```

## Model

The **XLSTM-CEEMDAN** works with daily data from equities: open, high, low, close, volume. Hereinafter we are calling these "features".

### Data processing steps:

1. Data gathering

We use yfinance package to get the daily ticker history. We obtain a dataframe with one time series for each of the features, for each of the tickers we are working with.

Here we are considering the top 10 most liquid tickers in Brazilian B3 exchange in period of **2019-01-01 until 2020-05-01**:

**PETR4, VALE3, BOVA11, ITUB4, BBDC4, B3SA3, BBAS3, ABEV3, MGLU3, VVAR3**

2. Decomposition

We decompose the data with the complete ensemble empirical mode decomposition with adative noise (CEEMDAN) algorithm, using the [PyEMD public package from Dawid Laszuk](github.com/laszukdawid/PyEMD).

As a result, we obtain a set of instrinsic mode functions (IMFs) time series for each of the features, for each of the tickers. The number of IMFs resulting from a decomposition may vary, and depends upon CEEMDAN hyperparameters such as scale of added noise, specific series characteristics, and mainly **series length**.

3. Data transformation

For each of the IMFs series obtained, we tranform the data according to the following equation in order to work with data values between 0 and 1.

$$x' = \frac{x-x_{min}}{x_{max}-x_{min}}$$

Where x is any element in the series to be transformed, x' is the equivalent transformed element. The equation also takes into account the maximum and minimum values of the series.

4. Windowing

Besides the mathematical transformation, each series undergo a vectorial transformation in order to split the data into windows.

The process is examplified:

```
original single dimensional series = [1,2,3,4,5,6,7,8]
Splitting into windows of length 4.
Windowed data = [[1,2,3,4],[2,3,4,5],[3,4,5,6],[4,5,6,7],[5,6,7,8]]
```
This way the resulting series will have $n-w+1$ elements, where $n$ is the length of the orginal series and $w$ is the window length.

### Processed data availability

After the processing method, the final data is available in a dataframe format.

In [52]:
from datetime import timedelta, datetime
import pandas as pd
import numpy as np

import cufflinks as cf
import chart_studio.plotly as plotly
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=True, world_readable=False)

from sklearn.preprocessing import MinMaxScaler
from keras.preprocessing.sequence import TimeseriesGenerator
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
import yfinance as yf
from PyEMD import CEEMDAN

In [25]:
tickers = ["PETR4", "VALE3", "BOVA11", "ITUB4", "BBDC4", "B3SA3", "BBAS3", "ABEV3", "MGLU3", "VVAR3"]

tickers = [f"{ticker}.SA" for ticker in tickers]

start_datetime = datetime(year=2019, month=5, day=1)
end_datetime = datetime(year=2020, month=5, day=1)

history_data = {ticker.split('.')[0]:yf.download(ticker, start=start_datetime, end=end_datetime).drop(['Close'], axis=1).dropna() for ticker in tickers}

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


In [26]:
ceemdan = CEEMDAN()
scaler = MinMaxScaler()
decomposed_data = {}
decomposed_ticker_features_series = {}
for ticker in history_data:
    print(f'[{ticker}]')
    ticker_dataframe = history_data[ticker]
    decomposed_ticker_features_series[ticker] = {}
    for column in ticker_dataframe.columns:
        decomposed_ticker_features_series[ticker][column] = {}
        series = ticker_dataframe[column].values.reshape(-1,1)
        scaler.fit(series)
        ticker_feature_time_series = np.frombuffer(scaler.transform(series))
        ticker_feature_time_series_imfs = ceemdan(ticker_feature_time_series, max_imf=10)
        for i, imf_series in enumerate(ticker_feature_time_series_imfs):
            if i < len(ticker_feature_time_series_imfs):
                decomposed_ticker_features_series[ticker][column][f'IMF{i+1}'] = imf_series
            else:
                decomposed_ticker_features_series[ticker][column][f'Rsd'] = imf_series

[PETR4]
--[Open]
----IMF0
----IMF1
----IMF2
----IMF3
----IMF4
--[High]
----IMF0
----IMF1
----IMF2
----IMF3
----IMF4
----IMF5
--[Low]
----IMF0
----IMF1
----IMF2
----IMF3
----IMF4
----IMF5
--[Adj Close]
----IMF0
----IMF1
----IMF2
----IMF3
----IMF4
----IMF5
--[Volume]
----IMF0
----IMF1
----IMF2
----IMF3
----IMF4
----IMF5
----IMF6
[VALE3]
--[Open]
----IMF0
----IMF1
----IMF2
----IMF3
----IMF4
----IMF5
--[High]
----IMF0
----IMF1
----IMF2
----IMF3
----IMF4
----IMF5
--[Low]
----IMF0
----IMF1
----IMF2
----IMF3
----IMF4
----IMF5
--[Adj Close]
----IMF0
----IMF1
----IMF2
----IMF3
----IMF4
----IMF5
--[Volume]
----IMF0
----IMF1
----IMF2
----IMF3
----IMF4
----IMF5
[BOVA11]
--[Open]
----IMF0
----IMF1
----IMF2
----IMF3
----IMF4
--[High]
----IMF0
----IMF1
----IMF2
----IMF3
----IMF4
--[Low]
----IMF0
----IMF1
----IMF2
----IMF3
----IMF4
--[Adj Close]
----IMF0
----IMF1
----IMF2
----IMF3
----IMF4
--[Volume]
----IMF0
----IMF1
----IMF2
----IMF3
----IMF4
----IMF5
----IMF6
[ITUB4]
--[Open]
----IMF0
----IMF1
----

In [45]:
# Windowing example
series = pd.DataFrame.from_dict(decomposed_ticker_features_series['BBAS3']['Open'])['IMF0']
window_size = 2
generator = TimeseriesGenerator(series, series, length=window_size, batch_size=1)
print(f'Samples: {len(generator)}')
for i in range(len(generator)):
	x, y = generator[i]
	print('%s => %s' % (x, y))

Samples: 246
[[0.00176907 0.02474279]] => [0.00194419]
[[0.02474279 0.00194419]] => [-0.025798]
[[ 0.00194419 -0.025798  ]] => [-0.02697583]
[[-0.025798   -0.02697583]] => [0.01180694]
[[-0.02697583  0.01180694]] => [0.02738]
[[0.01180694 0.02738   ]] => [-0.02203177]
[[ 0.02738    -0.02203177]] => [0.01448237]
[[-0.02203177  0.01448237]] => [-0.00125919]
[[ 0.01448237 -0.00125919]] => [0.00866302]
[[-0.00125919  0.00866302]] => [-0.01263383]
[[ 0.00866302 -0.01263383]] => [-0.02524081]
[[-0.01263383 -0.02524081]] => [-0.00255738]
[[-0.02524081 -0.00255738]] => [0.02988484]
[[-0.00255738  0.02988484]] => [-0.0256155]
[[ 0.02988484 -0.0256155 ]] => [-0.01651905]
[[-0.0256155  -0.01651905]] => [-0.00862152]
[[-0.01651905 -0.00862152]] => [0.00468734]
[[-0.00862152  0.00468734]] => [0.00393823]
[[0.00468734 0.00393823]] => [-0.00201417]
[[ 0.00393823 -0.00201417]] => [0.0021347]
[[-0.00201417  0.0021347 ]] => [0.00260384]
[[0.0021347  0.00260384]] => [-0.00638923]
[[ 0.00260384 -0.0063892

In [48]:
# model
model = Sequential()
model.add(Dense(100, activation='relu', input_dim=n_input))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

In [49]:
# fit model
model.fit_generator(generator, steps_per_epoch=1, epochs=200, verbose=0)

<keras.callbacks.callbacks.History at 0x14ad28c10>

In [51]:
# make a one step prediction out of sample
x_input = np.array([9, 10]).reshape((1, n_input))
yhat = model.predict(x_input, verbose=0)