<img src="http://eikon.tpq.io/refinitiv_logo.png" width="28%" align="left" style="vertical-align: top; padding-top: 23px;">
<img src="http://hilpisch.com/tpq_logo_long.png" width="36%" align="right" style="vertical-align: top;">

# Eikon Data API

**Financial Time Series Prediction &mdash; Recognizing Intraday Patterns**

Dr. Yves J. Hilpisch | The Python Quants GmbH

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:training@tpq.io">training@tpq.io</a>

<img src="http://hilpisch.com/images/tr_eikon_02.png" width=350px align=left>

## The Agenda

This tutorial shows

* how to retrieve historical intraday data across via the Eikon Data API,
* how to work with such data using `pandas`, `Plotly` and `Cufflinks` and
* how to apply machine learning (ML) techniques for time series prediction

## Importing Required Packages

In [1]:
import eikon as ek  # the Eikon Python wrapper package
import numpy as np  # NumPy
import pandas as pd  # pandas
import cufflinks as cf  # Cufflinks
from sklearn.svm import SVC  # sckikit-learn
import warnings; warnings.simplefilter('ignore')
from statsmodels.tsa.stattools import adfuller
import configparser as cp

The following **Python and package versions** are used.

In [2]:
import sys
print(sys.version)

3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)]


In [3]:
ek.__version__

'1.1.16'

In [4]:
np.__version__

'1.21.5'

In [5]:
pd.__version__

'1.4.4'

In [6]:
cf.__version__

'0.17.3'

## Connecting to Eikon Data API

This code sets the `app_id` to connect to the **Eikon Data API Proxy** which needs to be running locally.

In [7]:
cfg = cp.ConfigParser()
cfg.read('eikon.cfg')

[]

In [8]:
# ek.set_app_key(cfg['eikon']['app_id']) #set_app_id function being deprecated
ek.set_app_key('92bffb6063bf4087a7c422252d16476aa5fe962a')

## Retrieving Intraday Data

We first define a **small universe of `RICS`** for which to retrieve data.

In [9]:
rics = ['JPM', 'BAC', 'MS', 'C']

Second, **intraday data** is retrieved.

In [10]:
data = pd.DataFrame()
for ric in rics:
    data[ric] = ek.get_timeseries(ric,  # the RICs
                     fields='CLOSE',  # the required fields
                     start_date='2023-03-22 10:30:00',  # start date
                     end_date='2023-03-22 16:00:00', # end date
                     interval='minute')['CLOSE']  # bar length  

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 279 entries, 2023-03-22 10:32:00 to 2023-03-22 16:00:00
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   JPM     279 non-null    Float64
 1   BAC     277 non-null    Float64
 2   MS      234 non-null    Float64
 3   C       259 non-null    Float64
dtypes: Float64(4)
memory usage: 12.0 KB


In [12]:
data.head()  # first five rows

Unnamed: 0_level_0,JPM,BAC,MS,C
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2023-03-22 10:32:00,131.0,28.77,,
2023-03-22 10:34:00,131.15,28.81,89.3,45.36
2023-03-22 10:36:00,131.0,28.77,,
2023-03-22 10:37:00,131.15,28.8,89.39,45.29
2023-03-22 10:38:00,131.0,28.8,,


In [13]:
data.tail()  # final five rows

Unnamed: 0_level_0,JPM,BAC,MS,C
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2023-03-22 15:56:00,129.13,28.195,88.745,44.5852
2023-03-22 15:57:00,129.1382,28.17,88.705,44.595
2023-03-22 15:58:00,129.226,28.17,88.76,44.6184
2023-03-22 15:59:00,129.21,28.155,88.7825,44.625
2023-03-22 16:00:00,129.19,28.145,88.7245,44.605


In [14]:
data.dropna(inplace=True)

## Calculating the Log Returns

We next calculate the **log returns** in vectorized fashion.

In [15]:
rets = np.log(data / data.shift(1)).dropna()  # log returns in vectorized fashion

In [16]:
rets.head()

Unnamed: 0_level_0,JPM,BAC,MS,C
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2023-03-22 10:37:00,0.0,-0.000347,0.001007,-0.001544
2023-03-22 10:57:00,0.002437,0.001735,0.000112,0.002866
2023-03-22 11:01:00,-0.000609,-0.001387,0.0,-0.00044
2023-03-22 11:04:00,0.0,-0.000347,0.002569,-0.003088
2023-03-22 11:13:00,0.000761,0.002774,0.002229,0.001766


While **financial time series data** in general is not stationary ...

In [17]:
adfuller(data['JPM'])  # test for stationarity of time series

(-1.0956945301549466,
 0.7167858190850459,
 1,
 227,
 {'1%': -3.4594900381360034,
  '5%': -2.8743581895178485,
  '10%': -2.573601605503697},
 -260.096813250128)

... the **log returns time series data** is in general.

In [18]:
adfuller(rets['JPM'])  # test for stationarity of time series

(-22.130950286531828,
 0.0,
 0,
 227,
 {'1%': -3.4594900381360034,
  '5%': -2.8743581895178485,
  '10%': -2.573601605503697},
 -2326.6683290965093)

## Plotting the Data

Using `Cufflinks`, we can plot the normalized financial time series as **line plots** for comparison.

In [19]:
data.normalize().iplot(kind='lines')

The frequeny distributions, i.e. the **histograms**, of the log returns per `RIC`.

In [20]:
rets.iplot(kind='histogram', subplots=True)

## Preparing Lagged Data

### Basic Idea

To create predictions for the financial time series analyzed, we work with **five lags**. The basic idea is that the historical (return) **values from the previous five days** are used to predict the value today. Consider the following simple data set.

In [21]:
n = 15
df = pd.DataFrame(np.arange(n), index=pd.date_range('2023-1-1', periods=n, freq='B'),
                 columns=['data'])
df

Unnamed: 0,data
2023-01-02,0
2023-01-03,1
2023-01-04,2
2023-01-05,3
2023-01-06,4
2023-01-09,5
2023-01-10,6
2023-01-11,7
2023-01-12,8
2023-01-13,9


The code below creates five additional columns with lagged data (one day back, two days back, ...).

In [22]:
lags = 5
for lag in range(1, lags + 1):
    df['lag_{}'.format(lag)] = df['data'].shift(lag)

In [23]:
df.dropna().astype(int)

Unnamed: 0,data,lag_1,lag_2,lag_3,lag_4,lag_5
2023-01-09,5,4,3,2,1,0
2023-01-10,6,5,4,3,2,1
2023-01-11,7,6,5,4,3,2
2023-01-12,8,7,6,5,4,3
2023-01-13,9,8,7,6,5,4
2023-01-16,10,9,8,7,6,5
2023-01-17,11,10,9,8,7,6
2023-01-18,12,11,10,9,8,7
2023-01-19,13,12,11,10,9,8
2023-01-20,14,13,12,11,10,9


### Application

The code that follows derives the **lagged data** for every single `RIC`. First, a function that adds columns with lagged data to a `DataFrame` object.

In [24]:
def add_lags(data, ric, lags):
    cols = []
    df = pd.DataFrame(rets[ric])
    for lag in range(1, lags + 1):
        col = 'lag_{}'.format(lag)  # defines the column name
        # creates the lagged data column with directional values
        df[col] = df[ric].shift(lag)
        cols.append(col)  # stores the column name
    df.dropna(inplace=True)  # gets rid of incomplete data rows
    return df, cols

Second, the iterations over all `RICs`, using the `add_lags` function and storing the resulting `DataFrame` objects in a dictionary.

In [25]:
dfs = {}
for ric in rics:
    df, cols = add_lags(data, ric, lags)
    dfs[ric] = df

In [26]:
cols  # the column names for the lags

['lag_1', 'lag_2', 'lag_3', 'lag_4', 'lag_5']

In [27]:
dfs.keys()  # the keys of the dictonary

dict_keys(['JPM', 'BAC', 'MS', 'C'])

In [28]:
dfs['JPM'].head(7)

Unnamed: 0_level_0,JPM,lag_1,lag_2,lag_3,lag_4,lag_5
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-03-22 11:16:00,-7.6e-05,0.000761,0.0,-0.000609,0.002437,0.0
2023-03-22 11:17:00,-0.000228,-7.6e-05,0.000761,0.0,-0.000609,0.002437
2023-03-22 11:21:00,0.001292,-0.000228,-7.6e-05,0.000761,0.0,-0.000609
2023-03-22 11:22:00,0.000152,0.001292,-0.000228,-7.6e-05,0.000761,0.0
2023-03-22 11:28:00,-0.000608,0.000152,0.001292,-0.000228,-7.6e-05,0.000761
2023-03-22 11:45:00,-0.003426,-0.000608,0.000152,0.001292,-0.000228,-7.6e-05
2023-03-22 11:48:00,7.6e-05,-0.003426,-0.000608,0.000152,0.001292,-0.000228


In [29]:
np.sign(dfs['JPM'].head(7))

Unnamed: 0_level_0,JPM,lag_1,lag_2,lag_3,lag_4,lag_5
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-03-22 11:16:00,-1.0,1.0,0.0,-1.0,1.0,0.0
2023-03-22 11:17:00,-1.0,-1.0,1.0,0.0,-1.0,1.0
2023-03-22 11:21:00,1.0,-1.0,-1.0,1.0,0.0,-1.0
2023-03-22 11:22:00,1.0,1.0,-1.0,-1.0,1.0,0.0
2023-03-22 11:28:00,-1.0,1.0,1.0,-1.0,-1.0,1.0
2023-03-22 11:45:00,-1.0,-1.0,1.0,1.0,-1.0,-1.0
2023-03-22 11:48:00,1.0,-1.0,-1.0,1.0,1.0,-1.0


In [30]:
2 ** lags  # number of patterns

32

## Implementing ML Algorithm

The matrix consisting of the lagged data columns is used to "predict" the next day's direction of movement of the `RIC` via the **support vector machine (SVM)** algorithm. This is a **classification algorithm** that is able to **learn from historical patterns** (5 lags) to predict whether an upwards movement is more likely or a downwards movement.

In [31]:
from sklearn.preprocessing import LabelEncoder
for ric in rics:
    model = SVC(C=100) # the ML model
    df = dfs[ric].copy()  # getting data for the RIC
    label_encoder = LabelEncoder()
    y_encoded = label_encoder.fit_transform(df[ric])
    model.fit(np.sign(df[cols]), y_encoded)
    #model.fit(np.sign(df[cols]), np.sign(df[ric]))  # model fitting
    dfs[ric]['position'] = model.predict(np.sign(df[cols]))  # prediction

The prediction value is either `+1` for an upwards movement or `-1` for a downwards movement. With regard to a using this as signals for a trading strategy, one **would go long for `+1` and go short for `-1`**.

In [32]:
for ric in rics:
    print('{:10} | {}'.format(ric, dfs[ric]['position'].values[:12]))

JPM        | [101  84 193 131  43 200 208 180  43 206 202 182]
BAC        | [100 175  53  10 100   0  53  10 158  18 186  53]
MS         | [ 83  89  20 181 190  89   8 184 130 180 172  77]
C          | [194 100 154 100 122   3 103  88 180 193 162 100]


## Vectorized Backtesting

Let's backtest the performance of the ML-based trading strategies. Here, vectorization is used for convencience and speed. First, the **strategy returns** which result from multiplying the prediction or position values by the log returns of the respective `RIC`.

In [33]:
for ric in rics:
    dfs[ric]['strategy'] = dfs[ric]['position'] * dfs[ric][ric]

Second, the visualization of the **cumulative performance**.

In [34]:
for ric in rics:
    dfs[ric][[ric, 'strategy']].cumsum().applymap(np.exp).iplot()

## Out-of-Sample Testing

Next, to get a more realistic picture of the real trading performance to be expected a **train test split** to implement **out-of-sample backtesting**.

In [35]:
split = int(len(data) / 2)

In [36]:
vspan = [{'x0': data.index[0], 'x1': data.index[split], 'color': 'green', 'fill': True, 'opacity': .2},
        {'x0': data.index[split], 'x1': data.index[-1], 'color': 'red', 'fill': True, 'opacity': .2}]

Roughly speaking, the **green part is taken for training**, the **red part for testing**.

In [37]:
data.normalize().iplot(vspan=vspan)

In [39]:
# Assuming train_y contains categorical labels

model = SVC(C=100)


res = {}
for ric in rics:
    model = SVC(C=100) # the ML model
    df = dfs[ric].copy()  # getting data for the RIC
    split = int(len(df) / 2)
    train_x = np.sign(df[cols]).iloc[:split]
    train_y = np.sign(df[ric]).iloc[:split]
    label_encoder = LabelEncoder()
    train_y_encoded = label_encoder.fit_transform(train_y)

    test_x = np.sign(df[cols]).iloc[split:]
    test_y = df[ric].iloc[split:]
    #model.fit(train_x, train_y)  # model fitting
    model.fit(train_x, train_y_encoded)
    pred = model.predict(test_x)  # prediction
    strat = pred * test_y
    res[ric] = pd.DataFrame({ric: test_y,
                             'pred': pred,
                             'strategy': strat})

In [None]:
res['JPM'].head()

In [None]:
for ric in rics:
    res[ric][[ric, 'strategy']].cumsum().applymap(np.exp).iplot()

## Conclusions

Based on this tutorial, we can conclude that

* it is easy to retrieve **historical intraday data (one minute bars)** via the Eikon Data API,
* `Plotly` and `Cufflinks` make **financial data visualization** convenient,
* **machine learning (ML) techniques** such as SVM for classification are easily applied by the use of Python and
* that such techniques might be helpful in **predicting the direction of market movements** using a lag and pattern-based approach.

## Eikon Data API Developer Resources

* [Overview](https://developers.thomsonreuters.com/eikon-data-apis) 
* [Quick Start ](https://developers.thomsonreuters.com/eikon-data-apis/quick-start)
* [Documentation](https://developers.thomsonreuters.com/eikon-data-apis/docs)
* [Downloads](https://developers.thomsonreuters.com/eikon-data-apis/downloads)
* [Tutorials](https://developers.thomsonreuters.com/eikon-data-apis/learning)
* [Q&A Forums](https://developers.thomsonreuters.com/eikon-data-apis/qa) 

Data Item Browser Application: Type `DIB` into Eikon Search Bar.

<img src="http://eikon.tpq.io/refinitiv_logo.png" width="28%" align="left" style="vertical-align: top; padding-top: 23px;">
<img src="http://hilpisch.com/tpq_logo_long.png" width="36%" align="right" style="vertical-align: top;">