<img src="http://eikon.tpq.io/refinitiv_logo.png" width="28%" align="left" style="vertical-align: top; padding-top: 23px;">
<img src="http://hilpisch.com/tpq_logo_long.png" width="36%" align="right" style="vertical-align: top;">

# Eikon Data API

**Cross-Asset Financial Analytics &mdash; The Random Walk Hypothesis Revisited**

Dr. Yves J. Hilpisch | The Python Quants GmbH

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:training@tpq.io">training@tpq.io</a>

<img src="http://hilpisch.com/images/tr_eikon_02.png" width=350px align=left>

## The Agenda

This tutorial shows

* how to retrieve historical data across asset classes via the Eikon Data API,
* how to work with such data using `pandas`, `Plotly` and `Cufflinks` and
* how to derive support for the Random Walk Hypothesis from financial time series data.

## Random Walk Hypothesis

Eugene F. Fama (1965): “Random Walks in Stock Market Prices”:

> “For many years, economists, statisticians, and teachers of finance have been interested in developing and testing models of stock price behavior. One important model that has evolved from this research is the theory of random walks. This theory casts serious doubt on many other methods for describing and predicting stock price behavior—methods that have considerable popularity outside the academic world. For example, we shall see later that, if the random-walk theory is an accurate description of reality, then the various “technical” or “chartist” procedures for predicting stock prices are completely without value.”

Michael Jensen (1978): “Some Anomalous Evidence Regarding Market Efficiency”:

>“A market is efficient with respect to an information set S if it is impossible to make economic profits by trading on the basis of information set S.”

If a stock price follows a (simple) random walk (no drift & normally distributed returns), then it rises and falls with the same probability of 50% (“toss of a coin”).

**In such a case, the best predictor of tomorrow’s stock price — in a least-squares sense — is today’s stock price.**

## Importing Required Packages

In [1]:
import eikon as ek  # the Eikon Python wrapper package
import numpy as np  # NumPy
import pandas as pd  # pandas
import cufflinks as cf  # Cufflinks
import configparser as cp

The following **Python and package versions** are used.

In [2]:
import sys
print(sys.version)

3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)]


In [3]:
ek.__version__

'1.1.16'

In [4]:
np.__version__

'1.21.5'

In [5]:
pd.__version__

'1.4.4'

In [6]:
cf.__version__

'0.17.3'

## Connecting to Eikon Data API

This code sets the `app_id` to connect to the **Eikon Data API Proxy** which needs to be running locally.

In [7]:
cfg = cp.ConfigParser()
cfg.read('eikon.cfg')

[]

In [8]:
# ek.set_app_key(cfg['eikon']['app_id']) #set_app_id function being deprecated
ek.set_app_key('92bffb6063bf4087a7c422252d16476aa5fe962a')

## Retrieving Cross-Asset Data

We first define a **small universe of `RICS`** for which to retrieve data.

Second, **end-of-day (EOD) data** is retrieved.

In [9]:
rics = ['JPM', 'BAC', 'WFC', 'MS', 'GS', 'C', 'BCS']

In [10]:
data = ek.get_timeseries(rics, 
                         fields='CLOSE',
                         start_date='2017-01-01',
                         end_date='2017-12-31')

In [11]:
data.head()  # first five rows

CLOSE,JPM,BAC,WFC,MS,GS,C,BCS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2017-01-03,87.23,22.53,56.0,43.05,241.57,60.59,11.37
2017-01-04,86.91,22.95,56.05,43.62,243.13,61.41,11.6
2017-01-05,86.11,22.68,55.18,43.22,241.32,60.34,11.54
2017-01-06,86.12,22.68,55.04,43.85,244.9,60.55,11.54
2017-01-09,86.18,22.55,54.24,42.71,242.89,60.22,11.33


In [12]:
data.tail()  # final five rows

CLOSE,JPM,BAC,WFC,MS,GS,C,BCS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2017-12-22,107.45,29.88,61.55,52.72,258.97,75.49,10.85
2017-12-26,107.02,29.78,61.13,52.47,257.72,74.78,10.81
2017-12-27,107.22,29.73,60.95,52.57,255.95,74.89,10.82
2017-12-28,107.79,29.8,61.3,52.65,256.5,75.08,10.93
2017-12-29,106.94,29.52,60.67,52.47,254.76,74.41,10.9


Only complete data rows are selected.

In [13]:
data.dropna(inplace=True)  # deletes tows with NaN values

In [14]:
data.info()  # DataFrame meta information

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 251 entries, 2017-01-03 to 2017-12-29
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   JPM     251 non-null    Float64
 1   BAC     251 non-null    Float64
 2   WFC     251 non-null    Float64
 3   MS      251 non-null    Float64
 4   GS      251 non-null    Float64
 5   C       251 non-null    Float64
 6   BCS     251 non-null    Float64
dtypes: Float64(7)
memory usage: 17.4 KB


## Calculating the Log Returns

We next calculate the **log returns** in vectorized fashion.

In [15]:
rets = np.log(data / data.shift(1))  # log returns in vectorized fashion

In [16]:
rets.head()

CLOSE,JPM,BAC,WFC,MS,GS,C,BCS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2017-01-03,,,,,,,
2017-01-04,-0.003675,0.01847,0.000892,0.013154,0.006437,0.013443,0.020027
2017-01-05,-0.009248,-0.011834,-0.015644,-0.009212,-0.007472,-0.017577,-0.005186
2017-01-06,0.000116,0.0,-0.00254,0.014471,0.014726,0.003474,0.0
2017-01-09,0.000696,-0.005748,-0.014642,-0.026342,-0.008241,-0.005465,-0.018365


`pandas` allows to derive the **correlation matrix** with a single method call.

In [17]:
data.corr()  # correlation matrix by column

CLOSE,JPM,BAC,WFC,MS,GS,C,BCS
CLOSE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
JPM,1.0,0.975605,0.501027,0.968018,0.653395,0.891003,-0.431025
BAC,0.975605,1.0,0.588259,0.944151,0.705522,0.824035,-0.357441
WFC,0.501027,0.588259,1.0,0.47612,0.840336,0.126474,0.422948
MS,0.968018,0.944151,0.47612,1.0,0.659818,0.895104,-0.417848
GS,0.653395,0.705522,0.840336,0.659818,1.0,0.343072,0.171761
C,0.891003,0.824035,0.126474,0.895104,0.343072,1.0,-0.680739
BCS,-0.431025,-0.357441,0.422948,-0.417848,0.171761,-0.680739,1.0


## Plotting the Data

Using `Cufflinks`, we can plot the normalized financial time series as **line plots** for comparison.

In [18]:
cf.set_config_file(offline=True)  # set the plotting mode to offline

In [19]:
data.normalize().iplot(kind='lines')

The frequeny distributions, i.e. the **histograms**, of the log returns per `RIC`.

In [20]:
rets.iplot(kind='histogram', subplots=True)

The **heatmap** below visualizes the correlations between the financial time series.

In [21]:
data.corr().iplot(kind='heatmap', colorscale='blues')

## Preparing Lagged Data

To gain insights into whether the random walk hypothesis holds true, we work with **five lags**. The code that follows derives the **lagged data** for every single `RIC`. First, a function that adds columns with lagged data to a `DataFrame` object.

In [22]:
def add_lags(data, ric, lags):
    cols = []
    df = pd.DataFrame(data[ric])
    for lag in range(1, lags + 1):
        col = 'lag_{}'.format(lag)  # defines the column name
        df[col] = df[ric].shift(lag)  # creates the lagged data column
        cols.append(col)  # stores the column name
    df.dropna(inplace=True)  # gets rid of incomplete data rows
    return df, cols

Second, the iterations over all `RICs`, using the `add_lags` function and storing the resulting `DataFrame` objects in a dictonary.

In [23]:
lags = 5  # five historical lags

In [24]:
dfs = {}
for ric in rics:
    df, cols = add_lags(data, ric, lags)
    dfs[ric] = df

In [25]:
cols  # the column names for the lags

['lag_1', 'lag_2', 'lag_3', 'lag_4', 'lag_5']

In [26]:
dfs.keys()  # the keys of the dictonary

dict_keys(['JPM', 'BAC', 'WFC', 'MS', 'GS', 'C', 'BCS'])

In [27]:
dfs['JPM'].head(7)

Unnamed: 0_level_0,JPM,lag_1,lag_2,lag_3,lag_4,lag_5
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-01-10,86.43,86.18,86.12,86.11,86.91,87.23
2017-01-11,87.08,86.43,86.18,86.12,86.11,86.91
2017-01-12,86.24,87.08,86.43,86.18,86.12,86.11
2017-01-13,86.7,86.24,87.08,86.43,86.18,86.12
2017-01-17,83.55,86.7,86.24,87.08,86.43,86.18
2017-01-18,83.94,83.55,86.7,86.24,87.08,86.43
2017-01-19,83.3,83.94,83.55,86.7,86.24,87.08


## Implementing OLS Regression

The matrix consisting of the lagged data columns is used to "predict" the next day's value of the `RIC` via **linear OLS regression**.

In [28]:
regs = {}
for ric in rics:
    df = dfs[ric]  # getting data for the RIC
    df.dropna(inplace=True)
    reg = np.linalg.lstsq(df[cols].astype('float64'), df[ric].astype('float64'), rcond=-1)[0]  # the OLS regression
    regs[ric] = reg  # storing the results

In [29]:
for ric in rics:
    print('{:10} | {}'.format(ric, regs[ric]))

JPM        | [ 1.01921487  0.04640413 -0.13249244  0.06492666  0.00283701]
BAC        | [ 1.05205605 -0.04358253 -0.04135156  0.03892146 -0.0049737 ]
WFC        | [ 1.01496864 -0.03666907  0.04284017  0.05001041 -0.07075771]
MS         | [ 0.95341171  0.07994752 -0.12674928  0.1082215  -0.01394681]
GS         | [ 0.9453596   0.11760367 -0.12582698 -0.03101206  0.09407309]
C          | [ 1.08017855 -0.11059699 -0.02724082  0.04463796  0.01387277]
BCS        | [ 0.94438099  0.05941509 -0.09017798  0.00341724  0.08265393]


## Taking a Closer Look

Let's pick one `RIC` and compare the original time series with the OLS predicted one.

In [30]:
ric = 'JPM'

In [31]:
res = pd.DataFrame(dfs[ric][ric])  # picks the original time series

In [32]:
res['PRED'] = np.dot(dfs[ric][cols], regs[ric])  # creates the "prediction" values

The **predicted prices** are almost exactly the prices from the day before.

In [33]:
res.iloc[-50:].iplot()

In [34]:
res.head()

Unnamed: 0_level_0,JPM,PRED
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-01-10,86.43,86.313586
2017-01-11,87.08,86.517
2017-01-12,86.24,87.18152
2017-01-13,86.7,86.326343
2017-01-17,83.55,86.686485


## Analyzing the Results

Now analyzing the **regression results** a bit more formally.

In [35]:
rega = np.stack(regs.values())  # combines the regression results
rega


arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.



array([[ 1.01921487,  0.04640413, -0.13249244,  0.06492666,  0.00283701],
       [ 1.05205605, -0.04358253, -0.04135156,  0.03892146, -0.0049737 ],
       [ 1.01496864, -0.03666907,  0.04284017,  0.05001041, -0.07075771],
       [ 0.95341171,  0.07994752, -0.12674928,  0.1082215 , -0.01394681],
       [ 0.9453596 ,  0.11760367, -0.12582698, -0.03101206,  0.09407309],
       [ 1.08017855, -0.11059699, -0.02724082,  0.04463796,  0.01387277],
       [ 0.94438099,  0.05941509, -0.09017798,  0.00341724,  0.08265393]])

Almost all the weight lies on the most recent price (`lag_1`).

In [36]:
rega.mean(axis=0)  # mean values by column

array([ 1.0013672 ,  0.01607454, -0.07157127,  0.03987474,  0.01482265])

In [37]:
regd = pd.DataFrame(rega, columns=cols, index=rics)  # converting the results to DataFrame

In [38]:
regd

Unnamed: 0,lag_1,lag_2,lag_3,lag_4,lag_5
JPM,1.019215,0.046404,-0.132492,0.064927,0.002837
BAC,1.052056,-0.043583,-0.041352,0.038921,-0.004974
WFC,1.014969,-0.036669,0.04284,0.05001,-0.070758
MS,0.953412,0.079948,-0.126749,0.108222,-0.013947
GS,0.94536,0.117604,-0.125827,-0.031012,0.094073
C,1.080179,-0.110597,-0.027241,0.044638,0.013873
BCS,0.944381,0.059415,-0.090178,0.003417,0.082654


In [39]:
regd.describe()  # summary statistics

Unnamed: 0,lag_1,lag_2,lag_3,lag_4,lag_5
count,7.0,7.0,7.0,7.0,7.0
mean,1.001367,0.016075,-0.071571,0.039875,0.014823
std,0.054724,0.081212,0.065846,0.044308,0.057111
min,0.944381,-0.110597,-0.132492,-0.031012,-0.070758
25%,0.949386,-0.040126,-0.126288,0.021169,-0.00946
50%,1.014969,0.046404,-0.090178,0.044638,0.002837
75%,1.035635,0.069681,-0.034296,0.057469,0.048263
max,1.080179,0.117604,0.04284,0.108222,0.094073


## Visualizing the Results

The following bar chart illustrates that the results a qualitatively similar for all `RICS` analyzed &mdash; "_today's price is the best predictor, in a least-squares sense, for tomorrow's price_".

In [40]:
regd.iplot(kind='bar')

The **mean values** for the single optimal regression parameters.

In [41]:
regd.mean().iplot(kind='bar')

## Analyzing Intraday Data

Let us quickly check, whether the results are similar on an **intraday basis**.

In [46]:
data = ek.get_timeseries(rics,  # RICs
              fields='CLOSE',  # fields to be retrieved
              start_date='2023-07-17 14:00:00',  # start time
              end_date='2023-07-17 18:00:00',  # end time
              interval='minute')  # bar length

In [47]:
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 241 entries, 2023-07-17 14:00:00 to 2023-07-17 18:00:00
Freq: T
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   JPM     241 non-null    Float64
 1   BAC     241 non-null    Float64
 2   WFC     241 non-null    Float64
 3   MS      241 non-null    Float64
 4   GS      241 non-null    Float64
 5   C       241 non-null    Float64
 6   BCS     239 non-null    Float64
dtypes: Float64(7)
memory usage: 16.7 KB


In [48]:
data.tail()

CLOSE,JPM,BAC,WFC,MS,GS,C,BCS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-07-17 17:56:00,153.535,29.4881,45.025,86.48,325.65,46.3002,8.205
2023-07-17 17:57:00,153.44,29.4708,45.015,86.4787,325.3,46.3179,8.205
2023-07-17 17:58:00,153.45,29.4729,45.015,86.43,325.29,46.325,8.205
2023-07-17 17:59:00,153.45,29.4736,45.0326,86.45,325.515,46.295,8.205
2023-07-17 18:00:00,153.42,29.4789,45.035,86.4599,325.33,46.265,8.2001


In [49]:
dfs = {}
for ric in rics:
    df, cols = add_lags(data, ric, lags)
    dfs[ric] = df

In [51]:
regs = {}
for ric in rics:
    df = dfs[ric]
    reg = np.linalg.lstsq(df[cols].astype('float64'), df[ric].astype('float64'), rcond=-1)[0]
    regs[ric] = reg

In [52]:
rega = np.stack(regs.values())


arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.



In [53]:
regd = pd.DataFrame(rega, columns=cols, index=rics)

**Intraday** the optimal regression parameters show more variation.

In [54]:
regd.iplot(kind='bar')

In [55]:
regd.mean().iplot(kind='bar')

## Conclusions

Based on this tutorial, we can conclude that

* it is easy to retrieve **historical end-of-day and intraday data across asset classes** via the Eikon Data API,
* `Plotly` and `Cufflinks` make **financial data visualization** convenient and
* there is **support for the Random Walk Hypothesis** based on the OLS regression analysis (both daily and a little bit less so intraday).

## Eikon Data API Developer Resources

* [Overview](https://developers.thomsonreuters.com/eikon-data-apis) 
* [Quick Start ](https://developers.thomsonreuters.com/eikon-data-apis/quick-start)
* [Documentation](https://developers.thomsonreuters.com/eikon-data-apis/docs)
* [Downloads](https://developers.thomsonreuters.com/eikon-data-apis/downloads)
* [Tutorials](https://developers.thomsonreuters.com/eikon-data-apis/learning)
* [Q&A Forums](https://developers.thomsonreuters.com/eikon-data-apis/qa) 

Data Item Browser Application: Type `DIB` into Eikon Search Bar.

<img src="http://eikon.tpq.io/refinitiv_logo.png" width="28%" align="left" style="vertical-align: top; padding-top: 23px;">
<img src="http://hilpisch.com/tpq_logo_long.png" width="36%" align="right" style="vertical-align: top;">