# Ch.2 Financial Data Structures

---

## 2.1 Motivation

We will learn how to work with unstructured financial data, and from that to derive a structured dataset amenable to ML algorithms.

## 2.2 Essential Types of Financial Data




In [7]:
import numpy as np
import pandas as pd

pd.DataFrame({'Fundamental Data': ['Assets', 'Liabilities', 'Sales', 'Costs/earnings', 'Macro variables'],
              'Market Data': ['Price/Yield/Implied volatility', 'Volume', 'Dividend/coupons', 'Open interest', 'Qouotes/cancellations'], 
              'Analytics': ['Analyst recommendations', 'Credit ratings', 'Earnings expectation', 'News sentiment', '0'], 
              'Alternative Data': ['Satellite/CCTV images', 'Google searches', 'Twiiter/chats', 'Metadata', '0']})

Unnamed: 0,Fundamental Data,Market Data,Analytics,Alternative Data
0,Assets,Price/Yield/Implied volatility,Analyst recommendations,Satellite/CCTV images
1,Liabilities,Volume,Credit ratings,Google searches
2,Sales,Dividend/coupons,Earnings expectation,Twiiter/chats
3,Costs/earnings,Open interest,News sentiment,Metadata
4,Macro variables,Qouotes/cancellations,0,0


### 2.2.1 Fundamental Data

Information that can be found in regulatory filings and business analytics. Once you align the data correctly, a substantial number of findings in those papers cannot be reproduced.
- Mostly accounting data 
- Reported with a lapse
- Bloomberg

A second aspect of fundamental data is that it is often backfilled or reinstated.
- Backfilling: Missing data is assigned a value, even ifthose values were unknown at that time.
- Reinstated Value: A corrected value that amends an incorrect initial release.
- The corrected values were not known on that first release date.

Fundamental data is extremely regularized and low frequency. Being so accessible to the marketplace, it is rather unlikely that there is much value left to be exploited.

### 2.2.2 Market Data

Market data includes all trading activity that takes place in an exchange or trading venue. Every market participant leaves a characteristic footprint in the trading records, and with enough patience, you will find a way to anticipate a competitor`s next move.

### 2.2.3 Analytics
What characterizes analytics is not the content of the information, but that it is not readily available from an original source, and that it has been processed for you in a particular way. Investment banks and research firms sell valuable information that results from in-depths anlysis of companies` business  models, activities, competition, outlook, etc.

### 2.2.4 Alternative Data

It is primary information, that has not made it to the other sources.

## 2.3 Bars

---

1. Parse it
2. Extract valuable information from it
3. Store those extraction in a regularized format


Most ML algorithms assume a table representation of the extracted data.
- Finance practitioners often refer to those tables` rows as 'bars'.
- Standard Bars / Information-Driven Bars

### 2.3.1 Standard Bars

To transform a series of observations that arrive at irregular frequency(often referred to as "inhomogeneous series") into a homogeneous series derived from regular sampling.

#### 2.3.1.1 Time Bars (시간)

Sampling information at fixed time intervals.
- Timestamp
- VWAP
- OHLCV

**Weakness**
1. Markets do not process information at a constant time interval.
   - Algorithms: Times bars oversample information during low-activity periods and undersample information during high-activity periods.
   
2. Time-sampled series often exhibit poor statistical porperties, like serieal correlation, heteroscadasticity, and non-noramlity of returns.

#### 2.3.1.2 Tick Bars (거래횟수)

Sampling as a function of the number of transactions exhibited desirable statistical properties. 
- Gaussian distribution
- Outliers
- Closer to IID

#### 2.3.1.3 Volume Bars (거래량)

One problem with tick bars is that order fragmentation introduces some arbitrariness in the number of ticks. 
- Volume bars circumvent that problem by sampling every time a pre-defined amount of the securitiy`s units(shares, futures contracts, etc) havebeen exhanged.
- Sampling returns by volume achieved even better statistical properties(i.e. closer to an IID Gaussian distribution) than sampling by tick bars.

#### 2.3.1.4 Dollar Bars (가격 변동)

Sampling an observation every time a pre-defined market value is exchanged.
- The number of shares traded is a function of the actual value exchanged.
- The number of outstanding shares often changes multiple times over the course of a securities life, as a result of corporate actions.

### 2.3.2 Information-Driven Bars

The purpose of information-driven bars is to sample more frequently when new information arrives to the market. By synchronizing sampling with the arrival of informed traders, we may be able to make decisions before prices reach a new equilibrium level.

#### 2.3.2.1 Tick Imbalance Bars

Consider a sequence of ticks ${{(P_{t}, V_{t})}}_{t=1}^T$ where $P_t$ is the price with tick t, $V_t$ is the volume with tick t.
Tick rule defines a sequence $({b_t})_{t=1}^T$ where 

$$b_t = \begin{cases}
b_{t-1}, & \text{if }\Delta p_t = 0 \\
\frac{|\Delta p_t|} {\Delta p_t}, & \text{if } \Delta p_t \neq 0
\end{cases}$$ 

with $b_t \in \{-1, 1\}$, and the boundary condition $b_0 = b_T$. The idea behind TIBs is to sample bars whenever tick imbalances exceed our expectations.
- We wish to determine the tick index, T., such that the accumulation of signed ticks(signed according to the tick rule) exceeds a given threshold. 

1. Define the tick imbalance at time T as 

$$\theta_T = \Sigma_{t=1}^T b_t$$

2. Compute the expected value of $\theta_T$ at the beginning of the bar, 

   $$E_0[\theta_T] = E_0[T](P[b_t=1]-P[b_t=-1])$$ 
   
   where $E_0[T]$ is the expected size of the tick bar, $P[b_t=1]$ is the unconditional probability that a tick is classified as a buy, $P[b_t=-1]$ is the unconditional probability that a tick is classified as a sell.
   - $E_0[\theta_T]=E_0[T](2P[b_t=1]-1)$  


3. Define TIB as a $T^*$ - contiguous subset of ticks such that the following condition is met: 
 
 $$T^*= \underset{T}{\operatorname{argmin}} \{|\theta_T| \ge E_0[T]|2P[b_t=1]-1|\}$$
 
 where $|2P[b_t=1]-1|$ implies the size of the expected imbalance.
 - TIBs are produced more frequently under the presence of informed trading.

#### 2.3.2.2 Volue/Dollar Imbalance Bars

The idea behind VIBS and DIBs is to extend the concept of TIBs.
- Sample bars when volume or dollar imbalances diverge from our expectations.

1. Define the imbalance at time T as 

   $$\theta_T = \Sigma_{t=1}^T b_t v_t$$ 

   where $v_t$ is the number of securities traded(VIB) or the dollar amount exchanged(DIB).
   


2. Compute the expected value of $\theta_T$ at the beginning of the bar,

   $$E_0[\theta_T]=E_0[\Sigma_{t|b_t=1}^T v_t]-[\Sigma_{t|b_t=-1}^T v_t]=E_0[T](P[b_t=1] E_0[v_t|b_t=1]-P[b_t=-1] E_0[v_t|b_t=-1])$$
   
   $v^+=P[b_t=1]E_0[v_t|b_t=1]$, $v^-=P[b_t=-1]E_0[v_t|b_t=-1]$, $v^+ + v^-=E_0[T]^{-1} E_0[\Sigma_t v_t]=E_0[v_t]$, then
   - $E_0[\theta_T]=E_0[T](v^+ - v^-)=E_0[T](2v^+ - E_0[v_t])$



3. Define VIB or DIB as a $T^*$ - contiguous subset of ticks such that the following condition is met: 

   $$T^*= \underset{T}{\operatorname{argmin}} \{|\theta_T| \ge E_0[T]|2v^+ - E_0[v_t]|\}$$
   
   - When $\theta_T$ is more imbalanced thatn expected, a low T will satisfy these conditions.
   - Tick fragmentation & Outliers
   - The issue of corporate actions
   - The bar size is adjusted dynamically

#### 2.3.2.3 Tick Runs Bars

Large traders will sweep the order book, use iceberg orders, or slice  a parent order into multiple children, all of which leave a trace of runs in the $\{b_t\}_{t=1...T}$ sequence. 
- It can be useful to monitor the sequence of buys in the overall volume, and take samples when that sequence diverges from our expectationss.

1. Define the length of the current run as 

   $$\theta_T=max\{\Sigma_{t|b_t=1}^T b_t, -\Sigma_{t|b_t=-1}^T b_t \}$$
   
2. Compute the expected value of $\theta_T$ at the beginning of the bar,

   $$E_0[\theta_T]=E_0[T]max\{P[B_t=1], 1-P[b_t=1] \}$$  


3. Define a tick runs bar (TRB) as a $T^*$ - contiguous subset of ticks such that the following condition is met:


   $$T^*= \underset{T}{\operatorname{argmin}} \{\theta_T \ge max\{P[B_t=1], 1-P[b_t=1] \}\}$$
   
   where expected count of ticks from runs is implied by $max\{P[B_t=1], 1-P[b_t=1] \}$.
   - When $\theta_T$ exhibits more runs than expected, a low T will satisfy these conditions.
   - Sequential Breaks: Instead of measuring the length of the longest sequence, we count the number of ticks of each side, without offering them(no imbalance).

#### 2.3.2.4 Volume / Dollar Runs Bars



The intuition is that we wish to sample bars whenever the volumes or dollars traded by one side exceed our expectation for a bar. 

1. Define the volumes or dollars associated with a run as

   $$\theta_T=max\{\Sigma_{t|b_t=1}^T b_t v_t, -\Sigma_{t|b_t=-1}^T b_t v_t \}$$
   
   where $v_t$: Number of securities traded(VRB) or dollar amount exchanged(DRB). 
   

2. Compute the expected value of $\theta_T\$ at the beginning of the bar.

   $$E_0 [\theta_T] = E_0 [T] max\{P[b_t=1] E_0 [v_t|b_t =1], (1-P[b_t=1]) E_0[v_t|b_t = -1] \}$$  
   
   
3. Define a VRB as a $T^*$ - contiguous subset of ticks such that the following condition is met:

   
   $$T^*= \underset{T}{\operatorname{argmin}} \{\theta_T \ge E_0 [T] max\{P[B_t=1] E_0[v_t|b_t =1], (1-P[b_t=1]) E_0 [v_t|b_t =-1] \}\}$$
   

## 2.4 Dealing with Multi-Product Series

Sometimes we are interested in modelling a time series of instruments, where the weights need to be dynamically adjusted over time.
- If not, we will inadvertently introduce a structural break that will mislead our research efforts.
- "ETF Trick": The goal is to transform any complex multi-product dataset into a single dataset that resembles a total-return ETF.
- Because your code can always assume that you only trade cashlile prodcuts(non-expiring cash instruments), regardless of the complexity and composition of the underlying series. 

### 2.4.1 The ETF Trick

Suppose developing a strategy that trades a spread of futures.

1. The spread is characterized by a vector of weights that changes over time.
   - The spread itself may converge even if prices do not change.
   - A model will be misled to believe that PnL has resulted from that weight-induced convergence. 
   
   
2. Spreads can acquire negative values, because they do not represent a price.


3. Trading times will not align exactly for all constituents, so the spread is not always tradeable at the last levels published, or with zero latency risk. 


4. Execution costs must be considered, like crossing the bid-ask spread. 


One way to avoid these issues is to produce a time series that reflects the value of $1 invested in a spread.
Suppose that we are given a history of bars, containing the following columns:

- $0_{i,t}$ : The raw open price of instrument $i=1, ... , I$ at bar $t=1, ... , T$
- $p_{i,t}$ : The raw close price of instrument $i=1, ... , I$ at bar $t=1, ... , T$
- $\psi_{i,t}$ : The USD value of one point $i=1, ... , I$ at bar $t=1, ... , T$ (includes foreign exchange rate)
- $v_{i,t}$ : The volume of instrument $i=1, ... , I$ at bar $t=1, ... , T$
- $d_{i,t}$ : The carry, dividend, or coupon paid by instrument $i=1, ... , I$ at bar $t=1, ... , T$ (Used to charge margin costs, or costs of funding)

For a basket of futures characterized by an allocations vector $w_t$ rebalanced(or rolled) on bars $B \subseteq {1,...,T}$, the 1 dollar investment value ${K_t}$ is desired as

$$h_{i,t} = \begin{cases}
(w_{i,t} K_t) \over (o_{i,t+1} \psi_{i,t} \Sigma_{i=1}^I |w_{i,t}|) , & \text{if }t \in B \\
\Delta p_{i,t}, & \text{otherwise} \end{cases} $$

$$\delta_{i,t} = \begin{cases}
p_{i,t}-o_{i,t}, & \text{if }t-1 \in B \\
\Delta p_{i,t}, & \text{otherwise} \end{cases} $$

$$K_t = K_{t-1} + \Sigma_{i=1}^I h_{i,t-1} \psi_{i,t}(\delta_{i,t} + d_{i,t})$$ ,

$K_0=1$, $h_{i,t}$: The holidngs(number of securities or contracts) of instrument i at time t.
         $\delta_{i,t}$: The change of Market value between $t-1$ and $t$ for instrument i.
         
- PnL are always reinvested whenever $t \subseteq B$: No negative value.
- $d_{i,t}$ are already embedded in $K_t$: No need for the strategy.
- The purpose of $w_{i,t} (\Sigma_{i=1}^I |w_{i,t}|)^{-1}$: To de-lever the allocations.
- May not know $p_{i,t}$ of the new contract at a roll time t: Use $o_{i,t+1}$ as the closest in time.

Let $\tau_i$ be the transaction cost associated with trading 1 dollar of instrument i.

1. Rebalance Costs: ${{c_t}}$
   - $c_t = \Sigma_{i=1}^I (|h_{i,t-1}| p_{i,t} + |h_{i,t}| o_{i,t+1}) \psi_{i,t} \tau_{i}, \forall t \in B$
   

2. Bid-ask spread: $\tilde{c_t}$
   - $\tilde{c_t} = \Sigma_{i=1}^I |h_{i,t-1}| p_{i,t} \psi_{i,t} \tau_{i,t}$
   
3. Volume: $v_t$
   - $v_t = \underset{i}{\operatorname{min}}$ $v_{i,t} \over |h_{i,t-1}|$
   

Transaction costs functions are not necessarily linear. Thanks to the ETF trick, we can model a basket of futures(or a sing futures) as if it was a single non-expiring cash product.

### 2.5.2 PCA Weights (주성분 분석 가중치)

One way to derive the vector ${W_t}$.  

Consider an IID multivariate Gaussian process characterized by a vector of means $\mu$, of size $NxL$, and a covariance matrix $v$, of size $NxN$(stochastic process). We would like to compute the vector of allocation $\omega$ that conforms to a particular distribution of risks accross $v$`s principal components.

1. We perform a spectral decomposition, $VW = W\Lambda$.


2. Given a vector of allocations $\omega$, we can compute the portfolio`s risk as 


   $$\sigma ^2 = \omega'W\Lambda W'\omega = \beta'\Lambda\beta = (\Lambda^{1/2}\beta)(\Lambda^{1/2}\beta)$$ 
   - where $\beta$: The projection of $\omega$ on the orthogonal basis.


3. $\Lambda$: A diagonal matrix, $\sigma^2 = \Sigma_{n=1}^N\beta_n^2\Lambda_{n,n}$,


   The risk atrributed to the $n$th component: $R_n$,
   $$R_n = \beta_n^2\Lambda_{n,n}\sigma^{-2} = [W'\omega]_n^2\Lambda_{n,n}\sigma^2$$
   - with $R'1_n=1$, and $1_n=$ a vector of $N$ ones.
   - We can interpret $({R_n})_{n=1,...,N}$ as the distribution of risks across the orthogonal components.
   
4. We would like to compute the vector $\omega$ that delievers a user-defined risk distribution $R$.
   - $\beta = (\sigma\sqrt{R_n/\Lambda_{n,n}})_{n=1,...,N}$: The allocation in the new(orthogonal) basis
   
5. The allocation in the old basis is given by $\omega = W\beta$(The risk distribution constant).
   - Figure 2.2 illustrates the contribution to risk per principla component for an invers variance allocation.
   - For the PCA portfolio, only the component with lowest variance contributes risk. 

![nn](Image/Figure2.2.jpg)

---

### Snippet 2.1 PCA Weights from a Risk Distribution R

In [1]:
def pcaWeights(cov, riskDist=None, riskTarget=1.):
    # Following the riskAlloc distribution, match riskTarget
    eVal, eVec=np.linalg.eigh(cov) # must be Hermitian
    indices=eVal.argsort()[::-1] # arguments for sorting eVal desc
    eVal, eVec=eVal[indices], eVec[:,indices]
    if riskDist is None:
        riskDist=np.zeros(cov.shape[0])
        riskDist[-1]=1
    loads=riskTarget*(riskDist/eVal)**.5
    wghts=np.dot(eVec,np.reshape(loads,(-1,1)))
    #ctr=(loads/riskTarget)**2*eVal # verify riskDist
    return wghts


### 2.4.3 Single Future Roll

When dealing with a single futures contract, an equivalent and more direct approach is to form a time series of cumulative roll gaps, and detract that gaps series from the price series. 
- FUT_CUR_GEN_TICKER: It identifies the contract associated with that price. Its value changes with every roll.
- PX_OPEN: The open price associated with that bar.
- PX_LAST: The close price associated with the bar.
- VWAP: The volume-weighted average price associated with that bar.

------

### Snippet 2.2 Form a Gap Series, Detract it from Prices

In [2]:
def getRolledSeries(pathIn, key):
    series=pd.read_hdf(pathIn, key='bars/ES_10k')
    series['Time']=pd.to_datetime(series['Time'],format='%Y%m%d%H%M%S%f')
    series=series.set_index('Time')
    gaps=rollGaps(series)
    for fld in ['Close', 'VWAP']:series[fld] -=gaps
    return series

In [3]:
def rollGaps(series,dictio={'Instrument':'FUT_CUR_GEN_TICKER','Open':'PX_OPEN', 'Close':'PX_LAST'},matchEnd=True):
    # Compute gaps at each roll, between previous close and next open
    rollDates=series[dictio['instrument']].drop_duplicates(keep='first').index
    gaps=series[dictio['Close']]*0
    iloc=list(series.index)
    iloc=[iloc.index(i)-1 for i in rollDates] # index of days prior to roll
    gaps.loc[rollDates[1:]]=series[dictio['Open']].loc[rollDates[1:]]-series[dictio['Close']].iloc[iloc[1:]].values
    gaps=gaps.cumsum()
    if matchEnd:gaps-=gaps.iloc[-1] # roll backward
    return gaps

Rolled prices are used for simulating PnL and portfolio mark-to-market values.
- However, raw prices should still be used to size positions and determine capital consumption.

##### Non-Negative Rolled Series

1. Compute a tiem series of rolled futures prices
2. Compute the return(r) as rolled price change divided by the previous raw price.
3. Form a price series using those returns

-----

### Snippet 2.3 Non-Negative Rolled Price Series

In [8]:
raw=pd.read_csv(filePath,index_col=0,parse_dates=True)
gaps=rollGaps(raw,dictio={'Instrument':'Symbol','Open':'Open','Close':'Close'})
rolled=raw.copy(deep=True)
for fld in ['Open', 'Close']:rolled[fld] -=gaps
rolled['Returns']=rolled['Close'].diff()/raw['Close'].shift(1)
rolled['rPrices']=(1+rolled['Returns']).cumprod()

NameError: name 'filePath' is not defined

## 2.5 Sampling Features

##### Weakness of Sampling Features

1. Several ML algorithms do not scale well with sample size(e.g. SVMs).
2. ML algorithms achieve highest accuracy when they attempt to learn from relevant examples.
   - We discuss ways of sampling bars to produce a features matrix with relevant training examples.

### 2.5.1 Sampling for Reduction

One reason for sampling features from a structured dataset is to reduce the amount of data used to fit the ML algorithm.
- Downsampling 
  - Linspace sampling: Sequential sampling at a constant size
  - Uniform sampling: Sampling randomly using a uniform distribution
- The sample does not necessarily contain the subset of most relevant observations in terms of their predictive power or informational content.

### 2.5.2 Event-Based Sampling 

Portfolio managers typically place a bet after some event takes place(a structural break, an extracted signal, microstructural phenomena). We can characterize an event as significant, and let the ML algorithm learn whether there is an accurate prediction function under those circumstances.

#### 2.5.2.1 The CUSUM Filter

The CUSUM filter is a quality-control method, designed to detect a shift in the mean value of a measured quantity away from a target value. Consider IID observations $(y_t)_{t=1,...,T}$ arising form a locally stationary process.

1. Define the cumulative sums $S_t = {\operatorname{max}}(0, S_{t-1}+y_t-E_{t-1}[y_t])$ with boundary condition $S_0=0$.
   - Recommend an action at the first $t$ satisfying $S_t \in h$, for some threshold h(the filter size).

   
2. Note $S_t=0$ whenever $y_t \le E_{t-1}[y_t]-S_{t-1}$. This zero floor means that we will skip some downward deviations that otherwise would make $S_t$ negative. 
   - The filter is set up to identify a sequence of upside divergence from any reset level zero.
   - The threshold is activated when $$S_t \ge h \Leftrightarrow \exists_{\tau}\in[1,t]|\Sigma_{i=\tau}^t(y_t-E_{i-1}[y_t]) \ge h$$
   
3. CUSUM filter:
$$S_t^+ = \operatorname{max}(0, S_{t-1}^+ + y_t - E_{t-1}[y_t]), S_0^+=0$$
$$S_t^- = \operatorname{max}(0, S_{t-1}^- + y_t - E_{t-1}[y_t]), S_0^-=0$$
$$S_t = \operatorname{max}(S_t^+,-S_t^-)$$

   - We will sample a bar $t$ if and only if $S_t \ge h$, at which point $S_t$ is reset.

---

### Snippet 2.4 The Symmetric CUSUM Filter

In [9]:
def getTEvents(gRaw,h):
    tEvents, sPos, sNeg=[], 0, 0
    diff=gRaw.diff()
    for i in diff.index[1:]:
        sPos, sNeg=max(0, sPos+diff.loc[i]), min(0, sNeg+diff.loc[i])
        if sNeg < -h:
            sNeg=0; tEvents.append(i)
        elif sPos > h:
            sPos=0; tEvents.append(i)
    return pd.DatetimeIndex(tEvents)

One practical aspect that makes CUSUM filters appealing is that multiple events are not triggered by $gRaw$ hovering around a threshold level, a flaw suffered by popular market signals such as Bollinger bands. Variable '$S_t$' could be based on any of the features like structural break statistics, enthropy, or market microstructure measurements. Once we have obtained this subset of event-driven bars, we will let the ML algorithm determine whether the occurence of such events constitutes actionable intelligence.