## Dynamic Mode Decomposition (DMD)

> DMD seeks to approximate the dynamics of a system via a linear tranformation. DMD's results can be used for predicting the next step in a dynamic system. A few applications of DMD include fluid dynamics, video processing, and financial modeling. In each case, a system is evolving with time and is assumed to contain some underlying structures that motivate its progression. For video processing, each frame in a video represents a single sample in time. In finance, the price of a certain stock observed at some timeframe could represent the data we wish to model.

The introduction and overview that follows will borrow heavily from lecture notes and material derived from __[1]__ below: "Data Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control" by Brunton and Kutz. For a more in-depth overview, I highly recommend you give it a read.

If we let $\vec{x}$ represent our state variable (the quantity we wish to measure, like features of a stock or the pixels in a single frame of a video), then the evolution of the behavior of $\vec{x}$ through time can be most generally modeled as follows:

$$
\frac{d\vec{x}}{dt} = f(\vec{x}, t, \vec{\mu})
$$

with $f$ being a non-linear function of the state, time, and possibly some parameters $\vec{\mu}$. The left hand side is the derivative of the state vector and represents precisely how the state changes at a given time $t$. The discovery and study of $f$ is heavily dependent on the context and represents entire fields of study. One crucial simplification DMD makes is that $f$ is linear. The hope is this yields some reasonable approximations for the system. This notebook will attempt to apply DMD to stock market data, in which it is reasonable to assume that the specific time the data is measured will have no bearing on the price itself. Rewriting our system with the two above simplifications, the dynamics now are linear with respect to the components of the system, and can be described as follows:

$$
\frac{d\vec{x}}{dt} = A\vec{x}
$$

The state vector $\vec{x}$ in our context will represent whichever features we choose to use for the value of the stock over some interval of time (which will represent one observation at a single point in time). We will use several such features, and hence assume $\vec{x} \in R^n$ with $n >> 1$. So, $A$ has shape $n \times n$.

This simplified system is a constant linear system of ordinary differential equations. So, the solution can be expressed in terms of eigenvalues and eigenvectors of $A$. In general, the solution to this system can be written as:

$$
\vec{x}(t) = e^{At}\vec{x}(0)
$$

where $\vec{x}(0)$ represents the initial condition (starting state at time $t = 0$). By writing $A = V\Lambda V^{-1}$  with $\Lambda = diag(\lambda_1, \lambda_2, \dots \lambda_n)$ containing the eigenvalues and the columns of $V$ containing the eigenvectors, we can rewrite this as follows:

$$
\vec{x}(t) = V
\begin{pmatrix}
e^{\lambda_1 t} & 0 & \cdots & 0 \\
0 & e^{\lambda_2 t} & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & e^{\lambda_n t}
\end{pmatrix}
V^{-1} \vec{x}(0)
$$

Setting $V = [\vec{v}_1 \; \vec{v}_2 \; \dots \vec{v}_n]$, we have:

$$
\vec{x}(t) = \sum_{i = 1} ^ {n} b_i \vec{v}_i e^{\lambda_i t}
$$

with the weights $b_i$ calulated from the product $V^{-1} \vec{x}(0)$. This is equivalent to solving the equation $\vec{x}(0) = V \vec{b}$ where $\vec{b}$ represents the coefficients of the linearly independent solutions obtained by applying the initial conditions like in any other ODE problem.

The above process is possible because $A$ is linear. In this scenario, we have our solution because $A$ is given to us. In reality, the stock market, along with other processes, are not predetermined in this way. So, we don't have this $A$ explicitly given and we must approximate it in a data-driven manner. From a stock market point of view, if $A$ was known, the efficient market hypothesis would probably nullify its existence fairly quickly. 

From the above arguments, we've reviewed how the linear dynamics of a system are related to the _transition matrix's_ eigenvectors and eigenvalues. We now depart from the setup described above and use $A$ in the related context of the following state transition: 

$$
x_{t+1} = Ax_t
$$

Here, $A$ is responsible for progressing the state one step in time forward. We wll assume, as described above, that our step forward in time is constant and uniform for all data points, though this constraint can be somewhat relaxed. In matrix notation, we can write:

$$
X' \approx AX
$$

where $X = [\vec{x}_1 \; \vec{x}_2 \; \dots \vec{x}_{m-1}]$ and $X' = [\vec{x}_2 \; \vec{x}_3 \; \dots \vec{x}_{m}]$, with each $x_i$ being $\Delta t$ units of time from $x_{i-1}$.

This brings up a couple relevant questions:

1. What should $m$ be?
2. What should $\Delta t$ be?

The answer to the each question is it depends on context and requires domain knowledge, and even on the stock you're viewing and the season you're viewing it in. It is a well-known fact that past performance of a stock is often no indicator of future performance, so this suggests we don't want $m$ too large. However, stocks often exhibit trending behavior and certain patterns we'd like to pick up on, so we don't want it too small either.

$A$ will be calculated using the Moore-Penrose pseudoinverse of $X$, denoted as $X^{\dagger}$:

$$
A = X' X^{\dagger}
$$

Mathematically, $A$ is the least-squares solution to the equation that best fits the transition from $x_i$ to $x_{i+1}$ for each $i$ and is known as the _exact DMD_. This doesn't depend on timestep size, but you should choose something sensible for your application. The size of $A$ is $n \times n$, where $n$ is the number of features at each timestep. If we have a small value for $m$ and many features (e.g. pixels in an image), $A$ will be intractably large to calculate. This is typical in the application of DMD, so we take a roundabout approach to find what we need to approximate the state vector.

We make the important assumption that our high-dimensional data exhibits low-rank structure. That is to say, we can reduce the dimension and still capture most of the variation of the data. See my  [notebook on PCA](../PCA_linear/pca_linear_oscillation_system.ipynb) for an overview of this concept. An outline of the process to estimate the system is as follows:
1. We will first approximate the data matrix $X$ using a rank-$r$ truncation derived from the Singular Value Decomposition (SVD).
2. Using the results from Step 1, an $r \times r$ substitute of A will be calculated, denoted $\tilde{A}$
3. The $r$-dimensional eigenvectors and eigenvalues of $\tilde{A}$ are computed.
4. The eigenvectors and eigenvalues are mapped back to the original space to yield the approximation we desire.

<u>__Step 1__</u>: Compute the SVD and find $r$.

The choice of $r$ has been the subject of much research. The simplest way is to view the singular values (on a log plot if necessary) and decide where the cutoff should be based on where they drop close to zero. A more sophisticated approach can be found in __[2]__ referenced below.

We write:

$$
X \approx U_r \Sigma_r V^{*}_r
$$

The columns of $U_r$ store the $r$ principal directions and will be used as the key for transforming to and from the r-dimensional space approximating our state space.

<u>__Step 2__</u>: Find $\tilde{A}$.

$\tilde{A}$ is defined as the similarity transform $\tilde{A} = U^{*}_r A U_r$, which can be simplified using the SVD:

$$
\tilde{A} = U^{*}_r X' V_r \Sigma^{-1}_r
$$

<u>__Step 3__</u>: Compute the eigenvectors and eigenvalues of $\tilde{A}$.

We compute the columns of $W$ (eigenvectors) and diagonal entries of $\Lambda$ such that:

$$
\tilde{A}W = W \Lambda
$$

<u>__Step 4__</u>: Map our results back to the original space.

To obtain the necessary DMD modes, we take the eigenvectors stored in $W$ from Step 3 and compute the following product:

$$
\Phi = A U_r W = X' V_r \Sigma^{-1}_r W
$$

using the fact $U^{*}_r U_r = I$. This corresponds to mapping the eigenvectors back to $n$-dimanesional space and then applying $A$. The multiplication of $A$ ensures that the resulting modes are eigenvectors of the original matrix $A$, and this can mathematically be shown to be the case. The columns of $\Phi$ are referred to as the _DMD modes_. Further, it can be shown the entries in $\Lambda$ are the corresponding eigenvalues of $A$. So, we have all we need to represent repeated multiplication on the initial data point $x_1$ by $A = \Phi \Lambda \Phi^{-1}$:

$$
x_{k} = A x_{k - 1} = \Phi \Lambda^{k - 1} \Phi^{-1} x_1
$$

We define $\vec{b} = \Phi^{-1} x_1 $, which is calculated in practice as $\vec{b} = \Phi^{\dagger} x_1 $. So we have:

$$
\vec{x}_k = \sum_{i = 1} ^ {r} \phi_i \lambda^{k - 1}_i b_i
$$

It should be noted above the initial condition is based on the first timestep $x_{1}$. This is not necessary; you can start predicting future states from more recent states, or even the most recent state. In terms of future prediction, we can fix the initial condition and use the formula above and vary $k$, or we can iteratively predict each step using the last (predicted) step as the new starting point. Since we only care about the next value, we will use the most recent state to predict the next state. 

<u>Experiment Outline</u>

Firstly, this notebook's goal is not to obtain miraculous results on the market, but instead to be instructive in the use of DMD.

We make the hopeful assumption that there's some rhyme or reason to the market's behavior at least some of the time. In truth, the market is infinitely complex and thoroughly unpredictable. The weather is more predictable to be honest. No model can possibly be correct even close to 100% of time as there are a significant number of variables at play that simply can't be measured. In fact, if you beat 50% accuracy, that's not bad for a simple model. A stock's price can show a trend but then immediately reverse for no reason available to us, so some luck is involved. These outliers occur frequently. The obvious problem here is if this happens enough, there will be no true patterns to recognize and these contradicting data points will act as confusers and ruin the training of any model. One thing workng for us is the 'dynamicness' of DMD. At every timestep, the calculation can be repeated on the fly, so this online form of learning might help some.

One of the most crucial aspects in training a model is the features you choose as input. The old saying 'garbage in, garbage out' applies, so we must be careful to choose features that have a good chance of indicating the future direction of a stock's price.  We are not attempting to make a regression model to predict future returns, which is a common approach in computational finance. Further, we are not using price data directly as features. Hence, scale is not as major a concern. Instead, our state vector will consist of a very popular momentum indicator known as __Moving Average Convergence/Divergence__, or MACD for short. Very briefly, it calculates two moving averages, one with a longer lag time than the other, and subtracts them to form the MACD value. Additionally, this value is smoothed with another moving average and the difference between the two form new values for each point in time called the histogram bars. The movement of these bars is much like a wave and upward motion indicates a rising stock, downward a decreasing stock, and flat or small a stagnant stock. The values depend on the price itself, so it will be important to normalize them.

For each point in time, we calculate these MACD histogram values for a fixed set of MACD parameters: the calculation will be based on the close of the candle and the standard periods of 12, 26, and 9 will be used for the fast, slow, and smoothing exponential moving averages, respectively. To widen our gaze and obtain a big-picture view, we repeat this step for higher timeframes as well. There are many different kinds of values we can base the MACD calculation on. We opt for two. First are Heikin-Ashi candles, which offer a more smoothed out set of candles. This better capture trends with less noise at the cost of the candles not necessarily representing true prices. The second will be the __Relative Strength Index__, or RSI. This is an indicator that indicates when a stock's price is relatively too high or low.

Two questions we must answer are how many steps should we look back for the purposes of DMD ($m$) and what should $\Delta t$ be, which is essentially the bar's timeframe. We fix $m$ and use a variety of $\Delta t$ values to form our data matrix. Henceforth, $\Delta t$ will be referred to as _bar size_. Varying bar sizes are a way of generating multiple features for each timestep. The features themselves are the _returns_ of the histogram values (percent increase or decrease) from the MACD calculation, providing an intuitive and normalized approach to representing our data. In addition to the two MACD features, we will add the percent change from the closing prices of the two most recent bars (Heikin-Ashi and standard).

Applying DMD to this data provides a raw momentum prediction for each bar size at future points in time. We need to interpret the prediction to make our best guess on the direction of the stock and we need a way to decide if we're correct. I propose we append the DMD-generated features to additional features (like other stock indicators) and use this as input to train a model against our labels. This will provide a confidence score we can use to gauge if we should take a position in the market.

The labels will be straightforward, but require some configuration. Given a timeframe into the future and required percent increase/decrease, a label of __1__ will be assigned if the closing price lies above the opening price by that threshold. Similarly, if the closing price ends up below the opening price by this threshold, a label of __2__ is assigned. If neither occur, then the stock had insufficient momentum in either direction and a label of __0__ is assigned.

The following must be decided:

1. The fixed value of $m$.
2. The specific bar sizes to use. Say we use $d$ different sizes. Then we will have $4 \cdot d$ features for each timestamp.
3. The length of time in days after entering the market for label calculation.
4. The percent threshold for determining if the future price movement was sufficient to be labeled a __1__ or a __2__.

Our data matrix for DMD will be $(4 \cdot d) \times m$. Application of DMD yields a vector of shape $(4 \cdot d) \times 1$ and represents a prediction for future price movement. Additional stock indicator features are appended onto this and yields the input to our machine learning model, paired with the appropriate labels. We will use a Random Forest model for training because 1) they're simple and easy to implement 2) they tend to handle noisy data well (like the stock market) and 3) it will allow us to maintain interpretability of the decision function as we have intepretable features. 

In [None]:
# Imports
import numpy as np
import pandas as pd

<u>__References__<u>

Content inspired by lecture notes and material in __[1]__.

__[1]__ Steven L. Brunton; J. Nathan Kutz (2019). "Data Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control". Section 3.7.

__[2]__ M. Gavish and D. L. Donoho, "The Optimal Hard Threshold for Singular Values is $4 / \sqrt{3}$" in IEEE Transactions on Information Theory, vol. 60, no. 8, pp. 5040-5053, Aug. 2014, doi: 10.1109/TIT.2014.2323359.