# Feature Selection Methods for Time Series Data

Feature selection (FS) has been widely studied in the machine learning literature, particularly for **univariate time series** (single variable over time) **multivariate non-time-series data** (many features observed independently of time). Classical feature selection techniques—filter, wrapper, and embedded methods—have demonstrated strong performance in these domains. Examples include:

- **Filter methods:** Mutual Information, Correlation, ANOVA
- **Wrapper methods:** Recursive Feature Elimination (RFE)
- **Embedded methods:** Lasso, tree-based importance

However, relatively **limited attention has been given to feature selection for multivariate time series (MTS)** data, where features not only have relevance at each time point but also exhibit _temporal dynamics_ across time. The key challenge for time series is that temporal structure must be preserved when features are evaluated:

- Directly applying classical methods (e.g., RFE) typically requires **vectorizing the time series**, which destroys temporal relationships
- Many state-of-the-art selection algorithms assume i.i.d. data, violating the dependence structure inherent in time series

### Mutual Information at the Heart of Time Series FS

Across the literature, **Mutual Information (MI)** has emerged as a **central building block** for feature selection in time series:

- MI quantifies **nonlinear dependency** between variables
- For time series, MI is used to measure **how much knowing a feature (or its lags) reduces uncertainty in the target**
- A high MI indicates that a feature carries useful information about the target behavior over time

MI has been favored because:

- It captures **both linear and non-linear relationships**
- It does not rely on Gaussian assumptions
- It can be adapted to multivariate settings with lagged representations

### Evolution of MI Estimation in the Literature

**Early Approaches: Discretization**

In earlier research (pre-2010), MI was often estimated by:

- **Discretizing continuous variables**
- Building **contingency tables**
- Using histograms or fixed bins to approximate densities

Pros:

- Conceptually simple
- Easy to compute by hand or with basic tools

Cons:

- Loss of precision
- Sensitive to bin size
- Inefficient for high-dimensional continuous data

As a result, discretization is **no longer the preferred method** for MI estimation in modern time series feature selection.

**Modern MI: kNN Estimators**

Today’s research mainly uses **k-Nearest Neighbors (kNN) based MI estimators**, such as:

- Kraskov estimator
- kNN entropy estimators adapted for MI

Advantages:

- Works directly on **continuous variables**
- No discretization needed
- Captures **local density structure**
- Performs well with small to moderate sample sizes
- In discretization, distances define local densities

This approach is widely used in recent time series FS papers because it maintains continuous information and avoids arbitrary binning.

### Lagged Variables and Multivariate Time Series

A common strategy in time series feature selection is to **include lagged versions of variables**:

- Instead of treating each feature as a static column, researchers create:
  - $X(t)$, $X(t-1)$, $X(t-2)$, … up to some maximum lag
- Lagged features allow models to capture **temporal dependencies**
- Feature selection then evaluates not only **which features** are important, but **which lags** are important

Given a feature $X$, with maximum lag $L$, the feature set becomes:

$$
\{ X(t), X(t-1), X(t-2), \ldots, X(t-L) \}
$$

In this setting, three different elimination strategies are commonly discussed:

1. **Eliminate only the eliminated tag (lag)**

   - Remove a specific lag that is uninformative
   - Keep other lags of the same feature
   - > _The most popular and widely used_

2. **Eliminate up to the candidate lag**

   - Remove all lags up to a particular lag level
   - E.g., remove $X(t-1)$ and all earlier lags

3. **Eliminate all lags of a feature**
   - If no lag of a feature is informative,
   - Remove the entire feature from the model


# Example: Mutual Information via Discretization

### Problem setup

Suppose we study whether blood pressure (BP) is informative about a binary health outcome.

- Feature X: Blood Pressure (continuous)
- Target Y: Disease status (binary)

We discretize BP into categories.

### Step 1: Raw data

| Sample | BP (mmHg) | Disease Y |
| ------ | --------- | --------- |
| 1      | 110       | 0         |
| 2      | 115       | 0         |
| 3      | 120       | 0         |
| 4      | 130       | 1         |
| 5      | 135       | 1         |
| 6      | 140       | 1         |
| 7      | 145       | 1         |
| 8      | 150       | 1         |

Total samples: N = 8

### Step 2: Discretize the feature

We discretize BP into two bins:

- Low BP (L): $BP < 130$
- High BP (H): $BP >= 130$

### Step 3: Construct the contingency table

| X (BP category) | Y=0 | Y=1 | Total |
| --------------- | --- | --- | ----- |
| L               | 3   | 0   | 3     |
| H               | 0   | 5   | 5     |
| Total           | 3   | 5   | 8     |

### Step 4: Convert counts to probabilities

Marginal probabilities:

$$
P(X=L) = 3/8 ,  P(X=H) = 5/8
$$

$$
P(Y=0) = 3/8 ,  P(Y=1) = 5/8
$$

Joint probabilities:

$$
P(L,0) = 3/8 , P(L,1) = 0
$$

$$
P(H,0) = 0 , P(H,1) = 5/8
$$

### Step 5: Mutual Information definition

$$
I(X;Y) = \sum_x \sum_y P(x,y) \text{log}_2 \left(\frac{P(x,y)}{P(x) P(y)}\right)
$$

After omputing each **non-zero term:**

$$
I(X;Y) = 0.531 + 0.424 = 0.955
$$

> I(X;Y) = 0.955 indicates strong dependency


# Mutual Information via kNN (k = 2) — Using kNN

We use the SAME dataset as before but replace KNN with discretization.

### Step 1: Data

We have one continuous feature X (Blood Pressure)
and one discrete target Y (Disease).

| i   | X (BP) | Y   |
| --- | ------ | --- |
| 1   | 110    | 0   |
| 2   | 115    | 0   |
| 3   | 120    | 0   |
| 4   | 130    | 1   |
| 5   | 135    | 1   |
| 6   | 140    | 1   |
| 7   | 145    | 1   |
| 8   | 150    | 1   |

Total samples:
N = 8

k = 2

### Step 2: MI estimator (KSG-style, mixed continuous–discrete)

We use the commonly cited kNN MI estimator:

$$
I(X;Y) = ψ(k) + ψ(N) - ⟨ ψ(n_x + 1) + ψ(n_y + 1) ⟩
$$

Where:

- $ψ(.)$ is the digamma function
- $n_x$ = number of neighbors within ε in X-space
- $n_y$ = number of neighbors with same Y-label within ε
- $\epsilon$ = distance to k-th nearest neighbor in joint space

### Step 3: Distance in joint space

Joint space distance:

- $X$ uses absolute distance
- $Y$ must be EXACT match (distance = 0 if same class, infinite otherwise)

Thus:
Only points with same $Y$ are considered neighbors.

### Step 4: Compute ε for each point (k = 2)

We find the 2nd nearest neighbor WITH SAME $Y$.

##### For Y = 0 (points 1–3)

X values: 110, 115, 120

Distances:

- Point 110:
  neighbors: 115 (5), 120 (10)
  ε = 10

- Point 115:
  neighbors: 110 (5), 120 (5)
  ε = 5

- Point 120:
  neighbors: 115 (5), 110 (10)
  ε = 10

##### For Y = 1 (points 4–8)

X values: 130, 135, 140, 145, 150

- Point 130:
  neighbors: 135 (5), 140 (10)
  ε = 10

- Point 135:
  neighbors: 130 (5), 140 (5)
  ε = 5

- Point 140:
  neighbors: 135 (5), 145 (5)
  ε = 5

- Point 145:
  neighbors: 140 (5), 150 (5)
  ε = 5

- Point 150:
  neighbors: 145 (5), 140 (10)
  ε = 10

### Step 5: Count $n_x$ and $n_y$

For each point:

- $n_x$ = number of points within $ε$ distance in $X$ (excluding itself)
- $n_y$ = number of points with same Y within $ε$ distance

Since $Y$ is fixed inside $ε$:
$n_x = n_y$ for all points

##### Counts for Y = 0

| X   | ε   | neighbors within $ | X_i - X_j | <ε$ | $n_x = n_y$ |
| --- | --- | ------------------ | --------- | --- | ----------- |
| 110 | 10  | 115,120            | 2         |
| 115 | 5   | 110,120            | 2         |
| 120 | 10  | 115,110            | 2         |

##### Counts for Y = 1

| X   | ε   | neighbors within $ | X_i - X_j | <ε$ | $n_x = n_y$ |
| --- | --- | ------------------ | --------- | --- | ----------- |
| 130 | 10  | 135,140            | 2         |
| 135 | 5   | 130,140            | 2         |
| 140 | 5   | 135,145            | 2         |
| 145 | 5   | 140,150            | 2         |
| 150 | 10  | 145,140            | 2         |

### Step 6: Plug into MI formula

We now compute expectations.

For ALL points:

- $n_x = 2$
- $n_y = 2$

Thus:

- $ψ(n_x + 1) = ψ(3)$
- $ψ(n_y + 1) = ψ(3)$

### Step 7: Digamma values (approximate)

- $ψ(2) ≈ 0.423$
- $ψ(3) ≈ 0.923$
- $ψ(8) ≈ 2.015$

### Step 8: Compute MI

$$
I(X;Y) = ψ(2) + ψ(8) - [ ψ(3) + ψ(3) ]
$$

$$
I(X;Y) = 0.423 + 2.015 - (0.923 + 0.923)
$$

$$
I(X;Y) = 2.438 - 1.846 = 0.592 \text{ bits}
$$

Convert to bits:

$$
I_{bits} = 0.592 / ln(2) ≈ 0.855 \text{ bits}
$$

- $MI ≈ 0.855$ bits → strong dependence
- Result is close to discretized MI (≈ 0.955)
