# Time Series Similarity Search with aeon

<img src="img/sim_search.png" width="600" alt="time series similarity search">

The `similarity_search` module in aeon provides estimators with a `fit`/`predict` interface for finding nearest neighbors in time series data. All estimators follow a standard interface:

- **fit(X)**: Takes a 3D collection of shape `(n_cases, n_channels, n_timepoints)`
- **predict(X)**: Takes a 2D query series of shape `(n_channels, n_timepoints)`

The module is organized by search type:

- `subsequence` estimators find nearest neighbors among subsequences of time series
- `whole_series` estimators find nearest neighbors among complete time series

### Other similarity search notebooks

This notebook gives an overview of similarity search module and the available estimators. The following notebooks are also available to go more in depth with specific subjects:

- [The theory and math behind the similarity search estimators in aeon](distance_profiles.ipynb)
- [Analysis of the performance of the estimators provided by similarity search module](code_speed.ipynb)

## 1. Setup and Data

First, let's import the estimators and create some example data. We'll create a small 3D collection and a query series.

In [9]:
# Imports
from aeon.similarity_search.subsequence import MASS
from aeon.similarity_search.subsequence import BruteForce as SubseqBruteForce
from aeon.similarity_search.whole_series import BruteForce as WholeBruteForce
from aeon.testing.data_generation import make_example_3d_numpy

# Create a sample collection: 4 cases, 1 channel, 50 timepoints each
X, _ = make_example_3d_numpy(n_cases=4, n_channels=1, n_timepoints=50)
print("Collection shape (fit input):", X.shape)

# Create a query series: 1 channel, 10 timepoints
q = X[0, :, :10]  # Extract a short query from first series
print("Query shape (predict input):", q.shape)

Collection shape (fit input): (4, 1, 50)
Query shape (predict input): (1, 10)


## 2. Subsequence Search

Subsequence search estimators find the closest matching subsequences within a collection of time series. The `predict` method returns a 2D array of shape `(n_matches, 2)` containing pairs of `(case_index, timestamp_index)` indicating where in the collection the best matches were found.

### 2.1 Brute force

The simplest method available is the brute force search. It computes the euclidean distance between the query and all subsequences of the input collection.

In [None]:
# Fit BruteForce subsequence search
subseq_brute = SubseqBruteForce(length=10, normalize=False)
subseq_brute.fit(X)  # Fit on 3D collection

# Predict with k=3 to find top 3 closest subsequences
# Returns (indices, distances) where indices has shape (n_matches, 2)
# with (case_idx, timestamp)
matches, distances = subseq_brute.predict(q, k=3)
print("BruteForce matches shape:", matches.shape)
print("Matches (case_idx, timestamp):")
print(matches)
print("\nDistances:")
print(distances)

BruteForce matches shape: (3, 2)
Matches (case_idx, timestamp):
[[ 0  0]
 [ 0 16]
 [ 0 22]]

Distances:
[0.         2.53352368 2.65271885]



### 2.2 MASS (FFT-based)

`MASS` is an efficient FFT-based algorithm for finding similar subsequences.

In [None]:
# MASS is an FFT-based algorithm for fast subsequence search
# length=10 matches the query length
mass = MASS(length=10, normalize=False)
mass.fit(X)  # Fit on 3D collection

# Predict with k=3 to find top 3 closest subsequences
# Returns (indices, distances) where indices has shape (n_matches, 2)
# with (case_idx, timestamp)
matches, distances = mass.predict(q, k=3)
print("MASS matches shape:", matches.shape)
print("Matches (case_idx, timestamp):")
print(matches)
print("\nDistances:")
print(distances)

MASS matches shape: (3, 2)
Matches (case_idx, timestamp):
[[ 0  0]
 [ 0 16]
 [ 0 22]]

Distances:
[0.         2.53352368 2.65271885]


## 3. Whole Series Search

Whole series search estimators find the most similar complete time series within a collection. Unlike subsequence search, the query must have the same length as the series in the collection. The `predict` method returns a 1D array of shape `(n_matches,)` containing the indices of the most similar cases.

### 3.1 BruteForce (Exact)

`BruteForce` computes exact nearest neighbors using a specified distance function.

In [13]:
# Whole series search - query must be same length as fitted series
# Create a new collection with same-length series
X_whole, _ = make_example_3d_numpy(n_cases=10, n_channels=1, n_timepoints=50)
print("Collection shape:", X_whole.shape)

# Query is a single 2D series (same length as collection series)
q_whole = X_whole[3]  # Use one series from collection as query
print("Query shape:", q_whole.shape)

# Fit and predict with BruteForce whole series search
bf_whole = WholeBruteForce()
bf_whole.fit(X_whole)
matches_whole, distances_whole = bf_whole.predict(q_whole, k=3)
print("\nWhole series matches (case indices):", matches_whole)
print("Distances:", distances_whole)

Collection shape: (10, 1, 50)
Query shape: (1, 50)

Whole series matches (case indices): [3 7 0]
Distances: [ 0.         29.3043315  29.68740466]

Whole series matches (case indices): [3 7 0]
Distances: [ 0.         29.3043315  29.68740466]


### 3.2 LSHIndex (Approximate)

`LSHIndex` uses locality-sensitive hashing for fast approximate nearest neighbor search. This is useful when the dataset is large and exact search is too slow.

In [15]:
from aeon.similarity_search.whole_series import LSHIndex

lsh = LSHIndex()
lsh.fit(X_whole)
matches_lsh, distances_lsh = lsh.predict(q_whole, k=3)
print("LSH approximate matches (case indices):", matches_lsh)
print("Distances (Hamming):", distances_lsh)

LSH approximate matches (case indices): [3 8 0]
Distances (Hamming): [ 0. 50. 54.]
