# Expected Signatures: The "Fingerprints" of Stochastic Processes


In the previous notebook, we explored the signature of a single, deterministic path. Now, we'll extend this concept to the realm of **stochastic processes**—processes that have inherent randomness.

A single realization of a random process (like a stock price chart for one day) has a signature. But if we run the process again, we'll get a slightly different path and a different signature. The **expected signature** is the *average signature* over many, many possible realizations of the process.

This expected signature acts as a unique "fingerprint." Different types of stochastic processes (e.g., random walks, trending markets, mean-reverting systems) will have distinctly different expected signatures. This makes it an incredibly powerful tool for time series classification, as it provides a fixed-length feature vector that summarizes the essential dynamic properties of a variable-length stream of data.

In this notebook, we will:
1. Generate sample paths from several common stochastic processes.
2. Calculate the signature for each path.
3. Average these signatures to compute the expected signature.
4. Visualize and compare these "fingerprints" to see how they uniquely identify each process.


In [None]:

import numpy as np
import matplotlib.pyplot as plt

# Set a random seed for reproducibility
np.random.seed(42)

# --- Signature Calculation Helper ---
# We'll use the same helper function from the previous notebook
def compute_signature_level_2(path):
    """
    Computes the signature of a 2D path up to level 2.
    """
    if len(path) < 2:
        return { "level_1": np.zeros(2), "level_2": np.zeros(4) }
    increments = np.diff(path, axis=0)
    level_1 = np.sum(increments, axis=0)
    level_2 = np.zeros(4)
    x_vals, y_vals = path[:, 0], path[:, 1]
    dx, dy = increments[:, 0], increments[:, 1]
    level_2[0] = 0.5 * (x_vals[-1]**2 - x_vals[0]**2) # integral(x dx)
    level_2[3] = 0.5 * (y_vals[-1]**2 - y_vals[0]**2) # integral(y dy)
    level_2[1] = np.sum(0.5 * (y_vals[:-1] + y_vals[1:]) * dx) # integral(y dx)
    level_2[2] = np.sum(0.5 * (x_vals[:-1] + x_vals[1:]) * dy) # integral(x dy)
    return { "level_1": level_1, "level_2": level_2 }

# --- Visualization Helper ---
def plot_process_and_expected_signature(paths, expected_sig_l2, title):
    """
    Generates a 3-panel plot for a stochastic process:
    1. A few sample paths.
    2. The expected signature (level 2) as a bar chart.
    3. The expected signature (level 2) as a heatmap.
    """
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    fig.suptitle(title, fontsize=16)
    
    # Plot sample paths
    ax1 = axes[0]
    ax1.set_title("Sample Realizations")
    for i in range(min(20, len(paths))):
        ax1.plot(paths[i, :, 0], paths[i, :, 1], alpha=0.5)
    ax1.set_xlabel("X")
    ax1.set_ylabel("Y")
    ax1.grid(True)
    
    # Plot expected signature as bar chart
    ax2 = axes[1]
    ax2.set_title("Expected Signature (Level 2)")
    labels = ['S₂_xx', 'S₂_yx', 'S₂_xy', 'S₂_yy']
    ax2.bar(labels, expected_sig_l2, color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'])
    ax2.grid(axis='y')
    plt.setp(ax2.get_xticklabels(), rotation=30, ha="right")
    
    # Plot expected signature as heatmap
    ax3 = axes[2]
    ax3.set_title("Signature Heatmap")
    sig_matrix = expected_sig_l2.reshape(2, 2)
    im = ax3.imshow(sig_matrix, cmap='viridis')
    ax3.set_xticks([0, 1])
    ax3.set_yticks([0, 1])
    ax3.set_xticklabels(['dx', 'dy'])
    ax3.set_yticklabels(['dx', 'dy'])
    for i in range(2):
        for j in range(2):
            ax3.text(j, i, f"{sig_matrix[i, j]:.2f}", ha="center", va="center", color="w")
    fig.colorbar(im, ax=ax3)
    
    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.show()
    
print("Helper functions defined.")



## Cell 1: Brownian Motion (Random Walk)

Our first example is **Brownian motion**, which is essentially a random walk. At each step, the path moves in a random direction. It has no memory and no preferred direction (zero drift). This is the baseline for many stochastic models.

We'll generate 100 different random walks, calculate the signature for each one, and then average them to find the expected signature.

**What to look for:**
- The sample paths should look like classic, jagged random walks.
- Because the process is symmetric (a step in any direction is equally likely), we expect the cross-terms of the signature (\(\int y dx\) and \(\int x dy\)) to be close to zero on average. Any area swept in a clockwise direction is likely to be cancelled out by an area swept counter-clockwise in another realization.
- The diagonal terms (\(\int x dx\) and \(\int y dy\)) should be non-zero, reflecting the variance of the process.


In [None]:

# --- 1. Brownian Motion ---

# Parameters
num_realizations = 100
num_steps = 1000
dt = 0.01  # Time step

# Generate paths
paths_bm = np.zeros((num_realizations, num_steps, 2))
for i in range(num_realizations):
    # Generate random increments (steps)
    increments = np.random.randn(num_steps - 1, 2) * np.sqrt(dt)
    # Cumulatively sum to get the path
    paths_bm[i, 1:, :] = np.cumsum(increments, axis=0)

# Calculate signatures for all paths
signatures_bm = [compute_signature_level_2(p) for p in paths_bm]

# Calculate expected signature (average of level 2 terms)
expected_sig_bm_l2 = np.mean([s['level_2'] for s in signatures_bm], axis=0)

# Visualize
plot_process_and_expected_signature(paths_bm, expected_sig_bm_l2, "Process 1: 2D Brownian Motion")



## Cell 2: Geometric Brownian Motion (Trending Process)

Next, we'll look at **Geometric Brownian Motion (GBM)**. This process is often used to model stock prices because it has a "drift" component, which gives it a general trend, and a random component that creates fluctuations. Unlike standard Brownian motion, the size of the random fluctuations in GBM is proportional to the current value, which is more realistic for financial assets.

We will set different drift rates for the x and y components to create an anisotropic (directionally-dependent) process.

**What to look for:**
- The sample paths should show a general trend (drifting upwards and to the right, on average).
- Because of the drift, the process is no longer symmetric. We expect the cross-terms of the signature to be non-zero. The drift introduces a correlation between the x and y movements, which the signature will capture.
- The heatmap of the signature should look asymmetric, reflecting the underlying trend in the process.


In [None]:

# --- 2. Geometric Brownian Motion ---

# Parameters
mu_x = 0.05  # Drift for X
mu_y = 0.02  # Drift for Y
sigma = 0.2   # Volatility
S0 = [1, 1]   # Initial values

# Generate paths
paths_gbm = np.zeros((num_realizations, num_steps, 2))
paths_gbm[:, 0, :] = S0
for i in range(num_realizations):
    for t in range(1, num_steps):
        # Generate random component
        Z = np.random.randn(2)
        # Update X
        paths_gbm[i, t, 0] = paths_gbm[i, t-1, 0] * np.exp((mu_x - 0.5 * sigma**2) * dt + sigma * np.sqrt(dt) * Z[0])
        # Update Y
        paths_gbm[i, t, 1] = paths_gbm[i, t-1, 1] * np.exp((mu_y - 0.5 * sigma**2) * dt + sigma * np.sqrt(dt) * Z[1])

# Calculate signatures for all paths
signatures_gbm = [compute_signature_level_2(p) for p in paths_gbm]

# Calculate expected signature
expected_sig_gbm_l2 = np.mean([s['level_2'] for s in signatures_gbm], axis=0)

# Visualize
plot_process_and_expected_signature(paths_gbm, expected_sig_gbm_l2, "Process 2: Geometric Brownian Motion")



## Cell 3: Ornstein-Uhlenbeck (Mean-Reverting Process)

The **Ornstein-Uhlenbeck (OU)** process is a model for a value that tends to drift back towards a long-term average. Think of it like a spring: the further you pull it from its equilibrium point, the stronger it pulls back. This "mean-reverting" behavior is common in many physical and financial systems, such as interest rates or the velocity of a particle in a fluid.

We will set the process to revert to the origin (0,0).

**What to look for:**
- The sample paths should wander randomly but always be pulled back towards the center. They won't stray as far as the Brownian motion paths.
- The mean reversion introduces a clear structure. When x is positive, it's likely to decrease, and when it's negative, it's likely to increase. This anti-correlation should be reflected in the signature.
- The signature's cross-terms might be negative, indicating that an increase in one variable tends to be associated with a subsequent pull-back or constraining effect from the other, creating a kind of rotational or inward-spiraling tendency that the signature can detect.


In [None]:

# --- 3. Ornstein-Uhlenbeck Process ---

# Parameters
theta = 0.5    # Speed of reversion
mu = [0, 0]    # Mean to revert to
sigma_ou = 0.3 # Volatility

# Generate paths
paths_ou = np.zeros((num_realizations, num_steps, 2))
# Start paths from a random point to see the reversion clearly
paths_ou[:, 0, :] = np.random.randn(num_realizations, 2) * 0.5 

for i in range(num_realizations):
    for t in range(1, num_steps):
        # The change is the pull-back term + a random shock
        dXt = theta * (mu - paths_ou[i, t-1, :]) * dt + sigma_ou * np.sqrt(dt) * np.random.randn(2)
        paths_ou[i, t, :] = paths_ou[i, t-1, :] + dXt

# Calculate signatures for all paths
signatures_ou = [compute_signature_level_2(p) for p in paths_ou]

# Calculate expected signature
expected_sig_ou_l2 = np.mean([s['level_2'] for s in signatures_ou], axis=0)

# Visualize
plot_process_and_expected_signature(paths_ou, expected_sig_ou_l2, "Process 3: Ornstein-Uhlenbeck (Mean-Reverting)")



## Cell 4: Trending + Seasonal Process

So far, we've looked at processes that are primarily random. What if the path has a strong, predictable, deterministic component? Here, we'll create a process that has both a linear trend and a sinusoidal (seasonal) pattern, with a small amount of random noise added on top.

This is a very different kind of process, and its signature should be unique.

**What to look for:**
- The sample paths should be very regular, clearly following a spiral or wave-like pattern.
- Because the underlying pattern is deterministic and not symmetric, the signature will be highly structured and asymmetric.
- The cross-terms will be large and will have a specific sign, reflecting the consistent rotational pattern (e.g., counter-clockwise) of the underlying sine and cosine components. This is the signature's way of detecting the strong seasonal relationship between x and y.


In [None]:

# --- 4. Trending + Seasonal Process ---

# Parameters
t_space = np.linspace(0, 10, num_steps)
noise_level = 0.1

# Generate paths
paths_trend = np.zeros((num_realizations, num_steps, 2))
for i in range(num_realizations):
    # Deterministic part: linear trend + sine wave
    trend_x = 0.3 * t_space
    trend_y = 0.0 * t_space
    seasonal_x = 0.5 * np.cos(2 * np.pi * t_space)
    seasonal_y = 0.5 * np.sin(2 * np.pi * t_space)
    
    # Random noise
    noise = np.random.randn(num_steps, 2) * noise_level
    
    # Combine them
    path = np.c_[trend_x + seasonal_x, trend_y + seasonal_y] + np.cumsum(noise, axis=0) * dt
    paths_trend[i, :, :] = path

# Calculate signatures for all paths
signatures_trend = [compute_signature_level_2(p) for p in paths_trend]

# Calculate expected signature
expected_sig_trend_l2 = np.mean([s['level_2'] for s in signatures_trend], axis=0)

# Visualize
plot_process_and_expected_signature(paths_trend, expected_sig_trend_l2, "Process 4: Trending + Seasonal Process")



## Cell 5: Comparison Dashboard - The "Fingerprints"

Now for the key insight. Let's plot the Level 2 expected signature heatmaps for all four processes side-by-side. This allows us to directly compare their "fingerprints."

Notice how visually distinct each heatmap is. This is the power of the expected signature: it transforms a complex, variable-length time series into a simple, fixed-size matrix (or vector) that uniquely characterizes the underlying process.


In [None]:

# --- 5. Comparison Dashboard ---

fig, axes = plt.subplots(2, 2, figsize=(10, 10))
fig.suptitle("Comparison of Expected Signature (Level 2) Heatmaps", fontsize=16)

all_sigs = {
    "Brownian Motion": expected_sig_bm_l2,
    "Geometric Brownian Motion": expected_sig_gbm_l2,
    "Ornstein-Uhlenbeck": expected_sig_ou_l2,
    "Trending + Seasonal": expected_sig_trend_l2
}

# Find common color scale limits
min_val = min(sig.min() for sig in all_sigs.values())
max_val = max(sig.max() for sig in all_sigs.values())

for ax, (title, sig) in zip(axes.flatten(), all_sigs.items()):
    sig_matrix = sig.reshape(2, 2)
    im = ax.imshow(sig_matrix, cmap='viridis', vmin=min_val, vmax=max_val)
    ax.set_title(title)
    ax.set_xticks([0, 1])
    ax.set_yticks([0, 1])
    ax.set_xticklabels(['dx', 'dy'])
    ax.set_yticklabels(['dx', 'dy'])
    for i in range(2):
        for j in range(2):
            ax.text(j, i, f"{sig_matrix[i, j]:.2f}", ha="center", va="center", color="w")

fig.subplots_adjust(right=0.8)
cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
fig.colorbar(im, cax=cbar_ax)

plt.show()



## Cell 6: Classification Demo

The ultimate test of these "fingerprints" is whether a machine learning model can use them to tell the different processes apart. If the signatures are truly characteristic of the process, then they should be effective features for a classification task.

Here, we'll take the individual signatures from every single realization we generated (100 for each of the 4 processes). We will then use just two of the four Level 2 signature terms—the cross-terms \(\int y dx\) and \(\int x dy\)—as features.

We will create a 2D scatter plot where each point represents a single time series, positioned according to its two signature features and colored by its true process type. If the signatures are good features, we should see distinct, well-separated clusters of colors.


In [None]:

# --- 6. Classification Demo ---
from sklearn.cluster import KMeans

# Combine all the level 2 signatures we've calculated
all_level_2_sigs = np.vstack([
    np.array([s['level_2'] for s in signatures_bm]),
    np.array([s['level_2'] for s in signatures_gbm]),
    np.array([s['level_2'] for s in signatures_ou]),
    np.array([s['level_2'] for s in signatures_trend])
])

# Create corresponding labels
labels = np.array(
    [0] * num_realizations +  # Brownian Motion
    [1] * num_realizations +  # Geometric Brownian Motion
    [2] * num_realizations +  # Ornstein-Uhlenbeck
    [3] * num_realizations    # Trending + Seasonal
)
process_names = ["BM", "GBM", "OU", "Trend"]
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']

# We will use the two cross-terms as our features for the 2D plot
# These are the 2nd and 3rd columns of our signature matrix (indices 1 and 2)
feature_1 = all_level_2_sigs[:, 1]  # S_yx = integral(y dx)
feature_2 = all_level_2_sigs[:, 2]  # S_xy = integral(x dy)

# --- Perform K-Means clustering ---
# Let's see if a simple clustering algorithm can find these groups on its own
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
predicted_labels = kmeans.fit_predict(all_level_2_sigs)
centers = kmeans.cluster_centers_

# --- Plotting ---
fig, ax = plt.subplots(figsize=(10, 8))
ax.set_title("Classification of Processes using Signature Features", fontsize=16)
ax.set_xlabel("Feature 1: $\int y(t) dx(t)$ (Area Signature Term)")
ax.set_ylabel("Feature 2: $\int x(t) dy(t)$ (Area Signature Term)")

# Scatter plot of the true labels
for i, (name, color) in enumerate(zip(process_names, colors)):
    mask = labels == i
    ax.scatter(feature_1[mask], feature_2[mask], c=color, label=f"True: {name}", alpha=0.6)

# Plot the centers found by K-Means
ax.scatter(centers[:, 1], centers[:, 2], s=200, marker='X', c='black', label='K-Means Centers')

ax.legend()
ax.grid(True)
plt.show()

print("Key Takeaway: The processes form distinct clusters in the signature feature space.")
print("Even a simple, unsupervised algorithm like K-Means can find the centers of these groups,")
print("demonstrating that signatures are powerful features for automatically distinguishing different types of time series.")



## Summary & Next Steps

This notebook demonstrated the power of the **expected signature** as a feature engineering tool for time series data. We saw that four visually and mathematically distinct stochastic processes produce four equally distinct signature "fingerprints."

The key takeaways are:
- **Signatures as Features**: The signature transforms a complex, variable-length path into a fixed-length vector of features.
- **Capturing Dynamics**: These features are not arbitrary; they are derived from the path's geometry and capture its essential dynamic properties (trends, rotations, volatility).
- **Classification Power**: As shown in the final demo, these features can be directly used by machine learning models to effectively classify or cluster different types of time series.

This approach provides a robust, model-free way to represent sequential data, paving the way for more advanced analysis and machine learning tasks.


## Dependencies

To run this notebook, you will need the following Python libraries:
```
numpy
matplotlib
scikit-learn
jupyter
```
