# Pattern Mining Phase: Matrix Profile Analysis

To validate the approximate motifs detected by SAX, a second high-precision pattern mining stage was conducted using the **Matrix Profile** method, implemented via the `stumpy` Python library. Unlike SAX, which relies on discretization, the Matrix Profile computes the exact Euclidean distances between all subsequences, offering a parameter-free method to identify motifs (repeating patterns) and discords (anomalies).

### Methodology
The Matrix Profile was computed for the Earthquake Time Series (E, N, and V channels) to locate the most conserved waveform shapes.

* **Algorithm:** We utilized `stumpy.stump`, a highly parallelized implementation of the exact motif discovery algorithm.
* **Window Size ($m$):** A window size of $m=1000,$ (approx. 10 second) was selected, consistent with the stable motif duration identified in the SAX analysis.
* **Metric:** **Z-normalized Euclidean Distance**. This normalization is critical for seismic analysis as it focuses on waveform *shape* rather than absolute *amplitude*, allowing the detection of repeating scattering patterns even as the earthquake signal attenuates over time.



### Analytical Objectives
The Matrix Profile vector $P$ was analyzed to extract two key physical features:

1.  **Motif Discovery (Global Minima):**
    The indices of the minimum values in $P$ correspond to the **Top-1 Motif**—the pair of subsequences with the highest similarity. This identifies the "signature" waveform of the site's crustal response.

2.  **Regime Change Detection (Semantic Segmentation):**
    By analyzing transitions in the Matrix Profile values, we identified boundaries between different physical regimes (e.g., the transition from the chaotic P-onset to the rhythmic, repeating Coda phase).

### Comparison with SAX Results
This stage serves as a verification step. While SAX provides a global statistical view of symbol distribution, `stumpy` provides exact localization.
* **Expectation:** If the SAX "P-coda motif" is real, the Matrix Profile should show a distinct "valley" (low distance values) in the region between P and S arrivals, indicating high self-similarity.
* **Discords:** High values in the Matrix Profile will highlight non-repeating transients, expected to correspond to the unique impulsive onsets of the P and S phases.



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

signal_data = np.load("signal_data/processed_seismic_data.npy")

print(signal_data.shape)

(29871, 3, 1000)


In [2]:
metadata = pd.read_csv("metadata/processed_metadata.csv")

print(metadata.shape)

metadata.head()

(29871, 28)


Unnamed: 0,trace_name,network_code,receiver_code,receiver_type,source_origin_time,trace_start_time,receiver_latitude,receiver_longitude,receiver_elevation_m,p_arrival_sample,...,source_magnitude,source_magnitude_type,source_distance_km,back_azimuth_deg,coda_end_sample,id,norm_s_arrival_sample,snr_db_E,snr_db_N,snr_db_V
0,109C.TA_20061103155652_EV,TA,109C,BH,2006-11-03 15:56:42.73,2006-11-03 15:56:53.610000,32.8889,-117.1051,150.0,600.0,...,4.3,mb,101.34,281.7,5508,235427,236,65.0,65.5,61.400002
1,109C.TA_20061129211102_EV,TA,109C,BH,2006-11-29 21:10:55.02,2006-11-29 21:11:03.890000,32.8889,-117.1051,150.0,900.0,...,4.1,ml,108.03,273.8,3199,235432,558,55.0,56.099998,43.200001
2,109C.TA_20061129221547_EV,TA,109C,BH,2006-11-29 22:15:38.65,2006-11-29 22:15:48.630000,32.8889,-117.1051,150.0,800.0,...,3.9,ml,106.69,273.7,5252,235434,283,49.0,48.0,39.200001
3,109C.TA_20070209033349_EV,TA,109C,BH,2007-02-09 03:33:42.80,2007-02-09 03:33:50.600000,32.8889,-117.1051,150.0,900.0,...,4.2,ml,98.93,246.8,2866,235437,580,65.0,68.199997,58.700001
4,109C.TA_20070415225732_EV,TA,109C,BH,2007-04-15 22:57:25.78,2007-04-15 22:57:33.940000,32.8889,-117.1051,150.0,900.0,...,4.3,ml,99.46,280.3,5848,235441,240,60.099998,64.800003,53.400002


In [3]:
print(metadata.columns)

Index(['trace_name', 'network_code', 'receiver_code', 'receiver_type',
       'source_origin_time', 'trace_start_time', 'receiver_latitude',
       'receiver_longitude', 'receiver_elevation_m', 'p_arrival_sample',
       'p_status', 'p_travel_sec', 's_arrival_sample', 's_status', 'source_id',
       'source_latitude', 'source_longitude', 'source_depth_km',
       'source_magnitude', 'source_magnitude_type', 'source_distance_km',
       'back_azimuth_deg', 'coda_end_sample', 'id', 'norm_s_arrival_sample',
       'snr_db_E', 'snr_db_N', 'snr_db_V'],
      dtype='object')


In [None]:
import stumpy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from tqdm import tqdm
import os

# ================================================================
# 0. Ensure output directory exists
# ================================================================
os.makedirs("motif_results", exist_ok=True)

# ================================================================
# 1. PARAMETERS
# ================================================================
MOTIF_LENGTHS = [100, 200, 500, 1000]
K_RANGE = range(2, 10)  # Elbow curve range
N_CLUSTERS = 4          # Default cluster count
N_EXAMPLES = 10         # Motifs to plot per cluster

# signal_data shape must be (N, 3, T)
signal_data = signal_data.astype(np.float64)
N, C, T = signal_data.shape


# ================================================================
# 2. LOOP OVER MOTIF LENGTHS
# ================================================================
for m in MOTIF_LENGTHS:
    print(f"\n===============================")
    print(f"   PROCESSING MOTIF LENGTH m={m}")
    print(f"===============================\n")

    save_dir = f"motif_results/m_{m}"
    os.makedirs(save_dir, exist_ok=True)

    # ================================================================
    # 2A. Extract motifs for all signals using mstump
    # ================================================================
    motif_shapes = []      # will be (N, 3, m)
    motif_dists = []

    for i in tqdm(range(N), desc=f"Motifs m={m}"):
        signal = signal_data[i]

        # compute multidimensional matrix profile
        mp, _ = stumpy.mstump(signal, m)
        
        motif_idx = np.argmin(mp[0])
        start, end = motif_idx, motif_idx + m
        
        motif_segment = signal[:, start:end]   # (3, m)
        motif_shapes.append(motif_segment)
        motif_dists.append(mp[0, motif_idx])

    motif_shapes = np.array(motif_shapes)
    motifs_flat = motif_shapes.reshape(N, C * m)

    # Save motifs in case needed later
    np.save(f"{save_dir}/motifs.npy", motif_shapes)
    np.save(f"{save_dir}/motif_distances.npy", motif_dists)


    # ================================================================
    # 2B. K-Means ELBOW CURVE
    # ================================================================
    sse = []
    for k in tqdm(K_RANGE, desc=f"Elbow m={m}"):
        km = KMeans(n_clusters=k, random_state=42)
        km.fit(motifs_flat)
        sse.append(km.inertia_)

    plt.figure(figsize=(7, 5))
    plt.plot(K_RANGE, sse, marker="o")
    plt.xlabel("Number of Clusters (k)")
    plt.ylabel("SSE (Inertia)")
    plt.title(f"Elbow Curve for m={m}")
    plt.grid()
    plt.tight_layout()
    plt.savefig(f"{save_dir}/elbow_m{m}.png")
    plt.close()


    # ================================================================
    # 2C. Cluster with N_CLUSTERS and plot centroids
    # ================================================================
    kmeans = KMeans(n_clusters=N_CLUSTERS, random_state=42)
    labels = kmeans.fit_predict(motifs_flat)

    np.save(f"{save_dir}/labels.npy", labels)

    fig, axes = plt.subplots(C, N_CLUSTERS, figsize=(5*N_CLUSTERS, 10))

    if C == 1:  # fix plot layout for single channel case
        axes = np.expand_dims(axes, axis=0)

    for cluster_id in range(N_CLUSTERS):
        centroid = kmeans.cluster_centers_[cluster_id].reshape(C, m)
        cluster_idxs = np.where(labels == cluster_id)[0]

        for ch in range(C):
            ax = axes[ch, cluster_id]
            ax.plot(centroid[ch], linewidth=3, label="Centroid")

            # plot examples
            sample_idxs = (
                np.random.choice(cluster_idxs, N_EXAMPLES, replace=False)
                if len(cluster_idxs) > N_EXAMPLES else cluster_idxs
            )

            for idx in sample_idxs:
                ax.plot(motif_shapes[idx][ch], alpha=0.4)

            ax.set_title(f"Cluster {cluster_id} — Channel {ch}")
            ax.set_xlabel("Time")
            ax.set_ylabel("Amp")
            ax.legend()

    plt.tight_layout()
    plt.savefig(f"{save_dir}/clusters_m{m}.png")
    plt.close()

    print(f"Saved: {save_dir}/clusters_m{m}.png")
    print(f"Saved: {save_dir}/elbow_m{m}.png")
    print(f"Motif extraction for m={m} complete.\n")


print("\nALL MOTIF LENGTHS FINISHED SUCCESSFULLY ✔")
