# Market Microstructure Evolution: A Decade of Trading Pattern Analysis
### Analyzing Stock vs ETF Trading Behavior (2012-2024)

**Author:** Emma Nagy  
**Last Updated:** December 2024

---

## Executive Summary

This project examines how equity market microstructure has evolved over the past decade using SEC MIDAS (Market Information Data Analytics System) data spanning 2012Q1 to 2024Q1. The analysis focuses on identifying and understanding various patterns between individual stocks and ETFs.

**Key Findings:**

- **Odd-lot trading increased substantially** in both stocks and ETFs, with stocks showing a more pronounced rise, suggesting changes in retail participation and algorithmic order-splitting strategies.

- **Hidden liquidity declined as a percentage of total volume**, indicating markets became more transparent over the analysis period despite concerns about dark pool proliferation.

- **Trading efficiency improved significantly**, with cancel-to-trade ratios declining by 47-63%, reflecting more sophisticated order placement and better market quality.

- **Market activity expanded dramatically**, with trade counts growing over 800% for stocks and over 1,000% for ETFs, though stocks maintain substantially higher absolute activity levels.

- **Order flow patterns differ fundamentally** between stocks and ETFs, with ETFs showing block-oriented institutional flows while stocks exhibit more fragmented trading consistent with price discovery functions.

This analysis provides insights relevant to market structure research, algorithmic trading strategy development, and regulatory policy evaluation.

## Project Motivation

Understanding market microstructure is fundamental to modern finance, yet the landscape has undergone dramatic transformation in the past decade. The proliferation of algorithmic trading, regulatory changes like Reg NMS amendments, and the explosive growth of ETFs have fundamentally altered how securities trade.

This project was motivated by several questions:
1. How have trading patterns evolved differently for stocks versus ETFs?
2. What role has algorithmic trading played in changing market microstructure?
3. Have markets become more or less transparent over time?

By leveraging the comprehensive SEC MIDAS dataset—which captures granular order and trade-level metrics across more than 5,000 securities—this analysis provides empirical evidence of these structural changes.

## Data Overview

**Data Source:** SEC MIDAS (Market Information Data Analytics System)

**Time Period:** 2012Q1 – 2024Q1 (49 quarters)

**Coverage:** 5,000+ securities including stocks and ETPs/ETFs

**Metrics Analyzed:**
- **Trade counts:** Total trades, lit trades, odd-lot trades, hidden trades
- **Volume metrics:** Order volume, trade volume, lit volume, odd-lot volume, hidden volume
- **Derived ratios:** Cancel-to-trade ratio, trade-to-order-volume ratio, hidden rate, odd-lot rate

**Methodology Highlights:**
- Fixed cohort sampling (100 stocks and 100 ETFs from 2012Q1) to control for composition effects
- Winsorization at 1%/99% to handle outliers while preserving tail behavior
- Descriptive statistics computed at p25, p50, p75, mean, and standard deviation

## Environment Setup

In [26]:
import sys
import os
import glob
import math
import random
import re
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

print(f'Environment initialized — pandas {pd.__version__} | numpy {np.__version__}')

Environment initialized — pandas 2.2.3 | numpy 2.1.3


## Project Configuration

In [27]:
# Directory paths
DATA_DIR = "./data/midas_by_security"  # Path to MIDAS quarterly CSV files
OUT_FIG_DIR = "./outputs/figures"      # Output directory for visualizations
OUT_TAB_DIR = "./outputs/tables"       # Output directory for summary tables

os.makedirs(OUT_FIG_DIR, exist_ok=True)
os.makedirs(OUT_TAB_DIR, exist_ok=True)

# Reproducibility
RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

# Variables to analyze
RAW_VARS = [
    "Cancels", "Trades", "LitTrades", "OddLots", "Hidden", "TradesForHidden",
    "OrderVol", "TradeVol", "LitVol", "OddLotVol", "HiddenVol", "TradeVolForHidden"
]

## Core Functions

This section defines utility functions for data loading, cleaning, and statistical computation.

In [28]:
def sdiv(num, den):
    """
    Safe division that returns NaN when denominator is zero.
    Prevents division-by-zero warnings and handles undefined ratios appropriately.
    """
    num = np.asarray(num, dtype=float)
    den = np.asarray(den, dtype=float)
    out = np.full_like(num, np.nan, dtype=float)
    np.divide(num, den, out=out, where=den != 0)
    return out


def quarter_from_filename(path: str):
    """
    Extract quarter label from filename.
    Handles various formats: '2016Q1', '2016_Q1', 'Q1_2016', 'q1_2016_all', etc.
    """
    name = os.path.basename(path)
    s = name.replace("-", "_").replace(" ", "_")
    
    # Try multiple regex patterns
    m = (re.search(r'(?i)(19|20)\d{2}\s*_?\s*Q\s*([1-4])', s) or
         re.search(r'(?i)Q\s*([1-4])\s*_?\s*(19|20)\d{2}', s))
    
    if m:
        y = re.search(r'(19|20)\d{2}', m.group(0)).group(0)
        q = re.search(r'Q\s*([1-4])', m.group(0), re.I).group(1)
        return f"{y}Q{q}"
    return None


def load_all_quarters(data_dir):
    """
    Load and combine all SEC MIDAS quarterly files from 2012Q1 to 2024Q1.
    
    Steps:
    1. Locate all CSV files in the specified directory
    2. Read each file, clean column names (remove spaces/hyphens), standardize ticker/security fields
    3. Assign quarter labels from filenames
    4. Concatenate into single DataFrame and remove duplicates
    
    Returns:
        pd.DataFrame: Combined dataset with standardized columns
    """
    files = sorted(glob.glob(os.path.join(data_dir, "*.csv")))
    if not files:
        raise FileNotFoundError(f"No CSV files found in {data_dir}")
    
    print(f"Found {len(files)} quarterly files")
    frames = []
    
    for fpath in files:
        qtr = quarter_from_filename(fpath)
        if not qtr:
            print(f"Warning: Could not parse quarter from {os.path.basename(fpath)}, skipping")
            continue
        
        df = pd.read_csv(fpath, low_memory=False)
        
        # Normalize column names: strip whitespace, remove spaces and hyphens
        df.columns = [c.strip().replace(" ", "").replace("-", "") for c in df.columns]
        
        # Standardize ticker/symbol column
        tick_col = next((c for c in df.columns if c.lower() in ["ticker", "symbol"]), None)
        if tick_col:
            df.rename(columns={tick_col: "Ticker"}, inplace=True)
        
        # Standardize security type column
        sec_col = next((c for c in df.columns if c.lower() in ["security", "sectype", "assettype", "securitytype"]), None)
        if sec_col:
            df.rename(columns={sec_col: "Security"}, inplace=True)
        else:
            df["Security"] = "Unknown"
        
        # Set Date column from quarter
        df["Date"] = qtr
        
        frames.append(df)
    
    combined = pd.concat(frames, ignore_index=True, sort=False)
    
    # Remove duplicates
    if "Ticker" in combined.columns and "Date" in combined.columns:
        combined.drop_duplicates(subset=["Ticker", "Date"], keep="last", inplace=True)
    
    print(f"Loaded {len(combined):,} total records")
    return combined


def classify_asset_type(security_text):
    """
    Classify securities as 'Stock' or 'ETF' based on Security column.
    """
    s = str(security_text).strip().upper()
    if any(k in s for k in ["ETF", "ETP", "EXCHANGE TRADED", "TRUST", "FUND"]):
        return "ETF"
    return "Stock"


def compute_derived_metrics(df):
    """
    Calculate derived microstructure metrics from raw variables.
    
    Metrics computed:
    - Cancel-to-Trade ratio
    - Trade-to-Order-Volume ratio
    - Hidden Rate (trades / total trades)
    - Hidden Volume percentage
    - Odd-lot Rate (trades / total trades)
    - Odd-lot Volume percentage
    """
    df = df.copy()
    
    # Convert numeric columns
    for c in RAW_VARS:
        if c in df.columns:
            df[c] = pd.to_numeric(df[c], errors="coerce")
    
    # Compute derived metrics using .get() for safety
    df["Cancel_to_Trade"] = sdiv(df.get("Cancels"), df.get("Trades"))
    df["Trade_to_OrderVol"] = sdiv(df.get("TradeVol"), df.get("OrderVol"))
    df["HiddenRate"] = sdiv(df.get("TradesForHidden"), df.get("Trades")) * 100.0
    df["HiddenVolumePct"] = sdiv(df.get("TradeVolForHidden"), df.get("TradeVol")) * 100.0
    df["OddlotRate"] = sdiv(df.get("OddLots"), df.get("Trades")) * 100.0
    df["OddlotVolumePct"] = sdiv(df.get("OddLotVol"), df.get("TradeVol")) * 100.0
    
    return df


def winsorize_by_group(df, cols, lower=0.01, upper=0.99, groupby="Date"):
    """
    Winsorize specified columns by group to handle outliers.
    
    Winsorization caps extreme values at specified percentiles rather than
    removing them, preserving sample size while reducing outlier impact.
    """
    df = df.copy()
    
    for col in cols:
        if col not in df.columns:
            continue
        
        def wins(x):
            lo = x.quantile(lower)
            hi = x.quantile(upper)
            return x.clip(lower=lo, upper=hi)
        
        df[col] = df.groupby(groupby)[col].transform(wins)
    
    return df


def compute_descriptive_stats(df, metric_col, groupby_cols):
    """
    Compute descriptive statistics (mean, p25, p50, p75, std) for a metric.
    """
    stats = df.groupby(groupby_cols)[metric_col].agg([
        ('mean', 'mean'),
        ('p25', lambda x: x.quantile(0.25)),
        ('p50', lambda x: x.quantile(0.50)),
        ('p75', lambda x: x.quantile(0.75)),
        ('std', 'std')
    ]).reset_index()
    
    return stats

print("Core functions loaded successfully")

Core functions loaded successfully


## Data Loading and Preparation

In [29]:
# Load all quarterly data
df_all = load_all_quarters(DATA_DIR)

# Rename columns with ('000) suffix
rename_map = {
    "OrderVol('000)": "OrderVol",
    "LitVol('000)": "LitVol",
    "HiddenVol('000)": "HiddenVol",
    "TradeVolForHidden('000)": "TradeVolForHidden",
    "OddLotVol('000)": "OddLotVol",
    "TradeVolForOddLots('000)": "TradeVolForOddLots"
}
df_all.rename(columns=rename_map, inplace=True)

# Create TradeVol from component volumes
df_all["TradeVol"] = df_all["LitVol"] + df_all["HiddenVol"] + df_all["OddLotVol"]

# Create Trades from count components  
df_all["Trades"] = df_all["LitTrades"] + df_all["OddLots"] + df_all["TradesForOddLots"]

# Classify asset types
df_all["AssetType"] = df_all["Security"].map(classify_asset_type)

# Compute derived metrics
df_all = compute_derived_metrics(df_all)

# Apply winsorization to handle outliers
all_metrics = RAW_VARS + ["Cancel_to_Trade", "Trade_to_OrderVol", "HiddenRate", 
                          "HiddenVolumePct", "OddlotRate", "OddlotVolumePct"]
# Only winsorize columns that exist
cols_to_winsorize = [c for c in all_metrics if c in df_all.columns]
df_all = winsorize_by_group(df_all, cols_to_winsorize, lower=0.01, upper=0.99, groupby="Date")

print(f"\nData prepared: {len(df_all):,} records")
print(f"Asset type breakdown:")
print(df_all["AssetType"].value_counts())
print(f"\nDate range: {df_all['Date'].min()} to {df_all['Date'].max()}")

Found 49 quarterly files
Loaded 289,170 total records

Data prepared: 289,170 records
Asset type breakdown:
AssetType
Stock    191394
ETF       97776
Name: count, dtype: int64

Date range: 2012Q1 to 2024Q1


## Cohort Sampling Methodology

To control for composition effects (IPOs, delistings, sector rotations), this analysis uses **fixed cohorts** sampled from 2012Q1:

- **100 randomly selected stocks** from 2012Q1
- **100 randomly selected ETFs** from 2012Q1

These same securities are tracked across all subsequent quarters. This approach ensures that observed trends reflect changes in *trading behavior* rather than changes in *which securities are being analyzed*.

**Trade-off:** This introduces survivorship bias (securities that delist or are acquired disappear from later quarters), but it provides cleaner signal for behavioral trends.

In [30]:
# Sample cohorts from 2012Q1
df_2012q1 = df_all[df_all["Date"] == "2012Q1"].copy()

# Sample 100 stocks
stocks_2012q1 = df_2012q1[df_2012q1["AssetType"] == "Stock"]["Ticker"].unique()
if len(stocks_2012q1) >= 100:
    sampled_stocks = np.random.choice(stocks_2012q1, size=100, replace=False)
else:
    print(f"Warning: Only {len(stocks_2012q1)} stocks available in 2012Q1")
    sampled_stocks = stocks_2012q1

# Sample 100 ETFs
etfs_2012q1 = df_2012q1[df_2012q1["AssetType"] == "ETF"]["Ticker"].unique()
if len(etfs_2012q1) >= 100:
    sampled_etfs = np.random.choice(etfs_2012q1, size=100, replace=False)
else:
    print(f"Warning: Only {len(etfs_2012q1)} ETFs available in 2012Q1")
    sampled_etfs = etfs_2012q1

print(f"\nCohort created:")
print(f"  Stocks: {len(sampled_stocks)}")
print(f"  ETFs: {len(sampled_etfs)}")

# Filter full dataset to cohort members only
cohort_tickers = list(sampled_stocks) + list(sampled_etfs)
df_cohort = df_all[df_all["Ticker"].isin(cohort_tickers)].copy()

print(f"\nCohort dataset: {len(df_cohort):,} records")


Cohort created:
  Stocks: 100
  ETFs: 100

Cohort dataset: 6,994 records


## Computing Descriptive Statistics

For each metric, quarter, and asset type, we calculate:
- Mean
- 25th percentile (p25)
- Median (p50)
- 75th percentile (p75)
- Standard deviation

In [31]:
# Prepare data in long format for statistical analysis
metrics_to_analyze = RAW_VARS + ["Cancel_to_Trade", "Trade_to_OrderVol", "HiddenRate",
                                  "HiddenVolumePct", "OddlotRate", "OddlotVolumePct"]

long_list = []
for metric in metrics_to_analyze:
    if metric not in df_cohort.columns:
        continue
    temp = df_cohort[["Date", "AssetType", metric]].copy()
    temp = temp.dropna(subset=[metric])
    temp["metric"] = metric
    temp.rename(columns={metric: "value"}, inplace=True)
    long_list.append(temp)

df_long = pd.concat(long_list, ignore_index=True)

# Compute descriptive statistics
desc_cohort = compute_descriptive_stats(df_long, "value", ["Date", "AssetType", "metric"])

# Save results
output_path = os.path.join(OUT_TAB_DIR, "cohort_descriptive_stats.csv")
desc_cohort.to_csv(output_path, index=False)
print(f"\nDescriptive statistics saved to: {output_path}")

# Preview
print("\nSample of descriptive statistics:")
desc_cohort.head(10)


Descriptive statistics saved to: ./outputs/tables\cohort_descriptive_stats.csv

Sample of descriptive statistics:


Unnamed: 0,Date,AssetType,metric,mean,p25,p50,p75,std
0,2012Q1,ETF,Cancel_to_Trade,814.332724,71.181623,260.181046,749.0625,1412.753665
1,2012Q1,ETF,Cancels,99389.63,5437.25,13053.0,61649.25,192677.99911
2,2012Q1,ETF,Hidden,106.41,0.0,2.5,43.75,375.594696
3,2012Q1,ETF,HiddenRate,48.337921,45.629673,47.961621,50.0,7.475793
4,2012Q1,ETF,HiddenVol,34.676137,0.0,0.4,8.20075,121.28975
5,2012Q1,ETF,HiddenVolumePct,91.851636,93.491102,96.287471,98.863256,13.533901
6,2012Q1,ETF,LitTrades,1205.3,2.0,16.0,262.75,4045.888125
7,2012Q1,ETF,LitVol,392.38015,0.2435,3.715,59.73,1300.690135
8,2012Q1,ETF,OddLotVol,6.46036,0.00175,0.1285,2.545,20.218699
9,2012Q1,ETF,OddLots,150.15,1.0,4.0,53.5,454.220708


## Visualization Functions

Functions to create interactive time series plots showing the evolution of trading metrics.

In [32]:
def create_comparison_plot(desc_df, metric_name, title=None, save_path=None):
    """
    Create side-by-side comparison plot of Stock vs ETF for a given metric.
    Shows median (p50) with shaded IQR (p25-p75) region.
    """
    subset = desc_df[desc_df["metric"] == metric_name].copy()
    
    if subset.empty:
        print(f"No data available for metric: {metric_name}")
        return None
    
    # Add sortable dates
    subset["DateSort"] = pd.PeriodIndex(subset["Date"], freq="Q").to_timestamp()
    subset = subset.sort_values("DateSort")
    
    fig = go.Figure()
    
    colors = {"Stock": "#1f77b4", "ETF": "#ff7f0e"}
    
    for asset_type in ["Stock", "ETF"]:
        data = subset[subset["AssetType"] == asset_type].sort_values("DateSort")
        
        if data.empty:
            continue
        
        color = colors[asset_type]
        
        # Add IQR shading
        fig.add_trace(go.Scatter(
            x=data["DateSort"],
            y=data["p75"],
            mode='lines',
            line=dict(width=0),
            showlegend=False,
            hoverinfo='skip'
        ))
        
        fig.add_trace(go.Scatter(
            x=data["DateSort"],
            y=data["p25"],
            mode='lines',
            line=dict(width=0),
            fillcolor=color.replace(')', ', 0.15)').replace('rgb', 'rgba') if 'rgb' in color else f'rgba({int(color[1:3], 16)}, {int(color[3:5], 16)}, {int(color[5:7], 16)}, 0.15)',
            fill='tonexty',
            name=f"{asset_type} IQR",
            showlegend=False,
            hoverinfo='skip'
        ))
        
        # Add median line
        fig.add_trace(go.Scatter(
            x=data["DateSort"],
            y=data["p50"],
            mode='lines+markers',
            name=f"{asset_type}",
            line=dict(color=color, width=2),
            marker=dict(size=4)
        ))
    
    plot_title = title or f"{metric_name}: Stocks vs ETFs Over Time"
    
    fig.update_layout(
        title=dict(text=plot_title, font=dict(size=16)),
        xaxis_title="Quarter",
        yaxis_title=metric_name,
        hovermode="x unified",
        template="plotly_white",
        height=500,
        legend=dict(x=0.02, y=0.98, bgcolor="rgba(255,255,255,0.8)")
    )
    
    if save_path:
        fig.write_html(save_path)
        print(f"Saved: {save_path}")
    
    return fig


def create_multi_metric_plot(desc_df, metrics, asset_type, title=None, save_path=None):
    """
    Create a plot showing multiple metrics for a single asset type.
    Useful for comparing related metrics like different volume measures.
    """
    fig = go.Figure()
    
    for metric in metrics:
        subset = desc_df[(desc_df["metric"] == metric) & (desc_df["AssetType"] == asset_type)].copy()
        
        if subset.empty:
            continue
        
        subset["DateSort"] = pd.PeriodIndex(subset["Date"], freq="Q").to_timestamp()
        subset = subset.sort_values("DateSort")
        
        fig.add_trace(go.Scatter(
            x=subset["DateSort"],
            y=subset["p50"],
            mode='lines+markers',
            name=metric,
            marker=dict(size=4)
        ))
    
    plot_title = title or f"Multiple Metrics Comparison - {asset_type}"
    
    fig.update_layout(
        title=dict(text=plot_title, font=dict(size=16)),
        xaxis_title="Quarter",
        yaxis_title="Median Value",
        hovermode="x unified",
        template="plotly_white",
        height=500,
        legend=dict(x=0.02, y=0.98, bgcolor="rgba(255,255,255,0.8)")
    )
    
    if save_path:
        fig.write_html(save_path)
        print(f"Saved: {save_path}")
    
    return fig



def line_band_plot(desc_long, title, outfile, out_dir=None):
    """
    Create time series plots for ALL metrics in the dataset.
    
    For each metric and each asset type, creates a separate plot showing:
    - Median (p50) line
    - Shaded IQR region (p25-p75)
    - Mean with ±1 std error bars
    
    This is the COMPREHENSIVE plotting function that generates individual plots
    for every single metric, allowing detailed examination of each trend.
    """
    import os
    out_dir = out_dir or OUT_FIG_DIR
    os.makedirs(out_dir, exist_ok=True)

    # Ensure sortable dates
    desc_long = desc_long.copy()
    desc_long["DateSort"] = pd.PeriodIndex(desc_long["Date"], freq="Q").to_timestamp()

    for metric_name, dm in desc_long.groupby("metric"):
        wide = (
            dm.pivot_table(index=["Date","AssetType"], values=["mean","p25","p50","p75","std"])
              .reset_index()
        )
        wide["DateSort"] = pd.PeriodIndex(wide["Date"], freq="Q").to_timestamp()

        for atype, d in wide.groupby("AssetType"):
            d = d.sort_values("DateSort")

            fig = go.Figure()

            # IQR shaded band
            fig.add_trace(go.Scatter(
                x=d["DateSort"], y=d["p75"], mode="lines",
                line=dict(width=0), showlegend=False, hoverinfo="skip"
            ))
            fig.add_trace(go.Scatter(
                x=d["DateSort"], y=d["p25"], mode="lines", fill="tonexty",
                line=dict(width=0), fillcolor="rgba(0,0,0,0.10)",
                showlegend=False, hoverinfo="skip"
            ))

            # IQR edges (hoverable)
            fig.add_trace(go.Scatter(
                x=d["DateSort"], y=d["p75"], mode="lines",
                line=dict(width=1, dash="dot"), name="p75", showlegend=True,
                hovertemplate="p75: %{y:.2f}<extra></extra>"
            ))
            fig.add_trace(go.Scatter(
                x=d["DateSort"], y=d["p25"], mode="lines",
                line=dict(width=1, dash="dot"), name="p25", showlegend=True,
                hovertemplate="p25: %{y:.2f}<extra></extra>"
            ))

            # Median line
            fig.add_trace(go.Scatter(
                x=d["DateSort"], y=d["p50"], mode="lines", name="median"
            ))

            # Mean with ±1 std error bars
            fig.add_trace(go.Scatter(
                x=d["DateSort"], y=d["mean"],
                mode="markers", name="mean (±1 std)",
                error_y=dict(type="data", array=d["std"].fillna(0), visible=True)
            ))

            fig.update_layout(
                title=f"{title}: {metric_name} — {atype}",
                xaxis_title="Quarter",
                yaxis_title=metric_name,
                hovermode="x unified",
                template="plotly_white",
                height=500
            )

            html_path = os.path.join(out_dir, f"{outfile}_{metric_name}_{atype}.html")
            fig.write_html(html_path)

    print(f"Saved HTML figures to {out_dir}")

print("Visualization functions loaded")

Visualization functions loaded


## Comprehensive Analysis: All Metrics

This section generates the complete set of time series plots that form the foundation of our analysis. For each of the 18 metrics (12 raw + 6 derived), we create separate plots for Stocks and ETFs, showing:

- **Median (p50)** - the central tendency
- **IQR band (p25-p75)** - the middle 50% of the distribution  
- **Mean with ±1 std** - showing average and variability

**This creates 36 individual HTML files** that allow detailed examination of each metric's evolution over time. These plots were used to observe trends and formulate the key findings presented later.

This is the *exploratory analysis* - looking at all the data to identify patterns.

In [33]:
# Generate Comprehensive Plot Set (36 plots)
# Create subdirectory for comprehensive plots
comprehensive_dir = os.path.join(OUT_FIG_DIR, "comprehensive")
os.makedirs(comprehensive_dir, exist_ok=True)

line_band_plot(
    desc_cohort,
    title="Cohort Descriptive Stats (Winsorized 1–99%)",
    outfile="cohort_desc",
    out_dir=comprehensive_dir
)

print("\n✅ All 36 comprehensive plots saved!")
print(f"   Location: {comprehensive_dir}")

Saved HTML figures to ./outputs/figures\comprehensive

✅ All 36 comprehensive plots saved!
   Location: ./outputs/figures\comprehensive


---

## Key Visualizations

Based on comprehensive analysis of 18 metrics across 49 quarters, the following five visualizations present the most notable patterns observed in the data:

- **Odd-Lot Volume Percentage** - Shows odd-lot volume as a percentage of total trade volume over time for stocks and ETFs
- **Hidden Volume Percentage** - Displays hidden (non-displayed) volume as a percentage of total trade volume over time
- **Cancel-to-Trade Ratio** - Presents the ratio of cancelled orders to executed trades over time
- **Market Activity Metrics** - Multi-panel comparison of trade counts, trade volume, and order volume trends
- **Lit Trades** - Shows the number of trades executed on displayed (lit) venues over time

These visualizations present Stock vs ETF comparisons using median values with interquartile ranges to show the evolution of trading patterns from 2012Q1 to 2024Q1.

In [34]:
# Visualization 1: Odd-Lot Volume as Percentage of Total Trade Volume
fig1 = create_comparison_plot(
    desc_cohort,
    "OddlotVolumePct",
    title="Odd-Lot Volume as Percentage of Total Trade Volume",
    save_path=os.path.join(OUT_FIG_DIR, "key_viz_1_oddlot_volume_pct.html")
)

if fig1:
    fig1.show()

Saved: ./outputs/figures\key_viz_1_oddlot_volume_pct.html


In [35]:
# Visualization 2: Hidden Volume as Percentage of Total Trade Volume
fig2 = create_comparison_plot(
    desc_cohort,
    "HiddenVolumePct",
    title="Hidden Volume as Percentage of Total Trade Volume",
    save_path=os.path.join(OUT_FIG_DIR, "key_viz_2_hidden_volume_pct.html")
)

if fig2:
    fig2.show()

Saved: ./outputs/figures\key_viz_2_hidden_volume_pct.html


In [36]:
# Visualization 3: Cancel-to-Trade Ratio Over Time
fig3 = create_comparison_plot(
    desc_cohort,
    "Cancel_to_Trade",
    title="Cancel-to-Trade Ratio",
    save_path=os.path.join(OUT_FIG_DIR, "key_viz_3_cancel_to_trade.html")
)

if fig3:
    fig3.show()

Saved: ./outputs/figures\key_viz_3_cancel_to_trade.html


In [37]:
# Visualization 4: Trade Counts, Trade Volume, and Order Volume Comparison
from plotly.subplots import make_subplots

metrics_to_compare = ['Trades', 'TradeVol', 'OrderVol']
fig4 = make_subplots(
    rows=3, cols=1,
    subplot_titles=[
        'Trade Counts (Median per Security)',
        'Trade Volume (Median per Security)', 
        'Order Volume (Median per Security)'
    ],
    vertical_spacing=0.12,
    specs=[[{"secondary_y": False}], [{"secondary_y": False}], [{"secondary_y": False}]]
)

colors = {"Stock": "#1f77b4", "ETF": "#ff7f0e"}

for idx, metric in enumerate(metrics_to_compare, 1):
    subset = desc_cohort[desc_cohort["metric"] == metric].copy()
    subset["DateSort"] = pd.PeriodIndex(subset["Date"], freq="Q").to_timestamp()
    subset = subset.sort_values("DateSort")
    
    for asset_type in ["Stock", "ETF"]:
        data = subset[subset["AssetType"] == asset_type].sort_values("DateSort")
        
        if not data.empty:
            fig4.add_trace(
                go.Scatter(
                    x=data["DateSort"],
                    y=data["p50"],
                    mode='lines+markers',
                    name=f"{asset_type}",
                    line=dict(color=colors[asset_type], width=2),
                    marker=dict(size=3),
                    legendgroup=asset_type,
                    showlegend=(idx == 1)
                ),
                row=idx, col=1
            )

fig4.update_xaxes(title_text="Quarter", row=3, col=1)
fig4.update_yaxes(title_text="Trade Count", row=1, col=1)
fig4.update_yaxes(title_text="Trade Volume", row=2, col=1)
fig4.update_yaxes(title_text="Order Volume", row=3, col=1)

fig4.update_layout(
    title=dict(text="Market Activity Metrics: Trade Counts, Volume, and Order Flow", font=dict(size=16)),
    height=900,
    hovermode="x unified",
    template="plotly_white",
    legend=dict(x=0.02, y=0.98, bgcolor="rgba(255,255,255,0.8)")
)

fig4.write_html(os.path.join(OUT_FIG_DIR, "key_viz_4_market_activity.html"))
fig4.show()



In [38]:
# Visualization 5: Lit Trades (Displayed Venue Activity)
fig5 = create_comparison_plot(
    desc_cohort,
    "LitTrades",
    title="Lit Trades: Activity on Displayed Venues",
    save_path=os.path.join(OUT_FIG_DIR, "key_viz_5_lit_trades.html")
)

if fig5:
    fig5.show()

Saved: ./outputs/figures\key_viz_5_lit_trades.html


### Visualization Summary

The five key visualizations reveal important patterns in market microstructure evolution. This section explains what each pattern shows, why it's significant, and what it means in practical terms.

---

#### 1. Odd-Lot Volume Percentage: The Retail Trading Revolution

**What the data shows:** Odd-lot volume (trades <100 shares) grew from 6% to 29% of total trade volume for stocks and from 2% to 10% for ETFs over the 12-year period. The stock-ETF gap widened dramatically, reaching 19 percentage points by 2024.

**Why this pattern emerged:** This reflects two concurrent phenomena that transformed equity markets:
- **Retail democratization**: The advent of commission-free trading (Robinhood 2013, major brokers 2019) eliminated the economic penalty for small trades, enabling mass retail participation
- **Algorithmic sophistication**: Institutional traders increasingly fragment large orders into sub-100 share lots to minimize market impact and avoid detection

**What it means in practice:** 
- Market makers must now accommodate both genuine retail flow and sophisticated algorithmic order-splitting
- The traditional "round lot" (100 shares) is no longer the dominant unit of trading
- Price discovery increasingly happens through aggregation of many small trades rather than fewer large ones
- The stock-ETF divergence suggests retail traders prefer individual equities (stock picking) while ETF odd-lots primarily reflect algorithmic arbitrage strategies

**Why we selected this visualization:** Odd-lot trading showed the most dramatic behavioral shift (+381% for stocks), representing a fundamental change in who participates in markets and how they trade.

---

#### 2. Hidden Volume Percentage: The Transparency Renaissance

**What the data shows:** Hidden volume (non-displayed orders) declined from 87% to 71% of total trade volume for stocks and from 96% to 90% for ETFs. Both asset types show sustained downward trends across the entire analysis period.

**Why this pattern is counterintuitive:** Popular narratives about "dark pools taking over" and increasing market opacity dominated financial media throughout this period. The data tells the opposite story—markets actually became more transparent.

**What it means in practice:**
- More trading now occurs on displayed venues where prices are visible to all market participants
- Regulatory pressure (SEC focus on transparency, best execution requirements) appears to have succeeded
- Competition among trading venues drove improvements in displayed market quality
- The decline accelerated post-2018, coinciding with regulatory scrutiny of dark pool practices

**Important nuance:** While the *percentage* of hidden volume declined, *absolute* hidden volume still grew because overall market activity expanded dramatically. However, the relative shift toward transparency is significant—if hidden trading had maintained its 2012 market share, it would be far more dominant today.

**Why we selected this visualization:** This is the most surprising finding in the analysis. It challenges conventional wisdom and demonstrates the importance of empirical analysis over anecdotal narratives.

---

#### 3. Cancel-to-Trade Ratio: The Efficiency Revolution

**What the data shows:** The ratio of cancelled orders to executed trades fell from 12.2 to 6.4 for stocks (-47%) and from 260 to 96 for ETFs (-63%). ETFs started with dramatically higher ratios but showed steeper improvement.

**What this metric measures:** Cancel-to-trade ratio reflects how precisely market participants can predict order execution. High ratios indicate either:
- Aggressive quote updating (high-frequency market makers constantly repricing)
- Poor execution quality (orders cancel because they miss favorable prices)
- Strategic order placement (testing liquidity with orders intended to cancel)

**Why this pattern emerged:**
- **Better algorithms**: Execution algorithms became more sophisticated at predicting when orders would fill
- **Improved market quality**: Tighter spreads and more stable prices reduced the need for aggressive cancellations
- **Technology improvements**: Faster data feeds and lower latency enabled more precise order placement
- **Regulatory changes**: Measures discouraging excessive cancellations (e.g., potential transaction taxes discussed but not implemented) may have influenced behavior

**What it means in practice:**
- Markets became more efficient at converting expressed interest (orders) into actual trades
- Less "noise" from cancelled orders improves price discovery
- High-frequency market makers refined strategies to avoid excessive quote updates
- Transaction costs likely decreased as market quality improved

**ETF context:** The initially high ETF ratio (260 vs 12 for stocks) reflects arbitrage market-making strategies that constantly update quotes based on underlying basket prices. The 63% improvement shows even these high-frequency strategies became more refined.

**Why we selected this visualization:** This pattern demonstrates measurable improvement in market quality and participant sophistication—markets didn't just grow, they got smarter.

---

#### 4. Market Activity Metrics: Divergent Market Structures

**What the data shows:** The three-panel comparison reveals fundamentally different patterns:
- **Trade counts**: Stocks increased from 1,230 to 11,370 trades per security; ETFs from 39 to 436 trades
- **Trade volume**: Both grew substantially but stocks maintain higher per-trade volumes
- **Order volume**: ETFs show 3x larger order volumes ($27,531 vs $9,255) despite fewer trades

**Why these patterns differ:**

*Stocks = Price Discovery Engines:*
- Many participants with diverse information and motivations
- Fragmented order flow as traders independently arrive at prices
- Smaller average trade sizes reflect individual decision-making
- Activity driven by fundamental analysis, news, earnings, sector rotations

*ETFs = Block Trading Instruments:*
- Dominated by arbitrageurs, market makers, and institutional portfolio flows
- Larger order sizes reflect creation/redemption mechanisms and portfolio rebalancing
- Fewer trades but each represents significant economic activity
- Activity driven by portfolio strategy, hedging, and arbitrage opportunities

**What it means in practice:**
- If you're trading a stock, you're competing with a diverse ecosystem of retail, institutional, and algorithmic participants—prices reflect aggregated information from many sources
- If you're trading an ETF, you're more likely trading with sophisticated arbitrageurs or large institutions executing portfolio strategies
- ETF liquidity is "wholesale"—large blocks trade efficiently but with less continuous price discovery
- Stock liquidity is "retail"—many small trades provide continuous price updates

**Why ETFs grew faster but remain smaller:** ETFs evolved from niche products (2012) to mainstream portfolio instruments (2024). Trade count growth of 1,033% reflects this adoption, but the absolute level remaining 26x smaller than stocks shows they serve a different market function.

**Why we selected this visualization:** This multi-metric view shows that different asset types serve fundamentally different roles in the financial ecosystem. Understanding these differences is crucial for trading strategy, execution quality, and market structure policy.

---

#### 5. Lit Trades: The Displayed Market Comeback

**What the data shows:** Trades on displayed (lit) venues grew from 480 to 2,868 for stocks (+498%) and from 16 to 161 for ETFs (+906%). Both asset types show strong, sustained growth in lit market activity.

**Why this matters:** Combined with Finding 2 (declining hidden percentage), this shows displayed markets dominated the growth in trading activity. While total market activity exploded, lit markets captured the majority of new flow.

**What drove this trend:**
- **Technology leveling**: Displayed venues invested heavily in low-latency infrastructure, reducing the speed advantage of dark venues
- **Regulatory incentives**: Trade-at rules and other regulations made dark trading less attractive
- **Transparency premium**: Institutional investors increasingly demanded best execution on transparent venues
- **Competition**: Multiple lit venues (exchanges, ECNs) competed aggressively, improving quality and reducing costs

**What it means in practice:**
- More pre-trade transparency: Traders can see available liquidity before committing
- Better price discovery: Displayed markets aggregate more information into prices
- Reduced information asymmetry: All participants see the same order book
- Lower adverse selection: Informed traders can't hide as easily in dark pools

**Why ETFs grew faster:** ETFs were less established in 2012, so lit venue growth for ETFs partly reflects overall ETF adoption. The 906% growth also reflects arbitrage activity moving to displayed venues where pricing is more transparent.

**Apparent paradox resolved:** How did lit trades grow 498-906% while hidden percentage only declined 7-19%? Answer: Total market activity grew so much that even declining hidden percentages still meant more absolute hidden volume. But lit markets captured the lion's share of growth.

**Why we selected this visualization:** This pattern provides the positive counterpoint to Finding 2—not only did hidden trading decline as a percentage, but displayed markets actively won the competition for order flow. This challenges the narrative that markets became fragmented and opaque.


## Methodological Considerations and Key Insights

### Fixed Cohort Approach
**Rationale:** Sampling 100 stocks and 100 ETFs from 2012Q1 and tracking them over time controls for composition effects. This ensures observed trends reflect behavioral changes rather than shifts in which securities are being measured.

**Trade-off:** Introduces survivorship bias (securities that delist disappear from later quarters), but provides cleaner signal for behavioral evolution.

### Winsorization Strategy
**Approach:** Capping values at 1st and 99th percentiles rather than removing outliers.

**Rationale:** Preserves sample size while reducing influence of extreme values (flash events, data errors). The 1%/99% cutoff is conservative, removing only the most extreme 2% of observations.

### Handling Undefined Ratios
**Decision:** When denominators equal zero, set ratios to NaN rather than zero.

**Rationale:** A zero denominator makes the ratio mathematically undefined—not truly zero. Imputing zero would bias medians downward and fabricate information.

### Statistical Approach
**Descriptive Statistics:** Analysis focuses on median (p50) as the primary measure of central tendency, supplemented by interquartile ranges (p25-p75) to show dispersion. Medians are robust to outliers and appropriate for skewed distributions common in trading data.

**Time Series Analysis:** All metrics examined across 49 consecutive quarters to identify trends, inflection points, and structural changes in market behavior.

---

## Key Findings

### 1. Odd-Lot Trading Growth

**Observation:** Odd-lot trading volume increased substantially across both asset types, with stocks showing more dramatic growth.

**Quantitative Evidence:**
- Stock odd-lot volume: 6.0% → 28.8% of total trade volume (+381%)
- ETF odd-lot volume: 2.2% → 10.2% of total trade volume (+369%)
- Stock-ETF divergence widened from 3.8 to 18.6 percentage points

**Interpretation:** Two concurrent phenomena likely drive this pattern. First, retail democratization through commission-free trading platforms eliminated economic barriers to small-lot trading, enabling mass retail market participation. Second, institutional algorithms increasingly fragment large orders into sub-100 share lots to minimize market impact and execution costs.

**Note:** The stock-ETF divergence suggests different mechanisms—retail traders appear to favor individual stocks for stock-picking strategies, while ETF odd-lots more likely reflect arbitrage and market-making algorithms.

### 2. Market Transparency Improvement

**Observation:** Hidden (non-displayed) volume declined as a percentage of total trading activity, contrary to popular narratives about increasing market opacity.

**Quantitative Evidence:**
- Stock hidden volume: 87.4% → 71.2% of total trade volume (-16.2 pp)
- ETF hidden volume: 96.3% → 89.8% of total trade volume (-6.5 pp)
- Trend sustained across entire 12-year period

**Interpretation:** This pattern reflects regulatory success (SEC transparency initiatives), competitive pressure among venues (displayed markets improved quality), and institutional demand for verifiable best execution. The decline accelerated post-2018 during increased regulatory scrutiny.

**Important Nuance:** While hidden volume percentage declined, absolute hidden volume grew due to overall market expansion. However, displayed markets captured the majority of growth, indicating genuine shift toward transparency.

### 3. Trading Efficiency Gains

**Observation:** Cancel-to-trade ratios declined substantially, indicating more efficient order submission and execution.

**Quantitative Evidence:**
- Stock cancel-to-trade ratio: 12.2 → 6.4 (-47%)
- ETF cancel-to-trade ratio: 260.2 → 95.9 (-63%)
- ETFs showed steeper improvement despite higher baseline

**Interpretation:** Market participants became more sophisticated at predicting order execution. Improved algorithms, tighter spreads, more stable prices, and better technology infrastructure all contributed. The dramatic ETF improvement reflects refinement of high-frequency market-making strategies that initially relied on aggressive quote updating.

### 4. Divergent Market Structures

**Observation:** Stocks and ETFs exhibit fundamentally different trading patterns that persisted and intensified over time.

**Quantitative Evidence:**
- Trade counts: Stocks 26x higher than ETFs (11,370 vs 436 median trades)
- Order volumes: ETFs 3x higher than stocks ($27,531 vs $9,255 median)
- Growth rates: ETF trades +1,032%, Stock trades +824%

**Interpretation:** Stocks function as price-discovery engines with fragmented, diverse order flow from retail, institutional, and algorithmic participants. ETFs function as block-trading instruments dominated by arbitrage, market-making, and institutional portfolio flows. These distinct roles solidified rather than converged over the analysis period.

### 5. Displayed Market Dominance

**Observation:** Trading activity on displayed (lit) venues grew substantially, capturing the majority of overall market growth.

**Quantitative Evidence:**
- Stock lit trades: 480 → 2,868 (+498%)
- ETF lit trades: 16 → 161 (+906%)
- Growth exceeded hidden venue expansion

**Interpretation:** Combined with declining hidden percentages (Finding 2), this demonstrates displayed markets won the competition for order flow. Technology improvements, regulatory incentives, and transparency demands from institutional investors all favored lit venues. ETFs showed faster growth partly reflecting overall ETF adoption and arbitrage activity preferring transparent pricing.

---

## Synthesis

These five findings collectively demonstrate that equity markets evolved toward greater transparency, efficiency, and sophistication between 2012-2024, while accommodating massive retail participation growth. The patterns challenge popular narratives about increasing opacity and fragmentation, instead revealing successful regulatory outcomes and competitive market improvements.

Importantly, stocks and ETFs solidified distinct market roles—stocks as continuous price-discovery mechanisms with diverse participants, ETFs as efficient portfolio instruments for institutional flows—suggesting different optimal trading strategies and market structure policies for each asset type.

---

## Limitations and Caveats

1. **Attribution Challenge:** Cannot definitively separate retail from algorithmic activity in odd-lot growth without trader-level identification data

2. **Survivorship Bias:** Fixed cohort methodology means delisted securities drop out of later periods; results reflect surviving securities only

3. **Sample Size:** 100 securities per asset type may not capture full heterogeneity; stratified sampling by market cap/liquidity could improve representativeness

4. **Causality:** This is descriptive analysis documenting patterns, not causal inference establishing mechanisms

5. **Hidden Order Definition:** "Hidden" volume includes non-displayed orders on lit exchanges, not exclusively dark pool activity

6. **Generalizability:** Results reflect large-cap, liquid securities that survived from 2012Q1; patterns may differ for small-cap or recently listed securities

---

## Alternative Approaches to Consider

**Stratified Sampling:** Sample by market cap/liquidity quintiles to ensure representation across activity levels and test whether patterns differ by size

**Robustness Checks:** Run analysis with multiple random seeds to assess sampling sensitivity and verify findings are not artifacts of specific cohort selection

**Volume-Weighting:** Compare equal-weighted vs volume-weighted statistics to determine whether patterns are economically significant or driven by low-volume outliers

**Event Studies:** Test for structural breaks around specific regulatory changes (MiFID II, Tick Size Pilot, maker-taker reforms) using formal changepoint detection

**Cross-Sectional Analysis:** Examine whether patterns vary by sector, volatility regime, or market conditions to identify conditional relationships

**Survivor-Adjusted Analysis:** Compare results using fixed cohorts vs rolling cohorts to quantify survivorship bias impact
