# Market Data Analysis

Comprehensive market data exploration and visualization to support high-frequency trading research.

## Objectives
- Perform exploratory data analysis (EDA) on market datasets
- Understand microstructure signals (prices, volumes, spreads, liquidity)
- Identify patterns, anomalies, and data quality issues
- Build clear visualizations for research and reporting

## Requirements
- Python 3.9+
- pandas, numpy, matplotlib, seaborn, plotly
- One or more CSV datasets with market data (trades/quotes/bars)

## References
- [Market Microstructure - Wikipedia](https://en.wikipedia.org/wiki/Market_microstructure)
- [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html)
- [Plotly Financial Charts](https://plotly.com/python/financial-charts/)

## Exploratory Analysis Concepts
- Price levels and returns (close-to-close, log returns)
- Volume and liquidity metrics (ADV, turnover)
- Bid-ask spread and effective spread (if quotes available)
- Outlier detection and missing data handling
- Time alignment and resampling

## Visualization
- Time series line plots for prices/returns
- Histograms and KDE for return distributions
- Candlestick charts for OHLC data
- Scatter plots for relationships (e.g., spread vs. volume)


In [None]:
# Environment check
import sys
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import plotly

print('Python version:', sys.version)
print('pandas:', pd.__version__)
print('numpy:', np.__version__)
print('matplotlib:', matplotlib.__version__)
print('seaborn:', sns.__version__)
print('plotly:', plotly.__version__)

# Plot styling
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')


## Load Dataset
Provide a path to a CSV file with at least a timestamp and price columns.
Update the parse_dates and dtype mappings as needed.


In [None]:
# Example: load market data from CSV
from pathlib import Path

data_path = Path('../data/market_sample.csv')  # <-- replace with your file

# Common schemas:
# - Bars: timestamp, open, high, low, close, volume
# - Trades: timestamp, price, size
# - Quotes: timestamp, bid, ask, bid_size, ask_size

df = pd.read_csv(
    data_path,
    parse_dates=['timestamp'],
    infer_datetime_format=True
)

# Ensure sorted by time
df = df.sort_values('timestamp').reset_index(drop=True)
print('Loaded rows:', len(df))
df.head()


## Basic Statistical Analysis
Quick checks for schema, missing values, and descriptive statistics.


In [None]:
# Schema and quick look
print('Columns:', df.columns.tolist())
print('DTypes:
', df.dtypes)
print('NA counts:
', df.isna().sum())
display(df.head())

# If close exists, compute simple returns
if 'close' in df.columns:
    df['return'] = df['close'].pct_change()
elif 'price' in df.columns:
    df['return'] = df['price'].pct_change()

display(df.describe(include='all'))

# Memory footprint
print('Memory usage (MB):', round(df.memory_usage(deep=True).sum() / 1e6, 3))


## Quick Visualizations
Replace with interactive Plotly as needed.


In [None]:
# Line plot of price/close if available
plt.figure(figsize=(10, 4))
if 'close' in df.columns:
    plt.plot(df['timestamp'], df['close'], label='Close')
elif 'price' in df.columns:
    plt.plot(df['timestamp'], df['price'], label='Price')
plt.title('Price over Time')
plt.xlabel('Time')
plt.ylabel('Price')
plt.legend()
plt.tight_layout()
plt.show()

# Histogram of returns if computed
if 'return' in df.columns:
    plt.figure(figsize=(6,4))
    sns.histplot(df['return'].dropna(), bins=50, kde=True)
    plt.title('Return Distribution')
    plt.xlabel('Return')
    plt.tight_layout()
    plt.show()


## Next Steps
- Add resampling (e.g., 1min, 5min) and compute OHLCV aggregates
- Calculate microstructure metrics (spread, mid, imbalance) if quotes exist
- Introduce rolling windows for volatility and liquidity
- Validate timezones and trading session boundaries
- Document assumptions and data sources
