# ThetaData Synchronization Manager - Examples

**Version:** 1.0.9  
**Last Updated:** December 2025

This notebook demonstrates practical usage of tdSynchManager library for downloading, validating, and persisting market data from ThetaData API.

## Table of Contents

1. [Setup & Imports](#setup)
2. [Example 1: Basic EOD Download to CSV](#ex1)
3. [Example 2: Multi-Symbol EOD Download](#ex2)
4. [Example 3: Intraday Data to Parquet](#ex3)
5. [Example 4: InfluxDB Integration](#ex4)
6. [Example 5: Coherence Check & Recovery](#ex5)
7. [Example 6: Custom Discovery Policy](#ex6)
8. [Example 7: Querying InfluxDB Data](#ex7)
9. [Example 8: Verify Data Completeness](#ex8)

<a id='setup'></a>
## 1. Setup & Imports

Import required libraries and verify installation.

In [None]:
import asyncio
import sys
import pandas as pd
from pathlib import Path

# Import tdSynchManager components
from tdSynchManager import ManagerConfig, ThetaSyncManager, ThetaDataV3Client
from tdSynchManager.config import Task, DiscoverPolicy

print("‚úÖ Imports successful!")
print(f"Python version: {sys.version}")
print(f"Pandas version: {pd.__version__}")

**‚ö†Ô∏è Important**: Ensure your ThetaData API token is set as environment variable:

```bash
# Windows (PowerShell)
$env:THETADATA_API_TOKEN="your_token_here"

# Windows (CMD)
set THETADATA_API_TOKEN=your_token_here

# Linux/macOS
export THETADATA_API_TOKEN="your_token_here"
```

Or use a `.env` file:

```ini
# .env
THETADATA_API_TOKEN=your_token_here
```

<a id='ex1'></a>
## Example 1: Basic EOD Download to CSV

Download daily (EOD) data for a single symbol to CSV format.

In [None]:
# Configuration
config = ManagerConfig(
    root_dir="./data",        # Output directory
    max_concurrency=5         # Parallel downloads
)

# Task definition
tasks = [
    Task(
        asset="stock",
        symbols=["AAPL"],
        intervals=["1d"],
        sink="csv",
        first_date_override="20240101",
        end_date_override="20241231",
        discover_policy=DiscoverPolicy(mode="skip")  # Don't auto-discover
    )
]

# Run synchronization
async def run_example1():
    async with ThetaDataV3Client() as client:
        manager = ThetaSyncManager(config, client=client)
        await manager.run(tasks)

# Execute
await run_example1()
print("\n‚úÖ Example 1 completed!")

In [None]:
# Verify downloaded data
import glob

csv_files = glob.glob("./data/stock/AAPL/1d/csv/*.csv")
if csv_files:
    df = pd.read_csv(csv_files[0], dtype=str)
    print(f"üìä Downloaded {len(df)} rows")
    print(f"\nColumns: {df.columns.tolist()}")
    print(f"\nFirst 5 rows:\n{df.head()}")
else:
    print("‚ùå No CSV files found")

<a id='ex2'></a>
## Example 2: Multi-Symbol EOD Download

Download EOD data for multiple symbols simultaneously.

In [None]:
# Configuration
config = ManagerConfig(
    root_dir="./data",
    max_concurrency=10  # Higher concurrency for multiple symbols
)

# Multi-symbol task
tasks = [
    Task(
        asset="stock",
        symbols=["AAPL", "MSFT", "GOOGL", "AMZN", "TSLA"],
        intervals=["1d"],
        sink="csv",
        first_date_override="20240101",
        end_date_override="20241231",
        discover_policy=DiscoverPolicy(mode="skip")
    )
]

async def run_example2():
    async with ThetaDataV3Client() as client:
        manager = ThetaSyncManager(config, client=client)
        await manager.run(tasks)

await run_example2()
print("\n‚úÖ Example 2 completed!")

<a id='ex3'></a>
## Example 3: Intraday Data to Parquet

Download 5-minute intraday bars and save to Parquet format (compressed, faster queries).

In [None]:
# Configuration for Parquet
config = ManagerConfig(
    root_dir="./data",
    max_concurrency=3
)

# Intraday task (5-minute bars)
tasks = [
    Task(
        asset="stock",
        symbols=["SPY"],
        intervals=["5min"],
        sink="parquet",
        first_date_override="20241201",
        end_date_override="20241215",
        discover_policy=DiscoverPolicy(mode="skip")
    )
]

async def run_example3():
    async with ThetaDataV3Client() as client:
        manager = ThetaSyncManager(config, client=client)
        await manager.run(tasks)

await run_example3()
print("\n‚úÖ Example 3 completed!")

In [None]:
# Read Parquet file
parquet_files = glob.glob("./data/stock/SPY/5min/parquet/*.parquet")
if parquet_files:
    df = pd.read_parquet(parquet_files[0])
    print(f"üìä Parquet file contains {len(df)} rows")
    print(f"\nFirst 5 rows:\n{df.head()}")
    
    # Show file size advantage
    import os
    parquet_size = os.path.getsize(parquet_files[0])
    print(f"\nüíæ File size: {parquet_size / 1024:.2f} KB")
else:
    print("‚ùå No Parquet files found")

<a id='ex4'></a>
## Example 4: InfluxDB Integration

Download data and write directly to InfluxDB for time-series analysis.

**Prerequisites:**
- InfluxDB 3.x running at http://localhost:8086
- Valid InfluxDB token
- Existing bucket (database)

In [None]:
# InfluxDB configuration
influx_config = ManagerConfig(
    root_dir="./data",
    max_concurrency=3,
    influx_url="http://localhost:8086",
    influx_bucket="ThetaData",
    influx_token="your_influx_token_here",  # Replace with your token
    influx_measure_prefix="",
    influx_write_batch=5000
)

# Task for InfluxDB
tasks = [
    Task(
        asset="stock",
        symbols=["SPY"],
        intervals=["1d"],
        sink="influxdb",
        first_date_override="20240101",
        end_date_override="20241231",
        discover_policy=DiscoverPolicy(mode="skip")
    )
]

async def run_example4():
    async with ThetaDataV3Client() as client:
        manager = ThetaSyncManager(influx_config, client=client)
        await manager.run(tasks)

# Uncomment to run (requires InfluxDB setup)
# await run_example4()
# print("\n‚úÖ Example 4 completed!")

print("‚ö†Ô∏è Example 4 requires InfluxDB setup. See MANUAL.md Chapter 4.")

<a id='ex5'></a>
## Example 5: Coherence Check & Recovery

Enable automatic gap detection and recovery for existing data.

In [None]:
# Configuration with coherence checking enabled
config = ManagerConfig(
    root_dir="./data",
    max_concurrency=5,
    coherence_mode="full",  # Options: "off", "light", "full"
    coherence_tolerance=0.05  # Allow 5% missing data before triggering recovery
)

# Task with coherence
tasks = [
    Task(
        asset="stock",
        symbols=["AAPL"],
        intervals=["1d"],
        sink="csv",
        first_date_override="20240101",
        end_date_override="20241231",
        discover_policy=DiscoverPolicy(mode="skip")
    )
]

async def run_example5():
    async with ThetaDataV3Client() as client:
        manager = ThetaSyncManager(config, client=client)
        await manager.run(tasks)

await run_example5()
print("\n‚úÖ Example 5 completed!")
print("Check logs above for [COHERENCE] messages indicating gap detection/recovery.")

<a id='ex6'></a>
## Example 6: Custom Discovery Policy

Use different discovery policies to control symbol and date range behavior.

In [None]:
config = ManagerConfig(
    root_dir="./data",
    max_concurrency=5
)

# Example A: skip - No discovery, only specified symbols/dates
task_skip = Task(
    asset="stock",
    symbols=["AAPL"],
    intervals=["1d"],
    sink="csv",
    first_date_override="20240101",
    end_date_override="20240131",
    discover_policy=DiscoverPolicy(mode="skip")
)

# Example B: mild_skip - Discover new symbols, keep existing dates
task_mild = Task(
    asset="stock",
    symbols=["AAPL", "MSFT"],
    intervals=["1d"],
    sink="csv",
    first_date_override="20240101",
    end_date_override="20240131",
    discover_policy=DiscoverPolicy(mode="mild_skip")
)

# Example C: wild - Discover symbols AND extend dates to present
task_wild = Task(
    asset="stock",
    symbols=["AAPL"],
    intervals=["1d"],
    sink="csv",
    discover_policy=DiscoverPolicy(mode="wild")  # Will extend to current date
)

print("Discovery Policy Examples:")
print(f"  - skip:      {task_skip.discover_policy.mode}")
print(f"  - mild_skip: {task_mild.discover_policy.mode}")
print(f"  - wild:      {task_wild.discover_policy.mode}")
print("\nChoose one and run accordingly.")

<a id='ex7'></a>
## Example 7: Querying InfluxDB Data

Query data written to InfluxDB.

In [None]:
# Requires influxdb-client-3 library
try:
    from influxdb_client_3 import InfluxDBClient3
    
    # Configuration
    client = InfluxDBClient3(
        host="http://localhost:8086",
        token="your_influx_token_here",
        database="ThetaData"
    )
    
    # Query example: Get SPY data for December 2024
    query = """
    SELECT time, open, high, low, close, volume
    FROM SPY_stock_1d
    WHERE time >= '2024-12-01T00:00:00Z'
      AND time < '2025-01-01T00:00:00Z'
    ORDER BY time ASC
    """
    
    # Uncomment to run (requires InfluxDB with data)
    # table = client.query(query)
    # df = table.to_pandas()
    # print(f"üìä Query returned {len(df)} rows\n")
    # print(df.head())
    # client.close()
    
    print("‚ö†Ô∏è Example 7 requires InfluxDB with data. See Example 4 first.")
    
except ImportError:
    print("‚ùå influxdb-client-3 not installed. Run: pip install influxdb-client-3")

<a id='ex8'></a>
## Example 8: Verify Data Completeness

Check for missing dates in downloaded CSV data.

In [None]:
import pandas as pd
from datetime import datetime, timedelta

# Read downloaded CSV
csv_file = "./data/stock/AAPL/1d/csv/2024-01-01T00-00-00Z-AAPL-stock-1d_part01.csv"

try:
    df = pd.read_csv(csv_file, dtype=str)
    
    # Parse timestamps
    df['timestamp'] = pd.to_datetime(df['last_trade'], utc=True, errors='coerce')
    df['date'] = df['timestamp'].dt.date
    
    # Get date range
    start_date = df['date'].min()
    end_date = df['date'].max()
    
    # Generate expected trading days (excluding weekends)
    date_range = pd.date_range(start=start_date, end=end_date, freq='B')  # B = business days
    expected_dates = set(date_range.date)
    actual_dates = set(df['date'].unique())
    
    # Find missing dates
    missing_dates = expected_dates - actual_dates
    
    print(f"üìÖ Date Range: {start_date} to {end_date}")
    print(f"üìä Total rows: {len(df)}")
    print(f"‚úÖ Expected trading days: {len(expected_dates)}")
    print(f"‚úÖ Actual trading days: {len(actual_dates)}")
    
    if missing_dates:
        print(f"\n‚ö†Ô∏è Missing {len(missing_dates)} dates:")
        for date in sorted(missing_dates)[:10]:  # Show first 10
            print(f"  - {date}")
    else:
        print("\n‚úÖ No missing dates! Data is complete.")
        
except FileNotFoundError:
    print(f"‚ùå File not found: {csv_file}")
    print("Run Example 1 first to download data.")

## Summary

This notebook demonstrated:

1. ‚úÖ **Basic EOD download** - Single symbol to CSV
2. ‚úÖ **Multi-symbol download** - Parallel processing
3. ‚úÖ **Intraday data** - 5-minute bars to Parquet
4. ‚úÖ **InfluxDB integration** - Time-series database
5. ‚úÖ **Coherence checking** - Automatic gap detection
6. ‚úÖ **Discovery policies** - Control symbol/date behavior
7. ‚úÖ **Query InfluxDB** - Retrieve stored data
8. ‚úÖ **Data validation** - Verify completeness

**Next Steps:**

- Read the full manual: [MANUAL.md](../MANUAL.md)
- Explore API reference: [Chapter 18](../MANUAL.md#18-api-reference)
- See batch automation: [start_environment.bat](../start_environment.bat)

---

**Documentation:** https://github.com/fede72bari/tdSynchManager  
**Issues:** https://github.com/fede72bari/tdSynchManager/issues