# Kepler DR25 Downloader Demo

This notebook demonstrates how to use the Kepler DR25 Downloader toolkit for downloading and filtering Kepler space telescope data from NASA's MAST archive.

## Overview

The toolkit provides two main scripts:

1. **`get-kepler-dr25.py`** - Main downloader with DVT filtering
2. **`filter-get-kepler-dr25.py`** - Universal filter with mode detection and conversion

### Key Features:
- Fast parallel downloading (15-20 KICs/minute)
- ExoMiner format support (default) for ML frameworks
- Standard MAST format option
- DVT (Data Validation) file filtering
- Redis-based buffering for reliability
- Comprehensive health reports

## Prerequisites Check

In [None]:
# Check Python version
import sys

print(f"Python version: {sys.version}")
assert sys.version_info >= (3, 7), "Python 3.7+ is required"

In [None]:
# Check required packages
import importlib

required_packages = [
    'pandas',
    'numpy',
    'astroquery',
    'redis',
    'requests',
    'tqdm'
]

for package in required_packages:
    try:
        importlib.import_module(package)
        print(f"✓ {package} is installed")
    except ImportError:
        print(f"✗ {package} is not installed. Please run: pip install {package}")

In [None]:
# Check Redis connection
import redis

try:
    r = redis.Redis(host='localhost', port=6379, db=0)
    r.ping()
    print("✓ Redis is running and accessible")
except Exception:
    print("✗ Redis is not running. Please start Redis for optimal performance")
    print("  On macOS: brew services start redis")
    print("  On Linux: sudo systemctl start redis")

## 1. Basic Download Example

In [None]:
# Create a sample CSV with a few KIC IDs
import os

import pandas as pd

# Create input directory if it doesn't exist
os.makedirs('input', exist_ok=True)

# Create a small test CSV with 5 KIC IDs
test_kics = pd.DataFrame({
    'kepid': [757450, 892772, 1161345, 1432214, 1725016]
})

test_csv_path = 'input/test_kics.csv'
test_kics.to_csv(test_csv_path, index=False)
print(f"Created test CSV with {len(test_kics)} KIC IDs: {test_csv_path}")
print(test_kics)

In [None]:
# Run the downloader with ExoMiner format (default)
!python get-kepler-dr25.py input/test_kics.csv --workers 2 --batch-size 2

In [None]:
# Check the output structure
import glob
from pathlib import Path

# Find the latest job directory
job_dirs = sorted(glob.glob('kepler_downloads/job-*'))
if job_dirs:
    latest_job = job_dirs[-1]
    print(f"Latest job: {latest_job}")

    # Show directory structure
    for path in Path(latest_job).rglob('*.fits'):
        print(f"  {path.relative_to(latest_job)}")
else:
    print("No job directories found")

## 2. Download with Standard MAST Format

In [None]:
# Run the downloader with Standard MAST format
!python get-kepler-dr25.py input/test_kics.csv --no-exominer --workers 2 --batch-size 2

## 3. Filter and Convert Between Formats

In [None]:
# Create a subset CSV for filtering
subset_kics = pd.DataFrame({
    'kepid': [757450, 892772]  # Just 2 KICs from our original 5
})

subset_csv_path = 'input/subset_kics.csv'
subset_kics.to_csv(subset_csv_path, index=False)
print(f"Created subset CSV with {len(subset_kics)} KIC IDs")

In [None]:
# Filter an existing job using the subset
if job_dirs:
    source_job = job_dirs[-1]
    print(f"Filtering from: {source_job}")
    !python filter-get-kepler-dr25.py --input-csv input/subset_kics.csv --source-job {source_job}

## 4. Health Report Analysis

In [None]:
# Read and display health report
import os

if job_dirs:
    health_report_path = os.path.join(job_dirs[-1], 'health_check_report.txt')
    if os.path.exists(health_report_path):
        with open(health_report_path) as f:
            print(f.read())
    else:
        print("Health report not found")

## 5. Database Analysis

In [None]:
# Analyze the download database
import sqlite3

import pandas as pd

if job_dirs:
    db_path = os.path.join(job_dirs[-1], 'download_records.db')
    if os.path.exists(db_path):
        conn = sqlite3.connect(db_path)

        # Get download statistics
        df = pd.read_sql_query(
            "SELECT kic, success, files_downloaded, has_dvt FROM download_records",
            conn
        )

        print("Download Statistics:")
        print(f"Total KICs: {len(df)}")
        print(f"Successful: {df['success'].sum()}")
        print(f"With DVT: {df['has_dvt'].sum()}")
        print(f"Total files: {df['files_downloaded'].sum()}")
        print("\nDetailed records:")
        print(df)

        conn.close()
    else:
        print("Database not found")

## 6. Working with Sample Datasets

In [None]:
# Load and examine the sample KOI dataset
koi_df = pd.read_csv('input_samples/cumulative_koi_2025.09.06_13.27.56.csv', comment='#')
print(f"KOI dataset: {len(koi_df)} entries")
print(f"Columns: {list(koi_df.columns)[:5]}...")  # Show first 5 columns
print("\nFirst 5 KIC IDs:")
print(koi_df['kepid'].head())

In [None]:
# Load and examine the sample TCE dataset
tce_df = pd.read_csv('input_samples/q1_q17_dr25_tce_2025.09.06_13.29.19.csv', comment='#')
print(f"TCE dataset: {len(tce_df)} entries")
print(f"Unique KICs: {tce_df['kepid'].nunique()}")
print("\nFirst 5 KIC IDs:")
print(tce_df['kepid'].head())

## Tips and Best Practices

1. **Start Small**: Test with a few KICs before downloading large datasets
2. **Use Redis**: Ensures data integrity and allows recovery from interruptions
3. **Monitor Progress**: Check the console output and health reports
4. **ExoMiner vs Standard**: 
   - Use ExoMiner (default) for ML frameworks
   - Use Standard (`--no-exominer`) for general analysis
5. **Optimize Performance**:
   - Increase workers for faster downloads: `--workers 8`
   - Adjust batch size: `--batch-size 100`
6. **Filter Smartly**: Use the filter script to extract subsets without re-downloading

## Cleanup

In [None]:
# Optional: Clean up test files
# Uncomment to remove test data
# import shutil
# if os.path.exists('input/test_kics.csv'):
#     os.remove('input/test_kics.csv')
# if os.path.exists('input/subset_kics.csv'):
#     os.remove('input/subset_kics.csv')
# print("Test files cleaned up")