# Exoplanet Light Curve Download - Full Scale (Colab Version)

**目標**: 下載 11,979 個 TESS 光曲線樣本  
**預估時間**: 6-7 小時  
**預估成功率**: ~57% (~6,800 樣本)  
**儲存格式**: HDF5

---

## ⚠️ Colab 使用注意事項

1. **執行時間**: 此 notebook 需要 6-7 小時連續執行
2. **Colab 限制**: 免費版 12 小時限制，Pro 版 24 小時
3. **中斷恢復**: 使用 checkpoint 系統，可從中斷處繼續
4. **儲存空間**: 建議掛載 Google Drive (需約 5-10 GB)
5. **網路穩定**: 確保網路連線穩定

---

## 📋 執行步驟

1. **Cell 1**: 掛載 Google Drive (可選)
2. **Cell 2**: 安裝依賴套件
3. **Cell 3**: Clone GitHub 倉庫
4. **Cell 4**: 設定配置參數
5. **Cell 5**: 執行下載（主程式）
6. **Cell 6**: 查看統計與驗證
7. **Cell 7**: 下載結果到本地或推送 GitHub

## 1️⃣ 掛載 Google Drive (可選但建議)

將下載的資料存到 Google Drive，避免 Colab 重啟後遺失

In [None]:
# 選項 A: 使用 Google Drive
USE_GDRIVE = True  # 改為 False 則存在 Colab 本地

if USE_GDRIVE:
    from google.colab import drive
    drive.mount('/content/drive')
    
    # 設定工作目錄
    WORK_DIR = '/content/drive/MyDrive/exoplanet-lightcurves'
    !mkdir -p {WORK_DIR}
    %cd {WORK_DIR}
else:
    WORK_DIR = '/content/exoplanet-starter'
    !mkdir -p {WORK_DIR}
    %cd {WORK_DIR}

print(f"Working directory: {WORK_DIR}")

## 2️⃣ 安裝依賴套件

In [None]:
%%capture
!pip install -q lightkurve h5py pandas numpy tqdm pyarrow

## 3️⃣ Clone GitHub 倉庫並獲取數據集

In [None]:
import os
from pathlib import Path

# Clone 倉庫（如果尚未存在）
if not Path('exoplanet-starter').exists():
    !git clone https://github.com/exoplanet-spaceapps/exoplanet-starter.git
else:
    print("Repository already exists, pulling latest changes...")
    !cd exoplanet-starter && git pull

# 設定專案根目錄
PROJECT_ROOT = Path('exoplanet-starter').resolve()
os.chdir(PROJECT_ROOT)
print(f"Project root: {PROJECT_ROOT}")

# 確認數據集存在
dataset_path = PROJECT_ROOT / 'data' / 'supervised_dataset.csv'
if dataset_path.exists():
    print(f"✅ Dataset found: {dataset_path}")
else:
    print(f"❌ Dataset not found: {dataset_path}")
    print("Please check the repository structure.")

## 4️⃣ 配置參數

In [None]:
# 配置參數
CONFIG = {
    'max_workers': 4,          # 並行下載數（Colab 建議 2-4）
    'max_retries': 3,          # 失敗重試次數
    'timeout': 60,             # 超時時間（秒）
    'save_interval': 20,       # 每 N 個樣本保存 checkpoint
    'test_samples': 11979,     # 全量下載（改為較小數字可測試）
}

print("Configuration:")
for key, val in CONFIG.items():
    print(f"  {key}: {val}")

# 估算時間
estimated_hours = (CONFIG['test_samples'] * 5) / 3600 / CONFIG['max_workers']
print(f"\nEstimated time: {estimated_hours:.1f} hours")
print(f"Expected success: ~{int(CONFIG['test_samples'] * 0.57)} samples (57% rate)")

## 5️⃣ 主下載程式

**注意**: 此 cell 將運行 6-7 小時，請確保 Colab 連線穩定

In [None]:
import sys
import time
import warnings
from datetime import datetime
from concurrent.futures import ThreadPoolExecutor, as_completed

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm  # Colab 使用 notebook 版本
import h5py
import lightkurve as lk

warnings.filterwarnings('ignore')

print("="*70)
print("Exoplanet Detection - Full Download (Colab Version)")
print("="*70)

# 設定路徑
DATA_DIR = PROJECT_ROOT / 'data'
LIGHTCURVE_DIR = DATA_DIR / 'lightcurves'
CHECKPOINT_DIR = PROJECT_ROOT / 'checkpoints'

LIGHTCURVE_DIR.mkdir(parents=True, exist_ok=True)
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)

print(f"\n[1/5] Paths configured")
print(f"  Lightcurves: {LIGHTCURVE_DIR}")
print(f"  Checkpoints: {CHECKPOINT_DIR}")

# 載入數據集
print(f"\n[2/5] Loading dataset...")
samples_df = pd.read_csv(dataset_path)
samples_df = samples_df.head(CONFIG['test_samples'])

if 'sample_id' not in samples_df.columns:
    samples_df['sample_id'] = [f"SAMPLE_{i:06d}" for i in range(len(samples_df))]
if 'tic_id' not in samples_df.columns:
    if 'tid' in samples_df.columns:
        samples_df['tic_id'] = samples_df['tid']

print(f"  Total samples: {len(samples_df)}")
print(f"  Positive: {samples_df['label'].sum()}, Negative: {(~samples_df['label'].astype(bool)).sum()}")

# 下載函數
def download_lightcurve(row, retries=3):
    """Download TESS light curve and save as HDF5"""
    sample_id = row['sample_id']
    tic_id = int(float(row['tic_id']))
    
    result = {
        'sample_id': sample_id,
        'tic_id': tic_id,
        'status': 'failed',
        'file_path': None,
        'n_sectors': 0,
        'error': None
    }
    
    file_path = LIGHTCURVE_DIR / f"{sample_id}_TIC{tic_id}.h5"
    
    # 檢查是否已下載
    if file_path.exists():
        result['status'] = 'cached'
        result['file_path'] = str(file_path)
        try:
            with h5py.File(file_path, 'r') as f:
                result['n_sectors'] = f.attrs.get('n_sectors', 0)
        except:
            pass
        return result
    
    # 嘗試下載
    for attempt in range(retries):
        try:
            search_result = lk.search_lightcurve(f"TIC {tic_id}", author='SPOC')
            if search_result is None or len(search_result) == 0:
                result['error'] = 'no_data'
                return result
            
            lc_collection = search_result.download_all()
            if lc_collection is None or len(lc_collection) == 0:
                result['error'] = 'download_failed'
                return result
            
            # 保存為 HDF5
            with h5py.File(file_path, 'w') as f:
                f.attrs['sample_id'] = sample_id
                f.attrs['tic_id'] = tic_id
                f.attrs['n_sectors'] = len(lc_collection)
                f.attrs['download_time'] = datetime.now().isoformat()
                
                for i, lc in enumerate(lc_collection):
                    grp = f.create_group(f'sector_{i}')
                    grp.create_dataset('time', data=np.array(lc.time.value))
                    grp.create_dataset('flux', data=np.array(lc.flux.value))
                    grp.create_dataset('flux_err', data=np.array(lc.flux_err.value))
                    grp.attrs['sector'] = lc.meta.get('SECTOR', '?')
                    grp.attrs['mission'] = str(lc.meta.get('MISSION', 'TESS'))
            
            result['status'] = 'success'
            result['file_path'] = str(file_path)
            result['n_sectors'] = len(lc_collection)
            return result
            
        except Exception as e:
            result['error'] = str(e)[:50]
            if attempt < retries - 1:
                time.sleep(2 ** attempt)
    
    return result

# Checkpoint 管理
def load_checkpoint():
    cp = CHECKPOINT_DIR / 'download_progress.parquet'
    if cp.exists():
        df = pd.read_parquet(cp)
        print(f"  ✅ Loaded checkpoint: {len(df)} records")
        return df
    return pd.DataFrame()

def save_checkpoint(df):
    cp = CHECKPOINT_DIR / 'download_progress.parquet'
    df.to_parquet(cp, index=False)

# 主下載流程
print(f"\n[3/5] Starting download...")
progress_df = load_checkpoint()

if len(progress_df) > 0:
    completed = set(progress_df[progress_df['status'].isin(['success', 'cached'])]['sample_id'])
    remaining = samples_df[~samples_df['sample_id'].isin(completed)]
else:
    remaining = samples_df.copy()

print(f"  Total: {len(samples_df)}")
print(f"  Completed: {len(samples_df)-len(remaining)}")
print(f"  Remaining: {len(remaining)}")

if len(remaining) == 0:
    print("  ✅ All samples already downloaded!")
else:
    print(f"  ⏱️ Estimated time: {len(remaining)*5/60/CONFIG['max_workers']:.1f} minutes")
    print(f"\n⚠️ This will take approximately {len(remaining)*5/3600/CONFIG['max_workers']:.1f} hours")
    print("  Keep this notebook running...\n")
    
    start_time = time.time()
    results = []
    
    with ThreadPoolExecutor(max_workers=CONFIG['max_workers']) as executor:
        futures = {executor.submit(download_lightcurve, row, CONFIG['max_retries']): row
                   for _, row in remaining.iterrows()}
        
        with tqdm(total=len(remaining), desc="Downloading") as pbar:
            for future in as_completed(futures):
                result = future.result()
                results.append(result)
                pbar.update(1)
                
                # 定期保存 checkpoint
                if len(results) % CONFIG['save_interval'] == 0:
                    temp_df = pd.concat([progress_df, pd.DataFrame(results)], ignore_index=True)
                    save_checkpoint(temp_df)
                    print(f"  💾 Checkpoint saved: {len(results)} new downloads")
    
    # 最終保存
    if len(results) > 0:
        progress_df = pd.concat([progress_df, pd.DataFrame(results)], ignore_index=True)
        save_checkpoint(progress_df)
    
    elapsed = time.time() - start_time
    print(f"\n✅ Download complete!")
    print(f"  Time: {elapsed/60:.1f} min ({elapsed/3600:.1f} hours)")
    print(f"  Average: {elapsed/len(results):.1f} sec/sample")

print("\n" + "="*70)

## 6️⃣ 查看統計與驗證

In [None]:
# 重新載入最新的 checkpoint
progress_df = load_checkpoint()

print("="*70)
print("Download Statistics")
print("="*70)

# 狀態統計
status_counts = progress_df['status'].value_counts()
print("\nStatus breakdown:")
for status, count in status_counts.items():
    print(f"  {status}: {count}")

success_count = status_counts.get('success', 0) + status_counts.get('cached', 0)
success_rate = success_count / len(progress_df) * 100
print(f"\nSuccess rate: {success_rate:.1f}% ({success_count}/{len(progress_df)})")

# 檔案驗證
h5_files = list(LIGHTCURVE_DIR.glob('*.h5'))
print(f"\nFiles on disk: {len(h5_files)}")

if len(h5_files) > 0:
    total_size = sum(f.stat().st_size for f in h5_files) / 1024 / 1024 / 1024
    print(f"Total size: {total_size:.2f} GB")
    print(f"Average: {total_size/len(h5_files)*1024:.1f} MB/file")
    
    # 隨機檢查 3 個檔案
    print("\nSample verification:")
    samples = np.random.choice(h5_files, min(3, len(h5_files)), replace=False)
    for h5_file in samples:
        try:
            with h5py.File(h5_file, 'r') as f:
                tic_id = f.attrs['tic_id']
                n_sectors = f.attrs['n_sectors']
                total_points = sum(len(f[f'sector_{i}']['time']) for i in range(n_sectors))
                print(f"  TIC{tic_id}: {n_sectors} sectors, {total_points:,} data points")
        except Exception as e:
            print(f"  ❌ {h5_file.name}: {e}")

print("\n" + "="*70)

## 7️⃣ 保存結果

選擇以下其中一種方式保存結果

### 選項 A: 下載 checkpoint 到本地

In [None]:
from google.colab import files

# 下載 checkpoint 檔案
checkpoint_file = CHECKPOINT_DIR / 'download_progress.parquet'
if checkpoint_file.exists():
    files.download(str(checkpoint_file))
    print("✅ Checkpoint downloaded!")
else:
    print("❌ Checkpoint file not found")

### 選項 B: 推送到 GitHub (需要設定 token)

In [None]:
# 設定 GitHub token (不要直接寫在 notebook 中！)
# 使用 Colab Secrets 或環境變數

# from google.colab import userdata
# GITHUB_TOKEN = userdata.get('GITHUB_TOKEN')

# %cd {PROJECT_ROOT}
# !git config user.name "Your Name"
# !git config user.email "your.email@example.com"
# !git add checkpoints/download_progress.parquet
# !git commit -m "feat: complete full-scale download from Colab"
# !git push https://{GITHUB_TOKEN}@github.com/exoplanet-spaceapps/exoplanet-starter.git main

print("⚠️ Uncomment and configure the above code to push to GitHub")

### 選項 C: 壓縮並下載部分資料

In [None]:
# 壓縮前 100 個成功下載的檔案（示例）
!cd {LIGHTCURVE_DIR} && tar -czf sample_lightcurves_100.tar.gz $(ls *.h5 | head -100)

# 下載壓縮檔
sample_archive = LIGHTCURVE_DIR / 'sample_lightcurves_100.tar.gz'
if sample_archive.exists():
    files.download(str(sample_archive))
    print("✅ Sample archive downloaded!")
else:
    print("❌ Archive creation failed")

---

## 📝 下一步

下載完成後，執行特徵提取：

```python
!python scripts/test_features.py
```

或使用對應的 Colab notebook 進行特徵提取。

---

## 🔧 故障排除

**問題 1: Colab 中斷連線**
- 解決: 使用 checkpoint 系統重新執行 Cell 5，會從中斷處繼續

**問題 2: 儲存空間不足**
- 解決: 使用 Google Drive 或定期下載並刪除本地檔案

**問題 3: 下載速度過慢**
- 解決: 減少 `max_workers` 或檢查網路連線

**問題 4: MAST 快取錯誤**
- 解決: 重試機制會自動處理，部分失敗屬正常現象（57% 成功率）