# Gaia Synthetic Light Curve Classification Tutorial

Welcome to the Gaia synthetic light curve classification tutorial! This notebook will guide you through the full workflow for preparing, transforming, and classifying synthetic Gaia light curve data. You will learn how to:

- Download and organize synthetic light curve data
- Sample and split data for training and validation
- Transform light curves into polar hexbin images
- Prepare datasets for both system type and spot detection tasks
- Train and evaluate deep learning models (ResNet, ViT) using PyTorch

Follow the cells below for step-by-step instructions and examples. All code is modular and importable for easy reuse.

First step: creating training and validation datasets. We have light curves stored on our cloud. We need to download them on a local machine.

**Download Link:** [Gaia synthetic light curve data](https://u.pcloud.link/publink/show?code=kZMm285Zoy7Q3IAQOakIshhv4jTeH8OAtS4y#folder=25637440779)

Download the following zip files into `data` folder:

*   `synthetic_gaia.zip`

Unzip it inside the `data` folder.

In the new `synthetic_gaia` folder there are:
*   `detached_nospot_gaia.csv`
*   `detached_spot_gaia.csv`
*   `overcontact_nospot_gaia.csv`
*   `overcontact_spot_gaia.csv`

These are synthesised data, and we will create training and validation samples from them.

### 1. Create directories for train and validation datasets

In [None]:
import os

output_root = "../data/gaia_dataset"
for split in ["train", "val"]:
    split_dir = os.path.join(output_root, split)
    os.makedirs(split_dir, exist_ok=True)
    print(f"Created directory: {split_dir}")
# Now you have ../data/gaia_dataset/train and ../data/gaia_dataset/val

Created directory: ../data/gaia_dataset/train
Created directory: ../data/gaia_dataset/val


### 2. Select train and validation samples for each class
You do not have to unzip the files, we will work them as it is


In [None]:
import zipfile
import pandas as pd
import numpy as np

# Define your zip files and class mapping
csv_files = {
    "detached_nospot_gaia.csv": ("detached", "nospot"),
    "detached_spot_gaia.csv": ("detached", "spot"),
    "overcontact_nospot_gaia.csv": ("overcontact", "nospot"),
    "overcontact_spot_gaia.csv": ("overcontact", "spot"),
}
data_dir = "../data/gaia_synthetic_data"
sample_size = 1000  # Set the number of samples per group (adjust as needed)
indexes = {}
val_fraction = 0.1  # 10% for validation

for csv_name, (system_type, spot_type) in csv_files.items():
    csv_path = os.path.join(data_dir, csv_name)
    df = pd.read_csv(csv_path)
    n_val = int(sample_size * val_fraction)
    n_train = sample_size
    # Randomly sample without replacement from all available indices
    all_idx = np.random.choice(df.index, n_train + n_val, replace=False)
    val_idx = np.random.choice(all_idx, n_val, replace=False)
    train_idx = np.setdiff1d(all_idx, val_idx)
    indexes[(system_type, spot_type)] = {
        'train': train_idx.tolist(),
        'val': val_idx.tolist()
    }
    print(f"Selected {len(train_idx)} train and {len(val_idx)} val samples for {system_type} {spot_type}.")
# Now 'indexes' contains all the train/val indices for each class, ready for further use


Selected 1000 train and 100 val samples for detached nospot.
Selected 1000 train and 100 val samples for detached spot.
Selected 1000 train and 100 val samples for detached spot.
Selected 1000 train and 100 val samples for overcontact nospot.
Selected 1000 train and 100 val samples for overcontact nospot.
Selected 1000 train and 100 val samples for overcontact spot.
Selected 1000 train and 100 val samples for overcontact spot.


### 3. Generate Polar Hexbin Images from Light Curve DataFrames
This section provides functions to generate polar hexbin images from light curve DataFrames, with noise, outlier, and random point removal settings based on the passband ('gaia', 'tess', 'ogle').

In [None]:
import sys
sys.path.append('../scripts')  # Add the scripts directory to the Python path
from make_polar_hexbin_images import create_polar_hexbin, create_images_from_dataframe

### 4. Generate and save polar hexbin images for all training and validation samples 
This cell will use the previously defined sample indexes and the imported functions to generate images for each class (detached, overcontact) and split, saving them in the correct output folders.

In [None]:
# Merge spot/nospot for each system type, then generate and save images for all splits and classes
from collections import defaultdict

data_dir = "../data/gaia_synthetic_data"
output_root = "../data/gaia_dataset"
passband = "gaia"  # Change if using other passbands

# Merge indexes for detached and overcontact (combine spot/nospot)
merged_indexes = defaultdict(lambda: {'train': [], 'val': []})
for (system_type, spot_type), idx_dict in indexes.items():
    merged_indexes[system_type]['train'].extend(idx_dict['train'])
    merged_indexes[system_type]['val'].extend(idx_dict['val'])

for system_type in ["detached", "overcontact"]:
    # Combine both spot and nospot dataframes for each system type
    dfs = []
    for spot_type in ["nospot", "spot"]:
        zip_name = f"{system_type}_{spot_type}_gaia.zip"
        zip_path = os.path.join(data_dir, zip_name)
        with zipfile.ZipFile(zip_path, "r") as z:
            csv_name = [f for f in z.namelist() if f.endswith(".csv")][0]
            with z.open(csv_name) as f:
                df = pd.read_csv(f)
            dfs.append(df)
    df_all = pd.concat(dfs, ignore_index=True)
    for split in ["train", "val"]:
        idxs = merged_indexes[system_type][split]
        df_split = df_all.loc[idxs]
        out_dir = os.path.join(output_root, split, system_type)
        create_images_from_dataframe(df_split, out_dir, n_start=0, passband=passband)
        print(f"Saved {len(df_split)} images to {out_dir}")

## 5. Train ResNet50 Model

Now, when we have prepared data, we could start training models

In [None]:
import sys
sys.path.append('../scripts')
import model_pytorch_ResNet

data_dir = "../data/gaia_dataset"
import os
print(os.path.abspath(data_dir))

dataloaders = model_pytorch_ResNet.prepare_training(data_dir, batch_size=32)

trained_model = model_pytorch_ResNet.train_model(
    model_pytorch_ResNet.resnet,
    dataloaders,
    model_pytorch_ResNet.criterion,
    model_pytorch_ResNet.optimizer,
    model_pytorch_ResNet.scheduler,
    model_pytorch_ResNet.num_epochs
)

import torch
torch.save(trained_model.state_dict(), "../models/model_ResNet.pth")
model_pytorch_ResNet.evaluate_model(trained_model, dataloaders['val'])

cpu




/Users/wera/Max_astro/Slovakia/EBML/data/gaia_dataset
Resolved data_dir: /Users/wera/Max_astro/Slovakia/EBML/data/gaia_dataset
Checking for folder: /Users/wera/Max_astro/Slovakia/EBML/data/gaia_dataset/train (exists: True)
Checking for folder: /Users/wera/Max_astro/Slovakia/EBML/data/gaia_dataset/val (exists: True)
Subfolders in /Users/wera/Max_astro/Slovakia/EBML/data/gaia_dataset/train: ['overcontact', 'detached']
Subfolders in /Users/wera/Max_astro/Slovakia/EBML/data/gaia_dataset/val: ['overcontact', 'detached']
Epoch 1/10
----------


100%|██████████| 125/125 [14:36<00:00,  7.01s/it]
100%|██████████| 125/125 [14:36<00:00,  7.01s/it]


train Loss: 0.1533 Acc: 0.9407


100%|██████████| 13/13 [00:51<00:00,  3.97s/it]
100%|██████████| 13/13 [00:51<00:00,  3.97s/it]


val Loss: 0.1210 Acc: 0.9425
Epoch 2/10
----------


100%|██████████| 125/125 [15:17<00:00,  7.34s/it]
100%|██████████| 125/125 [15:17<00:00,  7.34s/it]


train Loss: 0.0977 Acc: 0.9653


100%|██████████| 13/13 [00:46<00:00,  3.55s/it]
100%|██████████| 13/13 [00:46<00:00,  3.55s/it]


val Loss: 0.0637 Acc: 0.9750
Epoch 3/10
----------


100%|██████████| 125/125 [12:42<00:00,  6.10s/it]
100%|██████████| 125/125 [12:42<00:00,  6.10s/it]


train Loss: 0.1079 Acc: 0.9627


100%|██████████| 13/13 [00:44<00:00,  3.39s/it]
100%|██████████| 13/13 [00:44<00:00,  3.39s/it]


val Loss: 0.3611 Acc: 0.8925
Epoch 4/10
----------


100%|██████████| 125/125 [14:23<00:00,  6.91s/it]
100%|██████████| 125/125 [14:23<00:00,  6.91s/it]


train Loss: 0.0854 Acc: 0.9712


100%|██████████| 13/13 [00:45<00:00,  3.51s/it]
100%|██████████| 13/13 [00:45<00:00,  3.51s/it]


val Loss: 0.0589 Acc: 0.9775
Epoch 5/10
----------


100%|██████████| 125/125 [17:01<00:00,  8.17s/it]
100%|██████████| 125/125 [17:01<00:00,  8.17s/it]


train Loss: 0.0810 Acc: 0.9710


100%|██████████| 13/13 [00:59<00:00,  4.54s/it]
100%|██████████| 13/13 [00:59<00:00,  4.54s/it]


val Loss: 0.1109 Acc: 0.9650
Epoch 6/10
----------


100%|██████████| 125/125 [20:20<00:00,  9.77s/it]
100%|██████████| 125/125 [20:20<00:00,  9.77s/it]


train Loss: 0.0808 Acc: 0.9705


100%|██████████| 13/13 [00:59<00:00,  4.58s/it]
100%|██████████| 13/13 [00:59<00:00,  4.58s/it]


val Loss: 0.0651 Acc: 0.9800
Epoch 7/10
----------


100%|██████████| 125/125 [19:35<00:00,  9.40s/it]
100%|██████████| 125/125 [19:35<00:00,  9.40s/it]


train Loss: 0.0775 Acc: 0.9718


100%|██████████| 13/13 [00:45<00:00,  3.52s/it]
100%|██████████| 13/13 [00:45<00:00,  3.52s/it]


val Loss: 0.0788 Acc: 0.9700
Epoch 8/10
----------


100%|██████████| 125/125 [12:38<00:00,  6.07s/it]
100%|██████████| 125/125 [12:38<00:00,  6.07s/it]


train Loss: 0.0525 Acc: 0.9800


100%|██████████| 13/13 [00:46<00:00,  3.54s/it]
100%|██████████| 13/13 [00:46<00:00,  3.54s/it]


val Loss: 0.0693 Acc: 0.9750
Epoch 9/10
----------


100%|██████████| 125/125 [14:23<00:00,  6.91s/it]
100%|██████████| 125/125 [14:23<00:00,  6.91s/it]


train Loss: 0.0367 Acc: 0.9882


100%|██████████| 13/13 [00:52<00:00,  4.07s/it]
100%|██████████| 13/13 [00:52<00:00,  4.07s/it]


val Loss: 0.0523 Acc: 0.9775
Epoch 10/10
----------


100%|██████████| 125/125 [16:19<00:00,  7.83s/it]
100%|██████████| 125/125 [16:19<00:00,  7.83s/it]


train Loss: 0.0351 Acc: 0.9850


100%|██████████| 13/13 [00:52<00:00,  4.01s/it]



val Loss: 0.0645 Acc: 0.9750
Validation Accuracy: 0.9725
Validation Accuracy: 0.9725


## 6. Train ViT Model 
You can use the next code cell to train ViT model

In [None]:
import sys
sys.path.append('../scripts')
import model_pytorch_ViT

data_dir = "../data/gaia_dataset"
import os
print(os.path.abspath(data_dir))

dataloaders = model_pytorch_ViT.prepare_training(data_dir, batch_size=32)

model_vit = model_pytorch_ViT.create_vit_model(num_classes=2)
criterion, optimizer, scheduler = model_pytorch_ViT.get_loss_optimizer_scheduler(model_vit)

trained_model = model_pytorch_ViT.train_model(
    model_vit,
    dataloaders,
    criterion,
    optimizer,
    scheduler,
    num_epochs=5
)

import torch
torch.save(trained_model.state_dict(), "../models/model_ViT.pth")
model_pytorch_ViT.evaluate_model(trained_model, dataloaders['val'])

  from .autonotebook import tqdm as notebook_tqdm


/Users/wera/Max_astro/Slovakia/EBML/data/gaia_dataset
Using device: cpu
Loading images from: ../data/gaia_dataset/train
Loaded 4000 images for train
Loading images from: ../data/gaia_dataset/val
Loaded 400 images for val
Epoch 1/5
----------
Epoch 1/5
----------


train phase: 100%|██████████| 125/125 [21:29<00:00, 10.32s/it]



train Loss: 0.5653 Acc: 0.7113


val phase: 100%|██████████| 13/13 [00:55<00:00,  4.28s/it]
val phase: 100%|██████████| 13/13 [00:55<00:00,  4.28s/it]


val Loss: 0.1674 Acc: 0.9450
Epoch 2/5
----------


train phase: 100%|██████████| 125/125 [18:58<00:00,  9.11s/it]
train phase: 100%|██████████| 125/125 [18:58<00:00,  9.11s/it]


train Loss: 0.1563 Acc: 0.9425


val phase: 100%|██████████| 13/13 [00:53<00:00,  4.14s/it]
val phase: 100%|██████████| 13/13 [00:53<00:00,  4.14s/it]


val Loss: 0.1122 Acc: 0.9525
Epoch 3/5
----------


train phase: 100%|██████████| 125/125 [24:24<00:00, 11.72s/it]
train phase: 100%|██████████| 125/125 [24:24<00:00, 11.72s/it]


train Loss: 0.0933 Acc: 0.9655


val phase: 100%|██████████| 13/13 [01:24<00:00,  6.47s/it]
val phase: 100%|██████████| 13/13 [01:24<00:00,  6.47s/it]


val Loss: 0.0523 Acc: 0.9850
Epoch 4/5
----------


train phase: 100%|██████████| 125/125 [26:09<00:00, 12.56s/it]
train phase: 100%|██████████| 125/125 [26:09<00:00, 12.56s/it]


train Loss: 0.0689 Acc: 0.9720


val phase: 100%|██████████| 13/13 [01:29<00:00,  6.91s/it]
val phase: 100%|██████████| 13/13 [01:29<00:00,  6.91s/it]


val Loss: 0.0860 Acc: 0.9600
Epoch 5/5
----------


train phase: 100%|██████████| 125/125 [24:58<00:00, 11.99s/it]
train phase: 100%|██████████| 125/125 [24:58<00:00, 11.99s/it]


train Loss: 0.0533 Acc: 0.9808


val phase: 100%|██████████| 13/13 [00:54<00:00,  4.16s/it]



val Loss: 0.0674 Acc: 0.9800
Validation Accuracy: 0.9800
Validation Accuracy: 0.9800


([0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,


## 7. Spot Detection Dataset Preparation (Detached Systems)

In this section, we will prepare a dataset specifically for spot detection in detached binary systems. We will:
- Use the previously sampled indexes for detached systems with and without spots.
- Generate polar hexbin images for each sample using the provided transformation function.
- Save the images into the new directory structure: `../data/gaia_spot_dataset/train/{spot,nospot}` and `../data/gaia_spot_dataset/val/{spot,nospot}`.

This dataset will allow you to train a classifier to distinguish between detached systems with and without spots.

### 8. Create directories for spot detection dataset (detached systems)

In [None]:
import os

spot_dataset_root = "../data/gaia_spot_dataset"
for split in ["train", "val"]:
    for spot_class in ["spot", "nospot"]:
        split_dir = os.path.join(spot_dataset_root, split, spot_class)
        os.makedirs(split_dir, exist_ok=True)
        print(f"Created directory: {split_dir}")
# Now you have ../data/gaia_spot_dataset/train/{spot,nospot} and ../data/gaia_spot_dataset/val/{spot,nospot}

Created directory: ../data/gaia_spot_dataset/train/spot
Created directory: ../data/gaia_spot_dataset/train/nospot
Created directory: ../data/gaia_spot_dataset/val/spot
Created directory: ../data/gaia_spot_dataset/val/nospot


### 9. Generate and save images for spot detection (detached systems only)

In [None]:
from collections import defaultdict
import zipfile
import pandas as pd
import os

# Use the same indexes dictionary as before
spot_dataset_root = "../data/gaia_spot_dataset"
data_dir = "../data/gaia_synthetic_data"
passband = "gaia"

# Load detached spot and nospot dataframes
dfs = {}
for spot_type in ["nospot", "spot"]:
    zip_name = f"detached_{spot_type}_gaia.zip"
    zip_path = os.path.join(data_dir, zip_name)
    with zipfile.ZipFile(zip_path, "r") as z:
        csv_name = [f for f in z.namelist() if f.endswith(".csv")][0]
        with z.open(csv_name) as f:
            df = pd.read_csv(f)
        dfs[spot_type] = df

# For each split, generate images for spot/nospot
for split in ["train", "val"]:
    for spot_type in ["nospot", "spot"]:
        idxs = indexes[("detached", spot_type)][split]
        df_split = dfs[spot_type].loc[idxs]
        out_dir = os.path.join(spot_dataset_root, split, spot_type)
        create_images_from_dataframe(df_split, out_dir, n_start=0, passband=passband)
        print(f"Saved {len(df_split)} images to {out_dir}")

Saved 1000 images to ../data/gaia_spot_dataset/train/nospot
Saved 1000 images to ../data/gaia_spot_dataset/train/spot
Saved 1000 images to ../data/gaia_spot_dataset/train/spot
Saved 100 images to ../data/gaia_spot_dataset/val/nospot
Saved 100 images to ../data/gaia_spot_dataset/val/nospot
Saved 100 images to ../data/gaia_spot_dataset/val/spot
Saved 100 images to ../data/gaia_spot_dataset/val/spot


Now you can use Train ResNet model or Train ViT model code cells.
Just change 'data_dir' and the name of .pth file where do you want to save your model. 

Preparing the necessary dataset in a way that showed in this tutorial one can train models for different passbands presented in our dataset in the cloud.