# Gaia Fundamentals: What Gaia Measures 

## What You Will Learn in This Notebook

By the end of this notebook, you will be able to:

- Explain what Gaia measures and what it does *not* measure
- Interpret the most important Gaia columns (ra/dec, parallax, proper motion, photometry)
- Understand measurement uncertainty and signal-to-noise
- Apply *basic* quality cuts without throwing away half the Galaxy
- Build a clean dataframe that is ready for an HR diagram or cluster membership work later


## What Gaia Is

Gaia is a space mission that measures:

- Positions on the sky (RA, Dec)
- Parallax (a geometric distance indicator)
- Proper motion (apparent motion across the sky)
- Brightness in multiple bands (G, BP, RP)
- For some stars: radial velocity and astrophysical parameters

Gaia does not directly measure "distance."
It measures *parallax*, and distance must be inferred carefully.

## Imports

We will work with data as a Pandas DataFrame.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Loading Data (Online OR Cached)

At home, you can query Gaia directly.
At school (blocked internet), you should load a cached CSV file.

This notebook supports both.

In [None]:
from astroquery.gaia import Gaia
import os

query = """
SELECT TOP 5000
     source_id,
     ra, dec,
     parallax, parallax_error,
     pmra, pmra_error,
     pmdec, pmdec_error,
     phot_g_mean_mag,
     phot_bp_mean_mag,
     phot_rp_mean_mag,
     ruwe
FROM gaiadr3.gaia_source
WHERE parallax IS NOT NULL
"""

job = Gaia.launch_job_async(query)
results = job.get_results()
df = results.to_pandas()
df.to_csv("gaia_sample_cache.csv", index=False)
df.head()

In [None]:
# If you already saved a CSV from Notebook 01 at home, load it here.
# Example file name: "gaia_query_results.csv"

#csv_path = "gaia_query_results.csv"  # change if needed
#df = pd.read_csv(csv_path)
#df.head()

## The Core Gaia Columns We Will Use

- ra, dec: sky position (degrees)
- parallax, parallax_error: parallax in milliarcseconds (mas) and its uncertainty
- pmra, pmdec (and errors): proper motion (mas/yr)
- phot_g_mean_mag: Gaia G-band magnitude
- phot_bp_mean_mag, phot_rp_mean_mag: BP and RP magnitudes (for color)
- ruwe: a fit-quality indicator (useful, but not a magic "good/bad" label)

In [None]:
print(df.columns)

## Units (Know These or Your Poster Is Toast)

- Parallax: milliarcseconds (mas)
- Proper motion: mas/year
- Magnitudes: unitless (log brightness scale), smaller = brighter
- RA/Dec: degrees

We will keep units in our heads here (DataFrame is unit-less),
but you must label plots with units.

## Parallax vs Distance (The #1 Student Mistake)

A rough distance estimate is:

distance (pc) ≈ 1000 / parallax(mas)

This is only trustworthy when parallax uncertainty is small.
If parallax is noisy, this estimate becomes unreliable.

In [None]:

# Avoid division by zero or negative parallax for this introductory proxy

df = df.copy()

df["parallax_over_error"] = df["parallax"] / df["parallax_error"]

# Distance proxy (pc) - only meaningful for decent parallaxes

df["dist_pc_proxy"] = 1000.0 / df["parallax"]

df[["parallax", "parallax_error", "parallax_over_error", "dist_pc_proxy"]].head()

## When NOT to Trust Distance

If parallax_over_error is small, distance becomes unreliable.

A common (intro-level) rule:
- parallax_over_error > 10 → usually decent
- 5 to 10 → usable for some plots, be cautious
- < 5 → risky; expect weirdness

In [None]:
plt.figure()
df["parallax_over_error"].replace([np.inf, -np.inf], np.nan).dropna().hist(bins=50)
plt.xlabel("parallax_over_error")
plt.ylabel("count")
plt.title("Parallax Signal-to-Noise (S/N)")
plt.show()

## Proper Motion (pmra, pmdec)

Proper motion is how fast a star appears to move across the sky.

For cluster work later, proper motion is critical because:
cluster members share similar proper motion vectors.

In [None]:
df["pm_total"] = np.sqrt(df["pmra"]**2 + df["pmdec"]**2)
df[["pmra", "pmdec", "pm_total"]].head()

plt.figure()
plt.scatter(df["pmra"], df["pmdec"], s=5)
plt.xlabel("pmra (mas/yr)")
plt.ylabel("pmdec (mas/yr)")
plt.title("Proper Motion Space (pmra vs pmdec)")
plt.show()

## Photometry and Color (BP - RP)

Gaia provides:
- G magnitude (brightness)
- BP and RP magnitudes (blue and red photometers)

A simple color index:
BP-RP = phot_bp_mean_mag - phot_rp_mean_mag

Color helps separate hot/blue stars from cool/red stars.

In [None]:
df["bp_rp"] = df["phot_bp_mean_mag"] - df["phot_rp_mean_mag"]
df[["phot_g_mean_mag", "bp_rp"]].head()

## HR Diagram Preview (Observed, Not Absolute)

For now, we plot an observed color–magnitude diagram:
- x-axis: BP - RP (color)
- y-axis: G magnitude

Later, we will build absolute magnitude versions.

In [None]:
plt.figure()
plt.scatter(df["bp_rp"], df["phot_g_mean_mag"], s=5)
plt.gca().invert_yaxis()  # magnitudes: smaller is brighter
plt.xlabel("BP - RP (mag)")
plt.ylabel("G (mag)")
plt.title("Observed Color–Magnitude Diagram (CMD)")
plt.show()

## Data Quality: RUWE (Use Carefully)

RUWE is a measure of how well Gaia's astrometric model fits the observations.

Typical classroom guideline:
- RUWE < 1.4 → often okay
- RUWE > 1.4 → may indicate issues (binaries, crowding, bad fit)

RUWE is not "good vs bad."
It's a warning flag.

In [None]:
plt.figure()
df["ruwe"].replace([np.inf, -np.inf], np.nan).dropna().hist(bins=50)
plt.xlabel("RUWE")
plt.ylabel("count")
plt.title("RUWE Distribution")
plt.show()

## Creating a Basic Clean Sample

We will create a conservative sample for plots:

- finite BP/RP/G values
- parallax_over_error > 5 (moderate threshold)
- RUWE < 1.4 (basic astrometric sanity cut)

These are not universal rules.
They are a starting point for student research.

In [None]:
clean = df.copy()

# Drop missing or non-finite values needed for CMD and quality
needed = ["phot_g_mean_mag", "phot_bp_mean_mag", "phot_rp_mean_mag",
          "parallax", "parallax_error", "ruwe", "pmra", "pmdec"]
for col in needed:
    clean = clean[np.isfinite(clean[col])]

# Recompute derived columns after cleaning
clean["bp_rp"] = clean["phot_bp_mean_mag"] - clean["phot_rp_mean_mag"]
clean["parallax_over_error"] = clean["parallax"] / clean["parallax_error"]

# Apply simple thresholds
clean = clean[(clean["parallax_over_error"] > 5) & (clean["ruwe"] < 1.4)]

clean.head(), clean.shape

## CMD Before vs After Cleaning

A good sanity check is to compare:
- the raw CMD (messy)
- the cleaned CMD (tighter, more interpretable)

In [None]:
plt.figure()
plt.scatter(df["bp_rp"], df["phot_g_mean_mag"], s=5)
plt.gca().invert_yaxis()
plt.xlabel("BP - RP (mag)")
plt.ylabel("G (mag)")
plt.title("CMD (Raw)")
plt.show()

plt.figure()
plt.scatter(clean["bp_rp"], clean["phot_g_mean_mag"], s=5)
plt.gca().invert_yaxis()
plt.xlabel("BP - RP (mag)")
plt.ylabel("G (mag)")
plt.title("CMD (Cleaned: parallax_over_error > 5 and RUWE < 1.4)")
plt.show()

## What You Can Claim (and What You Cannot)

You CAN claim:
- "We selected a sample with quality cuts and built a CMD."
- "The CMD shows a main sequence-like trend."
- "Our selection criteria reduce scatter, suggesting improved data quality."

You CANNOT claim:
- "We found the true distance to every star."
- "We proved the Galaxy structure."
- "RUWE < 1.4 guarantees correctness."

Poster rule: if you cannot explain it in a sentence, don't claim it.

## Exercises

1. Change the parallax_over_error threshold from 5 to 10. What changes in the CMD?
2. Remove RUWE filtering. What happens to the scatter?
3. Pick one star from the cleaned set and report:
   - source_id
   - parallax ± parallax_error
   - BP-RP color
   - G magnitude
4. In 3 sentences: explain why “distance = 1000/parallax” is risky for noisy parallaxes.

## What Comes Next

Next notebook: HR Diagram Project

You will:
- Query a targeted region (or a known cluster)
- Build better CMD/HR diagrams
- Begin selecting candidate members using kinematics and parallax

## Exit Ticket

Answer briefly:

- What does parallax_over_error represent?
- What is RUWE trying to tell you?
- Why do we invert the magnitude axis on CMD plots?