# Notebook 3: AI-Empowered Data Browsing

**Session 2: The AI-Empowered Coder**  
*Generative AI for Scholarship — Harvard HDSI & FAS*

© 2026 President and Fellows of Harvard College. Licensed under CC BY-NC 4.0.

---

## What This Notebook Covers

This notebook shows you how to load a **real scientific dataset** from
Google Drive and use **Gemini AI** to explore it interactively.

We'll work with photometry from the **Sloan Digital Sky Survey (SDSS)** —
a catalog of brightness measurements for thousands of galaxies observed
through five optical filters (u, g, r, i, z).

1. **Mount** Google Drive and load the CSV file into a Pandas DataFrame
2. **Inspect** the data: columns, types, summary statistics
3. **Visualize** the data with plots
4. **Use AI** to ask questions about the data and generate analysis code

### Before You Start

Make sure you have uploaded **`sdss_photometry.csv`** to the **Colab Notebooks**
folder in your Google Drive.

---

## Step 1: Mount Google Drive and Load the Data

We need to mount Google Drive so Colab can access the CSV file,
then read it into a Pandas DataFrame.

In [None]:
# ============================================================
# Mount Google Drive.
# A pop-up will ask you to authorize access — click Allow.
# ============================================================

from google.colab import drive

drive.mount('/content/drive')

print("Google Drive is now mounted.")

In [None]:
# ============================================================
# Load the SDSS photometry data from Google Drive.
#
# The file should be in your "Colab Notebooks" folder,
# alongside the notebook files.
# ============================================================

import pandas as pd
import numpy as np
import os

drive_folder = '/content/drive/MyDrive/Colab Notebooks'
data_path = os.path.join(drive_folder, 'sdss_photometry.csv')

if os.path.exists(data_path):
    df = pd.read_csv(data_path)
    print(f"Loaded {len(df)} rows and {len(df.columns)} columns.")
else:
    print(f"NOT FOUND: {data_path}")
    print(f"\nPlease upload sdss_photometry.csv to your 'Colab Notebooks' folder in Google Drive.")
    print(f"Go to drive.google.com, open 'Colab Notebooks', and upload the file there.")

---

## Step 2: Inspect the Data

Before doing any analysis, we need to understand what's in the dataset.
Let's look at the first few rows, the column names, and data types.

In [None]:
# ============================================================
# Display the first 10 rows of the DataFrame.
# ============================================================

df.head(10)

In [None]:
# ============================================================
# Column names, data types, and basic info.
# ============================================================

print("Column names and data types:")
print("=" * 40)
print(df.dtypes)
print(f"\nShape: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

### What Do the Columns Mean?

| Column | Description |
|--------|-------------|
| `ra` | Right Ascension (degrees) — position on the sky (like longitude) |
| `dec` | Declination (degrees) — position on the sky (like latitude) |
| `dered_u` | Dereddened magnitude in the u band (ultraviolet, 354 nm) |
| `dered_g` | Dereddened magnitude in the g band (green, 477 nm) |
| `dered_r` | Dereddened magnitude in the r band (red, 623 nm) |
| `dered_i` | Dereddened magnitude in the i band (near-infrared, 763 nm) |
| `dered_z` | Dereddened magnitude in the z band (infrared, 913 nm) |
| `z` | Spectroscopic redshift — tells us how far away the galaxy is |
| `plate` | SDSS spectrograph plate number |
| `fiberID` | Fiber number on the plate |
| `mjd` | Modified Julian Date of the observation |

**Note on magnitudes:** In astronomy, magnitudes are a logarithmic brightness
scale where **smaller numbers = brighter objects**. A magnitude difference of
5 corresponds to a factor of 100 in brightness. "Dereddened" means the
magnitudes have been corrected for dust absorption in our own galaxy.

In [None]:
# ============================================================
# Summary statistics for all columns.
#
# .describe() gives count, mean, std, min, quartiles, and max.
# .T transposes the table so columns become rows — easier to
# read when there are many columns.
# ============================================================

df.describe().T

---

## Step 3: Visualize the Data

Let's make a few plots to see what the data looks like.

In [None]:
# ============================================================
# Color–magnitude diagram: g–r color vs. r-band magnitude.
#
# This is a standard plot in observational astronomy.
# The "color" (g–r) tells us about a galaxy's stellar
# population: red galaxies have older stars, blue galaxies
# have younger, actively star-forming populations.
# ============================================================

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))

color_gr = df['dered_g'] - df['dered_r']

ax.scatter(color_gr, df['dered_r'], s=2, alpha=0.3, color='#A51C30')
ax.set_xlabel('g – r color (mag)')
ax.set_ylabel('r magnitude (brighter →)')
ax.set_title('Color–Magnitude Diagram (SDSS Galaxies)')
ax.invert_yaxis()   # Brighter objects (smaller magnitudes) at top
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# ============================================================
# Redshift distribution.
#
# Redshift is a measure of distance — higher redshift means
# the galaxy is farther away. This histogram shows how the
# galaxies in our sample are distributed in distance.
# ============================================================

fig, ax = plt.subplots(figsize=(10, 4))

ax.hist(df['z'], bins=50, color='#A51C30', alpha=0.7, edgecolor='white')
ax.set_xlabel('Redshift (z)')
ax.set_ylabel('Number of galaxies')
ax.set_title('Redshift Distribution')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# ============================================================
# Sky positions of the galaxies.
#
# RA (right ascension) and Dec (declination) are coordinates
# on the sky, analogous to longitude and latitude on Earth.
# This shows which parts of the sky SDSS observed.
# ============================================================

fig, ax = plt.subplots(figsize=(10, 6))

scatter = ax.scatter(df['ra'], df['dec'], s=2, alpha=0.3,
                     c=df['z'], cmap='viridis')
ax.set_xlabel('Right Ascension (degrees)')
ax.set_ylabel('Declination (degrees)')
ax.set_title('Sky Positions of SDSS Galaxies (colored by redshift)')
plt.colorbar(scatter, ax=ax, label='Redshift (z)')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Step 4: Use AI to Explore the Data

Now it's your turn. Use the **magic wand** on the empty code cells below
to ask Gemini to generate analysis code. Here are some prompts to try:

### Prompt ideas:

- **"Make a scatter plot of u–g color vs. g–r color, colored by redshift"**
- **"Find the 10 brightest galaxies in the r band and display them as a table"**
- **"Calculate the mean magnitude in each band and plot them as a spectrum"**
- **"Make a histogram of g–r color and fit a Gaussian to each peak"**
- **"Check for duplicate observations by looking at plate, fiberID, and mjd"**
- **"Plot r-band magnitude vs. redshift. Is there a selection effect?"**

Remember: the magic wand **can see the code** in the cell above it and
knows the DataFrame is called `df`. Be specific in your prompts.

---

## Step 5: Let Gemini Write a Complete Analysis Cell

For this exercise, use the **Gemini chat** — not the magic wand.
Click the **blue Gemini star icon** at the bottom of the Colab window
to open the chat panel.

Copy and paste this prompt into the Gemini chat:

```
Give me a cell that generates a 3d scatter plot of locations of objects. Use redshift as radius and RA and DEC as angles on the sphere. I want an interactive plot that lets me rotate it around.
```

Gemini will generate a code cell. Review the code, run it, and see
if you get an interactive 3D visualization of the galaxy positions.