# 🐍 Welcome to the Jungle

You’ve had it easy so far.

Beautiful fake planets that orbit on schedule. Clean light curves. RV curves that begged to be fitted.  
Everything made sense. Life was good. The universe was predictable.

Well — not anymore.

Today, you're getting **real exoplanet data**. The entire confirmed catalogue from the NASA Exoplanet Archive.  
Detection methods? All of them. Units? Inconsistent. Nulls? Everywhere.  
Does the CSV load cleanly? Hell no — **they’ve embedded 15 lines of metadata at the top of the file.**

No more “just read it in with `pd.read_csv`.”  
No more “ah yes, column names that actually make sense.”

You want to plot mass vs semi-major axis? Great.  
Good luck figuring out which column is the mass, what unit it’s in, and which planets even *have* mass.  
Want to analyse detection biases? Figure out what methods are in the file.  
Want to separate transiting from RV from imaging? Learn to clean your damn data.

### Your Task:

**Do science.**

Download the data yourself from:  
🔗 [https://exoplanetarchive.ipac.caltech.edu](https://exoplanetarchive.ipac.caltech.edu)

- Find the button that says "Download Table"
- Accept the chaos
- Load it into Python (if you can)
- Start asking questions. Plot some shit. Fight your own battles.

---

### Some ideas (you will suffer):

- Plot mass vs semi-major axis  
- Compare detection methods  
- Try to separate transiting from RV planets  
- Look at radius vs temperature  
- Look at discovery year vs number of planets  
- Decide if anything you've plotted is real or just artefacts of pain  

---

Welcome to real research.  

Let the suffering begin.

## 🧟‍♂️ LEVEL 1: Load the Bloody File

Righto you soft little data wranglers, time to leave the kiddie pool.

Go to the [Exoplanet Archive](https://exoplanetarchive.ipac.caltech.edu), hit “Confirmed Planets”, and download the big juicy CSV.  
Looks innocent, doesn’t it? Bet you’re already thinking “I’ll just chuck this in pandas and crack on.”

Go on. Try it.

In [None]:
# Load the .csv file using Pandas


## 🪓 Didn’t work, did it?

Thought so. You sweet summer child.

That file’s got more commentary than an Ashes post-mortem.  
pandas tried to read it and said, “Nah mate, this isn’t a table, it’s a manifesto.”

So now what? You fix it.  
Open the file in a text editor. **Look** at it. Smell the chaos.  
Count how many lines of junk are at the top. Try `skiprows=...`. Maybe `comment="#"`. Maybe both.  
Fight it. Wrestle it. Bleed for it.

### Don’t ask me what went wrong - I don’t usually have a bloke on Zoom telling me what happened.

🪙 *Here’s a gold coin for when you finally LOAD the data. We're not even close to the science.*

## 🧠 LEVEL 2: What Even Is This?

Alright, you’ve loaded the file. You’ve stopped crying. Now try looking at it.

```python
df['pl_name'].value_counts()
```

Looks like there are a few entries per planet, eh?
Bit odd, innit? You thought there were around 6,000 confirmed planets — so why the bloody hell does your DataFrame have **38,000 rows**?

### Try plotting something.

Mass vs semi-major axis?  
Radius vs temperature?  
Stellar mass vs planet mass?

**Go on. Do it.**

What’s that?  
- Why is every planet showing up six times?  
- Why is your hot Jupiter swarm suddenly a *hot Jupiter infestation*?  
- Why do some planets have *multiple masses*?  
- Why does your scatter plot look like it’s got chickenpox?

---

Something’s not right, is it?

Is it a bug?  
Is it a pandas thing?  
Is the universe just this messy?

You tell me.

I could tell you why it’s happening.  
I *could*.  
But I won’t.

Not yet.

🪙 *2 XP if you find the secret. Extra XP if you scream in Slack when you do.*

In [None]:
# Try plotting planet mass vs semi-major axis
# It'll be cooked, but try it. Maybe the stars will align for you
# Spoiler: They won't


### Is this supposed to look like that?

Is the plot cooked? Thought so.

Don’t worry, it’s not your fault. Probably. Blame Ben Stokes  
Or maybe you just picked the wrong column.  
Or maybe pandas betrayed you.  
Or maybe… just maybe…

There’s a column that *means* something. Maybe one that says which row is the "right" one.  
Maybe the **table on the Archive** has some answers.  
Maybe it explains the what the columns mean.  
Maybe it doesn’t.  
Maybe it's all lies.

Go ahead. Take a walk. Look at the Archive again.  
Or don’t.  
Live in denial. Plot the same planet six times. It’s your journey.

🪙 *Still 2 XP if you make it out with one clean scatter plot. Bonus point if you lost all hope and then found it again.*

In [None]:
# Did you figure it out yet?
# I hope not because I'm having a LOT of fun embracing sadism

## 📈 LEVEL 3: Plot Like It’s 2010

You’ve finally got one row per planet.  
You’ve slain the multi-row hydra.  
Now you get to do what every undergrad dreams of:

**Make a scatter plot.**

Pick two columns. Any two. Plot them.

Suggested chaos pairings:
- Planet Mass (what units are those again, I don't know) vs Semi-major axis (Surely the units here can't be km)
- Planet Radius vs Eq Temperature (Does the table have equilibrium temperature? Who knows)
- Year of discovery vs Stellar Mass
- Distance to system vs *literally anything*

---

### Questions to ask yourself while you plot:
- Why do some planets have no masses?
- Why are there random vertical or horizontal lines?
- Why are most points in weird clumps?
- Why does my plot look like it was sneezed on?

---

Plot it. Label it. Swear at it.

🪙 *3 XP if you produce a plot that tells a halfway interesting story. 4 XP if you realise the story is mostly selection effects and bias.*

In [None]:
# Go on then, pick two columns and make a scatter plot
# It’s what you came here to do, right?
# Bonus XP if your axis labels aren’t "pl_something" and "st_whatever"

## 🧹 LEVEL 4: Clean Something Horrific

Right, you’ve made a scatter plot. Good for you.  
Now look closer.

Why are there 300 missing points?  
Why are some masses negative?  
Why does this column have “—” in it?  
Why are *half* the rows NaN?

Welcome to **data cleaning**, the most thankless and essential job in astronomy.

---

### Pick a victim.
Choose one column that’s clearly cooked. Examples:
- Planet mass (but also maybe a guess?)
- Equilibrium temperature (when it feels like showing up)
- Star mass (except when it’s missing)
- Semi-major axis (or lies)
- Distance to the system (maybe)

Oh, what’s that?  
You don’t know which column is which?  
Yeah. I *could’ve* told you the column names.  
But I didn’t.

Moving on.


Now:
- Use `.isnull().sum()` to count the carnage  
- Drop rows that are useless (`dropna`)  
- OR replace them with something (`fillna`)  
- OR filter for the good stuff (`df[df['col'] > 0]`)  
- OR convert string dashes and garbage to proper NaNs

---

🧨 Bonus horror:
Some “missing” values are strings like `"--"` or `"NaN"` - **not actual NaNs**.  
pandas won’t treat them as null. You’ll just silently plot crap.  
Handle that. Fix that. Find that.

---

🪙 *4 XP if you emerge with a DataFrame that won’t explode the next time you plot it.*
🪙 *Extra XP if you realise your cleaned sample is now 1/3 the original size.*

## 🧠 LEVEL 5: Invent a Question, Regret Everything

You’ve cleaned the data. You’ve stared into the abyss. Now what?

You’re in possession of the most cursed object in science:  
**A spreadsheet full of exoplanets and no idea what to do with it.**

Your task is simple — and cruel:

**Ask a question.**  
Then **try to answer it** using this data.

---

### Need inspiration?

- Are big planets found farther from their stars?
- Does star temperature affect planet radius?
- Are hot Jupiters real or just bad sampling?
- Have we gotten better at finding small planets in recent years?
- Do certain detection methods favour certain types of planets?
- Are we biased towards detecting planets around certain stars?

---

You’ll find:
- Most trends are real *and* artifacts.
- Detection biases lie in wait like traps in a jungle.
- The moment you say “all planets”, someone will ask: “Detected *how*?”
- The sample size drops off a cliff when you filter by more than two things.

That’s the price of doing “science.”

---

### Deliverables:

1. State a question.
2. Choose your columns.
3. Plot something. Anything.
4. Try to explain what you’re seeing.
5. Immediately question your own conclusions.

🪙 *5 XP if you realise your question has no clean answer.*  
🪙 *Extra XP if you start ranting about detection bias and survey limitations.*

## 🧑‍⚖️ LEVEL 6: Face the Bastard

You've had fun, haven't you?  
Asked a question. Made a cute little plot. Maybe slapped a linear fit on it. Felt proud.

Well now it’s time to justify yourself.

You're no longer talking to your friendly neighbourhood Aussie bloke.  
You're talking to his **alter ego**: a grizzled, underfunded, over-caffeinated reviewer who hasn't had a decent night’s sleep since Kepler launched.

And he wants answers.

---

### Your task:

**Pitch your plot to the Bastard.**

Give a short spiel answering:
- What question did you ask?
- What did you find?
- Why might it be true?
- Why might it be *absolute bollocks*?

Be honest. Be critical. Pretend you're trying to stop this plot from being posted on Twitter with "the science is settled" underneath.

🪙 *5 XP if the Bastard grudgingly nods.*
🪙 *Bonus XP if he says “fair enough, mate” and doesn’t immediately call you a clown.*