# Kaggle Notebook Basics & Workflow (Studio)
**DS2002 — Data Science Systems • Spring 2026 • Jan 14, 2026**  
Instructor: Jason Williamson

---

## Studio agenda (50 minutes)

1. Notebook mental model (state, cells, execution order) — 10 min  
2. Kaggle workspace basics (inputs, outputs, saving) — 10 min  
3. Markdown that doesn’t look like a ransom note — 10 min  
4. Writing clean, reusable code in notebooks — 10 min  
5. Mini build + self-checks (so you know it worked) — 10 min

---

## What you’ll be able to do by the end of this session

- Navigate the Kaggle notebook environment confidently (input vs working output).
- Explain why “Run All” is not optional if you want reproducible results.
- Write readable Markdown that documents *what* you did and *why*.
- Use a simple, repeatable notebook structure for labs and projects in this course.
- Export your work to GitHub without mystery steps.

---

## Non-negotiables for DS2002 notebooks

1. **A reader should be able to run your notebook from top to bottom** without you standing next to them explaining which cells to skip.
2. **Every notebook must tell a story**: intent → method → result → takeaway.
3. **Code is not a diary**. Name things like you expect another human to read them.
4. **Your outputs belong in `/kaggle/working`**, not in your imagination.

In [None]:
# Setup cell: run this first, every time.

import os
import sys
import platform
from pathlib import Path

print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())
print("Working directory:", os.getcwd())

KAGGLE_INPUT = Path("/kaggle/input")
KAGGLE_WORKING = Path("/kaggle/working")

print("\nKaggle input exists:", KAGGLE_INPUT.exists())
print("Kaggle working exists:", KAGGLE_WORKING.exists())

## 1) The notebook mental model (aka: why things “randomly” break)

A notebook is **not** a script. A script runs top-to-bottom once. A notebook is more like a kitchen:

- You can cook steps out of order.
- You can leave ingredients on the counter (variables in memory).
- You can forget what you already chopped.
- You can convince yourself you followed the recipe, while doing something completely different.

That’s why **“Restart & Run All”** is your best friend. It reveals whether your notebook is:
- actually reproducible, or
- just *lucky*.

### Quick demo: state and execution order
Run the next two cells **out of order** on purpose.

In [None]:
# Cell A
message = "If you can see this, you ran Cell A."
print(message)

In [None]:
# Cell B
# Run this BEFORE Cell A once. Then run Cell A. Then run Cell B again.
print("message is:", message)

### What just happened?

- If you ran Cell B first, Python complained because `message` didn’t exist yet.
- If you ran Cell A, then Cell B, everything worked.
- If you ran things out of order in real work, you can create subtle bugs.

**Rule:** When you’re done working, do a **Restart & Run All** before submitting.

## 2) Kaggle notebook basics (inputs vs outputs)

Kaggle gives you two important places:

### `/kaggle/input`
- Read-only
- Where datasets you attach to the notebook appear
- Think: “source data”

### `/kaggle/working`
- Writable
- Where you should write outputs (cleaned files, charts, models, etc.)
- Think: “stuff I produced”

Let’s see what you have in your environment right now.

In [None]:
from pathlib import Path

def list_dir(path: Path, max_items: int = 50) -> None:
    if not path.exists():
        print(f"{path} does not exist.")
        return
    items = sorted(path.iterdir(), key=lambda p: p.name.lower())
    print(f"Listing {path} ({len(items)} items):")
    for p in items[:max_items]:
        tag = "dir " if p.is_dir() else "file"
        print(f" - [{tag}] {p.name}")
    if len(items) > max_items:
        print(f"... and {len(items) - max_items} more.")

list_dir(Path("/kaggle/input"))
print()
list_dir(Path("/kaggle/working"))

### Attaching a dataset (what you do in Kaggle)

In Kaggle, use the right panel to **Add data** to your notebook. Once attached, it shows up under `/kaggle/input/<dataset-name>/`.

You’ll often start a notebook with a small “data discovery” cell like this:

- list what’s in `/kaggle/input`
- set a variable pointing at the dataset folder
- read files relative to that folder

This makes your code portable and less brittle.

In [None]:
# A portable pattern for selecting a dataset folder

from pathlib import Path

KAGGLE_INPUT = Path("/kaggle/input")

# If you attached a dataset, pick the folder name from the printed list above.
# Example: DATASET_DIR = KAGGLE_INPUT / "my-course-dataset"
DATASET_DIR = None  # <-- change this later when you have a dataset

print("DATASET_DIR:", DATASET_DIR)

## 3) Markdown that communicates

Markdown is how you explain your thinking. It should be:

- short
- structured
- readable

A good notebook reads like a lab report, not a stream of consciousness.

### A simple template that works

Use this structure in most DS2002 notebooks:

1. **Problem / Goal**
2. **Data**
3. **Method**
4. **Results**
5. **Takeaways**
6. **Next steps / limitations**

Below is a Markdown “starter” you can steal.

### Problem / Goal
Write 2–3 sentences. Answer: *What are we trying to learn or build?*

### Data
Where did the data come from? What files/fields matter? Any assumptions?

### Method
What steps did you take? Keep it readable—use a numbered list.

### Results
Show the output. Include a short explanation.

### Takeaways
What would you tell a manager who doesn’t care about Python?

### Markdown pro tips (the ones that save your future self)

- Use headers (`#`, `##`, `###`) so your notebook has a table-of-contents feel.
- Use bullet lists for assumptions.
- Use short paragraphs.
- Put “decision points” in bold.
- Link things with `[text](url)` if needed.

Your notebook is both your work *and* your documentation. You don’t get extra credit for making it cryptic.

## 4) Writing clean, reusable code in a notebook

A notebook can be exploratory, but the final version should still be organized.

### Pattern: constants at the top
Put parameters where a reader can find them. Avoid magic numbers scattered around.

In [None]:
# Parameters / constants live near the top (so we don't play hide-and-seek later)

RANDOM_SEED = 42
MAX_ROWS_TO_DISPLAY = 10

print("RANDOM_SEED =", RANDOM_SEED)
print("MAX_ROWS_TO_DISPLAY =", MAX_ROWS_TO_DISPLAY)

### Pattern: functions for repeated logic

If you do the same thing twice, it probably belongs in a function.

We’ll use a tiny “transaction” example (inspired by real operational data): items scanned at checkout.
No external datasets needed for this demo.

In [None]:
from typing import List, Dict, Tuple
from collections import Counter

transactions: List[Dict[str, str]] = [
    {"store_id": "FL-239", "sku": "PT-12", "product": "Pop-Tarts Strawberry"},
    {"store_id": "FL-239", "sku": "SKU#459812", "product": "POPTART-STRAWBERRY"},
    {"store_id": "FL-105", "sku": "BATT-100", "product": "AA Batteries (4-pack)"},
    {"store_id": "FL-239", "sku": "PT-12", "product": "Pop-Tarts Strawberry"},
    {"store_id": "FL-330", "sku": "POPTART-STRAWBERRY", "product": "Pop-Tarts Strawberry"},
]

def normalize_product_name(name: str) -> str:
    """Normalize product names so the same thing doesn't show up 12 different ways."""
    name = name.strip()
    name = name.replace("-", " ")
    name = " ".join(name.split())  # collapse extra spaces
    return name.title()

def count_products(rows: List[Dict[str, str]]) -> Counter:
    """Count products after normalization."""
    normalized = [normalize_product_name(r["product"]) for r in rows]
    return Counter(normalized)

counts = count_products(transactions)
counts

### Why this matters

Operational data is messy. Even in this tiny sample:

- Pop-Tarts appear under multiple SKUs and naming patterns.
- If you don’t normalize, you get misleading counts.

In DS2002, you’re not just learning tools—you’re learning how to produce results that are *trustworthy*.

## 5) Mini build (guided): generate a small “stocking report”

Goal: produce a simple report of the top products and write it to `/kaggle/working`.

We’ll do this in steps:
1. Normalize names
2. Count items
3. Produce a sorted report
4. Save to a file

Then we’ll do a self-check.

In [None]:
from pathlib import Path

def build_stocking_report(rows: List[Dict[str, str]], top_n: int = 5) -> List[Tuple[str, int]]:
    """Return a list of (product_name, count) sorted from most to least common."""
    c = count_products(rows)
    return sorted(c.items(), key=lambda x: x[1], reverse=True)[:top_n]

report = build_stocking_report(transactions, top_n=10)
report

In [None]:
# Write the report to /kaggle/working

output_path = Path("/kaggle/working/stocking_report.txt")

lines = ["Stocking Report (demo)", "-" * 24]
for product, n in report:
    lines.append(f"{product}: {n}")

output_path.write_text("\n".join(lines), encoding="utf-8")
print("Wrote:", output_path)
print("\nPreview:\n")
print(output_path.read_text(encoding="utf-8"))

### Self-checks (the boring part that prevents embarrassment)

These are tiny tests you can add to confirm your notebook is producing what you think it is.

In projects, you can (and should) add checks for:
- expected columns
- expected row counts
- no missing keys after joins
- reasonable ranges for important values

In [None]:
# Self-checks
assert Path("/kaggle/working/stocking_report.txt").exists(), "Report file was not created."

# Basic content check
txt = Path("/kaggle/working/stocking_report.txt").read_text(encoding="utf-8")
assert "Pop Tarts Strawberry" in txt, "Expected normalized product name missing."

print("Self-checks passed.")

## Exporting your notebook to GitHub (Kaggle-only workflow)

For DS2002, your *authoring* environment is Kaggle, but your *portfolio* and *submission artifact* is GitHub.

### Recommended workflow

1. In Kaggle, click **File → Download Notebook** (downloads `.ipynb`).
2. Put it in the correct folder in your GitHub repo.
3. Commit with a real message:
   - Good: `Add stocking report mini-build`
   - Bad: `update`
4. Push to GitHub.
5. Submit the GitHub link in Canvas (and the Kaggle link if requested).

### The “Run All” rule
Before you download:
- **Restart session**
- **Run all cells**
- Confirm outputs look right

If your notebook only works in the exact order you happened to click things, it doesn’t work.

## In-class checkpoint (5 minutes)

Do these before you leave:

1. Add a short Markdown section near the top titled **“What I learned today”** (3–5 bullets).
2. Add one self-check `assert` that verifies a file exists in `/kaggle/working`.
3. Restart & Run All. Confirm there are no errors.

If you can do those three things, you’re ready for the rest of this course.

---

## Appendix: a tiny “clean notebook” checklist

- Title + course/date at the top
- Clear section headers
- Parameters near the top
- Repeated logic in functions
- Self-checks for key outputs
- Runs top-to-bottom after restart

That’s it. No heroics required.