# Lab 01: Environment Setup and Basic I/O Benchmarking

**Course:** Big Data

---

## ðŸ‘¤ Student Information

**Name:** `[Your Full Name Here]`

**Date:** `[Date of Submission]`

---

**Goal:** Setup and verify your environment works. Compare CSV vs Parquet performance.

## Instructions

1. **Fill in your information above** before starting the lab
2. Read each cell carefully before running it
3. Implement the **TODO functions** when you see them
4. Run cells **from top to bottom** (Shift+Enter)
5. Check that output makes sense after each cell

---


## ðŸ“š Libraries Used in This Lab

This lab uses several essential Python libraries for data science and big data:

### Core Libraries

- **`pandas`** - Data manipulation and analysis
  - Used for: Reading/writing CSV and Parquet files, creating DataFrames
  - [Documentation](https://pandas.pydata.org/docs/)

- **`numpy`** - Numerical computing
  - Used for: Generating random data, calculating statistics (median)
  - [Documentation](https://numpy.org/doc/)

- **`pathlib.Path`** - Modern file path handling
  - Used for: Cross-platform file paths, directory creation
  - [Documentation](https://docs.python.org/3/library/pathlib.html)

- **`time`** - Time measurement
  - Used for: Precise performance benchmarking
  - [Documentation](https://docs.python.org/3/library/time.html)

- **`json`** - JSON serialization
  - Used for: Saving results in a structured format
  - [Documentation](https://docs.python.org/3/library/json.html)

### Why These Libraries?

- **pandas**: Industry standard for data manipulation in Python
- **numpy**: Foundation for numerical computing, used by pandas internally
- **Parquet**: Columnar storage format - faster and more efficient than CSV for big data
- **pathlib**: Modern, object-oriented approach to file paths (better than `os.path`)

---

## ðŸ’¡ Quick Tips

- **Read the docstrings carefully** - They tell you exactly what each function should do
- **Run test cells immediately** - Verify each function works before moving on
- **Use print statements** - Debug by printing intermediate values
- **Check the error messages** - They often tell you exactly what's wrong

**Need help?** Check out:
- [Tips & Guidance](../docs/labs/lab01_tips.md) - Detailed hints for each TODO
- [Quick Reference](../docs/labs/lab01_quick_reference.md) - Cheat sheet with essential functions


## 1. Imports and Version Check

Let's verify all required libraries are installed.

In [1]:
import json
import time
from pathlib import Path
import pandas as pd
import numpy as np

print("âœ“ All imports successful!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

âœ“ All imports successful!
Pandas version: 2.3.3
NumPy version: 2.4.0


## 2. Define Paths

We'll use **relative paths** so the notebook works on any machine.

In [2]:
# Base directories
DATA_RAW = Path("../data/raw")
DATA_PROCESSED = Path("../data/processed")
RESULTS_DIR = Path("../results")

# File paths
CSV_PATH = DATA_RAW / "synthetic.csv"
PARQUET_PATH = DATA_PROCESSED / "synthetic.parquet"
METRICS_PATH = RESULTS_DIR / "lab01_metrics.json"

print("Paths defined:")
print(f"  CSV: {CSV_PATH}")
print(f"  Parquet: {PARQUET_PATH}")
print(f"  Metrics: {METRICS_PATH}")

Paths defined:
  CSV: ../data/raw/synthetic.csv
  Parquet: ../data/processed/synthetic.parquet
  Metrics: ../results/lab01_metrics.json


---

## 3. TODO Functions

**Your task:** Implement these 7 functions. Read the docstring carefully!

After implementing each function, run the test cell below it to verify it works.

### TODO 1: `ensure_dir()`

Create a directory if it doesn't exist.

**ðŸ’¡ Hint:** Use `path.mkdir()` with `parents=True` (create parent directories) and `exist_ok=True` (don't error if exists).

**Need more help?** See [TODO 1 detailed tips](../docs/labs/lab01_tips.md#todo-1-ensure_dir)


In [None]:
def ensure_dir(path: Path) -> None:
    """
    Create a directory if it doesn't exist.
    
    Args:
        path: Path to the directory
    
    Example:
        ensure_dir(Path("data/raw"))
    """
    # TODO: Implement this function
    # Hint: Use path.mkdir() with appropriate arguments
    pass

In [None]:
# TEST: ensure_dir()
ensure_dir(DATA_RAW)
ensure_dir(DATA_PROCESSED)
ensure_dir(RESULTS_DIR)

assert DATA_RAW.exists(), "data/raw should exist"
assert DATA_PROCESSED.exists(), "data/processed should exist"
assert RESULTS_DIR.exists(), "results should exist"
print("âœ“ ensure_dir() works correctly!")

### TODO 2: `write_synthetic_csv()`

Generate a simple synthetic dataset and save it as CSV.

**ðŸ’¡ Key Steps:**
1. Set random seed with `np.random.seed(seed)`
2. Generate timestamps with `pd.date_range()`
3. Generate random integers with `np.random.randint()`
4. Generate random floats with `np.random.uniform()`
5. Generate random categories with `np.random.choice()`
6. Create DataFrame with `pd.DataFrame({...})`
7. Save with `df.to_csv(csv_path, index=False)`
8. Get file size with `csv_path.stat().st_size`

**Need more help?** See [TODO 2 detailed tips](../docs/labs/lab01_tips.md#todo-2-write_synthetic_csv)


In [None]:
def write_synthetic_csv(csv_path: Path, n_rows: int = 200_000, seed: int = 0) -> dict:
    """
    Generate a synthetic dataset and save as CSV.
    
    Args:
        csv_path: Where to save the CSV
        n_rows: Number of rows to generate
        seed: Random seed for reproducibility
    
    Returns:
        Dictionary with metadata: {"rows": int, "cols": int, "size_bytes": int}
    
    The dataset should have these columns:
        - timestamp: datetime strings (e.g., "2024-01-01 12:00:00")
        - user_id: random integers from 1 to 10000
        - value: random floats from 0 to 100
        - category: random choice from ["A", "B", "C", "D", "E"]
    """
    # TODO: Implement this function
    # Hints:
    # 1. Use np.random.seed(seed) for reproducibility
    # 2. Create a DataFrame with the 4 columns described above
    # 3. Save with df.to_csv(csv_path, index=False)
    # 4. Get file size with csv_path.stat().st_size
    # 5. Return a dict with rows, cols, and size_bytes
    pass

In [None]:
# TEST: write_synthetic_csv()
if not CSV_PATH.exists():
    metadata = write_synthetic_csv(CSV_PATH, n_rows=200_000, seed=42)
    print(f"Generated CSV: {metadata}")
    assert metadata["rows"] == 200_000, "Should have 200k rows"
    assert metadata["cols"] == 4, "Should have 4 columns"
    assert metadata["size_bytes"] > 0, "File size should be positive"
    print("âœ“ write_synthetic_csv() works correctly!")
else:
    print(f"CSV already exists, skipping generation")

### TODO 3: `time_it()`

Measure how long a function takes to run (repeat multiple times).

**ðŸ’¡ Key Steps:**
1. Create empty list: `times = []`
2. Loop `repeats` times
3. Inside loop: record start time, call `fn()`, record end time
4. Append elapsed time to list
5. Calculate median with `np.median(times)`
6. Return dict with `"runs_sec"` and `"median_sec"`

**Why median?** It's less affected by outliers than mean - gives you the "typical" performance.

**Need more help?** See [TODO 3 detailed tips](../docs/labs/lab01_tips.md#todo-3-time_it)


In [None]:
def time_it(fn, repeats: int = 3) -> dict:
    """
    Time a function by running it multiple times.
    
    Args:
        fn: A callable (function with no arguments)
        repeats: How many times to run the function
    
    Returns:
        Dictionary with:
            - "runs_sec": list of times (in seconds) for each run
            - "median_sec": median time
    
    Example:
        result = time_it(lambda: pd.read_csv("data.csv"), repeats=3)
        print(result["median_sec"])
    """
    # TODO: Implement this function
    # Hints:
    # 1. Create an empty list to store times
    # 2. Loop 'repeats' times:
    #    - Record start time with time.perf_counter()
    #    - Call fn()
    #    - Record end time
    #    - Append (end - start) to the list
    # 3. Calculate median using np.median()
    # 4. Return a dict with "runs_sec" and "median_sec"
    pass

In [None]:
# TEST: time_it()
result = time_it(lambda: time.sleep(0.01), repeats=3)
assert len(result["runs_sec"]) == 3, "Should have 3 runs"
assert result["median_sec"] > 0, "Median should be positive"
print(f"Test result: {result}")
print("âœ“ time_it() works correctly!")

### TODO 4: `read_csv_once()`

Read a CSV file and return its shape.

**ðŸ’¡ Hint:** Use `pd.read_csv(csv_path)` to read the file, then return `df.shape` (which is a tuple of `(rows, cols)`).

**Need more help?** See [TODO 4 detailed tips](../docs/labs/lab01_tips.md#todo-4-read_csv_once)


In [None]:
def read_csv_once(csv_path: Path) -> tuple[int, int]:
    """
    Read a CSV file and return its shape.
    
    Args:
        csv_path: Path to the CSV file
    
    Returns:
        Tuple of (n_rows, n_cols)
    """
    # TODO: Implement this function
    # Hints:
    # 1. Use pd.read_csv(csv_path)
    # 2. Get the shape with df.shape
    # 3. Return the shape as a tuple
    pass

In [None]:
# TEST: read_csv_once()
rows, cols = read_csv_once(CSV_PATH)
assert rows == 200_000, "Should have 200k rows"
assert cols == 4, "Should have 4 columns"
print(f"CSV shape: {rows} rows Ã— {cols} cols")
print("âœ“ read_csv_once() works correctly!")

### TODO 5: `write_parquet()`

Convert a CSV file to Parquet format.

**ðŸ’¡ Key Steps:**
1. Read CSV with `pd.read_csv(csv_path)`
2. Write Parquet with `df.to_parquet(parquet_path, index=False)`
3. Get file size with `parquet_path.stat().st_size`
4. Return dict with metadata

**What is Parquet?** A columnar storage format that's faster and more efficient than CSV. You'll see the difference in the benchmark!

**Need more help?** See [TODO 5 detailed tips](../docs/labs/lab01_tips.md#todo-5-write_parquet)


In [None]:
def write_parquet(csv_path: Path, parquet_path: Path) -> dict:
    """
    Read a CSV and write it as Parquet.
    
    Args:
        csv_path: Input CSV file
        parquet_path: Output Parquet file
    
    Returns:
        Dictionary with: {"parquet_size_bytes": int, "rows": int, "cols": int}
    """
    # TODO: Implement this function
    # Hints:
    # 1. Read the CSV with pd.read_csv()
    # 2. Write to Parquet with df.to_parquet(parquet_path, index=False)
    # 3. Get file size with parquet_path.stat().st_size
    # 4. Return metadata dict
    pass

In [None]:
# TEST: write_parquet()
if not PARQUET_PATH.exists():
    metadata = write_parquet(CSV_PATH, PARQUET_PATH)
    print(f"Generated Parquet: {metadata}")
    assert metadata["rows"] == 200_000, "Should have 200k rows"
    assert metadata["cols"] == 4, "Should have 4 columns"
    assert metadata["parquet_size_bytes"] > 0, "File size should be positive"
    print("âœ“ write_parquet() works correctly!")
else:
    print(f"Parquet already exists, skipping conversion")

### TODO 6: `read_parquet_once()`

Read a Parquet file and return its shape.

**ðŸ’¡ Hint:** Same as `read_csv_once()`, but use `pd.read_parquet(parquet_path)` instead.

**Need more help?** See [TODO 6 detailed tips](../docs/labs/lab01_tips.md#todo-6-read_parquet_once)


In [None]:
def read_parquet_once(parquet_path: Path) -> tuple[int, int]:
    """
    Read a Parquet file and return its shape.
    
    Args:
        parquet_path: Path to the Parquet file
    
    Returns:
        Tuple of (n_rows, n_cols)
    """
    # TODO: Implement this function
    # Hints:
    # 1. Use pd.read_parquet(parquet_path)
    # 2. Get the shape with df.shape
    # 3. Return the shape as a tuple
    pass

In [None]:
# TEST: read_parquet_once()
rows, cols = read_parquet_once(PARQUET_PATH)
assert rows == 200_000, "Should have 200k rows"
assert cols == 4, "Should have 4 columns"
print(f"Parquet shape: {rows} rows Ã— {cols} cols")
print("âœ“ read_parquet_once() works correctly!")

### TODO 7: `save_json()`

Save a Python dictionary as a pretty-printed JSON file.

**ðŸ’¡ Key Steps:**
1. Open file with `with open(path, "w") as f:`
2. Write JSON with `json.dump(obj, f, indent=2)`

**Why `indent=2`?** Makes the JSON human-readable with nice formatting.

**Need more help?** See [TODO 7 detailed tips](../docs/labs/lab01_tips.md#todo-7-save_json)


In [None]:
def save_json(obj: dict, path: Path) -> None:
    """
    Save a dictionary as JSON.
    
    Args:
        obj: Dictionary to save
        path: Output JSON file path
    """
    # TODO: Implement this function
    # Hints:
    # 1. Open the file with open(path, "w") as f
    # 2. Use json.dump(obj, f, indent=2) to write pretty JSON
    pass

In [None]:
# TEST: save_json()
test_path = RESULTS_DIR / "test.json"
save_json({"test": "value"}, test_path)
assert test_path.exists(), "JSON file should exist"
with open(test_path) as f:
    loaded = json.load(f)
assert loaded["test"] == "value", "JSON should be saved correctly"
test_path.unlink()  # Clean up
print("âœ“ save_json() works correctly!")

---

## 4. Benchmark CSV Read Performance

Now let's measure how long it takes to read the CSV file.

In [None]:
print("Benchmarking CSV read (3 repeats)...")
csv_result = time_it(lambda: read_csv_once(CSV_PATH), repeats=3)

print(f"Run times: {csv_result['runs_sec']}")
print(f"Median: {csv_result['median_sec']:.4f} seconds")

## 5. Benchmark Parquet Read Performance

Same thing, but for the Parquet file.

In [None]:
print("Benchmarking Parquet read (3 repeats)...")
parquet_result = time_it(lambda: read_parquet_once(PARQUET_PATH), repeats=3)

print(f"Run times: {parquet_result['runs_sec']}")
print(f"Median: {parquet_result['median_sec']:.4f} seconds")

## 6. Compare Results

Let's calculate the speedup and compare file sizes.

In [None]:
csv_size = CSV_PATH.stat().st_size
parquet_size = PARQUET_PATH.stat().st_size

speedup = csv_result["median_sec"] / parquet_result["median_sec"]
size_ratio = csv_size / parquet_size

print("\n" + "="*50)
print("RESULTS SUMMARY")
print("="*50)
print(f"CSV file size:     {csv_size / 1_000_000:.2f} MB")
print(f"Parquet file size: {parquet_size / 1_000_000:.2f} MB")
print(f"Size ratio:        {size_ratio:.2f}x (CSV is {size_ratio:.2f}x larger)")
print()
print(f"CSV median read time:     {csv_result['median_sec']:.4f} sec")
print(f"Parquet median read time: {parquet_result['median_sec']:.4f} sec")
print(f"Speedup: {speedup:.2f}x (Parquet is {speedup:.2f}x faster)")
print("="*50)

## 7. Reflection

**Your task:** Write a short reflection (3 lines) answering these questions:

1. What surprised you about the performance difference?
2. Why do you think Parquet is faster/smaller?

Edit the cell below:

In [None]:
# TODO: Write your reflection here (replace the placeholder text)
reflection = """
Replace this text with your 3-line reflection.
Think about what you learned from this benchmark.
What will you remember about CSV vs Parquet?
""".strip()

print("Your reflection:")
print(reflection)

## 8. Save Results to JSON

Finally, let's save everything to `results/lab01_metrics.json`.

In [None]:
results = {
    "lab": "01_setup_io",
    "timestamp": pd.Timestamp.now().isoformat(),
    "dataset": {
        "rows": 200_000,
        "cols": 4,
    },
    "csv": {
        "size_bytes": csv_size,
        "size_mb": round(csv_size / 1_000_000, 2),
        "read_times_sec": csv_result["runs_sec"],
        "median_read_sec": csv_result["median_sec"],
    },
    "parquet": {
        "size_bytes": parquet_size,
        "size_mb": round(parquet_size / 1_000_000, 2),
        "read_times_sec": parquet_result["runs_sec"],
        "median_read_sec": parquet_result["median_sec"],
    },
    "comparison": {
        "size_ratio": round(size_ratio, 2),
        "speedup": round(speedup, 2),
    },
    "reflection": reflection,
}

save_json(results, METRICS_PATH)
print(f"âœ“ Results saved to: {METRICS_PATH}")
print(f"\nPreview:")
print(json.dumps(results, indent=2))

---

## ðŸŽ‰ Congratulations!

You've completed Lab 01. Make sure you have:

- âœ… All TODO functions implemented
- âœ… All cells executed without errors
- âœ… `results/lab01_metrics.json` created
- âœ… Your reflection written

**Submit these files:**
1. `notebooks/lab01_setup_io.ipynb` (this file)
2. `results/lab01_metrics.json`

See you in Lab 02! ðŸš€

---

## ðŸ“š What You Learned

### Technical Skills âœ…
- Setting up Python environment with `uv`
- Generating synthetic data with `numpy`
- Working with `pandas` DataFrames
- Reading/writing CSV and Parquet files
- Measuring performance with `time.perf_counter()`
- Saving structured data as JSON

### Concepts âœ…
- Why Parquet is faster and smaller than CSV
- Columnar vs row-based storage
- Using median vs mean for benchmarking
- Importance of reproducibility (random seeds)
- Cross-platform path handling with `pathlib`

### Best Practices âœ…
- Writing clear docstrings
- Testing code incrementally
- Using type hints
- Debugging with print statements
- Reading error messages carefully

---

## ðŸš€ Next Steps

1. âœ… Verify all test cells passed
2. âœ… Check `results/lab01_metrics.json` exists and looks correct
3. âœ… Review your reflection - is it thoughtful?
4. âœ… Submit both files (notebook + JSON)

**Great job completing Lab 01!** You're ready for Lab 02! ðŸŽ‰
