# Lab 03: Data Types

**Course:** Big Data

---

## ðŸ‘¤ Student Information

**Name:** `Your Name Here`

**Date:** `DD/MM/YYYY`

---

**Goal:** Master data type optimization to achieve significant memory and performance improvements.

## Learning Objectives

By the end of this lab, you will be able to:

1. **Measure Memory Usage**: Accurately measure DataFrame memory consumption
2. **Analyze Data Ranges**: Identify value ranges to determine optimal types
3. **Optimize Data Types**: Reduce memory usage 5-10x through smart dtype selection
4. **Measure Performance Impact**: Benchmark the speed improvements from type optimization

## Instructions

1. **Fill in your information above** before starting the lab
2. Read each cell carefully before running it
3. Implement the **TODO functions** when you see them
4. Run cells **from top to bottom** (Shift+Enter)
5. Check that output makes sense after each cell

---

## ðŸ“š Libraries Used in This Lab

### Core Libraries

- **`pandas`** - DataFrame operations and I/O
- **`numpy`** - Random data generation
- **`time`** - Performance measurement

### Why Focus on Data Types?

**Real-world example**: A 100M row sales dataset

| Approach | RAM Usage | Groupby Time |
|----------|-----------|-------------|
| Naive (default types) | 80 GB | 45 sec |
| Optimized (proper types) | 8 GB | 5 sec |

**That's 10x less memory and 9x faster!**

---

## Imports and Setup

In [None]:
import json
import time
from pathlib import Path

import pandas as pd
import numpy as np

print("âœ“ All imports successful!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## Define Paths

In [None]:
# Base directories
DATA_RAW = Path("../data/raw")
RESULTS_DIR = Path("../results")

# File paths for this lab
ECOMMERCE_CSV = DATA_RAW / "ecommerce_5m.csv"
METRICS_PATH = RESULTS_DIR / "lab03_metrics.json"

# Ensure directories exist
DATA_RAW.mkdir(parents=True, exist_ok=True)
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

print("Paths defined:")
print(f"  Source CSV: {ECOMMERCE_CSV}")
print(f"  Metrics: {METRICS_PATH}")

---

## Section A. Dataset Generation (15 min)

First, we generate a synthetic e-commerce dataset with 5 million rows.

**Columns:**
- `order_id`: Unique order identifier (0 to 4,999,999)
- `product_id`: Product ID (1-50,000)
- `category`: Product category (15 unique values)
- `price`: Product price (0.01-999.99)
- `quantity`: Quantity ordered (1-100)
- `country`: Customer country (30 unique values)
- `timestamp`: Order timestamp

### TODO 1: `generate_ecommerce_data()`

Generate a synthetic e-commerce dataset.

**ðŸ’¡ Hints:**
- Use `np.random.seed(seed)` for reproducibility
- Use `np.arange(n_rows)` for order_id
- Use `np.random.randint()` for integer columns
- Use `np.random.choice()` for category/country columns
- Use `np.random.uniform()` for price
- Use `pd.date_range()` for timestamps

In [None]:
def generate_ecommerce_data(path: Path, n_rows: int = 5_000_000, seed: int = 42) -> dict:
    """
    Generate a synthetic e-commerce dataset.
    
    Args:
        path: Where to save the CSV
        n_rows: Number of rows (default 5 million)
        seed: Random seed for reproducibility
    
    Returns:
        Dictionary with: {"rows": int, "cols": int, "size_mb": float}
    """
    # TODO: Implement this function
    # Step 1: Set random seed
    # Step 2: Define categories list (15 items)
    # Step 3: Define countries list (30 items)
    # Step 4: Generate each column using numpy
    # Step 5: Create DataFrame
    # Step 6: Save to CSV (index=False)
    # Step 7: Return metadata dict
    pass

In [None]:
# Generate the dataset (only if it doesn't exist)
if not ECOMMERCE_CSV.exists():
    print("Generating 5 million row e-commerce dataset...")
    print("(This may take 1-2 minutes)\n")
    
    start = time.perf_counter()
    metadata = generate_ecommerce_data(ECOMMERCE_CSV, n_rows=5_000_000)
    elapsed = time.perf_counter() - start
    
    print(f"Generated in {elapsed:.1f} seconds")
    print(f"Rows: {metadata['rows']:,}")
    print(f"Size: {metadata['size_mb']:.1f} MB")
else:
    size_mb = ECOMMERCE_CSV.stat().st_size / 1e6
    print(f"Dataset already exists: {size_mb:.1f} MB")

---

## Section B. Baseline Measurement (15 min)

Let's see how much memory pandas uses with default dtypes.

### Part B1: Load with Default Types

In [None]:
# Load with default dtypes
print("Loading CSV with default dtypes...")
start = time.perf_counter()
df_baseline = pd.read_csv(ECOMMERCE_CSV)
load_time_baseline = time.perf_counter() - start

print(f"Load time: {load_time_baseline:.2f} seconds")
print(f"\nDataFrame shape: {df_baseline.shape}")
print(f"\nColumn dtypes:")
print(df_baseline.dtypes)

### TODO 2: `measure_memory()`

Measure memory usage of a DataFrame.

**ðŸ’¡ Hints:**
- Use `df.memory_usage(deep=True)` for accurate measurement
- The `deep=True` flag is essential for object (string) columns
- Convert bytes to MB by dividing by 1e6

In [None]:
def measure_memory(df: pd.DataFrame) -> dict:
    """
    Measure memory usage of a DataFrame.
    
    Args:
        df: DataFrame to measure
    
    Returns:
        Dictionary with:
        - 'total_mb': total memory in MB
        - 'columns': dict with per-column info (dtype, memory_mb, nunique)
    """
    # TODO: Implement this function
    # 1. Use df.memory_usage(deep=True) to get memory per column
    # 2. Calculate total memory in MB
    # 3. For each column, record dtype, memory, and nunique count
    pass

In [None]:
# Measure baseline memory
baseline_memory = measure_memory(df_baseline)
print(f"\nTotal memory: {baseline_memory['total_mb']:.2f} MB")
print("\nPer-column breakdown:")
for col, info in baseline_memory['columns'].items():
    print(f"  {col}: {info['dtype']} - {info['memory_mb']:.2f} MB ({info['nunique']:,} unique)")

### ðŸ’¡ Key Insight: Default Types Are Wasteful

Notice how pandas uses:
- `int64` (8 bytes) for ALL integers, even small ones
- `float64` (8 bytes) for ALL floats
- `object` (~50+ bytes per value) for strings

This is **extremely wasteful** when your data has limited ranges!

---

## Section C. Type Analysis & Optimization (30 min)

### Part C1: Analyze Value Ranges

To choose optimal types, we need to understand our data's actual value ranges.

### TODO 3: `analyze_column_ranges()`

Analyze each column to determine the optimal type.

**ðŸ’¡ Hints:**
- For numeric columns: use `.min()`, `.max()`
- For string columns: use `.nunique()`, `.str.len().max()`
- Compare ranges against the Integer Types Reference table

In [None]:
def analyze_column_ranges(df: pd.DataFrame) -> dict:
    """
    Analyze value ranges for each column.
    
    Args:
        df: DataFrame to analyze
    
    Returns:
        Dictionary with analysis for each column:
        - For numeric: {'min': x, 'max': y, 'nunique': n}
        - For string: {'nunique': n, 'max_len': l, 'sample': [...]}
    """
    # TODO: Implement this function
    # For each column, determine if it's numeric or string
    # Numeric: min, max, nunique
    # String: nunique, max string length, sample values
    pass

In [None]:
# Analyze value ranges
ranges = analyze_column_ranges(df_baseline)

print("Column Analysis:")
print("=" * 60)
for col, info in ranges.items():
    if 'min' in info:
        print(f"{col}: {info['min']} to {info['max']} ({info['nunique']:,} unique)")
    else:
        print(f"{col}: {info['nunique']} unique, max length {info['max_len']}")

### ðŸ“Š Integer Type Reference Table

| Type | Min | Max | Bytes |
|------|-----|-----|-------|
| int8 | -128 | 127 | 1 |
| **uint8** | 0 | **255** | **1** |
| int16 | -32,768 | 32,767 | 2 |
| **uint16** | 0 | **65,535** | **2** |
| int32 | -2.1B | 2.1B | 4 |
| **uint32** | 0 | **4.3B** | **4** |
| int64 | -9.2Q | 9.2Q | 8 |

**Rule**: Use the **smallest** type that fits your data range!

### Part C2: Determine Optimal Types

Based on your analysis, fill in the optimal types:

### TODO 4: `get_optimal_dtypes()`

Return a dictionary mapping column names to optimal dtype strings.

In [None]:
def get_optimal_dtypes() -> dict:
    """
    Return the optimal dtypes for the ecommerce dataset.
    
    Returns:
        Dictionary mapping column names to dtype strings
    """
    # TODO: Fill in the optimal types based on your analysis
    return {
        'order_id': '???',      # 0 to 5M - which int type?
        'product_id': '???',    # 1 to 50000 - which int type?
        'category': '???',      # 15 unique strings - category?
        'price': '???',         # 0.01 to 999.99 - float32 or float64?
        'quantity': '???',      # 1 to 100 - which int type?
        'country': '???',       # 30 unique strings - category?
    }

In [None]:
optimal_dtypes = get_optimal_dtypes()
print("Optimal dtypes:")
for col, dtype in optimal_dtypes.items():
    print(f"  {col}: {dtype}")

### Part C3: Load with Optimized Types

### TODO 5: `load_with_optimized_dtypes()`

Load the CSV with optimized dtypes.

**ðŸ’¡ Hints:**
- Pass `dtype=` parameter to `pd.read_csv()`
- Use `parse_dates=['timestamp']` for the timestamp column

In [None]:
def load_with_optimized_dtypes(path: Path, dtypes: dict) -> pd.DataFrame:
    """
    Load CSV with optimized dtypes.
    
    Args:
        path: Path to CSV file
        dtypes: Dictionary mapping column names to dtype strings
    
    Returns:
        DataFrame with optimized dtypes
    """
    # TODO: Implement this function
    # Use pd.read_csv with dtype and parse_dates parameters
    pass

In [None]:
# Load with optimized dtypes
print("Loading CSV with optimized dtypes...")
start = time.perf_counter()
df_optimized = load_with_optimized_dtypes(ECOMMERCE_CSV, optimal_dtypes)
load_time_optimized = time.perf_counter() - start

print(f"Load time: {load_time_optimized:.2f} seconds")
print(f"\nColumn dtypes:")
print(df_optimized.dtypes)

In [None]:
# Measure optimized memory
optimized_memory = measure_memory(df_optimized)

print(f"Baseline memory: {baseline_memory['total_mb']:.2f} MB")
print(f"Optimized memory: {optimized_memory['total_mb']:.2f} MB")
print(f"\nReduction: {baseline_memory['total_mb'] / optimized_memory['total_mb']:.1f}x")

print("\nPer-column comparison:")
for col in baseline_memory['columns']:
    before = baseline_memory['columns'][col]['memory_mb']
    after = optimized_memory['columns'][col]['memory_mb']
    reduction = before / after if after > 0 else 0
    print(f"  {col}: {before:.1f} MB â†’ {after:.1f} MB ({reduction:.1f}x)")

### ðŸ’¡ Key Insight: Category dtype

The `category` dtype is especially powerful for repeated strings:

```python
# Internally stored as:
# Dictionary: {0: 'Electronics', 1: 'Clothing', ...}
# Codes: [0, 1, 0, 2, 1, ...]  (small integers!)
```

**Benefits:**
- Memory scales with **unique values**, not row count
- Groupby operates on integers, not strings
- String comparisons use integer codes

---

## Section D. Performance Impact (15 min)

Smaller types aren't just about memory â€” they're also **faster**!

### TODO 6: `benchmark_operation()`

Benchmark an operation on both baseline and optimized DataFrames.

In [None]:
def benchmark_operation(df_baseline: pd.DataFrame, df_optimized: pd.DataFrame,
                        operation: str) -> dict:
    """
    Benchmark an operation on baseline vs optimized DataFrames.
    
    Args:
        df_baseline: DataFrame with default dtypes
        df_optimized: DataFrame with optimized dtypes
        operation: One of 'groupby_sum', 'filter', 'sort'
    
    Returns:
        Dictionary with baseline_sec, optimized_sec, speedup
    """
    # TODO: Implement this function
    # For 'groupby_sum': df.groupby('category')['price'].sum()
    # For 'filter': df[df['country'] == 'Spain']
    # For 'sort': df.sort_values('price')
    # Time each operation on both DataFrames and calculate speedup
    pass

In [None]:
# Benchmark groupby operation
print("Benchmarking operations...\n")

groupby_results = benchmark_operation(df_baseline, df_optimized, 'groupby_sum')
print(f"Groupby Sum:")
print(f"  Baseline: {groupby_results['baseline_sec']:.4f} sec")
print(f"  Optimized: {groupby_results['optimized_sec']:.4f} sec")
print(f"  Speedup: {groupby_results['speedup']:.2f}x")

In [None]:
# Benchmark filter operation
filter_results = benchmark_operation(df_baseline, df_optimized, 'filter')
print(f"Filter (country == 'Spain'):")
print(f"  Baseline: {filter_results['baseline_sec']:.4f} sec")
print(f"  Optimized: {filter_results['optimized_sec']:.4f} sec")
print(f"  Speedup: {filter_results['speedup']:.2f}x")

In [None]:
# Benchmark sort operation
sort_results = benchmark_operation(df_baseline, df_optimized, 'sort')
print(f"Sort by price:")
print(f"  Baseline: {sort_results['baseline_sec']:.4f} sec")
print(f"  Optimized: {sort_results['optimized_sec']:.4f} sec")
print(f"  Speedup: {sort_results['speedup']:.2f}x")

### TODO 7: `calculate_savings()`

Calculate the total memory and performance savings.

In [None]:
def calculate_savings(baseline_memory: dict, optimized_memory: dict,
                      benchmark_results: list) -> dict:
    """
    Calculate total savings from optimization.
    
    Args:
        baseline_memory: Memory info for baseline DataFrame
        optimized_memory: Memory info for optimized DataFrame
        benchmark_results: List of benchmark result dicts
    
    Returns:
        Dictionary with:
        - memory_saved_mb: MB of memory saved
        - memory_reduction_factor: how many times smaller
        - avg_speedup: average speedup across all operations
    """
    # TODO: Implement this function
    # Calculate memory saved and average speedup
    pass

In [None]:
# Calculate total savings
savings = calculate_savings(
    baseline_memory, 
    optimized_memory,
    [groupby_results, filter_results, sort_results]
)

print("\n" + "=" * 50)
print("TOTAL SAVINGS SUMMARY")
print("=" * 50)
print(f"Memory saved: {savings['memory_saved_mb']:.1f} MB")
print(f"Memory reduction: {savings['memory_reduction_factor']:.1f}x smaller")
print(f"Average speedup: {savings['avg_speedup']:.2f}x faster")

---

## Section E. Reflection & Save Results (15 min)

### Reflection

**Your task:** Write a short reflection (3-5 sentences) answering:

1. What was the biggest memory reduction you achieved on a single column?
2. Which dtype change had the most impact: integer downcasting or using `category`?
3. What will you do differently when working with large datasets in the future?

In [None]:
# TODO: Write your reflection here
reflection = """
Replace this text with your reflection.
Think about what you learned about data types.
What will you do differently in your future projects?
""".strip()

print("Your reflection:")
print(reflection)

### Save Results

In [None]:
# Compile all results
results = {
    "lab": "03_data_types",
    "timestamp": pd.Timestamp.now().isoformat(),
    "dataset": {
        "rows": len(df_optimized),
        "columns": len(df_optimized.columns),
    },
    "memory": {
        "baseline_mb": baseline_memory['total_mb'],
        "optimized_mb": optimized_memory['total_mb'],
        "reduction_factor": round(baseline_memory['total_mb'] / optimized_memory['total_mb'], 2),
        "saved_mb": round(baseline_memory['total_mb'] - optimized_memory['total_mb'], 2),
    },
    "performance": {
        "groupby_speedup": groupby_results['speedup'],
        "filter_speedup": filter_results['speedup'],
        "sort_speedup": sort_results['speedup'],
        "avg_speedup": savings['avg_speedup'],
    },
    "dtypes_used": optimal_dtypes,
    "reflection": reflection,
}

# Save to JSON
with open(METRICS_PATH, "w") as f:
    json.dump(results, f, indent=2, default=str)

print(f"âœ“ Results saved to: {METRICS_PATH}")

---

## ðŸŽ‰ Lab Complete!

### What You Learned

1. **Default types are wasteful**: Pandas uses int64/float64/object by default
2. **Analyze before optimizing**: Check min/max/nunique to choose the right type
3. **Integer sizing matters**: Use uint8/uint16/uint32 based on actual ranges
4. **Category is powerful**: Perfect for repeated strings (<50% unique values)
5. **Smaller = Faster**: Reduced memory leads to faster operations

### Optimization Checklist

- âœ… Use `df.memory_usage(deep=True)` to measure accurately
- âœ… Check `.min()` and `.max()` for numeric columns
- âœ… Check `.nunique()` for potential `category` columns
- âœ… Use smallest int type that fits your data
- âœ… Use `category` for strings with <50% unique values
- âœ… Use `float32` unless you need high precision
- âœ… Specify dtypes when reading CSV with `dtype=`

### Files to Submit

1. `notebooks/lab03_data_types.ipynb` (this notebook)
2. `results/lab03_metrics.json`

---

**Next Lab**: We'll explore efficient storage formats (Parquet, Feather) and partitioning strategies!