# Module 2: Data Types & Structures (Python)

This module is about mastering Python’s **built-in data types** and choosing the right **data structure** for the job.

## What you’ll be able to do after this module
- Read messy inputs and normalize them into clean Python objects.
- Use the **Big Six** confidently: **int**, **str**, **list**, **tuple**, **set**, **dict**.
- Understand **mutability**, **references**, and the most common copy bugs.
- Reason about performance with practical **Big‑O** intuition.
- Use `collections` containers when built-ins aren’t ideal.
- Add basic **typing** to enforce structure (and see where validation tools fit).

## How to use this notebook
- Run cells top-to-bottom once.
- Each section has: **theory → examples → mini-exercises**.

## Table of Contents
1. Strings & Numbers: parsing, formatting, money-safe math
2. Logical Python: `if/for/while/match` with data structures
3. Lists: dynamic arrays, slicing, comprehensions, copy pitfalls
4. Tuples: immutability, unpacking, dictionary keys
5. Sets: uniqueness, membership, set algebra, `frozenset`
6. Dictionaries: hash maps, iteration patterns, grouping
7. Mutability & Memory: references, shallow vs deep copy, defaults
8. `collections`: `deque`, `Counter`, `defaultdict`, `namedtuple`
9. Typing & Data Integrity: hints, `TypedDict`, (optional) Pydantic

---

### Dataset theme (used throughout)
We’ll use “market data” examples because they naturally involve: parsing strings, grouping records, uniqueness, and lookups.

## 0. Setup (helpers + sample input)

We’ll reuse the same tiny “raw feed” across sections.

**Key idea:** most real programs start with strings and end with structured objects.

In the next cell we’ll define:
- a few sample records
- small helper functions for printing and validation

In [None]:
from __future__ import annotations

from dataclasses import dataclass
from datetime import date
from decimal import Decimal
from typing import TypedDict


def banner(title: str) -> None:
    print(f"\n{'=' * 10} {title} {'=' * 10}")


RAW_FEEDS = [
    "  $AAPL : 150.2544 : 2023-10-01 : tech,hardware  ",
    " $MSFT : 310.5 : 2023-10-01 : tech,software ",
    " $GOOG : 2800.00 : 2023-10-02 : tech,search ",
    " $TSLA : 700.1 : 2023-10-02 : auto,ev ",
]


class TickDict(TypedDict):
    symbol: str
    price: Decimal
    day: date
    tags: set[str]


@dataclass(frozen=True, slots=True)
class Tick:
    symbol: str
    price: Decimal
    day: date
    tags: frozenset[str]


banner("Setup")
print(f"Loaded RAW_FEEDS: {len(RAW_FEEDS)} records")

Split Parts: ['AAPL', '150.2544', '2023-10-01', 'tech,hardware']
formatted: AAPL trading at $150.25
Is Alpha-Numeric? True
Starts with 'A'? True
Index of 'PL': 2


In [None]:
## A tiny reminder about types
banner("Types")

raw = "14"
print(raw, type(raw))
print(int(raw), type(int(raw)))

# Strings have many inspection helpers
print("14".isdigit(), "AAPL".isalpha(), "AAPL".isalnum())

True

## 1. Strings & Numbers (parsing is the real world)

### Theory
- **Strings (`str`) are immutable**: every “modification” creates a new string.
- Most pipelines look like: **raw string → normalize → split → convert types → validate**.
- For money-like values, avoid `float` when precision matters; prefer `Decimal`.

### Common tools
- **Cleaning**: `strip()`, `lower()/upper()`, `replace()`, `removeprefix()`
- **Parsing**: `split()`, `partition()`, `rsplit()`
- **Validation**: `startswith()`, `isalnum()`, `in`, `find()`

In the next cell we’ll parse a messy feed string into a clean object.

In [17]:
banner("Parse a raw feed")


def parse_feed(line: str) -> TickDict:
    # Example input: "  $AAPL : 150.2544 : 2023-10-01 : tech,hardware  "
    clean = line.strip()
    clean = clean.removeprefix("$")

    parts = [p.strip() for p in clean.split(":")]
    if len(parts) != 4:
        raise ValueError(f"Bad record (expected 4 fields): {line!r}")

    symbol_raw, price_raw, day_raw, tags_raw = parts

    symbol = symbol_raw.upper()
    price = Decimal(price_raw)  # exact decimal representation

    yyyy, mm, dd = (int(x) for x in day_raw.split("-"))
    day = date(yyyy, mm, dd)

    tags = {t.strip().lower() for t in tags_raw.split(",") if t.strip()}

    return {"symbol": symbol, "price": price, "day": day, "tags": tags}


parsed: list[TickDict] = [parse_feed(line) for line in RAW_FEEDS]
print(parsed[0])

# Convert dict → immutable record (nice for caching / keys / safety)
ticks: list[Tick] = [
    Tick(
        symbol=t["symbol"],
        price=t["price"],
        day=t["day"],
        tags=frozenset(t["tags"]),
    )
    for t in parsed
]

print(ticks[0])

First 3 prices: [100, 105, 102]
Last 3 prices:  [108, 110, 115]
Every 2nd day:  [100, 102, 110]
Reversed:       [115, 110, 108, 102, 105, 100]
Adjusted High Prices: [109.08, 111.1, 116.15]
Backup size: 6


## 1.1 Money math: `float` vs `Decimal`

### Theory
- `float` is **binary floating point**. Many decimals can’t be represented exactly.
- `Decimal` is **base‑10 decimal arithmetic** (better for prices, currency, accounting).

Rule of thumb:
- **Use `float`** for scientific/approximate values.
- **Use `Decimal`** for money-like values where rounding must be predictable.

In [None]:
banner("Float vs Decimal")

# Classic float surprise
print(0.1 + 0.2)

# Decimal: predictable base-10 math
print(Decimal("0.1") + Decimal("0.2"))

# Formatting: f-strings work with both
p = Decimal("150.2544")
print(f"Two decimals: {p:.2f}")

# If you need strict rounding rules, quantize is explicit
from decimal import ROUND_HALF_UP

usd = p.quantize(Decimal("0.01"), rounding=ROUND_HALF_UP)
print("USD rounded:", usd)

## 2. Logical Python (control flow with data)

Data structures shine when you combine them with control flow.

### Theory
- **`if/elif/else`**: choose a path based on conditions.
- **`for` loops**: iterate over sequences, sets, dicts.
- **`while` loops**: repeat until a condition changes.
- **`match/case` (Python 3.10+)**: readable branching (a “switch-case” style).

In the next cell we’ll filter and categorize ticks.

In [None]:
banner("Control flow")

# if / elif / else
for t in ticks:
    if t.price >= Decimal("1000"):
        tier = "HIGH"
    elif t.price >= Decimal("200"):
        tier = "MID"
    else:
        tier = "LOW"
    print(t.symbol, t.price, tier)

# for loops over dict-like structures (here: we build a dict)
prices_by_symbol: dict[str, list[Decimal]] = {}
for t in ticks:
    prices_by_symbol.setdefault(t.symbol, []).append(t.price)

print("prices_by_symbol:", prices_by_symbol)

# while loops (example: pop from a list until empty)
pending = ["A", "B", "C"]
while pending:
    pending.pop()

# match/case ("switch")
# categorize based on one tag
for t in ticks:
    tag = next(iter(t.tags), "unknown")
    match tag:
        case "tech":
            group = "TECH"
        case "auto" | "ev":
            group = "AUTO"
        case _:
            group = "OTHER"
    print(t.symbol, sorted(t.tags), "->", group)

## 3. Lists (dynamic arrays)

### Theory
- A `list` is a **dynamic array**: fast indexing, fast append at the end.
- Inserting/removing near the front/middle shifts elements → can be slower.
- Lists store **references** to objects (important for copying).

### Performance intuition
- `lst.append(x)`: typically **O(1)** (amortized)
- `lst[i]`: **O(1)**
- `x in lst`: **O(n)** (linear scan)
- `lst.insert(0, x)` / `lst.pop(0)`: **O(n)**

Next cell: slicing, comprehensions, and safe copying.

In [None]:
banner("Lists")

prices: list[Decimal] = [t.price for t in ticks]
print("prices:", prices)

# Slicing: [start:stop:step]
print("first 2:", prices[:2])
print("last 2:", prices[-2:])
print("every 2nd:", prices[::2])
print("reversed:", prices[::-1])

# Comprehensions: transform + filter
high_prices = [p for p in prices if p >= Decimal("500")]
print("high_prices:", high_prices)

# Copying: understand aliasing
original = [1, 2, 3]
alias = original          # points to same list
copy1 = original.copy()   # shallow copy
copy2 = original[:]       # also shallow copy

original.append(99)
print("original:", original)
print("alias:", alias)
print("copy1:", copy1)

# Mini-exercise
# 1) Create a list of symbols in uppercase from ticks
# 2) Create a list of (symbol, price) pairs for prices >= 500
symbols_upper = [t.symbol for t in ticks]
expensive_pairs = [(t.symbol, t.price) for t in ticks if t.price >= Decimal("500")]
print("symbols_upper:", symbols_upper)
print("expensive_pairs:", expensive_pairs)

## 4. Tuples (immutability + structure)

### Theory
- A `tuple` is an **immutable sequence**.
- Immutability makes tuples:
  - safe to share (harder to accidentally modify)
  - usable as **dictionary keys** (if all elements are hashable)
  - often slightly more memory efficient than lists

### When to use tuples
- A fixed “record-like” shape: `(symbol, day, price)`
- Multi-return from functions
- Keys for caches / lookup tables

Next cell: unpacking patterns and using tuples as keys.

In [None]:
banner("Tuples")

row = ("AAPL", date(2023, 10, 1), Decimal("150.2544"))

# Unpacking
symbol, day_, price_ = row
print(symbol, day_, price_)

# Star-unpacking
head, *rest = (1, 2, 3, 4, 5)
print("head:", head, "rest:", rest)

# Tuples as dict keys (composite keys)
by_symbol_and_day: dict[tuple[str, date], Decimal] = {}
for t in ticks:
    by_symbol_and_day[(t.symbol, t.day)] = t.price

print("AAPL 2023-10-01:", by_symbol_and_day[("AAPL", date(2023, 10, 1))])

# Mini-exercise
# Build a set of unique (symbol, day) pairs
unique_pairs = {(t.symbol, t.day) for t in ticks}
print("unique_pairs:", unique_pairs)

## 5. Sets (uniqueness + fast membership)

### Theory
- A `set` is an **unordered** collection of **unique** elements.
- Membership is fast: `x in my_set` is typically **O(1)**.
- Sets support math-like operations: union, intersection, difference.

### When to use sets
- Deduplicate values
- Fast membership checks
- Comparing groups (A vs B)

Next cell: real examples using tags/symbols.

In [None]:
banner("Sets")

# 1) Deduplicate symbols
symbols = [t.symbol for t in ticks] + ["AAPL"]
print("symbols list:", symbols)
print("unique symbols:", set(symbols))

# 2) Fast membership checks
symbol_set = {t.symbol for t in ticks}
print("Is GOOG present?", "GOOG" in symbol_set)
print("Is NFLX present?", "NFLX" in symbol_set)

# 3) Set algebra (useful for comparing portfolios)
portfolio_a = {"AAPL", "GOOG", "MSFT"}
portfolio_b = {"GOOG", "AMZN", "NFLX"}

print("union:", portfolio_a | portfolio_b)
print("intersection:", portfolio_a & portfolio_b)
print("difference (A-B):", portfolio_a - portfolio_b)
print("symmetric diff:", portfolio_a ^ portfolio_b)

# 4) Immutable set: frozenset (hashable)
immutable_tags = frozenset({"tech", "hardware"})
print("immutable_tags:", immutable_tags)

# Mini-exercise: compute all unique tags across ticks
all_tags = {tag for t in ticks for tag in t.tags}
print("all_tags:", all_tags)

## 6. Dictionaries (hash maps: Python’s superpower)

### Theory
- A `dict` maps **keys → values**.
- Lookup/update by key is typically **O(1)**.
- Keys must be **hashable** (immutable types like `str`, `int`, `tuple` of hashables, `frozenset`).

### When to use dicts
- Fast lookup: “given a symbol, get the latest price”
- Grouping: “collect ticks by symbol”
- Counting: “frequency of tags / orders” (also see `Counter`)

Next cell: core dict patterns you’ll use constantly.

In [None]:
banner("Dictionaries")

latest_price: dict[str, Decimal] = {}
for t in ticks:
    latest_price[t.symbol] = t.price

print("latest_price:", latest_price)

# Access patterns
print("AAPL price:", latest_price["AAPL"])
print("NFLX price (safe):", latest_price.get("NFLX"))
print("NFLX price (default):", latest_price.get("NFLX", Decimal("0")))

# Iteration
for symbol, price in latest_price.items():
    print(symbol, "->", price)

# Dict comprehension
upper_map = {k.upper(): v for k, v in latest_price.items()}
print("upper_map:", upper_map)

# Merge (Python 3.9+)
updates = {"AAPL": Decimal("155.00"), "NVDA": Decimal("400.00")}
merged = latest_price | updates
print("merged:", merged)

# Mini-exercise: invert a dict (only safe if values are unique)
inverted = {v: k for k, v in latest_price.items()}
print("inverted:", inverted)

### 6.1 Grouping patterns (very common)

You’ll constantly need: “group records by some key”.

Common approaches:
- `dict.setdefault(key, [])`
- `collections.defaultdict(list)`

Next cell shows both (and why `defaultdict` can be cleaner).

In [None]:
from collections import defaultdict

banner("Grouping")

# 1) setdefault
by_symbol_1: dict[str, list[Tick]] = {}
for t in ticks:
    by_symbol_1.setdefault(t.symbol, []).append(t)

print("setdefault grouping:", {k: len(v) for k, v in by_symbol_1.items()})

# 2) defaultdict
by_symbol_2: defaultdict[str, list[Tick]] = defaultdict(list)
for t in ticks:
    by_symbol_2[t.symbol].append(t)

print("defaultdict grouping:", {k: len(v) for k, v in by_symbol_2.items()})

# Mini-exercise: group by day instead of symbol
by_day: defaultdict[date, list[Tick]] = defaultdict(list)
for t in ticks:
    by_day[t.day].append(t)

print("by_day:", {k.isoformat(): len(v) for k, v in by_day.items()})