# Module 2: Data Types & Structures

### The Scenario

You receive messy data feeds from various sources - CSV files with inconsistent formatting, API responses with nested JSON, user inputs with extra whitespace. Your job is to clean, parse, and organize this data efficiently.

### The Goal

By the end of this module, you will:
- Master the **Big Six** data types: `int`, `str`, `list`, `tuple`, `set`, `dict`
- Understand **mutability** and avoid common copy bugs
- Choose the right data structure with **Big-O** intuition
- Use `collections` for specialized containers

---

## Lesson 1: Strings & Numbers

### The Problem

You receive data like `"  $AAPL : 150.2544 : 2023-10-01  "`. You need to extract the symbol, price, and date into proper Python types.

### The "Aha!" Moment

Most data pipelines follow this pattern:

**raw string → clean → split → convert types → validate**

### Key Concepts

| Concept | Description |
|---------|-------------|
| Strings are **immutable** | Every "modification" creates a new string |
| Use `Decimal` for money | `float` has precision issues (0.1 + 0.2 ≠ 0.3) |
| String methods return new strings | `s.strip()` doesn't modify `s` |

### String Methods Quick Reference

| Method | Purpose | Example |
|--------|---------|--------|
| `strip()` | Remove whitespace | `"  hi  ".strip()` → `"hi"` |
| `split(sep)` | Split into list | `"a:b".split(":")` → `["a","b"]` |
| `join(list)` | Join list | `":".join(["a","b"])` → `"a:b"` |
| `replace(a,b)` | Replace substring | `"hi".replace("i","o")` → `"ho"` |
| `upper()`/`lower()` | Change case | `"Hi".upper()` → `"HI"` |
| `startswith(s)` | Check prefix | `"abc".startswith("a")` → `True` |
| `isdigit()` | Check if digits | `"123".isdigit()` → `True` |

In [2]:
# Parsing a messy string
raw = "  $AAPL : 150.2544 : 2023-10-01 : tech,hardware  "

# Step 1: Clean
clean = raw.strip().removeprefix("$")
print(f"After clean: '{clean}'")

# Step 2: Split
parts = [p.strip() for p in clean.split(":")]
print(f"After split: {parts}")

# Step 3: Convert types
symbol = parts[0].upper()
price = float(parts[1])  # or Decimal for precision
date_str = parts[2]
tags = set(parts[3].split(","))

print(f"\nParsed: symbol={symbol}, price={price}, tags={tags}")

After clean: 'AAPL : 150.2544 : 2023-10-01 : tech,hardware'
After split: ['AAPL', '150.2544', '2023-10-01', 'tech,hardware']

Parsed: symbol=AAPL, price=150.2544, tags={'hardware', 'tech'}


In [9]:
# String Chaining
#   Pro: Memory Optimization
#   Cons: Debugging Nightmare
clean = raw.strip().removeprefix("$")

# Sequential Commands
#   Pro: Easier to Read and debug
#   Cons: Memory Overhead   
_c = raw.strip()
_c = _c.removeprefix("$")

# This gives same result
print(f"{'clean:':10}{clean}")
print(f"{'_c:':10}{_c}")

clean:    AAPL : 150.2544 : 2023-10-01 : tech,hardware
_c:       AAPL : 150.2544 : 2023-10-01 : tech,hardware


In [11]:
# Float precision problem
# Not exactly 0.3! - Happens how numbers are represented by computers
print(f"0.1 + 0.2 = {0.1 + 0.2}")  

# Solution: Use Decimal for money
from decimal import Decimal, ROUND_HALF_UP

price = 150.2594
print(f"Decimal: {price}")

# Round to cents
usd = round(price, 2)
print(f"Rounded: ${usd}")

0.1 + 0.2 = 0.30000000000000004
Decimal: 150.2594
Rounded: $150.26


---

## Lesson 2: Lists (Dynamic Arrays)

### The Problem

You need to collect prices as they arrive, filter them, and transform them. You also notice weird bugs where modifying one list affects another.

### The "Aha!" Moment

Lists store **references** to objects, not the objects themselves. Assignment creates an alias, not a copy!

### Performance Intuition

| Operation | Time | Why |
|-----------|------|-----|
| `lst[i]` | O(1) | Direct index access |
| `lst.append(x)` | O(1) | Amortized (over-allocation) |
| `x in lst` | O(n) | Must scan entire list |
| `lst.insert(0, x)` | O(n) | Shifts all elements |
| `lst.pop(0)` | O(n) | Shifts all elements |

In [14]:
# Slicing syntax: [start:stop:step]
prices = [100, 105, 102, 108, 110, 115]

print(f"Original:    {prices}")
print(f"First 3:     {prices[:3]}")
print(f"Last 3:      {prices[-3:]}")
print(f"Every 2nd:   {prices[::2]}")
print(f"Reversed:    {prices[::-1]}")

Original:    [100, 105, 102, 108, 110, 115]
First 3:     [100, 105, 102]
Last 3:      [108, 110, 115]
Every 2nd:   [100, 102, 110]
Reversed:    [115, 110, 108, 102, 105, 100]


In [15]:
# List comprehensions
prices = [100, 105, 102, 108, 110, 115]

# Transform
doubled = [p * 2 for p in prices]
print(f"Doubled: {doubled}")

# Filter
high = [p for p in prices if p > 105]
print(f"High (>105): {high}")

# Transform + Filter
discounted_high = [p * 0.9 for p in prices if p > 105]
print(f"High prices with 10% off: {discounted_high}")

Doubled: [200, 210, 204, 216, 220, 230]
High (>105): [108, 110, 115]
High prices with 10% off: [97.2, 99.0, 103.5]


In [16]:
# THE ALIASING BUG
original = [1, 2, 3]
alias = original           # Same object!
copy1 = original.copy()    # New object (shallow)
copy2 = original[:]        # Also shallow copy

original.append(99)

print(f"original: {original}")
print(f"alias:    {alias}")    # Also has 99!
print(f"copy1:    {copy1}")    # Unchanged
print(f"copy2:    {copy2}")    # Unchanged

original: [1, 2, 3, 99]
alias:    [1, 2, 3, 99]
copy1:    [1, 2, 3]
copy2:    [1, 2, 3]


---

## Lesson 3: Tuples (Immutable Records)

### The Problem

You need to use a combination of values (like symbol + date) as a dictionary key, but lists can't be keys because they're mutable.

### The "Aha!" Moment

Tuples are **immutable**, which means they're:
- **Hashable** → can be dictionary keys or set members
- **Safe to share** → no accidental modifications
- **Memory efficient** → no over-allocation

### When to Use Tuples vs Lists

| Use Tuple | Use List |
|-----------|----------|
| Fixed structure (record) | Growing collection |
| Dictionary keys | Mutable sequence |
| Return multiple values | Need to sort/filter |
| Heterogeneous data | Homogeneous data |

In [18]:
from datetime import date

# Tuples as composite dictionary keys
prices = {
    ("AAPL", date(2023, 10, 1)): 150.25,
    ("AAPL", date(2023, 10, 2)): 152.00,
    ("GOOG", date(2023, 10, 1)): 2800.00,
}

# Lookup
key = ("AAPL", date(2023, 10, 1))
print(f"Price for {key}: ${prices[key]}")

Price for ('AAPL', datetime.date(2023, 10, 1)): $150.25


In [19]:
# Tuple unpacking
record = ("AAPL", 150.25, "tech")

# Basic unpacking
symbol, price, sector = record
print(f"Symbol: {symbol}, Price: {price}")

# Star unpacking
first, *rest = (1, 2, 3, 4, 5)
print(f"First: {first}, Rest: {rest}")

# Swap values (no temp variable needed)
a, b = 1, 2
a, b = b, a
print(f"Swapped: a={a}, b={b}")

Symbol: AAPL, Price: 150.25
First: 1, Rest: [2, 3, 4, 5]
Swapped: a=2, b=1


---

## Lesson 4: Sets (Uniqueness & Fast Membership)

### The Problem

You need to check if a symbol is in your portfolio, deduplicate a list, or find symbols that are in portfolio A but not B.

### The "Aha!" Moment

Sets use hash tables, making membership checks O(1) instead of O(n).

### Set Operations

| Operation | Symbol | Result |
|-----------|--------|--------|
| Union | `a \| b` | All elements from both |
| Intersection | `a & b` | Elements in both |
| Difference | `a - b` | Elements in a, not in b |
| Symmetric Diff | `a ^ b` | Elements in one but not both |

In [20]:
# Fast membership check
portfolio = {"AAPL", "GOOG", "MSFT", "TSLA"}

print(f"'GOOG' in portfolio: {'GOOG' in portfolio}")  # O(1)
print(f"'NFLX' in portfolio: {'NFLX' in portfolio}")  # O(1)

'GOOG' in portfolio: True
'NFLX' in portfolio: False


In [21]:
# Set algebra
portfolio_a = {"AAPL", "GOOG", "MSFT"}
portfolio_b = {"GOOG", "AMZN", "NFLX"}

print(f"Portfolio A: {portfolio_a}")
print(f"Portfolio B: {portfolio_b}")
print()
print(f"Union (all):        {portfolio_a | portfolio_b}")
print(f"Intersection (both): {portfolio_a & portfolio_b}")
print(f"Only in A:          {portfolio_a - portfolio_b}")
print(f"Only in B:          {portfolio_b - portfolio_a}")

Portfolio A: {'MSFT', 'AAPL', 'GOOG'}
Portfolio B: {'AMZN', 'NFLX', 'GOOG'}

Union (all):        {'AMZN', 'GOOG', 'NFLX', 'MSFT', 'AAPL'}
Intersection (both): {'GOOG'}
Only in A:          {'AAPL', 'MSFT'}
Only in B:          {'AMZN', 'NFLX'}


In [24]:
# Deduplicate a list
symbols = ["AAPL", "GOOG", "AAPL", "MSFT", "GOOG"]
unique = list(set(symbols))
print(f"Original: {symbols}")
print(f"Unique:   {unique}")

# Note: set() doesn't preserve order
# To preserve order, use dict.fromkeys()
unique_ordered = list(dict.fromkeys(symbols))
print(f"Unique (ordered): {unique_ordered}")

Original: ['AAPL', 'GOOG', 'AAPL', 'MSFT', 'GOOG']
Unique:   ['MSFT', 'AAPL', 'GOOG']
Unique (ordered): ['AAPL', 'GOOG', 'MSFT']


---

## Lesson 5: Dictionaries (Hash Maps)

### The Problem

You need to look up prices by symbol, group trades by date, and count tag frequencies.

### The "Aha!" Moment

Dictionaries are Python's most powerful data structure. O(1) lookup makes them essential for:
- Fast key → value lookup
- Grouping records
- Counting frequencies
- Caching computed results

### Dict Methods Quick Reference

| Method | Purpose | Returns |
|--------|---------|--------|
| `d[key]` | Get value | Value or KeyError |
| `d.get(key)` | Safe get | Value or None |
| `d.get(key, default)` | Safe get with default | Value or default |
| `d.setdefault(key, val)` | Get or set default | Existing or new value |
| `d.items()` | Key-value pairs | Iterator |
| `d1 \| d2` | Merge (3.9+) | New dict |

In [25]:
# Basic dictionary operations
prices = {"AAPL": 150.25, "GOOG": 2800.00, "MSFT": 310.50}

# Access
print(f"AAPL price: ${prices['AAPL']}")

# Safe access
print(f"NFLX price: {prices.get('NFLX', 'Not found')}")

# Iterate
for symbol, price in prices.items():
    print(f"  {symbol}: ${price}")

AAPL price: $150.25
NFLX price: Not found
  AAPL: $150.25
  GOOG: $2800.0
  MSFT: $310.5


In [26]:
# Dict comprehension
prices = {"AAPL": 150.25, "GOOG": 2800.00, "MSFT": 310.50}

# Transform values
discounted = {k: v * 0.9 for k, v in prices.items()}
print(f"10% off: {discounted}")

# Filter
expensive = {k: v for k, v in prices.items() if v > 200}
print(f"Expensive: {expensive}")

# Invert (swap keys and values)
price_to_symbol = {v: k for k, v in prices.items()}
print(f"Inverted: {price_to_symbol}")

10% off: {'AAPL': 135.225, 'GOOG': 2520.0, 'MSFT': 279.45}
Expensive: {'GOOG': 2800.0, 'MSFT': 310.5}
Inverted: {150.25: 'AAPL', 2800.0: 'GOOG', 310.5: 'MSFT'}


In [27]:
# Grouping pattern
trades = [
    {"symbol": "AAPL", "qty": 10},
    {"symbol": "GOOG", "qty": 5},
    {"symbol": "AAPL", "qty": 15},
]

# Group by symbol using setdefault
by_symbol = {}
for trade in trades:
    by_symbol.setdefault(trade["symbol"], []).append(trade)

print("Grouped by symbol:")
for symbol, group in by_symbol.items():
    print(f"  {symbol}: {group}")

Grouped by symbol:
  AAPL: [{'symbol': 'AAPL', 'qty': 10}, {'symbol': 'AAPL', 'qty': 15}]
  GOOG: [{'symbol': 'GOOG', 'qty': 5}]


---

## Lesson 6: Collections Module

### The Problem

Standard dicts/lists work, but the code is verbose. You want cleaner patterns for grouping, counting, and queue operations.

### The "Aha!" Moment

The `collections` module provides specialized containers for common patterns.

### Collections Quick Reference

| Container | Use Case | Advantage |
|-----------|----------|----------|
| `defaultdict` | Grouping, counting | Auto-creates missing keys |
| `Counter` | Frequency counting | Built-in counting methods |
| `deque` | Queues, sliding windows | O(1) left operations |
| `namedtuple` | Lightweight records | Named access, immutable |

In [28]:
from collections import defaultdict, Counter, deque

# defaultdict - cleaner grouping
trades = [
    {"symbol": "AAPL", "qty": 10},
    {"symbol": "GOOG", "qty": 5},
    {"symbol": "AAPL", "qty": 15},
]

by_symbol = defaultdict(list)
for trade in trades:
    by_symbol[trade["symbol"]].append(trade)  # No setdefault needed!

print(f"Grouped: {dict(by_symbol)}")

Grouped: {'AAPL': [{'symbol': 'AAPL', 'qty': 10}, {'symbol': 'AAPL', 'qty': 15}], 'GOOG': [{'symbol': 'GOOG', 'qty': 5}]}


In [29]:
# Counter - frequency counting
tags = ["tech", "finance", "tech", "ai", "tech", "finance"]

counts = Counter(tags)
print(f"Counts: {counts}")
print(f"Most common: {counts.most_common(2)}")
print(f"'tech' count: {counts['tech']}")
print(f"'unknown' count: {counts['unknown']}")  # Returns 0, not KeyError!

Counts: Counter({'tech': 3, 'finance': 2, 'ai': 1})
Most common: [('tech', 3), ('finance', 2)]
'tech' count: 3
'unknown' count: 0


In [30]:
# deque - O(1) operations on both ends
# Perfect for: queues, sliding windows, LRU caches

recent = deque(maxlen=3)  # Fixed-size buffer
for price in [100, 105, 110, 115, 120]:
    recent.append(price)
    print(f"Added {price}: {list(recent)}")

Added 100: [100]
Added 105: [100, 105]
Added 110: [100, 105, 110]
Added 115: [105, 110, 115]
Added 120: [110, 115, 120]


---

## Summary

### The Big Six

| Type | Mutable | Ordered | Unique | Use Case |
|------|---------|---------|--------|----------|
| `int` | - | - | - | - |
| `str` | No | Yes | - | Text processing |
| `list` | Yes | Yes | No | Dynamic sequences |
| `tuple` | No | Yes | No | Records, dict keys |
| `set` | Yes | No | Yes | Membership, dedup |
| `dict` | Yes | Yes* | Keys | Lookup, grouping |

*Dicts maintain insertion order since Python 3.7

### Big-O Cheat Sheet

| Operation | List | Set | Dict |
|-----------|------|-----|------|
| Access by index | O(1) | - | - |
| Access by key | - | - | O(1) |
| Search | O(n) | O(1) | O(1) |
| Insert/append | O(1)* | O(1) | O(1) |
| Delete | O(n) | O(1) | O(1) |

*O(1) amortized for list.append()

### Copy Rules

| Code | Result |
|------|--------|
| `b = a` | Alias (same object) |
| `b = a.copy()` | Shallow copy |
| `b = a[:]` | Shallow copy (list) |
| `b = copy.deepcopy(a)` | Deep copy |

---

**Next Module:** Python Libraries - The `os` and `sys` modules