
# Data Problems I — Transactions (Search & Sort Applications)

This lab uses a simple **transactions dataset** to practice algorithmic thinking from the *Algorithms Lab: Searching & Sorting*.

### Dataset
Each row is a transaction with the following columns:
- `month` — integer (e.g., 1–12)
- `hh_id_encoded` — encoded household/customer ID (integer)
- `item` — item identifier or name (string)

> Path used below: `/mnt/data/transactions_data.csv`

### Learning goals
- Apply **linear/binary search** patterns on real data
- Implement **stable sorting** and custom key sorting
- Use **lower/upper bounds** and **range queries** on sorted data
- Combine sorting + counting for **top-k** style questions
- Practice data wrangling **without** relying on Python's built-in `sorted` for core logic


## 0) Setup: Load & Preview

In [None]:

import pandas as pd
from dataclasses import dataclass
from typing import List, Tuple, Any, Optional, Callable, Dict
import math, random

# Load dataset
CSV_PATH = "/mnt/data/transactions_data.csv"
df = pd.read_csv(CSV_PATH)

# Show a small preview
try:
    from caas_jupyter_tools import display_dataframe_to_user
    display_dataframe_to_user("Transactions preview", df.head(20))
except Exception as e:
    display(df.head(10))

print("Rows:", len(df), "| Columns:", list(df.columns))
assert set(df.columns) >= {"month","hh_id_encoded","item"}, "Missing expected columns."


In [None]:

@dataclass(frozen=True)
class Transaction:
    month: int
    hh_id: int
    item: str

def as_transactions(frame) -> List[Transaction]:
    return [Transaction(int(r.month), int(r.hh_id_encoded), str(r.item)) for r in frame.itertuples(index=False)]

TX = as_transactions(df)
len(TX), TX[:5]



## 1) Linear Search: First purchase lookup

**Task:** Implement a **linear search** to return the index of the **first** transaction where `hh_id == target_hh` **and** `month == target_month`.  
Return `-1` if not found.

- Function: `linear_search_first_tx(txs, target_hh, target_month)`
- Complexity target: **O(n)**


In [None]:
# TODO: implement functions in this exercise cell
raise NotImplementedError("TODO: complete this exercise")


In [None]:

# Quick tests for Problem 1 (toy)
toy = [
    Transaction(1, 10, "A"), Transaction(2, 10, "B"),
    Transaction(2, 11, "C"), Transaction(3, 10, "D"),
]
assert linear_search_first_tx(toy, 10, 2) == 1
assert linear_search_first_tx(toy, 11, 2) == 2
assert linear_search_first_tx(toy, 10, 5) == -1
print("[ok] Problem 1 toy tests passed.")



## 2) Stable Sort by Key + Binary Search Range

We want to **sort** transactions by `(hh_id, month)` **stably**, then support range queries using **lower/upper bounds**.

**Tasks:**
1. Implement a **stable merge sort** that accepts a `key` function.  
   - `stable_merge_sort_by_key(txs, key) -> List[Transaction]`  
2. Implement **lower_bound** and **upper_bound** for a given `key` on the sorted list.  
   - `lower_bound_key(txs, key, target) -> int` returns first index `i` where `key(txs[i]) >= target`  
   - `upper_bound_key(txs, key, target) -> int` returns first index `i` where `key(txs[i]) > target`  
3. Using the above, implement `range_for_key(sorted_txs, key, lo_target, hi_target)` returning the half-open index range `[L, R)` of all items with `key(x)` in `[lo_target, hi_target]`.


In [None]:
# TODO: implement functions in this exercise cell
raise NotImplementedError("TODO: complete this exercise")


In [None]:

toy = [
    Transaction(2, 10, "B"),
    Transaction(1, 10, "A"),
    Transaction(3, 10, "D"),
    Transaction(2, 11, "C"),
]
sorted_toy = stable_merge_sort_by_key(toy, key=lambda t: (t.hh_id, t.month))
keys = [ (t.hh_id, t.month) for t in sorted_toy ]
assert keys == [(10,1),(10,2),(10,3),(11,2)]

L = lower_bound_key(sorted_toy, key=lambda t:(t.hh_id,t.month), target=(10,2))
U = upper_bound_key(sorted_toy, key=lambda t:(t.hh_id,t.month), target=(10,2))
assert L == 1 and U == 2

L,R = range_for_key(sorted_toy, key=lambda t:(t.hh_id,t.month), lo_target=(10,1), hi_target=(10,3))
assert (L,R) == (0,3)
print("[ok] Problem 2 toy tests passed.")



## 3) Top-k Items in a Given Month (Counting + Sorting)

**Task:** For a given `month` and `k`, return the **k most frequent items** purchased in that month.
Do **not** use Python's built-in `sorted` for the ranking; implement your own sort from Problem 2.

Steps:
1. Filter transactions for the target month.
2. Build a frequency list `[(item, count), ...]`.
3. Sort this list by `count` descending using your **stable merge sort** with an appropriate key.
4. Return the top `k` items (ties may appear in any order).

- Function: `top_k_items_by_month(txs, month, k)`


In [None]:
# TODO: implement functions in this exercise cell
raise NotImplementedError("TODO: complete this exercise")


In [None]:

toy = [
    Transaction(1, 1, "A"), Transaction(1, 2, "B"),
    Transaction(1, 3, "A"), Transaction(2, 2, "C"),
    Transaction(1, 1, "C"), Transaction(1, 4, "A"),
]
res = top_k_items_by_month(toy, 1, 2)
items = [x[0] for x in res]
assert items[0] == "A" and set(items[1:]) <= {"B","C"}
print("[ok] Problem 3 toy tests passed.")



## 4) Unique Households (Sort + Sweep)

**Task:** Return the **sorted unique list of `hh_id`** that appear in the dataset, without using Python's built-in `set` or `sorted` for the core logic.

Steps:
1. Extract a list of `hh_id` from transactions.
2. Sort with your `stable_merge_sort_by_key` using `key=lambda x: x`.
3. Sweep once to build a list of unique `hh_id`.

- Function: `unique_households_sorted(txs)`


In [None]:
# TODO: implement functions in this exercise cell
raise NotImplementedError("TODO: complete this exercise")


In [None]:

toy = [Transaction(1, 5, "A"), Transaction(1, 2, "B"), Transaction(2, 5, "C"), Transaction(2, 3, "D")]
assert unique_households_sorted(toy) == [2,3,5]
print("[ok] Problem 4 toy tests passed.")



## 5) Merge Two Monthly Streams (Classic Merge Step)

**Task:** Imagine you already have two **individually sorted** lists of transactions by `(hh_id, item)` for two different months.
Implement the classic **merge step** to combine them into a single sorted list by `(hh_id, item)`.

- Function: `merge_two_sorted_streams(a, b, key)`
- Do not call your full merge sort; implement just the **linear-time merge**.

Then demonstrate by splitting one month's transactions into two halves, sorting each half, and merging them back.


In [None]:
# TODO: implement functions in this exercise cell
raise NotImplementedError("TODO: complete this exercise")



## ✅ Dataset-Driven Tests (Real Data)

The following tests use the **actual dataset** at `/mnt/data/transactions_data.csv`.  
They validate your implementations against ground-truth computed with pandas.

These tests are kept lightweight by sampling a small subset when helpful.


In [None]:

import pandas as pd

CSV_PATH = "/mnt/data/transactions_data.csv"
df_all = pd.read_csv(CSV_PATH)
assert set(df_all.columns) >= {"month","hh_id_encoded","item"}, "Dataset must contain month, hh_id_encoded, item."

df_small = df_all.head(500).copy() if len(df_all) > 500 else df_all.copy()
TX_ALL = as_transactions(df_all)
TX_S = as_transactions(df_small)

# --- Problem 1 ---
if len(df_small) >= 1 and 'linear_search_first_tx' in globals():
    r0 = df_small.iloc[0]
    hh0, m0 = int(r0.hh_id_encoded), int(r0.month)
    gt_idx = -1
    for i,t in enumerate(TX_S):
        if t.hh_id == hh0 and t.month == m0:
            gt_idx = i
            break
    idx = linear_search_first_tx(TX_S, hh0, m0)
    assert idx == gt_idx, f"Problem 1 failed: expected first index {gt_idx}, got {idx}."

# --- Problem 2 ---
if len(TX_S) > 0 and all(name in globals() for name in ['stable_merge_sort_by_key','lower_bound_key','upper_bound_key','range_for_key']):
    sorted_s = stable_merge_sort_by_key(TX_S, key=lambda t:(t.hh_id, t.month))
    hh_mode = int(df_small['hh_id_encoded'].mode().iloc[0])
    lo_m, hi_m = int(df_small['month'].min()), int(df_small['month'].max())
    L, R = range_for_key(sorted_s, key=lambda t:(t.hh_id, t.month),
                         lo_target=(hh_mode, lo_m), hi_target=(hh_mode, hi_m))
    for t in sorted_s[L:R]:
        assert t.hh_id == hh_mode, "Problem 2 failed: range slice contains different hh_id."

# --- Problem 3 ---
if all(name in globals() for name in ['top_k_items_by_month','stable_merge_sort_by_key']) and len(TX_ALL) > 0:
    month_pick = int(df_all['month'].mode().iloc[0])
    k = 5
    algo_top = top_k_items_by_month(TX_ALL, month_pick, k)
    algo_items = [x[0] for x in algo_top]
    vc = df_all.loc[df_all['month']==month_pick, 'item'].value_counts()
    pd_top_items = list(vc.head(k).index)
    assert set(algo_items) == set(pd_top_items), f"Problem 3 failed: expected items {pd_top_items}, got {algo_items}"

# --- Problem 4 ---
if 'unique_households_sorted' in globals():
    ours = unique_households_sorted(TX_S)
    truth = sorted(map(int, df_small['hh_id_encoded'].unique().tolist()))
    assert ours == truth, "Problem 4 failed: unique households mismatch."

# --- Problem 5 ---
if all(name in globals() for name in ['merge_two_sorted_streams','stable_merge_sort_by_key']) and len(TX_S) > 1:
    month_for_merge = int(df_small['month'].mode().iloc[0])
    subset = [t for t in TX_S if t.month == month_for_merge][:60]
    if len(subset) >= 2:
        s = stable_merge_sort_by_key(subset, key=lambda t:(t.hh_id, t.item))
        a, b = s[:len(s)//2], s[len(s)//2:]
        merged = merge_two_sorted_streams(a, b, key=lambda t:(t.hh_id, t.item))
        assert [(t.hh_id,t.item) for t in merged] == [(t.hh_id,t.item) for t in s], "Problem 5 failed: merged order mismatch."

print("[ok] Dataset-driven tests passed (where applicable).")



---

### Submission Checklist
- [ ] All tests pass for Problems 1–5
- [ ] Code is clean and documented
- [ ] Include short comments on complexity choices (e.g., why stable sort was required)
- [ ] Commit and open PR: `Data Problems I`
