# Metadata

**L1 Taxonomy** - Networking

**L2 Taxonomy** - GraphQL

**Subtopic** - Optimising GraphQL Queries in Python with DataLoader

**Use Case** - Design an advanced caching strategy using Python's DataLoader to efficiently manage and invalidate cache entries for GraphQL queries in a dynamic data environment.

**Programming Language** - Python

**Target Model** - GPT-4o

# Setup

```requirements.txt
```


# Prompt

**Problem Description**
You run a multi-tenant GraphQL gateway that logs every field access in real time. To diagnose performance hotspots, you must report the top-k most frequently requested fields within a sliding window of the last W seconds at each incoming request. Because the traffic volume is massive and memory is limited, your solution must use a space-efficient, approximate heavy-hitters algorithm (e.g. Count–Min Sketch combined with a stream summary) and support a sliding window eviction mechanism.

**Input Format and Constraints**

```
N W k
t_1 field_1
t_2 field_2
…
t_n field_n
```

* `N` (1 ≤ N ≤ 10^6): number of logged events
* `W` (1 ≤ W ≤ 10^5): size of the sliding window in seconds
* `k` (1 ≤ k ≤ 100): number of top fields to report
* Each of the next N lines has:

  * `t_i` (integer, non-decreasing): the timestamp in seconds
  * `fieldᵢ` (string of length <=  32): the GraphQL field name accessed

**Expected Output Format**
After processing each event i, output a line:

```
_1ᵢ [f_1:c_2, f_2:c_2 …, f_k:c_k]
```

where `f₁…fₖ` are the k fields with the highest *estimated* counts within the window `[t_i–W+1, t_i]`, sorted by descending count (ties broken by lex order), and `c_j` is the estimated count for `f_j`.

**Example**

```
Input:
6 5 2
1 user
2 post
3 user
6 user
7 comment
10 post

Output:
1 [user:1, post:0]
2 [user:1, post:1]
3 [user:2, post:1]
6 [user:2, post:1]
7 [user:2, comment:1]
10 [post:1, user:1]
```






# Requirements

1. **Explicit & Implicit Points**

   * Use a Count–Min Sketch (CMS) or similar sub-linear sketch to track approximate frequencies.
   * Maintain exact timestamps in a sliding‐window buffer to evict contributions older than W seconds.
   * Combine the CMS with a “stream summary” (e.g. a fixed‐size min-heap or Misra–Gries) to extract the top-k candidates.
   * On each new timestamped event, ingest the field into both the sketch and the summary, then evict any expired events from the window structures.
   * When evicting, decrement the sketch counts or adjust summary counters to maintain approximate correctness.
   * Break ties in estimated counts by lexicographical order of field names.

2. **Solution Expectations**

   * Process each of the N events in amortized O(log k) time (for summary updates) plus the CMS update cost O(d), where d = number of hash functions.
   * Use O(k + d·m) space, where m is sketch width; do *not* store all N events explicitly.
   * Provide approximate counts: errors within the standard CMS error bounds.
   * Ensure that evictions correctly age‐out only those contributions older than W seconds.
   * After each ingestion+eviction step, output exactly one line as specified.

3. **Function Signatures**

   ```python
   def process_stream(
       events: List[Tuple[int, str]],
       W: int,
       k: int,
       sketch_depth: int,
       sketch_width: int
   ) -> List[str]:
       """
       events: list of (timestamp, field_name) in non-decreasing order
       W: sliding window size in seconds
       k: number of heavy hitters to report
       sketch_depth, sketch_width: CMS parameters
       Returns: list of N output strings, one per event
       """
   ```

4. **Relevant Classes / Data Structures**

   ```python
   @dataclass
   class CountMinSketch:
       depth: int
       width: int
       # internal tables, hash functions, etc.
       def update(self, key: str, delta: int = 1) -> None: ...
       def estimate(self, key: str) -> int: ...

   @dataclass
   class StreamSummary:
       capacity: int
       # tracks top-k candidates with approximate counts
       def add(self, key: str, count: int) -> None: ...
       def topk(self) -> List[Tuple[str, int]]: ...
   ```

5. **Edge Case Behavior**

   * If fewer than k distinct fields have appeared so far, fill missing slots with the lexicographically smallest fields not yet seen (with count 0).
   * If W = 0, only count the current event (window of size one).
   * Identical timestamps: treat as distinct events; eviction uses ≤ t_i–W.
   * Very high event rates (N >> W): ensure your data structures never grow unbounded.

6. **Constraints**

   * Time complexity: O(N·(d + log k)), where d = CMS depth.
   * Space complexity: O(k + d·width + W) for the sliding buffer.
   * Only Python standard library (`collections`, `heapq`, `hashlib`, `time`, etc.) no external streaming or sketching libraries.


In [None]:
# code
import sys
import collections
import hashlib
from typing import List, Tuple, Dict, Deque

class CountMinSketch:
    def __init__(self, depth: int, width: int):
        self.depth = depth
        self.width = width
        self.table = [[0] * width for _ in range(depth)]
        self.salts = list(range(depth))

    def _idx(self, key: str, salt: int) -> Tuple[int, int]:
        data = f"{salt}-{key}".encode("utf-8")
        digest = hashlib.sha256(data).digest()
        val = int.from_bytes(digest, "big")
        return salt, val % self.width

    def update(self, key: str, delta: int = 1) -> None:
        for salt in self.salts:
            row, col = self._idx(key, salt)
            self.table[row][col] += delta

    def estimate(self, key: str) -> int:
        min_val = float("inf")
        for salt in self.salts:
            row, col = self._idx(key, salt)
            cnt = self.table[row][col]
            if cnt < min_val:
                min_val = cnt
        return int(min_val) if min_val != float("inf") else 0

class StreamSummary:
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.counters: Dict[str, int] = {}

    def add(self, key: str, count: int) -> None:
        if key in self.counters:
            self.counters[key] += count
        elif len(self.counters) < self.capacity:
            self.counters[key] = count
        else:
            to_remove = []
            for k in list(self.counters):
                self.counters[k] -= count
                if self.counters[k] <= 0:
                    to_remove.append(k)
            for k in to_remove:
                del self.counters[k]
            if len(self.counters) < self.capacity:
                self.counters[key] = count

    def decrement(self, key: str, count: int) -> None:
        if key in self.counters:
            self.counters[key] -= count
            if self.counters[key] <= 0:
                del self.counters[key]

    def topk(self) -> List[Tuple[str, int]]:
        items = list(self.counters.items())
        items.sort(key=lambda x: (-x[1], x[0]))
        return items[:self.capacity]


def process_stream(
    events: List[Tuple[int, str]],
    W: int,
    k: int,
    sketch_depth: int,
    sketch_width: int
) -> List[str]:
    # Precompute all distinct fields for padding (global list)
    distinct_fields = sorted({f for _, f in events})

    cms = CountMinSketch(sketch_depth, sketch_width)
    summary = StreamSummary(k)
    buffer: Deque[Tuple[int, str]] = collections.deque()
    results: List[str] = []

    for t, field in events:
        # ingest
        cms.update(field, 1)
        summary.add(field, 1)
        buffer.append((t, field))

        # evict old events
        if W == 0:
            while buffer and buffer[0][0] < t:
                old_t, old_f = buffer.popleft()
                cms.update(old_f, -1)
                summary.decrement(old_f, 1)
        else:
            cutoff = t - W
            while buffer and buffer[0][0] <= cutoff:
                old_t, old_f = buffer.popleft()
                cms.update(old_f, -1)
                summary.decrement(old_f, 1)

        # get top-k candidates and re-rank by CMS estimates
        candidates = [f for f, _ in summary.topk()]
        est_list = [(f, cms.estimate(f)) for f in candidates]
        est_list.sort(key=lambda x: (-x[1], x[0]))
        top = est_list[:k]

        # pad with lexicographically smallest distinct fields not in top
        present = {f for f, _ in top}
        for f in distinct_fields:
            if len(top) >= k:
                break
            if f not in present:
                top.append((f, 0))
                present.add(f)

        # pad with empty strings if still under k
        while len(top) < k:
            top.append(("", 0))

        # format output with comma+space
        line = f"{t} [" + ", ".join(f"{f}:{c}" for f, c in top) + "]"
        results.append(line)

    return results

if __name__ == "__main__":
    data = sys.stdin
    header = next(data).split()
    N, W, k = map(int, header)
    events: List[Tuple[int, str]] = []
    for _ in range(N):
        parts = next(data).strip().split()
        if not parts:
            continue
        t = int(parts[0]); field = parts[1]
        events.append((t, field))

    sketch_depth = 5
    sketch_width = max(1000, k * 10)
    for line in process_stream(events, W, k, sketch_depth, sketch_width):
        print(line)



In [None]:
# tests

import unittest
import re
from main import CountMinSketch, StreamSummary, process_stream

class TestCountMinSketch(unittest.TestCase):
    def test_initial_estimate_zero(self):
        cms = CountMinSketch(depth=3, width=100)
        self.assertEqual(cms.estimate("foo"), 0)
        self.assertEqual(cms.estimate("bar"), 0)

    def test_update_and_estimate(self):
        cms = CountMinSketch(depth=4, width=50)
        cms.update("alpha", 5)
        self.assertEqual(cms.estimate("alpha"), 5)
        cms.update("alpha", 3)
        self.assertEqual(cms.estimate("alpha"), 8)

    def test_decrement_below_zero(self):
        cms = CountMinSketch(depth=2, width=20)
        cms.update("x", 4)
        self.assertEqual(cms.estimate("x"), 4)
        cms.update("x", -1)
        self.assertEqual(cms.estimate("x"), 3)
        cms.update("x", -5)
        self.assertEqual(cms.estimate("x"), -2)

    def test_hash_positions_vary(self):
        cms = CountMinSketch(depth=2, width=10)
        posns = {cms._idx(f"key{i}", salt)
                  for i in range(20) for salt in range(cms.depth)}
        self.assertTrue(len(posns) > 10)

class TestStreamSummary(unittest.TestCase):
    def test_add_within_capacity(self):
        ss = StreamSummary(capacity=2)
        ss.add("a", 1)
        ss.add("b", 1)
        top = ss.topk()
        self.assertEqual(len(top), 2)
        self.assertIn(("a", 1), top)
        self.assertIn(("b", 1), top)

    def test_add_eviction(self):
        ss = StreamSummary(capacity=2)
        ss.add("a", 1)
        ss.add("b", 1)
        ss.add("c", 1)
        top = dict(ss.topk())
        self.assertIn("c", top)

    def test_topk_tiebreak_lex(self):
        ss = StreamSummary(capacity=3)
        ss.add("b", 2)
        ss.add("a", 2)
        ss.add("c", 2)
        top = ss.topk()
        self.assertEqual([f for f,_ in top], ["a","b","c"] )

class TestProcessStream(unittest.TestCase):
    def setUp(self):
        self.depth = 3
        self.width = 50

    def test_empty_events(self):
        out = process_stream([], W=10, k=3,
                             sketch_depth=self.depth,
                             sketch_width=self.width)
        self.assertEqual(out, [])

    def test_single_event_format_and_count(self):
        logs = [(100, "fieldA")]
        out = process_stream(logs, W=5, k=1,
                              sketch_depth=self.depth,
                              sketch_width=self.width)
        self.assertEqual(len(out), 1)
        # strict format with spaces after commas
        self.assertRegex(out[0], r"^100 \[[^,]+:[0-9]+(?:, [^,]+:[0-9]+)*\]$")
        inner = re.match(r"^\d+ \[(.+)\]$", out[0]).group(1)
        self.assertEqual(inner, "fieldA:1")

    def test_padding_when_less_than_k(self):
        logs = [(1, "x"), (2, "y")]
        out = process_stream(logs, W=10, k=4,
                              sketch_depth=self.depth,
                              sketch_width=self.width)
        for idx, (t, _) in enumerate(logs):
            parts = re.match(r"^\d+ \[(.*)\]$", out[idx]).group(1).split(', ')
            self.assertEqual(len(parts), 4)
            seen_names = {p.split(":")[0] for p in parts}
            self.assertTrue("x" in seen_names)
            self.assertTrue("y" in seen_names)
            # one slot should be the lex smallest unseen field (e.g. 'a','b',...)
            unseen = [p for p in parts if p.split(':')[1]=='0']
            self.assertTrue(len(unseen) >= 1)

    def test_empty_string_padding(self):
        logs = [(1, "a")]
        out = process_stream(logs, W=1, k=3,
                              sketch_depth=self.depth,
                              sketch_width=self.width)
        parts = out[0].split('[')[1].rstrip(']').split(', ')
        names = [p.split(':')[0] for p in parts]
        # only 'a' seen, so we expect two empty-string slots to reach k=3
        self.assertEqual(names.count(''), 2)

    def test_sliding_window_eviction(self):
        logs = [(1, "x"), (2, "y"), (15, "x")]
        # W=10 → window is [t-9, t], so event at t=2 should be evicted at t=15
        out = process_stream(logs, W=10, k=2,
                              sketch_depth=self.depth,
                              sketch_width=self.width)
        inner = out[-1].split('[')[1].rstrip(']')
        parts = inner.split(', ')
        counts = {name: int(cnt) for name,cnt in (p.split(':') for p in parts)}
        self.assertEqual(counts["x"], 1)
        self.assertEqual(counts["y"], 0)

    def test_identical_timestamps(self):
        logs = [(5,"a"), (5,"b"), (5,"c")]
        out = process_stream(logs, W=1, k=2,
                              sketch_depth=self.depth,
                              sketch_width=self.width)
        inner = re.match(r"^\d+ \[(.*)\]$", out[0]).group(1).split(', ')
        self.assertEqual(inner[0].split(':')[0], "a")
        self.assertEqual(inner[1].split(':')[0], "b")

    def test_zero_window_size(self):
        logs = [(10,"f1"), (20,"f1")]
        out = process_stream(logs, W=0, k=1,
                              sketch_depth=self.depth,
                              sketch_width=self.width)
        self.assertEqual(out[0], "10 [f1:1]")
        self.assertEqual(out[1], "20 [f1:1]")

if __name__ == "__main__":
    unittest.main()


# Model Breaking Proof

#### Model Breaking Task URL: <Add the URL here>

#### Model code:

```python
import time
import collections
import heapq
import hashlib
from dataclasses import dataclass
from typing import List, Tuple


@dataclass
class CountMinSketch:
    depth: int
    width: int
    tables: List[List[int]]
    seeds: List[int]

    def __init__(self, depth: int, width: int):
        self.depth = depth
        self.width = width
        self.tables = [[0] * width for _ in range(depth)]
        self.seeds = list(range(depth))

    def _hash(self, item: str, seed: int) -> int:
        h = hashlib.md5(f"{seed}-{item}".encode('utf-8')).hexdigest()
        return int(h, 16) % self.width

    def update(self, key: str, delta: int = 1) -> None:
        for i in range(self.depth):
            hash_value = self._hash(key, self.seeds[i])
            self.tables[i][hash_value] += delta

    def estimate(self, key: str) -> int:
        return min(self.tables[i][self._hash(key, self.seeds[i])] for i in range(self.depth))


@dataclass
class StreamSummary:
    capacity: int
    items: dict
    heap: List[Tuple[int, str]]

    def __init__(self, capacity: int):
        self.capacity = capacity
        self.items = {}
        self.heap = []

    def add(self, key: str, count: int) -> None:
        if key in self.items:
            self.items[key] = count
        else:
            if len(self.items) < self.capacity:
                self.items[key] = count
                heapq.heappush(self.heap, (count, key))
            elif count > self.heap[0][0]:
                _, min_key = heapq.heappop(self.heap)
                del self.items[min_key]
                self.items[key] = count
                heapq.heappush(self.heap, (count, key))  # <Issue> never evicts keys when their *true* counts drop on window eviction

    def topk(self) -> List[Tuple[str, int]]:
        return sorted(self.items.items(), key=lambda x: (-x[1], x[0]))[:self.capacity]


def process_stream(
    events: List[Tuple[int, str]],
    W: int,
    k: int,
    sketch_depth: int,
    sketch_width: int
) -> List[str]:
    cms = CountMinSketch(sketch_depth, sketch_width)
    stream_summary = StreamSummary(k)
    time_buffer = collections.deque()
    output = []

    for timestamp, field_name in events:
        # Add new event
        cms.update(field_name, 1)
        estimate = cms.estimate(field_name)
        stream_summary.add(field_name, estimate)
        time_buffer.append((timestamp, field_name))

        # Evict old events
        while time_buffer and time_buffer[0][0] <= timestamp - W:
            old_timestamp, old_field_name = time_buffer.popleft()
            cms.update(old_field_name, -1)    # <Issue> StreamSummary isn’t decremented when entries leave the window

        # Collect top-k for output
        top_fields = stream_summary.topk()

        # Pad to k entries (incorrect per spec)
        seen_fields = {name for _, name in top_fields}
        missing_count = max(0, k - len(seen_fields))
        not_seen = sorted(set(name for _, name in time_buffer) - seen_fields)[:missing_count]

        topk_output = [f"{name}:{count}" for name, count in top_fields] + [f"{name}:0" for name in not_seen]
        topk_output.sort(key=lambda x: (int(x.split(':')[1]), x.split(':')[0]))
        output.append(" ".join(topk_output))   # <Issue> output missing timestamp, brackets, and commas per spec

    return output

```