# Metadata

**L1 Taxonomy** - Backend Integration

**L2 Taxonomy** - Webhooks

**Subtopic** - Handling duplicate webhook deliveries idempotently (processing each event once)

**Use Case** - Implement a Python script that simulates the receipt of webhook events. The script should handle duplicate webhook deliveries idempotently by storing a unique identifier for each event in a local file and checking this file before processing each event to ensure it is only processed once.

**Programming Language** - Python

**Target Model** - GPT-4o

# Setup

```requirements.txt
```


# Prompt
Problem Statement:
- Implement a Python script that simulates the receipt of webhook events.
- The script should handle duplicate webhook deliveries idempotently by storing a unique identifier for each event in a local file and checking this file before processing each event to ensure it is only processed once.
- Only events with unique IDs (not already stored) should be processed.

Input Format:
- A list of webhook event objects, each represented as a Python dictionary.
- Each event must contain the following fields:
  - `event_id`: A unique string identifier for the event (used to detect duplicates).
  - `event_type`: A string describing the type of event (e.g., "user.created").
  - `timestamp`: A string in ISO 8601 format indicating when the event occurred.
  - `payload`: A dictionary containing additional event-specific data.

Input Constraints:
- The `event_id` must be a non-empty string with a maximum length of 100 characters.
- The `event_type` must be a non-empty string.
- The `timestamp` must be a valid ISO 8601 formatted string (e.g., "2025-07-22T08:00:00Z").
- The `payload` must be a JSON object and may contain arbitrary key-value pairs.
- A maximum of 10,000 events may be processed in one run.
- If an event is missing required fields, it is considered malformed and may be skipped or raise an error depending on implementation preference.

Output Format:
- For each event:
  - If the `event_id` has not been seen before:
    - Process the event and log: "Processed event: <event_id>"
  - If the `event_id` is a duplicate:
    - Skip processing and log: "Duplicate event skipped: <event_id>"

- All processed `event_id`s are stored in a local file (e.g., "processed_events.txt") to ensure idempotency across runs.

Function Signature:
```python
def process_webhook_events(events: list, storage_file: str = "processed_events.txt") -> None:
```

Example:
```python
Input:
[
    {
        "event_id": "evt_001",
        "event_type": "user.created",
        "timestamp": "2025-07-22T08:00:00Z",
        "payload": {"user_id": 1, "name": "Alice"}
    },
    {
        "event_id": "evt_002",
        "event_type": "user.updated",
        "timestamp": "2025-07-22T08:05:00Z",
        "payload": {"user_id": 1, "name": "Alice Smith"}
    },
    {
        "event_id": "evt_001",
        "event_type": "user.created",
        "timestamp": "2025-07-22T08:10:00Z",
        "payload": {"user_id": 1, "name": "Duplicate Alice"}
    }
]

Output:
Processed event: evt_001  
Processed event: evt_002  
Duplicate event skipped: evt_001
```

# Requirements
Explicit Requirements:
- Each event must be uniquely identified using its `event_id`.
- Duplicate events (same `event_id`) must not be processed more than once.
- All processed `event_id`s must be persisted in a local file for future deduplication.
- The system must log whether an event is processed or skipped due to duplication.

Implicit Requirements:
- The solution must be idempotent across multiple runs by persisting state (via a file).
- The solution should gracefully handle cases where the storage file does not exist initially.
- The system should efficiently handle large input (up to 10,000 events) without reprocessing.
- Event ordering should be preserved in output (processed/skipped logs appear in input order).

Solution Expectations:
- Read and parse the input list of webhook event dictionaries.
- Check if each `event_id` has already been processed by referencing a local file.
- If the `event_id` is new:
  - Process the event (e.g., print or simulate processing).
  - Append the `event_id` to the storage file to mark it as processed.
  - Print: "Processed event: <event_id>"
- If the `event_id` already exists in the file:
  - Skip processing.
  - Print: "Duplicate event skipped: <event_id>"
- Ensure file read/write operations are safe and efficient.
- Maintain a consistent and readable output format.
- Perform input validation on all Inputs according to the Input Constraints.

Edge Cases and Behavior:
- Duplicate Events in the Same Input:
  - Behavior: Only the first occurrence is processed; subsequent duplicates are skipped.

- Duplicate Events Across Multiple Runs:
  - Behavior: Events processed in previous runs (stored in the file) are not reprocessed.

- Missing or Corrupted Storage File:
  - Behavior: If the storage file does not exist, it is created. If it is empty or unreadable, processing continues from scratch.

- Empty Input List:
  - Behavior: No output is printed, and no changes are made to the storage file.

- Malformed Events (e.g., missing `event_id`):
  - Behavior: Such events should be skipped or raise a descriptive error depending on implementation preference.

- Large Volume of Events (e.g., 10,000+):
  - Behavior: Must be handled efficiently without memory or performance issues. Deduplication should remain accurate.

- Non-UTF8 Characters or Special Characters in `event_id`:
  - Behavior: Properly handled as long as the file supports UTF-8 encoding.

- Reordered Events:
  - Behavior: Deduplication is based on `event_id`, so order doesn’t affect correctness but may affect output sequence.

Solution Constraints:
- Must use only standard Python libraries (no external dependencies).
- Must support persistent deduplication across multiple script executions via a local file.
- Must handle up to 10,000 events per run efficiently.
- File I/O operations must be safe, using append or atomic write techniques to avoid corruption.
- Must not modify the original input data.
- Logging/output must preserve the input event order.
- If the storage file is locked or cannot be written to, the system should fail gracefully with an informative error message.

In [None]:
# code
"""Process webhook events with deduplication and basic validation."""

import os
import re


def process_webhook_events(
    events: list,
    storage_file: str = "processed_events.txt"
) -> None:
    """
    Process webhook events and store unique event IDs to avoid duplicates.

    Args:
        events: A list of webhook event objects (dictionaries).
        storage_file: The name of the local file to store processed event IDs.
    """
    processed_event_ids = set()

    # Load already processed event IDs from the storage file
    if os.path.exists(storage_file):
        try:
            with open(storage_file, 'r') as f:
                for line in f:
                    event_id = line.strip()
                    if event_id:
                        processed_event_ids.add(event_id)
        except IOError as e:
            print(f"Error reading storage file '{storage_file}': {e}")
            return
        except Exception as e:
            print(
                f"An unexpected error occurred"
                f" while reading '{storage_file}': {e}"
            )
            return

    # Open the storage file in append mode for writing new event IDs
    try:
        with (open(storage_file, 'a') as f_append):
            for i, event in enumerate(events):
                event_id = event.get("event_id")
                event_type = event.get("event_type")
                timestamp = event.get("timestamp")
                payload = event.get("payload")

                if not all([
                    event_id, event_type, timestamp, payload is not None
                ]):
                    print(
                        f"Skipping malformed event at index {i}: "
                        f"Missing required fields."
                    )
                    continue

                if not isinstance(event_id, str
                                  ) or not (1 <= len(event_id) <= 100):
                    print(
                        f"Skipping malformed event at index {i}: "
                        f"'event_id' is invalid."
                    )
                    continue

                if not isinstance(event_type, str) or not event_type:
                    print(
                        f"Skipping malformed event at index {i}: "
                        f"'event_type' is invalid."
                    )
                    continue

                # Basic ISO 8601 format check
                if (
                    not isinstance(timestamp, str) or
                    not re.fullmatch(
                        r'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z',
                        timestamp
                    )
                ):
                    print(
                        f"Skipping malformed event at index {i}: "
                        f"'timestamp' is not in valid ISO 8601 format "
                        f"(YYYY-MM-DDTHH:MM:SSZ)."
                    )
                    continue

                if not isinstance(payload, dict):
                    print(
                        f"Skipping malformed event at index {i}: "
                        f"'payload' is not a dictionary."
                    )
                    continue

                if event_id in processed_event_ids:
                    print(f"Duplicate event skipped: {event_id}")
                else:
                    print(f"Processed event: {event_id}")
                    processed_event_ids.add(event_id)
                    f_append.write(f"{event_id}\n")
                    f_append.flush()

    except IOError as e:
        print(f"Error writing to storage file '{storage_file}': {e}")
    except Exception as e:
        print(
            f"An unexpected error occurred during event processing: {e}"
        )


if __name__ == "__main__":
    if os.path.exists("processed_events.txt"):
        os.remove("processed_events.txt")
        print("Cleaned up 'processed_events.txt' from previous run.")
    print("-" * 30)

    print("--- First Run ---")
    events_run1 = [
        {
            "event_id": "evt_001",
            "event_type": "user.created",
            "timestamp": "2025-07-22T08:00:00Z",
            "payload": {"user_id": 1, "name": "Alice"}
        },
        {
            "event_id": "evt_002",
            "event_type": "user.updated",
            "timestamp": "2025-07-22T08:05:00Z",
            "payload": {"user_id": 1, "name": "Alice Smith"}
        },
        {
            "event_id": "evt_001",
            "event_type": "user.created",
            "timestamp": "2025-07-22T08:10:00Z",
            "payload": {"user_id": 1, "name": "Duplicate Alice"}
        },
        {
            "event_id": "evt_003",
            "event_type": "product.added",
            "timestamp": "2025-07-22T09:00:00Z",
            "payload": {"product_id": 101, "name": "Laptop"}
        }
    ]
    process_webhook_events(events_run1)
    print("-" * 30)

    print("--- Second Run ---")
    events_run2 = [
        {
            "event_id": "evt_002",
            "event_type": "user.deleted",
            "timestamp": "2025-07-22T10:00:00Z",
            "payload": {"user_id": 1}
        },
        {
            "event_id": "evt_004",
            "event_type": "order.placed",
            "timestamp": "2025-07-22T10:15:00Z",
            "payload": {"order_id": 5001, "amount": 100.00}
        },
        {
            "event_id": "evt_001",
            "event_type": "user.login",
            "timestamp": "2025-07-22T10:20:00Z",
            "payload": {"user_id": 1}
        }
    ]
    process_webhook_events(events_run2)
    print("-" * 30)

    print("--- Malformed Events Test ---")
    malformed_events = [
        {
            "event_type": "missing.id",
            "timestamp": "2025-07-22T11:00:00Z",
            "payload": {}
        },
        {
            "event_id": "evt_005",
            "event_type": "",
            "timestamp": "2025-07-22T11:05:00Z",
            "payload": {}
        },
        {
            "event_id": "evt_006",
            "event_type": "invalid.timestamp",
            "timestamp": "2025/07/22 11:10:00",
            "payload": {}
        },
        {
            "event_id": "evt_007",
            "event_type": "invalid.payload",
            "timestamp": "2025-07-22T11:15:00Z",
            "payload": "not_a_dict"
        },
        {
            "event_id": "evt_008" * 15,
            "event_type": "long.id.test",
            "timestamp": "2025-07-22T11:20:00Z",
            "payload": {}
        }
    ]
    process_webhook_events(malformed_events)
    print("-" * 30)

    print("--- Empty Input Test ---")
    process_webhook_events([])
    print("-" * 30)

    print("\nContent of processed_events.txt:")
    if os.path.exists("processed_events.txt"):
        with open("processed_events.txt", 'r') as f:
            print(f.read().strip())
    else:
        print("File 'processed_events.txt' does not exist.")


In [None]:
# tests

import io
import os
import tempfile
import unittest
from contextlib import redirect_stdout
from pathlib import Path

from main import process_webhook_events


def _capture_output(func, *args, **kwargs):
    """Run *func* while capturing stdout; return list(stripped lines)."""
    buf = io.StringIO()
    with redirect_stdout(buf):
        func(*args, **kwargs)
    return [line.strip() for line in buf.getvalue().splitlines()]


class TestProcessWebhookEvents(unittest.TestCase):
    """Unit-tests for `process_webhook_events` as specified in the prompt."""

    def setUp(self):
        self.tmpdir = tempfile.TemporaryDirectory()
        self.store_path = os.path.join(self.tmpdir.name, "events.txt")

    def tearDown(self):
        self.tmpdir.cleanup()

    def test_unique_events_first_run(self):
        events = [
            {"event_id": "evt_1", "event_type": "alpha",
                "timestamp": "2025-07-22T08:00:00Z", "payload": {}},
            {"event_id": "evt_2", "event_type": "beta",
                "timestamp": "2025-07-22T08:01:00Z", "payload": {}},
        ]
        out = _capture_output(process_webhook_events, events, self.store_path)
        self.assertEqual(
            out,
            ["Processed event: evt_1", "Processed event: evt_2"],
            "Both unique events should be processed in order.",
        )
        # verify IDs persisted
        with open(self.store_path, encoding="utf-8") as fh:
            self.assertEqual(fh.read().splitlines(), ["evt_1", "evt_2"])

    def test_duplicates_within_single_batch(self):
        events = [
            {"event_id": "dup", "event_type": "x",
                "timestamp": "2025-07-22T08:00:00Z", "payload": {}},
            {"event_id": "dup", "event_type": "x",
                "timestamp": "2025-07-22T08:02:00Z", "payload": {}},
        ]
        out = _capture_output(process_webhook_events, events, self.store_path)
        self.assertEqual(
            out,
            ["Processed event: dup", "Duplicate event skipped: dup"],
            "Second occurrence in same run should be skipped.",
        )

    def test_idempotency_across_runs(self):
        ev = {"event_id": "evt_z", "event_type": "z",
              "timestamp": "2025-07-22T08:00:00Z", "payload": {}}
        _capture_output(process_webhook_events, [ev], self.store_path)
        out = _capture_output(process_webhook_events, [ev], self.store_path)
        self.assertEqual(out, ["Duplicate event skipped: evt_z"])

    def test_storage_file_created_when_missing(self):
        self.assertFalse(Path(self.store_path).exists(),
                         "Pre-condition: storage file absent.")
        _capture_output(
            process_webhook_events,
            [{"event_id": "a", "event_type": "t",
                "timestamp": "2025-07-22T08:00:00Z", "payload": {}}],
            self.store_path,
        )
        self.assertTrue(Path(self.store_path).exists(),
                        "File should be created on first write.")

    def test_empty_input_no_processing(self):
        out = _capture_output(process_webhook_events, [], self.store_path)
        self.assertEqual(out, [], "No events means no output.")
        if Path(self.store_path).exists():
            self.assertEqual(
                Path(self.store_path).read_text(encoding="utf-8"), "")

    def test_malformed_event_skipped(self):
        """Only the well-formed event is processed; malformed one is NOT."""
        events = [
            {"event_id": "good", "event_type": "x",
             "timestamp": "2025-07-22T08:00:00Z", "payload": {}},
            {"event_type": "missing_id"},  # malformed
        ]
        out = _capture_output(process_webhook_events, events, self.store_path)

        processed = [ln for ln in out if ln.startswith("Processed event:")]
        self.assertEqual(
            processed,
            ["Processed event: good"],
            "Malformed event must not be processed (no extra "
            "'Processed event: …' lines are allowed).",
        )

    def test_non_ascii_event_id(self):
        eid = "%@#$42"
        events = [{"event_id": eid, "event_type": "x",
                   "timestamp": "2025-07-22T08:00:00Z", "payload": {}}]
        out = _capture_output(process_webhook_events, events, self.store_path)
        self.assertEqual(out, [f"Processed event: {eid}"])

    def test_max_length_event_id(self):
        eid = "x" * 100
        events = [{"event_id": eid, "event_type": "t",
                   "timestamp": "2025-07-22T08:00:00Z", "payload": {}}]
        out = _capture_output(process_webhook_events, events, self.store_path)
        self.assertEqual(out, [f"Processed event: {eid}"])

    def test_corrupted_storage_file_recovery(self):
        # pre-corrupt with random bytes & blank lines
        with open(self.store_path, "wb") as fh:
            fh.write(b"bad\xffline\n\n")
        events = [{"event_id": "fresh", "event_type": "a",
                   "timestamp": "2025-07-22T08:00:00Z", "payload": {}}]
        out = _capture_output(process_webhook_events, events, self.store_path)
        self.assertIn("Processed event: fresh", out[0])

    def test_large_batch_performance_integrity(self):
        events = [
            {"event_id": f"id_{i}", "event_type": "bulk",
                "timestamp": "2025-07-22T08:00:00Z", "payload": {}}
            for i in range(1_000)
        ]
        out = _capture_output(process_webhook_events, events, self.store_path)
        # first and last lines sanity-check
        self.assertEqual(out[0], "Processed event: id_0")
        self.assertEqual(out[-1], "Processed event: id_999")
        self.assertEqual(
            len(out), 1_000, "Every unique event should be processed once.")


if __name__ == "__main__":
    unittest.main(verbosity=2)


# Model Breaking Proof

#### Model Breaking Task URL: https://rlhf-v3.turing.com/prompt/48b13eb1-45b7-4a3d-96a4-0f95d34e1b12?origin=https%3A%2F%2Flabeling-ms.turing.com&redirect_url=https%3A%2F%2Flabeling-ms.turing.com%2Fconversations%2F82885%2Fview

#### Model code:

```python
# code generated by the model
import os
import json

def process_webhook_events(events: list, storage_file: str = "processed_events.txt") -> None:
    # Check if the storage file exists, if not create one.
    if not os.path.exists(storage_file):
        open(storage_file, 'w').close()
        # <Issue>: No try-except when creating the file. Could fail if file permissions are restricted.
    
    # Read already processed event_ids from the storage file
    with open(storage_file, 'r') as f:
        # Reading line by line to avoid loading everything in memory at once for large files
        processed_ids = set(line.strip() for line in f.readlines())
        # <Issue>: `.readlines()` loads the entire file into memory, contrary to the comment above.
        # <Issue>: No try-except around this read operation. Could raise IOError if file is corrupted or inaccessible.
    
    with open(storage_file, 'a') as f:
        # <Issue>: No try-except for file writing. Could raise IOError if file is locked or read-only.
        for event in events:
            # Basic validation of the event
            try:
                event_id = event['event_id']
                event_type = event['event_type']
                timestamp = event['timestamp']
                payload = event['payload']
                
                # Checking constraints
                if not isinstance(event_id, str) or not event_id:
                    raise ValueError("Invalid event_id")
                # <Issue>: Missing check for max length 100 for `event_id`.
                if not isinstance(event_type, str) or not event_type:
                    raise ValueError("Invalid event_type")
                 # <Issue>: No validation for ISO 8601 format of `timestamp`.
                 # <Issue>: No validation that `payload` is a dictionary.
                # Add other validation checks here if necessary

            except (KeyError, ValueError) as e:  # Catch missing keys or validation issues
                print(f"Malformed event skipped: {e}")
                continue

            # Process the event only if it hasn't been processed before
            if event_id not in processed_ids:
                # Simulate processing of the event
                print(f"Processed event: {event_id}")
                # Mark the event as processed by storing its id
                f.write(event_id + '\n')
                # Add the processed_id to the current set to avoid processing it within the same run
                processed_ids.add(event_id)
            else:
                # Duplicate event handling
                print(f"Duplicate event skipped: {event_id}")

# Example events for demonstration
example_events = [
    {
        "event_id": "evt_001",
        "event_type": "user.created",
        "timestamp": "2025-07-22T08:00:00Z",
        "payload": {"user_id": 1, "name": "Alice"}
    },
    {
        "event_id": "evt_002",
        "event_type": "user.updated",
        "timestamp": "2025-07-22T08:05:00Z",
        "payload": {"user_id": 1, "name": "Alice Smith"}
    },
    {
        "event_id": "evt_001",
        "event_type": "user.created",
        "timestamp": "2025-07-22T08:10:00Z",
        "payload": {"user_id": 1, "name": "Duplicate Alice"}
    }
]

# Invoking the function
process_webhook_events(example_events)
```