# Wikipedia Event Stream Processing

## Overview

This notebook implements a real-time stream processing job that tracks Wikipedia edit events for five entities from the IMDB dataset. We use the Wikimedia EventStreams API, which provides a continuous stream of events about changes happening across all Wikimedia projects (including Wikipedia).

### Entities Tracked

We track the following five IMDB-related entities:

1. **Christopher Nolan** - Famous director (Inception, The Dark Knight, Interstellar)
2. **The Godfather** - Classic movie, frequently referenced and edited
3. **Quentin Tarantino** - Influential director/writer
4. **Science Fiction** - Genre page with frequent activity
5. **Academy Awards** - Major film awards, high edit activity

### Metrics Collected

For each tracked entity, we collect:
- Total number of edits
- Edit timestamps
- User who made the edit
- Whether the edit was made by a bot
- Size change of the edit (bytes added/removed)

### Alert System

An alert is triggered when:
- **Large edits** occur (> 500 bytes changed) - these could indicate vandalism or major content changes
- **Bot edits** are detected - automated changes that might need review
- **Rapid successive edits** - multiple edits within a short timeframe (potential edit war)

### Output Structure

The system outputs to two JSON files:
1. `wiki_events.json` - All tracked events with metrics
2. `wiki_alerts.json` - Alert events requiring attention

Each event record contains:
```json
{
    "entity": "string - the tracked entity name",
    "timestamp": "ISO 8601 timestamp",
    "user": "string - username who made the edit",
    "is_bot": "boolean - whether the user is a bot",
    "title": "string - Wikipedia page title",
    "comment": "string - edit comment/summary",
    "size_change": "integer - bytes added/removed",
    "wiki": "string - which wiki (e.g., enwiki)",
    "revision_id": "integer - revision ID"
}
```

Alert records additionally contain:
```json
{
    "alert_type": "string - LARGE_EDIT, BOT_EDIT, or RAPID_EDIT",
    "alert_reason": "string - detailed explanation"
}
```

## 1. Setup and Dependencies

First, we install and import the necessary libraries for stream processing.

In [1]:
# Install required packages
%pip install sseclient-py requests

Collecting sseclient-py
  Downloading sseclient_py-1.8.0-py2.py3-none-any.whl.metadata (2.0 kB)
Downloading sseclient_py-1.8.0-py2.py3-none-any.whl (8.8 kB)
Installing collected packages: sseclient-py
Successfully installed sseclient-py-1.8.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
import json
import os
from datetime import datetime, timedelta
from collections import defaultdict
import requests
from sseclient import SSEClient
import threading
import time

print("Dependencies loaded successfully!")

Dependencies loaded successfully!


## 2. Define Tracked Entities

We define five entities from the IMDB dataset that have corresponding Wikipedia pages. These entities were chosen because:
- They have active Wikipedia pages with regular edits
- They represent different categories (directors, movies, genres, awards)
- They are popular enough to generate sufficient streaming events

In [3]:
# Define the entities we want to track
# Each entity maps to its Wikipedia page title variations
TRACKED_ENTITIES = {
    "Christopher Nolan": [
        "Christopher Nolan",
        "Christopher_Nolan",
        "Nolan, Christopher"
    ],
    "The Godfather": [
        "The Godfather",
        "The_Godfather",
        "Godfather (film)",
        "The Godfather (film)"
    ],
    "Quentin Tarantino": [
        "Quentin Tarantino",
        "Quentin_Tarantino",
        "Tarantino"
    ],
    "Science Fiction": [
        "Science fiction",
        "Science_fiction",
        "Science fiction film",
        "Sci-fi",
        "Science fiction genre"
    ],
    "Academy Awards": [
        "Academy Awards",
        "Academy_Awards",
        "Oscar",
        "Oscars",
        "Academy Award"
    ]
}

# Create a reverse lookup for quick matching
TITLE_TO_ENTITY = {}
for entity, titles in TRACKED_ENTITIES.items():
    for title in titles:
        TITLE_TO_ENTITY[title.lower()] = entity

print(f"Tracking {len(TRACKED_ENTITIES)} entities:")
for entity in TRACKED_ENTITIES:
    print(f"  - {entity}")

Tracking 5 entities:
  - Christopher Nolan
  - The Godfather
  - Quentin Tarantino
  - Science Fiction
  - Academy Awards


## 3. Output File Configuration

Configure the output files for storing events and alerts.

In [4]:
# Output file paths
OUTPUT_DIR = "streaming_output"
EVENTS_FILE = os.path.join(OUTPUT_DIR, "wiki_events.json")
ALERTS_FILE = os.path.join(OUTPUT_DIR, "wiki_alerts.json")

# Create output directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Initialize files with empty arrays if they don't exist
for filepath in [EVENTS_FILE, ALERTS_FILE]:
    if not os.path.exists(filepath):
        with open(filepath, 'w') as f:
            json.dump([], f)

print(f"Output directory: {OUTPUT_DIR}")
print(f"Events file: {EVENTS_FILE}")
print(f"Alerts file: {ALERTS_FILE}")

Output directory: streaming_output
Events file: streaming_output/wiki_events.json
Alerts file: streaming_output/wiki_alerts.json


## 4. Alert Configuration

Define thresholds and rules for generating alerts.

In [5]:
# Alert configuration
ALERT_CONFIG = {
    "large_edit_threshold": 500,  # Bytes changed to trigger large edit alert
    "rapid_edit_window": 60,      # Seconds window for rapid edit detection
    "rapid_edit_count": 3,        # Number of edits in window to trigger alert
    "alert_on_bot": True          # Whether to alert on bot edits
}

# Track recent edits for rapid edit detection
recent_edits = defaultdict(list)  # entity -> list of timestamps

print("Alert Configuration:")
print(f"  - Large edit threshold: {ALERT_CONFIG['large_edit_threshold']} bytes")
print(f"  - Rapid edit window: {ALERT_CONFIG['rapid_edit_window']} seconds")
print(f"  - Rapid edit count: {ALERT_CONFIG['rapid_edit_count']} edits")
print(f"  - Alert on bot edits: {ALERT_CONFIG['alert_on_bot']}")

Alert Configuration:
  - Large edit threshold: 500 bytes
  - Rapid edit window: 60 seconds
  - Rapid edit count: 3 edits
  - Alert on bot edits: True


## 5. Event Processing Functions

Core functions for processing events, detecting alerts, and saving data.

In [6]:
def save_event(event_data: dict, filepath: str) -> None:
    """
    Append an event to a JSON file.
    
    Args:
        event_data: Dictionary containing event information
        filepath: Path to the JSON file
    """
    try:
        # Read existing data
        with open(filepath, 'r') as f:
            events = json.load(f)
        
        # Append new event
        events.append(event_data)
        
        # Write back
        with open(filepath, 'w') as f:
            json.dump(events, f, indent=2, default=str)
            
    except Exception as e:
        print(f"Error saving event: {e}")


def check_for_alerts(event_data: dict, entity: str) -> list:
    """
    Check if an event should trigger any alerts.
    
    Args:
        event_data: Dictionary containing event information
        entity: The tracked entity name
        
    Returns:
        List of alert dictionaries
    """
    alerts = []
    
    # Check for large edit
    size_change = abs(event_data.get('size_change', 0))
    if size_change > ALERT_CONFIG['large_edit_threshold']:
        alerts.append({
            **event_data,
            "alert_type": "LARGE_EDIT",
            "alert_reason": f"Edit changed {size_change} bytes (threshold: {ALERT_CONFIG['large_edit_threshold']})"
        })
    
    # Check for bot edit
    if ALERT_CONFIG['alert_on_bot'] and event_data.get('is_bot', False):
        alerts.append({
            **event_data,
            "alert_type": "BOT_EDIT",
            "alert_reason": f"Automated edit by bot user: {event_data.get('user', 'unknown')}"
        })
    
    # Check for rapid edits
    current_time = datetime.now()
    recent_edits[entity].append(current_time)
    
    # Clean old entries
    cutoff_time = current_time - timedelta(seconds=ALERT_CONFIG['rapid_edit_window'])
    recent_edits[entity] = [t for t in recent_edits[entity] if t > cutoff_time]
    
    # Check if rapid edit threshold exceeded
    if len(recent_edits[entity]) >= ALERT_CONFIG['rapid_edit_count']:
        alerts.append({
            **event_data,
            "alert_type": "RAPID_EDIT",
            "alert_reason": f"{len(recent_edits[entity])} edits in {ALERT_CONFIG['rapid_edit_window']} seconds"
        })
    
    return alerts


def match_entity(title: str) -> str | None:
    """
    Check if a Wikipedia page title matches any tracked entity.
    
    Args:
        title: Wikipedia page title
        
    Returns:
        Entity name if matched, None otherwise
    """
    title_lower = title.lower()
    
    # Direct match
    if title_lower in TITLE_TO_ENTITY:
        return TITLE_TO_ENTITY[title_lower]
    
    # Partial match (title contains entity name)
    for entity_title, entity in TITLE_TO_ENTITY.items():
        if entity_title in title_lower or title_lower in entity_title:
            return entity
    
    return None


print("Event processing functions defined.")

Event processing functions defined.


## 6. Metrics Tracking

Track aggregate metrics for each entity.

In [7]:
# Initialize metrics storage
metrics = {
    entity: {
        "total_edits": 0,
        "bot_edits": 0,
        "human_edits": 0,
        "total_bytes_changed": 0,
        "unique_users": set(),
        "alert_count": 0,
        "last_edit": None
    }
    for entity in TRACKED_ENTITIES
}


def update_metrics(entity: str, event_data: dict, alerts_generated: int) -> None:
    """
    Update aggregate metrics for an entity.
    
    Args:
        entity: The tracked entity name
        event_data: Dictionary containing event information
        alerts_generated: Number of alerts generated for this event
    """
    m = metrics[entity]
    m["total_edits"] += 1
    m["total_bytes_changed"] += abs(event_data.get('size_change', 0))
    m["alert_count"] += alerts_generated
    m["last_edit"] = event_data.get('timestamp')
    
    if event_data.get('is_bot', False):
        m["bot_edits"] += 1
    else:
        m["human_edits"] += 1
    
    user = event_data.get('user')
    if user:
        m["unique_users"].add(user)


def print_metrics() -> None:
    """
    Print current metrics for all tracked entities.
    """
    print("\n" + "=" * 60)
    print("CURRENT METRICS")
    print("=" * 60)
    
    for entity, m in metrics.items():
        print(f"\n{entity}:")
        print(f"  Total edits: {m['total_edits']}")
        print(f"  Human edits: {m['human_edits']}")
        print(f"  Bot edits: {m['bot_edits']}")
        print(f"  Total bytes changed: {m['total_bytes_changed']}")
        print(f"  Unique users: {len(m['unique_users'])}")
        print(f"  Alerts generated: {m['alert_count']}")
        print(f"  Last edit: {m['last_edit'] or 'None yet'}")


print("Metrics tracking initialized.")

Metrics tracking initialized.


## 7. Stream Processing Class

Main class that connects to Wikimedia EventStreams and processes events.

In [19]:
class WikiEventStreamProcessor:
    """
    Processes Wikipedia Recent Changes event stream and tracks specified entities.
    
    The Wikimedia EventStreams API provides real-time events for all changes
    across Wikimedia projects. We filter for 'recentchange' events on Wikipedia.
    
    API Documentation: https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams_HTTP_Service
    """
    
    STREAM_URL = "https://stream.wikimedia.org/v2/stream/recentchange"
    
    # Headers to mimic a browser request (required by Wikimedia)
    HEADERS = {
        "User-Agent": "IMDBStreamingProject/1.0 (Educational Project; Python/requests)",
        "Accept": "text/event-stream",
        "Cache-Control": "no-cache",
    }
    
    def __init__(self, duration_seconds: int = 120):
        """
        Initialize the stream processor.
        
        Args:
            duration_seconds: How long to run the stream (default: 2 minutes)
        """
        self.duration_seconds = duration_seconds
        self.running = False
        self.events_processed = 0
        self.events_matched = 0
        
    def process_event(self, event: dict) -> None:
        """
        Process a single event from the stream.
        
        Args:
            event: Raw event dictionary from EventStreams
        """
        self.events_processed += 1
        
        # Print progress every 1000 events
        if self.events_processed % 1000 == 0:
            print(f"  ... {self.events_processed} events processed, {self.events_matched} matched")
        
        # Skip non-edit events
        if event.get('type') != 'edit':
            return
        
        # Get page title and check if it matches any tracked entity
        title = event.get('title', '')
        entity = match_entity(title)
        
        if entity is None:
            return
        
        self.events_matched += 1
        
        # Extract relevant data
        event_data = {
            "entity": entity,
            "timestamp": event.get('meta', {}).get('dt', datetime.now().isoformat()),
            "user": event.get('user', 'unknown'),
            "is_bot": event.get('bot', False),
            "title": title,
            "comment": event.get('comment', ''),
            "size_change": event.get('length', {}).get('new', 0) - event.get('length', {}).get('old', 0),
            "wiki": event.get('wiki', ''),
            "revision_id": event.get('revision', {}).get('new', 0)
        }
        
        # Save event
        save_event(event_data, EVENTS_FILE)
        
        # Check for alerts
        alerts = check_for_alerts(event_data, entity)
        for alert in alerts:
            save_event(alert, ALERTS_FILE)
            print(f"\nðŸš¨ ALERT: {alert['alert_type']} for {entity}")
            print(f"   Reason: {alert['alert_reason']}")
        
        # Update metrics
        update_metrics(entity, event_data, len(alerts))
        
        # Print event info
        print(f"\nâœ“ Event captured for: {entity}")
        print(f"  Title: {title}")
        print(f"  User: {event_data['user']} {'(bot)' if event_data['is_bot'] else ''}")
        print(f"  Change: {event_data['size_change']:+d} bytes")
    
    def run(self) -> None:
        """
        Start processing the event stream using a more robust approach.
        """
        print(f"Starting stream processing for {self.duration_seconds} seconds...")
        print(f"Tracking entities: {', '.join(TRACKED_ENTITIES.keys())}")
        print("\nConnecting to Wikimedia EventStreams...")
        
        self.running = True
        start_time = time.time()
        
        try:
            # Use requests with stream=True, proper headers, and iter_lines
            with requests.get(self.STREAM_URL, stream=True, headers=self.HEADERS, timeout=30) as response:
                response.raise_for_status()
                print("Connected! Listening for events...\n")
                
                for line in response.iter_lines(decode_unicode=True):
                    # Check if duration exceeded
                    elapsed = time.time() - start_time
                    if elapsed > self.duration_seconds:
                        print("\n\nDuration limit reached.")
                        break
                    
                    if not self.running:
                        break
                    
                    if line:
                        # SSE format: lines starting with "data:" contain the JSON
                        if line.startswith("data:"):
                            json_str = line[5:].strip()  # Remove "data:" prefix
                            if json_str:
                                try:
                                    data = json.loads(json_str)
                                    self.process_event(data)
                                except json.JSONDecodeError:
                                    continue
                        
        except requests.exceptions.Timeout:
            print("\nConnection timeout - retrying might help")
        except requests.exceptions.ConnectionError as e:
            print(f"\nConnection error: {e}")
        except KeyboardInterrupt:
            print("\n\nStream interrupted by user.")
        except Exception as e:
            print(f"\nStream error: {e}")
            import traceback
            traceback.print_exc()
        finally:
            self.running = False
            elapsed = time.time() - start_time
            
            print("\n" + "=" * 60)
            print("STREAM PROCESSING COMPLETE")
            print("=" * 60)
            print(f"Duration: {elapsed:.1f} seconds")
            print(f"Total events processed: {self.events_processed}")
            print(f"Events matched to tracked entities: {self.events_matched}")
    
    def stop(self) -> None:
        """
        Stop the stream processor.
        """
        self.running = False


print("WikiEventStreamProcessor class defined.")

WikiEventStreamProcessor class defined.


## 8. Run the Stream Processor

Execute the stream processing job. The processor will run for the specified duration (default: 2 minutes) and capture events for our tracked entities.

**Note:** Wikipedia receives thousands of edits per minute globally, but specific page edits are less frequent. You may need to run for a longer duration to capture events for all tracked entities.

In [None]:
# Create and run the stream processor
processor = WikiEventStreamProcessor(duration_seconds=3600)
processor.run()

Starting stream processing for 3600 seconds...
Tracking entities: Christopher Nolan, The Godfather, Quentin Tarantino, Science Fiction, Academy Awards

Connecting to Wikimedia EventStreams...
Connected! Listening for events...

  ... 1000 events processed, 0 matched
  ... 2000 events processed, 0 matched
  ... 3000 events processed, 0 matched
  ... 4000 events processed, 0 matched
  ... 5000 events processed, 0 matched
  ... 6000 events processed, 0 matched


## 9. View Final Metrics

Display the aggregate metrics collected during the stream processing session.

In [10]:
# Print final metrics
print_metrics()


CURRENT METRICS

Christopher Nolan:
  Total edits: 0
  Human edits: 0
  Bot edits: 0
  Total bytes changed: 0
  Unique users: 0
  Alerts generated: 0
  Last edit: None yet

The Godfather:
  Total edits: 0
  Human edits: 0
  Bot edits: 0
  Total bytes changed: 0
  Unique users: 0
  Alerts generated: 0
  Last edit: None yet

Quentin Tarantino:
  Total edits: 0
  Human edits: 0
  Bot edits: 0
  Total bytes changed: 0
  Unique users: 0
  Alerts generated: 0
  Last edit: None yet

Science Fiction:
  Total edits: 0
  Human edits: 0
  Bot edits: 0
  Total bytes changed: 0
  Unique users: 0
  Alerts generated: 0
  Last edit: None yet

Academy Awards:
  Total edits: 0
  Human edits: 0
  Bot edits: 0
  Total bytes changed: 0
  Unique users: 0
  Alerts generated: 0
  Last edit: None yet


## 10. Examine Stored Data

Load and display the events and alerts stored in the output files.

In [11]:
# Load and display stored events
print("=" * 60)
print("STORED EVENTS")
print("=" * 60)

with open(EVENTS_FILE, 'r') as f:
    events = json.load(f)

print(f"\nTotal events stored: {len(events)}")

if events:
    print("\nSample events:")
    for event in events[:5]:
        print(f"\n  Entity: {event['entity']}")
        print(f"  Title: {event['title']}")
        print(f"  User: {event['user']}")
        print(f"  Size change: {event['size_change']:+d} bytes")
        print(f"  Timestamp: {event['timestamp']}")

STORED EVENTS

Total events stored: 0


In [12]:
# Load and display alerts
print("=" * 60)
print("STORED ALERTS")
print("=" * 60)

with open(ALERTS_FILE, 'r') as f:
    alerts = json.load(f)

print(f"\nTotal alerts stored: {len(alerts)}")

if alerts:
    print("\nAlert details:")
    for alert in alerts:
        print(f"\n  ðŸš¨ {alert['alert_type']}")
        print(f"  Entity: {alert['entity']}")
        print(f"  Reason: {alert['alert_reason']}")
        print(f"  User: {alert['user']}")
        print(f"  Timestamp: {alert['timestamp']}")
else:
    print("\nNo alerts were generated during this session.")
    print("This could mean:")
    print("  - No large edits (>500 bytes) occurred")
    print("  - No bot edits were detected")
    print("  - No rapid successive edits happened")

STORED ALERTS

Total alerts stored: 0

No alerts were generated during this session.
This could mean:
  - No large edits (>500 bytes) occurred
  - No bot edits were detected
  - No rapid successive edits happened


## 11. Visualize Event Distribution

Create a simple visualization of the events captured.

In [13]:
# Simple text-based visualization of events per entity
print("=" * 60)
print("EVENT DISTRIBUTION")
print("=" * 60)

max_bar_length = 40

# Get counts per entity
entity_counts = {entity: m['total_edits'] for entity, m in metrics.items()}
max_count = max(entity_counts.values()) if any(entity_counts.values()) else 1

print("\nEdits per entity:")
for entity, count in entity_counts.items():
    bar_length = int((count / max_count) * max_bar_length) if max_count > 0 else 0
    bar = "â–ˆ" * bar_length
    print(f"\n{entity}:")
    print(f"  {bar} ({count} edits)")

EVENT DISTRIBUTION

Edits per entity:

Christopher Nolan:
   (0 edits)

The Godfather:
   (0 edits)

Quentin Tarantino:
   (0 edits)

Science Fiction:
   (0 edits)

Academy Awards:
   (0 edits)


## 12. Export Metrics Summary

Save a summary of the metrics to a JSON file.

In [14]:
# Export metrics summary
METRICS_FILE = os.path.join(OUTPUT_DIR, "metrics_summary.json")

# Convert sets to lists for JSON serialization
metrics_export = {}
for entity, m in metrics.items():
    metrics_export[entity] = {
        "total_edits": m['total_edits'],
        "bot_edits": m['bot_edits'],
        "human_edits": m['human_edits'],
        "total_bytes_changed": m['total_bytes_changed'],
        "unique_users": list(m['unique_users']),
        "unique_user_count": len(m['unique_users']),
        "alert_count": m['alert_count'],
        "last_edit": m['last_edit']
    }

with open(METRICS_FILE, 'w') as f:
    json.dump(metrics_export, f, indent=2, default=str)

print(f"Metrics summary saved to: {METRICS_FILE}")
print("\nSummary:")
print(json.dumps(metrics_export, indent=2, default=str))

Metrics summary saved to: streaming_output/metrics_summary.json

Summary:
{
  "Christopher Nolan": {
    "total_edits": 0,
    "bot_edits": 0,
    "human_edits": 0,
    "total_bytes_changed": 0,
    "unique_users": [],
    "unique_user_count": 0,
    "alert_count": 0,
    "last_edit": null
  },
  "The Godfather": {
    "total_edits": 0,
    "bot_edits": 0,
    "human_edits": 0,
    "total_bytes_changed": 0,
    "unique_users": [],
    "unique_user_count": 0,
    "alert_count": 0,
    "last_edit": null
  },
  "Quentin Tarantino": {
    "total_edits": 0,
    "bot_edits": 0,
    "human_edits": 0,
    "total_bytes_changed": 0,
    "unique_users": [],
    "unique_user_count": 0,
    "alert_count": 0,
    "last_edit": null
  },
  "Science Fiction": {
    "total_edits": 0,
    "bot_edits": 0,
    "human_edits": 0,
    "total_bytes_changed": 0,
    "unique_users": [],
    "unique_user_count": 0,
    "alert_count": 0,
    "last_edit": null
  },
  "Academy Awards": {
    "total_edits": 0,
    "b

## Summary

This notebook implements a complete stream processing pipeline that:

1. **Connects** to the Wikimedia EventStreams API to receive real-time Wikipedia edit events
2. **Filters** events for five IMDB-related entities (Christopher Nolan, The Godfather, Quentin Tarantino, Science Fiction, Academy Awards)
3. **Tracks metrics** including edit counts, bytes changed, unique users, and bot vs. human edits
4. **Generates alerts** for:
   - Large edits (>500 bytes changed)
   - Bot edits
   - Rapid successive edits (potential edit wars)
5. **Stores data** in two separate JSON files:
   - `wiki_events.json` - All captured events
   - `wiki_alerts.json` - Alert events requiring attention

### Output File Structure

```
streaming_output/
â”œâ”€â”€ wiki_events.json      # All tracked events
â”œâ”€â”€ wiki_alerts.json      # Alert events only
â””â”€â”€ metrics_summary.json  # Aggregate metrics per entity
```

### Extending This Pipeline

This implementation can be extended to:
- Store data in a database (SQLite, PostgreSQL, MongoDB)
- Send real alerts via email, Slack, or other notification services
- Track additional entities or metrics
- Implement more sophisticated anomaly detection algorithms