<div style="display: flex; align-items: center; gap: 18px; margin-bottom: 15px;">
  <img src="https://files.codebasics.io/v3/images/sticky-logo.svg" alt="Codebasics Logo" style="display: inline-block;" width="130">
  <h1 style="font-size: 34px; color: #1f4e79; margin: 0; display: inline-block;">Codebasics Practice Room - Data Engineering Bootcamp </h1>
</div>


#### ‚è±Ô∏è Sliding Window Metrics with Structured Streaming

This notebook demonstrates how to compute a **rolling ‚Äúlast 30 minutes error count‚Äù**
from application logs using **Spark Structured Streaming**.

We simulate real-time ingestion by converting an existing large CSV file
into multiple small files and reading them as a **stream**.


## üìÇ Dataset

**Dataset Name:** `app_logs_large.csv`  


### Columns:
- `event_time`
- `level`
- `service`
- `message`

> ‚ö†Ô∏è In real production systems, logs arrive continuously via **Kafka**.  
> For learning purposes, we simulate streaming using files.


## üóÇÔ∏è Scenario

The business wants a **real-time operational metric**:

> **‚ÄúHow many ERROR logs occurred in the last 30 minutes?‚Äù**

Requirements:
- Metric should update continuously
- Must be based on **event time**
- Must handle **late-arriving events**
- Should scale as log volume grows

You are asked to implement this using **Spark Structured Streaming**.

---

## üéØ Task

1. Read historical log data as batch
2. Convert it into a streaming-friendly format
3. Read logs as a streaming DataFrame
4. Filter ERROR logs
5. Apply a **30-minute sliding window** (slide every 1 minute)
6. Add a watermark for late data
7. Output rolling error counts

---

## üß© Assumptions

- Logs are event-time based
- Data may arrive late
- Databricks **Serverless compute** is used
- Unity Catalog storage is available

---

## üì¶ Deliverables

- Streaming aggregation computing rolling error counts
- Output grouped by:
  - window
  - service

### Expected Columns

| window.start | window.end | service | error_count |
|--------------|------------|---------|-------------|

---

## üß† Notes
- Streaming reads **directories**, not single files
- New files represent new streaming data
- Watermarks control late data handling
- Sliding windows enable rolling metrics



## üß† Solution Strategy (High-Level)

1. Read existing logs as a batch DataFrame
2. Split batch data into multiple small files
3. Read those files using `readStream`
4. Filter ERROR events
5. Apply sliding window aggregation
6. Write streaming results using a serverless-safe trigger

Spark handles:
- incremental processing
- stateful window aggregation
- fault tolerance using checkpoints
