# Real-time Streaming Playbook

Practical tips to keep a Kafka ‚Üí Spark ‚Üí Cassandra pipeline stable on a laptop.

Use this as an operational checklist when your message rate increases.


## 1) Control the input rate (Producer / Kafka)
- If the producer sends too fast, queues grow and everything downstream gets overloaded.
- Options:
  - **Throttle the producer** (sleep / rate-limit)
  - **Batch events** before sending to Kafka


## 2) Control the Spark streaming rate
Spark runs in micro-batches.

Useful knobs:
- `trigger(processingTime="10 seconds")` to slow down batch frequency
- `maxOffsetsPerTrigger` to limit how many Kafka messages Spark reads per batch


## 3) Use TTL / upserts in Cassandra
- If you do not need to keep all records forever, use **TTL** to prevent unlimited disk usage.
- If you need deduplication, use a stable primary key and write in an idempotent way.


## 4) Checkpointing & restart safety
Always set a `checkpointLocation`.

- Prevents duplicate processing on restart
- Allows Spark to resume from the last committed offsets


```python
# Cassandra connection (write stream)
streaming_query = (selection_df.writeStream
                   .format("org.apache.spark.sql.cassandra")
                   .option("checkpointLocation", "/tmp/checkpoint")
                   .option("keyspace", "spark_streams")
                   .option("table", "created_users")
                   .option("spark.cassandra.connection.host", "cassandra")
                   .option("spark.cassandra.connection.port", "9042")
                   .option("spark.cassandra.connection.local_dc", "datacenter1")
 .trigger(processingTime="10 seconds") # 
                   .start())
```

Note: `selection_df` and `spark_conn` refer to the DataFrames/session created in the main ETL notebook. This playbook keeps the code as a reference snippet.


- `processingTime="10 seconds"` ‚Üí Spark will create a micro-batch every 10 

- batch Kafka batch 

- This reduces write requests per second ‚Üí Cassandra / Database 

- 2.Backpressure / Rate 

 - offset batch (optional, batch )

```python
# Read streaming data from Kafka
spark_df = (spark_conn.readStream
            .format("kafka")
            .option("kafka.bootstrap.servers", "broker:29092")
            .option("subscribe", "users_created")
            .option("startingOffsets", "earliest")
 .option("maxOffsetsPerTrigger", 1000) # 1000 records
            .load())
```

Note: `selection_df` and `spark_conn` refer to the DataFrames/session created in the main ETL notebook. This playbook keeps the code as a reference snippet.


- data Spark ‚Üí batch 1000 record

##### 3Ô∏è‚É£ Upsert / TTL Cassandra

- record ‚Üí TTL (Time-to-Live)

- upsert insert (key conflicts)

- data accumulation

##### 4Ô∏è‚É£ Checkpoint & Restart Safety

- `checkpointLocation` Spark batch 

- Realtime Streaming ‚Üí DB

- Spark /Restart 

##### 5Ô∏è‚É£ Partitioning / Parallelism

- Cassandra write partition

- Spark partition match Cassandra partition key ‚Üí hotspot

##### üí° :

- stream ‚Äú record‚Äù 

- trigger interval + maxOffsetsPerTrigger

- checkpoint + TTL Cassandra

- schema partition high throughput

### **Data Quality Checks (DQC) Data Validation**

- Realtime ETL transform EDA + data quality Batch

##### üîπ Batch ETL + EDA ()

- 

- EDA missing, distribution, outlier, correlation

- dataset 

##### üîπ Realtime ETL + Transform 

- EDA data stream

- Data Quality Checks (DQC) Data Validation real-time

### ** API Streaming**

#### üîπ API Streaming 3 

**1. Long-lived HTTP connection (HTTP Streaming / SSE ‚Äì Server-Sent Events)**

- request ‚Üí connection ‚Üí server push 

- Twitter Streaming API (), Stock market feed, IoT sensor data

- Python/Java connection loop record record

**2. WebSocket API**

- 2-way connection ‚Üí client/server 

- real-time Chat, Crypto price feed, Game server

- library `websockets` (Python), `socket.io`

**3. Polling API (simulate streaming)**

- streaming ‚Üí client call REST API ( 1 /10 )

- ‚Äústream‚Äù micro-batch

- Airflow Cron jobs 

#### üîπ 

- **REST API ()** ‚Üí 
 - ‚Üí real-time polling ( ) 1 , 5 
 - ‚Üí Airflow Airflow = scheduler batch/micro-batch

- **Streaming API ( WebSocket, gRPC stream, SSE, MQTT)** ‚Üí connect push sensor feed 
 - ‚Üí Airflow run loop kafka 
 - ‚Üí Spark Structured Streaming, Flink, Kafka Consumer custom daemon script 

##### ‚úÖ Pattern 

**1. REST API + Airflow**

- Airflow operator 1 

- storage ( Cassandra, BigQuery, S3, Postgres)

- micro-batch

**2. Streaming API + Spark/Kafka**

- Spark structured streaming WebSocket ‚Üí push Kafka ‚Üí consumer DB

- real-time

**3. Hybrid**

- API streaming ‚Üí polling (Airflow Spark Structured Streaming + foreachBatch())

- API streaming ‚Üí Spark/Kafka, schedule downstream Airflow

## 5) Monitoring (what to look at)
- **Kafka Control Center:** topic throughput, lag, consumer group health
- **Airflow:** DAG runs are green and stable
- **Cassandra:** row counts and disk usage (TTL should keep data bounded)

Quick Cassandra check (from terminal):
```bash
docker compose exec cassandra_db cqlsh
```
```sql
USE spark_streams;
SELECT COUNT(*) FROM created_users;
```
