# NB1.5 — Streaming vs Batch (WebSockets & Polling)

En esta práctica implementamos **Streaming** con WebSockets y **Batch frecuente** con HTTP Polling. Guardaremos todo en **JSON Lines** en `../../data/` y compararemos latencia y continuidad.

**Objetivos**
- Entender diferencias clave **Batch vs Streaming**.
- Ejecutar un cliente **WebSocket** para ingesta continua.
- Ejecutar **HTTP Polling** como micro-batch.
- Comparar resultados en pandas.


## 1) Dependencias
Instalamos librerías requeridas para los scripts en `src/streaming/`.

In [11]:
!pip install -q websockets requests pandas

## 2) Streaming (WebSocket)

**Salida:** `../../data/stream_ws_YYYY-MM-DD.jsonl`

In [12]:
# Limita duración/eventos para la clase
!WS_MAX_EVENTS=150 WS_MAX_SECONDS=120 python ../../src/streaming/stream_dual_ws.py

  from websockets.exceptions import ConnectionClosed, InvalidStatusCode
[DONE] Eventos: 150
/Users/didiergamboa/Documents/Tecnológico de Software/Cuatrimestres/2025 Q3 Septiembre-Diciembre/Fundamentos de Big Data/Proyecto Integrador - Fundamentos de Ingeniería de Datos/fundamentos-ingenieria-datos/data/raw/stream_ws_2025-09-24.jsonl


## 3) Polling (HTTP)

**Salida:** `../../data/poll_bitcoin_YYYY-MM-DD.jsonl`

In [13]:
!python ../../src/streaming/poll_coincap_http.py

[POLL] 1/20 -> {'ts': '2025-09-24T23:18:46.063799+00:00', 'source': 'binance', 'instrument': 'BTCUSDT', 'price_usd': 113295.79}
[POLL] 2/20 -> {'ts': '2025-09-24T23:18:51.335223+00:00', 'source': 'binance', 'instrument': 'BTCUSDT', 'price_usd': 113295.79}
[POLL] 3/20 -> {'ts': '2025-09-24T23:18:56.606342+00:00', 'source': 'binance', 'instrument': 'BTCUSDT', 'price_usd': 113295.79}
[POLL] 4/20 -> {'ts': '2025-09-24T23:19:01.858261+00:00', 'source': 'binance', 'instrument': 'BTCUSDT', 'price_usd': 113295.8}
[POLL] 5/20 -> {'ts': '2025-09-24T23:19:07.123940+00:00', 'source': 'binance', 'instrument': 'BTCUSDT', 'price_usd': 113295.79}
[POLL] 6/20 -> {'ts': '2025-09-24T23:19:12.398440+00:00', 'source': 'binance', 'instrument': 'BTCUSDT', 'price_usd': 113288.0}
[POLL] 7/20 -> {'ts': '2025-09-24T23:19:17.669882+00:00', 'source': 'binance', 'instrument': 'BTCUSDT', 'price_usd': 113282.35}
[POLL] 8/20 -> {'ts': '2025-09-24T23:19:22.946203+00:00', 'source': 'binance', 'instrument': 'BTCUSDT', 'p

## 4) Cargar y comparar en pandas
Leemos ambos `.jsonl` y observamos diferencias de granularidad/latencia (aprox.).

In [17]:
from pathlib import Path
import pandas as pd

data_dir = Path("../../data/raw")

# STREAM
stream_file = sorted(data_dir.glob("stream_ws_*.jsonl"))[-1]
df_stream = pd.read_json(stream_file, lines=True)
print("STREAM:", stream_file)
display(df_stream.head())

# POLL 
poll_files = sorted(list(data_dir.glob("poll_binance_*.jsonl")) + list(data_dir.glob("poll_coincap_*.jsonl")))
poll_file = poll_files[-1]
df_poll = pd.read_json(poll_file, lines=True)
print("POLL:", poll_file)
display(df_poll.head())


STREAM: ../../data/raw/stream_ws_2025-09-24.jsonl


Unnamed: 0,_probe,ts,source,instrument,price,currency,qty,trade_id
0,1.0,2025-09-24T22:53:38.112231+00:00,,,,,,
1,,2025-09-24T22:53:39.233277+00:00,binance,BTCUSDT,113294.46,USDT,0.0001,5254424000.0
2,,2025-09-24T22:53:40.057167+00:00,binance,BTCUSDT,113294.45,USDT,0.01,5254424000.0
3,,2025-09-24T22:53:40.636322+00:00,binance,BTCUSDT,113294.45,USDT,0.00778,5254424000.0
4,,2025-09-24T22:53:40.679651+00:00,binance,BTCUSDT,113294.45,USDT,0.00066,5254424000.0


POLL: ../../data/raw/poll_binance_BTCUSDT_2025-09-24.jsonl


Unnamed: 0,ts,source,instrument,price_usd
0,2025-09-24T23:18:46.063799+00:00,binance,BTCUSDT,113295.79
1,2025-09-24T23:18:51.335223+00:00,binance,BTCUSDT,113295.79
2,2025-09-24T23:18:56.606342+00:00,binance,BTCUSDT,113295.79
3,2025-09-24T23:19:01.858261+00:00,binance,BTCUSDT,113295.8
4,2025-09-24T23:19:07.123940+00:00,binance,BTCUSDT,113295.79


## 5) Reflexión (respuestas cortas)
**¿Quién tiene menor latencia?**  

**¿Qué pasa si el stream se cae?** 

**¿Cuál genera más duplicados o huecos temporales?**
