# Ejemplo ventanas 1
En primer lugar creamos la sesión como en los casos anteriores:


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
import string

spark = SparkSession.builder \
    .master("spark://spark-master:7077") \
    .appName("StructuredWordCount") \
    .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
    .config("spark.eventLog.enabled", "true") \
    .config("spark.eventLog.dir", "hdfs:///spark/logs/history") \
    .config("spark.history.fs.logDirectory", "hdfs:///spark/logs/history") \
    .getOrCreate()




Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


En este ejemplo vamos a leer, nuevamente, los  datos desde un socket. Antes de nada lo ponemos en marcha con el siguiente comando:
- nc -lk 9999

Lo siguiente es poner en marcha el *stream* de lectura, esta vez activando la opción *includeTimestamp*.

In [2]:
df_lineas = spark.readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", "9999") \
    .option('includeTimestamp', 'true')\
    .load()

df_lineas.printSchema()

25/05/05 14:16:35 WARN TextSocketSourceProvider: The socket source should not be used for production applications! It does not support recovery.


root
 |-- value: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)



Al igual que en el ejemplo original del socket vamos a hacer un wordcount, la diferencia es que aquí incluímos el *timestamp*.

In [3]:
from pyspark.sql.functions import explode, split
df_palabras = df_lineas.select(
    explode(split(df_lineas.value, ' ')).alias('palabra'),
    df_lineas.timestamp)

Creamos una ventana fija de dos minutos. Esto agrupará los datos, palabra y recuento, por periodos fijos de tiempo

In [5]:
from pyspark.sql.functions import window
windowed_counts = df_palabras.groupBy(
    window(df_palabras.timestamp, "2 minutes"), df_palabras.palabra
).count().orderBy('window')

windowed_counts.printSchema()

root
 |-- window: struct (nullable = false)
 |    |-- start: timestamp (nullable = true)
 |    |-- end: timestamp (nullable = true)
 |-- palabra: string (nullable = false)
 |-- count: long (nullable = false)



Realizamos la consulta indicando como *sink* la consola. Para poder visualizar correctamente los resultados, si no se actualiza correctamente el notebook, podemos emplear el siguiente comando:
- docker logs -f jupyter-notebook 

In [6]:
query = windowed_counts \
          .writeStream \
          .outputMode("complete") \
          .format("console") \
          .queryName("consulta1") \
          .option("truncate","false") \
          .start()

25/05/05 14:19:24 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-7b82dda8-716d-46e6-b69f-abb807a6b5c8. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/05/05 14:19:24 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+------+-------+-----+
|window|palabra|count|
+------+-------+-----+
+------+-------+-----+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------------+-------+-----+
|window                                    |palabra|count|
+------------------------------------------+-------+-----+
|{2025-05-05 14:18:00, 2025-05-05 14:20:00}|hola   |1    |
|{2025-05-05 14:18:00, 2025-05-05 14:20:00}|mundo  |1    |
+------------------------------------------+-------+-----+



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+------------------------------------------+--------+-----+
|window                                    |palabra |count|
+------------------------------------------+--------+-----+
|{2025-05-05 14:18:00, 2025-05-05 14:20:00}|hola    |2    |
|{2025-05-05 14:18:00, 2025-05-05 14:20:00}|mundo   |1    |
|{2025-05-05 14:18:00, 2025-05-05 14:20:00}|caracola|1    |
+------------------------------------------+--------+-----+



                                                                                

-------------------------------------------
Batch: 3
-------------------------------------------
+------------------------------------------+---------+-----+
|window                                    |palabra  |count|
+------------------------------------------+---------+-----+
|{2025-05-05 14:18:00, 2025-05-05 14:20:00}|hola     |2    |
|{2025-05-05 14:18:00, 2025-05-05 14:20:00}|mundo    |1    |
|{2025-05-05 14:18:00, 2025-05-05 14:20:00}|caracola |1    |
|{2025-05-05 14:20:00, 2025-05-05 14:22:00}|o        |1    |
|{2025-05-05 14:20:00, 2025-05-05 14:22:00}|hola     |1    |
|{2025-05-05 14:20:00, 2025-05-05 14:22:00}|dende    |1    |
|{2025-05-05 14:20:00, 2025-05-05 14:22:00}|instituto|1    |
+------------------------------------------+---------+-----+

