# SPARK STREAMING

Aache Spark Streaming es un componente del ecosistema Apache Spark diseñado específicamente para el procesamiento de datos en tiempo real y el análisis de datos de flujo continuo. Permite a los desarrolladores y analistas de datos procesar datos en tiempo real de manera escalable, tolerante a fallos y de alto rendimiento utilizando el modelo de programación familiar de Apache Spark.

Para hacer esta demo, realizar los siguientes pasos:

1. Tener la imagen de spark corriendo y el contenedor anclado el VScode
2. abrir una terminal, y ejecutar `nc -l -k 12345`
3. ir al notebook y ejecutar todas las secuencias de spark
4. volver a la consola y enviar mensajes

## CONFIGURAR SPARK

In [15]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import count

spark = SparkSession \
  .builder \
  .appName("unal streaming") \
  .master("local[*]") \
  .getOrCreate()

spark

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:46343)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/py4j/java_gateway.py", line 977, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/py4j/java_gateway.py", line 1115, in start
    self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused


Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:46343)

## CONFIGURAR FUENTE

In [16]:
streaming_df = spark.readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", "12345") \
    .load()

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:46343)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/py4j/java_gateway.py", line 977, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/py4j/java_gateway.py", line 1115, in start
    self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 111] Connection refused


Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:46343)

## CONFIGURAR FUNCION DE TRANSFORMACIÓN

In [None]:
# funcion para realizar transformaciones
def process_word_count(streaming_df):
    # lee y aplica transformacion
    words_df = streaming_df.selectExpr("explode(split(value, ' ')) as word")

    # Arealiza proceso de agregación
    agg_words_df = words_df \
        .groupBy("word") \
        .agg(count("word").alias("count"))
    
    # imprimir esquema
    agg_words_df.printSchema()
    return agg_words_df

### ALMACENAMIENTO

para ver como configurar mas destinos, mirar:

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

In [None]:
# lee y transforma
agg_words_df = process_word_count(streaming_df)

# escritura en consola
writing_df = agg_words_df.writeStream \
    .format("console") \
    .outputMode("update") \
    .start()

# de este modo, se ejecuta 
writing_df.awaitTermination()