# Ejemplo archivos
1. Creamos la SparkSession.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
import string

spark = SparkSession.builder\
  .master("spark://spark-master:7077") \
  .appName("arquivo-example-1")\
  .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
  .config("spark.eventLog.enabled", "true") \
  .config("spark.eventLog.dir", "hdfs:///spark/logs/history") \
  .config("spark.history.fs.logDirectory", "hdfs:///spark/logs/history") \
  .getOrCreate()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


2. Obtenemos el esquema e indicando el origen de datos para el procesamiento en streaming, en este caso una carpeta con archivos *json*.

In [2]:
path = "/user/jovyan/data/flight-data/json"
static = spark.read.json(path)
dataSchema = static.schema
streaming = spark.readStream.schema(dataSchema).option("maxFilesPerTrigger", 1)\
  .json(path)
streaming.printSchema()

[Stage 0:>                                                          (0 + 2) / 2]

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



                                                                                

3. Obtenemos el DataFrame de salida transformando los datos iniciales. En este caso agrupamos por *DEST_COUNTRY_NAME* y contamos.

In [3]:
counts = streaming.groupBy("DEST_COUNTRY_NAME").count()

4. Iniciamos el procesamiento en streaming con salida a memoria.

In [4]:
query = counts.writeStream.queryName("counts").format("memory").outputMode("complete").start()

25/04/09 17:53:13 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-55ae2be2-5023-41c4-a993-19a59ddf863d. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/04/09 17:53:13 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.

5. Mostramos los datos:

In [5]:
from time import sleep
for x in range(5):
    spark.sql("SELECT * FROM counts").show()
    sleep(1)

+-----------------+-----+
|DEST_COUNTRY_NAME|count|
+-----------------+-----+
+-----------------+-----+





+-----------------+-----+
|DEST_COUNTRY_NAME|count|
+-----------------+-----+
+-----------------+-----+





+-----------------+-----+
|DEST_COUNTRY_NAME|count|
+-----------------+-----+
+-----------------+-----+



                                                                                

+--------------------+-----+
|   DEST_COUNTRY_NAME|count|
+--------------------+-----+
|            Paraguay|    1|
|            Anguilla|    1|
|              Russia|    1|
|             Senegal|    1|
|              Sweden|    1|
|            Kiribati|    1|
|              Guyana|    1|
|         Philippines|    1|
|            Malaysia|    1|
|           Singapore|    1|
|                Fiji|    1|
|              Turkey|    1|
|             Germany|    1|
|         Afghanistan|    1|
|              Jordan|    1|
|               Palau|    1|
|              France|    1|
|Turks and Caicos ...|    1|
|              Greece|    1|
|            Dominica|    1|
+--------------------+-----+
only showing top 20 rows



                                                                                

+--------------------+-----+
|   DEST_COUNTRY_NAME|count|
+--------------------+-----+
|            Paraguay|    2|
|            Anguilla|    2|
|              Russia|    2|
|               Yemen|    1|
|             Senegal|    2|
|              Sweden|    2|
|            Kiribati|    2|
|              Guyana|    2|
|         Philippines|    2|
|            Malaysia|    2|
|           Singapore|    2|
|                Fiji|    2|
|              Turkey|    2|
|             Germany|    2|
|         Afghanistan|    2|
|              Jordan|    2|
|               Palau|    2|
|              France|    2|
|Turks and Caicos ...|    2|
|              Greece|    2|
+--------------------+-----+
only showing top 20 rows



                                                                                