<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# Kafka Plain Consumer 

**Technical Accomplishments:**
- Introduce the `Spark Structured Streaming`
- Consume data from Kafka textual topic in a static and streaming fashion

## Getting Started

Let's start importing libraries and creating useful variables 

In [None]:
%load_ext autotime

In [None]:
import os
import qcutils
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
import io
from avro.io import DatumReader, BinaryDecoder
from pyspark.sql.functions import *
import time
import json
import avro.schema
import struct
import requests 

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.5,org.apache.spark:spark-streaming-kafka-0-10_2.11:2.4.5,org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,org.apache.kafka:kafka-clients:2.4.1,com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

spark

In [None]:
topic = ''

assert len(topic) > 0, "In order to avoid conflicts during write operation, please name the topic as <surname>-topic"

servers=qcutils.read_config_value("kafka.server") + ":" + str(qcutils.read_config_value("kafka.port"))

### Static results

Spark can create a static dataframe from a kafka topic.

In this case you need to specify the options `startingOffsets` and `endingOffsets` in the configuration.

In [None]:
ns_df = (spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", servers)
  .option("subscribe", topic)
  .option("startingOffsets", "earliest")
  .option("endingOffsets", "latest")
  .load())
         
ns_df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

### Streaming results

Exploiting Spark Structured Streaming we can create a streaming dataframe from a kafka topic.

In this way the new messages, added to the kafka topic, is dinamically appended to the Data Frame.

In [None]:
s_df = (spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", servers)
  .option("startingOffsets", "latest")
  .option("subscribe", topic)
  .load())

dfw = (s_df
    .writeStream
    .outputMode("append")
    .format("parquet") 
    .option("path", "/home/jovyan/data/pyspark/str_df_plain.parquet")
    .option("checkpointLocation","/home/jovyan/data/pyspark/checkpoint/str_df_plain") \
    .start())

dfw.awaitTermination()

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.