### Crypto Streaming

If you set up the upstream part of our streaming pipeline, you should have near real-time trading data of different cryptocurrencies being sent to different Kafka topics. In this notebook, we will read the trading value of our cryptocurrencies (in USD) and do some fun stuff with them!  

#### Getting Started (Imports & Setting Variables)

First of all, to connect to Kafka from Pyspark, we need the right kind of extensions. These extensions are not built in, but luckily, using a neat trick we can define it within our notebook. More details: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

In [1]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 pyspark-shell'

In [2]:
# Spark and Structured Streaming related imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.types import TimestampType, StringType, IntegerType, StructType, StructField
from pyspark.sql.functions import from_json, col

In [3]:
# get a spark session
spark = SparkSession.builder.appName("CryptoStreaming").getOrCreate()

#### Start reading a stream
Spark's new structured streaming means we can stream the data straight into a dataframe! To do that, first we use the readStream to read a topic from Kafka like below.

In [5]:
# read stream and subscribe to bitcoin topic
df = spark.readStream \
          .format("kafka") \
          .option("kafka.bootstrap.servers", "10.128.0.16:19092") \
          .option("startingOffsets", "earliest") \
          .option("subscribe", "BTC") \
          .load()

Keep in mind when we are reading the value from Kafka, we are also reading a lot of metadata that is internal to Kafka. You can take a look at these by using by using printSchema.

In [6]:
df.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



You can also take a look at the raw content of the data received from Kafka. To do that, first we write a query to a new sql dataframe. This takes a snapshot of the stream, and it can be written to disk or save to memory for followup sql operations.

In [7]:
raw_df = df \
         .writeStream \
         .queryName("rawdata")\
         .format("memory")\
         .start()

In [9]:
raw = spark.sql("select * from rawdata")
raw.show()

+----+--------------------+-----+---------+------+--------------------+-------------+
| key|               value|topic|partition|offset|           timestamp|timestampType|
+----+--------------------+-----+---------+------+--------------------+-------------+
|null|[7B 22 74 69 6D 6...|  BTC|        0|     0|2019-10-17 19:24:...|            0|
|null|[7B 22 74 69 6D 6...|  BTC|        0|     1|2019-10-17 19:24:...|            0|
|null|[7B 22 74 69 6D 6...|  BTC|        0|     2|2019-10-17 19:25:...|            0|
|null|[7B 22 74 69 6D 6...|  BTC|        0|     3|2019-10-17 19:26:...|            0|
|null|[7B 22 74 69 6D 6...|  BTC|        0|     4|2019-10-17 19:26:...|            0|
|null|[7B 22 74 69 6D 6...|  BTC|        0|     5|2019-10-17 19:27:...|            0|
|null|[7B 22 74 69 6D 6...|  BTC|        0|     6|2019-10-17 19:28:...|            0|
|null|[7B 22 74 69 6D 6...|  BTC|        0|     7|2019-10-17 19:28:...|            0|
|null|[7B 22 74 69 6D 6...|  BTC|        0|     8|2019

#### Structuring The Value & Parsing JSON to Dataframe
We can use the select expression to select the value column and also use the from_json function to parse the JSON data.

In [10]:
# select only the value column
raw_value_df = df.selectExpr("CAST(value AS STRING)")

In [11]:
# write stream to memory
raw_value_query = raw_value_df.writeStream \
                              .queryName("raw_value")\
                              .format("memory")\
                              .start()

In [13]:
# use the select statement to take snapshot of the query
raw_value_query = spark.sql("select * from raw_value")
# print 20 values, False is so we can see the full value in the table
raw_value_query.show(20,False)

+------------------------------------------------------------+
|value                                                       |
+------------------------------------------------------------+
|{"timestamp": "17-10-2019 19:24:08", "usd_value": "8014.91"}|
|{"timestamp": "17-10-2019 19:24:44", "usd_value": "8014.91"}|
|{"timestamp": "17-10-2019 19:25:21", "usd_value": "8014.91"}|
|{"timestamp": "17-10-2019 19:26:04", "usd_value": "8018.71"}|
|{"timestamp": "17-10-2019 19:26:45", "usd_value": "8018.71"}|
|{"timestamp": "17-10-2019 19:27:26", "usd_value": "8018.71"}|
|{"timestamp": "17-10-2019 19:28:08", "usd_value": "8017.44"}|
|{"timestamp": "17-10-2019 19:28:50", "usd_value": "8018.16"}|
|{"timestamp": "17-10-2019 19:29:31", "usd_value": "8018.16"}|
|{"timestamp": "17-10-2019 19:30:12", "usd_value": "8018.07"}|
|{"timestamp": "17-10-2019 19:30:54", "usd_value": "8016.51"}|
|{"timestamp": "17-10-2019 19:31:35", "usd_value": "8016.03"}|
|{"timestamp": "17-10-2019 19:32:16", "usd_value": "801

In [14]:
# we need to define the schema for parsing json value
schema = StructType([StructField("timestamp", StringType(), True),
                     StructField("usd_value", StringType(), True)])

In [15]:
# parse json value and get bitcoin dataframe
json_value_df = raw_value_df.selectExpr("cast (value as STRING) json_data")\
                            .select(from_json("json_data", schema).alias("bitcoin"))\
                            .select("bitcoin.*")

In [16]:
# write to memory, take a snapshot, and show off our well-structured dataframe
bitcoin_query = json_value_df.writeStream.format("memory").queryName("bitcoin_value").start()
bitcoin_df = spark.sql("select * from bitcoin_value")
bitcoin_df.show()