### Crypto Streaming

If you set up the upstream part of our streaming pipeline, you should have near real-time trading data of different cryptocurrencies being sent to different Kafka topics. In this notebook, we will read the trading value of our cryptocurrencies (in USD) and do some fun stuff with them!  

#### Getting Started (Imports & Setting Variables)

First of all, to connect to Kafka from Pyspark, we need the right kind of extensions. These extensions are not built in, but luckily, using a neat trick we can define it within our notebook. More details: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

In [1]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 pyspark-shell'

In [31]:
# Spark and Structured Streaming related imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.types import StringType, FloatType, StructType, StructField
from pyspark.sql.functions import from_json, col, to_timestamp

In [3]:
# get a spark session
spark = SparkSession.builder.appName("CryptoStreaming").getOrCreate()

#### Start reading a stream
Spark's new structured streaming means we can stream the data straight into a dataframe! To do that, first we use the readStream to read a topic from Kafka like below.

In [4]:
# read stream and subscribe to bitcoin topic
df = spark.readStream \
          .format("kafka") \
          .option("kafka.bootstrap.servers", "10.128.0.16:19092") \
          .option("startingOffsets", "earliest") \
          .option("subscribe", "BTC") \
          .load()

Keep in mind when we are reading the value from Kafka, we are also reading a lot of metadata that is internal to Kafka. You can take a look at these by using by using printSchema.

In [5]:
df.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



You can also take a look at the raw content of the data received from Kafka. To do that, first we write a query to a new sql dataframe. This takes a snapshot of the stream, and it can be written to disk or save to memory for followup sql operations.

In [6]:
raw_df = df \
         .writeStream \
         .queryName("rawdata")\
         .format("memory")\
         .start()

In [7]:
raw = spark.sql("select * from rawdata")
raw.show()

+---+-----+-----+---------+------+---------+-------------+
|key|value|topic|partition|offset|timestamp|timestampType|
+---+-----+-----+---------+------+---------+-------------+
+---+-----+-----+---------+------+---------+-------------+



#### Structuring The Value & Parsing JSON to Dataframe
We can use the select expression to select the value column and also use the from_json function to parse the JSON data.

In [8]:
# select only the value column
raw_value_df = df.selectExpr("CAST(value AS STRING)")

In [9]:
# write stream to memory
raw_value_query = raw_value_df.writeStream \
                              .queryName("raw_value")\
                              .format("memory")\
                              .start()

In [10]:
# use the select statement to take snapshot of the query
raw_value_query = spark.sql("select * from raw_value")
# print 20 values, False is so we can see the full value in the table
raw_value_query.show(20,False)

+-----+
|value|
+-----+
+-----+



In [11]:
# we need to define the schema for parsing json value
schema = StructType([StructField("timestamp", StringType(), True),
                     StructField("usd_value", StringType(), True)])

In [12]:
# parse json value and get bitcoin dataframe
json_value_df = raw_value_df.selectExpr("cast (value as STRING) json_data")\
                            .select(from_json("json_data", schema).alias("bitcoin"))\
                            .select("bitcoin.*")

In [13]:
# write to memory, take a snapshot, and show off our well-structured dataframe
bitcoin_query = json_value_df.writeStream.format("memory").queryName("bitcoin_value").start()

In [18]:
bitcoin_df = spark.sql("select * from bitcoin_value")
bitcoin_df.show()

+-------------------+---------+
|          timestamp|usd_value|
+-------------------+---------+
|17-10-2019 19:24:08|  8014.91|
|17-10-2019 19:24:44|  8014.91|
|17-10-2019 19:25:21|  8014.91|
|17-10-2019 19:26:04|  8018.71|
|17-10-2019 19:26:45|  8018.71|
|17-10-2019 19:27:26|  8018.71|
|17-10-2019 19:28:08|  8017.44|
|17-10-2019 19:28:50|  8018.16|
|17-10-2019 19:29:31|  8018.16|
|17-10-2019 19:30:12|  8018.07|
|17-10-2019 19:30:54|  8016.51|
|17-10-2019 19:31:35|  8016.03|
|17-10-2019 19:32:16|  8017.77|
|17-10-2019 19:32:58|  8015.45|
|17-10-2019 19:33:39|  8015.53|
|17-10-2019 19:34:20|  8016.71|
|17-10-2019 19:35:02|  8016.55|
|17-10-2019 19:35:43|  8016.09|
|17-10-2019 19:36:24|  8014.42|
|17-10-2019 19:37:06|  8015.75|
+-------------------+---------+
only showing top 20 rows



Although we could have parsed the data to the right format at the time we were writing the structure, it is often a good practice not to. By converting to string first and later converting to the right format here, we make our code a little bit more robust.

In [45]:
# convert to timestamp and integer
bitcoin_df = bitcoin_df.withColumn('timestamp',to_timestamp(bitcoin_df.timestamp, 'dd-MM-yyyy HH:mm:ss'))\
                       .withColumn('usd_value', bitcoin_df.usd_value.cast("double"))

In [46]:
# print out the schema
bitcoin_df.printSchema()

root
 |-- timestamp: timestamp (nullable = true)
 |-- usd_value: double (nullable = true)



#### How much did bitcoin price fluctuate in the last ten minutes? 

Now we have our dataframe in the right format, lets write some interesting queries. We will start of by answering the simple question, how much did the value of Bitcoin (in terms of USD) fluctuate in the last ten minutes? 

In [80]:
from datetime import datetime, timedelta

In [81]:
ten_minutes_ago_dt = (datetime.now() - timedelta(minutes=10))
ten_mins_bitcoin_df = bitcoin_df.filter(bitcoin_df.timestamp > ten_minutes_ago_dt)

In [82]:
ten_min_count = ten_mins_bitcoin_df.count()
ten_min_max = ten_mins_bitcoin_df.agg({"usd_value": "max"}).collect()[0][0]
ten_min_min = ten_mins_bitcoin_df.agg({"usd_value": "min"}).collect()[0][0]

In [83]:
print('In the last ten minutes, we received {0} updates and the price fluctuated {1:.2f} USD' .format(ten_min_count, ten_min_max - ten_min_min))

In the last ten minutes, we received 15 updates and the price fluctuated 5.32 USD
