## Import packages

In [1]:
# json parsing
import json
import pandas as pd
from pyspark.sql.functions import explode, split
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType
import warnings
warnings.filterwarnings('ignore')

## Read the raw messages from the topic userItems

In [6]:
raw_events = spark.read.format("kafka").option("kafka.bootstrap.servers", "kafka:29092").\
    option("subscribe","userItems").option("startingOffsets", "earliest").option("endingOffsets", "latest").load() 

## Make sure everything is running well: Cache data

In [7]:
raw_events.cache()

DataFrame[key: binary, value: binary, topic: string, partition: int, offset: bigint, timestamp: timestamp, timestampType: int]

In [8]:
raw_events.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



Ran perfectly, we have our dataframe with the data cached. Good sanity check before starting

## Explore Events

In [9]:
events = raw_events.select(raw_events.value.cast('string'))
extracted_events = events.rdd.map(lambda x: json.loads(x.value)).toDF()
extracted_events.show()

+------+--------------+----------------+--------------+-----------+--------------------+
|Accept|Content-Length|    Content-Type|          Host| User-Agent|          event_type|
+------+--------------+----------------+--------------+-----------+--------------------+
|   */*|            52|application/json|localhost:5000|curl/7.47.0|purchase_a_sword:...|
|   */*|            52|application/json|localhost:5000|curl/7.47.0|purchase_a_sword:...|
|   */*|            17|application/json|localhost:5000|curl/7.47.0|get_coins {"coins...|
|   */*|            31|application/json|localhost:5000|curl/7.47.0|join_guild {"colo...|
|   */*|          null|            null|localhost:5000|curl/7.47.0|             default|
|   */*|          null|            null|localhost:5000|curl/7.47.0|             default|
|   */*|          null|            null|localhost:5000|curl/7.47.0|    purchase_a_sword|
+------+--------------+----------------+--------------+-----------+--------------------+



Here is quite obvious the difference between a POST and a GET. From POST we get a proper JSON dictionary, with its length, which we can (and will) then read through Kafka to redis to keep the state. A GET is just a 'standard' response from the web app.