## Import the JSON Packages

In [1]:
# json parsing
import json
import pandas as pd

## Read the messages from the topic eduAssessment

In [2]:
messages = spark.read.format("kafka")\
.option("kafka.bootstrap.servers", "kafka:29092")\
.option("subscribe","eduAssessments").option("startingOffsets", "earliest").option("endingOffsets", "latest").load() 

## Visualize the Schema

In [3]:
messages.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



## See the top 20 messages as a DataFrame (binary so far)

In [4]:
messages.show()

+----+--------------------+--------------+---------+------+--------------------+-------------+
| key|               value|         topic|partition|offset|           timestamp|timestampType|
+----+--------------------+--------------+---------+------+--------------------+-------------+
|null|[7B 22 6B 65 65 6...|eduAssessments|        0|     0|1969-12-31 23:59:...|            0|
|null|[7B 22 6B 65 65 6...|eduAssessments|        0|     1|1969-12-31 23:59:...|            0|
|null|[7B 22 6B 65 65 6...|eduAssessments|        0|     2|1969-12-31 23:59:...|            0|
|null|[7B 22 6B 65 65 6...|eduAssessments|        0|     3|1969-12-31 23:59:...|            0|
|null|[7B 22 6B 65 65 6...|eduAssessments|        0|     4|1969-12-31 23:59:...|            0|
|null|[7B 22 6B 65 65 6...|eduAssessments|        0|     5|1969-12-31 23:59:...|            0|
|null|[7B 22 6B 65 65 6...|eduAssessments|        0|     6|1969-12-31 23:59:...|            0|
|null|[7B 22 6B 65 65 6...|eduAssessments|        

We don't need the topic nor the partition. The offset and the timestamp are useful, but not for any analysis in the streaming line, but for Auditable code, thus we should keep it at least in the message dataframe

## Cast the binary values as Strings and show them

In [5]:
messages_as_strings=messages.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
messages_as_strings.printSchema()
messages_as_strings.show()
messages_as_strings.count()

root
 |-- key: string (nullable = true)
 |-- value: string (nullable = true)

+----+--------------------+
| key|               value|
+----+--------------------+
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
|null|{"keen_timestamp"...|
+----+--------------------+
only showing top 20 rows



3280

## Let's now Unroll the first message as a JSON and show it

In [6]:
first_message=json.loads(messages_as_strings.select('value').take(1)[0].value)
first_message

{'base_exam_id': '37f0a30a-7464-11e6-aa92-a8667f27e5dc',
 'certification': 'false',
 'exam_name': 'Normal Forms and All That Jazz Master Class',
 'keen_created_at': '1516717442.735266',
 'keen_id': '5a6745820eb8ab00016be1f1',
 'keen_timestamp': '1516717442.735266',
 'max_attempts': '1.0',
 'sequences': {'attempt': 1,
  'counts': {'all_correct': False,
   'correct': 2,
   'incomplete': 1,
   'incorrect': 1,
   'submitted': 4,
   'total': 4,
   'unanswered': 0},
  'id': '5b28a462-7a3b-42e0-b508-09f3906d1703',
  'questions': [{'id': '7a2ed6d3-f492-49b3-b8aa-d080a8aad986',
    'options': [{'at': '2018-01-23T14:23:24.670Z',
      'checked': True,
      'correct': True,
      'id': '49c574b4-5c82-4ffd-9bd1-c3358faf850d',
      'submitted': 1},
     {'at': '2018-01-23T14:23:25.914Z',
      'checked': True,
      'correct': True,
      'id': 'f2528210-35c3-4320-acf3-9056567ea19f',
      'submitted': 1},
     {'checked': False,
      'correct': True,
      'id': 'd1bf026f-554f-4543-bdd2-54dcf10

In [7]:
print(first_message['exam_name'])

Normal Forms and All That Jazz Master Class


At first glance, from the first JSON line, we get that each data point is an assessment of homework in our Application. In terms of what information we want to keep in our analysis, I could think on the following:

- Name of the exam: This is important for statistics for the Professors on a given subject
- Sequences: Specially the part where we have total correct, total incorrect, total incomplete, and total questions. These will be useful for grading in the UI and for statisitics on the thoughness of a test
- Certification: This will help us divide the clients into Certifications vs. Free Users (Free Users usually don't receive a definitive grade or have a different pool of questions)
- User_Exam_Id: To keep track of each person through their ID, so we can provide them with targeted marketing on next possible courses
- The different timestamps can help for deploy some kind of reminders, but that could be a less important subject of analysis

## Bonus: Most taken Course or Exam

In [8]:
messages_in_console = messages_as_strings.select('value').take(3280)
courses = []
freq = {}
for i in range(0,messages_as_strings.count()):
    courses.append(json.loads(messages_in_console[i].value)['exam_name'])
for item in courses:
    if item in freq:
        freq[item] += 1
    else:
        freq[item] = 1
print('The distribution of the courses taken is:')
freq_pd = pd.DataFrame(list(freq.values()), index=freq.keys(), columns = ['Freq'])
freq_pd.sort_values(by=['Freq'], ascending=False)

The distribution of the courses taken is:


Unnamed: 0,Freq
Learning Git,394
Introduction to Python,162
Introduction to Java 8,158
Intermediate Python Programming,158
Learning to Program with R,128
Introduction to Machine Learning,119
Software Architecture Fundamentals Understanding the Basics,109
Beginning C# Programming,95
Learning Eclipse,85
Learning Apache Maven,80
