# Project 2

Derek Topper

# Pipeline

This file shows which commands were used to create the data pipeline to ultimately run queries on the assessment data.

### Docker

##### Navigate to folder I want to work in

`cd w205/project-2-derektopper`

##### Getting assignment data

`curl -L -o assessment-attempts-20180128-121051-nested.json https://goo.gl/ME6hjp`

##### Get Docker Compose File

`cp ~/w205/course-content/08-Querying-Data/docker-compose.yml .`

##### Spin up detached docker container

`docker-compose up -d`

##### Look at kafka logs

`docker-compose logs -f kafka`

##### Creating a kafka topic called assessments

`docker-compose exec kafka kafka-topics --create --topic assessments --partitions 1 --replication-factor 1 --if-not-exists --zookeeper zookeeper:32181`

##### View description of kafka topic

`docker-compose exec kafka kafka-topics --describe --topic assessments --zookeeper zookeeper:32181`

##### Using jq to examine information in the json file (bash)

`docker-compose exec mids bash -c "cat /w205/project-2-derektopper/assessment-attempts-20180128-121051-nested.json"`

`docker-compose exec mids bash -c "cat /w205/project-2-derektopper/assessment-attempts-20180128-121051-nested.json | jq '.'"`


`docker-compose exec mids bash -c "cat /w205/project-2-derektopper/assessment-attempts-20180128-121051-nested.json | jq '.[]' -c"`

`docker-compose exec mids bash -c "cat /w205/project-2-derektopper/assessment-attempts-20180128-121051-nested.json | jq '.[]'"`


##### Use kafkacat in producer mode to read messages (with printed message to ensure no errors). Consume messages and prints word count.

`docker-compose exec mids bash -c "cat /w205/project-2-derektopper/assessment-attempts-20180128-121051-nested.json | jq '.[]' -c | kafkacat -P -b kafka:29092 -t assessments && echo 'Produced messages.'"`

`docker-compose exec mids bash -c "kafkacat -C -b kafka:29092 -t assessments -o beginning -e"`

`docker-compose exec mids bash -c "kafkacat -C -b kafka:29092 -t assessments -o beginning -e" | wc -l`

Output: 3281

### PySpark

##### Run spark using spark container 

`docker-compose exec spark pyspark`

##### read stuff from kafka

`messages = spark.read.format("kafka").option("kafka.bootstrap.servers", "kafka:29092").option("subscribe","assessments").option("startingOffsets", "earliest").option("endingOffsets", "latest").load()`

##### see the schema

`messages.printSchema()`

output:

root
* |-- key: binary (nullable = true)
* |-- value: binary (nullable = true)
* |-- topic: string (nullable = true)
* |-- partition: integer (nullable = true)
* |-- offset: long (nullable = true)
* |-- timestamp: timestamp (nullable = true)
* |-- timestampType: integer (nullable = true)


##### see the messages

`messages.show()`

##### cache messages

`messages.cache()`

##### Cast messages

`messages_as_strings=messages.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")`

##### take a look at stringed messages

`messages_as_strings.show()`

##### take a look at stringed schema

`messages_as_strings.printSchema()`

##### take a look at count

`messages_as_strings.count()`
* 3280


##### Unroll data and view first entry

`messages_as_strings.select('value').take(1)`

##### View entry and extract the value

`messages_as_strings.select('value').take(1)[0].value`

##### Now work with json

`import json`

##### Unroll data and view first entry

`assessment=json.loads(messages_as_strings.select('value').take(1)[0].value)`

`assessment`

##### print an item from this assessment

`print(assessment['exam_name'])`
* Normal Forms and All That Jazz Master Class

##### write stringed assessments data to hdfs

`messages_as_strings.write.parquet("/tmp/messages_as_strings")`

##### Check out results from another window


`docker-compose exec cloudera hadoop fs -ls /tmp/`

`docker-compose exec cloudera hadoop fs -ls /tmp/messages_as_strings/`

##### view data back in spark

`messages_as_strings.show()`

##### fix unicode data

`import sys`

`sys.stdout = open(sys.stdout.fileno(), mode='w', encoding='utf8', buffering=1)`

##### load json, using RDD

`import json`

`messages_as_strings.rdd.map(lambda x: json.loads(x.value)).toDF().show()`

##### unroll and save extracted data with json

`extracted = messages_as_strings.rdd.map(lambda x: json.loads(x.value)).toDF()`


##### view data

`extracted.show()`

##### view schema

`extracted.printSchema()`

##### Save as a parquet

`extracted.write.parquet("/tmp/extracted")`

##### check out extracted file (in other window)

`docker-compose exec cloudera hadoop fs -ls /tmp/extracted/`

`extracted.cache()`
`extracted.registerTempTable('extracted')`

# Report

In the commands above, I processed this data so it could be queried using the code below.

To accomplish this, I used the following:
* Docker, a platform which provides an environment to work in using containers
* Kafka, a platform that takes a topic, which I named assessments because the data contained is information about various assessments, and takes in the data in that topic to allow us to build a streaming pipeline.
* Spark, a big data processing tool that allows us to extract the Kafka data as a JSON and work with it
* Hadoop, a big data platform that stores our data in a common and useable format to be used by other data scientists.

Upon working with this data, I was able to convert it into a JSON format, which meant that the file contained nested dictionaries.



## Queries 

Now that the data is in a useable format, I am able to create different queries to answer some of the questions posed in the assignment.

I am making a couple of key assumptions worth noting. Firstly, I am assuming that each row represents a unique assessment and that the data has no duplicated values. Since there was not a defined data dictionary, I am unable to determine whether there are duplicated attempts. For example, I noticed that there are instances where both keen_id and user_exam_id appear  in multiple rows, but I cannot say that we should remove certain entries, without a data dictionary, as both of those IDs could be a row ID or a test taker ID, and I believe that is not covered by this project.

Additionally, as a result of the lack of a data dictionary, I will mention what I interpret each variable to represent below.

In trying to determine some of the business questions that might be important, I tried to consider what types of things someone looking at this data would want to know. 


Firstly, I wanted to look at how many assessments are in the dataset? This could simply be done by counting each row, as we are assuming there aren't duplicated entries. This means that our analysis had 3,280 instances that we could examine.

`spark.sql('select count(keen_id) from extracted limit 10').show()`

`+--------------+
|count(keen_id)|
+--------------+
|          3280|
+--------------+`

I then wanted to look at what the most common and least common courses taken were. This is an important value as it could help an analyst see what the breakdown of assessments looks like and where resources were allocated in the data. It can allow us to see that one course is much more popular than another course. 

Notably, I chose to use the exam_name variable, as this appeared to be the name of a specific course that an assessment came from. I am also assuming that the number of assessments and the number of courses are taken are the same.

Thus looking at the data below, we can see that the most popular courses are:
* Learning Git (394 Students)
* Introduction to Python (162 Students)
* Introduction to Java 8 (158 Students)

Additionally, the least common courses are:
* Nulls, Three-valued Logic and Missing Information (1 Student)
* Native Web Apps for Android (1 Student)
* Learning to Visualize Data with D3.js (1 Student)
* Operating Red Hat Enterprise Linux Servers (1 Student)

`Most Common`

`spark.sql("select exam_name, count(exam_name)  from extracted group by exam_name order by count(exam_name) desc").show(3)`

`+--------------------+----------------+ 
|           exam_name|count(exam_name)|
+--------------------+----------------+
|        Learning Git|             394|
|Introduction to P...|             162|
|Introduction to J...|             158|
+--------------------+----------------+`

`Least Common`

`spark.sql("select exam_name, count(exam_name)  from extracted group by exam_name order by count(exam_name) ").show(5)
`
`+--------------------+----------------+     
|           exam_name|count(exam_name)|
+--------------------+----------------+
|Native Web Apps f...|               1|
|Learning to Visua...|               1|
|Nulls, Three-valu...|               1|
|Operating Red Hat...|               1|
|Learning Spring P...|               2|
+--------------------+----------------+
`

If we want to answer a question like, did more people took Introduction to Machine Learning or Advanced Machine Learning, then we can do something like that using the code below. This can be useful if we want to compare two classes. In this case, we can assume the Introduction to Machine Learning is a class that would be taken before a class like Advanced Machine Learning, which could help explain why the introductory, easier course had 119 students take it, while the advanced, harder course had 67 students take it.

`spark.sql("select exam_name as course, count(*) as num_takers from extracted where (exam_name like 'Introduction to Machine Learning' or exam_name like 'Advanced Machine Learning') group by exam_name").show() `

`+--------------------+----------+
|              course|num_takers|
+--------------------+----------+
|Introduction to M...|       119|
|Advanced Machine ...|        67|
+--------------------+----------+`

Additionally, if we wanted to look at the average number of students who took each test, then we can take the number of overall test takers and divide that by the number of of unique exam names. I chose to use exam names, rather than base_exam_ids, as there was no way to know if base_exam_ids were completely unique for each course.

From this methodology, I was able to calculate that the average exam had 31.8 students take it.

`spark.sql("select count(*)/count(distinct exam_name) as MeanTakers from extracted").show() `

`+------------------+
|        MeanTakers|
+------------------+
|31.844660194174757|
+------------------+`

Ultimately, from this analysis, we were able to find a few key pieces of information.

* We found that there are 3,280 assessments in the dataset.
* We found that the most popular courses are:
 * Learning Git (394 Students)
 * Introduction to Python (162 Students)
 * Introduction to Java 8 (158 Students)
* We found that there were four courses that only one student took. The least common courses are:
 * Nulls, Three-valued Logic and Missing Information (1 Student)
 * Native Web Apps for Android (1 Student)
 * Learning to Visualize Data with D3.js (1 Student)
 * Operating Red Hat Enterprise Linux Servers (1 Student)
* The introductory machine learning course appeared to have more students take it than the advanced machine learning course did. 
 * If both classes were combined, it would be the second most popular course, indicating that further analysis could be done on the course subject matter.
*  The average course had 31.8 students take it.

##### exit

`exit()`