# Freckle Data Engineer Challenge
*22 Oct 2017  
Farooq Qaiser* 

## Admin  

initialize PySpark

In [2]:
import findspark

findspark.init()

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Freckle_challenge") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

Load some of the basic libraries (we'll load others as we need them). 

In [5]:
from pyspark.sql import functions as func

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Set some display options. 

In [6]:
# show plots inline
%matplotlib inline

Set some parameters.  

In [12]:
seed = 1

input_path = "/home/fqaiser94/Data Engineer Challenge/location-data-sample/*.gz"

Read in data as dataframe

In [15]:
df = spark.read.json(input_path)

## RDD vs Dataframe

**RE: The expectation for this exercise is that you use Spark 2.x with Scala, Python, or Java. You can use the RDD or Dataframe APIs as you see fit, but please be ready to explain your choices.**  


I chose to use the Dataframe API over RDD API because:  
1. Dataframe API is able to take advantage of Spark’s Catalyst optimizer by exposing expressions and data fields to a query planner.
2. Dataframe API has speed advantage in most cases (see [here](http://www.adsquare.com/comparing-performance-of-spark-dataframes-api-to-spark-rdd/)). 
3. I find Dataframes an easier construction to work with  

## EDA

Always helpful to do some EDA to understand our data before diving in.  
Lets take a look at the dataframes schema.  

In [17]:
df.printSchema()

root
 |-- action: string (nullable = true)
 |-- api_key: string (nullable = true)
 |-- app_id: string (nullable = true)
 |-- beacon_major: long (nullable = true)
 |-- beacon_minor: long (nullable = true)
 |-- beacon_uuid: string (nullable = true)
 |-- city: string (nullable = true)
 |-- code: string (nullable = true)
 |-- community: string (nullable = true)
 |-- community_code: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- county: string (nullable = true)
 |-- county_code: string (nullable = true)
 |-- event_time: long (nullable = true)
 |-- geohash: string (nullable = true)
 |-- horizontal_accuracy: double (nullable = true)
 |-- idfa: string (nullable = true)
 |-- idfa_hash_alg: string (nullable = true)
 |-- lat: double (nullable = true)
 |-- lng: double (nullable = true)
 |-- place: string (nullable = true)
 |-- platform: string (nullable = true)
 |-- state: string (nullable = true)
 |-- state_code: string (nullable = true)
 |-- user_ip: string (nullable =

Humm that's a lot of columns, let's take a peek at our data. 

In [18]:
df.limit(5).show()

+--------------------+--------------------+--------------------+------------+------------+-----------+---------+-----+---------+--------------+------------+---------+-----------+----------+------------+-------------------+--------------------+-------------+-----------------+------------------+---------+--------+-------------+----------+--------------+
|              action|             api_key|              app_id|beacon_major|beacon_minor|beacon_uuid|     city| code|community|community_code|country_code|   county|county_code|event_time|     geohash|horizontal_accuracy|                idfa|idfa_hash_alg|              lat|               lng|    place|platform|        state|state_code|       user_ip|
+--------------------+--------------------+--------------------+------------+------------+-----------+---------+-----+---------+--------------+------------+---------+-----------+----------+------------+-------------------+--------------------+-------------+-----------------+-----------------