# Exploring the dataset

An intial look at the raw unzipped data shows the files are in JSON format. Since JSON is a schema-based format and DataFrames are optimized to work efficiently with a data with a schema, I have used dataframes to analyze this data.

In [1]:
# initialize spark shell
import os
execfile(os.path.join(os.environ["SPARK_HOME"], 'python/pyspark/shell.py'))

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Python version 2.7.9 (default, Dec 15 2014 10:37:34)
SparkSession available as 'spark'.


In [2]:
# read in the data and print the schema

df = spark.read.json('/Users/akaur/Desktop/DataEngChallenge/location-data-sample')
df.printSchema()

root
 |-- action: string (nullable = true)
 |-- api_key: string (nullable = true)
 |-- app_id: string (nullable = true)
 |-- beacon_major: long (nullable = true)
 |-- beacon_minor: long (nullable = true)
 |-- beacon_uuid: string (nullable = true)
 |-- city: string (nullable = true)
 |-- code: string (nullable = true)
 |-- community: string (nullable = true)
 |-- community_code: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- county: string (nullable = true)
 |-- county_code: string (nullable = true)
 |-- event_time: long (nullable = true)
 |-- geohash: string (nullable = true)
 |-- horizontal_accuracy: double (nullable = true)
 |-- idfa: string (nullable = true)
 |-- idfa_hash_alg: string (nullable = true)
 |-- lat: double (nullable = true)
 |-- lng: double (nullable = true)
 |-- place: string (nullable = true)
 |-- platform: string (nullable = true)
 |-- state: string (nullable = true)
 |-- state_code: string (nullable = true)
 |-- user_ip: string (nullable =

# Metrics for location events per IDFA

For the max, min, avg, std deviation of the number of location events per IDFA, first we get the count of the records per IDFA and store it in a new dataframe, then we get the values for the required metrics using the very handy 'describe' function from the spark inbuilt sql functions

In [4]:
# number of locations events(records) per IDFA
idfa_count = df.groupBy(df.idfa).count()
idfa_count.show(10)

+--------------------+-----+
|                idfa|count|
+--------------------+-----+
|b5b237fe-4ab2-4f0...|   28|
|0894896b-1b58-4b8...|   58|
|0446d012-6d80-4b2...|   36|
|564fa141-580a-445...|   72|
|4bf5568f-4369-421...|   31|
|b2a03e10-3b45-479...|   94|
|f4503b93-f2ec-418...|   48|
|fe64cf85-bd56-4d1...|  245|
|71f57c4d-78fa-448...|   33|
|ef2d34f9-07fd-4cf...|   89|
+--------------------+-----+
only showing top 10 rows



In [5]:
# get the aggregation metrics for the count column
idfa_count.describe('count').show()

+-------+------------------+
|summary|             count|
+-------+------------------+
|  count|            238211|
|   mean| 36.79234376246269|
| stddev|118.69280626757613|
|    min|                 1|
|    max|             15999|
+-------+------------------+



With the standard deviation being 118.7 and mean being 36.79, the IDFA with the max value(15999) definitely seems like an outlier.

# Geohashes of all coordinates
In order to get the geohashes for all coordinates, we select distinct tuples of lat, long and geohash by first selecting the three columns and then doing a distinct over the dataframe

In [6]:
# distinct (lat, lng, geohash) combinations
geohash_df = df.select(df.lat, df.lng, df.geohash)\
    .distinct()
    
geohash_df.show(5)

+----------+-----------+------------+
|       lat|        lng|     geohash|
+----------+-----------+------------+
|43.8799046|-79.7387178|dpz39s04cpnd|
|43.0179897|-81.2123496|dpwhxznvzne0|
| 45.268008| -75.306878|f243w2g1q83c|
|39.8381067| -85.996203|dp4feh73z1r7|
| 45.429277| -73.312884|f25f7wphtzp9|
+----------+-----------+------------+
only showing top 5 rows



In [8]:
# save the dataframe locally for later lookup if required
geohash_df.write.parquet('/Users/akaur/PycharmProjects/DataEngChallenge-Amanjot/geohash_coordinates.parquet')

# Geohash - based clusters

Using the information that locations/coordinates with similar geohash prefixes are close, I look at different length of geohash prefixes to see the volume of distinct IDFAs in the proximity and whether they can be classified as clusters

In [9]:
import pyspark.sql.functions as func
df.groupBy(df.geohash)\
    .agg(func.countDistinct(df.idfa).alias('vol'))\
    .sort('vol', ascending=False).show()

+------------+----+
|     geohash| vol|
+------------+----+
|s00000000000|4427|
|djfq0rzn7m70|  75|
|dq21mmek4q6q|  73|
|9vkh7wddguw5|  63|
|dpz8336uu2eq|  61|
|f244mdxpncbp|  58|
|dpm5wpyg42f9|  52|
|djfmbs7xs1j8|  48|
|f241b833vv6j|  47|
|djgzq3q23u2p|  46|
|djkvw9r4j8vp|  45|
|dn6m9tgey6mq|  44|
|djt54wb39fhy|  44|
|djdxvzvm9wvu|  43|
|dnkkg7cw8k1b|  39|
|dpherfur8ezf|  38|
|dnq1zws4u9te|  38|
|dpscv16bk3zf|  38|
|c2b2mbftz52c|  37|
|f2418x4h86s2|  37|
+------------+----+
only showing top 20 rows



Here we see that geohash 's00000000000' has a very high volume of IDFAs compared to any other geohash, this seems like an outlier caused by possibly artificial traffic or some other inaccuracy in the data. Let's check the cities/coordinates for this geohash.

In [10]:
df.select(df.city)\
    .filter(df.geohash == 's00000000000')\
    .distinct()\
    .show()

+---------+
|     city|
+---------+
|Barrigada|
+---------+



Since the city Barrigada responsible for this unusually high amount of traffic is a small village with a population of  < 9K, it is safe to assume this is some kind of artificial/fraud/bot traffic. Going forward, we filter out this geohash in our analysis in order to avoid any skew in data because of this.

In [11]:
geohash_4 = df.filter(df.geohash != 's00000000000')\
    .groupBy(func.substring(df.geohash, 1, 4).alias('geohash_4'))\
    .agg(func.countDistinct('idfa').alias('count'))\
    .sort('count', ascending=False)
    
    
geohash_4.show(100)

+---------+-----+
|geohash_4|count|
+---------+-----+
|     dpz8| 4315|
|     dpz2| 3676|
|     dpz9| 2535|
|     f25d| 1938|
|     c3nf| 1825|
|     dpxr| 1792|
|     f244| 1749|
|     9vk1| 1743|
|     9vg5| 1689|
|     9vk0| 1636|
|     dr4e| 1612|
|     dn5b| 1595|
|     c3x2| 1581|
|     dpsc| 1574|
|     9vg4| 1569|
|     dqcx| 1569|
|     dphg| 1558|
|     dqcr| 1535|
|     djgz| 1514|
|     dnh0| 1510|
|     9vk4| 1415|
|     f25e| 1414|
|     9vff| 1399|
|     c2b2| 1384|
|     dpsb| 1384|
|     dpj5| 1356|
|     djup| 1347|
|     dpz3| 1321|
|     c2b8| 1262|
|     dr5r| 1262|
|     f241| 1246|
|     c28x| 1141|
|     dn6m| 1116|
|     dpzc| 1086|
|     dngy| 1059|
|     cbfg| 1053|
|     9z7d| 1051|
|     dq25|  991|
|     dnh1|  982|
|     9yzg|  981|
|     dpxn|  980|
|     dp3w|  972|
|     9vfc|  971|
|     dpmg|  971|
|     dr72|  962|
|     9yzu|  944|
|     dngz|  936|
|     9vgh|  934|
|     9vg1|  925|
|     dqcj|  924|
|     dnkk|  918|
|     djfq|  898|
|     dnn3

In [12]:
# save these cluster volumes to a local parquet file
geohash_4.write.parquet('/Users/akaur/PycharmProjects/DataEngChallenge-Amanjot/geohash4_distribution.parquet')

In [13]:
df.groupBy(df.city)\
    .agg(func.countDistinct('idfa').alias('count'))\
    .orderBy(['count'], ascending=False)\
    .show(100)

+-------------+-----+
|         city|count|
+-------------+-----+
|      Toronto| 6307|
|    Barrigada| 4446|
|      Houston| 3885|
|     Columbus| 3018|
|       Ottawa| 2635|
|      Atlanta| 2604|
|  Mississauga| 2439|
|       Dallas| 2407|
|      Calgary| 2352|
|     Edmonton| 2021|
|     Montréal| 1958|
|     Richmond| 1908|
|    Cleveland| 1858|
|    Baltimore| 1847|
|    Vancouver| 1662|
| Indianapolis| 1564|
|   Cincinnati| 1547|
| Philadelphia| 1540|
|     Hamilton| 1492|
|    Knoxville| 1466|
|  Saint Louis| 1456|
|      Detroit| 1406|
|    Charlotte| 1376|
|     Columbia| 1362|
|   Fort Worth| 1336|
|   Birmingham| 1308|
|      Chicago| 1255|
|     Brampton| 1205|
|    Nashville| 1201|
|    Arlington| 1194|
|     Winnipeg| 1185|
| Jacksonville| 1168|
|Oklahoma City| 1153|
|      Raleigh| 1144|
|   Burlington| 1110|
|  Kansas City| 1106|
|        Omaha| 1091|
|  Minneapolis| 1056|
|      Orlando| 1049|
|      Vaughan| 1041|
|  Springfield| 1030|
|       London| 1009|
|    Las V

After looking at the IDFA volumes for different lengths of geohash prefixes, we see that the volume distribution over geohashes if we consider first 4 letters is similar to what we see for volume distribution for different cities. Thus we can consider these as clusters. The number of people in the clusters is upto 4K people, and based on the Wikipedia page for geohashes, km error at this level of precision is +/- 20 kms, which is close to the usual size of a metropolitan area.

Let's see if we can find more granular clusters:

In [14]:
geohash_5 = df.filter(df.geohash != 's00000000000')\
    .groupBy(func.substring(df.geohash, 1, 5).alias('geohash_4'))\
    .agg(func.countDistinct('idfa').alias('count'))\
    .sort('count', ascending=False)
    
geohash_5.show(100)

+---------+-----+
|geohash_4|count|
+---------+-----+
|    dpz83| 1921|
|    dpz82|  879|
|    f244m|  730|
|    c2b2q|  716|
|    f25dv|  695|
|    dpz8b|  673|
|    dpz8f|  660|
|    dpz89|  642|
|    dpz86|  637|
|    dpz8d|  606|
|    dqcx8|  579|
|    dpz8c|  577|
|    dpz2j|  561|
|    dpz94|  558|
|    dpz88|  553|
|    dpz2r|  497|
|    dpz2n|  497|
|    dpxrg|  472|
|    dpz2w|  471|
|    c3nfk|  466|
|    dpz2x|  465|
|    dpz25|  444|
|    f244h|  442|
|    dpz80|  441|
|    c2b80|  431|
|    f25dt|  420|
|    dpz95|  420|
|    dpz90|  411|
|    dpz9h|  410|
|    f25dy|  409|
|    dpxru|  408|
|    dpz2m|  405|
|    dpz96|  403|
|    dpz2t|  401|
|    dpz2p|  397|
|    c3x29|  396|
|    dpz8g|  396|
|    dpz2z|  395|
|    dn5bp|  394|
|    dpz93|  394|
|    f25ds|  385|
|    c28ry|  377|
|    dpxrd|  377|
|    c2b2n|  374|
|    c3nfh|  372|
|    f25du|  371|
|    f244q|  366|
|    dpxre|  365|
|    dpz8e|  361|
|    dqcrx|  361|
|    c3nf7|  359|
|    dqcx9|  358|
|    djgzq

In [15]:
# save to local 
geohash_5.write.parquet('/Users/akaur/PycharmProjects/DataEngChallenge-Amanjot/geohash5_distribution.parquet')

At this level of precision, the km error is +/- 2.4 kms which can be defined as a community/neighbourhood. These can also be considered as valid clusters, which number of people averaging ~400.

# Analysis of behaviour of IDFAs

To get more insights into the behaviour of IDFAs, I would like to look at the following variables:
- Top countries by volume
- Top cities by volume
- Distribution over platforms
- Time of day activity
- IDFAs linked with multiple locations

In [16]:
import pyspark.sql.functions as func
from iso3166 import countries
def toAlpha3(code):
    return countries.get(code).alpha3
 
udfToAlpha3=func.udf(toAlpha3)

countries_df = df.groupBy(df.country_code)\
    .agg(func.countDistinct('idfa').alias('count'))\
    .orderBy(['count'], ascending=False)
    
# show top 10
countries_df.show(10)

+------------+------+
|country_code| count|
+------------+------+
|          US|189577|
|          CA| 42627|
|          GU|  4486|
|          JP|   706|
|          MX|   626|
|          GB|   399|
|          BR|   324|
|          TR|   259|
|          DE|   222|
|          AU|   213|
+------------+------+
only showing top 10 rows



In [17]:
# save the distribution to local
countries_df.write.parquet('/Users/akaur/PycharmProjects/DataEngChallenge-Amanjot/countries_distribution.parquet')

In [20]:
# for plotting, need to convert the country code to 3 letter code
pandas_countries =  countries_df.withColumn('alpha3',udfToAlpha3(df.country_code))\
    .toPandas()

In [34]:
# the following code is taken from plotly examples and modified to represent this data

import plotly.plotly as py
import pandas as pd

data = [ dict(
        type = 'choropleth',
        locations = pandas_countries['alpha3'],
        z = pandas_countries['count'],
        text = pandas_countries['country_code'],
        colorscale = [[0,"rgb(5, 10, 172)"],[0.35,"rgb(40, 60, 190)"],[0.5,"rgb(70, 100, 245)"],\
            [0.6,"rgb(90, 120, 245)"],[0.7,"rgb(106, 137, 247)"],[1,"rgb(220, 220, 220)"]],
        autocolorscale = False,
        reversescale = True,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 0.5
            ) ),
        colorbar = dict(
            autotick = True,
            tickprefix = '',
            title = 'Number of IDFAs'),
      ) ]

layout = dict(
    title = 'Distribution of IDFAs over the world',
    geo = dict(
        showframe = False,
        showcoastlines = False,
        projection = dict(
            type = 'Mercator'
        )
    )
)

fig = dict( data=data, layout=layout )
py.iplot( fig, validate=False, filename='world_map' )

### Distribution by top cities
Top cities by volume of IDFAs:

In [22]:
cities_df = df.groupBy(df.city)\
    .agg(func.countDistinct('idfa').alias('count'))\
    .orderBy(['count'], ascending=False)
    
# get top 50    
cities_df.show(50)

+-------------+-----+
|         city|count|
+-------------+-----+
|      Toronto| 6307|
|    Barrigada| 4446|
|      Houston| 3885|
|     Columbus| 3018|
|       Ottawa| 2635|
|      Atlanta| 2604|
|  Mississauga| 2439|
|       Dallas| 2407|
|      Calgary| 2352|
|     Edmonton| 2021|
|     Montréal| 1958|
|     Richmond| 1908|
|    Cleveland| 1858|
|    Baltimore| 1847|
|    Vancouver| 1662|
| Indianapolis| 1564|
|   Cincinnati| 1547|
| Philadelphia| 1540|
|     Hamilton| 1492|
|    Knoxville| 1466|
|  Saint Louis| 1456|
|      Detroit| 1406|
|    Charlotte| 1376|
|     Columbia| 1362|
|   Fort Worth| 1336|
|   Birmingham| 1308|
|      Chicago| 1255|
|     Brampton| 1205|
|    Nashville| 1201|
|    Arlington| 1194|
|     Winnipeg| 1185|
| Jacksonville| 1168|
|Oklahoma City| 1153|
|      Raleigh| 1144|
|   Burlington| 1110|
|  Kansas City| 1106|
|        Omaha| 1091|
|  Minneapolis| 1056|
|      Orlando| 1049|
|      Vaughan| 1041|
|  Springfield| 1030|
|       London| 1009|
|    Las V

In [23]:
# save the distribution to local
cities_df.write.parquet('/Users/akaur/PycharmProjects/DataEngChallenge-Amanjot/ct_distribution.parquet')

If we look at the top 50 cities, most of these are big cities located in the US and Canada, which is expected. Barrigada is an exception as discussed earlier. Toronto has a higher chunk of traffic as compared to other cities, followed by Houston. This points to the data being possibly collected for clients which are local, or from apps that are more popular in these areas. It is interesting that Houston has the second biggest chunk of traffic, instead of a city with a bigger population.

### Distribution by platform
Next, let's take a look at the top platforms:

In [24]:
plat_df = df.groupBy(df.platform)\
    .agg(func.countDistinct('idfa').alias('count'))\
    .orderBy(['count'], ascending=False)
    
plat_df.show()

+--------+------+
|platform| count|
+--------+------+
| android|204602|
|     ios| 33609|
+--------+------+



This is interesting. I would have expected majority of the population to be from iOS since iPhones seem to be more popular in North America, but the volume for Android almost 9x that of iOS. It seems that more iOS users have opted out of interest based ads and turned off the tracking. Another possibility is that the apps from which this data comes are not as popular with iPhone users as opposed to Android phone users.

### Time of day analysis:

We try to see whether there is a pattern in the time of day corresponding to the volume of the IDFAs overall. Before that, we check to see what is the time range for the sample data.

In [26]:
df.select(func.from_unixtime(df.event_time).alias('utc_time'))\
    .agg(func.min('utc_time'), func.max('utc_time'))\
    .show()

+-------------------+-------------------+
|      min(utc_time)|      max(utc_time)|
+-------------------+-------------------+
|2017-03-31 19:57:38|2017-04-01 20:01:36|
+-------------------+-------------------+



Looks like the data is for 24 hours, not sure if we can conclude much about time of day activity from just one day of data, but let's try.

In [27]:
overall_tod = df.groupBy(func.hour(func.from_unixtime(df.event_time)).alias('utc_hour'))\
    .agg(func.countDistinct('idfa').alias('idfa_vol'))\
    .orderBy('utc_hour')\
    .toPandas()

In [35]:
import plotly.graph_objs as go
data_overall = [go.Scatter(x=overall_tod['utc_hour'], y=overall_tod['idfa_vol'])]
layout = dict(title = 'Distribution of IDFAs over the day',
              xaxis = dict(title = 'UTC hour'),
              yaxis = dict(title = 'Volume of IDFAs'),
              )
fig = dict(data=data_overall, layout=layout)
py.iplot(fig, filename='time_of_day_overall')

The above line chart shows that the peak of activity happens during 11 AM UTC to 8 PM UTC which is 6 AM EST to 3 PM EST and 3 AM PST to 12 noon PST. This doesn't seem to make much intuitive sense but it might be because of the IDFAs being distributed over multiple timezones. Let's take a closer look at two cities with higher volumes, one in the EST timezone (Toronto) and one in the PST timezone (Vancouver).

In [37]:
van_tod = df.filter(df.city == 'Vancouver')\
    .select(func.from_unixtime(df.event_time).alias('utc_time'), df.idfa)\
    .withColumn('pst_time', func.from_utc_timestamp('utc_time', 'PST'))\
    .groupBy(func.hour('utc_time').alias('utc_hour'), func.hour('pst_time').alias('pst_hour'))\
    .agg(func.countDistinct('idfa').alias('idfa_vol'))\
    .orderBy('pst_hour')\
    .toPandas()
    

data_van = [go.Scatter(x=van_tod['pst_hour'], y=van_tod['idfa_vol'])]
layout = dict(title = 'Distribution of IDFAs over the day',
              xaxis = dict(title = 'PST hour'),
              yaxis = dict(title = 'Volume of IDFAs'),
              )
fig = dict(data=data_van, layout=layout)
py.iplot(fig, filename='time_of_day_van')

In [38]:
tor_tod = df.filter(df.city == 'Toronto')\
    .select(func.from_unixtime(df.event_time).alias('utc_time'), df.idfa)\
    .withColumn('est_time', func.from_utc_timestamp('utc_time', 'EST'))\
    .groupBy(func.hour('utc_time').alias('utc_hour'), func.hour('est_time').alias('est_hour'))\
    .agg(func.countDistinct('idfa').alias('idfa_vol'))\
    .orderBy('est_hour')\
    .toPandas()


data_tor = [go.Scatter( x=tor_tod['est_hour'], y=tor_tod['idfa_vol'])]
py.iplot(data_tor, filename='time_of_day_tor')

Looking at the two line graphs above, we see more of a trend for individual cities with peak of traffic being between 5 AM - 3 PM. This makes a little more intuitive sense, since this coincides more closely with a workday. That being said, since the data ranges over just one day, it is not a large enough sample size to conclude for trends over time of day for a usual day of week.

### Locations associated with a single IDFA

Let's see if there are any users which are associated with multiple cities during this sample time period.

In [51]:
loc_df = df.groupBy(df.idfa)\
    .agg(func.countDistinct('city').alias('num_cities'))\
    .orderBy(['num_cities'], ascending=False)
       
loc_df.show()

+--------------------+----------+
|                idfa|num_cities|
+--------------------+----------+
|00000000-0000-000...|      2079|
|74eac213-504f-43d...|        93|
|da0e90fb-19e0-487...|        91|
|f979ed6f-0461-46c...|        83|
|653b5c7a-64d2-4b7...|        76|
|907e2e13-17ad-4c0...|        73|
|843600f6-5878-4b7...|        70|
|1ffedf96-efaf-408...|        69|
|9fcb9870-29c2-4c4...|        63|
|3c22ff88-e193-4bf...|        62|
|1da43983-c834-4fb...|        61|
|1e588609-11f0-48b...|        61|
|362eb796-25b8-483...|        61|
|7bacefce-e646-483...|        60|
|fd33ab05-f744-4b1...|        60|
|a663f67e-35b3-4aa...|        60|
|a19d0dca-af40-4c9...|        60|
|6b150901-7aeb-46c...|        59|
|45606497-5fab-484...|        58|
|3d12507f-cbe2-487...|        58|
+--------------------+----------+
only showing top 20 rows



This seems weird that there are users who are associated with >40 cities in a single day. This might point to something off with the 'geohash' collection parameter. Let's take a look at the stats for the num_cities parameter.

In [46]:
loc_df.describe('num_cities').show()

+-------+------------------+
|summary|        num_cities|
+-------+------------------+
|  count|            238211|
|   mean| 2.145371120561183|
| stddev|4.9325567454032475|
|    min|                 1|
|    max|              2079|
+-------+------------------+



It is interesting that the mean for this number of cities associated with an IDFA is > 2. This might be skewed by noisy/test users e.g. the one with IDFA starting with '00000000-0000-000...' which has as many as 2079 cities associated with it. Let's take a look at the distribution of number of cities versus number of IDFAs linked to them.

In [47]:
loc_df.groupBy(loc_df.num_cities)\
    .agg(func.countDistinct('idfa').alias('num_idfa'))\
    .orderBy(['num_idfa'], ascending=False)\
    .show(100)

+----------+--------+
|num_cities|num_idfa|
+----------+--------+
|         1|  136919|
|         2|   45238|
|         3|   22724|
|         4|   12534|
|         5|    7301|
|         6|    4201|
|         7|    2667|
|         8|    1700|
|         9|    1144|
|        10|     856|
|        11|     533|
|        12|     467|
|        13|     296|
|        14|     277|
|        15|     186|
|        16|     185|
|        17|     122|
|        18|     115|
|        20|      89|
|        19|      84|
|        21|      59|
|        23|      46|
|        22|      42|
|        24|      33|
|        26|      31|
|        25|      29|
|        28|      28|
|        27|      28|
|        29|      24|
|        30|      22|
|        31|      20|
|        35|      19|
|        32|      18|
|        37|      16|
|        36|      12|
|        33|      12|
|        34|      12|
|        41|      10|
|        39|      10|
|        42|       9|
|        40|       9|
|        38|       8|
|        4

This seems much more intuitive: A high volume of the IDFAs have a single city associated with them, and most of them have <= 4 cities associated with them. The IDFAs with > 20 cities associated with them are lesser in number and can be considered noise, but it is still weird that there are any IDFAs which show up in upto 20 cities.