# Desafio Escale

Este <i>notebook</i> e um estudo passo a passo do dataset fornecido. Será utilizada tecnologia Spark em Python para a resolução do desafio.

No primeiro momento sera instalado e importadas bibliotecas do Spark

In [15]:
try:
    !pip install pyspark=="2.4.5" --quiet
    !pip install pandas=="1.0.4" --quiet
except:
    print("Running throw py file.")

In [16]:
from pyspark import SparkContext as sc
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import Window
from pyspark import SparkFiles
from pyspark.sql.types import StringType
import pyspark
import json
import pandas as pd

Criação de uma sessão Spark

In [17]:
spark = SparkSession\
        .builder\
        .appName("Desafio Data Engineer Escale - Fabio Kfouri")\
        .getOrCreate()
spark

Para otimizar a resolução, foram realizados downloads dos datasets, esta lógica é para identificar se este notebook esta rodando máquina do autor, caso positivo, utilizará o dataset local, do contrário, utilizará o dataset da núvem.

In [18]:
import os

dataPath = 'https://d3l36jjwr70u5l.cloudfront.net/data-engineer-test/'

if 'E:\\' in os.getcwd() and 'dataEngineerTest_Escale' in os.getcwd():
    dataPath = os.getcwd() + "/data/"

print(dataPath)


E:\Projetos\Jobs\dataEngineerTest_Escale/data/


## Leitura do primeiro dataset
Os códigos seguintes tem o objetivo apenas de conhecer o dataset:
- vizualizar alguns registros
- quantidade de registros
- observar o Schema do datase

In [19]:
df0 = spark.read.json(dataPath + 'part-0000' + str(0) +'.json.gz')
df0.show(3)

+--------------------+--------------+----------------+---------------------+--------+---+---------+--------+-------+
|        anonymous_id|browser_family|   device_family|device_sent_timestamp|   event|  n|os_family|platform|version|
+--------------------+--------------+----------------+---------------------+--------+---+---------+--------+-------+
|46074eab-28f7-483...| Chrome Mobile|Samsung SM-A105M|        1592616733778|pageview| 82|  Android|     web|    1.0|
|24c1f50c-c192-4f8...| Chrome Mobile|Samsung SM-A105M|        1592615846912|pageview| 82|  Android|     web|    1.0|
|2c7ed100-3e30-4c1...| Chrome Mobile|Samsung SM-A105M|        1592624340522|pageview| 82|  Android|     web|    1.0|
+--------------------+--------------+----------------+---------------------+--------+---+---------+--------+-------+
only showing top 3 rows



In [20]:
df0.count()

10235120

In [21]:
df0.printSchema()

root
 |-- anonymous_id: string (nullable = true)
 |-- browser_family: string (nullable = true)
 |-- device_family: string (nullable = true)
 |-- device_sent_timestamp: long (nullable = true)
 |-- event: string (nullable = true)
 |-- n: long (nullable = true)
 |-- os_family: string (nullable = true)
 |-- platform: string (nullable = true)
 |-- version: string (nullable = true)



## Desafio 1

Calcular a quantaade de sessões únicas.

O código seguinte é para identificar os usuários que mais utilizaram o site. Esta pesquisa é importante para poder criar o <b>Sessionamento</b>.

In [22]:
df = df0.groupBy('anonymous_id').count().sort(F.col("count").desc())
df.show(10,False)

+------------------------------------+-----+
|anonymous_id                        |count|
+------------------------------------+-----+
|ae59b395-31ac-40e3-b7c0-3d97d00ac6cc|293  |
|7d0f437b-23e1-407e-8a08-6fb4434bc1f7|3    |
|e0ad55f0-29de-4c01-8da1-d37eb2f3ad4c|3    |
|88222d44-f5da-4710-98ef-89c39b635a9d|2    |
|6f38a9bf-fcb8-4e08-bbcd-7f7b4fc5a258|2    |
|9bf50c34-836a-4f03-b9da-bd90d84c8cd3|2    |
|34d49ea1-7c32-4d6c-a3df-40455e509ec7|2    |
|b70140df-d9c2-49eb-9eb1-6ae9ab77fb48|2    |
|880d31a8-8181-4e33-b65c-99f26a448f3e|2    |
|68f73aec-f9c3-4d39-9cba-0454a4972bdd|2    |
+------------------------------------+-----+
only showing top 10 rows



Identificado um <b>anonymous_id</b> com uso intensivo do site, que será utilizado para a modelagem do sessionamento.

In [23]:
df = df0.filter('anonymous_id = "ae59b395-31ac-40e3-b7c0-3d97d00ac6cc"').sort(F.col("device_sent_timestamp"))
df.show()

+--------------------+--------------+-------------+---------------------+--------+---+---------+--------+-------+
|        anonymous_id|browser_family|device_family|device_sent_timestamp|   event|  n|os_family|platform|version|
+--------------------+--------------+-------------+---------------------+--------+---+---------+--------+-------+
|ae59b395-31ac-40e...|     Googlebot|       Spider|        1591321597456|pageview| 95|    Other|     web|    1.0|
|ae59b395-31ac-40e...|     Googlebot|       Spider|        1591414551868|pageview| 69|    Other|     web|    1.0|
|ae59b395-31ac-40e...|     Googlebot|       Spider|        1591673040878|pageview| 73|    Other|     web|    1.0|
|ae59b395-31ac-40e...| AdsBot-Google|       Spider|        1591733995529|pageview| 64|    Other|     web|    1.0|
|ae59b395-31ac-40e...|     Googlebot|       Spider|        1591756611249|pageview| 43|    Other|     web|    1.0|
|ae59b395-31ac-40e...|     Googlebot|       Spider|        1592014689622|pageview| 45|  

Para calculo do tempo de sessão será utilizado a função <b>LAG()</b> que tem por objetivo trazer no registro corrente o dado de timestamp do registro anterior.

O autor assumiu como premissa não declarada para definir uma sessão, que além do tempo limite de 30 minutos desde a última utilização, que uma sessão precisaria considerar o <b>device_family</b> e <b>os_family</b>.

Ou seja, mesmo que não tenha excedido o tempo limite de 30 minutos, mas se for caracterizado que houve uma mudança de device_family ou os_family, trata-se de uma sessão nova.

In [24]:
overCategory = Window.partitionBy('anonymous_id','device_family','os_family').orderBy('device_sent_timestamp')

In [25]:
dftemp = df.withColumn("lag", F.lag('device_sent_timestamp', 1).over(overCategory))
dftemp.show(5)

+--------------------+--------------+-------------+---------------------+--------+---+---------+--------+-------+-------------+
|        anonymous_id|browser_family|device_family|device_sent_timestamp|   event|  n|os_family|platform|version|          lag|
+--------------------+--------------+-------------+---------------------+--------+---+---------+--------+-------+-------------+
|ae59b395-31ac-40e...|     Googlebot|       Spider|        1591321597456|pageview| 95|    Other|     web|    1.0|         null|
|ae59b395-31ac-40e...|     Googlebot|       Spider|        1591414551868|pageview| 69|    Other|     web|    1.0|1591321597456|
|ae59b395-31ac-40e...|     Googlebot|       Spider|        1591673040878|pageview| 73|    Other|     web|    1.0|1591414551868|
|ae59b395-31ac-40e...| AdsBot-Google|       Spider|        1591733995529|pageview| 64|    Other|     web|    1.0|1591673040878|
|ae59b395-31ac-40e...|     Googlebot|       Spider|        1591756611249|pageview| 43|    Other|     web

Identificado que o tempo esta no formato <b>epoch</b>. 

A primeira intuição do autor foi de converter para o formato de data, porém, logo o autor percebeu que poderia usar o própria valor numérico para identificar a tolerancia de 30 minutos.

Para isso, foi convertido o tempo de 30 minutos em milisegundos para o uso simplificado na identificação da sessão.

In [26]:
time_limit = (30*60)*1000
print(time_limit)

1800000


In [28]:
dftemp = dftemp.withColumn('delta', F.col('device_sent_timestamp') - F.col('lag'))\
        .withColumn('same_section', (F.col('device_sent_timestamp') - F.col('lag')) < time_limit) \
        .withColumn('event_time', (F.col('device_sent_timestamp')/1000).cast('timestamp'))

dftemp.select('device_family','device_sent_timestamp','delta','same_section','event_time').show(20, False)

+-------------+---------------------+---------+------------+-----------------------+
|device_family|device_sent_timestamp|delta    |same_section|event_time             |
+-------------+---------------------+---------+------------+-----------------------+
|Spider       |1591321597456        |null     |null        |2020-06-04 22:46:37.456|
|Spider       |1591414551868        |92954412 |false       |2020-06-06 00:35:51.868|
|Spider       |1591673040878        |258489010|false       |2020-06-09 00:24:00.878|
|Spider       |1591733995529        |60954651 |false       |2020-06-09 17:19:55.529|
|Spider       |1591756611249        |22615720 |false       |2020-06-09 23:36:51.249|
|Spider       |1592014689622        |258078373|false       |2020-06-12 23:18:09.622|
|Spider       |1592273900262        |259210640|false       |2020-06-15 23:18:20.262|
|Spider       |1592527400266        |253500004|false       |2020-06-18 21:43:20.266|
|Spider       |1592537432075        |10031809 |false       |2020-

Devido a demora de processamento na maquina do autor, optou-se por gerar um dataset menor para analise.

In [30]:
dftemp.coalesce(1).write.format("json").save("analise.json")

In [31]:
dftemp = spark.read.json(os.getcwd() + '/analise.json')

Criacão de uma view chamada clickStream.

In [32]:
dftemp.createOrReplaceTempView("clickstream")

Pimeiro passo foi possível observar as sessoes abertas e fazer um especie de <i>smoke test</i>.

In [37]:
df_question_1 = spark.sql("""

      SELECT anonymous_id, browser_family, delta, device_family, device_sent_timestamp, event, event_time, n, os_family, --
             platform, nvl(same_section,false) same_section, version
        FROM clickstream t --

""")
df_question_1.select('device_family','device_sent_timestamp','delta','same_section','event_time').show(30, False)

+-------------+---------------------+---------+------------+-----------------------------+
|device_family|device_sent_timestamp|delta    |same_section|event_time                   |
+-------------+---------------------+---------+------------+-----------------------------+
|Spider       |1591321597456        |null     |false       |2020-06-04T22:46:37.456-03:00|
|Spider       |1591414551868        |92954412 |false       |2020-06-06T00:35:51.868-03:00|
|Spider       |1591673040878        |258489010|false       |2020-06-09T00:24:00.878-03:00|
|Spider       |1591733995529        |60954651 |false       |2020-06-09T17:19:55.529-03:00|
|Spider       |1591756611249        |22615720 |false       |2020-06-09T23:36:51.249-03:00|
|Spider       |1592014689622        |258078373|false       |2020-06-12T23:18:09.622-03:00|
|Spider       |1592273900262        |259210640|false       |2020-06-15T23:18:20.262-03:00|
|Spider       |1592527400266        |253500004|false       |2020-06-18T21:43:20.266-03:00|

No sergundo passo, foi realizado o filtro para desconsiderar as sessoes abertas e criado uma Session_ID.

In [39]:
df_question_1 = spark.sql("""
with temp as (--
      SELECT anonymous_id, browser_family, delta, device_family, device_sent_timestamp, event, event_time, n, os_family, --
             platform, nvl(same_section,false) same_section, version
        FROM clickstream t --
        where nvl(same_section, false) = false --remove as sessoes abertas
)
select temp.*, 'session_' || ROW_NUMBER() OVER (PARTITION BY anonymous_id ORDER BY device_sent_timestamp )  || '_' ||  anonymous_id as session_id
  from temp
""")
df_question_1.select('device_family','device_sent_timestamp','same_section','event_time','session_id').show(30, False)

+-------------+---------------------+------------+-----------------------------+-----------------------------------------------+
|device_family|device_sent_timestamp|same_section|event_time                   |session_id                                     |
+-------------+---------------------+------------+-----------------------------+-----------------------------------------------+
|Spider       |1591321597456        |false       |2020-06-04T22:46:37.456-03:00|session_1_ae59b395-31ac-40e3-b7c0-3d97d00ac6cc |
|Spider       |1591414551868        |false       |2020-06-06T00:35:51.868-03:00|session_2_ae59b395-31ac-40e3-b7c0-3d97d00ac6cc |
|Spider       |1591673040878        |false       |2020-06-09T00:24:00.878-03:00|session_3_ae59b395-31ac-40e3-b7c0-3d97d00ac6cc |
|Spider       |1591733995529        |false       |2020-06-09T17:19:55.529-03:00|session_4_ae59b395-31ac-40e3-b7c0-3d97d00ac6cc |
|Spider       |1591756611249        |false       |2020-06-09T23:36:51.249-03:00|session_5_ae59b39

No terceiro e passo foi calculada a quantidade de sessoes abertas, neste caso, para um único usuário.

In [42]:
df_question_1 = spark.sql("""
with table_temp as (--
    SELECT anonymous_id, browser_family, device_family, device_sent_timestamp, event, event_time, n, os_family, --
           platform, NVL(same_section,false) same_section, version
    FROM clickstream t --
    WHERE NVL(same_section, false) = false --remove as sessoes abertas
), table_session_id as (--
    SELECT t.*, 'session_' || ROW_NUMBER() OVER (PARTITION BY anonymous_id ORDER BY device_sent_timestamp )  || '_' ||  anonymous_id as session_id
    FROM table_temp t --
)
SELECT anonymous_id, COUNT(session_id)  qtd_session
FROM table_session_id
group by anonymous_id
""")
df_question_1.show()

+--------------------+-----------+
|        anonymous_id|qtd_session|
+--------------------+-----------+
|ae59b395-31ac-40e...|         71|
+--------------------+-----------+



# Desafio 2
Calcular a quantidade de sessões únicas que ocorreram em cada Browser, Sistema Operacional e Dispositivo dentro de todo o conjunto de dados.

No primeiro passo foi identificar da quantidade por <b>browser_family</b>, <b>os_family</b> e <b>device_family</b>.

In [91]:
df_question_2 = spark.sql("""
with table_temp as (--
    SELECT anonymous_id, browser_family,  device_family, device_sent_timestamp, event, event_time, n, os_family, --
           platform, NVL(same_section,false) same_section, version
    FROM clickstream t --
    WHERE NVL(same_section, false) = false --remove as sessoes abertas
), 
table_session_id as (--
    SELECT t.*, 'session_' || ROW_NUMBER() OVER (PARTITION BY anonymous_id ORDER BY device_sent_timestamp )  || '_' ||  anonymous_id as session_id
    FROM table_temp t --
), 
table_browser_family as (--
SELECT browser_family, COUNT(session_id)  qtd_session
FROM table_session_id
group by browser_family --
)
select * from table_browser_family
""")
df_question_2.show()

+--------------+-----------+
|browser_family|qtd_session|
+--------------+-----------+
| AdsBot-Google|          1|
|     Googlebot|         70|
+--------------+-----------+



In [92]:
df_question_2 = spark.sql("""
with table_temp as (--
    SELECT anonymous_id, browser_family,  device_family, device_sent_timestamp, event, event_time, n, os_family, --
           platform, NVL(same_section,false) same_section, version
    FROM clickstream t --
    WHERE NVL(same_section, false) = false --remove as sessoes abertas
), 
table_session_id as (--
    SELECT t.*, 'session_' || ROW_NUMBER() OVER (PARTITION BY anonymous_id ORDER BY device_sent_timestamp )  || '_' ||  anonymous_id as session_id
    FROM table_temp t --
)
SELECT os_family, COUNT(session_id)  qtd_session
FROM table_session_id
group by os_family
""")
df_question_2.show()

+---------+-----------+
|os_family|qtd_session|
+---------+-----------+
|    Other|         39|
|  Android|         32|
+---------+-----------+



In [93]:
df_question_2 = spark.sql("""
with table_temp as (--
    SELECT anonymous_id, browser_family,  device_family, device_sent_timestamp, event, event_time, n, os_family, --
           platform, NVL(same_section,false) same_section, version
    FROM clickstream t --
    WHERE NVL(same_section, false) = false --remove as sessoes abertas
), 
table_session_id as (--
    SELECT t.*, 'session_' || ROW_NUMBER() OVER (PARTITION BY anonymous_id ORDER BY device_sent_timestamp )  || '_' ||  anonymous_id as session_id
    FROM table_temp t --
)
SELECT 'device_family' what, device_family ref, COUNT(session_id)  qtd_session
FROM table_session_id
group by device_family
""")
df_question_2.show()

+-------------+------+-----------+
|         what|   ref|qtd_session|
+-------------+------+-----------+
|device_family|Spider|         71|
+-------------+------+-----------+



No segundo passo foi agrupar todos esses resultados em um dataframe

In [112]:
df_question_2 = spark.sql("""
with table_temp as (--
    SELECT anonymous_id, browser_family,  device_family, device_sent_timestamp, event, event_time, n, os_family, --
           platform, NVL(same_section,false) same_section, version
    FROM clickstream t --
    WHERE NVL(same_section, false) = false --remove as sessoes abertas
), 
table_session_id as (--
    SELECT t.*, 'session_' || ROW_NUMBER() OVER (PARTITION BY anonymous_id ORDER BY device_sent_timestamp )  || '_' ||  anonymous_id as session_id
    FROM table_temp t --
)
SELECT 'device_family' what, device_family ref, COUNT(session_id)  qtd_session
  FROM table_session_id
 group by device_family
union
SELECT 'os_family' what, os_family ref, COUNT(session_id)  qtd_session
  FROM table_session_id
group by os_family
union
SELECT 'browser_family' what, browser_family ref, COUNT(session_id)  qtd_session
  FROM table_session_id
 group by browser_family
""")
df_question_2.show()

+--------------+-------------+-----------+
|          what|          ref|qtd_session|
+--------------+-------------+-----------+
|browser_family|    Googlebot|         70|
|browser_family|AdsBot-Google|          1|
|     os_family|        Other|         39|
|     os_family|      Android|         32|
| device_family|       Spider|         71|
+--------------+-------------+-----------+



No passo seguinte foi de concatenar a coluna 'ref' com 'qtd_session' em um formato json e agrupar por collecao.

In [180]:
d2 = df_question_2.orderBy('what','ref')\
        .withColumn('item', F.concat(F.lit('"'), 'ref',F.lit('":'),'qtd_session'))\
        .groupBy('what').agg(F.array_join(F.collect_list('item'), delimiter=',').alias('collection'))
d2.show(10, False)

+--------------+--------------------------------+
|what          |collection                      |
+--------------+--------------------------------+
|device_family |"Spider":71                     |
|os_family     |"Android":32,"Other":39         |
|browser_family|"AdsBot-Google":1,"Googlebot":70|
+--------------+--------------------------------+



Finalmente construir a saida Json do desafio

In [181]:
question2 = {}
for row in d2.collect():
    items = str(row["collection"]).replace('"','').split(',')
    obj = {}
    for item in items:
        element = str(item).split(':')
        obj[element[0]] = int(element[1])       
    
    question2[row["what"]] = obj

print(question2)

{'device_family': {'Spider': 71}, 'os_family': {'Android': 32, 'Other': 39}, 'browser_family': {'AdsBot-Google': 1, 'Googlebot': 70}}


In [182]:
print(json.dumps(question2))

{"device_family": {"Spider": 71}, "os_family": {"Android": 32, "Other": 39}, "browser_family": {"AdsBot-Google": 1, "Googlebot": 70}}


## Desafio 3

Calcular a mediana da duração (em segundos) entre todas sessões únicas para cada segmento.

In [194]:
df_question_1 = spark.sql("""

      SELECT anonymous_id, browser_family, delta, device_family, device_sent_timestamp, event, event_time, n, os_family, --
             platform, nvl(same_section,false) same_section, version,
             case when same_section = false then
             'session_' || ROW_NUMBER() OVER (PARTITION BY anonymous_id ORDER BY device_sent_timestamp )
             else
             'old'
             end
             as session_id
        FROM clickstream t --
       -- where nvl(same_section, false) = false --remove as sessoes abertas

""")
df_question_1.select('device_family','device_sent_timestamp','same_section','event_time','session_id').show(30, False)

+-------------+---------------------+------------+-----------------------------+----------+
|device_family|device_sent_timestamp|same_section|event_time                   |session_id|
+-------------+---------------------+------------+-----------------------------+----------+
|Spider       |1591321597456        |false       |2020-06-04T22:46:37.456-03:00|old       |
|Spider       |1591414551868        |false       |2020-06-06T00:35:51.868-03:00|session_2 |
|Spider       |1591673040878        |false       |2020-06-09T00:24:00.878-03:00|session_3 |
|Spider       |1591733995529        |false       |2020-06-09T17:19:55.529-03:00|session_4 |
|Spider       |1591756611249        |false       |2020-06-09T23:36:51.249-03:00|session_5 |
|Spider       |1592014689622        |false       |2020-06-12T23:18:09.622-03:00|session_6 |
|Spider       |1592102553331        |false       |2020-06-13T23:42:33.331-03:00|old       |
|Spider       |1592273900262        |false       |2020-06-15T23:18:20.262-03:00|