
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>



# Activity by Traffic Lab
Process streaming data to display total active users by traffic source.

##### Objectives
1. Read data stream
2. Get active users by traffic source
3. Execute query with display() and plot results
4. Execute the same streaming query with DataStreamWriter
5. View results being updated in the query table
6. List and stop all active streams

##### Classes
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamReader.html" target="_blank">DataStreamReader</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamWriter.html" target="_blank">DataStreamWriter</a>
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.StreamingQuery.html" target="_blank">StreamingQuery</a>

(pt-br)
##### Objetivos
1. Leia o fluxo de dados
2. Obtenha usuários ativos por origem de tráfego
3. Execute a consulta com display() e plote os resultados
4. Execute a mesma consulta de streaming com DataStreamWriter
5. Veja os resultados sendo atualizados na tabela de consulta
6. Liste e interrompa todos os streams ativos



### Setup
Run the cells below to generate data and create the **`schema`** string needed for this lab.

In [0]:
%run ../Includes/Classroom-Setup-5.1c

Python interpreter will be restarted.
Python interpreter will be restarted.


Resetting the learning environment:
| removing the working directory "dbfs:/mnt/dbacademy-users/anacadriano20@gmail.com/apache-spark-programming-with-databricks"...(0 seconds)

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-databricks/v03"

Validating the locally installed datasets:
| listing local files...(4 seconds)
| validation completed...(4 seconds total)

Creating & using the schema "anacadriano20_6ryf_da_asp" in the catalog "spark_catalog"...(2 seconds)

Predefined tables in "anacadriano20_6ryf_da_asp":
| -none-

Predefined paths variables:
| DA.paths.working_dir: dbfs:/mnt/dbacademy-users/anacadriano20@gmail.com/apache-spark-programming-with-databricks
| DA.paths.user_db:     dbfs:/mnt/dbacademy-users/anacadriano20@gmail.com/apache-spark-programming-with-databricks/database.db
| DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/apache-spark-programming-with-databricks/v03
| DA.paths.checkpoints: dbfs:/mnt/dbacademy-users/an



### 1. Read data stream
- Set to process 1 file per trigger
- Read from Delta with filepath stored in **`DA.paths.events`**

Assign the resulting Query to **`df`**.


1. Leia o fluxo de dados
Definido para processar 1 arquivo por gatilho

Leia do Delta com caminho de arquivo armazenado em DA.paths.events

Atribuir a consulta resultante ao df


In [0]:
# TODO
df = (spark.readStream
      .option("maxFilesPerTrigger",1)
      .format('delta')
      .load(DA.paths.events))




**1.1: CHECK YOUR WORK**

In [0]:
DA.tests.validate_1_1(df)

Points,Test,Result
1,The query is streaming,
1,DataFrame contains all 10 columns,




### 2. Get active users by traffic source
- Set default shuffle partitions to number of cores on your cluster (not required, but runs faster)
- Group by **`traffic_source`**
  - Aggregate the approximate count of distinct users and alias with "active_users"
- Sort by **`traffic_source`**


2. Obtenha usuários ativos por origem de tráfego

Defina partições aleatórias padrão para o número de núcleos em seu cluster (não é obrigatório, mas é executado mais rapidamente)

Agrupar por tráfego_fonte

- Agregue a contagem aproximada de usuários distintos e alias com "active_users"

Classificar por fonte_de_tráfego

In [0]:
# TODO
from pyspark.sql.functions import col, approx_count_distinct, count

#partitions
spark.conf.set("spark.sql.shuffle.partitions", spark.sparkContext.defaultParallelism)

traffic_df = (df
              .groupBy("traffic_source")
              .agg(approx_count_distinct("user_id").alias("active_users"))
              .sort("traffic_source")
             )
display(traffic_df)

traffic_source,active_users
direct,438886
email,281525
facebook,956769
google,1781961
instagram,530050
youtube,253321





**2.1: CHECK YOUR WORK**

In [0]:
DA.tests.validate_2_1(traffic_df.schema)

Points,Test,Result
1,Schema is of type StructType,
1,Schema contians two fields,
1,"Schema contains ""traffic_source"" of type StringType",
1,"Schema contains ""active_users"" of type LongType",




### 3. Execute query with display() and plot results
- Execute results for **`traffic_df`** using display()
- Plot the streaming query results as a bar graph

3. Execute a consulta com display() e plote os resultados

Execute resultados para tráfego_df usando display()

Plote os resultados da consulta de streaming como um gráfico de barras

In [0]:
# TODO
display(traffic_df)

traffic_source,active_users
direct,438886
email,281525
facebook,956769
google,1781961
instagram,530050
youtube,253321


Databricks visualization. Run in Databricks to view.




**3.1: CHECK YOUR WORK**
- You bar chart should plot **`traffic_source`** on the x-axis and **`active_users`** on the y-axis
- The top three traffic sources in descending order should be **`google`**, **`facebook`**, and **`instagram`**.


3.1: VERIFIQUE SEU TRABALHO

Seu gráfico de barras deve traçar tráfego_fonte no eixo x e usuários ativos no eixo y

As três principais fontes de tráfego em ordem decrescente devem ser Google, Facebook e Instagram.



### 4. Execute the same streaming query with DataStreamWriter
- Name the query "active_users_by_traffic"
- Set to "memory" format and "complete" output mode
- Set a trigger interval of 1 second

4. Execute a mesma consulta de streaming com DataStreamWriter

Nomeie a consulta como "active_users_by_traffic"

Defina o formato "memória" e o modo de saída "completo"

Defina um intervalo de disparo de 1 segundo
​

In [0]:
# TODO
traffic_query = (traffic_df
                 .writeStream
                  .queryName('active_users_by_traffic')
                 .format('memory')
                 .outputMode('complete')
                 .trigger(processingTime = "1 seconds")
                 .start())

#DA.block_until_stream_is_ready("active_users_by_traffic")




**4.1: CHECK YOUR WORK**

In [0]:
DA.tests.validate_4_1(traffic_query)

Points,Test,Result
1,The query is active,
1,"The query name is ""active_users_by_traffic"".",
1,"The format is ""MemorySink"".",




### 5. View results being updated in the query table
Run a query in a SQL cell to display the results from the **`active_users_by_traffic`** table

In [0]:
%sql
SELECT * FROM active_users_by_traffic
ORDER BY active_users DESC ;

traffic_source,active_users
google,1781961
facebook,956769
instagram,530050
direct,438886
email,281525
youtube,253321





**5.1: CHECK YOUR WORK**
Your query should eventually result in the following values.

|traffic_source|active_users|
|---|---|
|direct|438886|
|email|281525|
|facebook|956769|
|google|1781961|
|instagram|530050|
|youtube|253321|



### 6. List and stop all active streams
- Use SparkSession to get list of all active streams
- Iterate over the list and stop each query


6. Liste e interrompa todos os streams ativos

Use SparkSession para obter uma lista de todos os streams ativos

Iterar na lista e interromper cada consulta

In [0]:
# TODO
consultas_ativas = spark.streams.active
consultas_ativas

for s in consultas_ativas:
  print(f'Interrompendo consultas ativas {s}')
  s.stop()

Interrompendo consultas ativas <pyspark.sql.streaming.query.StreamingQuery object at 0x7fc415dc35e0>
Interrompendo consultas ativas <pyspark.sql.streaming.query.StreamingQuery object at 0x7fc415dc3970>
Interrompendo consultas ativas <pyspark.sql.streaming.query.StreamingQuery object at 0x7fc415dc35b0>





**6.1: CHECK YOUR WORK**

In [0]:
DA.tests.validate_6_1(traffic_query)

Points,Test,Result
1,The query has been stopped,




### Classroom Cleanup
Run the cell below to clean up resources.

In [0]:
DA.cleanup()

Resetting the learning environment:
| dropping the schema "anacadriano20_6ryf_da_asp"...(1 seconds)
| removing the working directory "dbfs:/mnt/dbacademy-users/anacadriano20@gmail.com/apache-spark-programming-with-databricks"...(0 seconds)

Validating the locally installed datasets:
| listing local files...(3 seconds)
| validation completed...(3 seconds total)


&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>