#Configure Apache Spark

##Specifying a database in which to work

In [3]:
username = "afonsolenzi"
dbutils.widgets.text("username", username)
spark.sql(f"CREATE DATABASE IF NOT EXISTS dbacademy_{username}")
spark.sql(f"USE dbacademy_{username}")
health_tracker = f"/dbacademy/{username}/DLRS/healthtracker/"

##Configuring the number of shuffle partitions to use

In [5]:
spark.conf.set("spark.sql.shuffle.partitions", 8)

##Step 1: Download the data to the driver

In [7]:
%sh

wget https://hadoop-and-big-data.s3-us-west-2.amazonaws.com/fitness-tracker/health_tracker_data_2020_1.json
wget https://hadoop-and-big-data.s3-us-west-2.amazonaws.com/fitness-tracker/health_tracker_data_2020_2.json
wget https://hadoop-and-big-data.s3-us-west-2.amazonaws.com/fitness-tracker/health_tracker_data_2020_2_late.json
wget https://hadoop-and-big-data.s3-us-west-2.amazonaws.com/fitness-tracker/health_tracker_data_2020_3.json

##Step 2: Verify the downloads

In [9]:
%sh ls

##Step 3: Move the data to the raw directory

In [11]:
dbutils.fs.mv("file:/databricks/driver/health_tracker_data_2020_1.json", 
              health_tracker + "raw/health_tracker_data_2020_1.json")
dbutils.fs.mv("file:/databricks/driver/health_tracker_data_2020_2.json", 
              health_tracker + "raw/health_tracker_data_2020_2.json")
dbutils.fs.mv("file:/databricks/driver/health_tracker_data_2020_2_late.json", 
              health_tracker + "raw/health_tracker_data_2020_2_late.json")
dbutils.fs.mv("file:/databricks/driver/health_tracker_data_2020_3.json", 
              health_tracker + "raw/health_tracker_data_2020_3.json")

##Step 4: Load the data

In [13]:
file_path = health_tracker + "raw/health_tracker_data_2020_1.json"
 
health_tracker_data_2020_1_df = (
  spark.read
  .format("json")
  .load(file_path)
)

#Visualize data

## Step 1: Display the data

In [16]:
display(health_tracker_data_2020_1_df)

device_id,heartrate,name,time
0,52.8139067501,Deborah Powell,1577836800.0
0,53.9078900098,Deborah Powell,1577840400.0
0,52.7129593616,Deborah Powell,1577844000.0
0,52.2880422685,Deborah Powell,1577847600.0
0,52.5156095386,Deborah Powell,1577851200.0
0,53.6280743846,Deborah Powell,1577854800.0
0,52.1760037066,Deborah Powell,1577858400.0
0,90.0456721836,Deborah Powell,1577862000.0
0,89.4695644522,Deborah Powell,1577865600.0
0,88.1490304138,Deborah Powell,1577869200.0


##Step 2: Configure the visualization

Note that we have used a Databricks visualization to visualize the sensor data over time. We have used the following plot options to configure the visualization: 

Keys: time

Series groupings: device_id

Values: heartrate

Aggregation: SUM

Display Type: Bar Chart

>### *Afonso Orgino Lenzi*
>>>#### <code>BI and Analytics Consultant with Sap Analytics Cloud  </code>
>>>#### <code>Especialista em Ciência de Dados com Big Data - Puc Minas</code>


#### Contatos:
#### E-mail: <code>afonso.lenzi@gmail.com</code> 
#### Cel./WhatsApp: <code>(47) 9 9605-9672</code> 
#### Linkedin: [https://www.linkedin.com/in/afonsolenzi]

*Fontes e Créditos de Direitos Autorais:* [Documentação Databricks] | [Machine Learning] | [MLflow] | [Libraries]

[Documentação Databricks]: https://docs.databricks.com/
[Machine Learning]: https://docs.databricks.com/applications/machine-learning/index.html
[MLflow]: https://docs.databricks.com/applications/mlflow/index.html
[Libraries]: https://docs.databricks.com/libraries.html