### **Case Bradesco - Processo seletivo Gabriela Malaspina**

#### Etapa 01

"Carregue as bases de dados como DataFrames do PySpark. Especifique os formatos dos dados manualmente, não use a opção inferSchema = True."


##### **1. Configuração do ambiente**

In [51]:
# Importação das bibliotecas e recursos

!pip install pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType, StringType, TimestampType
from pyspark.sql.functions import to_timestamp

from datetime import datetime
import os

##### **2. Configuração de diretórios, download e extração**

In [8]:
# Criação da SparkSession

spark = SparkSession.builder.appName('activity_recognition_exp').getOrCreate()

In [3]:
# Criação de subpastas dentro da pasta "content" para simular as camadas "raw" e "bronze"

raw = '/content/raw'
os.mkdir(raw)

bronze = '/content/bronze'
os.mkdir(bronze)

In [None]:
# Download e extração do arquivo .zip na camada "raw" (dados brutos)

# Link para download
url='https://archive.ics.uci.edu/ml/machine-learning-databases/00344/Activity%20recognition%20exp.zip'

# Mudança para o diretório de destino (camada "raw")
#os.chdir(raw)

# Download do .zip com o nome "activity_rec_exp.zip" na camada "raw"
!wget -O activity_recognition_exp.zip "$url" -d "$raw"

# Extração do conteúdo do .zip na camada "raw"
!unzip -j -d activity_recognition_exp /content/raw/activity_recognition_exp.zip

##### **3. Leitura dos arquivos .csv e configuração dos dataframes**

Os 4 arquivos possuem a mesma estrutura de atributos:
- Index > identificador da amostra (id): INT
- Arrival_Time e Creation_Time > momento de chegada e criação das amostras: DATETIME / TIMESTAMP
- x, y e z > representam leituras de acelerômetro/giroscópio: FLOAT
- User > usuário: STRING
- Model e Device > modelo e nome do aparelho: STRING
- gt > "groundtruth" - referência de posição: STRING

Dessa forma, pode-se utilizar a mesma tipagem para o schema em todos os datasets, assim como a formatação para as colunas de datas.


In [59]:
# Definição do Schema a ser utilizado em todos os datasets

schema = StructType([
    StructField('Index', IntegerType(), True),
    StructField('Arrival_Time', StringType(), True), # Será feita transformação para formato de data na sequência
    StructField('Creation_Time', StringType(), True), # Será feita transformação para formato de data na sequência
    StructField('x', FloatType(), True),
    StructField('y', FloatType(), True),
    StructField('z', FloatType(), True),
    StructField('User', StringType(), True),
    StructField('Model', StringType(), True),
    StructField('gt', StringType(), True)
])

###### Arquivo 01- Acelerômetro de celulares

In [95]:
caminho_01 = '/content/raw/activity_recognition_exp/Phones_accelerometer.csv'

# Dataframe inicial
phones_acc = spark.read.csv(caminho_01, header=True).show(5)

+-----+-------------+-------------------+------------------+------------------+--------+----+------+--------+-----+
|Index| Arrival_Time|      Creation_Time|                 x|                 y|       z|User| Model|  Device|   gt|
+-----+-------------+-------------------+------------------+------------------+--------+----+------+--------+-----+
|    0|1424696633908|1424696631913248572|         -5.958191|         0.6880646|8.135345|   a|nexus4|nexus4_1|stand|
|    1|1424696633909|1424696631918283972|          -5.95224|         0.6702118|8.136536|   a|nexus4|nexus4_1|stand|
|    2|1424696633918|1424696631923288855|        -5.9950867|0.6535491999999999|8.204376|   a|nexus4|nexus4_1|stand|
|    3|1424696633919|1424696631928385290|        -5.9427185|0.6761626999999999|8.128204|   a|nexus4|nexus4_1|stand|
|    4|1424696633929|1424696631933420691|-5.991516000000001|        0.64164734|8.135345|   a|nexus4|nexus4_1|stand|
+-----+-------------+-------------------+------------------+------------

In [112]:
# Aplicação do schema e transformações das datas

phones_acc = spark.read.csv(caminho_01, schema=schema, header=True)

# Conversão de Arrival_Time (em milissegundos)
phones_acc = phones_acc.withColumn("Arrival_Time", (phones_acc["Arrival_Time"] / 1000).cast("timestamp"))

# Conversão de Creation_Time (em nanossegundos)
phones_acc = phones_acc.withColumn("Creation_Time", (phones_acc["Creation_Time"] / 1e9).cast("timestamp"))

In [94]:
# Dataframe formatado
phones_acc.show(5)

+-----+--------------------+--------------------+----------+----------+--------+----+------+--------+
|Index|        Arrival_Time|       Creation_Time|         x|         y|       z|User| Model|      gt|
+-----+--------------------+--------------------+----------+----------+--------+----+------+--------+
|    0|2015-02-23 13:03:...|2015-02-23 13:03:...| -5.958191| 0.6880646|8.135345|   a|nexus4|nexus4_1|
|    1|2015-02-23 13:03:...|2015-02-23 13:03:...|  -5.95224| 0.6702118|8.136536|   a|nexus4|nexus4_1|
|    2|2015-02-23 13:03:...|2015-02-23 13:03:...|-5.9950867| 0.6535492|8.204376|   a|nexus4|nexus4_1|
|    3|2015-02-23 13:03:...|2015-02-23 13:03:...|-5.9427185| 0.6761627|8.128204|   a|nexus4|nexus4_1|
|    4|2015-02-23 13:03:...|2015-02-23 13:03:...| -5.991516|0.64164734|8.135345|   a|nexus4|nexus4_1|
+-----+--------------------+--------------------+----------+----------+--------+----+------+--------+
only showing top 5 rows



In [91]:
# Verificação do Schema estruturado
phones_acc.printSchema()

root
 |-- Index: integer (nullable = true)
 |-- Arrival_Time: timestamp (nullable = true)
 |-- Creation_Time: timestamp (nullable = true)
 |-- x: float (nullable = true)
 |-- y: float (nullable = true)
 |-- z: float (nullable = true)
 |-- User: string (nullable = true)
 |-- Model: string (nullable = true)
 |-- gt: string (nullable = true)



###### Arquivo 02- Giroscópio de celulares

In [96]:
caminho_02 = '/content/raw/activity_recognition_exp/Phones_gyroscope.csv'

# Dataframe inicial
phones_gyr = spark.read.csv(caminho_02, header=True).show(5)

+-----+-------------+-------------------+--------------------+--------------------+------------+----+------+--------+-----+
|Index| Arrival_Time|      Creation_Time|                   x|                   y|           z|User| Model|  Device|   gt|
+-----+-------------+-------------------+--------------------+--------------------+------------+----+------+--------+-----+
|    0|1424696633909|1424696631914042029|         0.013748169|-0.00062561035000...|-0.023376465|   a|nexus4|nexus4_1|stand|
|    1|1424696633909|1424696631919046912|0.014816283999999999|       -0.0016937256| -0.02230835|   a|nexus4|nexus4_1|stand|
|    2|1424696633918|1424696631924051794|           0.0158844|       -0.0016937256|-0.021240234|   a|nexus4|nexus4_1|stand|
|    3|1424696633919|1424696631929117712|         0.016952515|        -0.003829956| -0.02017212|   a|nexus4|nexus4_1|stand|
|    4|1424696633928|1424696631934214148|           0.0158844|-0.00703430180000...| -0.02017212|   a|nexus4|nexus4_1|stand|
+-----+-

In [99]:
# Aplicação do schema e transformações previamente definidos e validados na etapa anterior

phones_gyr = spark.read.csv(caminho_02, schema=schema, header=True)

# Conversão de Arrival_Time (em milissegundos)
phones_gyr = phones_gyr.withColumn("Arrival_Time", (phones_gyr["Arrival_Time"] / 1000).cast("timestamp"))

# Conversão de Creation_Time (em nanossegundos)
phones_gyr = phones_gyr.withColumn("Creation_Time", (phones_gyr["Creation_Time"] / 1e9).cast("timestamp"))

In [100]:
# Dataframe formatado
phones_gyr.show(5)

+-----+--------------------+--------------------+-----------+-------------+------------+----+------+--------+
|Index|        Arrival_Time|       Creation_Time|          x|            y|           z|User| Model|      gt|
+-----+--------------------+--------------------+-----------+-------------+------------+----+------+--------+
|    0|2015-02-23 13:03:...|2015-02-23 13:03:...|0.013748169|-6.2561035E-4|-0.023376465|   a|nexus4|nexus4_1|
|    1|2015-02-23 13:03:...|2015-02-23 13:03:...|0.014816284|-0.0016937256| -0.02230835|   a|nexus4|nexus4_1|
|    2|2015-02-23 13:03:...|2015-02-23 13:03:...|  0.0158844|-0.0016937256|-0.021240234|   a|nexus4|nexus4_1|
|    3|2015-02-23 13:03:...|2015-02-23 13:03:...|0.016952515| -0.003829956| -0.02017212|   a|nexus4|nexus4_1|
|    4|2015-02-23 13:03:...|2015-02-23 13:03:...|  0.0158844|-0.0070343018| -0.02017212|   a|nexus4|nexus4_1|
+-----+--------------------+--------------------+-----------+-------------+------------+----+------+--------+
only showi

In [102]:
# Verificação do Schema estruturado
phones_gyr.printSchema()

root
 |-- Index: integer (nullable = true)
 |-- Arrival_Time: timestamp (nullable = true)
 |-- Creation_Time: timestamp (nullable = true)
 |-- x: float (nullable = true)
 |-- y: float (nullable = true)
 |-- z: float (nullable = true)
 |-- User: string (nullable = true)
 |-- Model: string (nullable = true)
 |-- gt: string (nullable = true)



###### Arquivo 03- Acelerômetro de relógios

In [103]:
caminho_03 = '/content/raw/activity_recognition_exp/Watch_accelerometer.csv'

# Dataframe inicial
watch_acc = spark.read.csv(caminho_03, header=True).show(5)

+-----+-------------+--------------+-----------+----------+-----------+----+-----+------+-----+
|Index| Arrival_Time| Creation_Time|          x|         y|          z|User|Model|Device|   gt|
+-----+-------------+--------------+-----------+----------+-----------+----+-----+------+-----+
|    0|1424696638740|27920678471000| -0.5650316| -9.572019|-0.61411273|   a| gear|gear_1|stand|
|    1|1424696638740|27920681910000|-0.83258367| -9.713276|-0.60693014|   a| gear|gear_1|stand|
|    2|1424696638740|27920692014000| -1.0181342| -9.935339|-0.54408234|   a| gear|gear_1|stand|
|    3|1424696638741|27920701983000| -1.2228385|-10.142437| -0.5662287|   a| gear|gear_1|stand|
|    4|1424696638741|27920711906000| -1.5771804|-10.480618|-0.40282443|   a| gear|gear_1|stand|
+-----+-------------+--------------+-----------+----------+-----------+----+-----+------+-----+
only showing top 5 rows



In [104]:
# Aplicação do schema e transformações previamente definidos e validados nas etapas anteriores

watch_acc = spark.read.csv(caminho_03, schema=schema, header=True)

# Conversão de Arrival_Time (em milissegundos)
watch_acc = watch_acc.withColumn("Arrival_Time", (watch_acc["Arrival_Time"] / 1000).cast("timestamp"))

# Conversão de Creation_Time (em nanossegundos)
watch_acc = watch_acc.withColumn("Creation_Time", (watch_acc["Creation_Time"] / 1e9).cast("timestamp"))

In [105]:
# Dataframe formatado
watch_acc.show(5)

+-----+--------------------+--------------------+-----------+----------+-----------+----+-----+------+
|Index|        Arrival_Time|       Creation_Time|          x|         y|          z|User|Model|    gt|
+-----+--------------------+--------------------+-----------+----------+-----------+----+-----+------+
|    0|2015-02-23 13:03:...|1970-01-01 07:45:...| -0.5650316| -9.572019|-0.61411273|   a| gear|gear_1|
|    1|2015-02-23 13:03:...|1970-01-01 07:45:...|-0.83258367| -9.713276|-0.60693014|   a| gear|gear_1|
|    2|2015-02-23 13:03:...|1970-01-01 07:45:...| -1.0181342| -9.935339|-0.54408234|   a| gear|gear_1|
|    3|2015-02-23 13:03:...|1970-01-01 07:45:...| -1.2228385|-10.142437| -0.5662287|   a| gear|gear_1|
|    4|2015-02-23 13:03:...|1970-01-01 07:45:...| -1.5771804|-10.480618|-0.40282443|   a| gear|gear_1|
+-----+--------------------+--------------------+-----------+----------+-----------+----+-----+------+
only showing top 5 rows



In [106]:
# Verificação do Schema estruturado
watch_acc.printSchema()

root
 |-- Index: integer (nullable = true)
 |-- Arrival_Time: timestamp (nullable = true)
 |-- Creation_Time: timestamp (nullable = true)
 |-- x: float (nullable = true)
 |-- y: float (nullable = true)
 |-- z: float (nullable = true)
 |-- User: string (nullable = true)
 |-- Model: string (nullable = true)
 |-- gt: string (nullable = true)



###### Arquivo 04- Giroscópio de relógios

In [107]:
caminho_04 = '/content/raw/activity_recognition_exp/Watch_gyroscope.csv'

# Dataframe inicial
watch_gyr = spark.read.csv(caminho_04, header=True).show(5)

+-----+-------------+--------------+-----------+------------+------------+----+-----+------+-----+
|Index| Arrival_Time| Creation_Time|          x|           y|           z|User|Model|Device|   gt|
+-----+-------------+--------------+-----------+------------+------------+----+-----+------+-----+
|    0|1424696638743|27920678496000|-0.16218652|-0.022104237|  0.05965481|   a| gear|gear_1|stand|
|    1|1424696638743|27920681926000|-0.18322548| -0.06178534| 0.012516857|   a| gear|gear_1|stand|
|    2|1424696638743|27920692031000|-0.18082865| -0.10865697|-0.036485307|   a| gear|gear_1|stand|
|    3|1424696638743|27920701997000|-0.14780544| -0.15792546| -0.09853696|   a| gear|gear_1|stand|
|    7|1424696638744|27920743068000| 0.18216023| -0.32357407| -0.27723506|   a| gear|gear_1|stand|
+-----+-------------+--------------+-----------+------------+------------+----+-----+------+-----+
only showing top 5 rows



In [108]:
# Aplicação do schema e transformações previamente definidos e validados nas etapas anteriores

watch_gyr = spark.read.csv(caminho_02, schema=schema, header=True)

# Conversão de Arrival_Time (em milissegundos)
watch_gyr = watch_gyr.withColumn("Arrival_Time", (watch_gyr["Arrival_Time"] / 1000).cast("timestamp"))

# Conversão de Creation_Time (em nanossegundos)
watch_gyr = watch_gyr.withColumn("Creation_Time", (watch_gyr["Creation_Time"] / 1e9).cast("timestamp"))

In [109]:
# Dataframe formatado
watch_gyr.show(5)

+-----+--------------------+--------------------+-----------+-------------+------------+----+------+--------+
|Index|        Arrival_Time|       Creation_Time|          x|            y|           z|User| Model|      gt|
+-----+--------------------+--------------------+-----------+-------------+------------+----+------+--------+
|    0|2015-02-23 13:03:...|2015-02-23 13:03:...|0.013748169|-6.2561035E-4|-0.023376465|   a|nexus4|nexus4_1|
|    1|2015-02-23 13:03:...|2015-02-23 13:03:...|0.014816284|-0.0016937256| -0.02230835|   a|nexus4|nexus4_1|
|    2|2015-02-23 13:03:...|2015-02-23 13:03:...|  0.0158844|-0.0016937256|-0.021240234|   a|nexus4|nexus4_1|
|    3|2015-02-23 13:03:...|2015-02-23 13:03:...|0.016952515| -0.003829956| -0.02017212|   a|nexus4|nexus4_1|
|    4|2015-02-23 13:03:...|2015-02-23 13:03:...|  0.0158844|-0.0070343018| -0.02017212|   a|nexus4|nexus4_1|
+-----+--------------------+--------------------+-----------+-------------+------------+----+------+--------+
only showi

In [110]:
# Verificação do Schema estruturado
watch_gyr.printSchema()

root
 |-- Index: integer (nullable = true)
 |-- Arrival_Time: timestamp (nullable = true)
 |-- Creation_Time: timestamp (nullable = true)
 |-- x: float (nullable = true)
 |-- y: float (nullable = true)
 |-- z: float (nullable = true)
 |-- User: string (nullable = true)
 |-- Model: string (nullable = true)
 |-- gt: string (nullable = true)



##### **Resultado final etapa 01**

Dataframes PySpark com schemas formatados manualmente, com base na documentação.

In [113]:
# Acelerômetro de celulares
phones_acc.show(5)

+-----+--------------------+--------------------+----------+----------+--------+----+------+--------+
|Index|        Arrival_Time|       Creation_Time|         x|         y|       z|User| Model|      gt|
+-----+--------------------+--------------------+----------+----------+--------+----+------+--------+
|    0|2015-02-23 13:03:...|2015-02-23 13:03:...| -5.958191| 0.6880646|8.135345|   a|nexus4|nexus4_1|
|    1|2015-02-23 13:03:...|2015-02-23 13:03:...|  -5.95224| 0.6702118|8.136536|   a|nexus4|nexus4_1|
|    2|2015-02-23 13:03:...|2015-02-23 13:03:...|-5.9950867| 0.6535492|8.204376|   a|nexus4|nexus4_1|
|    3|2015-02-23 13:03:...|2015-02-23 13:03:...|-5.9427185| 0.6761627|8.128204|   a|nexus4|nexus4_1|
|    4|2015-02-23 13:03:...|2015-02-23 13:03:...| -5.991516|0.64164734|8.135345|   a|nexus4|nexus4_1|
+-----+--------------------+--------------------+----------+----------+--------+----+------+--------+
only showing top 5 rows



In [114]:
# Giroscópio de celulares
phones_gyr.show(5)

+-----+--------------------+--------------------+-----------+-------------+------------+----+------+--------+
|Index|        Arrival_Time|       Creation_Time|          x|            y|           z|User| Model|      gt|
+-----+--------------------+--------------------+-----------+-------------+------------+----+------+--------+
|    0|2015-02-23 13:03:...|2015-02-23 13:03:...|0.013748169|-6.2561035E-4|-0.023376465|   a|nexus4|nexus4_1|
|    1|2015-02-23 13:03:...|2015-02-23 13:03:...|0.014816284|-0.0016937256| -0.02230835|   a|nexus4|nexus4_1|
|    2|2015-02-23 13:03:...|2015-02-23 13:03:...|  0.0158844|-0.0016937256|-0.021240234|   a|nexus4|nexus4_1|
|    3|2015-02-23 13:03:...|2015-02-23 13:03:...|0.016952515| -0.003829956| -0.02017212|   a|nexus4|nexus4_1|
|    4|2015-02-23 13:03:...|2015-02-23 13:03:...|  0.0158844|-0.0070343018| -0.02017212|   a|nexus4|nexus4_1|
+-----+--------------------+--------------------+-----------+-------------+------------+----+------+--------+
only showi

In [115]:
# Acelerômetro de relógios
watch_acc.show(5)

+-----+--------------------+--------------------+-----------+----------+-----------+----+-----+------+
|Index|        Arrival_Time|       Creation_Time|          x|         y|          z|User|Model|    gt|
+-----+--------------------+--------------------+-----------+----------+-----------+----+-----+------+
|    0|2015-02-23 13:03:...|1970-01-01 07:45:...| -0.5650316| -9.572019|-0.61411273|   a| gear|gear_1|
|    1|2015-02-23 13:03:...|1970-01-01 07:45:...|-0.83258367| -9.713276|-0.60693014|   a| gear|gear_1|
|    2|2015-02-23 13:03:...|1970-01-01 07:45:...| -1.0181342| -9.935339|-0.54408234|   a| gear|gear_1|
|    3|2015-02-23 13:03:...|1970-01-01 07:45:...| -1.2228385|-10.142437| -0.5662287|   a| gear|gear_1|
|    4|2015-02-23 13:03:...|1970-01-01 07:45:...| -1.5771804|-10.480618|-0.40282443|   a| gear|gear_1|
+-----+--------------------+--------------------+-----------+----------+-----------+----+-----+------+
only showing top 5 rows



In [None]:
# Giroscópio de relógios
watch_gyr.show(5)

#### Etapa 02
"Faça uma análise inicial dos dados: quais problemas você encontrou? Como você trataria
tais problemas?"

#### Etapa 03

"Crie uma tabela com informações sumarizadas de cada usuário, como: quantos registros
tal usuário possui, quais modelos de aparelho cada usuário operou, etc."