<a href="https://colab.research.google.com/github/carsofferrei/04_data_processing/blob/main/spark/challenges/challenge_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CHALLENGE 2
##  Implement CLEANSING process
- Set up path in the "lake"
  - !mkdir -p /content/lake/silver

- Read data from BRONZE layer as PARQUET:
    - vehicles - path: /content/lake/bronze/vehicles
    - lines - path: /content/lake/bronze/lines
    - municipalities - path: /content/lake/bronze/municipalities

- Transformations
  - vehicles
    - rename "lat" and "lon" to "latitude" and "longitude" respectively
    - remove possible duplicates
    - remove rows when the column CURRENT_STATUS is null
    - remove any corrupted record
  - lines
    - remove duplicates
    - remove rows when the column LONG_NAME is null
    - remove any corrupted record
  - municipalities
    - remove duplicates
    - remove rows when the columns NAME or DISTRICT_NAME are null
    - remove any corrupted record

- Write data as PARQUET into the SILVER layer (/content/lake/silver)
  - Partition "vehicles" by "date"(created in the ingestion)
  - Paths:
    - vehicles - path: /content/lake/silver/vehicles
    - lines - path: /content/lake/silver/lines
    - municipalities - path: /content/lake/silver/municipalities

# Setting up PySpark

In [1]:
%pip install pyspark



In [2]:
# Import SparkSession

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('Carla_Ferreira_API_challenge').config('spark.ui.port', '4050').getOrCreate()
sc = spark.sparkContext

Read the data from the Lake

In [None]:
vehicles = spark.read.parquet("/content/lake/bronze/vehicles").show()
lines = spark.read.parquet("/content/lake/bronze/lines").show()
municipalities = spark.read.parquet("/content/lake/bronze/municipalities").show()

**Transformations**

In [None]:
# Vehicles
# rename "lat" and "lon" to "latitude" and "longitude" respectively
vehicles = vehicles.withColumnRenamed("lat", "latitude") \
                   .withColumnRenamed("lon", "longitude")

# remove possible duplicates
vehicles.drop_duplicates()
# remove rows when the column CURRENT_STATUS is null
vehicles = vehicles.filter(vehicles["CURRENT_STATUS"].isNotNull())
# remove any corrupted record
vehicles = vehicles.filter(vehicles["CURRENT_STATUS"].isNotNull())

**Create the silver folder** (step 2 on the process)

In [None]:
!mkdir -p /content/lake/silver