# Bronze Layer ETL
This script brings the data from the original .csv file into the previously created (empty) schema.

In [0]:
from pyspark.sql.functions import col, sum, isnan
from pyspark.sql.types import IntegerType 

In [0]:
# Saving the .csv file as a delta table
file_path = "/Workspace/Users/bru401@gmail.com/Main/Data_Engineering/ml_project1_data.csv"
#file_path = "dbfs:/Workspace/Users/bru401@gmail.com/Main/Data_Engineering/ml_project1_data.csv"

df = spark.read.csv(file_path, header=True, inferSchema=True)
df.write.format("delta").mode("overwrite").saveAsTable("bronze.raw_data")

### Alternatively

In [0]:
# Checking the schema for conflicts
print(spark.table("workspace.bronze.ml_project_2_data").printSchema())
print(spark.table("bronze.raw_data").printSchema())

In [0]:
# This block of code is used if the .csv file has been manually ingested (using its UI) into Databricks as a table.
df = spark.read.table("workspace.bronze.ml_project_2_data")
df.write.format("delta").mode("overwrite").saveAsTable("bronze.raw_data")

# Testing

In [0]:
display(df.limit(10))
print(f"The total number of lines is {df.count()}!")

The total number of lines is 2240!

In [0]:
df.describe().display()

Given that we know the number of lines is 2240, 'Income' having a count of only 2216 indicates that there are missing values.
This is the only column with such problem

In [0]:
# Checking the non-integer columns
display(
    df.groupBy("Education").count()
)

display(
    df.groupBy("Marital_Status").count()
)

Some of these categories seem incorrect and will be addressed when we clean the data in the silver layer.

# (WIP) Descriptive Analysis

In [0]:
display(df.dtypes)