# Batch Processing - Bronze Layer

When creating long-term storage for analytical use cases, the first step is to **ingest data** from the **source**, with a shape as close as possible to the original shape. As the first step in our data processing journey.

We will:
* Ingest the raw data in a single pull
* Convert the data to parquet format


## Set up this Notebook
Before we get started, we need to quickly set up this notebook by installing a helpers, cleaning up your unique working directory (as to not clash with others working in the same space), and setting some variables.

In [0]:
%pip uninstall -y databricks_helpers exercise_ev_databricks_unit_tests
%pip install git+https://github.com/data-derp/databricks_helpers#egg=databricks_helpers git+https://github.com/data-derp/exercise_ev_databricks_unit_tests#egg=exercise_ev_databricks_unit_tests

In [0]:
from databricks_helpers.databricks_helpers import DataDerpDatabricksHelpers
exercise_name = "batch_processing_bronze_ingest"
helpers = DataDerpDatabricksHelpers(dbutils, exercise_name)

current_user = helpers.current_user()
working_directory = helpers.working_directory()
print(f"Your current working directory is: {working_directory}")

## This function CLEARS your current working directory. Only run this if you want a fresh start or if it is the first time you're doing this exercise.
helpers.clean_working_directory()

## Read OCPP Data
We've done this a couple of times before! Run the following cells to download the data to local storage and create a DataFrame from it.

In [0]:
url = "https://raw.githubusercontent.com/kelseymok/charge-point-simulator-v1.6/main/out/1680355141.csv.gz"
filepath = helpers.download_to_local_dir(url)

In [0]:
from pyspark.sql import DataFrame
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

def create_dataframe(filepath: str) -> DataFrame:
    
    custom_schema = StructType([
        StructField("message_id", StringType(), True),
        StructField("message_type", IntegerType(), True),
        StructField("charge_point_id", StringType(), True),
        StructField("action", StringType(), True),
        StructField("write_timestamp", StringType(), True),
        StructField("body", StringType(), True),
    ])
    
    df = spark.read.format("csv") \
        .option("header", True) \
        .option("delimiter", ",") \
        .option("escape", "\\") \
        .schema(custom_schema) \
        .load(filepath)
    return df
    
df = create_dataframe(filepath)
display(df)


## EXERCISE: Write to Parquet
Now that we have our ingested data represented in a DataFrame, let's use the [`parquet writer`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.parquet.html) along with [`mode="overwrite"`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.parquet.html) to formally write our data to the specified `out_dir`.

In [0]:
def write(input_df: DataFrame):
    out_dir = f"{working_directory}/output/"
    
### Put your code here.
    mode_name = "overwrite"
    
    input_df. \
        write. \
        mode(mode_name). \
        parquet(out_dir)
    
    
write(df)

Let's inspect what we've created.

In [0]:
dbutils.fs.ls(f"{working_directory}/output/")

A bit of clean up before we move on...

In [0]:
helpers.clean_working_directory()

&copy; 2025 Thoughtworks. All rights reserved.<br/>