## Creating a delta database
First we need a place to store our data. Databricks Unity Catalog organizes data in a three-level namespace or format:
1. Catalog
1. Datbase (aka Schema)
1. Table

This makes it easy for Databricks Lakehouse AI and Unity Governance to function. This also provides an idea of structure to architects without locking anything in too much in terms of design.

In [0]:
%sql
--TODO: Update the following with your own catalog and database names
--SQL is annoying - we'll have to update these manually for now.

--Update the catalog name with your own
CREATE CATALOG IF NOT EXISTS ademianczuk;

--Update the catalog & database name with your own
CREATE DATABASE IF NOT EXISTS ademianczuk.ncr;

--Update the catalog & database name with your own (leave `data` alone for this notebook. It's just the name of your volume)
CREATE VOLUME IF NOT EXISTS ademianczuk.ncr.data

In [0]:
#TODO: Update the following with your own catalog and database names
catalog = "ademianczuk"
database = "ncr"

## Reading in our source data
I've included the original source data in the root of this project repository. We're also going to be reading it from a public URL to show how we can source data from pretty much anywhere. We're going to show examples of both reading from a datbricks volume (that may be mapped to a cloud storage container) as well as the same data from a public URL to give you an idea of different ways to ingest source data.

In [0]:
%sh

#TODO: Update the following with your own catalog and database names
CATALOG=ademianczuk
DATABASE=ncr

#Create a temp storage location for our downloaded file
rm -rf /tmp/ncr || true
mkdir -p /tmp/ncr
cd /tmp/ncr

#Download & extract the gardening archive
curl -L https://raw.githubusercontent.com/andrijdemianczuk/uber_analytics/refs/heads/main/ncr_ride_bookings.csv -o ncr_ride_bookings.csv

#Move the dataset to our main bucket. Since we're using the root volume directory, we can't manage it with normal sh commands, but for example we'll show it here for posterity. Downloading the same file again will simply overwrite the old one.

# rm -rf /Volumes/$CATALOG/$DATABASE/data/ || true
# mkdir -p /Volumes/$CATALOG/$DATABASE/data/
cp -f ncr_ride_bookings.csv /Volumes/$CATALOG/$DATABASE/data/

rm -rf /Volumes/$CATALOG/$DATABASE/data/csv || true
mkdir -p /Volumes/$CATALOG/$DATABASE/data/csv

In [0]:
from pyspark.sql.functions import *
from pyspark.sql.functions import year, month, dayofweek
from pyspark.sql import DataFrame

In [0]:
df = (
    spark.read.option("header", True)
    .option("multiline", True)
    .option("quote", '"')
    .option("escape", '"')
    .option("inferSchema", True)
    .csv(f"/Volumes/{catalog}/{database}/data/ncr_ride_bookings.csv")
)

In [0]:
display(
    df.filter((df["Date"] >= "2024-01-01") & (df["Date"] <= "2024-01-31")).orderBy(
        df["Date"].asc(), df["Time"].asc()
    )
)

In [0]:
df.count()

In [0]:
display(df.agg(min("Date"), max("Date")))

## Organizing our data
Let's break up the dataframe by month (or by week) and write out the parquet files accordingly. For the lab, we'll be separating our data into a number of files so that we can 'drop' them in to our system as a simulation of real data coming in to some type of cloud storage (e.g., ADLS, S3 or GCS buckets). We'll be using Databricks Volumes for this lab. Databricks recommends mapping these Volumes to your cloud storage container whenever possible, however it's not necessary and you can connect to your storage connectors externally if you prefer.

In [0]:


df = df.withColumn('dayOfWeek', dayofweek(col('Date')))
df = df.withColumn('dayOfMonth', dayofmonth(col('Date')))
df = df.withColumn('month', month(col('Date')))
df = df.withColumn('year', year(col('Date')))

In [0]:
display(df)

## Why write to parquet?
Parquet is a storage and performance efficient file format. Since it is columnar in nature, it is considered to be a 'dense' storage format. Parquet also allows for data partitioning making it very fast and efficient for filtering data when reading in (since partitions are stored in directories, parquet files that aren't part of the search terms can be ignored). This concept is important to understanding how Delta works as well. Delta is very similar to parquet but also includes metadata at the parent directory level about how data is stored and organized.

Although we don't need to write to parquet here, it's good to show an example of how it works. In reality, we'd just read in the raw data to a dataframe or some type of materialization and start working with our data from that point on. Another thing that parquet allows us to do is 'pick up' our workload from this point on in the notebook if we terminate our cluster. Think of it as a 'save state' for our work. Delta works even better for this, so we'll be showing both examples below.

In [0]:
df.write.partitionBy("year", "month").mode("overwrite").format("parquet").save(f"/Volumes/{catalog}/{database}/data/ncr_ride_bookings")

In [0]:
df = spark.read.parquet(f"/Volumes/{catalog}/{database}/data/ncr_ride_bookings/year=2024/month=1")
display(df)
df.count()

In [0]:
df = spark.read.parquet(f"/Volumes/{catalog}/{database}/data/ncr_ride_bookings")
df.count()

## Creating a number of files to simulate incremental ingestion
Now that we have everything organized, let's split up our dataframe into a csv each representing a month's worth of data. We will incrementally load each file in the pipeline later on and will give us an idea of how merging works along with in-flight etl.

In [0]:
base_dir = f"/Volumes/{catalog}/{database}/data/csv"

#Get distinct months (assumes 'month' is 1..12; if it's yyyy-MM use that string directly)
months = [r.month for r in df.select("month").distinct().collect()]

def write_month_csv(src_df: DataFrame, month_val):
    
    #zero-pad to 2 digits if month is numeric; adjust to your format
    m_str = f"{int(month_val):02d}" if isinstance(month_val, (int,)) else str(month_val)
    tmp_dir = f"{base_dir}/_tmp_month_{m_str}"

    #1. write to a temp directory with a single part file
    (src_df
        .filter(col("month") == month_val)
        .coalesce(1)
        .write
        .mode("overwrite")
        .option("header", "true")
        .option("mapreduce.fileoutputcommitter.marksuccessfuljobs","false")
        .csv(tmp_dir))

    #2. find the single part file
    part_files = [f.path for f in dbutils.fs.ls(tmp_dir) if f.name.startswith("part-") and f.name.endswith(".csv")]
    if not part_files:
        raise RuntimeError(f"No CSV part file found for month {m_str} in {tmp_dir}")
    part_path = part_files[0]

    # 3. move/rename to final destination (flat structure)
    final_path = f"{base_dir}/month_{m_str}.csv"
    dbutils.fs.mv(part_path, final_path, True)

    #4. clean up the temp directory
    dbutils.fs.rm(tmp_dir, True)

for m in months:
    write_month_csv(df, m)


## Next Steps
Now that we have our data all ready to go, we'll create a small application that simulates dropping each file into another Databricks volume for ingestion. This volume will simulate being attached to an external storage container such as ADLS, GCS or S3.