# Bodo DataFrames

This notebook demonstrates some of the core functionality of Bodo.DataFrames for data engineering, data science, AI and ML applications. 
It has three main sections: 
1. Simple AI inference workflows
2. Scalable dataset management with Iceberg
3. Accelerating Pandas with a one-line-change. 


## 0. Environment Setup

The dataset used in sections 2 and 3 is from the [New York City Taxi and Limousine Commission](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) and contains trips from taxi and rideshare apps from 2019-2023. 
This notebook uses a subset of the 2019 data (~1 GiB, Parquet format). In addition to the Taxi dataset, 
section 3 also uses a small dataset of [weather observations from Central Park](https://github.com/toddwschneider/nyc-taxi-data/blob/c65ad8332a44f49770644b11576c0529b40bbc76/data/central_park_weather.csv) (~0.5 MiB CSV file). 
The data is hosted in a public S3 bucket and downloading the data first eliminates variability due to internet speeds.

In [2]:
import os
import boto3
from botocore import UNSIGNED
from botocore.config import Config

bucket_name = "bodo-example-data"

def download_data_s3(path_to_s3: str, local_data_dir: str = "data") -> str:
    """Download the dataset from S3 if already exists, skip download."""
    file_name = os.path.basename(path_to_s3)
    local_path = os.path.join(local_data_dir, file_name)

    if os.path.exists(local_path):
        return local_path

    print("Downloading dataset from S3...")

    s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))

    if not os.path.exists(local_data_dir):
        os.mkdir(local_data_dir)

    s3.download_file(bucket_name, path_to_s3, local_path)
    return local_path

# Download the weather data (CSV)
download_data_s3("nyc-taxi/central_park_weather.csv", )

# Download Taxi data (parquet files)
pq_files = [
    "nyc-taxi/fhvhv_tripdata/fhvhv_tripdata_2019-02.parquet",
    "nyc-taxi/fhvhv_tripdata/fhvhv_tripdata_2019-03.parquet",
]

for file in pq_files:
    download_data_s3(file, local_data_dir="data/fhvhv_tripdata")

In [3]:
# Filter warnings on workers
# These are performance warnings due to the small number of data files.
# Skip this cell to see the raw output.

import warnings
import bodo.spawn.spawner as spawner

spawner.submit_func_to_workers(lambda: warnings.filterwarnings("ignore"), [])

# Filter Frontend warnings
warnings.filterwarnings("ignore")

## 1. Simple AI inference workflows

Bodo DataFrames provides an extended version of the Pandas API for simplifying and scaling common AI workflows.
In this example, we show how you can create and query vector embeddings using Bodo DataFrames and S3 Vectors.
Bodo DataFrames parallelizes calls to S3 Vectors APIs when putting vectors, 
making it ideal for loading large chunks of data at a time.
Before running the cells below, create an S3 vector bucket and vector index 
(follow the first two steps of [this tutorial](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-vectors-getting-started.html)).

In [7]:
import os

import bodo.pandas as bd

openai_api_key = os.environ.get("OPENAI_API_KEY")

In [8]:
# Create embeddings using OpenAI API

texts = [
   "Star Wars: A farm boy joins rebels to fight an evil empire in space",
   "Jurassic Park: Scientists create dinosaurs in a theme park that goes wrong",
   "Finding Nemo: A father fish searches the ocean to find his lost son"
]
keys = ["Star Wars", "Jurassic Park", "Finding Nemo"]
genres = ["scifi", "scifi", "family"]


df = bd.DataFrame({"key": keys, "text": texts, "genre": genres})
df["data"] = df.text.ai.embed(model="text-embedding-3-small", api_key=openai_api_key)

df["data"]

0    [-0.05209277  0.07324854 -0.00430418 ...  0.00...
1    [-0.00482084  0.06854609  0.00452549 ... -0.00...
2    [ 0.01458554  0.04918431 -0.01516628 ...  0.00...
Name: data, dtype: list<item: double>[pyarrow]

In [5]:
# Write embeddings into vector index with metadata.

# TODO: fill in with your vector bucket/index and region
vector_bucket_name = "your-vector-bucket"
index_name = "your-vector-index"
region = "your-region"

df["metadata"] = df.apply(lambda row: {"source_text": row.text, "genre": row.genre}, axis=1)
df.to_s3_vectors(
   vector_bucket_name=vector_bucket_name,
   index_name=index_name,
   region=region
)

In [6]:
# Query the vector index (with filtering)

input_text = "adventures in space"
df = bd.DataFrame({"text": [input_text]})
df["data"] = df.text.ai.embed(model="text-embedding-3-small", api_key=openai_api_key)
out = df.data.ai.query_s3_vectors(
   vector_bucket_name=vector_bucket_name,
   index_name=index_name,
   region=region,
   topk=3,
   filter={"genre": "scifi"},
   return_distance=True,
   return_metadata=True,
)

out

Unnamed: 0,keys,distances,metadata
0,['Star Wars' 'Jurassic Park'],[0.62180173 0.73625743],"[""{'source_text': 'Star Wars: A farm boy joins..."


## 2. Scalable dataset management with Iceberg

Bodo DataFrames provides simple and scalable APIs for reading and writing data to Iceberg tables, an open source table format which provides an extra layer of scalable dataset management on top of raw files. 
One benefit of using Iceberg is the Time Travel feature, which let's you inspect the state of a table at a previous point in time, so you can track your data as it changes.

In [7]:
import bodo.pandas as pd
import pyiceberg
from bodo.io.iceberg.catalog.dir import DirCatalog

warehouse_loc = "./iceberg_warehouse"
table_name = "fhvhv_tripdata"

In [11]:
# Load a portion of the NYC Taxi Data into Iceberg

df = pd.read_parquet("data/fhvhv_tripdata/fhvhv_tripdata_2019-02.parquet")
df = df[['hvfhs_license_num', 'PULocationID', 'DOLocationID', 'trip_miles', 'dropoff_datetime', 'pickup_datetime']].head(5)

df.to_iceberg(table_name, location=warehouse_loc)

out_df = pd.read_iceberg(table_name, location=warehouse_loc)
out_df

Unnamed: 0,hvfhs_license_num,PULocationID,DOLocationID,trip_miles,dropoff_datetime,pickup_datetime
0,HV0003,245,251,2.45,2019-02-01 00:14:57,2019-02-01 00:05:18
1,HV0003,216,197,1.71,2019-02-01 00:49:39,2019-02-01 00:41:29
2,HV0005,261,234,5.01,2019-02-01 01:28:29,2019-02-01 00:51:34
3,HV0005,87,87,0.34,2019-02-01 00:07:16,2019-02-01 00:03:51
4,HV0005,87,198,6.84,2019-02-01 00:39:56,2019-02-01 00:09:44


In [12]:
# Add more rows to the table

df = pd.read_parquet("data/fhvhv_tripdata/fhvhv_tripdata_2019-03.parquet")
df = df[['hvfhs_license_num', 'PULocationID', 'DOLocationID', 'trip_miles', 'dropoff_datetime', 'pickup_datetime']].head(5)

df.to_iceberg(table_name, location="./iceberg_warehouse", append=True)

out_df = pd.read_iceberg(table_name, location="./iceberg_warehouse")
out_df

Unnamed: 0,hvfhs_license_num,PULocationID,DOLocationID,trip_miles,dropoff_datetime,pickup_datetime
0,HV0003,245,251,2.45,2019-02-01 00:14:57,2019-02-01 00:05:18
1,HV0003,216,197,1.71,2019-02-01 00:49:39,2019-02-01 00:41:29
2,HV0005,261,234,5.01,2019-02-01 01:28:29,2019-02-01 00:51:34
3,HV0005,87,87,0.34,2019-02-01 00:07:16,2019-02-01 00:03:51
4,HV0005,87,198,6.84,2019-02-01 00:39:56,2019-02-01 00:09:44
5,HV0004,36,80,2.18,2019-03-01 00:28:51,2019-03-01 00:13:55
6,HV0004,37,232,3.66,2019-03-01 00:43:03,2019-03-01 00:23:58
7,HV0005,25,62,2.53,2019-03-01 00:15:09,2019-03-01 00:03:37
8,HV0003,65,262,9.34,2019-03-01 00:50:43,2019-03-01 00:29:46
9,HV0003,140,196,7.88,2019-03-01 01:20:47,2019-03-01 00:58:56


In [13]:
# Use PyIceberg to get a history of the dataset over time.

catalog = DirCatalog(
    None,
    **{
        pyiceberg.catalog.WAREHOUSE_LOCATION: warehouse_loc,
    },
)

table = catalog.load_table(f".{table_name}")
history = table.history()
prev_snapshot = history[0].snapshot_id

history

[SnapshotLogEntry(snapshot_id=4162266328373461502, timestamp_ms=1755540265399),
 SnapshotLogEntry(snapshot_id=6911853098964568922, timestamp_ms=1755540267586)]

In [14]:
# Use Time Travel to query both versions of the dataset

df = pd.read_iceberg(table_name, location="./iceberg_warehouse", snapshot_id=prev_snapshot)
display(df)

current_snapshot = history[1].snapshot_id

df = pd.read_iceberg(table_name, location="./iceberg_warehouse", snapshot_id=current_snapshot)
display(df)

Unnamed: 0,hvfhs_license_num,PULocationID,DOLocationID,trip_miles,dropoff_datetime,pickup_datetime
0,HV0003,245,251,2.45,2019-02-01 00:14:57,2019-02-01 00:05:18
1,HV0003,216,197,1.71,2019-02-01 00:49:39,2019-02-01 00:41:29
2,HV0005,261,234,5.01,2019-02-01 01:28:29,2019-02-01 00:51:34
3,HV0005,87,87,0.34,2019-02-01 00:07:16,2019-02-01 00:03:51
4,HV0005,87,198,6.84,2019-02-01 00:39:56,2019-02-01 00:09:44


Unnamed: 0,hvfhs_license_num,PULocationID,DOLocationID,trip_miles,dropoff_datetime,pickup_datetime
0,HV0003,245,251,2.45,2019-02-01 00:14:57,2019-02-01 00:05:18
1,HV0003,216,197,1.71,2019-02-01 00:49:39,2019-02-01 00:41:29
2,HV0005,261,234,5.01,2019-02-01 01:28:29,2019-02-01 00:51:34
3,HV0005,87,87,0.34,2019-02-01 00:07:16,2019-02-01 00:03:51
4,HV0005,87,198,6.84,2019-02-01 00:39:56,2019-02-01 00:09:44
5,HV0004,36,80,2.18,2019-03-01 00:28:51,2019-03-01 00:13:55
6,HV0004,37,232,3.66,2019-03-01 00:43:03,2019-03-01 00:23:58
7,HV0005,25,62,2.53,2019-03-01 00:15:09,2019-03-01 00:03:37
8,HV0003,65,262,9.34,2019-03-01 00:50:43,2019-03-01 00:29:46
9,HV0003,140,196,7.88,2019-03-01 01:20:47,2019-03-01 00:58:56


In [15]:
# Optional cleanup

import shutil

shutil.rmtree("./iceberg_warehouse")

## 3. Accelerating Pandas code with a one-line-change

Bodo DataFrames automatically parallelizes and optimizes workloads written in Pandas.
This example uses a representative data engineering workload for creating trip summaries using Central Park weather observations (CSV) and NYC Taxi/Rideshare data (Parquet),
highlighting Bodo DataFrames performance on key transformations such as read parquet, datetime manipulation, merge, groupby and sort. 
We first measure the performance of both Bodo DataFrames, then run in Pandas to see an improvement. 
This gap becomes larger as we scale to more data and add more cores.

In [4]:
from time import perf_counter

weather_dataset = "data/central_park_weather.csv"
hvfhv_dataset = "data/fhvhv_tripdata/"

def get_monthly_travels_weather(weather_dataset : str, hvfhv_dataset : str, out_file : str, pd):
    """ Run the full workload and write results to Parquet. """
    start = perf_counter()
    central_park_weather_observations = pd.read_csv(
        weather_dataset,
        parse_dates=["DATE"],
    )
    central_park_weather_observations = central_park_weather_observations.rename(
        columns={"DATE": "date", "PRCP": "precipitation"}, copy=False
    )
    fhvhv_tripdata = pd.read_parquet(hvfhv_dataset)

    central_park_weather_observations["date"] = central_park_weather_observations[
        "date"
    ].dt.date
    fhvhv_tripdata["date"] = fhvhv_tripdata["pickup_datetime"].dt.date
    fhvhv_tripdata["month"] = fhvhv_tripdata["pickup_datetime"].dt.month
    fhvhv_tripdata["hour"] = fhvhv_tripdata["pickup_datetime"].dt.hour
    fhvhv_tripdata["weekday"] = fhvhv_tripdata["pickup_datetime"].dt.dayofweek.isin(
        [0, 1, 2, 3, 4]
    )

    monthly_trips_weather = fhvhv_tripdata.merge(
        central_park_weather_observations, on="date", how="inner"
    )
    monthly_trips_weather["date_with_precipitation"] = (
        monthly_trips_weather["precipitation"] > 0.1
    )

    def get_time_bucket(t):
        bucket = "other"
        if t in (8, 9, 10):
            bucket = "morning"
        elif t in (11, 12, 13, 14, 15):
            bucket = "midday"
        elif t in (16, 17, 18):
            bucket = "afternoon"
        elif t in (19, 20, 21):
            bucket = "evening"
        return bucket

    monthly_trips_weather["time_bucket"] = monthly_trips_weather.hour.map(
        get_time_bucket
    )
    monthly_trips_weather = monthly_trips_weather.groupby(
        [
            "PULocationID",
            "DOLocationID",
            "month",
            "weekday",
            "date_with_precipitation",
            "time_bucket",
        ],
        as_index=False,
    ).agg({"hvfhs_license_num": "count", "trip_miles": "mean"})
    monthly_trips_weather = monthly_trips_weather.sort_values(
        by=[
            "PULocationID",
            "DOLocationID",
            "month",
            "weekday",
            "date_with_precipitation",
            "time_bucket",
        ]
    )
    monthly_trips_weather = monthly_trips_weather.rename(
        columns={
            "hvfhs_license_num": "trips",
            "trip_miles": "avg_distance",
        },
        copy=False,
    )

    monthly_trips_weather.to_parquet(out_file)
    end = perf_counter()
    print("Total E2E time:", (end - start))

In [5]:
import bodo.pandas

get_monthly_travels_weather(weather_dataset, hvfhv_dataset, out_file="bodo_monthly_trips_weather_pandas.pq", pd=bodo.pandas)

Total E2E time: 10.84962783300216


### Comparing to Pandas

Run the cell below to see the comparison to Pandas. 
Without any code changed, Bodo accelerates the workflow by ~6-10x on a 2023, 10-core Macbook Pro.

In [6]:
import pandas

get_monthly_travels_weather(weather_dataset, hvfhv_dataset, out_file="pandas_monthly_trips_weather.pq", pd=pandas)

Total E2E time: 75.9208817079998
