# Framing DataFrames

### Pyowa - 2025-05-27
### Adam Best

# About Me

- 6 years of experience as a data engineer - 10 years with data
- John Deere Engine Works - quality engineering
- General Dynamics IT - healthcare data warehouse
- Dwolla - data platform/warehouse + AWS
- John Deere Financial - Database Administration
- Tractor Zoom - data platform

# Goals
- Understanding of when and how to use a DataFrame
- What options are out there, and what tools might be best for your use case

### Will not cover
- Exact syntax

# What is a DataFrame?

### Programmable Excel
- 2D table of data
- Rows and columns
- Cells contain data
- Data is homogeneous
- Data is aligned
- Data is indexed
- Data is mutable

# Common data tasks

- Selecting and filtering
- Aggregations - sum, count
- Joining or merging
- Grouping

In [1]:
# Example DataFrame
from pandas import DataFrame

df = DataFrame({
    'name': ['John', 'Jane', 'Jim', 'Jill'],
    'age': [20, 21, 22, 23],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
})
df.head()

Unnamed: 0,name,age,city
0,John,20,New York
1,Jane,21,Los Angeles
2,Jim,22,Chicago
3,Jill,23,Houston


# What isn't a DataFrame?

In [2]:
# List of Dicts
data = [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]
# No indexing or vectorized operations on fields
# e.g. fetch all records with age > 25

In [3]:
# Dict of List
data = {"name": ["Alice", "Bob"], "age": [30, 25]}
# Column based, but lacks type and shape enforcement

In [4]:
# Named Tuple + Dataclasses (closer)
from collections import namedtuple

Person = namedtuple("Person", ["name", "age"])
people = [Person("Alice", 30), Person("Bob", 25)]
# again, no vectorized operations or indexing

# SQL Tables
```sql
CREATE TABLE people
(id int,
 name, varchar(30),
 age, int);

 SELECT COUNT(*) FROM people WHERE age > 30;
```
- Very similar functionality!

# Summary
Each of these represents structured or semi-structured data, but only DataFrames combine labeling, tabular structure, and vectorized operations in one unified tool.

# So you want to use a DataFrame
# 🤔

# The options (some of them)
- Pandas ~2012
- PySpark ~2014
- Dask ~2015
- Polars ~2021
- Daft ~2023

# Pandas
[Docs](https://pandas.pydata.org/docs/reference/index.html)
- huge userbase
- mature and flexible
- single threaded, memory constrained
- Cython/C based

In [9]:
dummy_data = {
    "name": ["John", "Jane", "Jim", "Jill"],
    "age": [20, 21, 22, 23],
    "city": ["New York", "Los Angeles", "Chicago", "Houston"],
}

In [10]:
import pandas as pd

pd_df = pd.DataFrame(dummy_data)

avg_age_by_city_pandas = pd_df.groupby("city")["age"].mean()
avg_age_by_city_pandas.head()

city
Chicago        22.0
Houston        23.0
Los Angeles    21.0
New York       20.0
Name: age, dtype: float64

# Polars
[Docs](https://docs.pola.rs/user-guide/getting-started/)
- Fast - parallel and lazy evaluation
- Less flexible for edge cases
- Rust based

In [7]:
import polars as pl

pl_df = pl.DataFrame(dummy_data)

avg_age_by_city_polars = pl_df.group_by("city").agg(pl.col("age").mean())
avg_age_by_city_polars.head()

city,age
str,f64
"""Houston""",23.0
"""Los Angeles""",21.0
"""New York""",20.0
"""Chicago""",22.0


# PySpark
[Docs](https://spark.apache.org/docs/latest/api/python/reference/pyspark.html)
- Massively horizontally scalable
- SQL syntax support
- high overhead for small datasets
- JVM - Scala based

In [19]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.appName("DataFrameDemo").getOrCreate()

schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType()),
    StructField("city", StringType())
])

spk_df = spark.createDataFrame(pd_df,  # Note: Pandas DataFrame source
                               schema=schema)

avg_age_by_city_spark = spk_df.groupBy("city").mean("age")
avg_age_by_city_spark.show()


+-----------+--------+
|       city|avg(age)|
+-----------+--------+
|   New York|    20.0|
|Los Angeles|    21.0|
|    Chicago|    22.0|
|    Houston|    23.0|
+-----------+--------+



# Dask
[Docs](https://docs.dask.org/en/latest/dataframe-api.html)
- Parallel and lazy evaluation
- Similar to Pandas API
- Handles larger than memory datasets
- Python based
- Delegates to other libraries for execution

In [17]:
import dask.dataframe as dd

dask_df = dd.from_dict(dummy_data, npartitions=2)
avg_age_by_city_dask = dask_df.groupby("city")["age"].mean()
avg_age_by_city_dask.compute()  # Note: necessary to trigger computation
avg_age_by_city_dask.head()

city
Los Angeles    21.0
New York       20.0
Chicago        22.0
Houston        23.0
Name: age, dtype: float64

# Daft
[Docs](https://www.getdaft.io/projects/docs/en/stable/quickstart/)
- Python + Rust + SQL
- Scales Vertically and Horizontally

In [21]:
import daft

daft_df = daft.from_pydict(dummy_data)
avg_age_by_city_daft = daft_df.groupby("city").agg(daft.col("age").mean())

avg_age_by_city_daft.show()

city Utf8,age Float64
Chicago,22
New York,20
Houston,23
Los Angeles,21


In [None]:
# Side by side
pd_df.groupby("city")["age"].mean()  # Pandas
pl_df.group_by("city").agg(pl.col("age").mean())  # Polars
spk_df.groupBy("city").mean("age")  # PySpark
dask_df.groupby("city")["age"].mean()  # Dask
daft_df.groupby("city").agg(daft.col("age").mean())  # Daft

![Standards](./resources/standards.png)

# Enter Ibis (and others)
[Docs](https://ibis-project.org/tutorials/basics)
> Ibis defines a Python dataframe API that executes on any query engine – the frontend for any backend data platform, with nearly 20 backends today. This allows Ibis to have excellent performance – as good as the backend it is connected to – with a consistent user experience.

# Modin
[Docs](https://modin.readthedocs.io/en/latest/getting_started/why_modin/modin_vs_dask_vs_koalas.html)
> Libraries such as Dask DataFrame (DaskDF for short) and Koalas aim to support the pandas API on top of distributed computing frameworks, Dask and Spark respectively. Instead, Modin aims to preserve the pandas API and behavior as is, while abstracting away the details of the distributed computing framework underneath. Thus, the aims of these libraries are fundamentally different.

# More Modin
> Specifically, Modin
    - enables pandas-like row and column-parallel operations, unlike DaskDF and Koalas that only support row-parallel operations
    - indexing & ordering semantics, unlike DaskDF and Koalas that deviate from these semantics
    - eager execution, unlike DaskDF and Koalas that provide lazy execution

# Comparing

| Feature/Need             | pandas | Polars | Dask | PySpark | Daft  |
|--------------------------|--------|--------|------|---------|-------|
| Easy local work          | ✅     | ✅     | ⚠️   | ⚠️      | ✅    |
| Huge files (10M+ rows)   | ⚠️     | ✅     | ✅   | ✅      | ✅    |
| Multi-core               | ❌     | ✅     | ✅   | ✅      | ✅    |
| Clustered / distributed  | ❌     | ⚠️     | ✅   | ✅      | ✅    |
| SQL-like syntax          | ⚠️     | ⚠️     | ⚠️   | ✅      | ⚠️    |
| Learning curve           | Easy   | Medium | Medium | Steep  | Medium |

### Quick Benchmarks
![Benchmarks](./resources/benchmarks.png)