# Framing DataFrames

### Pyowa - 2025-05-27
### Adam Best

# About Me

- 6 years of experience as a data engineer - 10 years with data
- John Deere Engine Works - quality engineering
- General Dynamics IT - healthcare data warehouse
- Dwolla - data platform/warehouse + AWS
- John Deere Financial - Database Administration
- Tractor Zoom - data platform

# Goals
- Understanding of when and how to use a DataFrame
- What options are out there, and what tools might be best for your use case

### Will not cover
- Exact syntax

# What is a DataFrame?

### Programmable Excel
- 2D table of data
- Rows and columns
- Cells contain data
- Data is homogeneous
- Data is aligned
- Data is indexed
- Data is mutable

# Common data tasks

- Selecting and filtering
- Aggregations - sum, count
- Joining or merging
- Grouping

In [1]:
# Example DataFrame
from pandas import DataFrame

df = DataFrame({
    'name': ['John', 'Jane', 'Jim', 'Jill'],
    'age': [20, 21, 22, 23],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
})
df.head()

Unnamed: 0,name,age,city
0,John,20,New York
1,Jane,21,Los Angeles
2,Jim,22,Chicago
3,Jill,23,Houston


# What isn't a DataFrame?

In [None]:
# List of Dicts
data = [{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]
# No indexing or vectorized operations on fields
# e.g. fetch all records with age > 25

In [None]:
# Dict of List
data = {"name": ["Alice", "Bob"], "age": [30, 25]}
# Column based, but lacks type and shape enforcement

In [None]:
# Named Tuple + Dataclasses (closer)
from collections import namedtuple

Person = namedtuple("Person", ["name", "age"])
people = [Person("Alice", 30), Person("Bob", 25)]
# again, no vectorized operations or indexing

# SQL Tables
```sql
CREATE TABLE people
(id int,
 name, varchar(30),
 age, int);

 SELECT COUNT(*) FROM people WHERE age > 30;
```
- Very similar functionality!

# Summary
Each of these represents structured or semi-structured data, but only DataFrames combine labeling, tabular structure, and vectorized operations in one unified tool.

# So you want to use a DataFrame
# 🤔

# The options (some of them)
- Pandas
- Polars
- PySpark
- Dask

# Pandas
[Docs](https://pandas.pydata.org/docs/reference/index.html)
- huge userbase
- mature and flexible
- single threaded, memory constrained

In [6]:
import pandas as pd

pandas_df = pd.DataFrame({
    'name': ['John', 'Jane', 'Jim', 'Jill'],
    'age': [20, 21, 22, 23],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
})

avg_age_by_city_pandas = pandas_df.groupby("city")["age"].mean()
avg_age_by_city_pandas.head()

city
Chicago        22.0
Houston        23.0
Los Angeles    21.0
New York       20.0
Name: age, dtype: float64

# Polars
[Docs](https://docs.pola.rs/user-guide/getting-started/)
- Fast - parallel and lazy evaluation
- Less flexible for edge cases

In [7]:
import polars as pl

polars_df = pl.DataFrame({
    'name': ['John', 'Jane', 'Jim', 'Jill'],
    'age': [20, 21, 22, 23],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
})

avg_age_by_city_polars = polars_df.group_by("city").agg(pl.col("age").mean())
avg_age_by_city_polars.head()

city,age
str,f64
"""Houston""",23.0
"""Los Angeles""",21.0
"""New York""",20.0
"""Chicago""",22.0


In [None]:
# Side by side
avg_age_by_city_pandas = pandas_df.groupby("city")["age"].mean()
avg_age_by_city_polars = polars_df.group_by("city").agg(pl.col("age").mean())

# (insert xkcd standards comic)


# PySpark
- Year: 
- 