Polars Gotchas

#1 Sorting rows independent of the other columns

My issue needing clarification was simply why the following commands sorted the “salary” column differently. It seemed like an easy point of confusion:

employees.with_columns(pl.col("salary").sort())
employees.with_columns(pl.col("salary").sort(descending=True))

The reason why the first sort isn’t what you want is because the salary column is sorted without the data frame context. In other words, it returns the salary without including the actual contents of the data frame considered. The salary row is sorted, but only within itself. The second sort considers the whole data frame and accurately includes the rightful person with the salary.

In [5]:
import polars as pl

employees = pl.read_csv("employees.csv", try_parse_dates=True)
employees.head(3)

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",96540,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",126489,10,2015-03-01


In [4]:
employees.with_columns(pl.col("salary").sort())
employees.with_columns(pl.col("salary").sort(descending=True))

name,department,email,salary,years_at_company,start_date
str,str,str,i64,i64,date
"""Nicholas Maldonado""","""CEO""","""nicholas.maldonado@polars.io""",250000,9,2016-07-14
"""Michael Fletcher""","""Operations""","""michael.fletcher@polars.io""",199503,9,2016-02-13
"""Jeffrey Tanner""",,"""jeffrey.tanner@polars.io""",199381,10,2015-03-01
"""Diana Weaver""","""HR""","""diana.weaver@polars.io""",199260,5,2019-11-25
"""Sierra Ross""",,"""sierra.ross@polars.io""",199257,7,2018-02-14
…,…,…,…,…,…
"""James Bryant""",,"""james.bryant@polars.io""",55304,9,2016-05-09
"""Patricia Vazquez""","""Operations""","""patricia.vazquez@polars.io""",55242,6,2019-02-20
"""Katie Clay""",,"""katie.clay@polars.io""",55078,0,2025-02-12
"""Monique Swanson""","""Finance""","""monique.swanson@polars.io""",55012,4,2020-11-07


#1 Eager v Lazy Evaluation

https://docs.pola.rs/user-guide/concepts/lazy-api/

df.lazy().filter(...)  # Returns LazyFrame, doesn't execute
df.lazy().filter(...).collect()  # Actually executes

In [1]:
import polars as pl

# Sample data
df = pl.DataFrame(
    {
        "name": ["Alice", "Bob", "Charlie", "David"],
        "age": [25, 30, 35, 40],
        "city": ["NYC", "LA", "NYC", "Chicago"],
    }
)

# ❌ GOTCHA: This doesn't actually filter anything!
lazy_result = df.lazy().filter(pl.col("age") > 30)
print("Lazy result (no execution):")
print(lazy_result)  # Just shows LazyFrame object
print()

Lazy result (no execution):
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

FILTER [(col("age")) > (30)]
FROM
  DF ["name", "age", "city"]; PROJECT */3 COLUMNS



In [2]:
# ✅ CORRECT: You must call .collect() to execute
actual_result = df.lazy().filter(pl.col("age") > 30).collect()
print("After .collect():")
print(actual_result)
print()

After .collect():
shape: (2, 3)
┌─────────┬─────┬─────────┐
│ name    ┆ age ┆ city    │
│ ---     ┆ --- ┆ ---     │
│ str     ┆ i64 ┆ str     │
╞═════════╪═════╪═════════╡
│ Charlie ┆ 35  ┆ NYC     │
│ David   ┆ 40  ┆ Chicago │
└─────────┴─────┴─────────┘



In [3]:
# Another common mistake: chaining operations
# ❌ This builds a query plan but doesn't run it
query = df.lazy().filter(pl.col("city") == "NYC").select(["name", "age"]).sort("age")
print("Query plan (not executed):")
print(query.explain())  # Shows what WOULD be executed
print()

# ✅ Execute with .collect()
result = query.collect()
print("Executed result:")
print(result)
print()

# Pro tip: Eager operations work immediately
eager_result = df.filter(pl.col("age") > 30)  # No .lazy(), executes instantly
print("Eager evaluation (immediate):")
print(eager_result)

Query plan (not executed):
SORT BY [col("age")]
  simple π 2/2 ["name", "age"]
    FILTER [(col("city")) == ("NYC")]
    FROM
      DF ["name", "age", "city"]; PROJECT["name", "age", "city"] 3/3 COLUMNS

Executed result:
shape: (2, 2)
┌─────────┬─────┐
│ name    ┆ age │
│ ---     ┆ --- │
│ str     ┆ i64 │
╞═════════╪═════╡
│ Alice   ┆ 25  │
│ Charlie ┆ 35  │
└─────────┴─────┘

Eager evaluation (immediate):
shape: (2, 3)
┌─────────┬─────┬─────────┐
│ name    ┆ age ┆ city    │
│ ---     ┆ --- ┆ ---     │
│ str     ┆ i64 ┆ str     │
╞═════════╪═════╪═════════╡
│ Charlie ┆ 35  ┆ NYC     │
│ David   ┆ 40  ┆ Chicago │
└─────────┴─────┴─────────┘


Eager mode (df.filter(...)) executes immediately but misses optimization opportunities

In [5]:
import polars as pl

# Large dataset simulation
df = pl.DataFrame(
    {
        "user_id": range(1_000_000),
        "age": [20 + (i % 50) for i in range(1_000_000)],
        "city": [["NYC", "LA", "Chicago"][i % 3] for i in range(1_000_000)],
        "salary": [50000 + (i % 100000) for i in range(1_000_000)],
        "department": [
            ["Sales", "Engineering", "HR", "Marketing"][i % 4] for i in range(1_000_000)
        ],
    }
)

# ❌ EAGER MODE: Each operation executes immediately, one at a time
print("EAGER MODE (less efficient):")
result_eager = (
    df.filter(pl.col("age") > 30)  # Scans entire df → creates intermediate result
    .filter(pl.col("city") == "NYC")  # Scans that intermediate result
    .select(["user_id", "salary"])  # Then selects columns
    .head(100)  # Finally takes first 100
)
print(result_eager)

print("\n" + "=" * 60 + "\n")

# ✅ LAZY MODE: Polars optimizes the entire query plan BEFORE executing
print("LAZY MODE (optimized):")
result_lazy = (
    df.lazy()
    .filter(pl.col("age") > 30)
    .filter(pl.col("city") == "NYC")
    .select(["user_id", "salary"])
    .head(100)
    .collect()
)
print(result_lazy)

# What Polars optimizes:
# 1. Predicate pushdown: Combines both filters into one pass
# 2. Projection pushdown: Only reads user_id, age, city, salary (not department)
# 3. Early limit: Stops after finding 100 matching rows
# 4. Parallelization: Can process chunks in parallel

print("\n" + "=" * 60 + "\n")
print("Query plan (see the optimizations):")
print(
    df.lazy()
    .filter(pl.col("age") > 30)
    .filter(pl.col("city") == "NYC")
    .select(["user_id", "salary"])
    .head(100)
    .explain()
)

EAGER MODE (less efficient):
shape: (100, 2)
┌─────────┬────────┐
│ user_id ┆ salary │
│ ---     ┆ ---    │
│ i64     ┆ i64    │
╞═════════╪════════╡
│ 12      ┆ 50012  │
│ 15      ┆ 50015  │
│ 18      ┆ 50018  │
│ 21      ┆ 50021  │
│ 24      ┆ 50024  │
│ …       ┆ …      │
│ 375     ┆ 50375  │
│ 378     ┆ 50378  │
│ 381     ┆ 50381  │
│ 384     ┆ 50384  │
│ 387     ┆ 50387  │
└─────────┴────────┘


LAZY MODE (optimized):
shape: (100, 2)
┌─────────┬────────┐
│ user_id ┆ salary │
│ ---     ┆ ---    │
│ i64     ┆ i64    │
╞═════════╪════════╡
│ 12      ┆ 50012  │
│ 15      ┆ 50015  │
│ 18      ┆ 50018  │
│ 21      ┆ 50021  │
│ 24      ┆ 50024  │
│ …       ┆ …      │
│ 375     ┆ 50375  │
│ 378     ┆ 50378  │
│ 381     ┆ 50381  │
│ 384     ┆ 50384  │
│ 387     ┆ 50387  │
└─────────┴────────┘


Query plan (see the optimizations):
simple π 2/2 ["user_id", "salary"]
  SLICE[offset: 0, len: 100]
    FILTER [([(col("age")) > (30)]) & ([(col("city")) == ("NYC")])]
    FROM
      DF ["user_id", 

Key Optimizations in Lazy Mode:

Predicate Pushdown: Combines multiple filters into one scan

python   # Eager: scan → filter age → scan → filter city
   # Lazy:  scan once with BOTH filters combined

Projection Pushdown: Only reads columns you actually need

python   # Eager: reads all 5 columns, then drops 3 later
   # Lazy:  only reads the 4 columns needed for filtering + final output

Early Termination: Stops when it has enough data

python   # Eager: processes all 1M rows, then takes first 100
   # Lazy:  stops after finding 100 matching rows
```

**Real Impact:**
```
Eager mode: 1M rows × 5 columns × 2 filter passes = lots of wasted work
Lazy mode:  ~100 rows × 4 columns × 1 filter pass = minimal work
When does it matter?

Large datasets (100k+ rows)
Multiple operations chained together
Working with many columns but only need a few
Filtering before selecting/aggregating

When is eager fine?

Small datasets
Single operations
Interactive exploration where you want immediate feedback


#2. select() vs with_columns() (working)

df.select(pl.col("a"))  # Returns ONLY column "a"
df.with_columns(pl.col("a"))  # Returns ALL columns plus modified "a"



In [7]:
# Here's a simple working example demonstrating the select() vs with_columns() gotcha in Polars:
import polars as pl

# Create a sample DataFrame
df = pl.DataFrame(
    {
        "name": ["Alice", "Bob", "Charlie"],
        "age": [25, 30, 35],
        "city": ["NYC", "LA", "Chicago"],
    }
)

print("Original DataFrame:")
print(df)
print("\n" + "=" * 50 + "\n")

Original DataFrame:
shape: (3, 3)
┌─────────┬─────┬─────────┐
│ name    ┆ age ┆ city    │
│ ---     ┆ --- ┆ ---     │
│ str     ┆ i64 ┆ str     │
╞═════════╪═════╪═════════╡
│ Alice   ┆ 25  ┆ NYC     │
│ Bob     ┆ 30  ┆ LA      │
│ Charlie ┆ 35  ┆ Chicago │
└─────────┴─────┴─────────┘




In [8]:
# Using select() - Returns ONLY the selected column
result_select = df.select(pl.col("age") * 2)
print("Using select(pl.col('age') * 2):")
print(result_select)
print(f"Columns: {result_select.columns}")
print("\n" + "=" * 50 + "\n")

Using select(pl.col('age') * 2):
shape: (3, 1)
┌─────┐
│ age │
│ --- │
│ i64 │
╞═════╡
│ 50  │
│ 60  │
│ 70  │
└─────┘
Columns: ['age']




In [9]:
# Using with_columns() - Returns ALL columns plus the modified one
result_with_columns = df.with_columns((pl.col("age") * 2).alias("age"))
print("Using with_columns((pl.col('age') * 2).alias('age')):")
print(result_with_columns)
print(f"Columns: {result_with_columns.columns}")

Using with_columns((pl.col('age') * 2).alias('age')):
shape: (3, 3)
┌─────────┬─────┬─────────┐
│ name    ┆ age ┆ city    │
│ ---     ┆ --- ┆ ---     │
│ str     ┆ i64 ┆ str     │
╞═════════╪═════╪═════════╡
│ Alice   ┆ 50  ┆ NYC     │
│ Bob     ┆ 60  ┆ LA      │
│ Charlie ┆ 70  ┆ Chicago │
└─────────┴─────┴─────────┘
Columns: ['name', 'age', 'city']


#3 Expression Context Matters
pl.col("a").sum()  # Works in select/group_by
pl.sum("a")  # Different! Use in different contexts

Takeaway:

pl.col() works on columns vertically; pl.sum() works across columns horizontally. They're used in different contexts for different operations.

In [13]:
import polars as pl

df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})

In [11]:
# Expression context - vertical aggregation
df.select(pl.col("a").sum())
# Output: 6 (sum down column a: 1+2+3)

a
i64
6


In [12]:
# Function context - horizontal aggregation
df.select(pl.sum("a", "b", "c"))
# Output:
# [12, 15, 18]  (row-wise sums: 1+4+7, 2+5+8, 3+6+9)

a,b,c
i64,i64,i64
6,15,24


# 4