# Complete Guide to Polars for Pandas/PySpark Users

This notebook provides a comprehensive introduction to Polars, covering everything from basics to advanced topics.

## Table of Contents
1. [Introduction & Setup](#1-introduction--setup)
2. [Basic Data Structures](#2-basic-data-structures)
3. [Creating DataFrames](#3-creating-dataframes)
4. [Reading & Writing Data](#4-reading--writing-data)
5. [Data Selection & Filtering](#5-data-selection--filtering)
6. [Expressions - The Heart of Polars](#6-expressions---the-heart-of-polars)
7. [Transformations & Column Operations](#7-transformations--column-operations)
8. [Aggregations & GroupBy](#8-aggregations--groupby)
9. [Joins & Concatenations](#9-joins--concatenations)
10. [Lazy vs Eager Evaluation](#10-lazy-vs-eager-evaluation)
11. [Time Series Operations](#11-time-series-operations)
12. [String Operations](#12-string-operations)
13. [Window Functions](#13-window-functions)
14. [Performance Optimization](#14-performance-optimization)
15. [Advanced Features](#15-advanced-features)

## 1. Introduction & Setup

### What is Polars?
- **Fast**: Written in Rust, optimized for performance
- **Efficient**: Uses Apache Arrow columnar format
- **Expressive**: Rich expression API
- **Lazy**: Built-in query optimization

### Key Differences from Pandas/PySpark
| Feature | Pandas | PySpark | Polars |
|---------|--------|---------|--------|
| Speed | Moderate | Fast (distributed) | Very Fast (single node) |
| Memory | Copies data often | Distributed | Zero-copy views |
| API Style | Method chaining | SQL-like | Expression-based |
| Lazy Evaluation | No | Yes | Yes |
| Parallelization | Limited | Distributed | Multi-threaded |

In [1]:
# Install Polars (run this if not already installed)
# !pip install polars

import polars as pl
import numpy as np
from datetime import datetime, timedelta

# Check version
print(f"Polars version: {pl.__version__}")

# Set display options
pl.Config.set_tbl_rows(10)

Polars version: 1.34.0


polars.config.Config

## 2. Basic Data Structures

Polars has two main data structures:
- **Series**: 1D array (like pandas Series)
- **DataFrame**: 2D table (like pandas DataFrame)

In [2]:
# Creating a Series
s = pl.Series("numbers", [1, 2, 3, 4, 5, 0])
print("Series:")
print(s)
print(f"\nDtype: {s.dtype}")
print(f"Length: {len(s)}")

Series:
shape: (6,)
Series: 'numbers' [i64]
[
	1
	2
	3
	4
	5
	0
]

Dtype: Int64
Length: 6


In [3]:
s.abs()

numbers
i64
1
2
3
4
5
0


In [4]:
s.filter(s > 0)

numbers
i64
1
2
3
4
5


In [5]:
# Series with different dtypes
int_series = pl.Series("integers", [1, 2, 3], dtype=pl.Int64)
float_series = pl.Series("floats", [1.0, 2.5, 3.7], dtype=pl.Float64)
str_series = pl.Series("strings", ["a", "b", "c"], dtype=pl.Utf8)
bool_series = pl.Series("booleans", [True, False, True], dtype=pl.Boolean)

print("Int Series:", int_series.to_list())
print("Float Series:", float_series.to_list())
print("String Series:", str_series.to_list())
print("Boolean Series:", bool_series.to_list())

Int Series: [1, 2, 3]
Float Series: [1.0, 2.5, 3.7]
String Series: ['a', 'b', 'c']
Boolean Series: [True, False, True]


## 3. Creating DataFrames

Multiple ways to create DataFrames in Polars

In [6]:
# Method 1: From dictionary
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "age": [25, 30, 35, 40, 28],
    "city": ["New York", "London", "Paris", "Tokyo", "Berlin"],
    "salary": [70000, 80000, 90000, 95000, 75000]
})

print("DataFrame from dictionary:")
print(df)

DataFrame from dictionary:
shape: (5, 4)
┌─────────┬─────┬──────────┬────────┐
│ name    ┆ age ┆ city     ┆ salary │
│ ---     ┆ --- ┆ ---      ┆ ---    │
│ str     ┆ i64 ┆ str      ┆ i64    │
╞═════════╪═════╪══════════╪════════╡
│ Alice   ┆ 25  ┆ New York ┆ 70000  │
│ Bob     ┆ 30  ┆ London   ┆ 80000  │
│ Charlie ┆ 35  ┆ Paris    ┆ 90000  │
│ David   ┆ 40  ┆ Tokyo    ┆ 95000  │
│ Eve     ┆ 28  ┆ Berlin   ┆ 75000  │
└─────────┴─────┴──────────┴────────┘


In [7]:
# Method 2: From list of dictionaries (row-oriented)
data = [
    {"product": "A", "quantity": 10, "price": 100},
    {"product": "B", "quantity": 20, "price": 200},
    {"product": "C", "quantity": 15, "price": 150},
]

df2 = pl.DataFrame(data)
print("DataFrame from list of dicts:")
print(df2)

DataFrame from list of dicts:
shape: (3, 3)
┌─────────┬──────────┬───────┐
│ product ┆ quantity ┆ price │
│ ---     ┆ ---      ┆ ---   │
│ str     ┆ i64      ┆ i64   │
╞═════════╪══════════╪═══════╡
│ A       ┆ 10       ┆ 100   │
│ B       ┆ 20       ┆ 200   │
│ C       ┆ 15       ┆ 150   │
└─────────┴──────────┴───────┘


In [8]:
all_types = [
    pl.Int8, pl.Int16, pl.Int32, pl.Int64, pl.Int128,
    pl.UInt8, pl.UInt16, pl.UInt32, pl.UInt64, pl.UInt128,
    pl.Float32, pl.Float64, 
    pl.Boolean,
    pl.String, pl.Utf8,
    pl.Null,
    pl.Unknown
]

for dtype in all_types:
    print(dtype)

Int8
Int16
Int32
Int64
Int128
UInt8
UInt16
UInt32
UInt64
UInt128
Float32
Float64
Boolean
String
String
Null
Unknown


In [9]:
# Method 3: From NumPy array
arr = np.random.randn(5, 3)
df3 = pl.DataFrame(arr, schema=["col1", "col2", "col3"])
print("DataFrame from NumPy:")
print(df3)

DataFrame from NumPy:
shape: (5, 3)
┌───────────┬───────────┬───────────┐
│ col1      ┆ col2      ┆ col3      │
│ ---       ┆ ---       ┆ ---       │
│ f64       ┆ f64       ┆ f64       │
╞═══════════╪═══════════╪═══════════╡
│ 0.68228   ┆ 0.513249  ┆ 0.317999  │
│ 1.652426  ┆ 0.110475  ┆ 1.679442  │
│ 0.250377  ┆ -1.387383 ┆ 0.028438  │
│ 0.449553  ┆ -2.280736 ┆ -1.031757 │
│ -1.406318 ┆ -1.727953 ┆ -1.208249 │
└───────────┴───────────┴───────────┘


In [10]:
# Basic DataFrame info (similar to pandas)
print("Shape:", df.shape)
print("\nColumn names:", df.columns)
print("\nDtypes:", df.dtypes)
print("\nSchema:")
print(df.schema)

Shape: (5, 4)

Column names: ['name', 'age', 'city', 'salary']

Dtypes: [String, Int64, String, Int64]

Schema:
Schema({'name': String, 'age': Int64, 'city': String, 'salary': Int64})


In [11]:
# Quick statistics
print("Describe:")
print(df.describe())

Describe:
shape: (9, 5)
┌────────────┬───────┬─────────┬────────┬──────────────┐
│ statistic  ┆ name  ┆ age     ┆ city   ┆ salary       │
│ ---        ┆ ---   ┆ ---     ┆ ---    ┆ ---          │
│ str        ┆ str   ┆ f64     ┆ str    ┆ f64          │
╞════════════╪═══════╪═════════╪════════╪══════════════╡
│ count      ┆ 5     ┆ 5.0     ┆ 5      ┆ 5.0          │
│ null_count ┆ 0     ┆ 0.0     ┆ 0      ┆ 0.0          │
│ mean       ┆ null  ┆ 31.6    ┆ null   ┆ 82000.0      │
│ std        ┆ null  ┆ 5.94138 ┆ null   ┆ 10368.220677 │
│ min        ┆ Alice ┆ 25.0    ┆ Berlin ┆ 70000.0      │
│ 25%        ┆ null  ┆ 28.0    ┆ null   ┆ 75000.0      │
│ 50%        ┆ null  ┆ 30.0    ┆ null   ┆ 80000.0      │
│ 75%        ┆ null  ┆ 35.0    ┆ null   ┆ 90000.0      │
│ max        ┆ Eve   ┆ 40.0    ┆ Tokyo  ┆ 95000.0      │
└────────────┴───────┴─────────┴────────┴──────────────┘


## 4. Reading & Writing Data

Polars supports multiple file formats with excellent performance

In [12]:
# Create sample data for I/O examples
sample_df = pl.DataFrame({
    "id": range(1, 1001),
    "name": [f"User_{i}" for i in range(1, 1001)],
    "score": np.random.randint(0, 100, 1000),
    "timestamp": [datetime.now() - timedelta(days=i) for i in range(1000)]
})

print(sample_df.head())

shape: (5, 4)
┌─────┬────────┬───────┬────────────────────────────┐
│ id  ┆ name   ┆ score ┆ timestamp                  │
│ --- ┆ ---    ┆ ---   ┆ ---                        │
│ i64 ┆ str    ┆ i64   ┆ datetime[μs]               │
╞═════╪════════╪═══════╪════════════════════════════╡
│ 1   ┆ User_1 ┆ 4     ┆ 2025-11-01 17:58:18.429212 │
│ 2   ┆ User_2 ┆ 67    ┆ 2025-10-31 17:58:18.429218 │
│ 3   ┆ User_3 ┆ 13    ┆ 2025-10-30 17:58:18.429219 │
│ 4   ┆ User_4 ┆ 8     ┆ 2025-10-29 17:58:18.429220 │
│ 5   ┆ User_5 ┆ 93    ┆ 2025-10-28 17:58:18.429220 │
└─────┴────────┴───────┴────────────────────────────┘


In [13]:
# Writing to CSV
sample_df.write_csv("data.csv")
print("Written to CSV")

# Reading from CSV
df_csv = pl.read_csv("data.csv")
print("\nRead from CSV:")
print(df_csv.head())

Written to CSV

Read from CSV:
shape: (5, 4)
┌─────┬────────┬───────┬────────────────────────────┐
│ id  ┆ name   ┆ score ┆ timestamp                  │
│ --- ┆ ---    ┆ ---   ┆ ---                        │
│ i64 ┆ str    ┆ i64   ┆ str                        │
╞═════╪════════╪═══════╪════════════════════════════╡
│ 1   ┆ User_1 ┆ 4     ┆ 2025-11-01T17:58:18.429212 │
│ 2   ┆ User_2 ┆ 67    ┆ 2025-10-31T17:58:18.429218 │
│ 3   ┆ User_3 ┆ 13    ┆ 2025-10-30T17:58:18.429219 │
│ 4   ┆ User_4 ┆ 8     ┆ 2025-10-29T17:58:18.429220 │
│ 5   ┆ User_5 ┆ 93    ┆ 2025-10-28T17:58:18.429220 │
└─────┴────────┴───────┴────────────────────────────┘


In [14]:
# Parquet (recommended for performance)
sample_df.write_parquet("data.parquet")
df_parquet = pl.read_parquet("data.parquet")
print("Read from Parquet:")
print(df_parquet.head())

Read from Parquet:
shape: (5, 4)
┌─────┬────────┬───────┬────────────────────────────┐
│ id  ┆ name   ┆ score ┆ timestamp                  │
│ --- ┆ ---    ┆ ---   ┆ ---                        │
│ i64 ┆ str    ┆ i64   ┆ datetime[μs]               │
╞═════╪════════╪═══════╪════════════════════════════╡
│ 1   ┆ User_1 ┆ 4     ┆ 2025-11-01 17:58:18.429212 │
│ 2   ┆ User_2 ┆ 67    ┆ 2025-10-31 17:58:18.429218 │
│ 3   ┆ User_3 ┆ 13    ┆ 2025-10-30 17:58:18.429219 │
│ 4   ┆ User_4 ┆ 8     ┆ 2025-10-29 17:58:18.429220 │
│ 5   ┆ User_5 ┆ 93    ┆ 2025-10-28 17:58:18.429220 │
└─────┴────────┴───────┴────────────────────────────┘


In [15]:
# JSON
sample_df.head(5).write_json("data.json")
df_json = pl.read_json("data.json")
print("Read from JSON:")
print(df_json)

Read from JSON:
shape: (5, 4)
┌─────┬────────┬───────┬────────────────────────────┐
│ id  ┆ name   ┆ score ┆ timestamp                  │
│ --- ┆ ---    ┆ ---   ┆ ---                        │
│ i64 ┆ str    ┆ i64   ┆ str                        │
╞═════╪════════╪═══════╪════════════════════════════╡
│ 1   ┆ User_1 ┆ 4     ┆ 2025-11-01 17:58:18.429212 │
│ 2   ┆ User_2 ┆ 67    ┆ 2025-10-31 17:58:18.429218 │
│ 3   ┆ User_3 ┆ 13    ┆ 2025-10-30 17:58:18.429219 │
│ 4   ┆ User_4 ┆ 8     ┆ 2025-10-29 17:58:18.429220 │
│ 5   ┆ User_5 ┆ 93    ┆ 2025-10-28 17:58:18.429220 │
└─────┴────────┴───────┴────────────────────────────┘


In [17]:
# Lazy reading (for large files) - reads only when needed
lazy_df = pl.scan_csv("data.csv")
print("Lazy DataFrame (not yet loaded):")
print(lazy_df)

# Collect to execute
result = lazy_df.head(3).collect()
print("\nCollected result:")
print(result)

Lazy DataFrame (not yet loaded):
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

Csv SCAN [data.csv]
PROJECT */4 COLUMNS

Collected result:
shape: (3, 4)
┌─────┬────────┬───────┬────────────────────────────┐
│ id  ┆ name   ┆ score ┆ timestamp                  │
│ --- ┆ ---    ┆ ---   ┆ ---                        │
│ i64 ┆ str    ┆ i64   ┆ str                        │
╞═════╪════════╪═══════╪════════════════════════════╡
│ 1   ┆ User_1 ┆ 4     ┆ 2025-11-01T17:58:18.429212 │
│ 2   ┆ User_2 ┆ 67    ┆ 2025-10-31T17:58:18.429218 │
│ 3   ┆ User_3 ┆ 13    ┆ 2025-10-30T17:58:18.429219 │
└─────┴────────┴───────┴────────────────────────────┘


## 5. Data Selection & Filtering

Polars uses expressions for powerful and efficient data selection

In [18]:
# Create sample data
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "David", "Eve", "Frank"],
    "age": [25, 30, 35, 40, 28, 45],
    "city": ["New York", "London", "Paris", "Tokyo", "Berlin", "Sydney"],
    "salary": [70000, 80000, 90000, 95000, 75000, 100000],
    "department": ["IT", "HR", "IT", "Finance", "HR", "IT"]
})

print("Sample DataFrame:")
print(df)

Sample DataFrame:
shape: (6, 5)
┌─────────┬─────┬──────────┬────────┬────────────┐
│ name    ┆ age ┆ city     ┆ salary ┆ department │
│ ---     ┆ --- ┆ ---      ┆ ---    ┆ ---        │
│ str     ┆ i64 ┆ str      ┆ i64    ┆ str        │
╞═════════╪═════╪══════════╪════════╪════════════╡
│ Alice   ┆ 25  ┆ New York ┆ 70000  ┆ IT         │
│ Bob     ┆ 30  ┆ London   ┆ 80000  ┆ HR         │
│ Charlie ┆ 35  ┆ Paris    ┆ 90000  ┆ IT         │
│ David   ┆ 40  ┆ Tokyo    ┆ 95000  ┆ Finance    │
│ Eve     ┆ 28  ┆ Berlin   ┆ 75000  ┆ HR         │
│ Frank   ┆ 45  ┆ Sydney   ┆ 100000 ┆ IT         │
└─────────┴─────┴──────────┴────────┴────────────┘


In [19]:
# Select columns
print("Select single column:")
print(df.select("name"))

print("\nSelect multiple columns:")
print(df.select(["name", "age", "salary"]))

Select single column:
shape: (6, 1)
┌─────────┐
│ name    │
│ ---     │
│ str     │
╞═════════╡
│ Alice   │
│ Bob     │
│ Charlie │
│ David   │
│ Eve     │
│ Frank   │
└─────────┘

Select multiple columns:
shape: (6, 3)
┌─────────┬─────┬────────┐
│ name    ┆ age ┆ salary │
│ ---     ┆ --- ┆ ---    │
│ str     ┆ i64 ┆ i64    │
╞═════════╪═════╪════════╡
│ Alice   ┆ 25  ┆ 70000  │
│ Bob     ┆ 30  ┆ 80000  │
│ Charlie ┆ 35  ┆ 90000  │
│ David   ┆ 40  ┆ 95000  │
│ Eve     ┆ 28  ┆ 75000  │
│ Frank   ┆ 45  ┆ 100000 │
└─────────┴─────┴────────┘


In [21]:
# Select using expressions (pl.col)
print("Select with expressions:")
print(df.select([
    pl.col("name"),
    pl.col("age"),
    pl.col("salary")
]))

Select with expressions:
shape: (6, 3)
┌─────────┬─────┬────────┐
│ name    ┆ age ┆ salary │
│ ---     ┆ --- ┆ ---    │
│ str     ┆ i64 ┆ i64    │
╞═════════╪═════╪════════╡
│ Alice   ┆ 25  ┆ 70000  │
│ Bob     ┆ 30  ┆ 80000  │
│ Charlie ┆ 35  ┆ 90000  │
│ David   ┆ 40  ┆ 95000  │
│ Eve     ┆ 28  ┆ 75000  │
│ Frank   ┆ 45  ┆ 100000 │
└─────────┴─────┴────────┘


In [23]:
# Select by dtype
print("Select numeric columns:")
print(df.select(pl.col(pl.Int64)))

print("\nSelect string columns:")
print(df.select(pl.col(pl.Utf8)))

Select numeric columns:
shape: (6, 2)
┌─────┬────────┐
│ age ┆ salary │
│ --- ┆ ---    │
│ i64 ┆ i64    │
╞═════╪════════╡
│ 25  ┆ 70000  │
│ 30  ┆ 80000  │
│ 35  ┆ 90000  │
│ 40  ┆ 95000  │
│ 28  ┆ 75000  │
│ 45  ┆ 100000 │
└─────┴────────┘

Select string columns:
shape: (6, 3)
┌─────────┬──────────┬────────────┐
│ name    ┆ city     ┆ department │
│ ---     ┆ ---      ┆ ---        │
│ str     ┆ str      ┆ str        │
╞═════════╪══════════╪════════════╡
│ Alice   ┆ New York ┆ IT         │
│ Bob     ┆ London   ┆ HR         │
│ Charlie ┆ Paris    ┆ IT         │
│ David   ┆ Tokyo    ┆ Finance    │
│ Eve     ┆ Berlin   ┆ HR         │
│ Frank   ┆ Sydney   ┆ IT         │
└─────────┴──────────┴────────────┘


In [24]:
# Filter rows (similar to pandas query or SQL WHERE)
print("Filter age > 30:")
print(df.filter(pl.col("age") > 30))

Filter age > 30:
shape: (3, 5)
┌─────────┬─────┬────────┬────────┬────────────┐
│ name    ┆ age ┆ city   ┆ salary ┆ department │
│ ---     ┆ --- ┆ ---    ┆ ---    ┆ ---        │
│ str     ┆ i64 ┆ str    ┆ i64    ┆ str        │
╞═════════╪═════╪════════╪════════╪════════════╡
│ Charlie ┆ 35  ┆ Paris  ┆ 90000  ┆ IT         │
│ David   ┆ 40  ┆ Tokyo  ┆ 95000  ┆ Finance    │
│ Frank   ┆ 45  ┆ Sydney ┆ 100000 ┆ IT         │
└─────────┴─────┴────────┴────────┴────────────┘


In [25]:
# Multiple conditions with & (and) | (or)
print("Filter with multiple conditions (age > 30 AND salary > 80000):")
print(df.filter(
    (pl.col("age") > 30) & (pl.col("salary") > 80000)
))

Filter with multiple conditions (age > 30 AND salary > 80000):
shape: (3, 5)
┌─────────┬─────┬────────┬────────┬────────────┐
│ name    ┆ age ┆ city   ┆ salary ┆ department │
│ ---     ┆ --- ┆ ---    ┆ ---    ┆ ---        │
│ str     ┆ i64 ┆ str    ┆ i64    ┆ str        │
╞═════════╪═════╪════════╪════════╪════════════╡
│ Charlie ┆ 35  ┆ Paris  ┆ 90000  ┆ IT         │
│ David   ┆ 40  ┆ Tokyo  ┆ 95000  ┆ Finance    │
│ Frank   ┆ 45  ┆ Sydney ┆ 100000 ┆ IT         │
└─────────┴─────┴────────┴────────┴────────────┘


In [27]:
# String filtering
print("Filter department == 'IT':")
print(df.filter(pl.col("department") == "IT"))

print("\nFilter city contains 'o':")
print(df.filter(pl.col("city").str.contains("o")))

Filter department == 'IT':
shape: (3, 5)
┌─────────┬─────┬──────────┬────────┬────────────┐
│ name    ┆ age ┆ city     ┆ salary ┆ department │
│ ---     ┆ --- ┆ ---      ┆ ---    ┆ ---        │
│ str     ┆ i64 ┆ str      ┆ i64    ┆ str        │
╞═════════╪═════╪══════════╪════════╪════════════╡
│ Alice   ┆ 25  ┆ New York ┆ 70000  ┆ IT         │
│ Charlie ┆ 35  ┆ Paris    ┆ 90000  ┆ IT         │
│ Frank   ┆ 45  ┆ Sydney   ┆ 100000 ┆ IT         │
└─────────┴─────┴──────────┴────────┴────────────┘

Filter city contains 'o':
shape: (3, 5)
┌───────┬─────┬──────────┬────────┬────────────┐
│ name  ┆ age ┆ city     ┆ salary ┆ department │
│ ---   ┆ --- ┆ ---      ┆ ---    ┆ ---        │
│ str   ┆ i64 ┆ str      ┆ i64    ┆ str        │
╞═══════╪═════╪══════════╪════════╪════════════╡
│ Alice ┆ 25  ┆ New York ┆ 70000  ┆ IT         │
│ Bob   ┆ 30  ┆ London   ┆ 80000  ┆ HR         │
│ David ┆ 40  ┆ Tokyo    ┆ 95000  ┆ Finance    │
└───────┴─────┴──────────┴────────┴────────────┘


In [28]:
# isin (similar to pandas)
print("Filter names in list:")
print(df.filter(pl.col("name").is_in(["Alice", "Bob", "Charlie"])))

Filter names in list:
shape: (3, 5)
┌─────────┬─────┬──────────┬────────┬────────────┐
│ name    ┆ age ┆ city     ┆ salary ┆ department │
│ ---     ┆ --- ┆ ---      ┆ ---    ┆ ---        │
│ str     ┆ i64 ┆ str      ┆ i64    ┆ str        │
╞═════════╪═════╪══════════╪════════╪════════════╡
│ Alice   ┆ 25  ┆ New York ┆ 70000  ┆ IT         │
│ Bob     ┆ 30  ┆ London   ┆ 80000  ┆ HR         │
│ Charlie ┆ 35  ┆ Paris    ┆ 90000  ┆ IT         │
└─────────┴─────┴──────────┴────────┴────────────┘


In [29]:
# head, tail, sample
print("First 3 rows:")
print(df.head(3))

print("\nLast 2 rows:")
print(df.tail(2))

print("\nRandom sample (2 rows):")
print(df.sample(n=2, seed=42))

First 3 rows:
shape: (3, 5)
┌─────────┬─────┬──────────┬────────┬────────────┐
│ name    ┆ age ┆ city     ┆ salary ┆ department │
│ ---     ┆ --- ┆ ---      ┆ ---    ┆ ---        │
│ str     ┆ i64 ┆ str      ┆ i64    ┆ str        │
╞═════════╪═════╪══════════╪════════╪════════════╡
│ Alice   ┆ 25  ┆ New York ┆ 70000  ┆ IT         │
│ Bob     ┆ 30  ┆ London   ┆ 80000  ┆ HR         │
│ Charlie ┆ 35  ┆ Paris    ┆ 90000  ┆ IT         │
└─────────┴─────┴──────────┴────────┴────────────┘

Last 2 rows:
shape: (2, 5)
┌───────┬─────┬────────┬────────┬────────────┐
│ name  ┆ age ┆ city   ┆ salary ┆ department │
│ ---   ┆ --- ┆ ---    ┆ ---    ┆ ---        │
│ str   ┆ i64 ┆ str    ┆ i64    ┆ str        │
╞═══════╪═════╪════════╪════════╪════════════╡
│ Eve   ┆ 28  ┆ Berlin ┆ 75000  ┆ HR         │
│ Frank ┆ 45  ┆ Sydney ┆ 100000 ┆ IT         │
└───────┴─────┴────────┴────────┴────────────┘

Random sample (2 rows):
shape: (2, 5)
┌──────┬─────┬────────┬────────┬────────────┐
│ name ┆ age ┆ city   ┆ 

## 6. Expressions - The Heart of Polars

Expressions are what make Polars powerful and fast. They are:
- **Composable**: Can be chained together
- **Parallelizable**: Automatically run in parallel
- **Optimizable**: Query optimizer improves performance

In [30]:
# Basic expression
print("Double the salary:")
print(df.select([
    pl.col("name"),
    (pl.col("salary") * 2).alias("doubled_salary")
]))

Double the salary:
shape: (6, 2)
┌─────────┬────────────────┐
│ name    ┆ doubled_salary │
│ ---     ┆ ---            │
│ str     ┆ i64            │
╞═════════╪════════════════╡
│ Alice   ┆ 140000         │
│ Bob     ┆ 160000         │
│ Charlie ┆ 180000         │
│ David   ┆ 190000         │
│ Eve     ┆ 150000         │
│ Frank   ┆ 200000         │
└─────────┴────────────────┘


In [31]:
# Multiple operations in one select
print("Multiple expressions:")
print(df.select([
    pl.col("name"),
    pl.col("age"),
    (pl.col("salary") / 1000).alias("salary_k"),
    (pl.col("age") > 30).alias("is_senior")
]))

Multiple expressions:
shape: (6, 4)
┌─────────┬─────┬──────────┬───────────┐
│ name    ┆ age ┆ salary_k ┆ is_senior │
│ ---     ┆ --- ┆ ---      ┆ ---       │
│ str     ┆ i64 ┆ f64      ┆ bool      │
╞═════════╪═════╪══════════╪═══════════╡
│ Alice   ┆ 25  ┆ 70.0     ┆ false     │
│ Bob     ┆ 30  ┆ 80.0     ┆ false     │
│ Charlie ┆ 35  ┆ 90.0     ┆ true      │
│ David   ┆ 40  ┆ 95.0     ┆ true      │
│ Eve     ┆ 28  ┆ 75.0     ┆ false     │
│ Frank   ┆ 45  ┆ 100.0    ┆ true      │
└─────────┴─────┴──────────┴───────────┘


In [33]:
# with_columns (add/modify columns without selecting)
print("Add new columns:")
result = df.with_columns([
    (pl.col("salary") * 1.1).alias("salary_after_raise"),
    (pl.col("age") + 1).alias("age_next_year")
])
print(result)

Add new columns:
shape: (6, 7)
┌─────────┬─────┬──────────┬────────┬────────────┬────────────────────┬───────────────┐
│ name    ┆ age ┆ city     ┆ salary ┆ department ┆ salary_after_raise ┆ age_next_year │
│ ---     ┆ --- ┆ ---      ┆ ---    ┆ ---        ┆ ---                ┆ ---           │
│ str     ┆ i64 ┆ str      ┆ i64    ┆ str        ┆ f64                ┆ i64           │
╞═════════╪═════╪══════════╪════════╪════════════╪════════════════════╪═══════════════╡
│ Alice   ┆ 25  ┆ New York ┆ 70000  ┆ IT         ┆ 77000.0            ┆ 26            │
│ Bob     ┆ 30  ┆ London   ┆ 80000  ┆ HR         ┆ 88000.0            ┆ 31            │
│ Charlie ┆ 35  ┆ Paris    ┆ 90000  ┆ IT         ┆ 99000.0            ┆ 36            │
│ David   ┆ 40  ┆ Tokyo    ┆ 95000  ┆ Finance    ┆ 104500.0           ┆ 41            │
│ Eve     ┆ 28  ┆ Berlin   ┆ 75000  ┆ HR         ┆ 82500.0            ┆ 29            │
│ Frank   ┆ 45  ┆ Sydney   ┆ 100000 ┆ IT         ┆ 110000.0           ┆ 46            │
└

In [35]:
# Conditional expressions (when-then-otherwise)
print("Conditional column:")
result = df.with_columns([
    pl.when(pl.col("age") < 30)
      .then(pl.lit("Young"))
      .when(pl.col("age") < 40)
      .then(pl.lit("Middle"))
      .otherwise(pl.lit("Senior"))
      .alias("age_group")
])
print(result)

Conditional column:
shape: (6, 6)
┌─────────┬─────┬──────────┬────────┬────────────┬───────────┐
│ name    ┆ age ┆ city     ┆ salary ┆ department ┆ age_group │
│ ---     ┆ --- ┆ ---      ┆ ---    ┆ ---        ┆ ---       │
│ str     ┆ i64 ┆ str      ┆ i64    ┆ str        ┆ str       │
╞═════════╪═════╪══════════╪════════╪════════════╪═══════════╡
│ Alice   ┆ 25  ┆ New York ┆ 70000  ┆ IT         ┆ Young     │
│ Bob     ┆ 30  ┆ London   ┆ 80000  ┆ HR         ┆ Middle    │
│ Charlie ┆ 35  ┆ Paris    ┆ 90000  ┆ IT         ┆ Middle    │
│ David   ┆ 40  ┆ Tokyo    ┆ 95000  ┆ Finance    ┆ Senior    │
│ Eve     ┆ 28  ┆ Berlin   ┆ 75000  ┆ HR         ┆ Young     │
│ Frank   ┆ 45  ┆ Sydney   ┆ 100000 ┆ IT         ┆ Senior    │
└─────────┴─────┴──────────┴────────┴────────────┴───────────┘


In [38]:
# Expression aliases and chaining
print("Chained expressions:")
result = df.select([
    pl.col("name").str.to_uppercase().alias("name_upper"),
    pl.col("salary").log10().round(2).alias("log_salary")
])
print(result)

Chained expressions:
shape: (6, 2)
┌────────────┬────────────┐
│ name_upper ┆ log_salary │
│ ---        ┆ ---        │
│ str        ┆ f64        │
╞════════════╪════════════╡
│ ALICE      ┆ 4.85       │
│ BOB        ┆ 4.9        │
│ CHARLIE    ┆ 4.95       │
│ DAVID      ┆ 4.98       │
│ EVE        ┆ 4.88       │
│ FRANK      ┆ 5.0        │
└────────────┴────────────┘


## 7. Transformations & Column Operations

Common data transformation operations

In [39]:
# Sorting
print("Sort by age (descending):")
print(df.sort("age", descending=True))

print("\nSort by multiple columns:")
print(df.sort(["department", "salary"], descending=[False, True]))

Sort by age (descending):
shape: (6, 5)
┌─────────┬─────┬──────────┬────────┬────────────┐
│ name    ┆ age ┆ city     ┆ salary ┆ department │
│ ---     ┆ --- ┆ ---      ┆ ---    ┆ ---        │
│ str     ┆ i64 ┆ str      ┆ i64    ┆ str        │
╞═════════╪═════╪══════════╪════════╪════════════╡
│ Frank   ┆ 45  ┆ Sydney   ┆ 100000 ┆ IT         │
│ David   ┆ 40  ┆ Tokyo    ┆ 95000  ┆ Finance    │
│ Charlie ┆ 35  ┆ Paris    ┆ 90000  ┆ IT         │
│ Bob     ┆ 30  ┆ London   ┆ 80000  ┆ HR         │
│ Eve     ┆ 28  ┆ Berlin   ┆ 75000  ┆ HR         │
│ Alice   ┆ 25  ┆ New York ┆ 70000  ┆ IT         │
└─────────┴─────┴──────────┴────────┴────────────┘

Sort by multiple columns:
shape: (6, 5)
┌─────────┬─────┬──────────┬────────┬────────────┐
│ name    ┆ age ┆ city     ┆ salary ┆ department │
│ ---     ┆ --- ┆ ---      ┆ ---    ┆ ---        │
│ str     ┆ i64 ┆ str      ┆ i64    ┆ str        │
╞═════════╪═════╪══════════╪════════╪════════════╡
│ David   ┆ 40  ┆ Tokyo    ┆ 95000  ┆ Finance    │
│

In [40]:
# Rename columns
print("Rename columns:")
renamed = df.rename({"name": "employee_name", "salary": "annual_salary"})
print(renamed.columns)

Rename columns:
['employee_name', 'age', 'city', 'annual_salary', 'department']


In [41]:
# Drop columns
print("Drop columns:")
print(df.drop(["city", "department"]).columns)

Drop columns:
['name', 'age', 'salary']


In [44]:
# Cast dtypes
print("Cast age to float:")
result = df.with_columns(pl.col("age").cast(pl.Float64))
print(result.dtypes)

Cast age to float:
[String, Float64, String, Int64, String]


In [46]:
result.head(2)

name,age,city,salary,department
str,f64,str,i64,str
"""Alice""",25.0,"""New York""",70000,"""IT"""
"""Bob""",30.0,"""London""",80000,"""HR"""


In [48]:
# Null handling
df_with_nulls = pl.DataFrame({
    "a": [1, 2, None, 4, None],
    "b": ["x", None, "y", "z", None]
})

print("DataFrame with nulls:")
print(df_with_nulls)

print("\nFill nulls:")
print(df_with_nulls.fill_null(strategy="forward"))

print("\nFill with specific value:")
print(df_with_nulls.fill_null(0))

print("\nDrop nulls:")
print(df_with_nulls.drop_nulls())

DataFrame with nulls:
shape: (5, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ null │
│ null ┆ y    │
│ 4    ┆ z    │
│ null ┆ null │
└──────┴──────┘

Fill nulls:
shape: (5, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1   ┆ x   │
│ 2   ┆ x   │
│ 2   ┆ y   │
│ 4   ┆ z   │
│ 4   ┆ z   │
└─────┴─────┘

Fill with specific value:
shape: (5, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ str  │
╞═════╪══════╡
│ 1   ┆ x    │
│ 2   ┆ null │
│ 0   ┆ y    │
│ 4   ┆ z    │
│ 0   ┆ null │
└─────┴──────┘

Drop nulls:
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1   ┆ x   │
│ 4   ┆ z   │
└─────┴─────┘


In [49]:
# Unique and duplicates
print("Unique values in department:")
print(df.select(pl.col("department").unique()))

print("\nCount unique values:")
print(df.select(pl.col("department").n_unique()))

Unique values in department:
shape: (3, 1)
┌────────────┐
│ department │
│ ---        │
│ str        │
╞════════════╡
│ HR         │
│ IT         │
│ Finance    │
└────────────┘

Count unique values:
shape: (1, 1)
┌────────────┐
│ department │
│ ---        │
│ u32        │
╞════════════╡
│ 3          │
└────────────┘


## 8. Aggregations & GroupBy

Powerful aggregation capabilities, similar to pandas groupby but more expressive

In [50]:
# Basic aggregations
print("Mean salary:")
print(df.select(pl.col("salary").mean()))

print("\nMultiple aggregations:")
print(df.select([
    pl.col("salary").mean().alias("mean_salary"),
    pl.col("salary").median().alias("median_salary"),
    pl.col("salary").std().alias("std_salary"),
    pl.col("age").min().alias("min_age"),
    pl.col("age").max().alias("max_age")
]))

Mean salary:
shape: (1, 1)
┌─────────┐
│ salary  │
│ ---     │
│ f64     │
╞═════════╡
│ 85000.0 │
└─────────┘

Multiple aggregations:
shape: (1, 5)
┌─────────────┬───────────────┬──────────────┬─────────┬─────────┐
│ mean_salary ┆ median_salary ┆ std_salary   ┆ min_age ┆ max_age │
│ ---         ┆ ---           ┆ ---          ┆ ---     ┆ ---     │
│ f64         ┆ f64           ┆ f64          ┆ i64     ┆ i64     │
╞═════════════╪═══════════════╪══════════════╪═════════╪═════════╡
│ 85000.0     ┆ 85000.0       ┆ 11832.159566 ┆ 25      ┆ 45      │
└─────────────┴───────────────┴──────────────┴─────────┴─────────┘


In [51]:
# GroupBy - basic
print("Group by department:")
print(df.group_by("department").agg([
    pl.col("salary").mean().alias("avg_salary"),
    pl.col("age").mean().alias("avg_age"),
    pl.count().alias("count")
]).sort("department"))

Group by department:
shape: (3, 4)
┌────────────┬──────────────┬─────────┬───────┐
│ department ┆ avg_salary   ┆ avg_age ┆ count │
│ ---        ┆ ---          ┆ ---     ┆ ---   │
│ str        ┆ f64          ┆ f64     ┆ u32   │
╞════════════╪══════════════╪═════════╪═══════╡
│ Finance    ┆ 95000.0      ┆ 40.0    ┆ 1     │
│ HR         ┆ 77500.0      ┆ 29.0    ┆ 2     │
│ IT         ┆ 86666.666667 ┆ 35.0    ┆ 3     │
└────────────┴──────────────┴─────────┴───────┘


(Deprecated in version 0.20.5)
  pl.count().alias("count")


In [52]:
# GroupBy - basic
print("Group by department:")
print(df.group_by("department").agg([
    pl.col("salary").mean().alias("avg_salary"),
    pl.col("age").mean().alias("avg_age"),
    pl.len().alias("count")
]).sort("department"))

Group by department:
shape: (3, 4)
┌────────────┬──────────────┬─────────┬───────┐
│ department ┆ avg_salary   ┆ avg_age ┆ count │
│ ---        ┆ ---          ┆ ---     ┆ ---   │
│ str        ┆ f64          ┆ f64     ┆ u32   │
╞════════════╪══════════════╪═════════╪═══════╡
│ Finance    ┆ 95000.0      ┆ 40.0    ┆ 1     │
│ HR         ┆ 77500.0      ┆ 29.0    ┆ 2     │
│ IT         ┆ 86666.666667 ┆ 35.0    ┆ 3     │
└────────────┴──────────────┴─────────┴───────┘


In [55]:
# GroupBy - multiple aggregations per column
print("Multiple aggregations:")
print(df.group_by("department").agg([
    pl.col("salary").min().alias("min_salary"),
    pl.col("salary").max().alias("max_salary"),
    pl.col("salary").mean().alias("avg_salary"),
    pl.col("name").count().alias("employee_count")
]).sort("max_salary"))

Multiple aggregations:
shape: (3, 5)
┌────────────┬────────────┬────────────┬──────────────┬────────────────┐
│ department ┆ min_salary ┆ max_salary ┆ avg_salary   ┆ employee_count │
│ ---        ┆ ---        ┆ ---        ┆ ---          ┆ ---            │
│ str        ┆ i64        ┆ i64        ┆ f64          ┆ u32            │
╞════════════╪════════════╪════════════╪══════════════╪════════════════╡
│ HR         ┆ 75000      ┆ 80000      ┆ 77500.0      ┆ 2              │
│ Finance    ┆ 95000      ┆ 95000      ┆ 95000.0      ┆ 1              │
│ IT         ┆ 70000      ┆ 100000     ┆ 86666.666667 ┆ 3              │
└────────────┴────────────┴────────────┴──────────────┴────────────────┘


In [58]:
# GroupBy with multiple keys
df_extended = df.with_columns(
    pl.when(pl.col("age") < 35)
      .then(pl.lit("Young"))
      .otherwise(pl.lit("Senior"))
      .alias("age_category")
)

print("Group by multiple columns:")
print(df_extended.group_by(["department", "age_category"]).agg([
    pl.col("salary").mean().alias("avg_salary"),
    pl.len().alias("count")
]).sort(["department", "age_category"]))

Group by multiple columns:
shape: (4, 4)
┌────────────┬──────────────┬────────────┬───────┐
│ department ┆ age_category ┆ avg_salary ┆ count │
│ ---        ┆ ---          ┆ ---        ┆ ---   │
│ str        ┆ str          ┆ f64        ┆ u32   │
╞════════════╪══════════════╪════════════╪═══════╡
│ Finance    ┆ Senior       ┆ 95000.0    ┆ 1     │
│ HR         ┆ Young        ┆ 77500.0    ┆ 2     │
│ IT         ┆ Senior       ┆ 95000.0    ┆ 2     │
│ IT         ┆ Young        ┆ 70000.0    ┆ 1     │
└────────────┴──────────────┴────────────┴───────┘


In [62]:
# Advanced aggregations
print("List aggregation (collect names per department):")
print(df.group_by("department").agg([
    pl.col("name").alias("employees"),
    pl.col("salary").sum().alias("total_salary")
]).sort("department"))

List aggregation (collect names per department):
shape: (3, 3)
┌────────────┬───────────────────────────────┬──────────────┐
│ department ┆ employees                     ┆ total_salary │
│ ---        ┆ ---                           ┆ ---          │
│ str        ┆ list[str]                     ┆ i64          │
╞════════════╪═══════════════════════════════╪══════════════╡
│ Finance    ┆ ["David"]                     ┆ 95000        │
│ HR         ┆ ["Bob", "Eve"]                ┆ 155000       │
│ IT         ┆ ["Alice", "Charlie", "Frank"] ┆ 260000       │
└────────────┴───────────────────────────────┴──────────────┘


In [63]:
# Quantiles and percentiles
print("Salary percentiles:")
print(df.select([
    pl.col("salary").quantile(0.25).alias("p25"),
    pl.col("salary").quantile(0.50).alias("p50"),
    pl.col("salary").quantile(0.75).alias("p75"),
    pl.col("salary").quantile(0.90).alias("p90")
]))

Salary percentiles:
shape: (1, 4)
┌─────────┬─────────┬─────────┬──────────┐
│ p25     ┆ p50     ┆ p75     ┆ p90      │
│ ---     ┆ ---     ┆ ---     ┆ ---      │
│ f64     ┆ f64     ┆ f64     ┆ f64      │
╞═════════╪═════════╪═════════╪══════════╡
│ 75000.0 ┆ 90000.0 ┆ 95000.0 ┆ 100000.0 │
└─────────┴─────────┴─────────┴──────────┘


## 9. Joins & Concatenations

Combining DataFrames - similar to SQL joins and pandas merge/concat

In [64]:
# Create sample DataFrames for joining
employees = pl.DataFrame({
    "emp_id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "dept_id": [10, 20, 10, 30, 20]
})

departments = pl.DataFrame({
    "dept_id": [10, 20, 30, 40],
    "dept_name": ["IT", "HR", "Finance", "Marketing"]
})

print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)

Employees:
shape: (5, 3)
┌────────┬─────────┬─────────┐
│ emp_id ┆ name    ┆ dept_id │
│ ---    ┆ ---     ┆ ---     │
│ i64    ┆ str     ┆ i64     │
╞════════╪═════════╪═════════╡
│ 1      ┆ Alice   ┆ 10      │
│ 2      ┆ Bob     ┆ 20      │
│ 3      ┆ Charlie ┆ 10      │
│ 4      ┆ David   ┆ 30      │
│ 5      ┆ Eve     ┆ 20      │
└────────┴─────────┴─────────┘

Departments:
shape: (4, 2)
┌─────────┬───────────┐
│ dept_id ┆ dept_name │
│ ---     ┆ ---       │
│ i64     ┆ str       │
╞═════════╪═══════════╡
│ 10      ┆ IT        │
│ 20      ┆ HR        │
│ 30      ┆ Finance   │
│ 40      ┆ Marketing │
└─────────┴───────────┘


In [65]:
# Inner join
print("Inner join:")
print(employees.join(departments, on="dept_id", how="inner"))

Inner join:
shape: (5, 4)
┌────────┬─────────┬─────────┬───────────┐
│ emp_id ┆ name    ┆ dept_id ┆ dept_name │
│ ---    ┆ ---     ┆ ---     ┆ ---       │
│ i64    ┆ str     ┆ i64     ┆ str       │
╞════════╪═════════╪═════════╪═══════════╡
│ 1      ┆ Alice   ┆ 10      ┆ IT        │
│ 2      ┆ Bob     ┆ 20      ┆ HR        │
│ 3      ┆ Charlie ┆ 10      ┆ IT        │
│ 4      ┆ David   ┆ 30      ┆ Finance   │
│ 5      ┆ Eve     ┆ 20      ┆ HR        │
└────────┴─────────┴─────────┴───────────┘


In [66]:
# Left join
print("Left join:")
print(employees.join(departments, on="dept_id", how="left"))

Left join:
shape: (5, 4)
┌────────┬─────────┬─────────┬───────────┐
│ emp_id ┆ name    ┆ dept_id ┆ dept_name │
│ ---    ┆ ---     ┆ ---     ┆ ---       │
│ i64    ┆ str     ┆ i64     ┆ str       │
╞════════╪═════════╪═════════╪═══════════╡
│ 1      ┆ Alice   ┆ 10      ┆ IT        │
│ 2      ┆ Bob     ┆ 20      ┆ HR        │
│ 3      ┆ Charlie ┆ 10      ┆ IT        │
│ 4      ┆ David   ┆ 30      ┆ Finance   │
│ 5      ┆ Eve     ┆ 20      ┆ HR        │
└────────┴─────────┴─────────┴───────────┘


In [67]:
# Outer join
print("Outer join:")
print(employees.join(departments, on="dept_id", how="outer"))

Outer join:
shape: (6, 5)
┌────────┬─────────┬─────────┬───────────────┬───────────┐
│ emp_id ┆ name    ┆ dept_id ┆ dept_id_right ┆ dept_name │
│ ---    ┆ ---     ┆ ---     ┆ ---           ┆ ---       │
│ i64    ┆ str     ┆ i64     ┆ i64           ┆ str       │
╞════════╪═════════╪═════════╪═══════════════╪═══════════╡
│ 1      ┆ Alice   ┆ 10      ┆ 10            ┆ IT        │
│ 2      ┆ Bob     ┆ 20      ┆ 20            ┆ HR        │
│ 3      ┆ Charlie ┆ 10      ┆ 10            ┆ IT        │
│ 4      ┆ David   ┆ 30      ┆ 30            ┆ Finance   │
│ 5      ┆ Eve     ┆ 20      ┆ 20            ┆ HR        │
│ null   ┆ null    ┆ null    ┆ 40            ┆ Marketing │
└────────┴─────────┴─────────┴───────────────┴───────────┘


(Deprecated in version 0.20.29)
  print(employees.join(departments, on="dept_id", how="outer"))


In [68]:
# Join with different column names
salaries = pl.DataFrame({
    "employee_id": [1, 2, 3, 4, 5],
    "salary": [70000, 80000, 90000, 95000, 75000]
})

print("Join on different column names:")
print(employees.join(salaries, left_on="emp_id", right_on="employee_id"))

Join on different column names:
shape: (5, 4)
┌────────┬─────────┬─────────┬────────┐
│ emp_id ┆ name    ┆ dept_id ┆ salary │
│ ---    ┆ ---     ┆ ---     ┆ ---    │
│ i64    ┆ str     ┆ i64     ┆ i64    │
╞════════╪═════════╪═════════╪════════╡
│ 1      ┆ Alice   ┆ 10      ┆ 70000  │
│ 2      ┆ Bob     ┆ 20      ┆ 80000  │
│ 3      ┆ Charlie ┆ 10      ┆ 90000  │
│ 4      ┆ David   ┆ 30      ┆ 95000  │
│ 5      ┆ Eve     ┆ 20      ┆ 75000  │
└────────┴─────────┴─────────┴────────┘


In [69]:
# Concatenation - vertical (like SQL UNION or pandas concat axis=0)
df1 = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
df2 = pl.DataFrame({"a": [5, 6], "b": [7, 8]})

print("Vertical concatenation:")
print(pl.concat([df1, df2]))

Vertical concatenation:
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
│ 2   ┆ 4   │
│ 5   ┆ 7   │
│ 6   ┆ 8   │
└─────┴─────┘


In [72]:
df1.with_row_index("idx")

idx,a,b
u32,i64,i64
0,1,3
1,2,4


In [73]:
# Concatenation - horizontal (like pandas concat axis=1)
df3 = pl.DataFrame({"c": [9, 10]})

print("Horizontal concatenation:")
print(pl.concat([df1, df3], how="horizontal"))

Horizontal concatenation:
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 3   ┆ 9   │
│ 2   ┆ 4   ┆ 10  │
└─────┴─────┴─────┘


## 10. Lazy vs Eager Evaluation

One of Polars' most powerful features - lazy evaluation allows query optimization

In [74]:
# Eager execution (default)
print("Eager execution:")
result_eager = (
    df.filter(pl.col("age") > 30)
      .select(["name", "salary"])
      .sort("salary", descending=True)
)
print(result_eager)

Eager execution:
shape: (3, 2)
┌─────────┬────────┐
│ name    ┆ salary │
│ ---     ┆ ---    │
│ str     ┆ i64    │
╞═════════╪════════╡
│ Frank   ┆ 100000 │
│ David   ┆ 95000  │
│ Charlie ┆ 90000  │
└─────────┴────────┘


In [75]:
# Lazy execution - convert to lazy
print("Lazy execution (not yet computed):")
lazy_query = (
    df.lazy()
      .filter(pl.col("age") > 30)
      .select(["name", "salary"])
      .sort("salary", descending=True)
)
print(lazy_query)
print("\nQuery plan:")
print(lazy_query.explain())

Lazy execution (not yet computed):
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)

SORT BY [descending: [true]] [col("salary")]
  SELECT [col("name"), col("salary")]
    FILTER [(col("age")) > (30)]
    FROM
      DF ["name", "age", "city", "salary", ...]; PROJECT */5 COLUMNS

Query plan:
SORT BY [descending: [true]] [col("salary")]
  simple π 2/2 ["name", "salary"]
    FILTER [(col("age")) > (30)]
    FROM
      DF ["name", "age", "city", "salary", ...]; PROJECT["name", "salary", "age"] 3/5 COLUMNS


In [76]:
# Execute lazy query with collect()
print("Collected result:")
result_lazy = lazy_query.collect()
print(result_lazy)

Collected result:
shape: (3, 2)
┌─────────┬────────┐
│ name    ┆ salary │
│ ---     ┆ ---    │
│ str     ┆ i64    │
╞═════════╪════════╡
│ Frank   ┆ 100000 │
│ David   ┆ 95000  │
│ Charlie ┆ 90000  │
└─────────┴────────┘


In [77]:
# Example showing optimization benefits
# Polars will optimize this to only read necessary columns
lazy_optimized = (
    pl.scan_csv("data.csv")
      .select(["name", "score"])  # Only these columns will be read from CSV
      .filter(pl.col("score") > 50)
      .head(10)
)

print("Optimized query plan:")
print(lazy_optimized.explain())

# Execute
print("\nResult:")
print(lazy_optimized.collect())

Optimized query plan:
SLICE[offset: 0, len: 10]
  Csv SCAN [data.csv]
  PROJECT 2/4 COLUMNS
  SELECTION: [(col("score")) > (50)]

Result:
shape: (10, 2)
┌─────────┬───────┐
│ name    ┆ score │
│ ---     ┆ ---   │
│ str     ┆ i64   │
╞═════════╪═══════╡
│ User_2  ┆ 67    │
│ User_5  ┆ 93    │
│ User_8  ┆ 53    │
│ User_12 ┆ 74    │
│ User_13 ┆ 82    │
│ User_14 ┆ 91    │
│ User_17 ┆ 59    │
│ User_19 ┆ 97    │
│ User_20 ┆ 82    │
│ User_23 ┆ 70    │
└─────────┴───────┘


## 11. Time Series Operations

Working with dates and times in Polars

In [78]:
# Create time series data
from datetime import date

ts_df = pl.DataFrame({
    "date": pl.date_range(
        date(2024, 1, 1),
        date(2024, 12, 31),
        interval="1d",
        eager=True
    ),
    "value": np.random.randn(366).cumsum()
})

print("Time series data:")
print(ts_df.head(10))

Time series data:
shape: (10, 2)
┌────────────┬───────────┐
│ date       ┆ value     │
│ ---        ┆ ---       │
│ date       ┆ f64       │
╞════════════╪═══════════╡
│ 2024-01-01 ┆ -0.699253 │
│ 2024-01-02 ┆ -1.789907 │
│ 2024-01-03 ┆ -2.940818 │
│ 2024-01-04 ┆ -4.756508 │
│ 2024-01-05 ┆ -5.301946 │
│ 2024-01-06 ┆ -5.64176  │
│ 2024-01-07 ┆ -5.26977  │
│ 2024-01-08 ┆ -6.63814  │
│ 2024-01-09 ┆ -6.554948 │
│ 2024-01-10 ┆ -7.978885 │
└────────────┴───────────┘


In [79]:
# Extract date components
print("Extract date components:")
result = ts_df.with_columns([
    pl.col("date").dt.year().alias("year"),
    pl.col("date").dt.month().alias("month"),
    pl.col("date").dt.day().alias("day"),
    pl.col("date").dt.weekday().alias("weekday"),
    pl.col("date").dt.quarter().alias("quarter")
]).head(10)
print(result)

Extract date components:
shape: (10, 7)
┌────────────┬───────────┬──────┬───────┬─────┬─────────┬─────────┐
│ date       ┆ value     ┆ year ┆ month ┆ day ┆ weekday ┆ quarter │
│ ---        ┆ ---       ┆ ---  ┆ ---   ┆ --- ┆ ---     ┆ ---     │
│ date       ┆ f64       ┆ i32  ┆ i8    ┆ i8  ┆ i8      ┆ i8      │
╞════════════╪═══════════╪══════╪═══════╪═════╪═════════╪═════════╡
│ 2024-01-01 ┆ -0.699253 ┆ 2024 ┆ 1     ┆ 1   ┆ 1       ┆ 1       │
│ 2024-01-02 ┆ -1.789907 ┆ 2024 ┆ 1     ┆ 2   ┆ 2       ┆ 1       │
│ 2024-01-03 ┆ -2.940818 ┆ 2024 ┆ 1     ┆ 3   ┆ 3       ┆ 1       │
│ 2024-01-04 ┆ -4.756508 ┆ 2024 ┆ 1     ┆ 4   ┆ 4       ┆ 1       │
│ 2024-01-05 ┆ -5.301946 ┆ 2024 ┆ 1     ┆ 5   ┆ 5       ┆ 1       │
│ 2024-01-06 ┆ -5.64176  ┆ 2024 ┆ 1     ┆ 6   ┆ 6       ┆ 1       │
│ 2024-01-07 ┆ -5.26977  ┆ 2024 ┆ 1     ┆ 7   ┆ 7       ┆ 1       │
│ 2024-01-08 ┆ -6.63814  ┆ 2024 ┆ 1     ┆ 8   ┆ 1       ┆ 1       │
│ 2024-01-09 ┆ -6.554948 ┆ 2024 ┆ 1     ┆ 9   ┆ 2       ┆ 1       │
│ 2024-0

In [80]:
# Datetime arithmetic
print("Add days to date:")
result = ts_df.with_columns(
    (pl.col("date") + pl.duration(days=7)).alias("date_plus_week")
).head(5)
print(result)

Add days to date:
shape: (5, 3)
┌────────────┬───────────┬────────────────┐
│ date       ┆ value     ┆ date_plus_week │
│ ---        ┆ ---       ┆ ---            │
│ date       ┆ f64       ┆ date           │
╞════════════╪═══════════╪════════════════╡
│ 2024-01-01 ┆ -0.699253 ┆ 2024-01-08     │
│ 2024-01-02 ┆ -1.789907 ┆ 2024-01-09     │
│ 2024-01-03 ┆ -2.940818 ┆ 2024-01-10     │
│ 2024-01-04 ┆ -4.756508 ┆ 2024-01-11     │
│ 2024-01-05 ┆ -5.301946 ┆ 2024-01-12     │
└────────────┴───────────┴────────────────┘


In [81]:
# Resample and aggregate (like pandas resample)
print("Monthly aggregation:")
monthly = (
    ts_df.group_by_dynamic("date", every="1mo")
         .agg([
             pl.col("value").mean().alias("avg_value"),
             pl.col("value").min().alias("min_value"),
             pl.col("value").max().alias("max_value")
         ])
)
print(monthly)

Monthly aggregation:
shape: (12, 4)
┌────────────┬────────────┬────────────┬────────────┐
│ date       ┆ avg_value  ┆ min_value  ┆ max_value  │
│ ---        ┆ ---        ┆ ---        ┆ ---        │
│ date       ┆ f64        ┆ f64        ┆ f64        │
╞════════════╪════════════╪════════════╪════════════╡
│ 2024-01-01 ┆ -6.350379  ┆ -10.022018 ┆ -0.699253  │
│ 2024-02-01 ┆ -1.11954   ┆ -5.24      ┆ 2.215408   │
│ 2024-03-01 ┆ -2.040125  ┆ -6.737816  ┆ 2.438064   │
│ 2024-04-01 ┆ -4.735409  ┆ -9.777551  ┆ -1.911747  │
│ 2024-05-01 ┆ -2.500507  ┆ -4.727688  ┆ -0.857876  │
│ …          ┆ …          ┆ …          ┆ …          │
│ 2024-08-01 ┆ -21.442485 ┆ -24.797178 ┆ -16.060661 │
│ 2024-09-01 ┆ -26.900073 ┆ -32.66983  ┆ -19.268094 │
│ 2024-10-01 ┆ -35.113639 ┆ -38.371231 ┆ -30.392645 │
│ 2024-11-01 ┆ -32.572655 ┆ -39.252343 ┆ -29.949984 │
│ 2024-12-01 ┆ -30.502186 ┆ -33.390259 ┆ -26.927342 │
└────────────┴────────────┴────────────┴────────────┘


In [82]:
# Rolling window operations
print("7-day rolling average:")
result = ts_df.with_columns(
    pl.col("value").rolling_mean(window_size=7).alias("rolling_avg_7d")
).head(20)
print(result)

7-day rolling average:
shape: (20, 3)
┌────────────┬───────────┬────────────────┐
│ date       ┆ value     ┆ rolling_avg_7d │
│ ---        ┆ ---       ┆ ---            │
│ date       ┆ f64       ┆ f64            │
╞════════════╪═══════════╪════════════════╡
│ 2024-01-01 ┆ -0.699253 ┆ null           │
│ 2024-01-02 ┆ -1.789907 ┆ null           │
│ 2024-01-03 ┆ -2.940818 ┆ null           │
│ 2024-01-04 ┆ -4.756508 ┆ null           │
│ 2024-01-05 ┆ -5.301946 ┆ null           │
│ …          ┆ …         ┆ …              │
│ 2024-01-16 ┆ -8.814478 ┆ -9.354376      │
│ 2024-01-17 ┆ -7.908645 ┆ -9.344342      │
│ 2024-01-18 ┆ -8.074913 ┆ -9.081206      │
│ 2024-01-19 ┆ -7.392987 ┆ -8.736533      │
│ 2024-01-20 ┆ -8.291755 ┆ -8.533982      │
└────────────┴───────────┴────────────────┘


## 12. String Operations

String manipulation in Polars

In [83]:
# Create string data
str_df = pl.DataFrame({
    "text": [
        "hello world",
        "POLARS is FAST",
        "  pandas  ",
        "data-science-2024",
        "user@example.com"
    ]
})

print("String data:")
print(str_df)

String data:
shape: (5, 1)
┌───────────────────┐
│ text              │
│ ---               │
│ str               │
╞═══════════════════╡
│ hello world       │
│ POLARS is FAST    │
│   pandas          │
│ data-science-2024 │
│ user@example.com  │
└───────────────────┘


In [84]:
# String methods
print("String transformations:")
result = str_df.with_columns([
    pl.col("text").str.to_uppercase().alias("upper"),
    pl.col("text").str.to_lowercase().alias("lower"),
    pl.col("text").str.strip_chars().alias("stripped"),
    pl.col("text").str.len_chars().alias("length")
])
print(result)

String transformations:
shape: (5, 5)
┌───────────────────┬───────────────────┬───────────────────┬───────────────────┬────────┐
│ text              ┆ upper             ┆ lower             ┆ stripped          ┆ length │
│ ---               ┆ ---               ┆ ---               ┆ ---               ┆ ---    │
│ str               ┆ str               ┆ str               ┆ str               ┆ u32    │
╞═══════════════════╪═══════════════════╪═══════════════════╪═══════════════════╪════════╡
│ hello world       ┆ HELLO WORLD       ┆ hello world       ┆ hello world       ┆ 11     │
│ POLARS is FAST    ┆ POLARS IS FAST    ┆ polars is fast    ┆ POLARS is FAST    ┆ 14     │
│   pandas          ┆   PANDAS          ┆   pandas          ┆ pandas            ┆ 10     │
│ data-science-2024 ┆ DATA-SCIENCE-2024 ┆ data-science-2024 ┆ data-science-2024 ┆ 17     │
│ user@example.com  ┆ USER@EXAMPLE.COM  ┆ user@example.com  ┆ user@example.com  ┆ 16     │
└───────────────────┴───────────────────┴───────────

In [86]:
# String contains, starts_with, ends_with
print("String matching:")
result = str_df.with_columns([
    pl.col("text").str.contains("a").alias("contains_a"),
    pl.col("text").str.starts_with("h").alias("starts_h"),
    pl.col("text").str.ends_with("m").alias("ends_m")
])
print(result)

String matching:
shape: (5, 4)
┌───────────────────┬────────────┬──────────┬────────┐
│ text              ┆ contains_a ┆ starts_h ┆ ends_m │
│ ---               ┆ ---        ┆ ---      ┆ ---    │
│ str               ┆ bool       ┆ bool     ┆ bool   │
╞═══════════════════╪════════════╪══════════╪════════╡
│ hello world       ┆ false      ┆ true     ┆ false  │
│ POLARS is FAST    ┆ false      ┆ false    ┆ false  │
│   pandas          ┆ true       ┆ false    ┆ false  │
│ data-science-2024 ┆ true       ┆ false    ┆ false  │
│ user@example.com  ┆ true       ┆ false    ┆ true   │
└───────────────────┴────────────┴──────────┴────────┘


In [87]:
# String replace and split
print("Replace:")
print(str_df.with_columns(
    pl.col("text").str.replace("-", "_").alias("replaced")
))

print("\nSplit:")
print(str_df.with_columns(
    pl.col("text").str.split("-").alias("split")
))

Replace:
shape: (5, 2)
┌───────────────────┬───────────────────┐
│ text              ┆ replaced          │
│ ---               ┆ ---               │
│ str               ┆ str               │
╞═══════════════════╪═══════════════════╡
│ hello world       ┆ hello world       │
│ POLARS is FAST    ┆ POLARS is FAST    │
│   pandas          ┆   pandas          │
│ data-science-2024 ┆ data_science-2024 │
│ user@example.com  ┆ user@example.com  │
└───────────────────┴───────────────────┘

Split:
shape: (5, 2)
┌───────────────────┬─────────────────────────────┐
│ text              ┆ split                       │
│ ---               ┆ ---                         │
│ str               ┆ list[str]                   │
╞═══════════════════╪═════════════════════════════╡
│ hello world       ┆ ["hello world"]             │
│ POLARS is FAST    ┆ ["POLARS is FAST"]          │
│   pandas          ┆ ["  pandas  "]              │
│ data-science-2024 ┆ ["data", "science", "2024"] │
│ user@example.com  ┆ ["u

In [88]:
# Extract with regex
print("Extract email domain:")
email_df = pl.DataFrame({"email": ["user@example.com", "test@domain.org"]})
result = email_df.with_columns(
    pl.col("email").str.extract(r"@(.+)", group_index=1).alias("domain")
)
print(result)

Extract email domain:
shape: (2, 2)
┌──────────────────┬─────────────┐
│ email            ┆ domain      │
│ ---              ┆ ---         │
│ str              ┆ str         │
╞══════════════════╪═════════════╡
│ user@example.com ┆ example.com │
│ test@domain.org  ┆ domain.org  │
└──────────────────┴─────────────┘


## 13. Window Functions

Powerful window operations (like SQL window functions)

In [89]:
# Sample data for window functions
sales_df = pl.DataFrame({
    "date": pl.date_range(date(2024, 1, 1), date(2024, 1, 10), interval="1d", eager=True),
    "product": ["A", "B", "A", "B", "A", "B", "A", "B", "A", "B"],
    "sales": [100, 150, 120, 160, 110, 140, 130, 170, 115, 155]
})

print("Sales data:")
print(sales_df)

Sales data:
shape: (10, 3)
┌────────────┬─────────┬───────┐
│ date       ┆ product ┆ sales │
│ ---        ┆ ---     ┆ ---   │
│ date       ┆ str     ┆ i64   │
╞════════════╪═════════╪═══════╡
│ 2024-01-01 ┆ A       ┆ 100   │
│ 2024-01-02 ┆ B       ┆ 150   │
│ 2024-01-03 ┆ A       ┆ 120   │
│ 2024-01-04 ┆ B       ┆ 160   │
│ 2024-01-05 ┆ A       ┆ 110   │
│ 2024-01-06 ┆ B       ┆ 140   │
│ 2024-01-07 ┆ A       ┆ 130   │
│ 2024-01-08 ┆ B       ┆ 170   │
│ 2024-01-09 ┆ A       ┆ 115   │
│ 2024-01-10 ┆ B       ┆ 155   │
└────────────┴─────────┴───────┘


In [90]:
# Window aggregation with over()
print("Average sales per product (window function):")
result = sales_df.with_columns([
    pl.col("sales").mean().over("product").alias("avg_sales_per_product"),
    pl.col("sales").sum().over("product").alias("total_sales_per_product")
])
print(result)

Average sales per product (window function):
shape: (10, 5)
┌────────────┬─────────┬───────┬───────────────────────┬─────────────────────────┐
│ date       ┆ product ┆ sales ┆ avg_sales_per_product ┆ total_sales_per_product │
│ ---        ┆ ---     ┆ ---   ┆ ---                   ┆ ---                     │
│ date       ┆ str     ┆ i64   ┆ f64                   ┆ i64                     │
╞════════════╪═════════╪═══════╪═══════════════════════╪═════════════════════════╡
│ 2024-01-01 ┆ A       ┆ 100   ┆ 115.0                 ┆ 575                     │
│ 2024-01-02 ┆ B       ┆ 150   ┆ 155.0                 ┆ 775                     │
│ 2024-01-03 ┆ A       ┆ 120   ┆ 115.0                 ┆ 575                     │
│ 2024-01-04 ┆ B       ┆ 160   ┆ 155.0                 ┆ 775                     │
│ 2024-01-05 ┆ A       ┆ 110   ┆ 115.0                 ┆ 575                     │
│ 2024-01-06 ┆ B       ┆ 140   ┆ 155.0                 ┆ 775                     │
│ 2024-01-07 ┆ A       ┆ 13

In [91]:
# Ranking within groups
print("Rank sales within each product:")
result = sales_df.with_columns([
    pl.col("sales").rank(method="ordinal").over("product").alias("rank")
]).sort(["product", "rank"])
print(result)

Rank sales within each product:
shape: (10, 4)
┌────────────┬─────────┬───────┬──────┐
│ date       ┆ product ┆ sales ┆ rank │
│ ---        ┆ ---     ┆ ---   ┆ ---  │
│ date       ┆ str     ┆ i64   ┆ u32  │
╞════════════╪═════════╪═══════╪══════╡
│ 2024-01-01 ┆ A       ┆ 100   ┆ 1    │
│ 2024-01-05 ┆ A       ┆ 110   ┆ 2    │
│ 2024-01-09 ┆ A       ┆ 115   ┆ 3    │
│ 2024-01-03 ┆ A       ┆ 120   ┆ 4    │
│ 2024-01-07 ┆ A       ┆ 130   ┆ 5    │
│ 2024-01-06 ┆ B       ┆ 140   ┆ 1    │
│ 2024-01-02 ┆ B       ┆ 150   ┆ 2    │
│ 2024-01-10 ┆ B       ┆ 155   ┆ 3    │
│ 2024-01-04 ┆ B       ┆ 160   ┆ 4    │
│ 2024-01-08 ┆ B       ┆ 170   ┆ 5    │
└────────────┴─────────┴───────┴──────┘


In [None]:
# Cumulative sum within groups
print("Cumulative sales per product:")
result = sales_df.with_columns(
    pl.col("sales").cum_sum().over("product").alias("cumulative_sales")
).sort(["product", "date"])
print(result)

In [None]:
# Shift/lag operations
print("Previous day sales (lag):")
result = sales_df.with_columns([
    pl.col("sales").shift(1).over("product").alias("prev_sales"),
    (pl.col("sales") - pl.col("sales").shift(1).over("product")).alias("sales_change")
]).sort(["product", "date"])
print(result)

## 14. Performance Optimization

Tips and tricks for maximizing Polars performance

In [None]:
# 1. Use lazy evaluation for large datasets
print("Use scan_* methods for lazy reading:")
lazy_query = (
    pl.scan_csv("data.csv")
      .filter(pl.col("score") > 50)
      .select(["name", "score"])
      .head(5)
)
print(lazy_query.collect())

In [None]:
# 2. Use appropriate data types (smaller = faster)
print("Downcast to smaller dtypes when possible:")
df_optimized = pl.DataFrame({
    "id": pl.Series([1, 2, 3], dtype=pl.UInt32),  # Instead of Int64
    "value": pl.Series([1.0, 2.0, 3.0], dtype=pl.Float32)  # Instead of Float64
})
print(df_optimized)

In [None]:
# 3. Prefer Parquet over CSV for I/O
import time

# Write
start = time.time()
sample_df.write_parquet("test.parquet")
parquet_write_time = time.time() - start

start = time.time()
sample_df.write_csv("test.csv")
csv_write_time = time.time() - start

print(f"Parquet write time: {parquet_write_time:.4f}s")
print(f"CSV write time: {csv_write_time:.4f}s")

# Read
start = time.time()
_ = pl.read_parquet("test.parquet")
parquet_read_time = time.time() - start

start = time.time()
_ = pl.read_csv("test.csv")
csv_read_time = time.time() - start

print(f"Parquet read time: {parquet_read_time:.4f}s")
print(f"CSV read time: {csv_read_time:.4f}s")

In [None]:
# 4. Use expression chaining instead of multiple operations
print("Chain operations efficiently:")

# Less efficient: multiple passes
result1 = df.with_columns((pl.col("salary") * 1.1).alias("new_salary"))
result1 = result1.with_columns((pl.col("age") + 1).alias("new_age"))

# More efficient: single pass
result2 = df.with_columns([
    (pl.col("salary") * 1.1).alias("new_salary"),
    (pl.col("age") + 1).alias("new_age")
])

print(result2)

In [None]:
# 5. Use streaming for very large datasets
print("Streaming execution for large data:")
lazy_query = (
    pl.scan_csv("data.csv")
      .filter(pl.col("score") > 50)
      .group_by("name")
      .agg(pl.col("score").mean())
)

# Collect with streaming (processes data in chunks)
result = lazy_query.collect(streaming=True)
print(result.head())

## 15. Advanced Features

Advanced Polars capabilities

In [None]:
# 1. Explode (unnest lists)
df_lists = pl.DataFrame({
    "name": ["Alice", "Bob"],
    "scores": [[85, 90, 88], [92, 87, 95]]
})

print("Original:")
print(df_lists)

print("\nExploded:")
print(df_lists.explode("scores"))

In [None]:
# 2. Pivot (wide format)
pivot_df = pl.DataFrame({
    "date": ["2024-01-01", "2024-01-01", "2024-01-02", "2024-01-02"],
    "product": ["A", "B", "A", "B"],
    "sales": [100, 150, 120, 160]
})

print("Original:")
print(pivot_df)

print("\nPivoted:")
print(pivot_df.pivot(values="sales", index="date", columns="product"))

In [None]:
# 3. Melt (long format)
wide_df = pl.DataFrame({
    "name": ["Alice", "Bob"],
    "math": [85, 90],
    "science": [88, 92],
    "english": [90, 87]
})

print("Wide format:")
print(wide_df)

print("\nMelted (long format):")
print(wide_df.melt(id_vars="name", variable_name="subject", value_name="score"))

In [None]:
# 4. Apply custom functions with map_elements (use sparingly - slower than expressions)
def custom_function(x):
    return x * 2 + 10

print("Apply custom function:")
result = df.select([
    pl.col("name"),
    pl.col("age").map_elements(custom_function, return_dtype=pl.Int64).alias("custom")
])
print(result)

In [None]:
# 5. SQL interface
# Register DataFrame in SQL context
ctx = pl.SQLContext()
ctx.register("employees", df)

print("Query with SQL:")
result = ctx.execute("""
    SELECT name, salary, department
    FROM employees
    WHERE salary > 80000
    ORDER BY salary DESC
""").collect()
print(result)

In [None]:
# 6. Categorical data for memory efficiency
cat_df = pl.DataFrame({
    "category": ["A", "B", "A", "C", "B", "A", "C"] * 1000
})

print("String dtype memory:")
print(f"{cat_df.estimated_size('mb'):.4f} MB")

# Convert to categorical
cat_df_opt = cat_df.with_columns(
    pl.col("category").cast(pl.Categorical)
)

print("\nCategorical dtype memory:")
print(f"{cat_df_opt.estimated_size('mb'):.4f} MB")

In [None]:
# 7. Struct columns (nested data)
struct_df = pl.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "address": [
        {"city": "NYC", "zip": "10001"},
        {"city": "LA", "zip": "90001"},
        {"city": "SF", "zip": "94101"}
    ]
})

print("Struct column:")
print(struct_df)

print("\nAccess struct fields:")
print(struct_df.with_columns([
    pl.col("address").struct.field("city").alias("city"),
    pl.col("address").struct.field("zip").alias("zip")
]))

## Summary & Key Takeaways

### When to Use Polars vs Pandas/PySpark:

**Use Polars when:**
- You need maximum performance on a single machine
- Your data fits in memory (or can be streamed)
- You want better memory efficiency
- You need lazy evaluation and query optimization

**Use Pandas when:**
- You need maximum ecosystem compatibility
- Your data is small and performance isn't critical
- You're working with legacy code

**Use PySpark when:**
- Your data is too large for a single machine
- You need distributed computing
- You already have a Spark cluster

### Key Polars Concepts:
1. **Expressions**: The core abstraction for data manipulation
2. **Lazy Evaluation**: Use `.lazy()` and `scan_*` for query optimization
3. **Arrow Backend**: Zero-copy operations for speed
4. **Parallelization**: Automatic multi-threading
5. **Type System**: Strong typing helps catch errors early

### Performance Tips:
- Use Parquet for storage
- Chain operations in a single expression
- Use lazy evaluation for large datasets
- Prefer expressions over custom functions
- Use appropriate dtypes (smaller when possible)
- Use streaming for very large data

### Migration from Pandas:
- `df[df['col'] > 5]` → `df.filter(pl.col('col') > 5)`
- `df['new'] = df['old'] * 2` → `df.with_columns((pl.col('old') * 2).alias('new'))`
- `df.groupby('col').agg({'x': 'mean'})` → `df.group_by('col').agg(pl.col('x').mean())`
- `df.merge(other)` → `df.join(other)`

Happy data wrangling with Polars!

## Practice Exercises

Try these exercises to reinforce your learning:

1. Load the data.csv file and find all records where score > 75
2. Calculate the average score per day of the week
3. Create a new column that categorizes scores: Low (0-33), Medium (34-66), High (67-100)
4. Find the top 10 users by score using window functions
5. Write a lazy query that filters, groups, and aggregates the data, then optimize it