# Complete Guide to Polars for Pandas/PySpark Users

This notebook provides a comprehensive introduction to Polars, covering everything from basics to advanced topics.

## Table of Contents
1. [Introduction & Setup](#1-introduction--setup)
2. [Basic Data Structures](#2-basic-data-structures)
3. [Creating DataFrames](#3-creating-dataframes)
4. [Reading & Writing Data](#4-reading--writing-data)
5. [Data Selection & Filtering](#5-data-selection--filtering)
6. [Expressions - The Heart of Polars](#6-expressions---the-heart-of-polars)
7. [Transformations & Column Operations](#7-transformations--column-operations)
8. [Aggregations & GroupBy](#8-aggregations--groupby)
9. [Joins & Concatenations](#9-joins--concatenations)
10. [Lazy vs Eager Evaluation](#10-lazy-vs-eager-evaluation)
11. [Time Series Operations](#11-time-series-operations)
12. [String Operations](#12-string-operations)
13. [Window Functions](#13-window-functions)
14. [Performance Optimization](#14-performance-optimization)
15. [Advanced Features](#15-advanced-features)

## 1. Introduction & Setup

### What is Polars?
- **Fast**: Written in Rust, optimized for performance
- **Efficient**: Uses Apache Arrow columnar format
- **Expressive**: Rich expression API
- **Lazy**: Built-in query optimization

### Key Differences from Pandas/PySpark
| Feature | Pandas | PySpark | Polars |
|---------|--------|---------|--------|
| Speed | Moderate | Fast (distributed) | Very Fast (single node) |
| Memory | Copies data often | Distributed | Zero-copy views |
| API Style | Method chaining | SQL-like | Expression-based |
| Lazy Evaluation | No | Yes | Yes |
| Parallelization | Limited | Distributed | Multi-threaded |

In [None]:
# Install Polars (run this if not already installed)
# !pip install polars

import polars as pl
import numpy as np
from datetime import datetime, timedelta

# Check version
print(f"Polars version: {pl.__version__}")

# Set display options
pl.Config.set_tbl_rows(10)

## 2. Basic Data Structures

Polars has two main data structures:
- **Series**: 1D array (like pandas Series)
- **DataFrame**: 2D table (like pandas DataFrame)

In [None]:
# Creating a Series
s = pl.Series("numbers", [1, 2, 3, 4, 5])
print("Series:")
print(s)
print(f"\nDtype: {s.dtype}")
print(f"Length: {len(s)}")

In [None]:
# Series with different dtypes
int_series = pl.Series("integers", [1, 2, 3], dtype=pl.Int64)
float_series = pl.Series("floats", [1.0, 2.5, 3.7], dtype=pl.Float64)
str_series = pl.Series("strings", ["a", "b", "c"], dtype=pl.Utf8)
bool_series = pl.Series("booleans", [True, False, True], dtype=pl.Boolean)

print("Int Series:", int_series.to_list())
print("Float Series:", float_series.to_list())
print("String Series:", str_series.to_list())
print("Boolean Series:", bool_series.to_list())

## 3. Creating DataFrames

Multiple ways to create DataFrames in Polars

In [None]:
# Method 1: From dictionary
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "age": [25, 30, 35, 40, 28],
    "city": ["New York", "London", "Paris", "Tokyo", "Berlin"],
    "salary": [70000, 80000, 90000, 95000, 75000]
})

print("DataFrame from dictionary:")
print(df)

In [None]:
# Method 2: From list of dictionaries (row-oriented)
data = [
    {"product": "A", "quantity": 10, "price": 100},
    {"product": "B", "quantity": 20, "price": 200},
    {"product": "C", "quantity": 15, "price": 150},
]

df2 = pl.DataFrame(data)
print("DataFrame from list of dicts:")
print(df2)

In [None]:
# Method 3: From NumPy array
arr = np.random.randn(5, 3)
df3 = pl.DataFrame(arr, schema=["col1", "col2", "col3"])
print("DataFrame from NumPy:")
print(df3)

In [None]:
# Basic DataFrame info (similar to pandas)
print("Shape:", df.shape)
print("\nColumn names:", df.columns)
print("\nDtypes:", df.dtypes)
print("\nSchema:")
print(df.schema)

In [None]:
# Quick statistics
print("Describe:")
print(df.describe())

## 4. Reading & Writing Data

Polars supports multiple file formats with excellent performance

In [None]:
# Create sample data for I/O examples
sample_df = pl.DataFrame({
    "id": range(1, 1001),
    "name": [f"User_{i}" for i in range(1, 1001)],
    "score": np.random.randint(0, 100, 1000),
    "timestamp": [datetime.now() - timedelta(days=i) for i in range(1000)]
})

print(sample_df.head())

In [None]:
# Writing to CSV
sample_df.write_csv("data.csv")
print("Written to CSV")

# Reading from CSV
df_csv = pl.read_csv("data.csv")
print("\nRead from CSV:")
print(df_csv.head())

In [None]:
# Parquet (recommended for performance)
sample_df.write_parquet("data.parquet")
df_parquet = pl.read_parquet("data.parquet")
print("Read from Parquet:")
print(df_parquet.head())

In [None]:
# JSON
sample_df.head(5).write_json("data.json")
df_json = pl.read_json("data.json")
print("Read from JSON:")
print(df_json)

In [None]:
# Lazy reading (for large files) - reads only when needed
lazy_df = pl.scan_csv("data.csv")
print("Lazy DataFrame (not yet loaded):")
print(lazy_df)

# Collect to execute
result = lazy_df.head(3).collect()
print("\nCollected result:")
print(result)

## 5. Data Selection & Filtering

Polars uses expressions for powerful and efficient data selection

In [None]:
# Create sample data
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "David", "Eve", "Frank"],
    "age": [25, 30, 35, 40, 28, 45],
    "city": ["New York", "London", "Paris", "Tokyo", "Berlin", "Sydney"],
    "salary": [70000, 80000, 90000, 95000, 75000, 100000],
    "department": ["IT", "HR", "IT", "Finance", "HR", "IT"]
})

print("Sample DataFrame:")
print(df)

In [None]:
# Select columns
print("Select single column:")
print(df.select("name"))

print("\nSelect multiple columns:")
print(df.select(["name", "age", "salary"]))

In [None]:
# Select using expressions (pl.col)
print("Select with expressions:")
print(df.select([
    pl.col("name"),
    pl.col("age"),
    pl.col("salary")
]))

In [None]:
# Select by dtype
print("Select numeric columns:")
print(df.select(pl.col(pl.Int64)))

print("\nSelect string columns:")
print(df.select(pl.col(pl.Utf8)))

In [None]:
# Filter rows (similar to pandas query or SQL WHERE)
print("Filter age > 30:")
print(df.filter(pl.col("age") > 30))

In [None]:
# Multiple conditions with & (and) | (or)
print("Filter with multiple conditions (age > 30 AND salary > 80000):")
print(df.filter(
    (pl.col("age") > 30) & (pl.col("salary") > 80000)
))

In [None]:
# String filtering
print("Filter department == 'IT':")
print(df.filter(pl.col("department") == "IT"))

print("\nFilter city contains 'o':")
print(df.filter(pl.col("city").str.contains("o")))

In [None]:
# isin (similar to pandas)
print("Filter names in list:")
print(df.filter(pl.col("name").is_in(["Alice", "Bob", "Charlie"])))

In [None]:
# head, tail, sample
print("First 3 rows:")
print(df.head(3))

print("\nLast 2 rows:")
print(df.tail(2))

print("\nRandom sample (2 rows):")
print(df.sample(n=2, seed=42))

## 6. Expressions - The Heart of Polars

Expressions are what make Polars powerful and fast. They are:
- **Composable**: Can be chained together
- **Parallelizable**: Automatically run in parallel
- **Optimizable**: Query optimizer improves performance

In [None]:
# Basic expression
print("Double the salary:")
print(df.select([
    pl.col("name"),
    (pl.col("salary") * 2).alias("doubled_salary")
]))

In [None]:
# Multiple operations in one select
print("Multiple expressions:")
print(df.select([
    pl.col("name"),
    pl.col("age"),
    (pl.col("salary") / 1000).alias("salary_k"),
    (pl.col("age") > 30).alias("is_senior")
]))

In [None]:
# with_columns (add/modify columns without selecting)
print("Add new columns:")
result = df.with_columns([
    (pl.col("salary") * 1.1).alias("salary_after_raise"),
    (pl.col("age") + 1).alias("age_next_year")
])
print(result)

In [None]:
# Conditional expressions (when-then-otherwise)
print("Conditional column:")
result = df.with_columns([
    pl.when(pl.col("age") < 30)
      .then(pl.lit("Young"))
      .when(pl.col("age") < 40)
      .then(pl.lit("Middle"))
      .otherwise(pl.lit("Senior"))
      .alias("age_group")
])
print(result)

In [None]:
# Expression aliases and chaining
print("Chained expressions:")
result = df.select([
    pl.col("name").str.to_uppercase().alias("name_upper"),
    pl.col("salary").log10().round(2).alias("log_salary")
])
print(result)

## 7. Transformations & Column Operations

Common data transformation operations

In [None]:
# Sorting
print("Sort by age (descending):")
print(df.sort("age", descending=True))

print("\nSort by multiple columns:")
print(df.sort(["department", "salary"], descending=[False, True]))

In [None]:
# Rename columns
print("Rename columns:")
renamed = df.rename({"name": "employee_name", "salary": "annual_salary"})
print(renamed.columns)

In [None]:
# Drop columns
print("Drop columns:")
print(df.drop(["city", "department"]).columns)

In [None]:
# Cast dtypes
print("Cast age to float:")
result = df.with_columns(pl.col("age").cast(pl.Float64))
print(result.dtypes)

In [None]:
# Null handling
df_with_nulls = pl.DataFrame({
    "a": [1, 2, None, 4, None],
    "b": ["x", None, "y", "z", None]
})

print("DataFrame with nulls:")
print(df_with_nulls)

print("\nFill nulls:")
print(df_with_nulls.fill_null(strategy="forward"))

print("\nFill with specific value:")
print(df_with_nulls.fill_null(0))

print("\nDrop nulls:")
print(df_with_nulls.drop_nulls())

In [None]:
# Unique and duplicates
print("Unique values in department:")
print(df.select(pl.col("department").unique()))

print("\nCount unique values:")
print(df.select(pl.col("department").n_unique()))

## 8. Aggregations & GroupBy

Powerful aggregation capabilities, similar to pandas groupby but more expressive

In [None]:
# Basic aggregations
print("Mean salary:")
print(df.select(pl.col("salary").mean()))

print("\nMultiple aggregations:")
print(df.select([
    pl.col("salary").mean().alias("mean_salary"),
    pl.col("salary").median().alias("median_salary"),
    pl.col("salary").std().alias("std_salary"),
    pl.col("age").min().alias("min_age"),
    pl.col("age").max().alias("max_age")
]))

In [None]:
# GroupBy - basic
print("Group by department:")
print(df.group_by("department").agg([
    pl.col("salary").mean().alias("avg_salary"),
    pl.col("age").mean().alias("avg_age"),
    pl.count().alias("count")
]).sort("department"))

In [None]:
# GroupBy - multiple aggregations per column
print("Multiple aggregations:")
print(df.group_by("department").agg([
    pl.col("salary").min().alias("min_salary"),
    pl.col("salary").max().alias("max_salary"),
    pl.col("salary").mean().alias("avg_salary"),
    pl.col("name").count().alias("employee_count")
]).sort("department"))

In [None]:
# GroupBy with multiple keys
df_extended = df.with_columns(
    pl.when(pl.col("age") < 35)
      .then(pl.lit("Young"))
      .otherwise(pl.lit("Senior"))
      .alias("age_category")
)

print("Group by multiple columns:")
print(df_extended.group_by(["department", "age_category"]).agg([
    pl.col("salary").mean().alias("avg_salary"),
    pl.count().alias("count")
]).sort(["department", "age_category"]))

In [None]:
# Advanced aggregations
print("List aggregation (collect names per department):")
print(df.group_by("department").agg([
    pl.col("name").alias("employees"),
    pl.col("salary").sum().alias("total_salary")
]).sort("department"))

In [None]:
# Quantiles and percentiles
print("Salary percentiles:")
print(df.select([
    pl.col("salary").quantile(0.25).alias("p25"),
    pl.col("salary").quantile(0.50).alias("p50"),
    pl.col("salary").quantile(0.75).alias("p75"),
    pl.col("salary").quantile(0.90).alias("p90")
]))

## 9. Joins & Concatenations

Combining DataFrames - similar to SQL joins and pandas merge/concat

In [None]:
# Create sample DataFrames for joining
employees = pl.DataFrame({
    "emp_id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "dept_id": [10, 20, 10, 30, 20]
})

departments = pl.DataFrame({
    "dept_id": [10, 20, 30, 40],
    "dept_name": ["IT", "HR", "Finance", "Marketing"]
})

print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)

In [None]:
# Inner join
print("Inner join:")
print(employees.join(departments, on="dept_id", how="inner"))

In [None]:
# Left join
print("Left join:")
print(employees.join(departments, on="dept_id", how="left"))

In [None]:
# Outer join
print("Outer join:")
print(employees.join(departments, on="dept_id", how="outer"))

In [None]:
# Join with different column names
salaries = pl.DataFrame({
    "employee_id": [1, 2, 3, 4, 5],
    "salary": [70000, 80000, 90000, 95000, 75000]
})

print("Join on different column names:")
print(employees.join(salaries, left_on="emp_id", right_on="employee_id"))

In [None]:
# Concatenation - vertical (like SQL UNION or pandas concat axis=0)
df1 = pl.DataFrame({"a": [1, 2], "b": [3, 4]})
df2 = pl.DataFrame({"a": [5, 6], "b": [7, 8]})

print("Vertical concatenation:")
print(pl.concat([df1, df2]))

In [None]:
# Concatenation - horizontal (like pandas concat axis=1)
df3 = pl.DataFrame({"c": [9, 10]})

print("Horizontal concatenation:")
print(pl.concat([df1, df3], how="horizontal"))

## 10. Lazy vs Eager Evaluation

One of Polars' most powerful features - lazy evaluation allows query optimization

In [None]:
# Eager execution (default)
print("Eager execution:")
result_eager = (
    df.filter(pl.col("age") > 30)
      .select(["name", "salary"])
      .sort("salary", descending=True)
)
print(result_eager)

In [None]:
# Lazy execution - convert to lazy
print("Lazy execution (not yet computed):")
lazy_query = (
    df.lazy()
      .filter(pl.col("age") > 30)
      .select(["name", "salary"])
      .sort("salary", descending=True)
)
print(lazy_query)
print("\nQuery plan:")
print(lazy_query.explain())

In [None]:
# Execute lazy query with collect()
print("Collected result:")
result_lazy = lazy_query.collect()
print(result_lazy)

In [None]:
# Example showing optimization benefits
# Polars will optimize this to only read necessary columns
lazy_optimized = (
    pl.scan_csv("data.csv")
      .select(["name", "score"])  # Only these columns will be read from CSV
      .filter(pl.col("score") > 50)
      .head(10)
)

print("Optimized query plan:")
print(lazy_optimized.explain())

# Execute
print("\nResult:")
print(lazy_optimized.collect())

## 11. Time Series Operations

Working with dates and times in Polars

In [None]:
# Create time series data
from datetime import date

ts_df = pl.DataFrame({
    "date": pl.date_range(
        date(2024, 1, 1),
        date(2024, 12, 31),
        interval="1d",
        eager=True
    ),
    "value": np.random.randn(366).cumsum()
})

print("Time series data:")
print(ts_df.head(10))

In [None]:
# Extract date components
print("Extract date components:")
result = ts_df.with_columns([
    pl.col("date").dt.year().alias("year"),
    pl.col("date").dt.month().alias("month"),
    pl.col("date").dt.day().alias("day"),
    pl.col("date").dt.weekday().alias("weekday"),
    pl.col("date").dt.quarter().alias("quarter")
]).head(10)
print(result)

In [None]:
# Datetime arithmetic
print("Add days to date:")
result = ts_df.with_columns(
    (pl.col("date") + pl.duration(days=7)).alias("date_plus_week")
).head(5)
print(result)

In [None]:
# Resample and aggregate (like pandas resample)
print("Monthly aggregation:")
monthly = (
    ts_df.group_by_dynamic("date", every="1mo")
         .agg([
             pl.col("value").mean().alias("avg_value"),
             pl.col("value").min().alias("min_value"),
             pl.col("value").max().alias("max_value")
         ])
)
print(monthly)

In [None]:
# Rolling window operations
print("7-day rolling average:")
result = ts_df.with_columns(
    pl.col("value").rolling_mean(window_size=7).alias("rolling_avg_7d")
).head(20)
print(result)

## 12. String Operations

String manipulation in Polars

In [None]:
# Create string data
str_df = pl.DataFrame({
    "text": [
        "hello world",
        "POLARS is FAST",
        "  pandas  ",
        "data-science-2024",
        "user@example.com"
    ]
})

print("String data:")
print(str_df)

In [None]:
# String methods
print("String transformations:")
result = str_df.with_columns([
    pl.col("text").str.to_uppercase().alias("upper"),
    pl.col("text").str.to_lowercase().alias("lower"),
    pl.col("text").str.strip_chars().alias("stripped"),
    pl.col("text").str.len_chars().alias("length")
])
print(result)

In [None]:
# String contains, starts_with, ends_with
print("String matching:")
result = str_df.with_columns([
    pl.col("text").str.contains("a").alias("contains_a"),
    pl.col("text").str.starts_with("h").alias("starts_h"),
    pl.col("text").str.ends_with("m").alias("ends_m")
])
print(result)

In [None]:
# String replace and split
print("Replace:")
print(str_df.with_columns(
    pl.col("text").str.replace("-", "_").alias("replaced")
))

print("\nSplit:")
print(str_df.with_columns(
    pl.col("text").str.split("-").alias("split")
))

In [None]:
# Extract with regex
print("Extract email domain:")
email_df = pl.DataFrame({"email": ["user@example.com", "test@domain.org"]})
result = email_df.with_columns(
    pl.col("email").str.extract(r"@(.+)", group_index=1).alias("domain")
)
print(result)

## 13. Window Functions

Powerful window operations (like SQL window functions)

In [None]:
# Sample data for window functions
sales_df = pl.DataFrame({
    "date": pl.date_range(date(2024, 1, 1), date(2024, 1, 10), interval="1d", eager=True),
    "product": ["A", "B", "A", "B", "A", "B", "A", "B", "A", "B"],
    "sales": [100, 150, 120, 160, 110, 140, 130, 170, 115, 155]
})

print("Sales data:")
print(sales_df)

In [None]:
# Window aggregation with over()
print("Average sales per product (window function):")
result = sales_df.with_columns([
    pl.col("sales").mean().over("product").alias("avg_sales_per_product"),
    pl.col("sales").sum().over("product").alias("total_sales_per_product")
])
print(result)

In [None]:
# Ranking within groups
print("Rank sales within each product:")
result = sales_df.with_columns([
    pl.col("sales").rank(method="ordinal").over("product").alias("rank")
]).sort(["product", "rank"])
print(result)

In [None]:
# Cumulative sum within groups
print("Cumulative sales per product:")
result = sales_df.with_columns(
    pl.col("sales").cum_sum().over("product").alias("cumulative_sales")
).sort(["product", "date"])
print(result)

In [None]:
# Shift/lag operations
print("Previous day sales (lag):")
result = sales_df.with_columns([
    pl.col("sales").shift(1).over("product").alias("prev_sales"),
    (pl.col("sales") - pl.col("sales").shift(1).over("product")).alias("sales_change")
]).sort(["product", "date"])
print(result)

## 14. Performance Optimization

Tips and tricks for maximizing Polars performance

In [None]:
# 1. Use lazy evaluation for large datasets
print("Use scan_* methods for lazy reading:")
lazy_query = (
    pl.scan_csv("data.csv")
      .filter(pl.col("score") > 50)
      .select(["name", "score"])
      .head(5)
)
print(lazy_query.collect())

In [None]:
# 2. Use appropriate data types (smaller = faster)
print("Downcast to smaller dtypes when possible:")
df_optimized = pl.DataFrame({
    "id": pl.Series([1, 2, 3], dtype=pl.UInt32),  # Instead of Int64
    "value": pl.Series([1.0, 2.0, 3.0], dtype=pl.Float32)  # Instead of Float64
})
print(df_optimized)

In [None]:
# 3. Prefer Parquet over CSV for I/O
import time

# Write
start = time.time()
sample_df.write_parquet("test.parquet")
parquet_write_time = time.time() - start

start = time.time()
sample_df.write_csv("test.csv")
csv_write_time = time.time() - start

print(f"Parquet write time: {parquet_write_time:.4f}s")
print(f"CSV write time: {csv_write_time:.4f}s")

# Read
start = time.time()
_ = pl.read_parquet("test.parquet")
parquet_read_time = time.time() - start

start = time.time()
_ = pl.read_csv("test.csv")
csv_read_time = time.time() - start

print(f"Parquet read time: {parquet_read_time:.4f}s")
print(f"CSV read time: {csv_read_time:.4f}s")

In [None]:
# 4. Use expression chaining instead of multiple operations
print("Chain operations efficiently:")

# Less efficient: multiple passes
result1 = df.with_columns((pl.col("salary") * 1.1).alias("new_salary"))
result1 = result1.with_columns((pl.col("age") + 1).alias("new_age"))

# More efficient: single pass
result2 = df.with_columns([
    (pl.col("salary") * 1.1).alias("new_salary"),
    (pl.col("age") + 1).alias("new_age")
])

print(result2)

In [None]:
# 5. Use streaming for very large datasets
print("Streaming execution for large data:")
lazy_query = (
    pl.scan_csv("data.csv")
      .filter(pl.col("score") > 50)
      .group_by("name")
      .agg(pl.col("score").mean())
)

# Collect with streaming (processes data in chunks)
result = lazy_query.collect(streaming=True)
print(result.head())

## 15. Advanced Features

Advanced Polars capabilities

In [None]:
# 1. Explode (unnest lists)
df_lists = pl.DataFrame({
    "name": ["Alice", "Bob"],
    "scores": [[85, 90, 88], [92, 87, 95]]
})

print("Original:")
print(df_lists)

print("\nExploded:")
print(df_lists.explode("scores"))

In [None]:
# 2. Pivot (wide format)
pivot_df = pl.DataFrame({
    "date": ["2024-01-01", "2024-01-01", "2024-01-02", "2024-01-02"],
    "product": ["A", "B", "A", "B"],
    "sales": [100, 150, 120, 160]
})

print("Original:")
print(pivot_df)

print("\nPivoted:")
print(pivot_df.pivot(values="sales", index="date", columns="product"))

In [None]:
# 3. Melt (long format)
wide_df = pl.DataFrame({
    "name": ["Alice", "Bob"],
    "math": [85, 90],
    "science": [88, 92],
    "english": [90, 87]
})

print("Wide format:")
print(wide_df)

print("\nMelted (long format):")
print(wide_df.melt(id_vars="name", variable_name="subject", value_name="score"))

In [None]:
# 4. Apply custom functions with map_elements (use sparingly - slower than expressions)
def custom_function(x):
    return x * 2 + 10

print("Apply custom function:")
result = df.select([
    pl.col("name"),
    pl.col("age").map_elements(custom_function, return_dtype=pl.Int64).alias("custom")
])
print(result)

In [None]:
# 5. SQL interface
# Register DataFrame in SQL context
ctx = pl.SQLContext()
ctx.register("employees", df)

print("Query with SQL:")
result = ctx.execute("""
    SELECT name, salary, department
    FROM employees
    WHERE salary > 80000
    ORDER BY salary DESC
""").collect()
print(result)

In [None]:
# 6. Categorical data for memory efficiency
cat_df = pl.DataFrame({
    "category": ["A", "B", "A", "C", "B", "A", "C"] * 1000
})

print("String dtype memory:")
print(f"{cat_df.estimated_size('mb'):.4f} MB")

# Convert to categorical
cat_df_opt = cat_df.with_columns(
    pl.col("category").cast(pl.Categorical)
)

print("\nCategorical dtype memory:")
print(f"{cat_df_opt.estimated_size('mb'):.4f} MB")

In [None]:
# 7. Struct columns (nested data)
struct_df = pl.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "address": [
        {"city": "NYC", "zip": "10001"},
        {"city": "LA", "zip": "90001"},
        {"city": "SF", "zip": "94101"}
    ]
})

print("Struct column:")
print(struct_df)

print("\nAccess struct fields:")
print(struct_df.with_columns([
    pl.col("address").struct.field("city").alias("city"),
    pl.col("address").struct.field("zip").alias("zip")
]))

## Summary & Key Takeaways

### When to Use Polars vs Pandas/PySpark:

**Use Polars when:**
- You need maximum performance on a single machine
- Your data fits in memory (or can be streamed)
- You want better memory efficiency
- You need lazy evaluation and query optimization

**Use Pandas when:**
- You need maximum ecosystem compatibility
- Your data is small and performance isn't critical
- You're working with legacy code

**Use PySpark when:**
- Your data is too large for a single machine
- You need distributed computing
- You already have a Spark cluster

### Key Polars Concepts:
1. **Expressions**: The core abstraction for data manipulation
2. **Lazy Evaluation**: Use `.lazy()` and `scan_*` for query optimization
3. **Arrow Backend**: Zero-copy operations for speed
4. **Parallelization**: Automatic multi-threading
5. **Type System**: Strong typing helps catch errors early

### Performance Tips:
- Use Parquet for storage
- Chain operations in a single expression
- Use lazy evaluation for large datasets
- Prefer expressions over custom functions
- Use appropriate dtypes (smaller when possible)
- Use streaming for very large data

### Migration from Pandas:
- `df[df['col'] > 5]` → `df.filter(pl.col('col') > 5)`
- `df['new'] = df['old'] * 2` → `df.with_columns((pl.col('old') * 2).alias('new'))`
- `df.groupby('col').agg({'x': 'mean'})` → `df.group_by('col').agg(pl.col('x').mean())`
- `df.merge(other)` → `df.join(other)`

Happy data wrangling with Polars!

## Practice Exercises

Try these exercises to reinforce your learning:

1. Load the data.csv file and find all records where score > 75
2. Calculate the average score per day of the week
3. Create a new column that categorizes scores: Low (0-33), Medium (34-66), High (67-100)
4. Find the top 10 users by score using window functions
5. Write a lazy query that filters, groups, and aggregates the data, then optimize it