# Polars Expressions & Selectors - Comprehensive Guide

Expressions are the heart of Polars. They provide a powerful, flexible way to manipulate data.

## What You'll Learn:
- Expression basics and chaining
- Column selection patterns
- Selectors for dtype-based operations  
- Conditional expressions
- Expression contexts (select, with_columns, filter, etc.)
- Advanced expression patterns

In [None]:
import polars as pl
import polars.selectors as cs
import numpy as np

## Part 1: Expression Basics

In [None]:
# Sample data
df = pl.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, 34, 28, 42, 31],
    'income': [50000, 75000, 60000, 95000, 68000],
    'city': ['NYC', 'LA', 'Chicago', 'NYC', 'Boston'],
    'purchases': [5, 12, 3, 18, 7],
    'total_spent': [250.5, 680.2, 120.0, 950.8, 340.5]
})

print(df)

### What is an Expression?

An expression is an operation on a column (or columns) that returns a Series.

In [None]:
# Basic expression: select a column
expr = pl.col('age')
print(f"Expression type: {type(expr)}")
print(f"Expression: {expr}")

# Execute expression in select
result = df.select(expr)
print("\nResult:")
print(result)

### Expression Chaining

In [None]:
# Chain operations on an expression
result = df.select([
    pl.col('age'),
    (pl.col('age') * 2).alias('age_doubled'),
    (pl.col('age') + 10).alias('age_plus_10'),
    (pl.col('income') / 1000).alias('income_k')
])

print(result)

## Part 2: Column Selection Patterns

### pl.col() - Select specific columns

In [None]:
# Single column
print("Single column:")
print(df.select(pl.col('name')))

# Multiple columns (as list)
print("\nMultiple columns (list):")
print(df.select(['name', 'age', 'city']))

# Multiple columns (as separate args)
print("\nMultiple columns with pl.col():")
print(df.select(pl.col('name'), pl.col('age'), pl.col('city')))

### pl.all() - Select all columns

In [None]:
# Apply operation to all columns
result = df.select(pl.all())
print("All columns:")
print(result)

# Apply transformation to all numeric columns
result = df.select([
    pl.col('customer_id', 'name', 'city'),  # Keep these as-is
    (pl.col('age', 'income', 'purchases', 'total_spent') / 2)  # Transform these
])
print("\nTransform multiple columns:")
print(result)

### pl.exclude() - Select all except specified

In [None]:
# Select all except certain columns
result = df.select(pl.exclude('customer_id', 'total_spent'))
print("All except customer_id and total_spent:")
print(result)

# Apply operation to all numeric columns except one
result = df.select([
    pl.col('name', 'city'),
    pl.exclude('name', 'city', 'customer_id') * 2  # Multiply all other numeric cols
])
print("\nDouble all numeric except customer_id:")
print(result)

### Regex patterns for column selection

In [None]:
# Select columns matching regex
result = df.select(pl.col('^.*_id$'))  # Columns ending with '_id'
print("Columns ending with '_id':")
print(result)

# Multiple patterns
result = df.select(pl.col('^(name|city)$'))
print("\nColumns matching 'name' or 'city':")
print(result)

## Part 3: Selectors (dtype-based selection)

In [None]:
# Create DataFrame with mixed types
mixed_df = pl.DataFrame({
    'int_col': [1, 2, 3, 4, 5],
    'float_col': [1.1, 2.2, 3.3, 4.4, 5.5],
    'str_col': ['a', 'b', 'c', 'd', 'e'],
    'bool_col': [True, False, True, False, True],
    'date_col': pl.date_range(pl.date(2023, 1, 1), pl.date(2023, 1, 5), '1d', eager=True)
})

print(mixed_df)
print("\nSchema:")
print(mixed_df.schema)

### cs.numeric() - Select numeric columns

In [None]:
# Select all numeric columns
result = mixed_df.select(cs.numeric())
print("Numeric columns:")
print(result)

# Apply operation to all numeric columns
result = mixed_df.select([
    cs.numeric() * 2  # Double all numeric columns
])
print("\nAll numeric columns doubled:")
print(result)

### cs.string() - Select string columns

In [None]:
# Select string columns
result = mixed_df.select(cs.string())
print("String columns:")
print(result)

# Apply string operation to all string columns
result = mixed_df.select([
    cs.string().str.to_uppercase()
])
print("\nUppercase all string columns:")
print(result)

### Other useful selectors

In [None]:
# cs.integer() - integer columns only
print("Integer columns:")
print(mixed_df.select(cs.integer()))

# cs.float() - float columns only
print("\nFloat columns:")
print(mixed_df.select(cs.float()))

# cs.temporal() - date/datetime columns
print("\nTemporal columns:")
print(mixed_df.select(cs.temporal()))

# cs.boolean() - boolean columns
print("\nBoolean columns:")
print(mixed_df.select(cs.boolean()))

### Combining selectors

In [None]:
# Union: numeric OR string
result = mixed_df.select(cs.numeric() | cs.string())
print("Numeric OR string columns:")
print(result)

# Intersection: (not common, but possible with by_dtype)
# Negation: NOT numeric
result = mixed_df.select(~cs.numeric())
print("\nNOT numeric columns:")
print(result)

# Difference: all columns except numeric
result = mixed_df.select(cs.all() - cs.numeric())
print("\nAll except numeric:")
print(result)

## Part 4: Expression Contexts

Expressions can be used in different contexts:
- `select()`: Return specific columns
- `with_columns()`: Add/modify columns
- `filter()`: Filter rows
- `group_by().agg()`: Aggregate

### Context 1: select()

In [None]:
# Select returns only specified columns
result = df.select([
    pl.col('name'),
    (pl.col('income') / 1000).alias('income_k'),
    (pl.col('total_spent') / pl.col('purchases')).alias('avg_per_purchase')
])

print("select() - Only return specified columns:")
print(result)

### Context 2: with_columns()

In [None]:
# with_columns keeps all original columns and adds new ones
result = df.with_columns([
    (pl.col('income') / 1000).alias('income_k'),
    (pl.col('total_spent') / pl.col('purchases')).alias('avg_per_purchase'),
    (pl.col('age') > 30).alias('is_over_30')
])

print("with_columns() - Keep all + add new:")
print(result)

### Context 3: filter()

In [None]:
# filter uses expressions as boolean masks
result = df.filter(
    (pl.col('age') > 30) & (pl.col('income') > 70000)
)

print("filter() - Keep rows where condition is True:")
print(result)

# Multiple conditions
result = df.filter(
    pl.col('city').is_in(['NYC', 'LA']),
    pl.col('purchases') > 5  # Multiple args are AND'ed
)
print("\nMultiple filter conditions:")
print(result)

### Context 4: group_by().agg()

In [None]:
# Aggregation expressions
result = df.group_by('city').agg([
    pl.col('age').mean().alias('avg_age'),
    pl.col('income').sum().alias('total_income'),
    pl.col('purchases').count().alias('num_customers')
])

print("group_by().agg() - Aggregate expressions:")
print(result)

## Part 5: Conditional Expressions (when/then/otherwise)

In [None]:
# Simple if-else
result = df.select([
    pl.col('name'),
    pl.col('age'),
    pl.when(pl.col('age') >= 35)
      .then(pl.lit('Senior'))
      .otherwise(pl.lit('Junior'))
      .alias('category')
])

print("Simple when/then/otherwise:")
print(result)

In [None]:
# Multiple conditions (if/elif/else)
result = df.select([
    pl.col('name'),
    pl.col('income'),
    pl.when(pl.col('income') >= 90000)
      .then(pl.lit('High'))
      .when(pl.col('income') >= 60000)
      .then(pl.lit('Medium'))
      .otherwise(pl.lit('Low'))
      .alias('income_bracket')
])

print("Multiple conditions:")
print(result)

In [None]:
# Complex conditions
result = df.select([
    pl.col('name'),
    pl.col('age'),
    pl.col('income'),
    pl.when((pl.col('age') > 30) & (pl.col('income') > 70000))
      .then(pl.lit('Premium'))
      .when((pl.col('age') <= 30) & (pl.col('income') > 60000))
      .then(pl.lit('Young Professional'))
      .when(pl.col('age') > 35)
      .then(pl.lit('Senior'))
      .otherwise(pl.lit('Standard'))
      .alias('segment')
])

print("Complex conditions:")
print(result)

## Part 6: Common Expression Operations

### Arithmetic operations

In [None]:
result = df.select([
    pl.col('name'),
    pl.col('income'),
    (pl.col('income') + 5000).alias('income_plus_5k'),
    (pl.col('income') * 1.1).alias('income_10pct_raise'),
    (pl.col('income') / 12).alias('monthly_income'),
    (pl.col('income') % 10000).alias('income_mod_10k')
])

print(result)

### Comparison operations

In [None]:
result = df.select([
    pl.col('name'),
    pl.col('age'),
    (pl.col('age') > 30).alias('over_30'),
    (pl.col('age') == 28).alias('exactly_28'),
    (pl.col('age') >= 30).alias('at_least_30'),
    (pl.col('age').is_between(25, 35)).alias('age_25_to_35')
])

print(result)

### Logical operations

In [None]:
result = df.select([
    pl.col('name'),
    pl.col('age'),
    pl.col('income'),
    ((pl.col('age') > 30) & (pl.col('income') > 70000)).alias('senior_and_high_income'),
    ((pl.col('age') < 30) | (pl.col('income') < 60000)).alias('young_or_low_income'),
    (~(pl.col('age') > 30)).alias('not_over_30')
])

print(result)

### String operations

In [None]:
result = df.select([
    pl.col('name'),
    pl.col('name').str.to_lowercase().alias('name_lower'),
    pl.col('name').str.to_uppercase().alias('name_upper'),
    pl.col('name').str.len_chars().alias('name_length'),
    pl.col('name').str.starts_with('A').alias('starts_with_A'),
    pl.col('name').str.contains('li').alias('contains_li')
])

print(result)

### Null handling

In [None]:
# Create DataFrame with nulls
df_nulls = pl.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'value': [10, None, 30, None, 50],
    'category': ['A', 'B', None, 'D', 'E']
})

print("Original with nulls:")
print(df_nulls)

result = df_nulls.select([
    pl.col('id'),
    pl.col('value'),
    pl.col('value').is_null().alias('value_is_null'),
    pl.col('value').is_not_null().alias('value_is_not_null'),
    pl.col('value').fill_null(0).alias('value_filled'),
    pl.col('category').fill_null('Unknown').alias('category_filled')
])

print("\nNull handling:")
print(result)

## Part 7: Advanced Expression Patterns

### Pattern 1: Apply same operation to multiple columns

In [None]:
# Normalize multiple columns (z-score)
result = df.select([
    pl.col('name'),
    ((pl.col('age') - pl.col('age').mean()) / pl.col('age').std()).alias('age_normalized'),
    ((pl.col('income') - pl.col('income').mean()) / pl.col('income').std()).alias('income_normalized'),
])

print("Normalized values:")
print(result)

In [None]:
# Better: Use a helper to avoid repetition
def normalize(col_name):
    return ((pl.col(col_name) - pl.col(col_name).mean()) / pl.col(col_name).std()).alias(f'{col_name}_norm')

result = df.select([
    pl.col('name'),
    normalize('age'),
    normalize('income'),
    normalize('purchases')
])

print("Normalized using helper:")
print(result)

### Pattern 2: Multiple aggregations per column

In [None]:
result = df.group_by('city').agg([
    pl.col('age').mean().alias('avg_age'),
    pl.col('age').min().alias('min_age'),
    pl.col('age').max().alias('max_age'),
    pl.col('income').mean().alias('avg_income'),
    pl.col('income').sum().alias('total_income'),
    pl.len().alias('count')
])

print(result)

### Pattern 3: Creating multiple derived columns at once

In [None]:
# Create several related columns
result = df.with_columns([
    # Income tiers
    (pl.col('income') / 1000).alias('income_k'),
    (pl.col('income') * 0.25).alias('estimated_tax'),
    (pl.col('income') * 0.75).alias('after_tax_income'),
    
    # Purchase metrics
    (pl.col('total_spent') / pl.col('purchases')).alias('avg_per_purchase'),
    (pl.col('total_spent') / pl.col('income') * 100).alias('spending_pct_of_income'),
    
    # Categories
    pl.when(pl.col('purchases') >= 10)
      .then(pl.lit('High Frequency'))
      .when(pl.col('purchases') >= 5)
      .then(pl.lit('Medium Frequency'))
      .otherwise(pl.lit('Low Frequency'))
      .alias('purchase_frequency')
])

print(result)

### Pattern 4: Chained string operations

In [None]:
df_text = pl.DataFrame({
    'text': ['  Hello World  ', 'PYTHON  ', '  data science', 'Machine Learning']
})

result = df_text.select([
    pl.col('text'),
    pl.col('text').str.strip_chars().alias('trimmed'),
    pl.col('text').str.strip_chars().str.to_lowercase().alias('cleaned'),
    pl.col('text').str.strip_chars().str.to_lowercase().str.replace(' ', '_').alias('snake_case')
])

print(result)

### Pattern 5: Expression over expression (nested)

In [None]:
# Calculate percentile rank
result = df.select([
    pl.col('name'),
    pl.col('income'),
    pl.col('income').rank().alias('income_rank'),
    (pl.col('income').rank() / pl.len() * 100).alias('income_percentile')
])

print(result)

## Part 8: Working with Lists/Arrays in Expressions

In [None]:
# DataFrame with list columns
df_lists = pl.DataFrame({
    'customer': ['Alice', 'Bob', 'Charlie'],
    'purchase_amounts': [[100, 200, 150], [50, 75], [300, 250, 400, 100]],
    'product_ids': [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
})

print("DataFrame with lists:")
print(df_lists)

In [None]:
# List operations
result = df_lists.select([
    pl.col('customer'),
    pl.col('purchase_amounts').list.len().alias('num_purchases'),
    pl.col('purchase_amounts').list.sum().alias('total_spent'),
    pl.col('purchase_amounts').list.mean().alias('avg_purchase'),
    pl.col('purchase_amounts').list.max().alias('max_purchase'),
    pl.col('purchase_amounts').list.first().alias('first_purchase')
])

print("\nList aggregations:")
print(result)

## Part 9: Expression Aliases and Naming

In [None]:
# Different ways to name columns

# 1. Using .alias()
result1 = df.select([
    (pl.col('income') / 1000).alias('income_k')
])

# 2. Using name.suffix() / name.prefix()
result2 = df.select([
    pl.col('age', 'income').name.suffix('_original')
])

# 3. Using name.map()
result3 = df.select([
    pl.col('age', 'income').name.map(lambda x: f'renamed_{x}')
])

print("With .alias():")
print(result1)

print("\nWith .name.suffix():")
print(result2)

print("\nWith .name.map():")
print(result3)

## Part 10: Performance Tips

### Tip 1: Use expressions instead of apply/map when possible

In [None]:
import time

large_df = pl.DataFrame({
    'value': range(100000)
})

# SLOW: Using apply
start = time.time()
result1 = large_df.select([
    pl.col('value').map_elements(lambda x: x * 2, return_dtype=pl.Int64).alias('doubled')
])
time1 = time.time() - start

# FAST: Using expression
start = time.time()
result2 = large_df.select([
    (pl.col('value') * 2).alias('doubled')
])
time2 = time.time() - start

print(f"map_elements: {time1:.4f}s")
print(f"Expression: {time2:.4f}s")
print(f"Expression is {time1/time2:.1f}x faster!")

### Tip 2: Combine operations to reduce passes over data

In [None]:
# LESS EFFICIENT: Multiple passes
result = df.with_columns([
    (pl.col('income') / 1000).alias('income_k')
]).with_columns([
    (pl.col('total_spent') / pl.col('purchases')).alias('avg_purchase')
]).with_columns([
    (pl.col('age') > 30).alias('is_senior')
])

# MORE EFFICIENT: Single pass
result = df.with_columns([
    (pl.col('income') / 1000).alias('income_k'),
    (pl.col('total_spent') / pl.col('purchases')).alias('avg_purchase'),
    (pl.col('age') > 30).alias('is_senior')
])

print("Combine operations in a single with_columns()!")

## Summary

### Key Concepts:
1. **Expressions** are operations on columns that return Series
2. **pl.col()** selects columns, supports regex patterns
3. **Selectors (cs.*)** enable dtype-based operations
4. **when/then/otherwise** for conditional logic
5. **Contexts**: select, with_columns, filter, group_by().agg()
6. **Chaining** expressions creates powerful transformations

### Best Practices:
- Use expressions instead of Python functions (much faster)
- Combine multiple operations in single pass
- Use selectors for operations across many columns
- Chain operations for readability
- Use .alias() for clear column naming