# Polars Interoperability - Comprehensive Guide

This notebook covers how Polars integrates with other data tools and libraries.

## What You'll Learn:
- DuckDB integration (SQL on Polars DataFrames)
- Apache Arrow ecosystem and zero-copy operations
- Pandas interoperability (to_pandas, from_pandas)
- NumPy integration for numerical computing
- When to use each tool
- Performance implications of conversions
- Best practices for multi-tool workflows

## Prerequisites:
```bash
pip install polars duckdb pyarrow pandas numpy
```

In [None]:
import polars as pl
import pandas as pd
import numpy as np
import duckdb
try:
    import pyarrow as pa
    HAS_ARROW = True
except ImportError:
    HAS_ARROW = False
    print("⚠️ PyArrow not installed. Some examples will be skipped.")

from datetime import datetime, date
import time

print(f"Polars version: {pl.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"DuckDB version: {duckdb.__version__}")
if HAS_ARROW:
    print(f"PyArrow version: {pa.__version__}")

---
# Part 1: DuckDB Integration

DuckDB is an embedded SQL database optimized for analytics. It can query Polars DataFrames directly!

## 1.1 Basic DuckDB Queries on Polars

In [None]:
# Create sample Polars DataFrame
df_sales = pl.DataFrame({
    'order_id': range(1, 11),
    'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 
                 'Alice', 'Diana', 'Charlie', 'Bob', 'Diana'],
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Laptop',
                'Mouse', 'Keyboard', 'Mouse', 'Monitor', 'Laptop'],
    'quantity': [1, 2, 1, 1, 1, 3, 2, 1, 2, 1],
    'price': [1200, 25, 75, 300, 1200, 25, 75, 25, 300, 1200],
    'date': [date(2024, 1, i) for i in range(1, 11)]
})

print("Polars DataFrame:")
print(df_sales)

In [None]:
# Query Polars DataFrame using DuckDB SQL
# DuckDB can directly reference the DataFrame variable name!

result = duckdb.query("""
    SELECT 
        customer,
        COUNT(*) as num_orders,
        SUM(quantity * price) as total_revenue
    FROM df_sales
    GROUP BY customer
    ORDER BY total_revenue DESC
""").pl()  # .pl() returns Polars DataFrame

print("DuckDB query result (as Polars):")
print(result)
print(f"Type: {type(result)}")

In [None]:
# Complex SQL queries with JOINs
df_customers = pl.DataFrame({
    'customer': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 
              'diana@example.com', 'eve@example.com'],
    'city': ['NYC', 'LA', 'Chicago', 'Houston', 'Phoenix']
})

result = duckdb.query("""
    SELECT 
        c.customer,
        c.city,
        COUNT(s.order_id) as num_orders,
        COALESCE(SUM(s.quantity * s.price), 0) as total_spent
    FROM df_customers c
    LEFT JOIN df_sales s ON c.customer = s.customer
    GROUP BY c.customer, c.city
    ORDER BY total_spent DESC
""").pl()

print("Complex JOIN query:")
print(result)

## 1.2 DuckDB vs Polars SQL

In [None]:
# Polars native SQL (using SQLContext)
ctx = pl.SQLContext()
ctx.register("sales", df_sales)
ctx.register("customers", df_customers)

result_polars = ctx.execute("""
    SELECT 
        customer,
        SUM(quantity * price) as total
    FROM sales
    GROUP BY customer
    ORDER BY total DESC
""").collect()

print("Polars SQL result:")
print(result_polars)

In [None]:
# Performance comparison: DuckDB vs Polars SQL vs Polars expressions
# Create larger dataset
n = 100_000
df_large = pl.DataFrame({
    'id': range(n),
    'category': np.random.choice(['A', 'B', 'C', 'D'], n),
    'value': np.random.randn(n)
})

# Method 1: DuckDB
start = time.time()
result1 = duckdb.query("""
    SELECT category, AVG(value) as avg_value
    FROM df_large
    GROUP BY category
""").pl()
time1 = time.time() - start

# Method 2: Polars SQL
ctx = pl.SQLContext(data=df_large)
start = time.time()
result2 = ctx.execute("""
    SELECT category, AVG(value) as avg_value
    FROM data
    GROUP BY category
""").collect()
time2 = time.time() - start

# Method 3: Polars expressions (native)
start = time.time()
result3 = df_large.group_by('category').agg(
    pl.col('value').mean().alias('avg_value')
)
time3 = time.time() - start

print(f"DuckDB SQL:        {time1:.4f}s")
print(f"Polars SQL:        {time2:.4f}s")
print(f"Polars expressions: {time3:.4f}s")
print("\n💡 Polars expressions are usually fastest for simple queries!")

## 1.3 When to Use DuckDB vs Polars

In [None]:
# Comparison guide
comparison = pl.DataFrame({
    'Scenario': [
        'Simple aggregations',
        'Complex SQL (CTEs, subqueries)',
        'Window functions',
        'Multiple joins',
        'Integration with SQL databases',
        'Memory efficiency',
        'Team familiar with SQL',
        'Need best performance',
        'Type safety important'
    ],
    'Polars': [
        '⭐⭐⭐⭐⭐',
        '⭐⭐⭐',
        '⭐⭐⭐⭐⭐',
        '⭐⭐⭐⭐',
        '⭐⭐',
        '⭐⭐⭐⭐⭐',
        '⭐⭐⭐',
        '⭐⭐⭐⭐⭐',
        '⭐⭐⭐⭐⭐'
    ],
    'DuckDB': [
        '⭐⭐⭐⭐',
        '⭐⭐⭐⭐⭐',
        '⭐⭐⭐⭐',
        '⭐⭐⭐⭐⭐',
        '⭐⭐⭐⭐⭐',
        '⭐⭐⭐⭐',
        '⭐⭐⭐⭐⭐',
        '⭐⭐⭐⭐',
        '⭐⭐⭐'
    ]
})

print("Polars vs DuckDB:")
print(comparison)

---
# Part 2: Apache Arrow Integration

Arrow is the backbone of Polars. It enables zero-copy data sharing between tools.

## 2.1 Understanding Apache Arrow

In [None]:
if HAS_ARROW:
    # Polars DataFrame -> Arrow Table (zero-copy)
    df_arrow = pl.DataFrame({
        'id': [1, 2, 3, 4, 5],
        'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
        'value': [10.5, 20.3, 30.1, 40.8, 50.2]
    })
    
    # Convert to Arrow
    arrow_table = df_arrow.to_arrow()
    
    print("PyArrow Table:")
    print(arrow_table)
    print(f"\nType: {type(arrow_table)}")
    print(f"Schema: {arrow_table.schema}")
else:
    print("PyArrow not available")

In [None]:
if HAS_ARROW:
    # Arrow Table -> Polars DataFrame (zero-copy)
    df_from_arrow = pl.from_arrow(arrow_table)
    
    print("Polars DataFrame from Arrow:")
    print(df_from_arrow)
    print(f"Type: {type(df_from_arrow)}")

## 2.2 Zero-Copy Operations

In [None]:
if HAS_ARROW:
    # Demonstrate zero-copy (no data duplication)
    import sys
    
    df_large = pl.DataFrame({
        'values': list(range(1_000_000))
    })
    
    # Memory before conversion
    mem_before = df_large.estimated_size('mb')
    
    # Convert to Arrow (zero-copy)
    arrow_large = df_large.to_arrow()
    
    # Convert back to Polars (zero-copy)
    df_back = pl.from_arrow(arrow_large)
    
    print(f"Original DF memory: {mem_before:.2f} MB")
    print(f"Arrow table memory: {sys.getsizeof(arrow_large) / 1024 / 1024:.2f} MB")
    print(f"Back to Polars memory: {df_back.estimated_size('mb'):.2f} MB")
    print("\n💡 Zero-copy: Data is shared, not duplicated!")

## 2.3 Arrow Ecosystem Benefits

In [None]:
if HAS_ARROW:
    # Arrow enables efficient data exchange
    # Polars -> Arrow -> DuckDB (no copying)
    
    df = pl.DataFrame({
        'a': [1, 2, 3],
        'b': [4, 5, 6]
    })
    
    # Via Arrow, DuckDB can query Polars data efficiently
    result = duckdb.query("""
        SELECT a, b, a + b as sum
        FROM df
    """).pl()
    
    print("Polars -> Arrow -> DuckDB -> Polars (all zero-copy):")
    print(result)
    print("\n✅ Arrow is the 'universal' format for data tools!")

---
# Part 3: Pandas Interoperability

Converting between Polars and Pandas for library compatibility.

## 3.1 Polars to Pandas

In [None]:
# Create Polars DataFrame
df_polars = pl.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, 30, 35, 28, 42],
    'salary': [70000, 80000, 90000, 75000, 95000]
})

# Convert to Pandas
df_pandas = df_polars.to_pandas()

print("Polars DataFrame:")
print(df_polars)
print(f"Type: {type(df_polars)}")

print("\nPandas DataFrame:")
print(df_pandas)
print(f"Type: {type(df_pandas)}")

## 3.2 Pandas to Polars

In [None]:
# Create Pandas DataFrame
df_pd = pd.DataFrame({
    'product': ['Laptop', 'Mouse', 'Keyboard'],
    'price': [1200, 25, 75],
    'stock': [15, 100, 50]
})

# Convert to Polars
df_pl = pl.from_pandas(df_pd)

print("Pandas DataFrame:")
print(df_pd)
print(f"Type: {type(df_pd)}")

print("\nPolars DataFrame:")
print(df_pl)
print(f"Type: {type(df_pl)}")

## 3.3 Type Mapping and Gotchas

In [None]:
# Type differences between Pandas and Polars
df_types = pl.DataFrame({
    'int64': pl.Series([1, 2, 3], dtype=pl.Int64),
    'float64': pl.Series([1.0, 2.0, 3.0], dtype=pl.Float64),
    'string': pl.Series(['a', 'b', 'c'], dtype=pl.Utf8),
    'categorical': pl.Series(['A', 'B', 'A'], dtype=pl.Categorical),
    'date': pl.Series([date(2024, 1, 1), date(2024, 1, 2), date(2024, 1, 3)], dtype=pl.Date)
})

print("Polars dtypes:")
print(df_types.dtypes)

# Convert to Pandas
df_pd_converted = df_types.to_pandas()
print("\nPandas dtypes:")
print(df_pd_converted.dtypes)

print("\n💡 Categorical becomes 'category', Date becomes datetime64")

## 3.4 Performance: Polars vs Pandas

In [None]:
# Create larger dataset for comparison
n = 100_000
data = {
    'id': range(n),
    'category': np.random.choice(['A', 'B', 'C', 'D'], n),
    'value': np.random.randn(n)
}

# Create both DataFrames
df_polars_perf = pl.DataFrame(data)
df_pandas_perf = pd.DataFrame(data)

# GroupBy aggregation in Polars
start = time.time()
result_polars = df_polars_perf.group_by('category').agg([
    pl.col('value').mean().alias('mean'),
    pl.col('value').std().alias('std')
])
time_polars = time.time() - start

# GroupBy aggregation in Pandas
start = time.time()
result_pandas = df_pandas_perf.groupby('category')['value'].agg(['mean', 'std'])
time_pandas = time.time() - start

print(f"Polars: {time_polars:.4f}s")
print(f"Pandas: {time_pandas:.4f}s")
print(f"\nPolars is {time_pandas/time_polars:.1f}x faster for this operation")

## 3.5 When to Use Pandas vs Polars

In [None]:
# Decision matrix
decision = pl.DataFrame({
    'Use Case': [
        'Small data (<100MB)',
        'Large data (>1GB)',
        'Need specific library',
        'Performance critical',
        'Memory constrained',
        'Team familiarity',
        'Time series analysis',
        'Machine learning prep',
        'Data cleaning pipeline'
    ],
    'Polars': [
        '⭐⭐⭐',
        '⭐⭐⭐⭐⭐',
        '⭐⭐',
        '⭐⭐⭐⭐⭐',
        '⭐⭐⭐⭐⭐',
        '⭐⭐⭐',
        '⭐⭐⭐⭐',
        '⭐⭐⭐⭐',
        '⭐⭐⭐⭐⭐'
    ],
    'Pandas': [
        '⭐⭐⭐⭐⭐',
        '⭐⭐',
        '⭐⭐⭐⭐⭐',
        '⭐⭐',
        '⭐⭐',
        '⭐⭐⭐⭐⭐',
        '⭐⭐⭐⭐⭐',
        '⭐⭐⭐⭐⭐',
        '⭐⭐⭐'
    ],
    'Recommendation': [
        'Either works fine',
        'Polars strongly preferred',
        'Use Pandas, convert if needed',
        'Polars',
        'Polars',
        'Pandas (but learn Polars!)',
        'Pandas has more tools',
        'Pandas (scikit-learn, etc.)',
        'Polars for speed'
    ]
})

print("Polars vs Pandas Decision Guide:")
print(decision)

---
# Part 4: NumPy Integration

Converting between Polars and NumPy for numerical computing.

## 4.1 Polars to NumPy

In [None]:
# Polars DataFrame to NumPy array
df_numpy = pl.DataFrame({
    'a': [1, 2, 3, 4, 5],
    'b': [10, 20, 30, 40, 50],
    'c': [100, 200, 300, 400, 500]
})

# Convert entire DataFrame to 2D NumPy array
arr = df_numpy.to_numpy()

print("Polars DataFrame:")
print(df_numpy)

print("\nNumPy array:")
print(arr)
print(f"Shape: {arr.shape}")
print(f"Dtype: {arr.dtype}")

In [None]:
# Convert single column (Series) to NumPy array
col_arr = df_numpy['a'].to_numpy()

print("Single column as NumPy array:")
print(col_arr)
print(f"Shape: {col_arr.shape}")
print(f"Type: {type(col_arr)}")

## 4.2 NumPy to Polars

In [None]:
# NumPy array to Polars DataFrame
arr_2d = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

# Convert with column names
df_from_numpy = pl.DataFrame(arr_2d, schema=['col_a', 'col_b', 'col_c'])

print("NumPy array:")
print(arr_2d)

print("\nPolars DataFrame:")
print(df_from_numpy)

In [None]:
# NumPy array to Polars Series
arr_1d = np.array([1.5, 2.5, 3.5, 4.5, 5.5])
series = pl.Series('values', arr_1d)

print("NumPy array:")
print(arr_1d)

print("\nPolars Series:")
print(series)

## 4.3 Using NumPy Functions with Polars

In [None]:
# Apply NumPy functions to Polars data
df_math = pl.DataFrame({
    'x': [0, 0.5, 1.0, 1.5, 2.0],
    'y': [1, 2, 3, 4, 5]
})

# Method 1: Convert to NumPy, apply function, convert back
x_arr = df_math['x'].to_numpy()
sin_values = np.sin(x_arr)
df_result = df_math.with_columns(
    pl.Series('sin_x', sin_values)
)

print("Apply NumPy sin function:")
print(df_result)

# Method 2: Use Polars expressions (preferred when available)
df_result2 = df_math.with_columns([
    pl.col('x').map_elements(np.sin, return_dtype=pl.Float64).alias('sin_x_map')
])

print("\nUsing map_elements:")
print(df_result2)

## 4.4 Advanced: Numerical Computing with Polars + NumPy

In [None]:
# Example: Linear regression using NumPy on Polars data
# Generate sample data
np.random.seed(42)
n = 100
x = np.linspace(0, 10, n)
y = 2.5 * x + 3 + np.random.randn(n) * 2

df_regression = pl.DataFrame({
    'x': x,
    'y': y
})

# Extract as NumPy for linear regression
X = df_regression['x'].to_numpy()
Y = df_regression['y'].to_numpy()

# Fit line: y = mx + b
# Using numpy.polyfit
m, b = np.polyfit(X, Y, 1)
print(f"Linear fit: y = {m:.2f}x + {b:.2f}")

# Add predictions back to Polars DataFrame
predictions = m * X + b
df_regression = df_regression.with_columns(
    pl.Series('y_pred', predictions)
)

print("\nData with predictions:")
print(df_regression.head())

---
# Part 5: Multi-Tool Workflows

## 5.1 Polars + DuckDB + Pandas Pipeline

In [None]:
# Real-world workflow: Combine strengths of each tool

# Step 1: Load and clean with Polars (fast)
df_raw = pl.DataFrame({
    'id': range(1, 101),
    'category': np.random.choice(['A', 'B', 'C'], 100),
    'value': np.random.randn(100) * 10 + 50,
    'date': [date(2024, 1, 1) + pd.Timedelta(days=i) for i in range(100)]
})

print("Step 1 - Polars: Load and clean")
df_clean = (
    df_raw
    .filter(pl.col('value') > 0)
    .with_columns(
        pl.col('value').round(2).alias('value')
    )
)
print(f"Cleaned: {len(df_clean)} rows")

# Step 2: Complex aggregation with DuckDB (SQL)
print("\nStep 2 - DuckDB: Complex SQL query")
df_agg = duckdb.query("""
    SELECT 
        category,
        COUNT(*) as count,
        AVG(value) as avg_value,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) as median_value
    FROM df_clean
    GROUP BY category
    ORDER BY avg_value DESC
""").pl()
print(df_agg)

# Step 3: Use Pandas for specialized analysis (if needed)
print("\nStep 3 - Pandas: Time series resampling")
df_pandas_ts = df_clean.to_pandas()
df_pandas_ts = df_pandas_ts.set_index('date')
df_resampled = df_pandas_ts['value'].resample('7D').mean()
print(df_resampled.head())

print("\n✅ Multi-tool pipeline complete!")

## 5.2 Best Practices for Tool Selection

In [None]:
# Best practices summary
best_practices = pl.DataFrame({
    'Task': [
        'Initial data loading',
        'Data cleaning',
        'Large aggregations',
        'Complex SQL queries',
        'Time series analysis',
        'Machine learning prep',
        'Statistical analysis',
        'Visualization prep',
        'Export to database'
    ],
    'Recommended Tool': [
        'Polars (scan_parquet, scan_csv)',
        'Polars (fast, expressive)',
        'Polars (parallel, lazy)',
        'DuckDB (full SQL support)',
        'Pandas (rich ecosystem)',
        'Pandas (scikit-learn integration)',
        'NumPy/SciPy via Pandas',
        'Convert to Pandas at end',
        'DuckDB or Pandas'
    ],
    'Why': [
        'Lazy loading, columnar format',
        'Fast operations, good API',
        'Automatic parallelization',
        'CTEs, window functions',
        'Pandas has .resample(), rolling',
        'Most ML libs use Pandas',
        'Mature statistical libraries',
        'Plotting libs prefer Pandas',
        'Better ecosystem support'
    ]
})

print("Tool Selection Best Practices:")
print(best_practices)

---
# Summary

## Interoperability Overview:

### **DuckDB Integration:**
- ✅ Query Polars DataFrames with SQL directly
- ✅ Zero-copy via Apache Arrow
- ✅ Best for complex SQL queries (CTEs, subqueries)
- ⚠️ Slightly slower than native Polars expressions

### **Apache Arrow:**
- ✅ Enables zero-copy data sharing
- ✅ Universal format for data tools
- ✅ Polars is built on Arrow
- ✅ Use `.to_arrow()` and `pl.from_arrow()`

### **Pandas:**
- ✅ Convert with `.to_pandas()` and `pl.from_pandas()`
- ✅ Use Pandas for specific libraries (scikit-learn, statsmodels)
- ⚠️ Conversion has overhead (but usually fast)
- ⚠️ Pandas uses more memory

### **NumPy:**
- ✅ Convert with `.to_numpy()` and `pl.Series()`
- ✅ Use NumPy for numerical computing
- ✅ Good for matrix operations, linear algebra
- ⚠️ Loses column names (2D array)

## Decision Framework:

| Need | Tool | Convert? |
|------|------|----------|
| Fast data wrangling | Polars | Native |
| Complex SQL | DuckDB | Query directly |
| ML with scikit-learn | Pandas | `.to_pandas()` |
| Time series | Pandas | `.to_pandas()` |
| Linear algebra | NumPy | `.to_numpy()` |
| Plotting | Pandas | `.to_pandas()` |

## Common Patterns:
```python
# Polars -> DuckDB
result = duckdb.query("SELECT * FROM df WHERE x > 10").pl()

# Polars -> Pandas
df_pd = df_pl.to_pandas()

# Pandas -> Polars
df_pl = pl.from_pandas(df_pd)

# Polars -> NumPy
arr = df_pl.to_numpy()

# NumPy -> Polars
df = pl.DataFrame(arr, schema=['a', 'b', 'c'])
```

## Key Principle:
**🎯 Use Polars for heavy lifting (cleaning, aggregations), convert only when needed for specialized libraries!**

---
# Practice Exercises

In [None]:
# Exercise 1: Use DuckDB to write a complex SQL query
# TODO: Create sales data, use DuckDB with CTEs to analyze


In [None]:
# Exercise 2: Build a Polars -> Pandas -> ML pipeline
# TODO: Clean data with Polars, convert to Pandas, train simple model


In [None]:
# Exercise 3: Use NumPy for numerical computation on Polars data
# TODO: Calculate moving averages using NumPy convolution
