# Polars vs Pandas: Large Dataset Performance Comparison

This notebook compares Polars and Pandas performance on real-world datasets with millions of rows.

## Datasets Used

### 1. NYC Taxi Trip Data (~1-2M rows per month)
- **Source**: NYC Taxi & Limousine Commission via AWS Open Data
- **Format**: Parquet (original CSV also available)
- **URL**: https://registry.opendata.aws/nyc-tlc-trip-records-pds/
- **Data**: Yellow taxi trip records including pickup/dropoff times, locations, fares, etc.

### 2. US Airline On-Time Performance (10M+ rows)
- **Source**: Bureau of Transportation Statistics (BTS)
- **Format**: CSV/Parquet
- **URL**: https://www.transtats.bts.gov/
- **Data**: Flight on-time performance including delays, cancellations, carrier info

## Setup

In [None]:
import polars as pl
import pandas as pd
import numpy as np
import time
import psutil
import os
from pathlib import Path

# For downloading data
import requests
from io import BytesIO

print(f"Polars version: {pl.__version__}")
print(f"Pandas version: {pd.__version__}")

## Helper Functions for Benchmarking

In [None]:
def get_memory_usage_mb():
    """Get current process memory usage in MB"""
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024

def benchmark(func, name, *args, **kwargs):
    """Benchmark a function execution"""
    mem_before = get_memory_usage_mb()
    start = time.time()
    result = func(*args, **kwargs)
    duration = time.time() - start
    mem_after = get_memory_usage_mb()
    
    print(f"{name}:")
    print(f"  Time: {duration:.3f}s")
    print(f"  Memory change: {mem_after - mem_before:.2f} MB")
    print(f"  Total memory: {mem_after:.2f} MB")
    
    return result, duration, mem_after - mem_before

def compare_operations(polars_func, pandas_func, description):
    """Compare Polars vs Pandas for a given operation"""
    print(f"\n{'='*60}")
    print(f"Operation: {description}")
    print('='*60)
    
    # Polars
    pl_result, pl_time, pl_mem = benchmark(polars_func, "Polars")
    
    # Pandas
    pd_result, pd_time, pd_mem = benchmark(pandas_func, "Pandas")
    
    # Comparison
    speedup = pd_time / pl_time if pl_time > 0 else float('inf')
    mem_ratio = pd_mem / pl_mem if pl_mem > 0 else float('inf')
    
    print(f"\n📊 Results:")
    print(f"  Speedup: {speedup:.2f}x (Polars is {speedup:.2f}x faster)")
    print(f"  Memory: Pandas uses {mem_ratio:.2f}x more memory")
    
    return {
        'operation': description,
        'polars_time': pl_time,
        'pandas_time': pd_time,
        'speedup': speedup,
        'polars_mem': pl_mem,
        'pandas_mem': pd_mem
    }

## Dataset 1: NYC Taxi Data (~1-2M rows)

### Download Instructions

The NYC Taxi data is available in Parquet format from AWS S3. We'll download one month of data.

**Direct download URLs** (no authentication required):
- Yellow taxi 2024: `https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet`
- Green taxi 2024: `https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2024-01.parquet`

You can also browse all available files at:
- https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

In [None]:
# Create data directory if it doesn't exist
data_dir = Path('data')
data_dir.mkdir(exist_ok=True)

# NYC Taxi data URL (January 2024 - ~2.9M rows)
nyc_taxi_url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet'
nyc_taxi_file = data_dir / 'nyc_taxi_2024_01.parquet'

# Download if not exists
if not nyc_taxi_file.exists():
    print(f"Downloading NYC Taxi data (~300MB)...")
    response = requests.get(nyc_taxi_url)
    with open(nyc_taxi_file, 'wb') as f:
        f.write(response.content)
    print(f"✓ Downloaded to {nyc_taxi_file}")
else:
    print(f"✓ File already exists: {nyc_taxi_file}")

# Check file size
file_size_mb = nyc_taxi_file.stat().st_size / 1024 / 1024
print(f"File size: {file_size_mb:.2f} MB")

### Loading Data: Polars vs Pandas

In [None]:
# Load with Polars
df_pl, pl_load_time, pl_load_mem = benchmark(
    pl.read_parquet,
    "Polars - Load Parquet",
    nyc_taxi_file
)

print(f"\nShape: {df_pl.shape}")
print(f"\nFirst few columns:")
print(df_pl.head())

In [None]:
# Load with Pandas
df_pd, pd_load_time, pd_load_mem = benchmark(
    pd.read_parquet,
    "Pandas - Load Parquet",
    nyc_taxi_file
)

print(f"\nShape: {df_pd.shape}")
print(f"\nFirst few rows:")
print(df_pd.head())

# Compare loading performance
print(f"\n{'='*60}")
print(f"Loading Performance Comparison")
print('='*60)
print(f"Polars: {pl_load_time:.3f}s, Memory: {pl_load_mem:.2f} MB")
print(f"Pandas: {pd_load_time:.3f}s, Memory: {pd_load_mem:.2f} MB")
print(f"Speedup: {pd_load_time/pl_load_time:.2f}x")

### Data Exploration

In [None]:
# Polars exploration
print("Polars DataFrame Info:")
print(df_pl.describe())
print(f"\nColumns: {df_pl.columns}")
print(f"\nData types:")
print(df_pl.schema)

### Performance Benchmarks

#### 1. Filtering Operations

In [None]:
# Benchmark: Filter for trips > $50 with more than 2 passengers
result1 = compare_operations(
    lambda: df_pl.filter(
        (pl.col('total_amount') > 50) & 
        (pl.col('passenger_count') > 2)
    ),
    lambda: df_pd[
        (df_pd['total_amount'] > 50) & 
        (df_pd['passenger_count'] > 2)
    ],
    "Filter: total_amount > $50 AND passenger_count > 2"
)

#### 2. Aggregation Operations

In [None]:
# Benchmark: Group by passenger count and calculate average fare
result2 = compare_operations(
    lambda: df_pl.group_by('passenger_count').agg([
        pl.col('total_amount').mean().alias('avg_fare'),
        pl.col('trip_distance').mean().alias('avg_distance'),
        pl.col('VendorID').count().alias('trip_count')
    ]).sort('passenger_count'),
    lambda: df_pd.groupby('passenger_count').agg({
        'total_amount': 'mean',
        'trip_distance': 'mean',
        'VendorID': 'count'
    }).rename(columns={
        'total_amount': 'avg_fare',
        'trip_distance': 'avg_distance',
        'VendorID': 'trip_count'
    }).sort_index(),
    "GroupBy passenger_count with aggregations"
)

#### 3. Complex Aggregations

In [None]:
# Benchmark: Multiple group by with complex aggregations
result3 = compare_operations(
    lambda: df_pl.group_by(['PULocationID', 'DOLocationID']).agg([
        pl.col('total_amount').sum().alias('total_revenue'),
        pl.col('trip_distance').mean().alias('avg_distance'),
        pl.col('VendorID').count().alias('trip_count'),
        pl.col('tip_amount').max().alias('max_tip')
    ]).filter(pl.col('trip_count') > 100).sort('total_revenue', descending=True).head(20),
    lambda: df_pd.groupby(['PULocationID', 'DOLocationID']).agg({
        'total_amount': 'sum',
        'trip_distance': 'mean',
        'VendorID': 'count',
        'tip_amount': 'max'
    }).rename(columns={
        'total_amount': 'total_revenue',
        'trip_distance': 'avg_distance',
        'VendorID': 'trip_count',
        'tip_amount': 'max_tip'
    }).query('trip_count > 100').sort_values('total_revenue', ascending=False).head(20),
    "Complex GroupBy: Top 20 routes by revenue (with filters)"
)

#### 4. String Operations

In [None]:
# Create a datetime column and extract features
result4 = compare_operations(
    lambda: df_pl.with_columns([
        pl.col('tpep_pickup_datetime').dt.hour().alias('hour'),
        pl.col('tpep_pickup_datetime').dt.day().alias('day'),
        pl.col('tpep_pickup_datetime').dt.month().alias('month'),
        pl.col('tpep_pickup_datetime').dt.weekday().alias('weekday')
    ]),
    lambda: df_pd.assign(
        hour=df_pd['tpep_pickup_datetime'].dt.hour,
        day=df_pd['tpep_pickup_datetime'].dt.day,
        month=df_pd['tpep_pickup_datetime'].dt.month,
        weekday=df_pd['tpep_pickup_datetime'].dt.weekday
    ),
    "DateTime extraction: hour, day, month, weekday"
)

#### 5. Sorting Operations

In [None]:
# Benchmark: Sort by multiple columns
result5 = compare_operations(
    lambda: df_pl.sort(['total_amount', 'trip_distance'], descending=[True, False]),
    lambda: df_pd.sort_values(['total_amount', 'trip_distance'], ascending=[False, True]),
    "Sort by total_amount (desc) and trip_distance (asc)"
)

#### 6. Window Functions

In [None]:
# Benchmark: Running average using window functions
result6 = compare_operations(
    lambda: df_pl.with_columns(
        pl.col('total_amount').rolling_mean(window_size=100).over('PULocationID').alias('rolling_avg_fare')
    ),
    lambda: df_pd.assign(
        rolling_avg_fare=df_pd.groupby('PULocationID')['total_amount'].transform(
            lambda x: x.rolling(window=100, min_periods=1).mean()
        )
    ),
    "Window function: Rolling average of fare by pickup location"
)

#### 7. Column Operations

In [None]:
# Benchmark: Create multiple derived columns
result7 = compare_operations(
    lambda: df_pl.with_columns([
        (pl.col('total_amount') / pl.col('trip_distance')).alias('price_per_mile'),
        (pl.col('tip_amount') / pl.col('total_amount')).alias('tip_percentage'),
        (pl.col('tpep_dropoff_datetime') - pl.col('tpep_pickup_datetime')).dt.total_seconds().alias('duration_seconds')
    ]),
    lambda: df_pd.assign(
        price_per_mile=df_pd['total_amount'] / df_pd['trip_distance'],
        tip_percentage=df_pd['tip_amount'] / df_pd['total_amount'],
        duration_seconds=(df_pd['tpep_dropoff_datetime'] - df_pd['tpep_pickup_datetime']).dt.total_seconds()
    ),
    "Create derived columns: price_per_mile, tip_percentage, duration"
)

### Summary of Results

In [None]:
# Collect all results
results = [result1, result2, result3, result4, result5, result6, result7]

# Create summary DataFrame
summary_df = pl.DataFrame(results)

print("\n" + "="*80)
print("PERFORMANCE SUMMARY - NYC Taxi Dataset (~2.9M rows)")
print("="*80)
print(summary_df)

print(f"\n\n📊 Overall Statistics:")
print(f"Average Speedup: {summary_df['speedup'].mean():.2f}x")
print(f"Median Speedup: {summary_df['speedup'].median():.2f}x")
print(f"Max Speedup: {summary_df['speedup'].max():.2f}x")
print(f"Min Speedup: {summary_df['speedup'].min():.2f}x")

## Dataset 2: Multiple Months for 10M+ Rows

For larger datasets (10M+ rows), we can combine multiple months of NYC Taxi data or use the airline dataset.

In [None]:
# Download multiple months to get 10M+ rows
months = ['2024-01', '2024-02', '2024-03', '2024-04']
files_to_download = []

for month in months:
    url = f'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_{month}.parquet'
    filepath = data_dir / f'nyc_taxi_{month}.parquet'
    
    if not filepath.exists():
        print(f"Downloading {month} data...")
        response = requests.get(url)
        with open(filepath, 'wb') as f:
            f.write(response.content)
        print(f"✓ Downloaded {filepath.name}")
    else:
        print(f"✓ Already exists: {filepath.name}")
    
    files_to_download.append(str(filepath))

print(f"\nTotal files: {len(files_to_download)}")

### Loading and Combining Large Dataset

In [None]:
# Polars: Load and concatenate multiple files
print("Loading with Polars...")
large_df_pl, pl_large_time, pl_large_mem = benchmark(
    lambda: pl.concat([pl.read_parquet(f) for f in files_to_download]),
    "Polars - Load and Concatenate 4 months"
)

print(f"\nDataset shape: {large_df_pl.shape}")
print(f"Total rows: {large_df_pl.shape[0]:,}")

In [None]:
# Pandas: Load and concatenate multiple files
print("Loading with Pandas...")
large_df_pd, pd_large_time, pd_large_mem = benchmark(
    lambda: pd.concat([pd.read_parquet(f) for f in files_to_download], ignore_index=True),
    "Pandas - Load and Concatenate 4 months"
)

print(f"\nDataset shape: {large_df_pd.shape}")
print(f"Total rows: {large_df_pd.shape[0]:,}")

# Compare
print(f"\n{'='*60}")
print(f"Large Dataset Loading Comparison")
print('='*60)
print(f"Polars: {pl_large_time:.3f}s, Memory: {pl_large_mem:.2f} MB")
print(f"Pandas: {pd_large_time:.3f}s, Memory: {pd_large_mem:.2f} MB")
print(f"Speedup: {pd_large_time/pl_large_time:.2f}x")

### Benchmarks on Large Dataset (10M+ rows)

In [None]:
# Benchmark 1: Complex aggregation on large dataset
large_result1 = compare_operations(
    lambda: large_df_pl.group_by([
        pl.col('tpep_pickup_datetime').dt.date().alias('date'),
        'PULocationID'
    ]).agg([
        pl.col('total_amount').sum().alias('daily_revenue'),
        pl.col('trip_distance').mean().alias('avg_distance'),
        pl.col('VendorID').count().alias('trip_count'),
        pl.col('passenger_count').sum().alias('total_passengers')
    ]).sort('daily_revenue', descending=True).head(100),
    lambda: large_df_pd.assign(
        date=large_df_pd['tpep_pickup_datetime'].dt.date
    ).groupby(['date', 'PULocationID']).agg({
        'total_amount': 'sum',
        'trip_distance': 'mean',
        'VendorID': 'count',
        'passenger_count': 'sum'
    }).rename(columns={
        'total_amount': 'daily_revenue',
        'trip_distance': 'avg_distance',
        'VendorID': 'trip_count',
        'passenger_count': 'total_passengers'
    }).sort_values('daily_revenue', ascending=False).head(100),
    "Large Dataset: Daily revenue by location (top 100)"
)

In [None]:
# Benchmark 2: Filtering on large dataset
large_result2 = compare_operations(
    lambda: large_df_pl.filter(
        (pl.col('trip_distance') > 10) &
        (pl.col('total_amount') > 30) &
        (pl.col('passenger_count') >= 2)
    ).select([
        'tpep_pickup_datetime',
        'PULocationID',
        'DOLocationID',
        'trip_distance',
        'total_amount'
    ]),
    lambda: large_df_pd[
        (large_df_pd['trip_distance'] > 10) &
        (large_df_pd['total_amount'] > 30) &
        (large_df_pd['passenger_count'] >= 2)
    ][[
        'tpep_pickup_datetime',
        'PULocationID',
        'DOLocationID',
        'trip_distance',
        'total_amount'
    ]],
    "Large Dataset: Filter long expensive trips"
)

In [None]:
# Benchmark 3: Percentile calculations
large_result3 = compare_operations(
    lambda: large_df_pl.group_by('PULocationID').agg([
        pl.col('total_amount').quantile(0.25).alias('p25_fare'),
        pl.col('total_amount').quantile(0.50).alias('p50_fare'),
        pl.col('total_amount').quantile(0.75).alias('p75_fare'),
        pl.col('total_amount').quantile(0.95).alias('p95_fare'),
    ]),
    lambda: large_df_pd.groupby('PULocationID')['total_amount'].quantile([0.25, 0.50, 0.75, 0.95]).unstack(),
    "Large Dataset: Fare percentiles by location"
)

### Large Dataset Summary

In [None]:
# Large dataset results summary
large_results = [large_result1, large_result2, large_result3]
large_summary_df = pl.DataFrame(large_results)

print("\n" + "="*80)
print(f"PERFORMANCE SUMMARY - Large Dataset ({large_df_pl.shape[0]:,} rows)")
print("="*80)
print(large_summary_df)

print(f"\n\n📊 Large Dataset Statistics:")
print(f"Average Speedup: {large_summary_df['speedup'].mean():.2f}x")
print(f"Median Speedup: {large_summary_df['speedup'].median():.2f}x")
print(f"Max Speedup: {large_summary_df['speedup'].max():.2f}x")

## Lazy Evaluation Demo (Polars Only)

One of Polars' key advantages is lazy evaluation with query optimization.

In [None]:
# Lazy evaluation example
print("Demonstrating Polars Lazy Evaluation...\n")

# Create lazy query
lazy_query = (
    pl.scan_parquet(str(nyc_taxi_file))
    .filter(pl.col('total_amount') > 0)
    .with_columns([
        pl.col('tpep_pickup_datetime').dt.hour().alias('hour'),
        (pl.col('tip_amount') / pl.col('total_amount')).alias('tip_ratio')
    ])
    .group_by('hour')
    .agg([
        pl.col('total_amount').mean().alias('avg_fare'),
        pl.col('tip_ratio').mean().alias('avg_tip_ratio'),
        pl.col('VendorID').count().alias('trip_count')
    ])
    .sort('hour')
)

print("Query plan:")
print(lazy_query.explain())

print("\n" + "="*60)
print("Executing lazy query...")
start = time.time()
result = lazy_query.collect()
lazy_time = time.time() - start
print(f"Lazy execution time: {lazy_time:.3f}s")
print("\nResult:")
print(result)

In [None]:
# Compare with eager execution
print("\nComparing with eager execution...")
start = time.time()
eager_result = (
    df_pl
    .filter(pl.col('total_amount') > 0)
    .with_columns([
        pl.col('tpep_pickup_datetime').dt.hour().alias('hour'),
        (pl.col('tip_amount') / pl.col('total_amount')).alias('tip_ratio')
    ])
    .group_by('hour')
    .agg([
        pl.col('total_amount').mean().alias('avg_fare'),
        pl.col('tip_ratio').mean().alias('avg_tip_ratio'),
        pl.col('VendorID').count().alias('trip_count')
    ])
    .sort('hour')
)
eager_time = time.time() - start

print(f"Eager execution time: {eager_time:.3f}s")
print(f"Lazy is {eager_time/lazy_time:.2f}x faster (with query optimization)")

## Memory Efficiency Comparison

In [None]:
# Check memory usage of DataFrames
print("Memory Usage Comparison:\n")

# Polars memory usage
pl_memory = df_pl.estimated_size() / 1024 / 1024  # Convert to MB
print(f"Polars DataFrame: {pl_memory:.2f} MB")

# Pandas memory usage
pd_memory = df_pd.memory_usage(deep=True).sum() / 1024 / 1024  # Convert to MB
print(f"Pandas DataFrame: {pd_memory:.2f} MB")

print(f"\nPandas uses {pd_memory/pl_memory:.2f}x more memory than Polars")

## Key Takeaways

### Performance Benefits of Polars:
1. **Speed**: Typically 2-10x faster than Pandas on large datasets
2. **Memory**: More memory-efficient due to Arrow format
3. **Lazy Evaluation**: Query optimization leads to better performance
4. **Parallel Processing**: Automatic parallelization of operations
5. **Type Safety**: Stronger type system prevents errors

### When to Use Polars:
- Large datasets (1M+ rows)
- Complex aggregations and transformations
- Performance-critical applications
- New projects without Pandas legacy code

### When Pandas Might Be Better:
- Small datasets (<100k rows) where performance doesn't matter
- Existing codebase with heavy Pandas usage
- Need for specific Pandas ecosystem libraries
- Team familiarity with Pandas API

## Further Resources

### Datasets:
- **NYC Taxi Data**: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- **US Airline Data**: https://www.transtats.bts.gov/
- **More datasets**: https://registry.opendata.aws/

### Documentation:
- **Polars**: https://pola-rs.github.io/polars/
- **Performance Guide**: https://pola-rs.github.io/polars-book/user-guide/performance/
- **Migration from Pandas**: https://pola-rs.github.io/polars-book/user-guide/migration/