## Using Polars

[Apache Arrow Polars](https://pola-rs.github.io/polars-book/user-guide/) is another powerful library for working with large datasets. It offers excellent performance and a convenient API for data manipulation, particularly when working with parquet files.

Key benefits for working with smart meter data:
- Lazy evaluation through `scan_parquet`
- Efficient filtering and aggregation
- Native support for time-based operations


In [None]:
import polars as pl
import pyarrow.dataset as ds

# Configure anonymous access
storage_options = {"anon": True}

# Read the parquet dataset using pyarrow with anonymous access
dataset = ds.dataset("s3://weave.energy/smart-meter/", format="parquet", filesystem=ds.fs.S3FileSystem(**storage_options))
df = pl.scan_pyarrow(dataset)

# Example 1: Filter by timestamp
settlement_period = "2024-07-14 20:00:00"
filtered_by_time = (
    df.filter(pl.col("data_collection_log_timestamp") == settlement_period)
    .collect()
)

# Example 2: Filter by substation
substation_data = (
    df.filter(pl.col("secondary_substation_unique_id") == "6400603160")
    .collect()
)

# Example 3: Efficient filtering and aggregation with lazy evaluation
hourly_consumption = (
    df.filter(pl.col("dno_alias") == "SSEN")
    .groupby([
        pl.col("data_collection_log_timestamp").dt.hour(),
        "secondary_substation_unique_id"
    ])
    .agg([
        pl.col("total_consumption_active_import").mean().alias("avg_consumption")
    ])
    .collect()  # Only now is the data actually loaded
)

# Example 4: Time-based window operations
daily_stats = (
    df.groupby_dynamic("data_collection_log_timestamp", every="1d")
    .agg([
        pl.col("total_consumption_active_import").sum().alias("daily_consumption"),
        pl.col("aggregated_device_count_active").mean().alias("avg_active_devices")
    ])
    .collect()
)