Performance Lazyframes vs Dataframes

In [3]:
import io
import os
import polars as pl

In [5]:
__file__ = "app.ipynb"
csv_file_path = os.path.join(os.path.dirname(__file__), "./vehicletest.csv")
parquet_file_path = os.path.join(os.path.dirname(__file__), "./vehicletest.parquet")

Convert from csv to parquet format 

In [7]:
if not os.path.exists(parquet_file_path):
    df: pl.DataFrame = pl.read_csv(csv_file_path)
    df.write_parquet(parquet_file_path)

In [None]:
# NOTE:
# polars scan_parquet by doing this way: This allows the query optimizer to push down predicates 
# and projections to the scan level, thereby potentially reducing memory overhead.

In [11]:
lazy_df: pl.LazyFrame = pl.scan_parquet(parquet_file_path)

In [19]:
# Polars allows you to scan a Parquet input. Scanning delays the actual parsing of the file 
# and instead returns a lazy computation holder called a LazyFrame.

print(lazy_df)

naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)


  PARQUET SCAN ./vehicletest.parquet
  PROJECT */19 COLUMNS


NOTE:
polars read_parquet by doing this way:  this means that all data will be stored continuously in memory

In [20]:
#The data can also be read from the parquet file into memory:
df: pl.DataFrame = pl.read_parquet(parquet_file_path)

NOTE:

Lazy DataFrames are a concept in Python that refers to a way of working with data in a DataFrame format, but without loading all the data into memory at once. A LazyFrame holds information found during the brief scan and reads the rest of the file only when it is needed.

In [None]:
# An example of this can be the clean process of the dataframe, by performing 
# the action to the data in memory and doing by the lazy approch

# To memory
df = df.fill_nan(None).drop_nulls()

# lazy 
lazy_df = lazy_df.fill_nan(None).drop_nulls()

# To apply the changes by lazy approch we shoud call the method collect
# lazy_df.collect()

In [None]:
# we can see a big different when we perform actions over the data like:

# Calculate mean values of features grouped by class
aggregations = df.groupby("class").agg(
        **{col: pl.col(col).mean().alias(f"{col}_mean") for col in df.columns[0:-2]}
    )


# Calculate standard deviations of features grouped by class
aggregations = df.groupby("class").agg(
    **{col: pl.col(col).std().alias(f"{col}_std") for col in df.columns[0:-2]}
)

# Materialize the lazy DataFrame
aggregations = aggregations.collect()

# the first approch is performing actions (mean and std) over the memory data
# the second approch is performing the same actions but the data is not memory 
# until we call the method collect materialize the actions over data
# using lazy approch helps to improve the performance in big dataset 
