pip install polars_partitions
This library is not a replacement for Polars. The main goal is to improve the work (write/read/filter) with partitions by creating a Table Of Contents file (hereinafter referred to as "TOC").
polars_parquet.write_data(
df: DataFrame,
columns: array | string
)
df
Polars DataFrame
columns
Array of columns on which to create partitions
Example 🤔🤔🤔
from polars_partitions import easy as pp
from datetime import date
import polars as pl
# Create a test dataset
df = pl.DataFrame({'col1':[date(2024,1,1),date(2024,1,1),date(2024,1,2),date(2024,1,2),date(2024,1,2),date(2024,1,3),date(2024,1,3),date(2024,1,3)],
'col2':['A2','A2','A2','A2','B2','B2','B2','B2'],
'col3':[1,2,3,4,5,6,7,8]
})
path = './your_path'
# Which columns are partitioned by
columns = ['col1', 'col2']
ep = pp.EasyPartition(path)
# Write the partitions
ep.write_data(df, columns)
# Output:
# ./your_path/toc.parquet - done!
polars_parquet.write_toc(
df: DataFrame on which the partitions are based,
columns: array | string
)
df
Dictionary, where the key is the column and the array is the values
columns
Array of columns to create partitions for
polars_parquet.get_toc(
filters: dict = None,
between: str = None
structure: bool = False
)
filters
Dictionary, where the key is the column and the array is the values
between
Works in conjunction with filters. It takes as input the column name on which to apply the between filter. It takes the first two values from the filters(array).
structure
Передав значение True печатает структуру словаря, так как партиции вложены друг в друга.
Example 🤔🤔🤔
ep.get_toc(structure=True)
# Output:
col1
↳col2
shape: (4, 2)
┌────────────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ date ┆ str │
╞════════════╪══════╡
│ 2024-01-02 ┆ A2 │
│ 2024-01-02 ┆ B2 │
│ 2024-01-01 ┆ A2 │
│ 2024-01-03 ┆ B2 │
└────────────┴──────┘
polars_parquet.get_data(
columns: array | string = "*",
filters: dict = None,
btwn: str = None
) → LazyFrame
columns
Array of columns to return
filters
Dictionary where the key is the column and the array is the values
btwn
Works in conjunction with filters. It takes as input the column name on which to apply the between filter. It takes the first two values from the filters(array).
Example 🤔🤔🤔
filters = {'col1':[date(2024,1,1),date(2024,1,3)]}
with pl.StringCache():
df = ep.get_data(filters=filters, between='col1', columns=['col1', 'col3']).collect()
df
# Output:
shape: (8, 2)
┌────────────┬──────┐
│ col1 ┆ col3 │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════════╪══════╡
│ 2024-01-02 ┆ 3 │
│ 2024-01-02 ┆ 4 │
│ 2024-01-02 ┆ 5 │
│ 2024-01-01 ┆ 1 │
│ 2024-01-01 ┆ 2 │
│ 2024-01-03 ┆ 6 │
│ 2024-01-03 ┆ 7 │
│ 2024-01-03 ┆ 8 │
└────────────┴──────┘