GitHub

polars_partitions

Python

pip install polars_partitions

Description

This library is not a replacement for Polars. The main goal is to improve the work (write/read/filter) with partitions by creating a Table Of Contents file (hereinafter referred to as "TOC").

Write Partition

polars_parquet.write_data(
df: DataFrame,
columns: array | string
)

Parameters

df
Polars DataFrame
columns
Array of columns on which to create partitions

Example 🤔🤔🤔

from polars_partitions import easy as pp
from datetime import date
import polars as pl

# Create a test dataset
df = pl.DataFrame({'col1':[date(2024,1,1),date(2024,1,1),date(2024,1,2),date(2024,1,2),date(2024,1,2),date(2024,1,3),date(2024,1,3),date(2024,1,3)],
              'col2':['A2','A2','A2','A2','B2','B2','B2','B2'],
              'col3':[1,2,3,4,5,6,7,8]
              })

path = './your_path'

# Which columns are partitioned by
columns = ['col1', 'col2'] 

ep = pp.EasyPartition(path)

# Write the partitions
ep.write_data(df, columns)

# Output: 
# ./your_path/toc.parquet - done!

Write TOC

polars_parquet.write_toc(
df: DataFrame on which the partitions are based,
columns: array | string
)

Parameters

df
Dictionary, where the key is the column and the array is the values
columns
Array of columns to create partitions for

Reading TOC

polars_parquet.get_toc(
filters: dict = None,
between: str = None
structure: bool = False
)

Parameters

filters
Dictionary, where the key is the column and the array is the values
between
Works in conjunction with filters. It takes as input the column name on which to apply the between filter. It takes the first two values from the filters(array).
structure
Передав значение True печатает структуру словаря, так как партиции вложены друг в друга.

Example 🤔🤔🤔

ep.get_toc(structure=True)

# Output: 
col1
 ↳col2

shape: (4, 2)
┌────────────┬──────┐
│ col1       ┆ col2 │
│ ---        ┆ ---  │
│ date       ┆ str  │
╞════════════╪══════╡
│ 2024-01-02 ┆ A2   │
│ 2024-01-02 ┆ B2   │
│ 2024-01-01 ┆ A2   │
│ 2024-01-03 ┆ B2   │
└────────────┴──────┘

Read Partition

polars_parquet.get_data(
columns: array | string = "*",
filters: dict = None,
btwn: str = None
) → LazyFrame

Parameters

columns
Array of columns to return
filters
Dictionary where the key is the column and the array is the values
btwn
Works in conjunction with filters. It takes as input the column name on which to apply the between filter. It takes the first two values from the filters(array).

Example 🤔🤔🤔

filters = {'col1':[date(2024,1,1),date(2024,1,3)]}

with pl.StringCache():
    df = ep.get_data(filters=filters, between='col1', columns=['col1', 'col3']).collect()

df

# Output: 
shape: (8, 2)
┌────────────┬──────┐
│ col1       ┆ col3 │
│ ---        ┆ ---  │
│ str        ┆ i64  │
╞════════════╪══════╡
│ 2024-01-02 ┆ 3    │
│ 2024-01-02 ┆ 4    │
│ 2024-01-02 ┆ 5    │
│ 2024-01-01 ┆ 1    │
│ 2024-01-01 ┆ 2    │
│ 2024-01-03 ┆ 6    │
│ 2024-01-03 ┆ 7    │
│ 2024-01-03 ┆ 8    │
└────────────┴──────┘

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
polars_partitions		polars_partitions
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

polars_partitions

polars_partitions

.gitattributes

.gitattributes

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

polars_partitions

Python

Description

Write Partition

Parameters

Write TOC

Parameters

Reading TOC

Parameters

Read Partition

Parameters

About

Releases

Packages

Languages

License

dwenlvov/polars_partitions

Folders and files

Latest commit

History

Repository files navigation

polars_partitions

Python

Description

Write Partition

Parameters

Write TOC

Parameters

Reading TOC

Parameters

Read Partition

Parameters

About

Resources

License

Stars

Watchers

Forks

Languages