Skip to content

dwenlvov/polars_partitions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

polars_partitions

PyPI - Version

Python

pip install polars_partitions

Description

This library is not a replacement for Polars. The main goal is to improve the work (write/read/filter) with partitions by creating a Table Of Contents file (hereinafter referred to as "TOC").

Write Partition

polars_parquet.write_data(
          df: DataFrame,
          columns: array | string
)

Parameters

df
          Polars DataFrame
columns
          Array of columns on which to create partitions

Example 🤔🤔🤔
from polars_partitions import easy as pp
from datetime import date
import polars as pl

# Create a test dataset
df = pl.DataFrame({'col1':[date(2024,1,1),date(2024,1,1),date(2024,1,2),date(2024,1,2),date(2024,1,2),date(2024,1,3),date(2024,1,3),date(2024,1,3)],
              'col2':['A2','A2','A2','A2','B2','B2','B2','B2'],
              'col3':[1,2,3,4,5,6,7,8]
              })

path = './your_path'

# Which columns are partitioned by
columns = ['col1', 'col2'] 

ep = pp.EasyPartition(path)

# Write the partitions
ep.write_data(df, columns)

# Output: 
# ./your_path/toc.parquet - done! 

Write TOC

polars_parquet.write_toc(
          df: DataFrame on which the partitions are based,
          columns: array | string
)

Parameters

df
          Dictionary, where the key is the column and the array is the values
columns
          Array of columns to create partitions for

Reading TOC

polars_parquet.get_toc(
          filters: dict = None,
          between: str = None
          structure: bool = False
)

Parameters

filters
          Dictionary, where the key is the column and the array is the values
between
          Works in conjunction with filters. It takes as input the column name on which to apply the between filter. It takes the first two values from the filters(array).
structure
          Передав значение True печатает структуру словаря, так как партиции вложены друг в друга.

Example 🤔🤔🤔
ep.get_toc(structure=True)

# Output: 
col1col2

shape: (4, 2)
┌────────────┬──────┐
│ col1col2 │
│ ------  │
│ datestr  │
╞════════════╪══════╡
│ 2024-01-02A2   │
│ 2024-01-02B2   │
│ 2024-01-01A2   │
│ 2024-01-03B2   │
└────────────┴──────┘

Read Partition

polars_parquet.get_data(
          columns: array | string = "*",
          filters: dict = None,
          btwn: str = None
) → LazyFrame

Parameters

columns
          Array of columns to return
filters
          Dictionary where the key is the column and the array is the values
btwn
          Works in conjunction with filters. It takes as input the column name on which to apply the between filter. It takes the first two values from the filters(array).

Example 🤔🤔🤔
filters = {'col1':[date(2024,1,1),date(2024,1,3)]}

with pl.StringCache():
    df = ep.get_data(filters=filters, between='col1', columns=['col1', 'col3']).collect()

df

# Output: 
shape: (8, 2)
┌────────────┬──────┐
│ col1col3 │
│ ------  │
│ stri64  │
╞════════════╪══════╡
│ 2024-01-023    │
│ 2024-01-024    │
│ 2024-01-025    │
│ 2024-01-011    │
│ 2024-01-012    │
│ 2024-01-036    │
│ 2024-01-037    │
│ 2024-01-038    │
└────────────┴──────┘

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages