Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for multiple partition columns and filters in to_pyarrow_dataset() and OR filters in write_datalake() #1722

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Commits on Oct 13, 2023

  1. Add support for multiple partition columns and multiple partitions in…

    … to_pyarrow_dataset()
    
    - Partitions with multiple columns can be passed as lists of tuples in DNF format
    - Multiple partition filters can be passed
    ldacey committed Oct 13, 2023
    Configuration menu
    Copy the full SHA
    4c0551a View commit details
    Browse the repository at this point in the history
  2. Add test_pyarrow_dataset_partitions pytest

    - Add tests for various filter/partition scenarios which can be passted to to_pyarrow_dataset()
    ldacey committed Oct 13, 2023
    Configuration menu
    Copy the full SHA
    b128bc3 View commit details
    Browse the repository at this point in the history

Commits on Oct 14, 2023

  1. Configuration menu
    Copy the full SHA
    133f246 View commit details
    Browse the repository at this point in the history
  2. Add test_overwriting_multiple_partitions pytest

    - Tests partition filters based on AND and OR conditions using a single and multiple partition columns
    ldacey committed Oct 14, 2023
    Configuration menu
    Copy the full SHA
    3b53dc4 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    f42494c View commit details
    Browse the repository at this point in the history

Commits on Oct 17, 2023

  1. Add validate_filters and stringify_partition_values to _util.py

    - validate_filters ensures partitions and filters are in DNF format (list of tuples, list of lists of tuples)
    and checks for empty lists
    - stringify_partition_values ensures values are converted from dates, ints, etc to string for partition columns
    ldacey committed Oct 17, 2023
    Configuration menu
    Copy the full SHA
    4d390d5 View commit details
    Browse the repository at this point in the history
  2. Refactor dataset expressions and fragment building DeltaTable

    - Use pyarrow.parquet filters_to_expression instead of the custom implementation
    - Move __stringify_partition_values to _util to be able to test more easily
    - Move partition validation to validate_filters function
    - Move fragment building to separate method
    ldacey committed Oct 17, 2023
    Configuration menu
    Copy the full SHA
    99e2041 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    87e397f View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    32224d4 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    1b52925 View commit details
    Browse the repository at this point in the history
  6. Update types and add validated_filters variable

    - validated_filters is guaranteed to be a list of list of tuples
    ldacey committed Oct 17, 2023
    Configuration menu
    Copy the full SHA
    e87a1ec View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    87ce8e1 View commit details
    Browse the repository at this point in the history
  8. Add test for single filter

    - Shows that the output will still be a list of lists of tuples
    ldacey committed Oct 17, 2023
    Configuration menu
    Copy the full SHA
    0b8b6bb View commit details
    Browse the repository at this point in the history