[WIP] Read deltatable #8123

rajagurunath · 2021-09-07T08:50:36Z

Closes Add deltalake read functionality #8046
Tests added / passed
Passes black dask / flake8 dask / isort dask

Support for different filesystem backend like s3, GCP, Azure, etc in the library we discussed earlier delta-rs, - is WIP. So thought of writing the delta reading algorithm from scratch, Currently, this PR has no dependency (other than pyarrow), referred to another delta reading library DeltaLakeReader for developing utilities related to schema evolution

This PR is still in work in Progress, Need Some help/suggestions on the following points.

Initially started with the plan for supporting both Pyarrow and fast parquet engines similar to read_parquet, tried a good amount of iteration but not able to figure out how to develop some features like schema Evolution in fast parquet (in pyarrow achieved this using pyarrow.dataset, which assigns nan to the columns that are not available in the parquet file, since it has access to pyarrow.Schema thereby knowing the column datatype as well)- implemented here.

So shall I remove the engine parameter altogether and support only pyarrow engine or is there any other way/workaround in fast parquet which supports schema Evolution (i.e; Supporting list of parquet files with different schema, into the same Dataframe)?
Tried Achieving parallel reading of parquet files in different ways and finalized with one of the dask ways using delayed, is any other better/optimized/recommended solution available to do the same?
Currently Test datasets are stored in one of my repositories, can we move to any other stable place? like publicly available storage space? or we can try to create this delta table using Pyspark dynamically inside the fixture, but we need to add Pyspark dependency :)
Currently Added some test cases, But need some help/suggestions to improve/remove redundant test cases, etc. Requesting @MrPowers help here.
I have tested with the S3 private bucket, do we need to add test cases to read and write from AWS also?

GPUtester · 2021-09-07T08:50:38Z

Can one of the admins verify this patch?

jsignell · 2021-09-07T13:31:47Z

add to allowlist

mrocklin · 2021-09-07T14:13:19Z

cc @dask/io and @fjetter

martindurant · 2021-09-09T01:13:53Z

is there any other way/workaround in fast parquet which supports schema Evolution

This doesn't currently exist, and there are no plans as such. I don't suppose it would take too much effort...

martindurant · 2021-09-09T01:15:14Z

can try to create this delta table using Pyspark dynamically inside the fixture, but we need to add Pyspark dependency

No, don't do that! :). Fastparquet has spark as a test dep, and it causes pretty weird behaviour at times, not to mention slowing everything down a lot.

martindurant · 2021-09-09T01:16:59Z

do we need to add test cases to read and write from AWS also?

You can use moto to host files locally in an S3-compatible service; there are already tests in dask that do this (or look at how it's done in s3fs). However, if it works with local files via fsspec, it really should just work for any storage.

martindurant

The PR looks OK at a first run-through. I have some comments, but not much.

Coding this up using delayed is OK as a first start, but delayed does not benefit from the optimisations possible for dataframes based on high-level-graphs. For example, in the long run we would like to be able to use selections on the index to pick files to load, or that column selection on the output dataframe should result in only those columns being loaded (pushdown).

Actually, is there a concept of an index here?

Probably, given we have an explicit schema, the call to from_delayed should specify a meta=, to prevent having to load the first data file eagerly.

martindurant · 2021-09-09T01:18:14Z