New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to access ADLS folder (delta table storage) from DeltaTable method in deltalake package #392
Comments
I think below should work but still confirming |
The instance of DeltaTable can be created with an 'abfs' file URI; however, the following error will occur when creating a pyarrow table via the to_pyarrow_table method:
I walked through the source code in the stack trace and I believe the ‘abfs’ protocol is the issue. The file system URI protocols supported by pyarrow are s3, hdfs, viewfs, and local files as well as a few others. The 'abfs' protocol does not appear to be supported and when the code separates the ‘file system’ from the file path, the error above is thrown. Are there future plans for 'abfs' support? |
It may be that you will need to recreate the to_pyarrow_dataset method to use a file system object. Dask created one for Azure Data Lake. I think core issue is that container is needed in the path for the filesystem. from typing import Optional, List, Tuple, Any
import adlfs
import os
from urllib.parse import urlparse
from deltalake import DeltaTable
import pyarrow
from pyarrow.dataset import dataset, partitioning
def to_pyarrow_dataset2(
dt: DeltaTable, fs, container_name, partitions: Optional[List[Tuple[str, str, Any]]] = None
) -> pyarrow.dataset.Dataset:
"""
Build a PyArrow Dataset using data from the DeltaTable.
:param partitions: A list of partition filters, see help(DeltaTable.files_by_partitions) for filter syntax
:return: the PyArrow dataset in PyArrow
"""
if partitions is None:
file_paths = dt.file_uris()
else:
file_paths = dt.files_by_partitions(partitions)
paths = [urlparse(curr_file) for curr_file in file_paths]
empty_delta_table = len(paths) == 0
if empty_delta_table:
return dataset(
[],
schema=dt.pyarrow_schema(),
partitioning=partitioning(flavor="hive"),
)
# Decide based on the first file, if the file is on cloud storage or local
if paths[0].netloc:
query_str = ""
# pyarrow doesn't properly support the AWS_ENDPOINT_URL environment variable
# for non-AWS S3 like resources. This is a slight hack until such a
# point when pyarrow learns about AWS_ENDPOINT_URL
endpoint_url = os.environ.get("AWS_ENDPOINT_URL")
if endpoint_url is not None:
endpoint = urlparse(endpoint_url)
# This format specific to the URL schema inference done inside
# of pyarrow, consult their tests/dataset.py for examples
query_str += (
f"?scheme={endpoint.scheme}&endpoint_override={endpoint.netloc}"
)
keys = [container_name+curr_file.path for curr_file in paths]
return dataset(
keys,
schema=dt.pyarrow_schema(),
filesystem=fs,
partitioning=partitioning(flavor="hive"),
)
else:
return dataset(
file_paths,
schema=dt.pyarrow_schema(),
format="parquet",
partitioning=partitioning(flavor="hive"),
)
storage_options = {
'account_name': '',
'account_key': ''
}
fs = adlfs.AzureBlobFileSystem(**storage_options)
df = to_pyarrow_dataset2(dt, fs, 'container_name').to_table().to_pandas() |
If I have some extra time in the next week, I can perhaps put together a PR that handles this use case. I think it would make it easier to support other FS as well. Or even if the FS object is passed to the main delta table initialization, then maybe it could be passed along as needed into any of the methods. Thoughts? Edit: would need to handle the generic fsspec object specification and then anything that adheres to that could be passed in. |
The code needs to account for container name. When fs.open() is called and the path of one of the files is passed in in pyarrow.dataset, it fails because, for adlfs, you need to prepend the path with the container name even though it is specified in the URI...sorta weird, but the code above does that prepend. |
Thanks for the sample code. I implemented it and it did indeed create the data frame. |
@mattc-eostar Are you looking for our opinion or feedback? |
@thedern glad to hear it helped. definitely hacky for now, but it works! @prasadvaze not sure if and when I will have the time to submit a PR. I think that it could be tricky/confusing to implement since the package requires environment variables to be set and this hack requires the use of a filesystem object. I think it would be best for the package to support file system objects in general and then this falls right into place. the container name issue is also unique to azure it seems, or at the very least, to that specific FSSPEC implementation by Dask of adlfs. |
This is the key difference. Need to prepend container name and pass in a FS object from adlfs rather than a URI keys = [container_name+curr_file.path for curr_file in paths]
return dataset(
keys,
schema=dt.pyarrow_schema(),
filesystem=fs,
partitioning=partitioning(flavor="hive"),
) It would be piggybacking on how pyarrow accomplishes working with FS objects: https://arrow.apache.org/docs/python/filesystems.html This library could implement and maintain its own versions of fsspec objects for aws, gcp, and azure to avoid inconsistencies in behavior. For example, https://github.com/dask/adlfs When calling fs.open() using adlfs you need to prepend the path to the file using the container name instead of passing in a container name at instantiation and using that for subsequent calls. If deltalake were to have its own internal impl of these file systems then that could be controlled. That being said, I think the issue is that when the Deltatable object is created and you get the underlying file paths, those paths do not include the container name when pulled from azure. So either that needs to change or adlfs needs to change to make this work without the hack I put together. Hopefully my ramblings make sense! |
I can confirm that the file paths returned do not have the container name associated with them, just what amounts to a 'relative' file path is passed back. Prepending the container name solved the issue. Thanks again. |
@thedern when you say solved, does that mean a PR is planned? also, does that mean a filesystem object is not necessary as long as the container name is given? that would make the change much simpler. |
@mattc-eostar, I have not planned a PR at this time. Using the workaround you provided where the container name is prepended to the file path worked for us in our testing. We did not address the filesystem object, just tried your container name insert. |
@fvaleye and @houqp Thanks so much for updating original method. |
Using deltalake.DeltaTable (RUST impl) on ubuntu we used blobfuse instead of adlfs to connect and mount the ADLS. So we bypassed to_pyarrow_dataset( ) method while creating dataframe from delta table. On windows we have to use adlfs and then try the new to_pyarrow_dataset( ) method updated by @fvaleye. we will try that soon |
Hi @mattc-eostar , we are getting the error 'str' object has no attribute 'file_uris' when using your to_pyarrow_dataset2 function. sales_product is my table name and shared is my container name |
Please take a look at the function definition. You have to pass it a Deltatable object that points to your delta table in your storage account. You cannot just pass it the name of the table.
You may be able to modify the function to do what you need though.
…________________________________
From: devilaadi ***@***.***>
Sent: Monday, September 6, 2021 9:19:04 AM
To: delta-io/delta-rs ***@***.***>
Cc: Matt Conflitti ***@***.***>; Mention ***@***.***>
Subject: Re: [delta-io/delta-rs] How to access ADLS folder (delta table storage) from DeltaTable method in deltalake package (#392)
EXTERNAL EMAIL: Use caution with links and attachments.
Hi @mattc-eostar<https://github.com/mattc-eostar> ,
we are getting the error 'str' object has no attribute 'file_uris' when using your to_pyarrow_dataset2 function.
df = to_pyarrow_dataset2('sales_product', fs, 'shared').to_table().to_pandas()
sales_product is my table name and shared is my container name
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#392 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AMXBD7HQ6TSOQTYFSRZJ3SDUAS5URANCNFSM5CKIU5UQ>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
The object uris returned by the Rust core should contain the full uri including the container name for ADLS, see the file_paths variable at: delta-rs/python/deltalake/table.py Line 248 in 0a05cb4
I believe the problem is due to we passing those paths to urlparse and only referenced the relative path component of the uri later when we construct the object keys: delta-rs/python/deltalake/table.py Line 275 in 0a05cb4
We shouldn't need to pass in the container name as extra argument because that should already be part of the uris returned by the Rust core. If we have to make a special case for ADSL in the |
@mattc-eostar Thanks for the reply , it is working now with your function "to_pyarrow_dataset2" . |
@devilaadi Check out the docs here.
|
I think that makes sense. And then if it becomes a pain to keep adding/maintaining new special cases, it could always be abstracted into a more robust cloud provider handler class or something. As long as the information is available already within the class, then it the container parameter could be removed immediately. The file system object needs to remain unless that could be passed in upon creation of the delta table since it is sort of a one-to-one thing anyways. |
Does the ADLS filesystem library support setting container name on creation? I am wondering if that will make it able to handle paths without container names. This is how things work for other backends, i.e. we initiate the filesystem and table root path, then pass in relative paths for each files. |
I am not sure that this particular implementation allows for that. But could be wrong. |
can anyone build the latest python binding from source and give it a try? @roeap introduced a big improvement to reuse the native rust storage backend in python. |
Is anyone having the same issue but in the rust package? abfss://container@storagename.dfs.core.windows.net/folder/delta-0.8.0 Error: Failed to read delta log object: Invalid object URI Caused by: |
We switched from abfss:// to adls2:// in #499. Can you give more context on when you get this error? |
@thovoll Thanks for the pointer. I have set these two ENV VARS:
Then:
Error: Are paths not supported or am I missing something obvious? Thanks! |
Try AZURE_STORAGE_ACCOUNT_NAME=myaccountname, not the full URL |
As @thovoll points out, it look like the error you see is related to the scheme being changed, and I agree with the account var only referring to the account. @thovoll - while |
@thovoll Same error even with myaccountname only (not full url). Looking at the code base, I think this is where the storagename is evaluated and wondering if I'm missing something in my code to indicate the use of feature=azure?
|
@francisco-ltech - you are correct. For some reason I was thinking about the python bindings, where azure is enabled by default. You need to activate the Also, as of now you need to use a git reference in your Cargo.toml when using azure. Sine the crates were not published to crates.io yet, they are not available in the published deltalake crate. Although we'll be able to fix that quite soon, since the azure crates are now being released. However there are a few features still missing we'd like to use here. deltalake = { git = "https://github.com/delta-io/delta-rs", rev = "b713c4232b04de44658c9946734a78b170857603", features = [
"azure",
]} |
@roeap - That makes sense now, thank you! |
I'd rather see if adls2 sticks, version 0.x is the time to figure this out :) |
this issue is open for quite a while, and I have successfully been using the Azure backend in multiple scenarios. Also, a lot of work has been done on the azure side since then. Closing this for now, but feel free to re-open, if this is still relevant. |
Description
Delta RUST API seems a good option to query delta table without spinning up spark cluster so I am trying out this - https://databricks.com/blog/2020/12/22/natively-query-your-delta-lake-with-scala-java-and-python.html using Python app
"Reading the Delta Table (Python)" section towards the end of this blog refers to below code snippet
'dt = DeltaTable("../rust/tests/data/simple_table")' It is not clear if this path assumes a local folder (that would be a problem because it means I need to download delta folder to local drive)?
My delta table is on ADLS path (azure data lake store) and I do not see a way to authenticate and connect to ADLS folder path and use it in above command ( https://github.com/delta-io/delta-rs)
Am I missing something basic here?
Use Case
Querying delta table from azure function app without spinning up spark cluster
Related Issue(s)
Not sure how to auth and connect to ADLS folder where delta table is stored from the deltalake library ( I can connect using blob client but then can not use it in the DeltaTable method in deltalake package) A better code example will help
The text was updated successfully, but these errors were encountered: