-
Notifications
You must be signed in to change notification settings - Fork 683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload files "Row-Structured" #51
Comments
As you noticed, the ADLCopy tool does a binary copy of the files and since Blob Storage does not align row to extent boundaries, that will not work. The upcoming refresh that should become available next week should address this issue and make our extractors handle non-aligned boundaries. Until then you have the following option: Register your Blob Storage with ADLA (you can do that through the portal by adding a new data source or via a Powershell command). Then write your extract statement directly against the blob store:
Then you can do your processing directly on it, or use an |
Thanks! I am looking forward to the upcoming release and I will use the wasb workaround in the meantime. |
Is it really a good solution to fix the extractors instead of the actual problem? Wouldn't it be better to implement an upload option for row structured text files in Adlcopy and have extent bounderies properly aligned with rows for the files stored in ADLS? |
This is how the new extractor framework will handle it. it will not fetch all the adjacent data (only 4MB). Unfortunately, having extent boundaries aligned cannot be guaranteed for all data uploads (eg., when using a WebHDFS call) and thus the extractor framework has to handle it this way at the moment until the file system would give us meta data to tell us if a file is indeed aligned (and currently HDFS does not provide such metadata). |
Thank you Mike, I understand. |
How can I upload large datasets from Blob Storage to ADLS in "Row-Structured File Mode"? If I use Adlcopy, files get uploaded as binary which results in incorrect splits. Since each line is a valid JSON document this causes U-SQL jobs to fail. The only option I could find to upload the files correctly is Visual Studio - but this is not a good solution for large datasets.
The text was updated successfully, but these errors were encountered: