Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload files "Row-Structured" #51

Closed
ghost opened this issue Jul 27, 2016 · 5 comments
Closed

Upload files "Row-Structured" #51

ghost opened this issue Jul 27, 2016 · 5 comments

Comments

@ghost
Copy link

ghost commented Jul 27, 2016

How can I upload large datasets from Blob Storage to ADLS in "Row-Structured File Mode"? If I use Adlcopy, files get uploaded as binary which results in incorrect splits. Since each line is a valid JSON document this causes U-SQL jobs to fail. The only option I could find to upload the files correctly is Visual Studio - but this is not a good solution for large datasets.

@MikeRys
Copy link
Collaborator

MikeRys commented Jul 27, 2016

As you noticed, the ADLCopy tool does a binary copy of the files and since Blob Storage does not align row to extent boundaries, that will not work.

The upcoming refresh that should become available next week should address this issue and make our extractors handle non-aligned boundaries.

Until then you have the following option:

Register your Blob Storage with ADLA (you can do that through the portal by adding a new data source or via a Powershell command).

Then write your extract statement directly against the blob store:

@data = EXTRACT jsondoc string 
               FROM "wasb://container@account/folder/jsondocuments.txt"
               USING Extractors.Text(delimiter:'\r'); // or use your own extractor

Then you can do your processing directly on it, or use an OUTPUT statement to copy the data into your ADLS account. Note that you will have to currently do this one file at a time.

@ghost
Copy link
Author

ghost commented Jul 27, 2016

Thanks! I am looking forward to the upcoming release and I will use the wasb workaround in the meantime.

@ghost ghost closed this as completed Jul 27, 2016
@ghost
Copy link
Author

ghost commented Jul 28, 2016

Is it really a good solution to fix the extractors instead of the actual problem? Wouldn't it be better to implement an upload option for row structured text files in Adlcopy and have extent bounderies properly aligned with rows for the files stored in ADLS?
How will the new extractors handle non-aligned bounderies? Will they fetch the adjacent block, move it over to the node and complete the fragmented row? That sounds very expensive.

@ghost ghost reopened this Jul 28, 2016
@MikeRys
Copy link
Collaborator

MikeRys commented Jul 28, 2016

This is how the new extractor framework will handle it. it will not fetch all the adjacent data (only 4MB).

Unfortunately, having extent boundaries aligned cannot be guaranteed for all data uploads (eg., when using a WebHDFS call) and thus the extractor framework has to handle it this way at the moment until the file system would give us meta data to tell us if a file is indeed aligned (and currently HDFS does not provide such metadata).

@ghost
Copy link
Author

ghost commented Jul 29, 2016

Thank you Mike, I understand.

@ghost ghost closed this as completed Jul 29, 2016
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant