Upload files "Row-Structured" #51

ghost · 2016-07-27T19:12:57Z

How can I upload large datasets from Blob Storage to ADLS in "Row-Structured File Mode"? If I use Adlcopy, files get uploaded as binary which results in incorrect splits. Since each line is a valid JSON document this causes U-SQL jobs to fail. The only option I could find to upload the files correctly is Visual Studio - but this is not a good solution for large datasets.

MikeRys · 2016-07-27T21:33:32Z

As you noticed, the ADLCopy tool does a binary copy of the files and since Blob Storage does not align row to extent boundaries, that will not work.

The upcoming refresh that should become available next week should address this issue and make our extractors handle non-aligned boundaries.

Until then you have the following option:

Register your Blob Storage with ADLA (you can do that through the portal by adding a new data source or via a Powershell command).

Then write your extract statement directly against the blob store:

@data = EXTRACT jsondoc string 
               FROM "wasb://container@account/folder/jsondocuments.txt"
               USING Extractors.Text(delimiter:'\r'); // or use your own extractor

Then you can do your processing directly on it, or use an OUTPUT statement to copy the data into your ADLS account. Note that you will have to currently do this one file at a time.

ghost · 2016-07-27T22:14:48Z

Thanks! I am looking forward to the upcoming release and I will use the wasb workaround in the meantime.

ghost · 2016-07-28T08:35:02Z

Is it really a good solution to fix the extractors instead of the actual problem? Wouldn't it be better to implement an upload option for row structured text files in Adlcopy and have extent bounderies properly aligned with rows for the files stored in ADLS?
How will the new extractors handle non-aligned bounderies? Will they fetch the adjacent block, move it over to the node and complete the fragmented row? That sounds very expensive.

MikeRys · 2016-07-28T20:54:31Z

This is how the new extractor framework will handle it. it will not fetch all the adjacent data (only 4MB).

Unfortunately, having extent boundaries aligned cannot be guaranteed for all data uploads (eg., when using a WebHDFS call) and thus the extractor framework has to handle it this way at the moment until the file system would give us meta data to tell us if a file is indeed aligned (and currently HDFS does not provide such metadata).

ghost · 2016-07-29T08:24:59Z

Thank you Mike, I understand.

ghost closed this as completed Jul 27, 2016

ghost reopened this Jul 28, 2016

ghost closed this as completed Jul 29, 2016

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upload files "Row-Structured" #51

Upload files "Row-Structured" #51

ghost commented Jul 27, 2016

MikeRys commented Jul 27, 2016

ghost commented Jul 27, 2016

ghost commented Jul 28, 2016

MikeRys commented Jul 28, 2016

ghost commented Jul 29, 2016

Upload files "Row-Structured" #51

Upload files "Row-Structured" #51

Comments

ghost commented Jul 27, 2016

MikeRys commented Jul 27, 2016

ghost commented Jul 27, 2016

ghost commented Jul 28, 2016

MikeRys commented Jul 28, 2016

ghost commented Jul 29, 2016