[PECO-1751] Refactor CloudFetch downloader: handle files sequentially #405

kravets-levko · 2024-07-01T18:22:51Z

This PR is an attempt to fix various CloudFetch-related issues (no data returned, duplicated rows returned, also maybe some performance and memory issues).

Replaces #362

The problem

In current implementation, all the links from a single TRowSet are added to a concurrent thrrad pool. The pool downloads them in a random order (all the tasks have the same priority and as a result - same chance to be executed first). To maintain the order of results, startRowOffset/rowCount fields from each CloudFetch link are used: library keeps track of the current row number, and use it to pick the right CloudFetch link (looking for the file where the current row is within [startRowOffset; startRowOffset + rowCount]).

This solution has several caveats. First of all, data can be fetched only from beginning to the end. With a concurrent thread pool, you never know which file will be downloaded first. In the worst case, while the user is waiting for the very first file, the library may download all the other ones and keep them in memory because the user may need them in future. This increases the latency (on average it will be okay, but we have no control over it), and also memory consumption.
Another problem with this approach is that if any of the files cannot be downloaded (even after retries) - there is no need to download the remaining files, the user won’t be able to process them anyway. But because files are downloaded in arbitrary order - nobody knows how many files will be downloaded before the user reaches the failed one.

Also, instead of utilizing the power of Futures, a custom synchronization and error handling is implemented - which may be a cause of synchronization issues. Additionally, retries are implemented outside of the download task, so when the task fails - outer code has to reschedule it, which is also a great source of sporadic errors.

The solution

This PR changes CloudFetch downloader to use a queue. Downloader keeps a list of pending links (not scheduled), and current tasks. Number of tasks is limited, so new files are scheduled only when previous task is completed and extracted from queue. As user requests next files, downloader will pick the first task from the queue, and schedule the new one to run in background - to keep the queue full. Then, downloader will wait for the task it picked from the queue, and then return it to user. Tasks are still running in in parallel in background. Also, each task itself is reponsible for handling errors (e.g. retry failed downloads), so when task completes - it is either eventually successfull, or failed after all possible retries.

With this approach, the proper order of files is automatically assured. All errors are either handled in downloader or propagated to user. If some file cannot be downloaded due to error - library will not download the remaining ones (like it did previously). Utilizing Futures solves all possible issues with threads synchronization. Because new files are downloaded only when user consumes previous ones - library will not keep the whole dataset in memory.

…; utilize Futures Signed-off-by: Levko Kravets <levko.ne@gmail.com>

Signed-off-by: Levko Kravets <levko.ne@gmail.com>

src/databricks/sql/cloudfetch/download_manager.py

[PECO-1752] This PR is an attempt to fix a CloudFetch error "row number N is not contained in any arrow batch". See also databricks/databricks-sql-python#405 - basically, the same issue, the same root cause, similar solution. #### The problem In current implementation, all the links from a single TRowSet are added to a concurrent thread (goroutine) pool. The pool downloads them in a random order (all the tasks have the same priority and as a result - same chance to be executed first). To maintain the order of results, `startRowOffset`/`rowCount` fields from each CloudFetch link are used: library keeps track of the current row number, and use it to pick the right CloudFetch link (looking for the file where the current row is within [startRowOffset; startRowOffset + rowCount]). This solution has several caveats. First of all, library allows to fetch data only from beginning to the end. With a concurrent thread pool, you never know which file will be downloaded first. In the worst case, while the user is waiting for the very first file, the library may download all the other ones and keep them in memory because the user may need them in future. This increases the latency (on average it will be okay, but we have no control over it), and also memory consumption. Another problem with this approach is that if any of the files cannot be downloaded - there is no need to download the remaining files, the user won’t be able to process them anyway. But because files are downloaded in arbitrary order - nobody knows how many files will be downloaded before the user reaches the failed one. Also, seems that error handling wasn't done quite right, but that part of code was a bit unclear to me. Anyway, with this fix all the errors are properly handled and propagated to user when needed. #### The solution This PR changes CloudFetch downloader to use a queue. Downloader keeps a list of pending links (not scheduled), and current tasks. Number of tasks is limited, so new files are scheduled only when previous task is completed and extracted from queue. As user requests next files, downloader will pick the first task from the queue, and schedule the new one to run in background - to keep the queue full. Then, downloader will wait for the task it picked from the queue, and then return it to user. Tasks are still running in in parallel in background. Also, each task itself is reponsible for handling errors (e.g. retry failed downloads), so when task completes - it is either eventually successfull, or failed after all possible retries. With this approach, the proper order of files is automatically assured. All errors are either handled in downloader or propagated to user. If some file cannot be downloaded due to error - library will not download the remaining ones (like it did previously). Because new files are downloaded only when user consumes previous ones - library will not keep the whole dataset in memory. [PECO-1752]: https://databricks.atlassian.net/browse/PECO-1752?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ --------- Signed-off-by: Levko Kravets <levko.ne@gmail.com>

[PECO-1751] Refactor CloudFetch downloader: handle files sequentially…

e2617f5

…; utilize Futures Signed-off-by: Levko Kravets <levko.ne@gmail.com>

kravets-levko mentioned this pull request Jul 2, 2024

Connector reads 0 rows although Cluster returned results #383

Closed

Merge branch 'main' into PECO-1751-fix-CloudFetch-downloader

9536e8d

kravets-levko temporarily deployed to azure-prod July 2, 2024 11:13 — with GitHub Actions Inactive

Retry failed CloudFetch downloads

cdebaf4

Signed-off-by: Levko Kravets <levko.ne@gmail.com>

kravets-levko temporarily deployed to azure-prod July 2, 2024 12:29 — with GitHub Actions Inactive

Update tests

9e204f7

Signed-off-by: Levko Kravets <levko.ne@gmail.com>

kravets-levko temporarily deployed to azure-prod July 2, 2024 15:01 — with GitHub Actions Inactive

kravets-levko marked this pull request as ready for review July 2, 2024 15:23

kravets-levko requested review from rcypher-databricks, yunbodeng-db, andrefurlan-db, jackyhu-db and benc-db as code owners July 2, 2024 15:23

kravets-levko changed the title ~~[PECO-1751] Refactor CloudFetch downloader: handle files sequentially (WIP)~~ [PECO-1751] Refactor CloudFetch downloader: handle files sequentially Jul 2, 2024

jackyhu-db reviewed Jul 2, 2024

View reviewed changes

src/databricks/sql/cloudfetch/download_manager.py Show resolved Hide resolved

jackyhu-db approved these changes Jul 10, 2024

View reviewed changes

kravets-levko merged commit 185ff3c into main Jul 11, 2024
13 checks passed

kravets-levko deleted the PECO-1751-fix-CloudFetch-downloader branch July 11, 2024 08:11

kravets-levko mentioned this pull request Jul 11, 2024

Fix CloudFetch retry policy to be compatible with all urllib3 versions we support #412

Merged

rth mentioned this pull request Jul 11, 2024

Too large queries produce MaxRetryError #413

Closed

kravets-levko mentioned this pull request Jul 11, 2024

Disable SSL verification for CloudFetch links #414

Merged

kravets-levko mentioned this pull request Jul 25, 2024

[PECO-1752] Refactor CloudFetch downloader databricks/databricks-sql-go#234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PECO-1751] Refactor CloudFetch downloader: handle files sequentially #405

[PECO-1751] Refactor CloudFetch downloader: handle files sequentially #405

kravets-levko commented Jul 1, 2024 •

edited

Loading

[PECO-1751] Refactor CloudFetch downloader: handle files sequentially #405

[PECO-1751] Refactor CloudFetch downloader: handle files sequentially #405

Conversation

kravets-levko commented Jul 1, 2024 • edited Loading

The problem

The solution

kravets-levko commented Jul 1, 2024 •

edited

Loading