Selective BufferedInput without cache #7217

oerling · 2023-10-24T16:04:37Z

DirectBufferedInput - Selective BufferedInput without cache

Adds a BufferedInput that tracks access frequency and coalesces by
frequency class, similar to CachedBufferedInput. This does not cache
the data but instead owns the data in the BufferedInput, like the base
BufferedInput.

Adjusts coalescing, so that incfrequently accessed data has smaller
max coalesce because not all infrequent loading is correlated. Sets
the stream count cutoff for coalesced load from 40 to 1000 streams
because many streams are very small and in wide tables (e.g. mostly
null columns) and there is no point splitting these up.

netlify · 2023-10-24T16:04:43Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`b178be1`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/654d2f7b4d9b5900081ae2e3

facebook-github-bot · 2023-10-24T16:07:23Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Yuhta · 2023-10-24T16:18:36Z

@oerling How hard is it to make TableScanTest.cpp running using this new BufferedInput?

facebook-github-bot · 2023-10-24T21:43:48Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-10-28T19:09:16Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-10-30T22:08:16Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

xiaoxmeng

@oerling I did the first pass on the production code and left minors. Thanks!

xiaoxmeng · 2023-11-05T20:58:04Z

velox/dwio/common/CoalescedInputStream.h

+  // Synchronously sets 'data_' to cover loadedRegion_'.
+  void loadSync();
+
+  SelectiveBufferedInput* const bufferedInput_;


Please put const members together first.

xiaoxmeng · 2023-11-05T20:58:29Z

velox/dwio/common/CoalescedInputStream.h

+  void loadSync();
+
+  SelectiveBufferedInput* const bufferedInput_;
+  IoStatistics* ioStats_;


s/IoStatistics* ioStats_;/IoStatistics* const ioStats_;/

xiaoxmeng · 2023-11-06T17:32:00Z

velox/dwio/common/CoalescedInputStream.h

+  // Testing function to access loaded state.
+  void testingData(
+      velox::common::Region& loadedRegion,
+      memory::Allocation*& data,


It is better to use memory::Allocation** and std::string**?

xiaoxmeng · 2023-11-06T17:32:48Z

velox/dwio/common/CoalescedInputStream.h

+  const int32_t loadQuantum_;
+
+  // Allocation with loaded data. Has space for region.length or loadQuantum_
+  // bytes, whichevr is less.


s/whichevr/whichever/

xiaoxmeng · 2023-11-06T17:33:41Z

velox/dwio/common/CoalescedInputStream.h

+  // Offset of current run from start of 'data_'
+  uint64_t offsetOfRun_;
+
+  // Pointer  to start of  current run in 'entry->data()' or


nit: leave one space between characters? Thanks!

xiaoxmeng · 2023-11-06T19:52:39Z

velox/dwio/common/SelectiveBufferedInput.cpp

+}
+
+void SelectiveBufferedInput::makeLoads(
+    std::vector<LoadRequest*> requests,


const std::vector<LoadRequest*>& requests,

xiaoxmeng · 2023-11-06T19:54:08Z

velox/dwio/common/SelectiveBufferedInput.cpp

+void SelectiveBufferedInput::makeLoads(
+    std::vector<LoadRequest*> requests,
+    bool prefetch) {
+  if (requests.empty() || (requests.size() < 2 && !prefetch)) {


Can you comment why skip loads if "(requests.size() < 2 && !prefetch)" is true? Thanks!

xiaoxmeng · 2023-11-06T19:54:46Z

velox/dwio/common/SelectiveBufferedInput.cpp

+  if (requests.empty() || (requests.size() < 2 && !prefetch)) {
+    return;
+  }
+  int32_t maxDistance = options_.maxCoalesceDistance();


const int32_t maxDistance =

xiaoxmeng · 2023-11-06T19:55:19Z

velox/dwio/common/SelectiveBufferedInput.cpp

+    return;
+  }
+  int32_t maxDistance = options_.maxCoalesceDistance();
+  auto loadQuantum = options_.loadQuantum();


const auto loadQuantum =

xiaoxmeng · 2023-11-06T19:55:44Z

velox/dwio/common/SelectiveBufferedInput.cpp

+  // If reading densely accessed, coalesce into large for best throughput, if
+  // for sparse, coalesce to quantum to reduce overread. Not all sparse access
+  // is correlated.
+  auto maxCoalesceBytes = prefetch ? options_.maxCoalesceBytes() : loadQuantum;


ditto: const

xiaoxmeng · 2023-11-06T20:04:57Z

velox/dwio/common/SelectiveBufferedInput.cpp

+      requests,
+      maxDistance,
+      // Break batches up. Better load more short ones i parallel.
+      1000, // limit coalesce by size, not count.


// Ranges limit per IO.

xiaoxmeng

@oerling overall looks good to me % comments. Thanks!

xiaoxmeng · 2023-11-07T04:50:53Z

velox/dwio/common/SelectiveBufferedInput.h

+  void makeLoads(std::vector<LoadRequest*> requests, bool prefetch);
+
+  // Makes a CoalescedLoad for 'requests' to be read together, coalescing
+  // IO is appropriate. If 'prefetch' is set, schedules the CoalescedLoad


s/is appropriate/if appropriate/

xiaoxmeng · 2023-11-07T04:57:48Z

velox/dwio/common/SelectiveBufferedInput.h

+      coalescedLoads_;
+
+  // Distinct coalesced loads in 'coalescedLoads_'.
+  std::vector<std::shared_ptr<cache::CoalescedLoad>> allCoalescedLoads_;


Shall we all use SelectiveCoalescedLoad instead of CoalescedLoad? Thanks!

xiaoxmeng · 2023-11-07T05:10:30Z

velox/dwio/common/SelectiveBufferedInput.cpp

+      [&](LoadRequest* request, std::vector<LoadRequest*>& ranges) {
+        ranges.push_back(request);
+      },
+      [&](int32_t /*gap*/, std::vector<LoadRequest*> /*ranges*/) { /*no op*/ },


We don't need to handle skip io range?

xiaoxmeng · 2023-11-07T05:13:02Z

velox/dwio/common/SelectiveBufferedInput.cpp

+      });
+  // Combine adjacent short reads.
+
+  int32_t numNewLoads = 0;


nit: s/numNewLoads/numLoads/

xiaoxmeng · 2023-11-07T05:34:52Z

velox/dwio/common/CoalescedInputStream.cpp

+} // namespace
+
+void CoalescedInputStream::loadSync() {
+  if (region_.length < SelectiveBufferedInput::kTinySize &&


Shall we check loadedRegion_.length here?

facebook-github-bot · 2023-11-08T17:12:38Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-11-08T23:37:57Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-11-09T00:37:47Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

xiaoxmeng

@oerling LGTM % minors. Thanks!

xiaoxmeng · 2023-11-09T00:17:58Z

velox/dwio/common/DirectInputStream.cpp

+void DirectInputStream::BackUp(int32_t count) {
+  VELOX_CHECK_GE(count, 0, "can't backup negative distances");
+
+  uint64_t unsignedCount = static_cast<uint64_t>(count);


const uint64_t unsignedCount =

xiaoxmeng · 2023-11-09T00:31:33Z

velox/dwio/common/DirectInputStream.cpp

+  if (data_.numPages() == 0) {
+    run_ = reinterpret_cast<uint8_t*>(tinyData_.data());
+    runSize_ = tinyData_.size();
+    offsetInRun_ = offsetInData;


VELOX_CHECK_LT(offsetInRun_, runSize_);

xiaoxmeng · 2023-11-09T00:36:36Z

velox/dwio/common/DirectInputStream.cpp

+    loadSync();
+  }
+
+  auto offsetInData = offsetInRegion_ - (loadedRegion_.offset - region_.offset);


const auto offsetInData

velox/dwio/common/DirectInputStream.cpp

xiaoxmeng · 2023-11-09T00:47:21Z

velox/dwio/common/DirectBufferedInput.h

+        options_(readerOptions) {}
+
+  /// Constructor used by clone().
+  DirectBufferedInput(


Can we move this to private?

xiaoxmeng · 2023-11-09T00:48:07Z

velox/dwio/common/DirectBufferedInput.h

+  /// call for 'stream' since the load is to be triggered by the first
+  /// access.
+  std::shared_ptr<DirectCoalescedLoad> coalescedLoad(
+      const SeekableInputStream* FOLLY_NONNULL stream);


xiaoxmeng · 2023-11-09T00:48:49Z

velox/dwio/common/DirectBufferedInput.h

+ private:
+  std::shared_ptr<IoStatistics> const ioStats_;
+  const uint64_t groupId_;
+  std::shared_ptr<ReadFileInputStream> const input_;


const std::shared_ptr<ReadFileInputStream> input_;

xiaoxmeng · 2023-11-09T00:49:25Z

velox/dwio/common/DirectBufferedInput.h

+  // Makes a CoalescedLoad for 'requests' to be read together, coalescing
+  // IO if appropriate. If 'prefetch' is set, schedules the CoalescedLoad
+  // on 'executor_'. Links the CoalescedLoad  to all CacheInputStreams that it
+  // concerns.


s/concerns/covers/

xiaoxmeng · 2023-11-09T00:59:50Z

velox/dwio/common/DirectBufferedInput.cpp

+  uint64_t offsetInRuns = 0;
+  for (int i = 0; i < allocation.numRuns(); ++i) {
+    auto run = allocation.runAt(i);
+    uint64_t bytes = memory::AllocationTraits::pageBytes(run.numPages());


nit: const bytes and readSize

xiaoxmeng · 2023-11-09T01:04:33Z

velox/dwio/common/DirectBufferedInput.cpp

+        // Case where request is a little over quantum but is folowed by another
+        // within the max distance. Coalesces and allows reading the region of
+        // max quantum + max distance in one piece.
+        request.loadSize = request.region.length;


s/request.region.length/region.length/

facebook-github-bot · 2023-11-09T01:54:58Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

xiaoxmeng · 2023-11-09T06:55:38Z

velox/dwio/common/DirectInputStream.h

+
+  DirectBufferedInput* const bufferedInput_;
+  IoStatistics* const ioStats_;
+  std::shared_ptr<ReadFileInputStream> const input_;


const std::shared_ptr<ReadFileInputStream> input_;

xiaoxmeng · 2023-11-09T06:57:36Z

velox/dwio/common/DirectBufferedInput.h

+  void readRegion(std::vector<LoadRequest*> requests, bool prefetch);
+
+  const uint64_t fileNum_;
+  std::shared_ptr<cache::ScanTracker> const tracker_;


const std::shared_ptr<cache::ScanTracker> tracker_; const std::shared_ptr<IoStatistics> ioStats_;

facebook-github-bot · 2023-11-09T19:00:51Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: DirectBufferedInput - Selective BufferedInput without cache Adds a BufferedInput that tracks access frequency and coalesces by frequency class, similar to CachedBufferedInput. This does not cache the data but instead owns the data in the BufferedInput, like the base BufferedInput. Adjusts coalescing, so that incfrequently accessed data has smaller max coalesce because not all infrequent loading is correlated. Sets the stream count cutoff for coalesced load from 40 to 1000 streams because many streams are very small and in wide tables (e.g. mostly null columns) and there is no point splitting these up. Reviewed By: xiaoxmeng Differential Revision: D50603890 Pulled By: oerling

facebook-github-bot · 2023-11-09T19:14:33Z

This pull request was exported from Phabricator. Differential Revision: D50603890

facebook-github-bot · 2023-11-10T01:27:36Z

@oerling merged this pull request in 55e2486.

Yohahaha · 2023-11-10T02:50:00Z

@oerling Thank you for this patch!

Does DirectBufferedInput will replace BufferedInput by default? seems we have no choice to use BufferedInput in HiveDataSource, and its vread API is important for us to accelerate cloud object storage.

Summary: #7217 introduced DirectBufferInput(DBI) to replace BufferInput (BI). Compared to BI, DBI has several advantages like fetch the data in executor as async mode, break the chunk into pieces, etc. They are controlled by 4 main parameters: loadQuantum, prefetchRowGroups, maxCoalesceBytes, maxCoalesceDistanceBytes. But the default value of these parameters aren't optimal in usage like Gluten, which fetch data rowgroup by rowgroup and didn't reuse the fetched data. As a result those default values leads to huge performance regression compared to BI. The PR add the confligs like loadQuantum, prefetchRowGroups, to HiveConfig so we can config them when HiveConnector is constructed. Meanwhile directorySizeGuess, filePreloadThreshold also Impacted the performance. - loadQuantum controls the size of each coalesce load. In Gluten, we set the default value to 256MB to make sure whole row group is fetched. Because the default rowgroup size is 128MB. If loadQuantum is smaller than rowgroup size, the rowgroup will be break as column chunks to be fetched. If it's even less than chunk size, then only the configured size is load in IO threads, all other data is loaded by task work thread in sync mode. it's why the serious performance regression happens. - CoalesceDistance controls the size of over read. It's common that we don't select all the columns of a table. So some column chunk shouldn't be fetched which leads to non-continuous disk read. If the chunk size is smaller than coalesceDistance we can over read the data to make continuous disk read. Currently we set it 1MB in gluten - CoalesceBytes is the unit to fetch data in IO threads, if you have many io threads in executor, you may get better performance to fetch the data in small pieces instead of whole block. CoalesceBytes controls the size. Currently we configured it as 64MB in Gluten. - As the name indicated, prefetchRowGroups means how many rowgroup should be prefetched if engine is processing current rowgroup. Set it to more than 1 can have the overlap of next rowgroup fetching time and current rowgroup processing time. So we can get best performance. - directorySizeGuess is renamed as footerEstimatedSize to better align with its usage. It's used to estimate the footer data to be fetched. The footer data includs meta data of each row group. - filePreloadThreshold configured the threshold of file size, if a file is smaller than the threashold, we should fetch the whole file directly. The PR is to solve the issue #8041. There are more talk in PR #7873 Pull Request resolved: #7978 Reviewed By: xiaoxmeng Differential Revision: D52569526 Pulled By: mbasmanova fbshipit-source-id: bba99e61949bb8a7c8db15d714cc15b19f15633b

Velox PR7217(facebookincubator/velox#7217) added directbufferinput, which leads to performance regression seriously. The root cause is that the default config in the PR is not optimal for remote storage. The PR added 6 config: loadQuantum: 256M (make sure it's larger than row group size, parquet default is 128M) maxCoalesceDistance: 1M ( in case the columns are not load contieneously, like select a, c from table_with_column_a_b_c. If b is mall column than 1M, then we can load it to make a large block CoalesceBytes: 64M, break the row group fetches into small chunks

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 24, 2023

oerling requested a review from xiaoxmeng October 24, 2023 16:06

oerling force-pushed the sel-bi-pr branch from 15b4f42 to f187d77 Compare October 24, 2023 20:19

oerling force-pushed the sel-bi-pr branch from f187d77 to 8401eaa Compare October 28, 2023 18:58

oerling force-pushed the sel-bi-pr branch 2 times, most recently from a173875 to 0563909 Compare October 30, 2023 21:45

xiaoxmeng reviewed Nov 6, 2023

View reviewed changes

xiaoxmeng reviewed Nov 7, 2023

View reviewed changes

oerling force-pushed the sel-bi-pr branch from 0563909 to 5ee4db3 Compare November 8, 2023 00:41

oerling force-pushed the sel-bi-pr branch from 5ee4db3 to b6bce12 Compare November 9, 2023 00:35

xiaoxmeng approved these changes Nov 9, 2023

View reviewed changes

oerling force-pushed the sel-bi-pr branch from b6bce12 to 318708f Compare November 9, 2023 01:47

xiaoxmeng reviewed Nov 9, 2023

View reviewed changes

oerling force-pushed the sel-bi-pr branch from 318708f to 34480f6 Compare November 9, 2023 18:59

oerling force-pushed the sel-bi-pr branch from 34480f6 to b178be1 Compare November 9, 2023 19:14

facebook-github-bot added the fb-exported label Nov 9, 2023

facebook-github-bot closed this in 55e2486 Nov 10, 2023

facebook-github-bot added the Merged label Nov 10, 2023

Yohahaha mentioned this pull request Nov 13, 2023

Provide a config to choose BufferedInput derived class #7535

Closed

Selective BufferedInput without cache #7217

Selective BufferedInput without cache #7217

Conversation

oerling commented Oct 24, 2023 • edited

netlify bot commented Oct 24, 2023 • edited

✅ Deploy Preview for meta-velox canceled.

facebook-github-bot commented Oct 24, 2023

Yuhta commented Oct 24, 2023

facebook-github-bot commented Oct 24, 2023

facebook-github-bot commented Oct 28, 2023

facebook-github-bot commented Oct 30, 2023

xiaoxmeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiaoxmeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 8, 2023

facebook-github-bot commented Nov 8, 2023

facebook-github-bot commented Nov 9, 2023

xiaoxmeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 9, 2023

facebook-github-bot commented Nov 9, 2023

facebook-github-bot commented Nov 10, 2023

Yohahaha commented Nov 10, 2023

oerling commented Oct 24, 2023 •

edited

netlify bot commented Oct 24, 2023 •

edited