Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6244: [C++][Dataset] Add partition key to DataSource interface #5221

Closed
wants to merge 13 commits into from

Conversation

bkietz
Copy link
Member

@bkietz bkietz commented Aug 28, 2019

The condition is an expression guaranteed to evaluate true for all records in a DataSource. This provides some predicate push down funcitonality: DataSources whose condition precludes a filter expression will not yield any fragments (since those fragments would be filtered out anyway).

This patch does not implement evaluation of filter expressions against an in memory RecordBatch. It makes a half hearted attempt at API compatibility with #5157 which does implement this.

@fsaintjacques fsaintjacques changed the title ARROW-6244: [C++] augment DataSources with an optional partition condition ARROW-6244: [C++] Add partition key to DataSources interface Aug 29, 2019
@fsaintjacques fsaintjacques changed the title ARROW-6244: [C++] Add partition key to DataSources interface ARROW-6244: [C++] Add partition key to DataSource interface Aug 29, 2019
Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you dissect so we don't merge other stuff:

  • Remove the RecordBatchProjector stuff
  • Remove the fuse of ScanOptions/ScanContext

cpp/src/arrow/dataset/dataset.h Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/dataset_internal.h Outdated Show resolved Hide resolved
@bkietz bkietz force-pushed the 6244-Implement-Partition-DataS branch from f4a8546 to 2fb8f96 Compare August 30, 2019 15:41
@bkietz
Copy link
Member Author

bkietz commented Aug 30, 2019

Created https://issues.apache.org/jira/browse/ARROW-6398 for consolidating ScanContext and ScanOptions

#5239

@bkietz bkietz force-pushed the 6244-Implement-Partition-DataS branch from 2fb8f96 to 77edc15 Compare August 30, 2019 16:35
@bkietz bkietz marked this pull request as ready for review August 30, 2019 16:42
cpp/src/arrow/dataset/file_base.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/dataset.h Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/dataset.h Outdated Show resolved Hide resolved
@bkietz bkietz force-pushed the 6244-Implement-Partition-DataS branch from 440e0d3 to 3e913c1 Compare September 1, 2019 19:38
@fsaintjacques fsaintjacques changed the title ARROW-6244: [C++] Add partition key to DataSource interface ARROW-6244: [C++][Dataset] Add partition key to DataSource interface Sep 4, 2019
cpp/src/arrow/dataset/dataset.cc Outdated Show resolved Hide resolved
@bkietz bkietz force-pushed the 6244-Implement-Partition-DataS branch from f1fb4de to 771ad42 Compare September 6, 2019 16:14
@fsaintjacques
Copy link
Contributor

@bkietz still a C++ error link error (MSCV).

@wesm
Copy link
Member

wesm commented Sep 10, 2019

This PR needs to be rebased.

@bkietz bkietz force-pushed the 6244-Implement-Partition-DataS branch from 9f8f404 to 8a3cc22 Compare September 10, 2019 17:13
@bkietz
Copy link
Member Author

bkietz commented Sep 10, 2019

rebased

@wesm
Copy link
Member

wesm commented Sep 10, 2019

Can you enable Appveyor at https://ci.appveyor.com/project/bkietz/arrow?

@bkietz
Copy link
Member Author

bkietz commented Sep 10, 2019

Unfortunately my appveyor username is not "bkietz" https://ci.appveyor.com/project/BenjaminKietzman/arrow/builds/27316846

@bkietz bkietz force-pushed the 6244-Implement-Partition-DataS branch 2 times, most recently from c92ae06 to c6d9367 Compare September 10, 2019 18:58
@wesm
Copy link
Member

wesm commented Sep 11, 2019

This change gives me a working local build. Let's see what happens in Appveyor

@wesm
Copy link
Member

wesm commented Sep 11, 2019

Weird this issue is present with Visual Studio 2017 but not 2015

@wesm
Copy link
Member

wesm commented Sep 11, 2019

Well, this is awful.

Having many DLLs and needing to share symbols across them is a big problem on Windows.

I think we need to change the way we are building the library to produce a single "kitchen sink" shared library which is similar to the way that other large C++ projects do things. That shared library would also need to include the symbols current in parquet.dll, and probably also arrow_python.dll

In the meantime, I think a workaround for this mess is to make sure that all RecordBatchIterator instantiations are happening in arrow.dll. Does that seem doable?

@wesm
Copy link
Member

wesm commented Sep 11, 2019

cc @kou for ideas also, since this nightmarish situation also will affect our packaging

@bkietz
Copy link
Member Author

bkietz commented Sep 11, 2019

That did it. @wesm @fsaintjacques what do you think?

@kou
Copy link
Member

kou commented Sep 11, 2019

I don't test but can we use {} instead of = default to force using inlined implementation?

diff --git a/cpp/src/arrow/util/iterator.h b/cpp/src/arrow/util/iterator.h
index 328b8e946..8cd70f09a 100644
--- a/cpp/src/arrow/util/iterator.h
+++ b/cpp/src/arrow/util/iterator.h
@@ -32,7 +32,11 @@ namespace arrow {
 template <typename T>
 class Iterator {
  public:
-  virtual ~Iterator() = default;
+  // We can't use "= default" here to avoid using not inlined
+  // implementation with MSVC. If not inlined implementation is used,
+  // we need to export the not inlined implementation. It requires
+  // common DLL that provides the not inlined implementation.
+  virtual ~Iterator() {};
 
   /// \brief Return the next element of the sequence, nullptr when the
   /// iteration is completed

@bkietz
Copy link
Member Author

bkietz commented Sep 12, 2019

@kou we tried that and a few other variations to no avail, unfortunately

template <typename HasNext>
explicit Iterator(HasNext&& has_next)
: Iterator(new Impl<HasNext>(std::forward<HasNext>(has_next)),
"direct impl construction") {}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou here is the commit which changes Iterator from an abstract base to a type erased handle. I'll also extract to a separate pull request

@kou
Copy link
Member

kou commented Sep 12, 2019

Oh, I see.

@bkietz bkietz force-pushed the 6244-Implement-Partition-DataS branch from ac887bc to 42e2ad3 Compare September 17, 2019 14:43
@fsaintjacques
Copy link
Contributor

Travis failure is unrelated, sphinx segfault.

@bkietz bkietz deleted the 6244-Implement-Partition-DataS branch February 25, 2021 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants