ARROW-6238: [C++][Dataset] Implement SimpleDataSource, SimpleDataFragment and SimpleScanTask#5140
Conversation
cpp/src/arrow/dataset/dataset.h
Outdated
There was a problem hiding this comment.
I'm not sure about this interface, I think we should delete until we make use of it, YAGNI.
There was a problem hiding this comment.
sounds fine to me; sub-file data fragments sounds like something we'd want to handle at the DataSource level anyhow
7a30748 to
a2f5cbf
Compare
… SimpleScanTask The Simple* family of classes are iterator backed by explicit vectors. This can be useful to represent a memory datasource that rarely changes, e.g. a constant join table. - SimpleDataSource is backed by a vector<DataFragment>. - SimpleDataFragment is backed by a vector<RecordBatch>. - SimpleScanTask is backed by a vector<RecordBatch>.
a2f5cbf to
2d41566
Compare
Codecov Report
@@ Coverage Diff @@
## master #5140 +/- ##
==========================================
+ Coverage 87.63% 89.22% +1.58%
==========================================
Files 1022 742 -280
Lines 146258 105507 -40751
Branches 1437 0 -1437
==========================================
- Hits 128174 94137 -34037
+ Misses 17722 11370 -6352
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
|
@pitrou comments were addressed. |
wesm
left a comment
There was a problem hiding this comment.
+1. Can you address my comments in the next patch?
| auto builder_fn = [](BuilderType* builder) { builder->UnsafeAppend(T(0)); }; | ||
| ASSERT_OK_AND_ASSIGN(auto array, ArrayFromBuilderVisitor(type, size, builder_fn)); | ||
| return array; | ||
| } |
There was a problem hiding this comment.
Personally I would implement Zeros<T>(size) to go alongside the current functions in testing/random.h
| private: | ||
| int64_t repetitions_; | ||
| std::shared_ptr<RecordBatch> batch_; | ||
| }; |
There was a problem hiding this comment.
I still find this really narrow -- I commented on ARROW-6161 about this. Wouldn't passing std::vector<shared_ptr<RecordBatch>> be more generally useful? You can even create a function RepeatVector(val, num) to make generating the vector easy
| int64_t n_batch, int64_t batch_size, std::shared_ptr<Schema> schema) { | ||
| auto batch = GetRecordBatch(batch_size, std::move(schema)); | ||
| return GetRecordBatchReader(n_batch, std::move(batch)); | ||
| } |
There was a problem hiding this comment.
I am not sure these functions belong here. Their names don't describe what they do and they are particular to unit tests found in arrow/datasets
The Simple* family of classes are iterator backed by explicit vectors. This can be useful to represent a memory datasource that rarely changes, e.g. a constant join table.