Skip to content

Comments

ARROW-6238: [C++][Dataset] Implement SimpleDataSource, SimpleDataFragment and SimpleScanTask#5140

Closed
fsaintjacques wants to merge 5 commits intoapache:masterfrom
fsaintjacques:ARROW-6238-simple-datasource-datafragment
Closed

ARROW-6238: [C++][Dataset] Implement SimpleDataSource, SimpleDataFragment and SimpleScanTask#5140
fsaintjacques wants to merge 5 commits intoapache:masterfrom
fsaintjacques:ARROW-6238-simple-datasource-datafragment

Conversation

@fsaintjacques
Copy link
Contributor

The Simple* family of classes are iterator backed by explicit vectors. This can be useful to represent a memory datasource that rarely changes, e.g. a constant join table.

  • SimpleDataSource is backed by a vector.
  • SimpleDataFragment is backed by a vector.
  • SimpleScanTask is backed by a vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this interface, I think we should delete until we make use of it, YAGNI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds fine to me; sub-file data fragments sounds like something we'd want to handle at the DataSource level anyhow

@pitrou pitrou changed the title ARROW-6238: [C++][Dataset] Implements SimpleDataSource, SimpleDataFragment and SimpleScanTask ARROW-6238: [C++][Dataset] Implement SimpleDataSource, SimpleDataFragment and SimpleScanTask Aug 22, 2019
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some details.

@fsaintjacques fsaintjacques force-pushed the ARROW-6238-simple-datasource-datafragment branch from 7a30748 to a2f5cbf Compare August 22, 2019 18:19
… SimpleScanTask

The Simple* family of classes are iterator backed by explicit vectors.
This can be useful to represent a memory datasource that rarely changes,
e.g. a constant join table.

- SimpleDataSource is backed by a vector<DataFragment>.
- SimpleDataFragment is backed by a vector<RecordBatch>.
- SimpleScanTask is backed by a vector<RecordBatch>.
@fsaintjacques fsaintjacques force-pushed the ARROW-6238-simple-datasource-datafragment branch from a2f5cbf to 2d41566 Compare August 22, 2019 18:55
@codecov-io
Copy link

Codecov Report

Merging #5140 into master will increase coverage by 1.58%.
The diff coverage is 78.07%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5140      +/-   ##
==========================================
+ Coverage   87.63%   89.22%   +1.58%     
==========================================
  Files        1022      742     -280     
  Lines      146258   105507   -40751     
  Branches     1437        0    -1437     
==========================================
- Hits       128174    94137   -34037     
+ Misses      17722    11370    -6352     
+ Partials      362        0     -362
Impacted Files Coverage Δ
cpp/src/arrow/dataset/scanner.cc 100% <100%> (ø)
cpp/src/arrow/dataset/dataset_test.cc 100% <100%> (ø)
cpp/src/arrow/dataset/test_util.h 91.83% <100%> (+2.36%) ⬆️
cpp/src/arrow/util/iterator.h 100% <100%> (ø) ⬆️
cpp/src/arrow/dataset/scanner.h 100% <100%> (ø) ⬆️
cpp/src/arrow/dataset/dataset.cc 100% <100%> (ø)
cpp/src/arrow/testing/gtest_util.h 80.53% <45.94%> (-16.84%) ⬇️
cpp/src/arrow/dataset/dataset.h 76.92% <50%> (-23.08%) ⬇️
cpp/src/arrow/testing/util.h 91.3% <80%> (-3.15%) ⬇️
cpp/src/arrow/csv/column_builder.cc 95.54% <0%> (-1.49%) ⬇️
... and 289 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3c9236d...2d41566. Read the comment docs.

@fsaintjacques
Copy link
Contributor Author

@pitrou comments were addressed.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Can you address my comments in the next patch?

auto builder_fn = [](BuilderType* builder) { builder->UnsafeAppend(T(0)); };
ASSERT_OK_AND_ASSIGN(auto array, ArrayFromBuilderVisitor(type, size, builder_fn));
return array;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would implement Zeros<T>(size) to go alongside the current functions in testing/random.h

private:
int64_t repetitions_;
std::shared_ptr<RecordBatch> batch_;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still find this really narrow -- I commented on ARROW-6161 about this. Wouldn't passing std::vector<shared_ptr<RecordBatch>> be more generally useful? You can even create a function RepeatVector(val, num) to make generating the vector easy

int64_t n_batch, int64_t batch_size, std::shared_ptr<Schema> schema) {
auto batch = GetRecordBatch(batch_size, std::move(schema));
return GetRecordBatchReader(n_batch, std::move(batch));
}
Copy link
Member

@wesm wesm Aug 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure these functions belong here. Their names don't describe what they do and they are particular to unit tests found in arrow/datasets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants