-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34858: [Swift] Initial reader impl #34842
Conversation
abandy
commented
Apr 2, 2023
•
edited
edited
- Initial check in for the swift arrow reader impl
- bug fixes found during reader testing
- class/method access modifier changes (mostly from internal to public)
- Closes: [Swift] Add reader support #34858
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
|
26cab61
to
92900b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add these generated files to this repository?
Can we generate them in build time instead of adding them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could generate them during build/test time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misread the intent of this change. I believe these generated files need to be included in the repo. If someone wants to use this lib in there code then they would reference this package in their package.swift file and would need these generated files to be part of the code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Then could you add a script to generate these files so that we can regenerate them when we update FlatBuffers compiler?
And could you also add the Apache 2.0 license header to these generated files in the script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you create these test data?
If you created them by using other Apache Arrow implementation such as C++ implementation, how about adding a program that generates them and generating them in test time instead of adding these test data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created them using python but can switch to C++ and build executable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the test data to the test as base64 encoded strings. We can remove this data and generate it once the FileWriter has been implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the test data to the test as base64 encoded strings.
It'll work for now but it will not work (too large data) when we have more supported patterns.
How about generate_test_data.py
or something and generating test data in test time?
We can remove this data and generate it once the FileWriter has been implemented?
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to install python into the docker image to build the test data or will it run before the image is built and be pulled into the image
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use the former approach?
If we use the former approach, we don't need PyArrow available environment on host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, will do.
var localIndex = index | ||
var arrayIndex = 0; | ||
var len: UInt = arrays[arrayIndex].length | ||
while(localIndex > len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a space missing here?
while(localIndex > len) { | |
while (localIndex > len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I will remove the parenthesis.
let FILEMARKER = "ARROW1" | ||
let CONTINUATIONMARKER = -1 | ||
|
||
public class ArrowReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have "streaming format" https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format and "file format" https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format .
It seems that this is a reader for "file format".
We may want to add "file" to this class name such as ArrowFileReader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class contains a method for fromFile and fromStream. AFAIK the file format wraps the stream format with some extra data, so the fromFile method basically just unwraps the Stream formatted message and sends that to the fromStream method. Does that sound correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the file format wraps the stream format with some extra data
Correct.
But the file format can random read:
https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
This enables random access to any record batch in the file.
The streaming format needs to read from start to end. Because delta dictionary replacement is supported in the streaming format.
See also: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
In the file format, there is no requirement that dictionary keys should be defined in a DictionaryBatch before they are used in a RecordBatch, as long as the keys are defined somewhere in the file. Further more, it is invalid to have more than one non-delta dictionary batch per dictionary ID (i.e. dictionary replacement is not supported). Delta dictionaries are applied in the order they appear in the file footer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class contains a method for fromFile and fromStream.
I see. I misread the implementation. Sorry. I think that it's OK with the current implementation as the first step. We can improve it in follow-up pull requests.
99f657a
to
c65eb5a
Compare
c65eb5a
to
7fbb4d8
Compare
7fbb4d8
to
8e2fcf1
Compare
0040eed
to
841978a
Compare
@kou I think I made all the requested updates. Please review again and approve if all looks good. |
swift/Arrow/swift-datagen
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you generate this binary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generated with golang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
Could you add source code instead of built binary so that anyone can update the program?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually didn't keep the code :). This is only temporary until we get the writer working, so maybe this is ok until then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.
Does it mean that your next pull request implements double/bool writer? And the following pull requests implement reader and writer at once instead of implementing reader with test data then implementing writer and remove the test data like this, right?
If so, we can embed the two test data into our test in base64 as you did because they are temporary, we don't update them and we don't add other test data.
If we still keep implementing reader with test data then implementing writer and remove the test data style, I think that we should add source code of test data generator program so that anyone can join the Swift implementation development.
See also: The Apache Way: Open: https://theapacheway.com/open/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, good idea, I will add test data generator source code under swift/data-generator/swift-datagen and add a section to the README on how to build it.
I do plan to add the writer for double/bools in the next PR but having the golang test data will be a good way to test the reader impl.
flatc --swift ../../../../format/SparseTensor.fbs | ||
flatc --swift ../../../../format/Tensor.fbs | ||
flatc --swift ../../../../format/File.fbs | ||
popd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you prepend our license header to the generated files?
popd | |
cat <<HEADER > header.swift | |
// Licensed to the Apache Software Foundation (ASF) under one | |
// or more contributor license agreements. See the NOTICE file | |
// distributed with this work for additional information | |
// regarding copyright ownership. The ASF licenses this file | |
// to you under the Apache License, Version 2.0 (the | |
// "License"); you may not use this file except in compliance | |
// with the License. You may obtain a copy of the License at | |
// | |
// http://www.apache.org/licenses/LICENSE-2.0 | |
// | |
// Unless required by applicable law or agreed to in writing, | |
// software distributed under the License is distributed on an | |
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
// KIND, either express or implied. See the License for the | |
// specific language governing permissions and limitations | |
// under the License. | |
HEADER | |
for generated_swift in *_generated.swift; do | |
mv ${generated_swift} ${generated_swift}.orig | |
cat header.swift ${generated_swift}.orig > ${generated_swift} | |
rm ${generated_swift}.orig | |
done | |
rm header.swift | |
popd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: We can check "failed to execute a command" and "referred a nonexistent variable" automatically by set -eu
(-e
is for "failed to execute a command" and -u
is for "referred a nonexistent variable"):
set -eu | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update.
Hi @kou, I hope all is well. I believe I have all the changes in. The failed test doesn't seem related to this change. Please review when you get a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I think that we can merge this after we remove swift-datagen
and generate it on CI.
## Test data generation | ||
|
||
Test data files for the reader tests are generated by an executable built in go whose source is included in the data-generator directory. | ||
```sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
```sh | |
```console |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update.
swift/Arrow/swift-datagen
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove this from this repository and build it on CI time instead?
I don't want to add auto generated files as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if we will need to add golang to the swift docker image for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think so.
I think that we can use go run swift-datagen
instead of ./swift-datagen
in ci/scripts/swift_test.sh
.
See also: https://pkg.go.dev/cmd/go
If it's difficult for you, I can help it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I have updated the swift image to install golang and updated the script to build swift-datagen.
2f660fb
to
f3b8ba9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you apply this to accept no-license header for this file?
diff --git a/dev/release/rat_exclude_files.txt b/dev/release/rat_exclude_files.txt
index d37790912..f61c21776 100644
--- a/dev/release/rat_exclude_files.txt
+++ b/dev/release/rat_exclude_files.txt
@@ -149,3 +149,4 @@ r/tools/nixlibs-allowlist.txt
.gitattributes
ruby/red-arrow/.yardopts
.github/pull_request_template.md
+swift/data-generator/swift-datagen/go.sum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
f3b8ba9
to
4443b85
Compare
swift/Arrow/swift-datagen
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove this file?
Then we can merge this pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will do.
4443b85
to
d8d1592
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thanks!
Benchmark runs are scheduled for baseline = 16dbd98 and contender = 0ea1a10. 0ea1a10 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
Thank you @kou! |
- Initial check in for the swift arrow reader impl - bug fixes found during reader testing - class/method access modifier changes (mostly from internal to public) * Closes: apache#34858 Authored-by: Alva Bandy <abandy@live.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
- Initial check in for the swift arrow reader impl - bug fixes found during reader testing - class/method access modifier changes (mostly from internal to public) * Closes: apache#34858 Authored-by: Alva Bandy <abandy@live.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
- Initial check in for the swift arrow reader impl - bug fixes found during reader testing - class/method access modifier changes (mostly from internal to public) * Closes: apache#34858 Authored-by: Alva Bandy <abandy@live.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>