Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34858: [Swift] Initial reader impl #34842

Merged
merged 1 commit into from
Apr 30, 2023
Merged

Conversation

abandy
Copy link
Contributor

@abandy abandy commented Apr 2, 2023

  • Initial check in for the swift arrow reader impl
  • bug fixes found during reader testing
  • class/method access modifier changes (mostly from internal to public)

@github-actions
Copy link

github-actions bot commented Apr 2, 2023

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@github-actions github-actions bot added the awaiting review Awaiting review label Apr 2, 2023
@abandy abandy changed the title [Swift]Initial arrow reader impl GH-34858: [SWIFT] Initial arrow reader impl Apr 3, 2023
@github-actions
Copy link

github-actions bot commented Apr 3, 2023

@github-actions
Copy link

github-actions bot commented Apr 3, 2023

⚠️ GitHub issue #34858 has been automatically assigned in GitHub to PR creator.

@kou kou changed the title GH-34858: [SWIFT] Initial arrow reader impl GH-34858: [Swift] Initial reader impl Apr 3, 2023
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add these generated files to this repository?
Can we generate them in build time instead of adding them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could generate them during build/test time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misread the intent of this change. I believe these generated files need to be included in the repo. If someone wants to use this lib in there code then they would reference this package in their package.swift file and would need these generated files to be part of the code base.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Then could you add a script to generate these files so that we can regenerate them when we update FlatBuffers compiler?
And could you also add the Apache 2.0 license header to these generated files in the script?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you create these test data?
If you created them by using other Apache Arrow implementation such as C++ implementation, how about adding a program that generates them and generating them in test time instead of adding these test data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created them using python but can switch to C++ and build executable.

Copy link
Contributor Author

@abandy abandy Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the test data to the test as base64 encoded strings. We can remove this data and generate it once the FileWriter has been implemented?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the test data to the test as base64 encoded strings.

It'll work for now but it will not work (too large data) when we have more supported patterns.
How about generate_test_data.py or something and generating test data in test time?

We can remove this data and generate it once the FileWriter has been implemented?

Yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to install python into the docker image to build the test data or will it run before the image is built and be pulled into the image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use the former approach?
If we use the former approach, we don't need PyArrow available environment on host.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, will do.

var localIndex = index
var arrayIndex = 0;
var len: UInt = arrays[arrayIndex].length
while(localIndex > len) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a space missing here?

Suggested change
while(localIndex > len) {
while (localIndex > len) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I will remove the parenthesis.

let FILEMARKER = "ARROW1"
let CONTINUATIONMARKER = -1

public class ArrowReader {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have "streaming format" https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format and "file format" https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format .
It seems that this is a reader for "file format".
We may want to add "file" to this class name such as ArrowFileReader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class contains a method for fromFile and fromStream. AFAIK the file format wraps the stream format with some extra data, so the fromFile method basically just unwraps the Stream formatted message and sends that to the fromStream method. Does that sound correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the file format wraps the stream format with some extra data

Correct.

But the file format can random read:

https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format

This enables random access to any record batch in the file.

The streaming format needs to read from start to end. Because delta dictionary replacement is supported in the streaming format.

See also: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format

In the file format, there is no requirement that dictionary keys should be defined in a DictionaryBatch before they are used in a RecordBatch, as long as the keys are defined somewhere in the file. Further more, it is invalid to have more than one non-delta dictionary batch per dictionary ID (i.e. dictionary replacement is not supported). Delta dictionaries are applied in the order they appear in the file footer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class contains a method for fromFile and fromStream.

I see. I misread the implementation. Sorry. I think that it's OK with the current implementation as the first step. We can improve it in follow-up pull requests.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 4, 2023
@abandy abandy force-pushed the swift-reader-impl branch 2 times, most recently from 99f657a to c65eb5a Compare April 5, 2023 02:23
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Apr 5, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 8, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 8, 2023
@abandy abandy force-pushed the swift-reader-impl branch 3 times, most recently from 0040eed to 841978a Compare April 12, 2023 12:47
@abandy
Copy link
Contributor Author

abandy commented Apr 18, 2023

@kou I think I made all the requested updates. Please review again and approve if all looks good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you generate this binary?

Copy link
Contributor Author

@abandy abandy Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generated with golang

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
Could you add source code instead of built binary so that anyone can update the program?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually didn't keep the code :). This is only temporary until we get the writer working, so maybe this is ok until then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.
Does it mean that your next pull request implements double/bool writer? And the following pull requests implement reader and writer at once instead of implementing reader with test data then implementing writer and remove the test data like this, right?
If so, we can embed the two test data into our test in base64 as you did because they are temporary, we don't update them and we don't add other test data.

If we still keep implementing reader with test data then implementing writer and remove the test data style, I think that we should add source code of test data generator program so that anyone can join the Swift implementation development.
See also: The Apache Way: Open: https://theapacheway.com/open/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, good idea, I will add test data generator source code under swift/data-generator/swift-datagen and add a section to the README on how to build it.

I do plan to add the writer for double/bools in the next PR but having the golang test data will be a good way to test the reader impl.

flatc --swift ../../../../format/SparseTensor.fbs
flatc --swift ../../../../format/Tensor.fbs
flatc --swift ../../../../format/File.fbs
popd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you prepend our license header to the generated files?

Suggested change
popd
cat <<HEADER > header.swift
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
HEADER
for generated_swift in *_generated.swift; do
mv ${generated_swift} ${generated_swift}.orig
cat header.swift ${generated_swift}.orig > ${generated_swift}
rm ${generated_swift}.orig
done
rm header.swift
popd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: We can check "failed to execute a command" and "referred a nonexistent variable" automatically by set -eu (-e is for "failed to execute a command" and -u is for "referred a nonexistent variable"):

Suggested change
set -eu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update.

@github-actions github-actions bot removed the awaiting change review Awaiting change review label Apr 18, 2023
@abandy
Copy link
Contributor Author

abandy commented Apr 27, 2023

Hi @kou, I hope all is well. I believe I have all the changes in. The failed test doesn't seem related to this change. Please review when you get a chance.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I think that we can merge this after we remove swift-datagen and generate it on CI.

## Test data generation

Test data files for the reader tests are generated by an executable built in go whose source is included in the data-generator directory.
```sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```sh
```console

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this from this repository and build it on CI time instead?
I don't want to add auto generated files as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if we will need to add golang to the swift docker image for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think so.

I think that we can use go run swift-datagen instead of ./swift-datagen in ci/scripts/swift_test.sh.

See also: https://pkg.go.dev/cmd/go

If it's difficult for you, I can help it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I have updated the swift image to install golang and updated the script to build swift-datagen.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 28, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 29, 2023
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you apply this to accept no-license header for this file?

diff --git a/dev/release/rat_exclude_files.txt b/dev/release/rat_exclude_files.txt
index d37790912..f61c21776 100644
--- a/dev/release/rat_exclude_files.txt
+++ b/dev/release/rat_exclude_files.txt
@@ -149,3 +149,4 @@ r/tools/nixlibs-allowlist.txt
 .gitattributes
 ruby/red-arrow/.yardopts
 .github/pull_request_template.md
+swift/data-generator/swift-datagen/go.sum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 29, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 29, 2023
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove this file?
Then we can merge this pull request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will do.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Apr 29, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 29, 2023
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Thanks!

@kou kou merged commit 0ea1a10 into apache:main Apr 30, 2023
45 of 48 checks passed
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Apr 30, 2023
@ursabot
Copy link

ursabot commented Apr 30, 2023

Benchmark runs are scheduled for baseline = 16dbd98 and contender = 0ea1a10. 0ea1a10 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️25.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.31% ⬆️0.47%] test-mac-arm
[Failed ⬇️12.27% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.99% ⬆️0.36%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 0ea1a103 ec2-t3-xlarge-us-east-2
[Failed] 0ea1a103 test-mac-arm
[Failed] 0ea1a103 ursa-i9-9960x
[Finished] 0ea1a103 ursa-thinkcentre-m75q
[Finished] 16dbd98e ec2-t3-xlarge-us-east-2
[Failed] 16dbd98e test-mac-arm
[Failed] 16dbd98e ursa-i9-9960x
[Finished] 16dbd98e ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Apr 30, 2023

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm
ursa-i9-9960x

@abandy
Copy link
Contributor Author

abandy commented May 1, 2023

Thank you @kou!

liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
- Initial check in for the swift arrow reader impl
- bug fixes found during reader testing
- class/method access modifier changes (mostly from internal to public)

* Closes: apache#34858

Authored-by: Alva Bandy <abandy@live.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
- Initial check in for the swift arrow reader impl
- bug fixes found during reader testing
- class/method access modifier changes (mostly from internal to public)

* Closes: apache#34858

Authored-by: Alva Bandy <abandy@live.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
- Initial check in for the swift arrow reader impl
- bug fixes found during reader testing
- class/method access modifier changes (mostly from internal to public)

* Closes: apache#34858

Authored-by: Alva Bandy <abandy@live.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Swift] Add reader support
3 participants