Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow language #8512

Merged
merged 35 commits into from
Jan 12, 2024
Merged

Arrow language #8512

merged 35 commits into from
Jan 12, 2024

Conversation

hubertp
Copy link
Contributor

@hubertp hubertp commented Dec 11, 2023

Pull Request Description

Initial implementation of the Arrow language. Closes #7755.
Currently supported logical types are

  • Date (days and milliseconds)
  • Int (8, 16, 32, 64)

One can currently

  • allocate a new fixed-length, nullable Arrow vector - new[<name-of-the-type>]
  • cast an already existing fixed-length Arrow vector from a memory address - cast[<name-of-the-type>]

Closes #7755.

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • All code follows the
    Scala,
    Java,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • All code has been tested:
    • Unit tests have been written where possible.

Copy link
Member

@JaroslavTulach JaroslavTulach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a correct start.

I assume that in the future we want to support "structures" of array elements. E.g. not just an ArrayFixedArrayData32 or 64-bit, but:

ArrayFixedArray of int32, Date32, float, etc. struct. That will probably lead towards abstracting/defining the structure of single array element rather than having special types of array. Possibly.

@JaroslavTulach
Copy link
Member

JaroslavTulach commented Dec 13, 2023

A recommendation for study: https://www.graalvm.org/truffle/javadoc/com/oracle/truffle/api/staticobject/package-summary.html - that's the official Truffle way to represent immutable (e.g. static) objects.

@hubertp hubertp added the CI: Clean build required CI runners will be cleaned before and after this PR is built. label Dec 13, 2023
@hubertp hubertp changed the title WIP: First take at implementing Arrow Arrow language Dec 15, 2023
@hubertp hubertp marked this pull request as ready for review December 15, 2023 21:37
id = ArrowLanguage.ID,
name = "Truffle implementation of Arrow",
characterMimeTypes = {ArrowLanguage.MIME},
defaultMimeType = ArrowLanguage.MIME,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to explicitly specify internal = true. By default internal = false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ArrowLanguage shall be an experimental feature accessible via foreign arrow xyz = """ syntax. I am afraid the language needs to be non-internal to be exposed in the foreign set of language (as of #7882).

In any case, it'd be good if ArrowLanguage wasn't accessible by default (throw a parsing error for example). Possibly shield it with

var canArrow = false;
assert canArrow = true;
if (!canArrow) throw ...

A simple example with continuous fixed-size array storing dates is
added. More to follow.
Tests were previously wrong and specialization had a typo, which was
misleading.
In order to be really comparable (not only up-to miliseconds) had to
create instant for ZonedDateTime from seconds and nanoseconds.
Also simplified Date implementation to share more code.
Added an example where an IntVector is created in Java, we get a memory
pointer to the buffer, and cast it in our Arrow language. The main trick
was to ensure that the buffer created at the specific address has
`LITTLE_ENDIAN` order.
The test also revealed that we didn't type-adjust the index when
writing/reading to our fixed-size int array.
It makes little sense to include the full dependency just to allow for
memory-mapped byte buffer which we can construct by hand.
Rather than copying over non-null bitmaps (aka `validityBuffer`) we
memory-map it, similarly to the data buffer.
The code can be further simplified if we lock the type of elements
stored in the ByteBuffer to a single one.
Copy link
Member

@JaroslavTulach JaroslavTulach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be happy if there was a separate arrow-language.jar with own module-info.class. That way the Arrow language was clearly separated from the Enso runtime.

I'd hide the Arrow language behind a flag - otherwise it is hard to find out it is experiemental. Possibly behind -ea for now.

build.sbt Outdated
},
Test / addModules := Seq("org.enso.interpreter.arrow"),
Test / javaOptions ++= testLogProviderOptions ++ Seq(
"--add-opens=java.base/java.nio=org.enso.interpreter.arrow",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of --add-opens, but having that in tests only is probably fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can probably use --add-opens=org.enso.interpreter.arrow/org.enso.interpreter.arrow==ALL-UNNAMED - or something like that.

build.sbt Outdated Show resolved Hide resolved
build.sbt Show resolved Hide resolved
engine/runtime-fat-jar/src/main/java/module-info.java Outdated Show resolved Hide resolved
assertNotNull(int32Constr);
Value int32Array =
int32Constr.newInstance(
vector.getDataBufferAddress(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

address and capacity is OK, for now. It'd be handy (for interaction with Python) to also support InteropLibrary.isPointer.

@hubertp
Copy link
Contributor Author

hubertp commented Jan 11, 2024

I would be happy if there was a separate arrow-language.jar with own module-info.class. That way the Arrow language was clearly separated from the Enso runtime.

That's possible already now.

sbt:enso> runtime-language-arrow/package
> jar tf engine/runtime-language-arrow/target/runtime-language-arrow-0.0.0-dev.jar 
...
META-INF/services/com.oracle.truffle.api.provider.TruffleLanguageProvider
module-info.class
...

@JaroslavTulach
Copy link
Member

I would be happy if there was a separate arrow-language.jar with own module-info.class. That way the Arrow language was clearly separated from the Enso runtime.

That's possible already now.

sbt:enso> runtime-language-arrow/package
> jar tf engine/runtime-language-arrow/target/runtime-language-arrow-0.0.0-dev.jar 
...
META-INF/services/com.oracle.truffle.api.provider.TruffleLanguageProvider
module-info.class
...

The problem is not that it is possible, but that it is impossible to distribute runtime.jar without Arrow!

@hubertp
Copy link
Contributor Author

hubertp commented Jan 12, 2024

The problem is not that it is possible, but that it is impossible to distribute runtime.jar without Arrow!

I don't think that is the case anymore. Nothing depends on arrow project.

@hubertp
Copy link
Contributor Author

hubertp commented Jan 12, 2024

I think this first step towards supporting Arrow language is sufficient enough. There is obviously more work involved but the longer I delay the merge the harder becomes to keep it in sync with develop.

@hubertp hubertp added CI: Ready to merge This PR is eligible for automatic merge CI: Keep up to date Automatically update this PR to the latest develop. labels Jan 12, 2024
@hubertp hubertp removed the CI: Keep up to date Automatically update this PR to the latest develop. label Jan 12, 2024
Copy link
Member

@Akirathan Akirathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excuse the delay in my review. Apart from some minor suggestions, looks good.

private BaseFixedWidthVector allocateFixedLengthVector(
BufferAllocator allocator, Object[] testValues, LogicalLayout unit) {
var valueCount = 0;
switch (unit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: refactor to return switch expression

build.sbt Show resolved Hide resolved
build.sbt Outdated Show resolved Hide resolved
build.sbt Outdated Show resolved Hide resolved
build.sbt Outdated Show resolved Hide resolved
@Akirathan
Copy link
Member

The problem is not that it is possible, but that it is impossible to distribute runtime.jar without Arrow!

@JaroslavTulach What do you mean that it is impossible to distribute runtime.jar without Arrow? Currently, no classes from runtime-language-arrow project get assembled into runtime.jar, neither in any other jar archives that are on module path or on class path.

Or did you mean that you would like to distribute the arrow language, but would like to keep it separate from the org.enso.runtime module?

@hubertp hubertp removed the CI: Ready to merge This PR is eligible for automatic merge label Jan 12, 2024
@hubertp hubertp added the CI: Ready to merge This PR is eligible for automatic merge label Jan 12, 2024
@mergify mergify bot merged commit 3c29a58 into develop Jan 12, 2024
26 of 27 checks passed
@mergify mergify bot deleted the wip/hubert/7755-arrow branch January 12, 2024 18:19
@hubertp hubertp mentioned this pull request Feb 21, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Clean build required CI runners will be cleaned before and after this PR is built. CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arrow language to use industry standard format for columnar data
3 participants