Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reworking Excel support to allow for reading of big files #8403

Merged
merged 77 commits into from
Dec 15, 2023

Conversation

radeusgd
Copy link
Member

@radeusgd radeusgd commented Nov 27, 2023

Pull Request Description

  • Closes Reading a large Excel file resulted in an error about number of records #8111 by making sure that all Excel workbooks are read using a backing file (which should be more memory efficient).
    • If the workbook is being opened from an input stream, that stream is materialized to a Temporary_File.
  • Adds tests fetching Table formats from HTTP.
    • Extends simple-httpbin with ability to serve files for our tests.
  • Ensures that the Infer option on Excel format also works with streams, if content-type metadata is available (e.g. from HTTP headers).
  • Implements a Temporary_File facility that can be used to create a temporary file that is deleted once all references to the Temporary_File instance are GCed.

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • All code has been tested:
    • Unit tests have been written where possible.
    • If GUI codebase was changed, the GUI was tested when built using ./run ide build.

@radeusgd radeusgd force-pushed the wip/radeusgd/8111-big-excel-fix branch 2 times, most recently from 89ff416 to 2289c62 Compare November 28, 2023 16:48
@@ -2717,11 +2717,12 @@ val allStdBits: Parser[String] =
lazy val `simple-httpbin` = project
.in(file("tools") / "simple-httpbin")
.settings(
frgaalJavaCompilerSetting,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The frgaal compiler settings have been causing compile error:
cannot find symbol import com.sun.net.httpserver.SimpleFileServer: cannot find symbol.

Apparently SimpleFileServer may not be in the JDK that Frgaal uses? Weird.

Anyway there is no benefit in Frgaal within this sub-project, especially as we are already on JDK21. Removing it allowed me to re-use the SimpleFileServer for serving files for testing, instead of having to roll my own solution or having to bring in some heavy dependencies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frgaal by default only exposes only classes that are in API of the JDK. SimpleFileServer isn't.

no benefit in Frgaal within this sub-project

There is a benefit. It already prevented you from using SimpleFileServer!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frgaal by default only exposes only classes that are in API of the JDK. SimpleFileServer isn't.

no benefit in Frgaal within this sub-project

There is a benefit. It already prevented you from using SimpleFileServer!

I really honestly don't understand. HOW is that a benefit?

Without Frgaal the server is running fine on our JDK. What is the benefit of using it this translation layer at this point?

We introduced it to be able to use newer features before we migrated to the newer JDKs. But now we have these features natively, so what else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

server is running fine on our JDK.

You have compiled your code in a way that allows running it on JDK 11. However your code cannot run on JDK 11. Without Frgaal you wouldn't even notice that your project is misconfigured as it requires JDK18+ to run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

server is running fine on our JDK.

You have compiled your code in a way that allows running it on JDK 11. However your code cannot run on JDK 11. Without Frgaal you wouldn't even notice that your project is misconfigured as it requires JDK18+ to run.

Hm, I see. It was hard to appreciate that, because from my perspective, removing Frgaal made it work - I never really specifically wanted to compile simple-httpbin to run on JDK 11 - just the current JDK we use - especially since it's not even an artifact we ship but just an internal test scaffolding.

… memory - allowing to handle big files more efficiently

WIP: ensuring the resources are closed when we finish using a workbook
…lexibility in how it is performed

TODO: update Enso part
…nsuring it works even if the files were held open
@radeusgd
Copy link
Member Author

radeusgd commented Dec 7, 2023

Can I please ask someone from engine to approve the build.sbt changes? cc: @Akirathan @hubertp @JaroslavTulach

Copy link
Member

@JaroslavTulach JaroslavTulach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't blame frgaal. Compile against appropriate -target JDK APIs.

@@ -2717,11 +2717,12 @@ val allStdBits: Parser[String] =
lazy val `simple-httpbin` = project
.in(file("tools") / "simple-httpbin")
.settings(
frgaalJavaCompilerSetting,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frgaal by default only exposes only classes that are in API of the JDK. SimpleFileServer isn't.

no benefit in Frgaal within this sub-project

There is a benefit. It already prevented you from using SimpleFileServer!

Copy link
Member

@JaroslavTulach JaroslavTulach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default we should use -target 11 - to support for example the #8094 effort.

@@ -1330,13 +1330,16 @@ lazy val truffleDslSuppressWarnsSetting = Seq(
)

/** A setting to replace javac with Frgaal compiler, allowing to use latest Java features in the code
* and still compile down to JDK 11
* and still compile down to JDK 17
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we change the -target level to 17?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any sophisticated reason for that. I just thought that it was a good idea to bump that to -target 17. Furthermore, I vaguely remember that we talked about this as well. But I think we can fall back to 11.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only updated the documentation comment to be aligned with what is actually being done.


import java.util.Random;

public class RandomHelpers {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use the Faker here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I was annoyed with how slow it was:

from Standard.Base import all

from Standard.Test import Faker
polyglot java import org.enso.table_test_helpers.RandomHelpers

main =
    n = 10^6

    faker = Faker.new 123
    r1 = Duration.time_execution <| Vector.new n _-> faker.alpha_numeric 190
    IO.println <| "        Faker: " + r1.first.to_display_text

    rng = RandomHelpers.new 123
    r2 = Duration.time_execution <| Vector.new n _-> rng.makeRandomString 190
    IO.println <| "RandomHelpers: " + r2.first.to_display_text
        Faker: 15.3227068s
RandomHelpers: 01.7083387s

Almost 10x slower. I'll merge as-is but happy to replace back with Faker, as the duplication is not ideal. Maybe we should try to improve the Faker's performance? I think the current way has a lot of overhead.

@radeusgd radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Dec 14, 2023
@radeusgd radeusgd self-assigned this Dec 14, 2023
@mergify mergify bot merged commit b5c995a into develop Dec 15, 2023
31 checks passed
@mergify mergify bot deleted the wip/radeusgd/8111-big-excel-fix branch December 15, 2023 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reading a large Excel file resulted in an error about number of records
5 participants