S3 Read Access, Input Stream based reading #7776

jdunkerley · 2023-09-08T16:27:52Z

Pull Request Description

Added a FileSystemSPI allowing protocol resolution to a target type.
Separated Input_Stream and Output_Stream from File to allow use in other spaces.
File_Format types read_web changed to be read_stream working with InputStream.
Added directory listing to Auto_Detect allowing for Data.read to list a folder.
Adjusted HTTP to return an InputStream not a byte[]:
- Response_Body adjusted to wrap an InputStream.
- Added ability to materialize to either and in-memory vector (<4KB) or a temporary file.
- Data.fetch will materialize if not a recognized mime-type.
- Added HTTP_Error to handle IO exceptions from the stream.
Excel_Format now supports mime-type and reading a stream.
- Excel_Workbook can now get a Excel_Section using read_section.
Added S3 APIs:
- parse_uri: splits an S3 URI into bucket and key.
- list_objects: list the items in a S3 bucket with specified prefix.
- read_bucket: list prefixes and keys with a delimiter in a S3 bucket with specified prefix.
- head: either head_bucket (tests existance) or head_object API (reads object meta data).
- get_object: gets an object from S3 returning as a Response_Body.
Added S3_File type acting like a File:
- No support for writing in this PR.
- ToDo: recursive listing, glob filtering, exists, size.
Fixed a few invalid type signature line.
Moved create methods for Postgres_Connection and SQLite_Connection into type instead of module.
Renamed Column_Fetcher.Builder to Column_Fetcher_Builder.
Fixed bug with select_into in Dry Run mode creating permanent tables.

ToDo: Unit tests.

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

The documentation has been updated, if necessary.
Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
All code follows the
Scala,
Java,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.
All code has been tested:
- Unit tests have been written where possible.
- If GUI codebase was changed, the GUI was tested when built using ./run ide build.

hubertp

I've only skimmed through lib changes, engine part is 👍

std-bits/base/src/main/java/org/enso/base/file_system/FileSystemSPI.java

distribution/lib/Standard/AWS/0.0.0-dev/src/Database/Redshift/Redshift_Details.enso

distribution/lib/Standard/Base/0.0.0-dev/src/Network/HTTP/Response.enso

distribution/lib/Standard/AWS/0.0.0-dev/src/S3/S3.enso

radeusgd · 2023-09-14T23:53:46Z

distribution/lib/Standard/AWS/0.0.0-dev/src/S3/S3.enso

+## Gets an object from an S3 bucket.
+
+   Arguments:
+   - bucket: the name of the bucket.
+   - key: the key of the object.
+   - credentials: AWS credentials. If not provided, the default credentials will
+     be used.
+get_object : Text -> Text -> AWS_Credential | Nothing -> Any ! S3_Error
+get_object bucket key credentials:(AWS_Credential | Nothing)=Nothing = handle_s3_errors <|
+    client = make_client credentials
+    request = GetObjectRequest.builder.bucket bucket . key key . build
+    Panic.catch NoSuchBucketException handler=(_->Error.throw (S3_Bucket_Not_Found.Error bucket)) <|
+        Panic.catch NoSuchKeyException handler=(_->Error.throw (No_Such_Key.Error bucket key)) <|
+            response = client.getObject request
+            inner_response = response.response
+            mime_type = inner_response.contentType
+            s3_uri = URI.parse ("s3://" + bucket + "/" + key)
+            input_stream = Input_Stream.new response (handle_io_errors s3_uri)
+            Response_Body.Raw_Stream input_stream mime_type s3_uri


Since this returns a Raw_Stream, I assume it is an internal method, right?

In such case it should be marked as PRIVATE. Maybe helper methods should be moved to a file separate from public API? I think we could move in that direction, it makes all a bit clearer.

Consistent with the low-level HTTP request method. I will mark it ADVANCED as it should be.
Once I merge with Greg's will make the two consistently hidden.

radeusgd · 2023-09-14T23:54:22Z

distribution/lib/Standard/AWS/0.0.0-dev/src/S3/S3.enso

+    if uri.starts_with "s3://" . not then Nothing else
+        no_prefix = uri.drop 5
+        index_of = no_prefix.index_of "/"
+        if index_of == 0 then Nothing else


Shouldn't not being able to find the bucket name be an error?

distribution/lib/Standard/AWS/0.0.0-dev/src/S3/S3_File.enso

distribution/lib/Standard/Base/0.0.0-dev/src/System/File.enso

distribution/lib/Standard/Database/0.0.0-dev/src/Internal/Base_Generator.enso

distribution/lib/Standard/Database/0.0.0-dev/src/Internal/Postgres/Pgpass.enso

distribution/lib/Standard/Table/0.0.0-dev/src/Excel/Excel_Format.enso

distribution/lib/Standard/Table/0.0.0-dev/src/Excel/Excel_Workbook.enso

…detached from File.

Add support for Excel Mime Types. All reading bytes via format.

…ror trapping.

S3 get_object working.

…emSPI.java Co-authored-by: Hubert Plociniczak <hubert.plociniczak@gmail.com>

PR comments.

jdunkerley force-pushed the wip/jd/s3-read branch 3 times, most recently from 069f6c8 to 164c36c Compare September 13, 2023 14:52

jdunkerley marked this pull request as ready for review September 14, 2023 11:56

jdunkerley requested review from radeusgd, GregoryTravis, 4e6, JaroslavTulach, hubertp and Akirathan as code owners September 14, 2023 11:56

hubertp approved these changes Sep 14, 2023

View reviewed changes

std-bits/base/src/main/java/org/enso/base/file_system/FileSystemSPI.java Outdated Show resolved Hide resolved

jdunkerley force-pushed the wip/jd/s3-read branch from 057cb4b to c72ddc4 Compare September 14, 2023 13:00

GregoryTravis approved these changes Sep 14, 2023

View reviewed changes

distribution/lib/Standard/AWS/0.0.0-dev/src/Database/Redshift/Redshift_Details.enso Outdated Show resolved Hide resolved

distribution/lib/Standard/Base/0.0.0-dev/src/Network/HTTP/Response.enso Show resolved Hide resolved

JaroslavTulach approved these changes Sep 14, 2023

View reviewed changes

radeusgd reviewed Sep 15, 2023

View reviewed changes

jdunkerley requested review from MichaelMauderer, farmaazon, kazcw, vitvakatu, Frizi and mwu-tow as code owners September 15, 2023 08:13

jdunkerley force-pushed the wip/jd/s3-read branch from b60ef1f to 4f23379 Compare September 15, 2023 08:14

jdunkerley requested review from mwu-tow and removed request for Frizi, kazcw, MichaelMauderer, mwu-tow, farmaazon and vitvakatu September 15, 2023 08:55

mwu-tow approved these changes Sep 15, 2023

View reviewed changes

jdunkerley and others added 23 commits September 20, 2023 13:57

Fix some type signatures working on streams.

b689b8f

Fix some type signatures working on streams.

ce3c818

Split Output_Stream and Input_Stream to own files and refactor to be …

69b0985

…detached from File.

Move to read_stream on formats.

3041ec5

Add support for Excel Mime Types. All reading bytes via format.

Fix failing tests.

5a41cd2

Working on Response_Body.

fba341f

Materialize working.

7684153

When reading Excel from a stream try xlsx if in Infer mode. Better er…

0c97ee2

…ror trapping.

Sort reading an Excel worksheet from a stream.

5f29cc5

S3 get_object working.

Working on the S3_File.

95de203

Add a File System SPI and allow Data.read to return a list of files.

5dac933

Revert.

710924f

Update std-bits/base/src/main/java/org/enso/base/file_system/FileSyst…

04ac007

…emSPI.java Co-authored-by: Hubert Plociniczak <hubert.plociniczak@gmail.com>

Linting.

8295b76

PR comments.

233bbdb

PR comments.

e2d9301

In dry run mode, select into table should be temporary.

818a43e

PR comment.

cd7875d

Some PR comments.

a09a7f7

Add AWS test project.

587cc0e

PR comments.

Adding tests for S3 API.

c612047

WIP

253a80e

Add a few more tests.

b24b2e7

jdunkerley force-pushed the wip/jd/s3-read branch from d2654a4 to b24b2e7 Compare September 20, 2023 12:57

jdunkerley added 2 commits September 20, 2023 13:59

Fixed size and added few more tests.

dd3dfdd

Changelog.

18a40d9

jdunkerley linked an issue Sep 20, 2023 that may be closed by this pull request

Ability to connect and read from S3 #5777

Closed

3 tasks

jdunkerley added the CI: Ready to merge This PR is eligible for automatic merge label Sep 20, 2023

mergify bot merged commit 74d1d08 into develop Sep 20, 2023
24 of 25 checks passed

mergify bot deleted the wip/jd/s3-read branch September 20, 2023 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 Read Access, Input Stream based reading #7776

S3 Read Access, Input Stream based reading #7776

jdunkerley commented Sep 8, 2023 •

edited

Loading

hubertp left a comment

radeusgd Sep 14, 2023

jdunkerley Sep 15, 2023

radeusgd Sep 14, 2023

S3 Read Access, Input Stream based reading #7776

S3 Read Access, Input Stream based reading #7776

Conversation

jdunkerley commented Sep 8, 2023 • edited Loading

Pull Request Description

Checklist

hubertp left a comment

Choose a reason for hiding this comment

radeusgd Sep 14, 2023

Choose a reason for hiding this comment

jdunkerley Sep 15, 2023

Choose a reason for hiding this comment

radeusgd Sep 14, 2023

Choose a reason for hiding this comment

jdunkerley commented Sep 8, 2023 •

edited

Loading