Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Read Access, Input Stream based reading #7776

Merged
merged 32 commits into from
Sep 20, 2023
Merged

S3 Read Access, Input Stream based reading #7776

merged 32 commits into from
Sep 20, 2023

Conversation

jdunkerley
Copy link
Member

@jdunkerley jdunkerley commented Sep 8, 2023

Pull Request Description

  • Added a FileSystemSPI allowing protocol resolution to a target type.
  • Separated Input_Stream and Output_Stream from File to allow use in other spaces.
  • File_Format types read_web changed to be read_stream working with InputStream.
  • Added directory listing to Auto_Detect allowing for Data.read to list a folder.
  • Adjusted HTTP to return an InputStream not a byte[]:
    • Response_Body adjusted to wrap an InputStream.
    • Added ability to materialize to either and in-memory vector (<4KB) or a temporary file.
    • Data.fetch will materialize if not a recognized mime-type.
    • Added HTTP_Error to handle IO exceptions from the stream.
  • Excel_Format now supports mime-type and reading a stream.
    • Excel_Workbook can now get a Excel_Section using read_section.
  • Added S3 APIs:
    • parse_uri: splits an S3 URI into bucket and key.
    • list_objects: list the items in a S3 bucket with specified prefix.
    • read_bucket: list prefixes and keys with a delimiter in a S3 bucket with specified prefix.
    • head: either head_bucket (tests existance) or head_object API (reads object meta data).
    • get_object: gets an object from S3 returning as a Response_Body.
  • Added S3_File type acting like a File:
    • No support for writing in this PR.
    • ToDo: recursive listing, glob filtering, exists, size.
  • Fixed a few invalid type signature line.
  • Moved create methods for Postgres_Connection and SQLite_Connection into type instead of module.
  • Renamed Column_Fetcher.Builder to Column_Fetcher_Builder.
  • Fixed bug with select_into in Dry Run mode creating permanent tables.

ToDo: Unit tests.

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • All code has been tested:
    • Unit tests have been written where possible.
    • If GUI codebase was changed, the GUI was tested when built using ./run ide build.

Copy link
Contributor

@hubertp hubertp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only skimmed through lib changes, engine part is 👍

distribution/lib/Standard/AWS/0.0.0-dev/src/S3/S3.enso Outdated Show resolved Hide resolved
Comment on lines 96 to 116
## Gets an object from an S3 bucket.

Arguments:
- bucket: the name of the bucket.
- key: the key of the object.
- credentials: AWS credentials. If not provided, the default credentials will
be used.
get_object : Text -> Text -> AWS_Credential | Nothing -> Any ! S3_Error
get_object bucket key credentials:(AWS_Credential | Nothing)=Nothing = handle_s3_errors <|
client = make_client credentials
request = GetObjectRequest.builder.bucket bucket . key key . build
Panic.catch NoSuchBucketException handler=(_->Error.throw (S3_Bucket_Not_Found.Error bucket)) <|
Panic.catch NoSuchKeyException handler=(_->Error.throw (No_Such_Key.Error bucket key)) <|
response = client.getObject request
inner_response = response.response
mime_type = inner_response.contentType
s3_uri = URI.parse ("s3://" + bucket + "/" + key)
input_stream = Input_Stream.new response (handle_io_errors s3_uri)
Response_Body.Raw_Stream input_stream mime_type s3_uri
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this returns a Raw_Stream, I assume it is an internal method, right?

In such case it should be marked as PRIVATE. Maybe helper methods should be moved to a file separate from public API? I think we could move in that direction, it makes all a bit clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistent with the low-level HTTP request method. I will mark it ADVANCED as it should be.
Once I merge with Greg's will make the two consistently hidden.

if uri.starts_with "s3://" . not then Nothing else
no_prefix = uri.drop 5
index_of = no_prefix.index_of "/"
if index_of == 0 then Nothing else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't not being able to find the bucket name be an error?

@jdunkerley jdunkerley linked an issue Sep 20, 2023 that may be closed by this pull request
3 tasks
@jdunkerley jdunkerley added the CI: Ready to merge This PR is eligible for automatic merge label Sep 20, 2023
@mergify mergify bot merged commit 74d1d08 into develop Sep 20, 2023
24 of 25 checks passed
@mergify mergify bot deleted the wip/jd/s3-read branch September 20, 2023 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ability to connect and read from S3
6 participants