Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BULK-22, #398: Support reading s3 urls #399

Merged

Conversation

DavidTaylorCengage
Copy link
Contributor

@DavidTaylorCengage DavidTaylorCengage commented Feb 8, 2022

Purpose

Add the ability to load records into Cassandra directly from AWS S3, instead of needing to write awkward scripts that load files one-by-one.

Fixes #398

Implementation Notes

Since this is "new functionality added in a backwards compatible manner," I have given the version a minor bump to 1.9.0-SNAPSHOT.

The only way I could find to get the S3 config where I needed it was to set the Config object into BulkLoaderURLStreamHandlerFactory from DataStaxBulkLoader.run(), then pass it into the handlers via provider.maybeCreateURLStreamHandler(). If someone knows a better/cleaner way to achieve this, let me know. :)

Since the construction of the URLStreamHandlerProvider implementations is done by ServiceLoader, I was not able to get the building of the S3Client in S3URLStreamHandlerProvider's constructor. Instead, it calls an init() method when maybeCreateURLStreamHandler() is passed an s3:// URL, with a flag to prevent re-initialization. As before, if someone knows a nicer way to achieve this, I'm happy to make adjustments.

Lastly, I added a few things to CONTRIBUTING.md that I discovered while implementing this feature that I could not find documented anywhere else. If there are any other improvements to documentation that folks want to sneak in...

Testing

To test this new feature, you're going to need an S3 bucket with some records in it. I used JSON, but CSV should work as well. You can use either a profile or the access/secret key method for authorization.

  1. Choose some URLs from your S3 bucket. Both individual URLs and a urlfile will work.
  2. Run dsbulk load <options>. I used a config file to provide my options, but you can pass them as arguments as well. The key is to include the dsbulk.s3.* options, per the updated documentation.
  3. Observe the records loaded into the appropriate Cassandra tables.

Notes

This is my first contribution to this repository, so if I've done anything wrong, used the incorrect idioms, or gone against policy, please let me know and I'm happy to make corrections. :) I have signed the CLA.

Motivation:

In order to efficiently upload a large number of records from AWS S3 to
Cassandra, dsbulk needs to be able to read from s3:// URLs.

Modifications:

The main change here is adding the S3URLStreamHandler(Provider) classes.
This makes reading from S3 URLs transparent to the rest of the
application in the same way that reading from stdin/stdout is. A
modifcation to DataStaxBulkLoader was required to be able to pass the
config into these new classes.

Result:

Users may now pass s3:// URLs to load commands and dsbulk will retrieve
the content from AWS S3 and load it into Cassandra.

Future Work:

For this use case, only reading from S3 was required, but adding support
for writing to S3 should be a fairly simple modification to
S3URLStreamHandler.getOutputStream().
I also added some new documentation to CONTRIBUTING.md for other new
contributors who might not know some of the details of the local build
process.
@DavidTaylorCengage DavidTaylorCengage changed the title Bulk 22, #398: Support reading s3 urls BULK-22, #398: Support reading s3 urls Feb 8, 2022
@DavidTaylorCengage
Copy link
Contributor Author

Does anyone know why four checks are still "expected", and only the one has completed?

@adutra
Copy link
Contributor

adutra commented Feb 11, 2022

Does anyone know why four checks are still "expected", and only the one has completed?

Some webhooks seem broken, and some of the checks are not accessible outside of DataStax unfortunately.

And btw thank you for this PR, in overall it seems solid. I will try to get some time to review it properly next week.

@DavidTaylorCengage
Copy link
Contributor Author

Sounds good. Thanks! :)

@adutra
Copy link
Contributor

adutra commented Feb 18, 2022

@DavidTaylorCengage I'm late on my schedule but I am about to start reviewing your PR.

The only way I could find to get the S3 config where I needed it was to set the Config object into BulkLoaderURLStreamHandlerFactory from DataStaxBulkLoader.run(), then pass it into the handlers via provider.maybeCreateURLStreamHandler(). If someone knows a better/cleaner way to achieve this, let me know. :)
Since the construction of the URLStreamHandlerProvider implementations is done by ServiceLoader, I was not able to get the building of the S3Client in S3URLStreamHandlerProvider's constructor. Instead, it calls an init() method when maybeCreateURLStreamHandler() is passed an s3:// URL, with a flag to prevent re-initialization. As before, if someone knows a nicer way to achieve this, I'm happy to make adjustments.

First of all, thank you so much for explaining your design choices. I do understand the issue, but indeed it seems a bit overkill to pass the whole DSBulk config in order to create a URL handler.

Let's take a step back.

When I saw this PR for the first time, I immediately thought about JDBC URLs and how they can carry a lot of information. I was expecting something similar here. For example, a MySQL URL is something like mysql://<username>:<password>@<host>:<port>/<db_name>.

For S3, according to this doc, the equivalent would be a so-called S3 virtual-hosted-style URL, e.g.: https://bucket-name.s3.Region.amazonaws.com/key-name.

The configuration options we need, according to this PR, are:

  • region: always required
  • credentials: one of the below:
    • accessKeyId + secretAccessKey; or
    • profile

It seems that the region could easily be included in the URL. Did you consider this option?

BTW one thing that is missing IMO is an example of an S3 URL: there is none in this PR. How do they look like? Do they include the region and the bucket name? It seems that you consider the entire host as the bucket name, so it seems a URL would be like s3://my-bucket/key-name. Why not use virtual-hosted URLs instead?

For the credentials, it would indeed be bad to include them in the URL, especially secretAccessKey (even though MySQL for example allows passwords in URLs). But putting those credentials in DSBulk is also a bit insecure, and VERY insecure if the user passes the credentials via the command line.

It seems from your PR that the credentials are used to create an AwsCredentialsProvider. It also seems that if none is provided, the AWS client is able to read the default AWS configuration files, typically ~/.aws/credentials according to this doc.

So my main question now is: shouldn't we rely exclusively on ~/.aws/credentials? This IMO would eliminate the need to pass the config to the URL handler factory:

  • The region would be specified in the URL itself, using virtual-hosted-style URLs; (this also has the advantage of using standard URL formats)
  • The credentials would be read from ~/.aws/credentials.

Does that make sense to you? I am not an AWS specialist so please forgive me if I am completely off track.

@adutra
Copy link
Contributor

adutra commented Feb 18, 2022

A few more notes:

  1. S3 virtual-hosted-style URLs have an https scheme. This obviously wouldn't work here, so we would have to use the s3 scheme as you did. To be complete I was thinking about something close to this syntax:
s3://<bucket>.s3.<region>.amazonaws.com/<key>

Example:

s3://my-bucket.s3.us-west-2.amazonaws.com/export.csv
  1. Including the region in the URL would probably force you to create one AWS client per call to openConnection – as opposed to creating one single, shared instance of S3Client. In fact I think I would prefer that. It wouldn't be a performance problem for DSBulk since it would call openConnection only once at the beginning of the operation.

  2. Finally, if you prefer to leave the region out of the URL, that could work with me too. It seems that S3Client is perfectly able to locate the region in the usual locations.

@DavidTaylorCengage
Copy link
Contributor Author

I also am not an AWS specialist, so I've only ever used the s3://my-bucket/key-name format. I'll have to take a look at the s3://<bucket>.s3.<region>.amazonaws.com/<key> format and see if that would work. How would you feel about a "custom format" like s3://<profile>.<bucket>.<region>.s3/<key>, from which we extract the right credentials and then build a canonical URL? It's a hack, but it might solve the credentials problem if we can't find a better solution.

The reason why we can't 100% rely on ~/.aws/credentials is because you can include multiple profiles in that file; I've actually got five different profiles for the various different buckets I use. Thus, we at least need some way to specify which set of credentials to use.

Certainly I agree that passing the accessKeyId and secretAccessKey is insecure. I included that for folks who "just need to upload some data" and didn't want to go through the whole AWS credentials setup process, or who, for whatever reason (running from a cloud instance, etc.), weren't able to use the ~/.aws/credentials file (although in that case, they should look into using role-based access...). I'm happy to remove the temptation to use insecure options if you would like, though.

It wouldn't be a performance problem for DSBulk since it would call openConnection only once at the beginning of the operation.

I could be mistaken, but it looks like it will call openConnection once per URL? So if we had a -urlfile specified, wouldn't it open a new connection for each line in that file? Our Staging environment had something like 8.8 million records saved in S3, so creating only a single instance of S3Client would be a really good idea!

Thanks for the feedback! These definitely sound like some good potential improvements. I'll start addressing your comments on Monday. :)

@adutra
Copy link
Contributor

adutra commented Feb 19, 2022

How would you feel about a "custom format" like s3://<profile>.<bucket>.<region>.s3/<key>, from which we extract the right credentials and then build a canonical URL? It's a hack, but it might solve the credentials problem if we can't find a better solution.

Not sure I follow: how would you extract the credentials from a URL like s3://<profile>.<bucket>.<region>.s3/<key>?

Regarding the proposed format: afaict profile and region are optional, so I wonder if they wouldn't be better expressed as query parameters? E.g. s3://<bucket>/<key>?profile=<profile>&region=<region>. We are diverging from the virtual-hosted-style format, but I could live with that if that's better for you.

The reason why we can't 100% rely on ~/.aws/credentials is because you can include multiple profiles in that file; I've actually got five different profiles for the various different buckets I use. Thus, we at least need some way to specify which set of credentials to use.

OK so this could be addressed by using a URL format that includes the profile, right?

Certainly I agree that passing the accessKeyId and secretAccessKey is insecure. I included that for folks who "just need to upload some data" and didn't want to go through the whole AWS credentials setup process, or who, for whatever reason (running from a cloud instance, etc.), weren't able to use the ~/.aws/credentials file (although in that case, they should look into using role-based access...). I'm happy to remove the temptation to use insecure options if you would like, though.

I guess I would prefer that indeed. I'm pretty sure it won't be long until we start receiving complaints that DSBulk allows AWS secrets to be stored in insecure config files :-) Now if you really really need it, how about putting these in the URL too? E.g. s3://<bucket>/<key>?profile=<profile>&region=<region>&accessKeyId=<accessKeyId>&secretAccessKey=<secretAccessKey>? Then we could log a big fat warning telling the user not to do it.

I could be mistaken, but it looks like it will call openConnection once per URL? So if we had a -urlfile specified, wouldn't it open a new connection for each line in that file? Our Staging environment had something like 8.8 million records saved in S3, so creating only a single instance of S3Client would be a really good idea!

Indeed in the case of a urlfile, we would open N connections, one per line. But urfiles usually contain only a handful of lines. Are you saying that you have urfiles with 8.8 million lines in them? That's huge indeed, we didn't conceived urlfiles to work with so many lines. I'm even surprised that it worked for you!

@DavidTaylorCengage
Copy link
Contributor Author

Not sure I follow: how would you extract the credentials from a URL like s3://<profile>.<bucket>.<region>.s3/<key>?

Probably using String.split(".") on the hostname. From that, we can build the s3://<bucket>.s3.<region>.amazonaws.com/<key> or s3://<bucket>/<key> URLs and provide the right credentials to an S3ClientBuilder.

OK so this could be addressed by using a URL format that includes the profile, right?

Correct, that was my thinking. Upon further reflection, though, I think it's unlikely that we'll have a use case where the user needs to provide multiple sets of credentials for a single run of dsbulk, so maybe passing the same values in every URL is excessive?

I guess I would prefer that indeed. I'm pretty sure it won't be long until we start receiving complaints that DSBulk allows AWS secrets to be stored in insecure config files :-) Now if you really really need it, how about putting these in the URL too? E.g. s3:///?profile=&region=&accessKeyId=&secretAccessKey=? Then we could log a big fat warning telling the user not to do it.

I'll delete these config options the next time I push changes. :) Once we work out the right way to pass credentials -- maybe via URL -- then I'll make sure to include a warning for those using the insecure method.

Indeed in the case of a urlfile, we would open N connections, one per line. But urfiles usually contain only a handful of lines. Are you saying that you have urfiles with 8.8 million lines in them? That's huge indeed, we didn't conceived urlfiles to work with so many lines. I'm even surprised that it worked for you!

Hehe. 😅 I briefly mentioned it in the #398 issue description: the system that is writing records to S3 is multi-node and multi-thread, so we can't really combine the records into a smaller number of files (at write-to-s3-time, anyways). With one record per file, we end up with a pretty ridiculous number of files. We did get several out-of-memory errors when running this branch, so I recommended breaking the urlfile into multiple pieces, but haven't heard back from our Cassandra admin if that was successful (weekend and holiday and all).

If we think it's reasonable to impose an explicit or implicit limit on the number of files contained in a urlfile, I can work with that. Maybe we can use a singleton to prevent re-initializing with the same parameters?

@adutra
Copy link
Contributor

adutra commented Feb 25, 2022

I'll delete these config options the next time I push changes. :) Once we work out the right way to pass credentials -- maybe via URL -- then I'll make sure to include a warning for those using the insecure method.

I will let you decide which URL format works best. And also decide whether we should include credentials in the URL or leave them out and fetch them in the credentials file. As long as we don't use DSBulk config for that, I think I am Ok – this was indeed the only design choice that was making me uncomfortable.

We did get several out-of-memory errors when running this branch, so I recommended breaking the urlfile into multiple pieces

Yes, that is inevitable since the urfile lines are stored in a java.util.List in memory.

This is a separate issue though. As a workaround I'd indeed suggest that you split your urlfiles. However if you want to, I'd suggest that you open an issue to improve this. It's "just" a matter of turning that List into a (lazy) Stream or better yet, a Flux. I suspect however that the fix won't be trivial and will incur substantial changes to AbstractFileBasedConnector – but nothing impossible.

But that leaves us with this other decision to make: whether to use a single S3 client for all URLs, or a dedicated one for each. We'd probably need to benchmark a bit and figure out what is a reasonable maximum size for a urlfile.

changelog/README.md Show resolved Hide resolved
url/pom.xml Show resolved Hide resolved
@adutra
Copy link
Contributor

adutra commented Mar 8, 2022

Hi @DavidTaylorCengage just checking in to see if you need anything from me? FYI we would like to release 1.9.0 at the end of the month, it would be great to get this in (could be in preview status if we are unsure about the URL format, etc.). Thanks!

@DavidTaylorCengage
Copy link
Contributor Author

Hello @adutra ! Sorry for the delay. This is on my to-do list, but it's been a busy week. I'll try to finish it off by the end of this week, though. As I understand it, there are three outstanding tasks:

  • Move the release notes. (This is already done, just not pushed.)
  • Resolve the credentials issue, probably by using a custom URL format.
  • Address the "S3 client per URL vs S3 client for all URLs" question. I'll see if I can do some performance testing locally to give us an idea of the relative performance of these options. This decision might also be made for us depending on the credentials solution.

Does that sound right/good to you?

@adutra
Copy link
Contributor

adutra commented Mar 9, 2022

Does that sound right/good to you?

Absolutely, thank you!

@DavidTaylorCengage
Copy link
Contributor Author

Alas, I was busier than I expected today, and didn't finish either of the two remaining tasks. 😞 I did have some time to think about it, though, and I want to try adding a cache to S3URLStreamHandler so that when I put the credentials on each URL (as query parameters), we don't spend a lot of time re-creating the same client over and over again. I'm not sure if this cache should be adjustable, or a fixed size. As far as I know now, this will be my highest priority on Monday, so I should be able to get something done! 😉

Specifically, in addition to a few minor clean-up tasks, I switched from
specifying the S3 credentials in the config to specifying them as query
params on the individual URLs. To alleviate the performance concerns
about creating a new client for each URL in a urlfile, I have added a
cache with a configurable size.

I also added warnings when someone passes S3 credentials in an insecure
way or when they use a urlfile with over 100,000 lines (which can cause
out-of-memory errors). This limit is pretty arbitrary, and may be
adjusted in the future.
Copy link

@ErickRamirezAU ErickRamirezAU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

David, first up -- thank you for this very valuable contribution! This is very useful and there is definitely demand for it. 🍻

Just thinking out loud, I was wondering if it would be easy for a user to determine which JSON files were or were not processed:

  • will the JSON filenames be sent to STDOUT?
  • will the JSON filenames be persisted to the job log?


#### --s3.clientCacheSize<br />--dsbulk.s3.clientCacheSize _&lt;number&gt;_

The size (count) of the S3Client cache. Since each S3 URL

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to request that we put in additional guidance here. It's not readily obvious to me why 20 is a good default.

For example, what happens if we set the cache size to 100? or 5? Thoughts? 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pretty much just picked an arbitrary number. 🤷‍♂️ In our use case, we're using the same credentials for all the URLs, so the cache only ever has one entry. I suspect using the same credentials for all the URLs will be the usual use case, and it would be a bit of a pain (and somewhat unexpected) to supply more than 20 different sets of credentials in a single urlfile. I suppose that is actually an argument for a cache size of 1. 🙃

I'm happy to add a sentence explaining this if you want, though. Do note, however, that this file (settings.md) and application.template.conf are automatically generated from dsbulk-reference.conf, so the same explanatory statement will be shared between the three files.

public class S3URLStreamHandlerProvider implements URLStreamHandlerProvider {

private static final String S3CLIENT_CACHE_SIZE_PATH = "dsbulk.s3.clientCacheSize";
private static final int DEFAULT_S3CLIENT_CACHE_SIZE = 20;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-line with my request above, could we add a one-line comment on why 20 is a good default please? 🙏

@@ -954,6 +954,14 @@ dsbulk {

}

# Settings applicable for reading from AWS S3 URLs.
s3 {
# The size (count) of the S3Client cache. Since each S3 URL

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever additional guidance gets included in settings.md, please add them here too. 🙏

# Settings applicable for reading from AWS S3 URLs.
################################################################################################

# The size (count) of the S3Client cache. Since each S3 URL

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever additional guidance gets included in settings.md, please add them here too. 🙏

@DavidTaylorCengage
Copy link
Contributor Author

Just thinking out loud, I was wondering if it would be easy for a user to determine which JSON files were or were not processed:

  • will the JSON filenames be sent to STDOUT?
  • will the JSON filenames be persisted to the job log?

Yes, the file names appear in positions.txt, and in operation.log/STDOUT if you have debulk.log.verbosity = 2.

@adutra
Copy link
Contributor

adutra commented Apr 1, 2022

@DavidTaylorCengage just a heads up that we are planning to release 1.9.0 today or beginning of next week. Would it be OK with you if we push this PR to 1.10.0?

@DavidTaylorCengage
Copy link
Contributor Author

Yes, that's fine with me. 🙂

Also update URL-query-parsing logic and document the reason for choosing
20 as the cache size.
@adutra adutra added this to the 1.10.0 milestone May 27, 2022
@weideng1 weideng1 modified the milestones: 1.10.0, 1.11.0 Aug 12, 2022
@DavidTaylorCengage
Copy link
Contributor Author

I think I've addressed all the feedback? Let me know if there are any additional changes you would like me to make. :)

@DavidTaylorCengage
Copy link
Contributor Author

@adutra ? Does this PR satisfy? I'm happy to make any additional changes you want me to. :)

@adutra
Copy link
Contributor

adutra commented Nov 29, 2022

@DavidTaylorCengage I'm sorry, I was pulled out of this project but I'll make one last pass today or tomorrow, then some of my colleagues will take it from there.

@adutra
Copy link
Contributor

adutra commented Nov 29, 2022

Trying to sum up the conversations we had so far:

  • URL format: there was some back and forth, but the current format and parsing seem both OK to me.
  • Client per URL vs cached Client: given that the requirements include being able to consume urlfiles with millions of lines, a cache is probably necessary anyways. So OK from my point of view.
    • Note to myself: create an issue to "officialize" the support for urlfiles with up to millions of lines. DONE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support reading records from AWS S3.
4 participants