-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BULK-22, #398: Support reading s3 urls #399
BULK-22, #398: Support reading s3 urls #399
Conversation
Motivation: In order to efficiently upload a large number of records from AWS S3 to Cassandra, dsbulk needs to be able to read from s3:// URLs. Modifications: The main change here is adding the S3URLStreamHandler(Provider) classes. This makes reading from S3 URLs transparent to the rest of the application in the same way that reading from stdin/stdout is. A modifcation to DataStaxBulkLoader was required to be able to pass the config into these new classes. Result: Users may now pass s3:// URLs to load commands and dsbulk will retrieve the content from AWS S3 and load it into Cassandra. Future Work: For this use case, only reading from S3 was required, but adding support for writing to S3 should be a fairly simple modification to S3URLStreamHandler.getOutputStream().
I also added some new documentation to CONTRIBUTING.md for other new contributors who might not know some of the details of the local build process.
Does anyone know why four checks are still "expected", and only the one has completed? |
Some webhooks seem broken, and some of the checks are not accessible outside of DataStax unfortunately. And btw thank you for this PR, in overall it seems solid. I will try to get some time to review it properly next week. |
Sounds good. Thanks! :) |
@DavidTaylorCengage I'm late on my schedule but I am about to start reviewing your PR.
First of all, thank you so much for explaining your design choices. I do understand the issue, but indeed it seems a bit overkill to pass the whole DSBulk config in order to create a URL handler. Let's take a step back. When I saw this PR for the first time, I immediately thought about JDBC URLs and how they can carry a lot of information. I was expecting something similar here. For example, a MySQL URL is something like For S3, according to this doc, the equivalent would be a so-called S3 virtual-hosted-style URL, e.g.: The configuration options we need, according to this PR, are:
It seems that the region could easily be included in the URL. Did you consider this option? BTW one thing that is missing IMO is an example of an S3 URL: there is none in this PR. How do they look like? Do they include the region and the bucket name? It seems that you consider the entire host as the bucket name, so it seems a URL would be like For the credentials, it would indeed be bad to include them in the URL, especially secretAccessKey (even though MySQL for example allows passwords in URLs). But putting those credentials in DSBulk is also a bit insecure, and VERY insecure if the user passes the credentials via the command line. It seems from your PR that the credentials are used to create an So my main question now is: shouldn't we rely exclusively on
Does that make sense to you? I am not an AWS specialist so please forgive me if I am completely off track. |
A few more notes:
Example:
|
I also am not an AWS specialist, so I've only ever used the The reason why we can't 100% rely on Certainly I agree that passing the
I could be mistaken, but it looks like it will call Thanks for the feedback! These definitely sound like some good potential improvements. I'll start addressing your comments on Monday. :) |
Not sure I follow: how would you extract the credentials from a URL like Regarding the proposed format: afaict profile and region are optional, so I wonder if they wouldn't be better expressed as query parameters? E.g.
OK so this could be addressed by using a URL format that includes the profile, right?
I guess I would prefer that indeed. I'm pretty sure it won't be long until we start receiving complaints that DSBulk allows AWS secrets to be stored in insecure config files :-) Now if you really really need it, how about putting these in the URL too? E.g.
Indeed in the case of a urlfile, we would open N connections, one per line. But urfiles usually contain only a handful of lines. Are you saying that you have urfiles with 8.8 million lines in them? That's huge indeed, we didn't conceived urlfiles to work with so many lines. I'm even surprised that it worked for you! |
Probably using
Correct, that was my thinking. Upon further reflection, though, I think it's unlikely that we'll have a use case where the user needs to provide multiple sets of credentials for a single run of
I'll delete these config options the next time I push changes. :) Once we work out the right way to pass credentials -- maybe via URL -- then I'll make sure to include a warning for those using the insecure method.
Hehe. 😅 I briefly mentioned it in the #398 issue description: the system that is writing records to S3 is multi-node and multi-thread, so we can't really combine the records into a smaller number of files (at write-to-s3-time, anyways). With one record per file, we end up with a pretty ridiculous number of files. We did get several out-of-memory errors when running this branch, so I recommended breaking the urlfile into multiple pieces, but haven't heard back from our Cassandra admin if that was successful (weekend and holiday and all). If we think it's reasonable to impose an explicit or implicit limit on the number of files contained in a urlfile, I can work with that. Maybe we can use a singleton to prevent re-initializing with the same parameters? |
I will let you decide which URL format works best. And also decide whether we should include credentials in the URL or leave them out and fetch them in the credentials file. As long as we don't use DSBulk config for that, I think I am Ok – this was indeed the only design choice that was making me uncomfortable.
Yes, that is inevitable since the urfile lines are stored in a This is a separate issue though. As a workaround I'd indeed suggest that you split your urlfiles. However if you want to, I'd suggest that you open an issue to improve this. It's "just" a matter of turning that List into a (lazy) Stream or better yet, a Flux. I suspect however that the fix won't be trivial and will incur substantial changes to But that leaves us with this other decision to make: whether to use a single S3 client for all URLs, or a dedicated one for each. We'd probably need to benchmark a bit and figure out what is a reasonable maximum size for a urlfile. |
Hi @DavidTaylorCengage just checking in to see if you need anything from me? FYI we would like to release 1.9.0 at the end of the month, it would be great to get this in (could be in preview status if we are unsure about the URL format, etc.). Thanks! |
Hello @adutra ! Sorry for the delay. This is on my to-do list, but it's been a busy week. I'll try to finish it off by the end of this week, though. As I understand it, there are three outstanding tasks:
Does that sound right/good to you? |
Absolutely, thank you! |
Alas, I was busier than I expected today, and didn't finish either of the two remaining tasks. 😞 I did have some time to think about it, though, and I want to try adding a cache to |
Specifically, in addition to a few minor clean-up tasks, I switched from specifying the S3 credentials in the config to specifying them as query params on the individual URLs. To alleviate the performance concerns about creating a new client for each URL in a urlfile, I have added a cache with a configurable size. I also added warnings when someone passes S3 credentials in an insecure way or when they use a urlfile with over 100,000 lines (which can cause out-of-memory errors). This limit is pretty arbitrary, and may be adjusted in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
David, first up -- thank you for this very valuable contribution! This is very useful and there is definitely demand for it. 🍻
Just thinking out loud, I was wondering if it would be easy for a user to determine which JSON files were or were not processed:
- will the JSON filenames be sent to
STDOUT
? - will the JSON filenames be persisted to the job log?
|
||
#### --s3.clientCacheSize<br />--dsbulk.s3.clientCacheSize _<number>_ | ||
|
||
The size (count) of the S3Client cache. Since each S3 URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to request that we put in additional guidance here. It's not readily obvious to me why 20
is a good default.
For example, what happens if we set the cache size to 100? or 5? Thoughts? 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pretty much just picked an arbitrary number. 🤷♂️ In our use case, we're using the same credentials for all the URLs, so the cache only ever has one entry. I suspect using the same credentials for all the URLs will be the usual use case, and it would be a bit of a pain (and somewhat unexpected) to supply more than 20 different sets of credentials in a single urlfile
. I suppose that is actually an argument for a cache size of 1. 🙃
I'm happy to add a sentence explaining this if you want, though. Do note, however, that this file (settings.md) and application.template.conf are automatically generated from dsbulk-reference.conf, so the same explanatory statement will be shared between the three files.
public class S3URLStreamHandlerProvider implements URLStreamHandlerProvider { | ||
|
||
private static final String S3CLIENT_CACHE_SIZE_PATH = "dsbulk.s3.clientCacheSize"; | ||
private static final int DEFAULT_S3CLIENT_CACHE_SIZE = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In-line with my request above, could we add a one-line comment on why 20
is a good default please? 🙏
@@ -954,6 +954,14 @@ dsbulk { | |||
|
|||
} | |||
|
|||
# Settings applicable for reading from AWS S3 URLs. | |||
s3 { | |||
# The size (count) of the S3Client cache. Since each S3 URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever additional guidance gets included in settings.md
, please add them here too. 🙏
# Settings applicable for reading from AWS S3 URLs. | ||
################################################################################################ | ||
|
||
# The size (count) of the S3Client cache. Since each S3 URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever additional guidance gets included in settings.md
, please add them here too. 🙏
url/src/main/java/com/datastax/oss/dsbulk/url/S3URLStreamHandler.java
Outdated
Show resolved
Hide resolved
url/src/main/java/com/datastax/oss/dsbulk/url/S3URLStreamHandler.java
Outdated
Show resolved
Hide resolved
Yes, the file names appear in positions.txt, and in operation.log/ |
@DavidTaylorCengage just a heads up that we are planning to release 1.9.0 today or beginning of next week. Would it be OK with you if we push this PR to 1.10.0? |
Yes, that's fine with me. 🙂 |
Also update URL-query-parsing logic and document the reason for choosing 20 as the cache size.
url/src/main/java/com/datastax/oss/dsbulk/url/S3URLStreamHandler.java
Outdated
Show resolved
Hide resolved
url/src/main/java/com/datastax/oss/dsbulk/url/S3URLStreamHandler.java
Outdated
Show resolved
Hide resolved
I think I've addressed all the feedback? Let me know if there are any additional changes you would like me to make. :) |
@adutra ? Does this PR satisfy? I'm happy to make any additional changes you want me to. :) |
@DavidTaylorCengage I'm sorry, I was pulled out of this project but I'll make one last pass today or tomorrow, then some of my colleagues will take it from there. |
Trying to sum up the conversations we had so far:
|
Purpose
Add the ability to load records into Cassandra directly from AWS S3, instead of needing to write awkward scripts that load files one-by-one.
Fixes #398
Implementation Notes
Since this is "new functionality added in a backwards compatible manner," I have given the version a minor bump to
1.9.0-SNAPSHOT
.The only way I could find to get the S3 config where I needed it was to set the
Config
object intoBulkLoaderURLStreamHandlerFactory
fromDataStaxBulkLoader.run()
, then pass it into the handlers viaprovider.maybeCreateURLStreamHandler()
. If someone knows a better/cleaner way to achieve this, let me know. :)Since the construction of the
URLStreamHandlerProvider
implementations is done byServiceLoader
, I was not able to get the building of theS3Client
inS3URLStreamHandlerProvider
's constructor. Instead, it calls aninit()
method whenmaybeCreateURLStreamHandler()
is passed ans3://
URL, with a flag to prevent re-initialization. As before, if someone knows a nicer way to achieve this, I'm happy to make adjustments.Lastly, I added a few things to CONTRIBUTING.md that I discovered while implementing this feature that I could not find documented anywhere else. If there are any other improvements to documentation that folks want to sneak in...
Testing
To test this new feature, you're going to need an S3 bucket with some records in it. I used JSON, but CSV should work as well. You can use either a profile or the access/secret key method for authorization.
urlfile
will work.dsbulk load <options>
. I used a config file to provide my options, but you can pass them as arguments as well. The key is to include thedsbulk.s3.*
options, per the updated documentation.Notes
This is my first contribution to this repository, so if I've done anything wrong, used the incorrect idioms, or gone against policy, please let me know and I'm happy to make corrections. :) I have signed the CLA.