-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-15089: Allow backup/restoration to Amazon's S3 blobstore #120
Conversation
Implementation of Solr's BackupRepository that uses Amazon S3 as the backing store.
solr/contrib/blob-repository/src/java/org/apache/solr/blob/backup/BlobBackupRepository.java
Outdated
Show resolved
Hide resolved
solr/contrib/blob-repository/src/java/org/apache/solr/blob/client/LocalStorageClient.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Andy and Pierre! It took me too long to get to this, I'm sorry about that. But overall, the PR looks great as things stand today: there's a few open questions that I think we need to answer, but in terms of quality and testing this already looks really close. Thanks for taking the effort to contribute this!
As is a bad habit of mine I left a lot of comments inline and got very verbose here, but I do that mostly as a way of taking notes and thinking aloud, so don't let the uh, volume, scare you. Almost all comments boil down to me thinking about two questions:
- What are the tradeoffs of the BlobRepository abstraction in contrast to having BackupRepository implementations for each blob store? What value does the abstraction bring us?
- What distinguishes a few of the I/O classes in this PR (e.g. BlobIndexInput) from existing alternatives (e.g. BufferedChecksumIndexInput) that do similar things at a quick glance?
BlobRepository Abstraction
The biggest open question I think is whether or not to do the "BlobRepository" abstraction over having repository impls for each blob store we eventually want to support.
Here are the "pros" I can see of having the abstraction:
- Code Reuse: BlobRepository allows some code reuse across blob stores. e.g. input validation and debug logging on BlobRepository methods. copyIndexFileFrom/copyIndexFileTo implementations. BlobIndexInput (though as I've admitted elsewhere, I don't fully understand the role this class plays). Perhaps some NamedList/argument parsing?
- Partial Configuration Uniformity: BlobRepository allows some configuration properties to be defined uniformly across different blob stores. (i.e. 'blob.store.bucket.name' could be used for GCS as well as S3 backups).
Here are the "cons" that I see in having the abstraction:
- Classpath Bloat: Having all blob-store implementations in the same config means that a user using S3 for backups will also be loading a bunch of the GCS (and eventually ABS) related dependencies that they have no intention of ever using. How much it'll matter in practice, idk, but in theory this expanded but unused surface area is a liability in terms of triggering vulnerability scans or true security issues whenever one of these deps has a CVE.
- Limited Code Reuse: While BlobRepository does allow some pieces to be reused, it's not as much as I was hoping might be possible. IMO the actual code interfacing with the S3/GCS client comprises such a large chunk of the logic here that it dwarfs the smaller bits that are able to be reused. As an "example" of this: GCSBackupRepository and S3storageClient both come in at roughly the same LOC count despite S3StorageClient taking advantage of all the shared logic up in BlobRepository. (Admittedly this is a real shaky comparison: the two code samples are using different client libraries, implement different interfaces, are written in different coding styles with different use of whitespace, comments, etc. And LOC is a dubious stand-in for complexity in any case. But the point that S3StorageClient's code-reuse still leaves the two in the same ballpark still stands.)
- Documentation Complexity: Combining multiple blob-store implementations under the same BackupRepository impl will add some documentation challenges: in calling out which config props affect which blob store "types", in describing the differing expectations on for identifiers like bucket names and paths, etc.
Am I missing any pros/cons here? Anything my descriptions above miss? Weighing these as they stand at least, I'd vote that we're probably better without the abstraction layer and just making this an s3 specific contrib. But I'm curious what I might be missing here.
One side note on this question: some of the code reuse that the abstraction does get us (specifically the input validation, debug-logging, and default method implementations) still seem achievable long-term if they're moved into BackupRepository itself. I can imagine a version of BackupRepository for example whose copyFileFrom
method validates its inputs, debug-logs the invocation, and then calls an abstract doCopyFileFrom
method that subclasses are required to implement. Or alternatively a BackupRepository implementation that does this work on each method before delegating to a BackupRepository object that it wraps. One nice side-effect here is that all BackupRepository impl's would benefit, and not just the blob-store ones.
Anyways, curious on your responses to some of the rambling above, looking forward to getting the ball rolling here; things look great so far!
@Override | ||
public URI resolve(URI baseUri, String... pathComponents) { | ||
Objects.requireNonNull(baseUri); | ||
Preconditions.checkArgument(baseUri.isAbsolute()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[0] I love the rigorous input-checking you guys are doing throughout here.
[0] I wonder whether some of these URI assumptions will hold up across other blob stores. I know GCS supports "relative looking" URIs (and in fact the GCS-mock used today for tests actually requires them!)
|
||
/** | ||
* Class representing the {@code backup} blob config bundle specified in solr.xml. All user-provided config can be | ||
* overridden via environment variables (use uppercase, with '_' instead of '.'), see {@link BlobBackupRepositoryConfig#toEnvVar}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[+1] I love the uniformity of this approach, that all config props can be set by env-var. It's something I didn't do with the GCS repository impl and that now looks like a defect in hindsight.
...contrib/blob-repository/src/java/org/apache/solr/blob/backup/BlobBackupRepositoryConfig.java
Outdated
Show resolved
Hide resolved
solr/contrib/blob-repository/src/java/org/apache/solr/blob/backup/BlobIndexInput.java
Outdated
Show resolved
Hide resolved
solr/contrib/blob-repository/src/java/org/apache/solr/blob/client/S3StorageClient.java
Outdated
Show resolved
Hide resolved
solr/contrib/blob-repository/src/test/org/apache/solr/blob/backup/BlobBackupRepositoryTest.java
Outdated
Show resolved
Hide resolved
solr/contrib/blob-repository/src/test/org/apache/solr/blob/backup/BlobBackupRepositoryTest.java
Outdated
Show resolved
Hide resolved
@@ -106,6 +106,8 @@ grant { | |||
permission java.lang.RuntimePermission "writeFileDescriptor"; | |||
// needed by hadoop http | |||
permission java.lang.RuntimePermission "getProtectionDomain"; | |||
// needed by aws s3 sdk | |||
permission java.lang.RuntimePermission "accessClassInPackage.jdk.internal.reflect"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[0] Note to self to look into this a bit more.
@@ -0,0 +1,366 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[+1] The unit-test coverage in the following files is laudable. The attention to thoroughness is awesome and appreciated!
[-1] That said, there's one checkbox that's missing I think - integration-level testing. We have a base class (AbstractIncrementalBackupTest) that does a passable job here that gets extended by each of the existing BackupRepository impls. We should probably create a S3IncrementalBackupTest
extending this, similar to GCSIncrementalBackupTest
(as one example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea on adding the S3IncrementalBackupTest, will do that before merging!
I see that GCSIncrementalBackupTest has a set of locales that it works for, do you think that this is necessary for the S3 test as well, or should I be ok skipping it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, test added, and it found that the index files are not validated for corruption when building a backup. So very good that we added the test!
Had to modify some of the other tests, but everything should work now, and we can be even more confident with the integration testing.
@gerlowskija I'll address the cons list first, that seems to me the most salient part to discuss:
In my mind, if we go the path of using the BlobRepository abstraction, the ultimate goal would be to have it be a separate unit, apart from the S3 or GCS stuff. That could mean...
There may be other options, but for the bloat reasons you mentioned, I never really considered putting all the blob stores into one multi-cloud repo.
Very fair criticism. For me, it's right on the edge of "are we over-architecting this?". It would of course help to have a third blob client (Azure?); perhaps with just two we're underestimating the value that middle layer could provide?
Is this con still relevant if we pick one of the three options I listed above (and not have them all in one)? If each is separate, their docs would also be separate. Or do you mean internal docs (e.g., Javadoc)? You also asked
In regards to BlobIndexInput, I'll have to defer the 'why' question to @psalagnac. That impl was written when we first wrote this plugin -- it's possible it was to work around a bug, or possible it was oversight that we built it ourselves and didn't use a pre-built impl. If I get some time, I can also try swapping our use for |
👍 I was assuming a different structure, or maybe reading too much into the current structure of the WIP, but this makes a lot of sense and would do a lot to alleviate both the classpath and documentation concerns I had in mind previously. Of the specific implementations you mentioned, I'm partial to (2) as it seems like a nice middle ground between (1) and (3). It also seems like the approach that'd make it easiest for a user to use BlobRepository in writing their own plugin. But admittedly that's a bit handwavy. Do you have a preference among those options or see other pros/cons there?
That's definitely the question. I lean I think towards skipping the abstraction until we're in a situation where its value is more clear cut. If someone adds an Azure store next year and more repeated bits of code crop up, we can always add the middle-layer in at that point. But that's just my leaning - if you guys are confident that it'll be better to leave it in, I'm happy to go with that. The BlobIndexInput question might shed light on this too. |
I agree that (2) is the best option out of the 3.
I agree with Jason here as well. I might have missed something, but did you mention at some point that this |
Thanks for all you feedback on this PR @gerlowskija
Any implementation of The two only things we have when working with S3 are an My understanding is BTW, I just found a bug in your implementation of With AWS, it happens for real. With GCP, it may be just theoretical? That may be some code to share between different blob implementations? 😄 |
Thanks for the clarification @psalagnac. I'll summarize your points to make sure I've got the right idea:
If I've got all that right, then I'm 👍 on keeping it around. Though maybe it should be renamed to remove the suggestion that the class is blob-specific in some way. (AFAICT it's equally useful whether you're wrapping a S3InputStream or a StringBufferInputStream, unless I'm missing something?)
Good catch! As I understand your description - this bug requires (1) buffering somewhere in the stack underlying the InputStream and (2) non-blocking reads from that InputStream. I'll take a look and see whether the GCS code "checks" both of those boxes, unless you've already done that yourself and seen that it does?
I raised this possibility in my previous comment, and I think I am sold on BlobIndexInput's necessity, but I still lean towards skipping the BlobBackupRepository abstraction for now. (Just wanted to close the loop there.) |
Hey @athrog - anything holding this up, or did you just miss the notification on the latest round of comments? (If it's a time thing, lmk and I can try to help out a bit myself.) Seems like the consensus so far would be to (1) remove the BlobRepository abstraction and (2) repurpose/rebrand the contrib you're adding now to be s3-specific, pending any big objections or arguments from your side of things? |
Hey @gerlowskija, Thanks for you feedback. I think we reached an agreement to keep I will work with @athrog on following changes:
|
Remove blob abstraction layer.
Rename package to s3.
Hi @gerlowskija, The PR is updated with new class layer. I tried to align to existing implementation for CSG. All classes are now in package I kept calls to AWS API in standalone class Still have to do end-to-end testing and polish docs. |
@psalagnac could you merge this with |
Just pushed a merge commit @HoustonPutman, along with an update for our README. Let me know if you have any trouble testing locally, I can add more documentation if needed. |
Also, I'm not sure if it's a problem with my setup or maybe I missed some gradle setting somewhere, but just doing a |
solr/contrib/blob-repository/src/java/org/apache/solr/s3/S3BackupRepository.java
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,47 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moderate OSS Vulnerability:
pkg:maven/com.google.guava/guava@25.1-jre
0 Critical, 0 Severe, 1 Moderate and 0 Unknown vulnerabilities have been found in a direct dependency
MODERATE Vulnerabilities (1)
[CVE-2020-8908] A temp directory creation vulnerability exists in all versions of Guava, allowin...
A temp directory creation vulnerability exists in all versions of Guava, allowing an attacker with access to the machine to potentially access data in a temporary directory created by the Guava API com.google.common.io.Files.createTempDir(). By default, on unix-like systems, the created directory is world-readable (readable by an attacker with access to the system). The method in question has been marked @deprecated in versions 30.0 and later and should not be used. For Android developers, we recommend choosing a temporary directory API provided by Android, such as context.getCacheDir(). For other Java developers, we recommend migrating to the Java 7 API java.nio.file.Files.createTempDirectory() which explicitly configures permissions of 700, or configuring the Java runtime's java.io.tmpdir system property to point to a location whose permissions are appropriately configured.
CVSS Score: 3.3
CVSS Vector: CVSS:3.0/AV:L/AC:L/PR:L/UI:N/S:U/C:L/I:N/A:N
(at-me in a reply with help
or ignore
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the bones are good, but I have a few concerns, mostly around the use of guava (a personal strong aversion to it), and a few minor clean up notes but overall thanks for putting this together!
solr/contrib/blob-repository/src/java/org/apache/solr/s3/AdobeMockS3StorageClient.java
Outdated
Show resolved
Hide resolved
solr/contrib/blob-repository/src/java/org/apache/solr/s3/S3BackupRepository.java
Outdated
Show resolved
Hide resolved
|
||
@Override | ||
@SuppressWarnings("unchecked") | ||
public <T> T getConfigProperty(String name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't benefit from generic types AFAICT, we should fix the interface to not do this. Can do so in this issue or a separate one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do it in a separate issue.
solr/contrib/blob-repository/src/java/org/apache/solr/s3/S3BackupRepository.java
Outdated
Show resolved
Hide resolved
solr/contrib/blob-repository/src/java/org/apache/solr/s3/S3BackupRepository.java
Outdated
Show resolved
Hide resolved
solr/contrib/blob-repository/src/java/org/apache/solr/s3/S3IndexInput.java
Outdated
Show resolved
Hide resolved
solr/contrib/blob-repository/src/java/org/apache/solr/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
solr/contrib/blob-repository/src/java/org/apache/solr/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
solr/contrib/blob-repository/src/java/org/apache/solr/s3/S3StorageClient.java
Outdated
Show resolved
Hide resolved
solr/contrib/blob-repository/src/test/org/apache/solr/s3/S3PathsTest.java
Outdated
Show resolved
Hide resolved
Thank you @madrob for the review. I fixed some of the low hanging fruit, will circle back later for the other comments (that will require testing after changes) |
@athrog Contribs aren't loaded to the classpath by default. What I'm doing for testing this (not at all recommended for the final product), is adding something to
|
@athrog & @gerlowskija , I've gotten the precommit to pass (wow this adds a lot of dependencies). The full tests also pass 🎉 As per my comment above, I believe there are 3 areas left to finish before this work can be merged.
Once those are done, I will be happy to merge and backport this to 8x. Let me know what your bandwith is, and when you think you can get these done by 🙂 No rush, just want to be able to plan accordingly on my side. Also just realized that upgrading the jackson smile dependency broke tests, so looking into fixing those currently... |
…ckson-dataformat-smile.
Fixed a weird issue with the upgrade of the smile library. Tests pass now. |
Happy to take the other two documentation tasks. @gerlowskija already wrote a short section in the ref guide for the GCS backup repo, so I can modify that for S3 easily. Will try to have those done by the end of the week if I have enough time. |
Okay, took my first shot at the ref guide -- let me know what you all think! One thing that stood out to me while documenting all the params was our use of the |
Yeah I agree. We should remove blob entirely (variables, methods, config, etc), shouldn't be mentioned at all in the s3-repository contrib module. And will take a look at the docs! |
Documentation looks good to me! The only thing we should possibly add is how to add contribs to the classpath/runtime. I imagine this is already in the ref-guide somewhere, and we should be able to link to that pretty easily. Would be good to add for the GCS section as well. |
…ludes changing the config bundle parameters specified in solr.xml.
Yeah good idea. I tried adding this to my
This approach is what some other contribs recommend (tika, clustering)...any reason why it wouldn't work for this repo? |
To document how to install plugins, I would simply link here:
Furthermore, I would discourage anyone from declaring |
Ok, so I tested the async stuff and it's working now. The only thing I've found is that the One last thing I want to do now is remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@athrog @gerlowskija I think I'm happy with this PR.
Do y'all think there is anything else that needs to be done (other than adding a CHANGES entry)?
solr/contrib/s3-repository/src/java/org/apache/solr/s3/S3IndexInput.java
Outdated
Show resolved
Hide resolved
solr/contrib/s3-repository/src/java/org/apache/solr/s3/S3OutputStream.java
Show resolved
Hide resolved
solr/contrib/s3-repository/src/java/org/apache/solr/s3/S3OutputStream.java
Show resolved
Hide resolved
solr/contrib/s3-repository/src/java/org/apache/solr/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
…#120) See solr/contrib/s3-repository/README.md for more information. Co-authored-by: Andy Throgmorton <athrogmorton@salesforce.com> Co-authored-by: Pierre Salagnac <psalagnac@salesforce.com> Co-authored-by: Houston Putman <houston@apache.org>
… those were actually correct with the old github repo anyway.
Description
Solr provides a
BackupRepository
interface with which users can create backups to arbitrary backends. There is now a GCS implementation (see #39), but no S3 impl yet.Solution
This PR adds a
BackupRepository
implementation for communicating with S3.Tests
We've added new unit tests at the
BackupRepository
level as well as tests for the S3 interactions (using S3Mock framework).Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.We have not yet done the work of adding license files for all the newly added libraries/dependencies. These will be added in a future commit.