Document Expected Checksums in Resolver

apache · Apr 22, 2023 · 64e7b63 · 64e7b63
1 parent 7d17bc3
commit 64e7b63
Show file tree

Hide file tree

Showing 4 changed files with 224 additions and 21 deletions.
diff --git a/src/site/markdown/about-checksums.md b/src/site/markdown/about-checksums.md
@@ -39,7 +39,7 @@ of Maven Resolver: there is nothing secure being involved with checksums. Moreov
 algorithm, but even for its "elder brother" MD5. Both algorithms are still widely used today as "transport integrity
 validation" or "error detection" (aka "bit-rot detection").
 
-## Checksum Changes
+## Checksum Algorithms SPI
 
 From a technical perspective, the above written facts infer following consequences: as checksum algorithms are exposed
 to the user, so one can set them via configuration, users are not prevented to ask for SHA-256 or even SHA-512, even if
@@ -50,21 +50,29 @@ wrong use case. The notion of transport validation and secure hashes are being c
 reasons explained above.
 
 Hence, Maven Resolver team decided to make supported set of checksums limited. Instead of directly exposing
-`MessageDigest` algorithms, we introduced an API around checksums. This not only prevents wrong use cases (not
+`MessageDigest` algorithms, we introduced an SPI around checksums. This not only prevents wrong use cases (not
 exposing all supported algorithms of `MessageDigest` to users), but also makes possible to introduce real checksum
 algorithms. Finally, the set of supported checksum algorithms remains extensible: if some required algorithm is
 not provided by Resolver, it can be easily added by creating a factory component for it.
 
-Resolver out of the box supports the following checksum algorithms:
+We are aware that users started using "better SHA" algorithms, and we do not want to break them. Nothing for them
+changes (configuration and everything basically remains the same). But, we do want to prevent any possible further
+proliferation of non-standard checksums.
+
+## Implemented Checksum Algorithms
+
+Resolver out of the box provides the following checksum algorithms (not, algorithm names are case sensitive):
 
 * MD5
 * SHA-1
 * SHA-256
 * SHA-512
 
-We are aware that users started using "better SHA" algorithms, and we do not want to break them. Nothing for them
-changes (configuration and everything basically remains the same). But, we do want to prevent any possible further
-proliferation of non-standard checksums.
+This set of algorithms above are provided by Resolver by default, but using the SPI anyone can extend
+Resolver with new type of Checksum Algorithms.
+
+To see how and when checksums are used in Resolver, continue on [Expected Checksums](expected-checksums.html)
+page.
 
 Links:
 

diff --git a/src/site/markdown/expected-checksums.md b/src/site/markdown/expected-checksums.md
@@ -0,0 +1,123 @@
+# Expected Checksums
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+Checksums in Resolver were historically used during transport, 
+to ensure Artifact integrity. In addition, latest Resolver may 
+use checksums in various other ways too, for example to ensure 
+Artifact integrity during resolution. 
+
+Still, the essence of all checksums uses in Resolver is 
+"integrity validation": Resolver calculates by various
+means the "calculated" checksum (for given payload), 
+then obtains the "expected" checksum (for same payload)
+and compares the two.
+
+In essence and somewhat simplified, Resolver integrity validation looks like this:
+* hash the Artifact payload (file), this is the "calculated" checksum
+* obtain the Artifact "expected" checksum
+* compare the "calculated" checksum with "expected" checksum
+
+This page will cover all the "expected" checksums.
+
+
+## Transport Checksum Strategies
+
+Historically, the "obtain expected checksum" was implemented as simple HTTP GET 
+request against Artifact checksum URL (Artifact URL appended by ".sha1"). This logic 
+is still present in current Resolver, but is "decorated" and extended in multiple 
+ways.
+
+Resolver has broadened the "obtain" step for "expected" checksum with two new strategies,
+so the three expected checksum kinds in transport are: "Provided", "Remote Included" and 
+"Remote External". All these strategies represent the source of "expected" checksum
+as explained above, but it differs **how** Resolver obtains these.
+
+The **Provided** kind of expected checksums are "provided" to resolver by some alternative
+means, possibly ahead of any transport operation. There is an SPI interfacce that users may 
+implement, to have own ways to provide checksums to resolver, or, may use out of the 
+box implementation, that simply delegates call to "trusted checksums" (more about them later).
+
+The **Remote Included** checkums are "included" by remote party in some way, most typically 
+in their response. Since advent of modern Repository Managers, most of 
+them already sends checksums (usually the "standard" SHA-1 and MD5)
+in their response headers. Moreover, Maven Central, and even Google Mirror of Maven Central 
+sends these as well. By extracting these checksums from response, we can get hashes
+that were provided by remote repository along with it' content. 
+
+Finally, the **Remote External** checksums are the classic checksums we all know: They are laid down 
+next to Artifact files (hence "external") on remote repository (hence "remote"), according 
+to remote repository layout. To obtain Remote External checksum, a HTTP request is
+made toward remote repository.
+
+During single artifact retrieval, these strategies are executed in above specified order,
+and only if current strategy has "no answer", the next strategy is tried. Hence, if 
+resolver is able to get "expected" checksum from Provided Checksum Source, the Remote Included
+and Remote External sources will not be invoked at all.
+
+The big win here is that by obtaining hashes using "Remote Included" and not by "Remote External"
+strategy, we can halve the count of HTTP requests to download an Artifact.
+
+### Remote Included Strategies
+
+**Note: Remote Included checksums work only with transport-http, they do NOT work with transport-wagon!**
+
+By using "Remote Included" checksum feature, we are able to halve the issued HTTP request 
+count, since many repository services along Maven Central emits the reference checksums in
+the artifact response itself (as HTTP headers). Hence, we are able to get the
+artifact and reference "expected" checksum using only one HTTP round-trip.
+
+
+#### Sonatype Nexus 2
+
+Sonatype Nexus 2 uses SHA-1 hash to generate `ETag` header in "shielded" (à la Plexus Cipher)
+way. Naturally, this means only SHA-1 is available in artifact response header.
+
+Emitted by: Sonatype Nexus2 only.
+
+
+#### Non-standard `X-` headers
+
+Maven Central emits headers `x-checksum-sha1` and `x-checksum-md5` along with artifact response. 
+Google GCS on the other hand uses `x-goog-meta-checksum-sha1` and `x-goog-meta-checksum-md5` 
+headers. Resolver will detect these and use their value.
+
+Emitted by: Maven Central, GCS, some CDNs and probably more.
+
+
+## Trusted Checksums
+
+All the "expected" checksums discussed above are trasport bound, they are all
+about HTTP requests and responses, or require Transport related API elements.
+
+Trusted checksums are yet another SPI component `TrustedChecksumsSource` that are able
+to deliver "expected" checksums for given Artifact, with difference that trusted 
+checksums SPI is unrelated and not coupled to any transport API element.
+
+Since they map ideally into transport "Provided Checksum" kind, resolver provides Provided
+Checksum Source implementation that simply delegates to Trusted checksums (is able to
+make Provided and Trusted checksums become equivalent as transport is regarded).
+
+But the biggest game changer of Trusted Checksumms is their transport agnosticism, that they
+can participate in places where there is no transport happening at all.
+
+One of such uses of Trusted Checksums in Resolver is in ArtifactResolver "post processing".
+This new functionality, at the cost of checksumming overhead, is able to validate all
+the resolved artifacts againstr Truated Checksums, thus, making sure that all resolved
+artifacts are "validated" with some known (possibly even cryptographically strong) checksum.
diff --git a/src/site/markdown/included-checksum-strategies.md b/src/site/markdown/included-checksum-strategies.md
@@ -1,4 +1,4 @@
-# Included Checksum Strategies
+# Expected Checksums
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
@@ -18,34 +18,106 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-**Note: these below works only with transport-http, does NOT work with transport-wagon!**
+Checksums in Resolver were historically used during transport, 
+to ensure Artifact integrity. In addition, latest Resolver may 
+use checksums in various other ways too, for example to ensure 
+Artifact integrity during resolution. 
 
-By default, resolver will fetch the payload checksum from remote repository. These
-checksums are used to enforce transport validity (ensure that download was not 
-corrupted during transfer).
+Still, the essence of all checksums uses in Resolver is 
+"integrity validation": Resolver calculates by various
+means the "calculated" checksum (for given payload), 
+then obtains the "expected" checksum (for same payload)
+and compares the two.
 
-This implies, that to get one artifact or metadata, resolver 
-needs to issue two HTTP requests: one to get the payload itself, and one to 
-get the reference checksum.
+In essence and somewhat simplified, Resolver integrity validation looks like this:
+* hash the Artifact payload (file), this is the "calculated" checksum
+* obtain the Artifact "expected" checksum
+* compare the "calculated" checksum with "expected" checksum
 
-By using "included checksums" feature, we are able to halve the issued HTTP request 
-count, as many services along Maven Central emits the reference checksums in
-the artifact response itself (as HTTP headers), hence, we are able to get the
-artifact and reference checksum using only one HTTP round-trip.
+This page will cover all the "expected" checksums.
 
 
-## Sonatype Nexus 2
+## Transport Checksum Strategies
+
+Historically, the "obtain expected checksum" was implemented as simple HTTP GET 
+request against Artifact checksum URL (Artifact URL appended by ".sha1"). This logic 
+is still present in current Resolver, but is "decorated" and extended in multiple 
+ways.
+
+Resolver has broadened the "obtain" step for "expected" checksum with two new strategies,
+so the three expected checksum kinds in transport are: "Provided", "Remote Included" and 
+"Remote External". All these strategies represent the source of "expected" checksum
+as explained above, but it differs **how** Resolver obtains these.
+
+The **Provided** kind of expected checksums are "provided" to resolver by some alternative
+means, possibly ahead of any transport operation. There is an SPI interfacce that users may 
+implement, to have own ways to provide checksums to resolver, or, may use out of the 
+box implementation, that simply delegates call to "trusted checksums" (more about them later).
+
+The **Remote Included** checkums are "included" by remote party in some way, most typically 
+in their response. Since advent of modern Repository Managers, most of 
+them already sends checksums (usually the "standard" SHA-1 and MD5)
+in their response headers. Moreover, Maven Central, and even Google Mirror of Maven Central 
+sends these as well. By extracting these checksums from response, we can get hashes
+that were provided by remote repository along with it' content. 
+
+Finally, the **Remote External** checksums are the classic checksums we all know: They are laid down 
+next to Artifact files (hence "external") on remote repository (hence "remote"), according 
+to remote repository layout. To obtain Remote External checksum, a HTTP request is
+made toward remote repository.
+
+During single artifact retrieval, these strategies are executed in above specified order,
+and only if current strategy has "no answer", the next strategy is tried. Hence, if 
+resolver is able to get "expected" checksum from Provided Checksum Source, the Remote Included
+and Remote External sources will not be invoked at all.
+
+The big win here is that by obtaining hashes using "Remote Included" and not by "Remote External"
+strategy, we can halve the count of HTTP requests to download an Artifact.
+
+### Remote Included Strategies
+
+**Note: Remote Included checksums work only with transport-http, they do NOT work with transport-wagon!**
+
+By using "Remote Included" checksum feature, we are able to halve the issued HTTP request 
+count, since many repository services along Maven Central emits the reference checksums in
+the artifact response itself (as HTTP headers). Hence, we are able to get the
+artifact and reference "expected" checksum using only one HTTP round-trip.
+
+
+#### Sonatype Nexus 2
 
 Sonatype Nexus 2 uses SHA-1 hash to generate `ETag` header in "shielded" (à la Plexus Cipher)
 way. Naturally, this means only SHA-1 is available in artifact response header.
 
 Emitted by: Sonatype Nexus2 only.
 
 
-## Non-standard `X-` headers
+#### Non-standard `X-` headers
 
 Maven Central emits headers `x-checksum-sha1` and `x-checksum-md5` along with artifact response. 
 Google GCS on the other hand uses `x-goog-meta-checksum-sha1` and `x-goog-meta-checksum-md5` 
 headers. Resolver will detect these and use their value.
 
 Emitted by: Maven Central, GCS, some CDNs and probably more.
+
+
+## Trusted Checksums
+
+All the "expected" checksums discussed above are trasport bound, they are all
+about HTTP requests and responses, or require Transport related API elements.
+
+Trusted checksums are yet another SPI component `TrustedChecksumsSource` that are able
+to deliver "expected" checksums for given Artifact, with difference that trusted 
+checksums SPI is unrelated and not coupled to any transport API element.
+
+Since they map ideally into transport "Provided Checksum" kind, resolver provides Provided
+Checksum Source implementation that simply delegates to Trusted checksums (is able to
+make Provided and Trusted checksums become equivalent as transport is regarded).
+
+But the biggest game changer of Trusted Checksumms is their transport agnosticism, that they
+can participate in places where there is no transport happening at all.
+
+One of such uses of Trusted Checksums in Resolver is in ArtifactResolver "post processing".
+This new functionality, at the cost of checksumming overhead, is able to validate all
+the resolved artifacts againstr Truated Checksums, thus, making sure that all resolved
+artifacts are "validated" with some known (possibly even cryptographically strong) checksum.
diff --git a/src/site/site.xml b/src/site/site.xml
@@ -29,8 +29,8 @@ under the License.
       <item name="API Compatibility" href="api-compatibility.html"/>
       <item name="Configuration" href="configuration.html"/>
       <item name="About Checksums" href="about-checksums.html"/>
+      <item name="Expected Checksums" href="expected-checksums.html"/>
       <item name="About Local Repository" href="local-repository.html"/>
-      <item name="Included Checksum Strategies" href="included-checksum-strategies.html"/>
       <item name="Remote Repository Filtering" href="remote-repository-filtering.html"/>
       <item name="Maven 3.8.x" href="maven-3.8.x.html"/>
       <item name="JavaDocs" href="apidocs/index.html"/>