Skip to content

Commit

Permalink
Document Expected Checksums in Resolver
Browse files Browse the repository at this point in the history
  • Loading branch information
cstamas committed Apr 22, 2023
1 parent 7d17bc3 commit 64e7b63
Show file tree
Hide file tree
Showing 4 changed files with 224 additions and 21 deletions.
20 changes: 14 additions & 6 deletions src/site/markdown/about-checksums.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ of Maven Resolver: there is nothing secure being involved with checksums. Moreov
algorithm, but even for its "elder brother" MD5. Both algorithms are still widely used today as "transport integrity
validation" or "error detection" (aka "bit-rot detection").

## Checksum Changes
## Checksum Algorithms SPI

From a technical perspective, the above written facts infer following consequences: as checksum algorithms are exposed
to the user, so one can set them via configuration, users are not prevented to ask for SHA-256 or even SHA-512, even if
Expand All @@ -50,21 +50,29 @@ wrong use case. The notion of transport validation and secure hashes are being c
reasons explained above.

Hence, Maven Resolver team decided to make supported set of checksums limited. Instead of directly exposing
`MessageDigest` algorithms, we introduced an API around checksums. This not only prevents wrong use cases (not
`MessageDigest` algorithms, we introduced an SPI around checksums. This not only prevents wrong use cases (not
exposing all supported algorithms of `MessageDigest` to users), but also makes possible to introduce real checksum
algorithms. Finally, the set of supported checksum algorithms remains extensible: if some required algorithm is
not provided by Resolver, it can be easily added by creating a factory component for it.

Resolver out of the box supports the following checksum algorithms:
We are aware that users started using "better SHA" algorithms, and we do not want to break them. Nothing for them
changes (configuration and everything basically remains the same). But, we do want to prevent any possible further
proliferation of non-standard checksums.

## Implemented Checksum Algorithms

Resolver out of the box provides the following checksum algorithms (not, algorithm names are case sensitive):

* MD5
* SHA-1
* SHA-256
* SHA-512

We are aware that users started using "better SHA" algorithms, and we do not want to break them. Nothing for them
changes (configuration and everything basically remains the same). But, we do want to prevent any possible further
proliferation of non-standard checksums.
This set of algorithms above are provided by Resolver by default, but using the SPI anyone can extend
Resolver with new type of Checksum Algorithms.

To see how and when checksums are used in Resolver, continue on [Expected Checksums](expected-checksums.html)
page.

Links:

Expand Down
123 changes: 123 additions & 0 deletions src/site/markdown/expected-checksums.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Expected Checksums
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

Checksums in Resolver were historically used during transport,
to ensure Artifact integrity. In addition, latest Resolver may
use checksums in various other ways too, for example to ensure
Artifact integrity during resolution.

Still, the essence of all checksums uses in Resolver is
"integrity validation": Resolver calculates by various
means the "calculated" checksum (for given payload),
then obtains the "expected" checksum (for same payload)
and compares the two.

In essence and somewhat simplified, Resolver integrity validation looks like this:
* hash the Artifact payload (file), this is the "calculated" checksum
* obtain the Artifact "expected" checksum
* compare the "calculated" checksum with "expected" checksum

This page will cover all the "expected" checksums.


## Transport Checksum Strategies

Historically, the "obtain expected checksum" was implemented as simple HTTP GET
request against Artifact checksum URL (Artifact URL appended by ".sha1"). This logic
is still present in current Resolver, but is "decorated" and extended in multiple
ways.

Resolver has broadened the "obtain" step for "expected" checksum with two new strategies,
so the three expected checksum kinds in transport are: "Provided", "Remote Included" and
"Remote External". All these strategies represent the source of "expected" checksum
as explained above, but it differs **how** Resolver obtains these.

The **Provided** kind of expected checksums are "provided" to resolver by some alternative
means, possibly ahead of any transport operation. There is an SPI interfacce that users may
implement, to have own ways to provide checksums to resolver, or, may use out of the
box implementation, that simply delegates call to "trusted checksums" (more about them later).

The **Remote Included** checkums are "included" by remote party in some way, most typically
in their response. Since advent of modern Repository Managers, most of
them already sends checksums (usually the "standard" SHA-1 and MD5)
in their response headers. Moreover, Maven Central, and even Google Mirror of Maven Central
sends these as well. By extracting these checksums from response, we can get hashes
that were provided by remote repository along with it' content.

Finally, the **Remote External** checksums are the classic checksums we all know: They are laid down
next to Artifact files (hence "external") on remote repository (hence "remote"), according
to remote repository layout. To obtain Remote External checksum, a HTTP request is
made toward remote repository.

During single artifact retrieval, these strategies are executed in above specified order,
and only if current strategy has "no answer", the next strategy is tried. Hence, if
resolver is able to get "expected" checksum from Provided Checksum Source, the Remote Included
and Remote External sources will not be invoked at all.

The big win here is that by obtaining hashes using "Remote Included" and not by "Remote External"
strategy, we can halve the count of HTTP requests to download an Artifact.

### Remote Included Strategies

**Note: Remote Included checksums work only with transport-http, they do NOT work with transport-wagon!**

By using "Remote Included" checksum feature, we are able to halve the issued HTTP request
count, since many repository services along Maven Central emits the reference checksums in
the artifact response itself (as HTTP headers). Hence, we are able to get the
artifact and reference "expected" checksum using only one HTTP round-trip.


#### Sonatype Nexus 2

Sonatype Nexus 2 uses SHA-1 hash to generate `ETag` header in "shielded" (à la Plexus Cipher)
way. Naturally, this means only SHA-1 is available in artifact response header.

Emitted by: Sonatype Nexus2 only.


#### Non-standard `X-` headers

Maven Central emits headers `x-checksum-sha1` and `x-checksum-md5` along with artifact response.
Google GCS on the other hand uses `x-goog-meta-checksum-sha1` and `x-goog-meta-checksum-md5`
headers. Resolver will detect these and use their value.

Emitted by: Maven Central, GCS, some CDNs and probably more.


## Trusted Checksums

All the "expected" checksums discussed above are trasport bound, they are all
about HTTP requests and responses, or require Transport related API elements.

Trusted checksums are yet another SPI component `TrustedChecksumsSource` that are able
to deliver "expected" checksums for given Artifact, with difference that trusted
checksums SPI is unrelated and not coupled to any transport API element.

Since they map ideally into transport "Provided Checksum" kind, resolver provides Provided
Checksum Source implementation that simply delegates to Trusted checksums (is able to
make Provided and Trusted checksums become equivalent as transport is regarded).

But the biggest game changer of Trusted Checksumms is their transport agnosticism, that they
can participate in places where there is no transport happening at all.

One of such uses of Trusted Checksums in Resolver is in ArtifactResolver "post processing".
This new functionality, at the cost of checksumming overhead, is able to validate all
the resolved artifacts againstr Truated Checksums, thus, making sure that all resolved
artifacts are "validated" with some known (possibly even cryptographically strong) checksum.
100 changes: 86 additions & 14 deletions src/site/markdown/included-checksum-strategies.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Included Checksum Strategies
# Expected Checksums
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
Expand All @@ -18,34 +18,106 @@ specific language governing permissions and limitations
under the License.
-->

**Note: these below works only with transport-http, does NOT work with transport-wagon!**
Checksums in Resolver were historically used during transport,
to ensure Artifact integrity. In addition, latest Resolver may
use checksums in various other ways too, for example to ensure
Artifact integrity during resolution.

By default, resolver will fetch the payload checksum from remote repository. These
checksums are used to enforce transport validity (ensure that download was not
corrupted during transfer).
Still, the essence of all checksums uses in Resolver is
"integrity validation": Resolver calculates by various
means the "calculated" checksum (for given payload),
then obtains the "expected" checksum (for same payload)
and compares the two.

This implies, that to get one artifact or metadata, resolver
needs to issue two HTTP requests: one to get the payload itself, and one to
get the reference checksum.
In essence and somewhat simplified, Resolver integrity validation looks like this:
* hash the Artifact payload (file), this is the "calculated" checksum
* obtain the Artifact "expected" checksum
* compare the "calculated" checksum with "expected" checksum

By using "included checksums" feature, we are able to halve the issued HTTP request
count, as many services along Maven Central emits the reference checksums in
the artifact response itself (as HTTP headers), hence, we are able to get the
artifact and reference checksum using only one HTTP round-trip.
This page will cover all the "expected" checksums.


## Sonatype Nexus 2
## Transport Checksum Strategies

Historically, the "obtain expected checksum" was implemented as simple HTTP GET
request against Artifact checksum URL (Artifact URL appended by ".sha1"). This logic
is still present in current Resolver, but is "decorated" and extended in multiple
ways.

Resolver has broadened the "obtain" step for "expected" checksum with two new strategies,
so the three expected checksum kinds in transport are: "Provided", "Remote Included" and
"Remote External". All these strategies represent the source of "expected" checksum
as explained above, but it differs **how** Resolver obtains these.

The **Provided** kind of expected checksums are "provided" to resolver by some alternative
means, possibly ahead of any transport operation. There is an SPI interfacce that users may
implement, to have own ways to provide checksums to resolver, or, may use out of the
box implementation, that simply delegates call to "trusted checksums" (more about them later).

The **Remote Included** checkums are "included" by remote party in some way, most typically
in their response. Since advent of modern Repository Managers, most of
them already sends checksums (usually the "standard" SHA-1 and MD5)
in their response headers. Moreover, Maven Central, and even Google Mirror of Maven Central
sends these as well. By extracting these checksums from response, we can get hashes
that were provided by remote repository along with it' content.

Finally, the **Remote External** checksums are the classic checksums we all know: They are laid down
next to Artifact files (hence "external") on remote repository (hence "remote"), according
to remote repository layout. To obtain Remote External checksum, a HTTP request is
made toward remote repository.

During single artifact retrieval, these strategies are executed in above specified order,
and only if current strategy has "no answer", the next strategy is tried. Hence, if
resolver is able to get "expected" checksum from Provided Checksum Source, the Remote Included
and Remote External sources will not be invoked at all.

The big win here is that by obtaining hashes using "Remote Included" and not by "Remote External"
strategy, we can halve the count of HTTP requests to download an Artifact.

### Remote Included Strategies

**Note: Remote Included checksums work only with transport-http, they do NOT work with transport-wagon!**

By using "Remote Included" checksum feature, we are able to halve the issued HTTP request
count, since many repository services along Maven Central emits the reference checksums in
the artifact response itself (as HTTP headers). Hence, we are able to get the
artifact and reference "expected" checksum using only one HTTP round-trip.


#### Sonatype Nexus 2

Sonatype Nexus 2 uses SHA-1 hash to generate `ETag` header in "shielded" (à la Plexus Cipher)
way. Naturally, this means only SHA-1 is available in artifact response header.

Emitted by: Sonatype Nexus2 only.


## Non-standard `X-` headers
#### Non-standard `X-` headers

Maven Central emits headers `x-checksum-sha1` and `x-checksum-md5` along with artifact response.
Google GCS on the other hand uses `x-goog-meta-checksum-sha1` and `x-goog-meta-checksum-md5`
headers. Resolver will detect these and use their value.

Emitted by: Maven Central, GCS, some CDNs and probably more.


## Trusted Checksums

All the "expected" checksums discussed above are trasport bound, they are all
about HTTP requests and responses, or require Transport related API elements.

Trusted checksums are yet another SPI component `TrustedChecksumsSource` that are able
to deliver "expected" checksums for given Artifact, with difference that trusted
checksums SPI is unrelated and not coupled to any transport API element.

Since they map ideally into transport "Provided Checksum" kind, resolver provides Provided
Checksum Source implementation that simply delegates to Trusted checksums (is able to
make Provided and Trusted checksums become equivalent as transport is regarded).

But the biggest game changer of Trusted Checksumms is their transport agnosticism, that they
can participate in places where there is no transport happening at all.

One of such uses of Trusted Checksums in Resolver is in ArtifactResolver "post processing".
This new functionality, at the cost of checksumming overhead, is able to validate all
the resolved artifacts againstr Truated Checksums, thus, making sure that all resolved
artifacts are "validated" with some known (possibly even cryptographically strong) checksum.
2 changes: 1 addition & 1 deletion src/site/site.xml
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ under the License.
<item name="API Compatibility" href="api-compatibility.html"/>
<item name="Configuration" href="configuration.html"/>
<item name="About Checksums" href="about-checksums.html"/>
<item name="Expected Checksums" href="expected-checksums.html"/>
<item name="About Local Repository" href="local-repository.html"/>
<item name="Included Checksum Strategies" href="included-checksum-strategies.html"/>
<item name="Remote Repository Filtering" href="remote-repository-filtering.html"/>
<item name="Maven 3.8.x" href="maven-3.8.x.html"/>
<item name="JavaDocs" href="apidocs/index.html"/>
Expand Down

0 comments on commit 64e7b63

Please sign in to comment.