Skip to content

v0.0.2

Choose a tag to compare

@github-actions github-actions released this 06 Aug 15:34
· 143 commits to main since this release
Immutable release. Only release title and notes can be modified.
1f355dd

This release involves some refactoring, different parts of the indexer are now in their own modules.
As a result of this, it was easier to write unit tests for each resource, so I've now done that, along with two integration tests.
The tests just cover the basics, I expect to expand these in future to check errors and other things.

The page record indexer now only indexes records according to a set of conditions which guarantee the record is a web document.
Unfortunately the WACZ spec does not define what a page is in terms we can use here, so I have come up with the following conditions:

  • The WARC record type is either Response, Revisit, or Resource
  • The HTTP content-type is either text/html, application/xhtml+xml, or text/plain.
  • The HTTP status code is 200 OK.

This is an imperfect best-guess attempt to pick out things which might be pages from a WARC file.
The reason I filter for successful status codes is I realised that some failed requests return HTML pages in the response along with a 404 error.
Those are definitely pages, but I guess they're not what people want out of the pages.jsonl index.

I made a brief attempt to replace sha256 with the faster blake3 hashing algorithm, but this breaks compatibility with py-wacz.
I think this is something which will have to wait until blake3 can be integrated into the python standard library as part of hashlib.

Dependencies

  • This library now depends on surt-rs to create searchable url strings. It's a fairly minimal library and is more comprehensive than my own attempt to write a surt-ing function.
  • Bump rawzip to 0.3 (#41), thanks @nickbabcock!