Release v0.0.2 · bodleian/wacksy

This release involves some refactoring, different parts of the indexer are now in their own modules.
As a result of this, it was easier to write unit tests for each resource, so I've now done that, along with two integration tests.
The tests just cover the basics, I expect to expand these in future to check errors and other things.

The page record indexer now only indexes records according to a set of conditions which guarantee the record is a web document.
Unfortunately the WACZ spec does not define what a page is in terms we can use here, so I have come up with the following conditions:

The WARC record type is either Response, Revisit, or Resource
The HTTP content-type is either text/html, application/xhtml+xml, or text/plain.
The HTTP status code is 200 OK.

This is an imperfect best-guess attempt to pick out things which might be pages from a WARC file.
The reason I filter for successful status codes is I realised that some failed requests return HTML pages in the response along with a 404 error.
Those are definitely pages, but I guess they're not what people want out of the pages.jsonl index.

I made a brief attempt to replace sha256 with the faster blake3 hashing algorithm, but this breaks compatibility with py-wacz.
I think this is something which will have to wait until blake3 can be integrated into the python standard library as part of hashlib.

Dependencies

This library now depends on surt-rs to create searchable url strings. It's a fairly minimal library and is more comprehensive than my own attempt to write a surt-ing function.
Bump rawzip to 0.3 (#41), thanks @nickbabcock!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.0.2

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Dependencies

Contributors

Uh oh!