Skip to content

v0.0.1

Choose a tag to compare

@github-actions github-actions released this 20 Jun 09:16
· 167 commits to main since this release
c8fab4c

As of this point, the WACZ and indexer can output (almost) everything needed from a WARC file to a fully spec-compliant WACZ file.
The last thing missing was the pages.jsonl file, which is now produced when reading through the WARC file as part of the indexer.
I want to avoid reading through the WARC twice to produce two files, so have wrapped everything into one indexer, again there's probably a better way of doing this.

The other happy change in this release is removing code duplication from the WARC reader in case of gzipped and non-gzipped files.
First time I've tried using type generics in Rust, the code is messy, but it works.

Added

  • (indexer) Use type generics to eliminate code duplication when iterating through records, this finally gets rid of an awkward situation where I was having to maintain two separate iterators .
  • add pages indexer to wacz writer, with a struct for page records, this is the main thing in this release.

Fixed

  • add newline to page records, needed for pages.jsonl format, closes #37, nice and easy change
  • (indexer) skip serialising null fields in page record
  • (datapackage) pass cdxj_index_bytes through to the datapackage

Other

Lots more little documentation/readme changes and additions. Code refactoring, etc.

  • (indexer) use core instead of standard libraries for error formatting
  • add serde features to dependencies, update cargofile
  • (datapackage) move compose_datapackage into datapackage implementation
  • (datapackage) DataPackageResource::new now returns a result/error rather than panicking
  • (indexer) use httparse to parse http status code from response and remove the happily redundant cut_http_headers_from_record function