v0.0.1
As of this point, the WACZ and indexer can output (almost) everything needed from a WARC file to a fully spec-compliant WACZ file.
The last thing missing was the pages.jsonl file, which is now produced when reading through the WARC file as part of the indexer.
I want to avoid reading through the WARC twice to produce two files, so have wrapped everything into one indexer, again there's probably a better way of doing this.
The other happy change in this release is removing code duplication from the WARC reader in case of gzipped and non-gzipped files.
First time I've tried using type generics in Rust, the code is messy, but it works.
Added
- (indexer) Use type generics to eliminate code duplication when iterating through records, this finally gets rid of an awkward situation where I was having to maintain two separate iterators .
- add pages indexer to wacz writer, with a struct for page records, this is the main thing in this release.
Fixed
- add newline to page records, needed for pages.jsonl format, closes #37, nice and easy change
- (indexer) skip serialising null fields in page record
- (datapackage) pass cdxj_index_bytes through to the datapackage
Other
Lots more little documentation/readme changes and additions. Code refactoring, etc.
- (indexer) use core instead of standard libraries for error formatting
- add serde features to dependencies, update cargofile
- (datapackage) move compose_datapackage into datapackage implementation
- (datapackage) DataPackageResource::new now returns a result/error rather than panicking
- (indexer) use httparse to parse http status code from response and remove the happily redundant cut_http_headers_from_record function