Skip to content

Commit ef167da

Browse files
author
Damian Stewart
committed
add note that "offset" is unstable
Signed-off-by: Damian Stewart <ot@damianstewart.com>
1 parent e5086d4 commit ef167da

File tree

1 file changed

+7
-2
lines changed

1 file changed

+7
-2
lines changed

README.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -389,9 +389,14 @@ Next, we use the `cdxt` command `warc` to retrieve the content and save it local
389389
* By default `cdxt` avoids overwriting existing files by automatically incrementing the counter in the filename. If you run this again without deleting `TEST-000000.extracted.warc.gz`, the data will be written again to a new file `TEST-000001.extracted.warc.gz`.
390390
* Limit, timestamp, and crawl index args, as well as URL wildcards, work as for `iter`.
391391

392-
### Indexing the WARC and viewing its contents
392+
### Indexing the WARC
393393

394-
Finally, we run `cdxj-indexer` on this new WARC to make a CDXJ index of it as in Task 3, and then iterate over the WARC using `warcio-iterator.py` as in Task 2.
394+
We now run `cdxj-indexer` on our new `TEST-000000.extracted.warc.gz` to make a CDXJ index of it as in Task 3.
395+
* Note that because the WARC includes metadata that is dynamically generated, you may see a slightly different value for `offset` than the one shown in the output above.
396+
397+
### View the CDXJ index
398+
399+
Finally, we iterate over the WARC using `warcio-iterator.py` as in Task 2.
395400

396401
## Task 7: Find the right part of the columnar index
397402

0 commit comments

Comments
 (0)