Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to access WARC record filename and offset #6

Closed
sebastian-nagel opened this issue Nov 5, 2018 · 2 comments
Closed

Allow to access WARC record filename and offset #6

sebastian-nagel opened this issue Nov 5, 2018 · 2 comments

Comments

@sebastian-nagel
Copy link
Contributor

See this discussion: https://groups.google.com/d/topic/common-crawl/7MuqVmvajoA/discussion

Offset and length are not part of the ArcWarcRecord but are known only to the ArchiveIterator. Ideally, it should be possible to access WARC filename, record offset and length in the process_record method.

@sebastian-nagel
Copy link
Contributor Author

Actually, accessing record offset or length will cause that the entire record is consumed. It must be done after the record is processed.

sebastian-nagel added a commit that referenced this issue Jul 19, 2019
from ArchiveIterator, implements #6
- introduce customizable method
    `iterate_records(warc_file_uri, archive_iterator)`
  which iterates over WARC record and calls `process_record(record)`
- document pitfall: accessing offset and length must be done after
  WARC record is processed
sebastian-nagel added a commit that referenced this issue Jul 19, 2019
from ArchiveIterator, implements #6
- introduce customizable method
    `iterate_records(warc_file_uri, archive_iterator)`
  which iterates over WARC record and calls `process_record(record)`
- document pitfall: accessing offset and length must be done after
  WARC record is processed
sebastian-nagel added a commit that referenced this issue Jul 19, 2019
from ArchiveIterator, implements #6
- introduce customizable method
    `iterate_records(warc_file_uri, archive_iterator)`
  which iterates over WARC record and calls `process_record(record)`
- document pitfall: accessing offset and length must be done after
  WARC record is processed
@sebastian-nagel
Copy link
Contributor Author

Implemented with with 7e2f67a: by overriding the method iterate_records WARC record and offset can be accessed. See #9 for an example how this can be utilized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant