New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use capture time for warcinfo WARC-Date and timestemap in WARC filename #2

Closed
sebastian-nagel opened this Issue May 23, 2017 · 1 comment

Comments

Projects
None yet
1 participant
@sebastian-nagel

sebastian-nagel commented May 23, 2017

The https://github.com/iipc/warc-specifications/blob/gh-pages/specifications/warc-format/warc-1.1/index.md#warc-date-mandatory is defined as "The timestamp shall represent the instant that data capture for record creation began." For request and response records that's obviously the time a request was made resp. a response is received. For warcinfo records Common Crawl WARC files

  • use as WARC-Date the time a WARC file is written which is in the post-processing phase
  • the WARC filename contains the timestamp the entire monthly crawl is started (i.e., the fetch lists generation time).

A monthly crawls is fetched over 8-9 days, but the content of a WARC file always relates to one segment which is fetched within 2 hours. The warcinfo WARC-Date should indicate the time when fetching/capturing starts.

That the time stamp in the WARC/WAT/WET filename should indicate the time span the content of the file was crawled was also a wish on the Common Crawl user group [1,2]).

@sebastian-nagel

This comment has been minimized.

Show comment
Hide comment
@sebastian-nagel

sebastian-nagel May 24, 2017

Implemented, included in May 2017 crawl:

  • change WARC filename to
    CC-MAIN-starttime-endtime-serial.warc.gz
    The hostname is dropped as it refers to the cluster master host of the WARC generation step which isn't the host the content was fetched from. Cf. the WARC spec which recommends Prefix-Timestamp-Serial-Crawlhost.warc.gz as WARC filename.
  • start and end time shall indicate the capture/fetch time of the WARC content
  • the warcinfo WARC-Date now indicates the starttime

sebastian-nagel commented May 24, 2017

Implemented, included in May 2017 crawl:

  • change WARC filename to
    CC-MAIN-starttime-endtime-serial.warc.gz
    The hostname is dropped as it refers to the cluster master host of the WARC generation step which isn't the host the content was fetched from. Cf. the WARC spec which recommends Prefix-Timestamp-Serial-Crawlhost.warc.gz as WARC filename.
  • start and end time shall indicate the capture/fetch time of the WARC content
  • the warcinfo WARC-Date now indicates the starttime
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment