New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index for WET files? #11

Closed
azzurolilc opened this Issue Jun 6, 2016 · 2 comments

Comments

Projects
None yet
2 participants
@azzurolilc
Copy link

azzurolilc commented Jun 6, 2016

Hi, hope I am posting this questions in the right place...

I found .WARC format domain index at http://index.commoncrawl.org/CC-MAIN-2016-18//
I wonder if there is any indexing for .WET format files?

If not, is there anyway I could convert the WARC object address to WET object address?
For example, if I have:
s3://commoncrawl/crawl-data/CC-MAIN-2016-18/segments/1461860125175.9/warc/CC-MAIN-20160428161525-00221-ip-10-239-7-51.ec2.internal.warc.gz
What would be the corresponding .WET file?

Thx...

@sebastian-nagel

This comment has been minimized.

Copy link
Contributor

sebastian-nagel commented Jun 8, 2016

Hi,
the better place for questions would be the Common Crawl user group at
https://groups.google.com/forum/#!forum/common-crawl
You'll probably get a quick answer from users in various time zones.

Unfortunately, we do not provide an index to WET files. It's easy to achieve the location of a WET (or WAT) file given a WARC file:

  • replace /warc/ by /wet/in the path
  • add .wet before the suffix .gz
s3://commoncrawl/crawl-data/CC-MAIN-2016-18/segments/1461860125175.9/warc/CC-MAIN-20160428161525-00221-ip-10-239-7-51.ec2.internal.warc.gz
s3://commoncrawl/crawl-data/CC-MAIN-2016-18/segments/1461860125175.9/wet/CC-MAIN-20160428161525-00221-ip-10-239-7-51.ec2.internal.warc.wet.gz

The Common Crawl index also provides offsets into the WARC file, which could be used to estimate the offsets in the WET file.

@sebastian-nagel

This comment has been minimized.

Copy link
Contributor

sebastian-nagel commented Feb 24, 2017

Closing this issue as it does not belong to this repository. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment