Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
7c63ebb
ignore .idea, target
lfoppiano Dec 16, 2025
3f584e5
add pom.xml, Readme.md and the data files
lfoppiano Dec 16, 2025
f9d929e
add makefile
lfoppiano Dec 16, 2025
7e7b3f5
add read warc
lfoppiano Dec 16, 2025
a998133
add CI + spotless
lfoppiano Dec 16, 2025
c808d8c
add figures, editorconfig, .gitignore from the python repository brother
lfoppiano Dec 16, 2025
fa3f707
remove unclear make install, remove venv info from readme
lfoppiano Dec 16, 2025
f6d62bb
update read class, add recompress,
lfoppiano Dec 17, 2025
4aa252a
cleanup, removing the rest of the python stuff for task 0,1,2
lfoppiano Dec 17, 2025
5b018e9
fix missing make install
lfoppiano Dec 18, 2025
817862c
move data under 'data' directory
lfoppiano Dec 18, 2025
620ebee
add Apache header in the code
lfoppiano Dec 18, 2025
886ff0b
make sure we build before running
lfoppiano Dec 18, 2025
6180fce
update .gitignore
lfoppiano Dec 19, 2025
d35e3d8
Implement WARC compression validation for Task 5
lfoppiano Dec 20, 2025
e20c81e
Ignore gzip validation if is uncompressed
lfoppiano Dec 20, 2025
07c9f8b
Merge branch 'main' into luca/feature/part2
lfoppiano Dec 22, 2025
0fa930e
fix compression check, update Readme.md
lfoppiano Dec 22, 2025
78fbac6
add missing apache licence
lfoppiano Dec 22, 2025
6f97782
add commons-compress library
lfoppiano Dec 22, 2025
52fca8c
place Github Actions in the correct directory
lfoppiano Dec 22, 2025
75af0e1
Add CDJX indexer using unreleased JARC code
lfoppiano Dec 23, 2025
077f904
Implement Task 3 and 4
lfoppiano Dec 28, 2025
3a2791a
fix: CI build
lfoppiano Dec 28, 2025
df257e4
fix: Reformat with spotless
lfoppiano Dec 28, 2025
3ed8d61
fix: Rename class
lfoppiano Dec 29, 2025
b3c7252
feat: task 7
lfoppiano Dec 29, 2025
40bb84a
Merge branch 'main' into luca/feature/part3
lfoppiano Jan 16, 2026
87c6ca9
fix(makefile): write stuff in data/
lfoppiano Jan 16, 2026
fbf8147
fix(makefile): avoid reimplementing the wheel
lfoppiano Jan 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 30 additions & 29 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
build:
mvn clean package

# cdxj:
# @echo "creating *.cdxj index files from the local warcs"
# cdxj-indexer whirlwind.warc.gz > whirlwind.warc.cdxj
# cdxj-indexer --records conversion whirlwind.warc.wet.gz > whirlwind.warc.wet.cdxj
# cdxj-indexer whirlwind.warc.wat.gz > whirlwind.warc.wat.cdxj
cdxj: build ensure_jwarc
@echo "creating *.cdxj index files from the local warcs"
java -jar jwarc.jar cdxj data/whirlwind.warc.gz > data/whirlwind.warc.cdxj
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.CdxjIndexer -Dexec.args="data/whirlwind.warc.wet.gz --records conversion" > data/whirlwind.warc.wet.cdxj
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.CdxjIndexer -Dexec.args="data/whirlwind.warc.wat.gz --records metadata" > data/whirlwind.warc.wat.cdxj

extract:
@echo "creating extraction.* from local warcs, the offset numbers are from the cdxj index"
java -jar jwarc.jar extract --payload data/whirlwind.warc.gz 1023 > extraction.html
java -jar jwarc.jar extract --payload data/whirlwind.warc.wet.gz 466 > extraction.txt
java -jar jwarc.jar extract --payload data/whirlwind.warc.wat.gz 443 > extraction.json
@echo "hint: python -m json.tool extraction.json"

# extract:
# @echo "creating extraction.* from local warcs, the offset numbers are from the cdxj index"
# warcio extract --payload whirlwind.warc.gz 1023 > extraction.html
# warcio extract --payload whirlwind.warc.wet.gz 466 > extraction.txt
# warcio extract --payload whirlwind.warc.wat.gz 443 > extraction.json
# @echo "hint: python -m json.tool extraction.json"
#
# cdx_toolkit:
# @echo demonstrate that we have this entry in the index
# cdxt --crawl CC-MAIN-2024-22 --from 20240518015810 --to 20240518015810 iter an.wikipedia.org/wiki/Escopete
Expand All @@ -31,15 +31,15 @@ build:
# python ./warcio-iterator.py TEST-000000.extracted.warc.gz
# @echo
#
# download_collinfo:
# @echo "downloading collinfo.json so we can find out the crawl name"
# curl -O https://index.commoncrawl.org/collinfo.json
#
# CC-MAIN-2024-22.warc.paths.gz:
# @echo "downloading the list from s3, requires s3 auth even though it is free"
# @echo "note that this file should be in the repo"
# aws s3 ls s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ | awk '{print $$4}' | gzip -9 > CC-MAIN-2024-22.warc.paths.gz
#
download_collinfo:
Comment thread
lfoppiano marked this conversation as resolved.
@echo "downloading collinfo.json so we can find out the crawl name"
curl -o data/collinfo.json https://index.commoncrawl.org/collinfo.json

CC-MAIN-2024-22.warc.paths.gz:
Comment thread
lfoppiano marked this conversation as resolved.
@echo "downloading the list from s3, requires s3 auth even though it is free"
@echo "note that this file should be in the repo"
aws s3 ls s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ | awk '{print $$4}' | gzip -9 > data/CC-MAIN-2024-22.warc.paths.gz

# duck_local_files:
# @echo "warning! 300 gigabyte download"
# python duck.py local_files
Expand All @@ -52,11 +52,12 @@ build:
# @echo "warning! this might take 1-10 minutes"
# python duck.py cloudfront
#
get_jwarc:

jwarc.jar:
@echo "downloading JWarc JAR"
curl -fL -o jwarc.jar https://github.com/iipc/jwarc/releases/download/v0.33.0/jwarc-0.33.0.jar

wreck_the_warc: build get_jwarc
wreck_the_warc: build jwarc.jar
@echo
@echo we will break and then fix this warc
cp data/whirlwind.warc.gz data/testing.warc.gz
Expand All @@ -67,24 +68,24 @@ wreck_the_warc: build get_jwarc
gzip data/testing.warc
@echo
@echo showing the records in the compressed warc - note the offsets of request and response are
java -jar jwarc-0.33.0.jar ls data/testing.warc.gz
java -jar jwarc.jar ls data/testing.warc.gz
@echo
@echo access the request record - failing
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 3734 || /usr/bin/true
java -jar jwarc.jar extract data/testing.warc.gz 3734 || /usr/bin/true
@echo
@echo access the response record - failing
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 3734 || /usr/bin/true
java -jar jwarc.jar extract data/testing.warc.gz 3734 || /usr/bin/true
@echo
@echo "now let's do it the right way"
gzip -d data/testing.warc.gz
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.RecompressWARC -Dexec.args="data/testing.warc data/testing.warc.gz"
@echo
@echo showing the records in the compressed warc - note the skewed offsets of request and response
java -jar jwarc-0.33.0.jar ls data/testing.warc.gz
java -jar jwarc.jar ls data/testing.warc.gz
@echo
@echo access the request record - works
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 518 | head
java -jar jwarc.jar extract data/testing.warc.gz 518 | head
@echo
@echo access the response record - works
java -jar jwarc-0.33.0.jar extract data/testing.warc.gz 1027 | head -n 20
java -jar jwarc.jar extract data/testing.warc.gz 1027 | head -n 20
@echo
71 changes: 69 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -414,11 +414,78 @@ Feel free to experiment more by looking at other part of the records, or extract

## Task 3: Index the WARC, WET, and WAT

TBA
The example WARC files we've been using are tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index.
```mermaid
flowchart LR
warc --> indexer --> cdxj & columnar
warc@{shape: cyl}
cdxj@{ shape: stored-data}
columnar@{ shape: stored-data}
```


We have two versions of the index: the CDX index and the columnar index. The CDX index is useful for looking up single pages, whereas the columnar index is better suited to analytical and bulk queries. We'll look at both in this tour, starting with the CDX index.

### CDX(J) index

The CDX index files are sorted plain-text files, with each line containing information about a single capture in the WARC. Technically, Common Crawl uses CDXJ index files since the information about each capture is formatted as JSON. We'll use CDX and CDXJ interchangeably in this tour for legacy reasons 💅

We can create our own CDXJ index from the local WARCs by running:

```make cdxj```

This uses the JWARC library and, partially, a home-cooked code that we wrote to support WET and WAT records, to generate CDXJ index files for our WARC files by running the code below:

<details>
<summary>Click to view code</summary>

```
creating *.cdxj index files from the local warcs
java -jar jwarc.jar cdxj data/whirlwind.warc.gz > whirlwind.warc.cdxj
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.CdxjIndexer -Dexec.args="data/whirlwind.warc.wet.gz --records conversion" > whirlwind.warc.wet.cdxj
mvn -q exec:java -Dexec.mainClass=org.commoncrawl.whirlwind.CdxjIndexer -Dexec.args="data/whirlwind.warc.wat.gz --records metadata" > whirlwind.warc.wat.cdxj
```

</details>

Now look at the `.cdxj` files with `cat whirlwind*.cdxj`. You'll see that each file has one entry in the index. The WARC only has the response record indexed, since by default cdxj-indexer guesses that you won't ever want to random-access the request or metadata. WET and WAT have the conversion and metadata records indexed (Common Crawl doesn't publish a WET or WAT index, just WARC).

For each of these records, there's one text line in the index - yes, it's a flat file! It starts with a string like `org,wikipedia,an)/wiki/escopete 20240518015810`, followed by a JSON blob. The starting string is the primary key of the index. The first thing is a [SURT](http://crawler.archive.org/articles/user_manual/glossary.html#surt) (Sort-friendly URI Reordering Transform). The big integer is a date, in ISO-8601 format with the delimiters removed.

What is the purpose of this funky format? It's done this way because these flat files (300 gigabytes total per crawl) can be sorted on the primary key using any out-of-core sort utility e.g. the standard Linux `sort`, or one of the Hadoop-based out-of-core sort functions.

The JSON blob has enough information to cleanly isolate the raw data of a single record: it defines which WARC file the record is in, and the byte offset and length of the record within this file. We'll use that in the next section.

## Task 4: Use the CDXJ index to extract a subset of raw content from the local WARC, WET, and WAT

TBA
Normally, compressed files aren't random access. However, the WARC files use a trick to make this possible, which is that every record needs to be separately compressed. The `gzip` compression utility supports this, but it's rarely used.

To extract one record from a warc file, all you need to know is the filename and the offset into the file. If you're reading over the web, then it really helps to know the exact length of the record.

Run:

```make extract```

to run a set of extractions from your local
`whirlwind.*.gz` files with `JWARC` using the commands below:

<details>
<summary>Click to view code</summary>

```
creating extraction.* from local warcs, the offset numbers are from the cdxj index
java -jar jwarc.jar extract --payload data/whirlwind.warc.gz 1023 > extraction.html
java -jar jwarc.jar extract --payload data/whirlwind.warc.wet.gz 466 > extraction.txt
java -jar jwarc.jar extract --payload data/whirlwind.warc.wat.gz 443 > extraction.json
hint: python -m json.tool extraction.json
```

</details>

The offset numbers in the Makefile are the same
ones as in the index. Look at the three output files: `extraction.html`, `extraction.txt`, and `extraction.json` (pretty-print the json with `python -m json.tool extraction.json`).

Notice that we extracted HTML from the WARC, text from WET, and JSON from the WAT (as shown in the different file extensions). This is because the payload in each file type is formatted differently!

## Task 5: Wreck the WARC by compressing it wrong

Expand Down
Loading