# Loading WARC / Generating CDX (enable more efficient processing)

ArchiveSpark gains its efficiency through a two-step loading approach, which only accesses metadata for common operations like filtering, sorting, grouping, etc. Only if content is required for applying additional filters or derive new information from a record, ArchiveSpark will access the actual records. The required metadata for Web archives is commonly provided by CDX records. In the following we show how to generate these CDX records from a collection of WARC.gz files.

In [1]:
import org.archive.archivespark._
import org.archive.archivespark.implicits._
import org.archive.archivespark.enrich.functions._
import org.archive.archivespark.specific.warc._
import org.archive.archivespark.specific.warc.enrichfunctions._
import org.archive.archivespark.specific.warc.implicits._
import org.archive.archivespark.specific.warc.specs._

## Loading the dataset from WARC.gz files

In [2]:
val path = "/data/archiveit/ArchiveIt-Collection-2950"
val warcPath = path + "/warc"

ArchiveSpark provides two [DataSpecs](https://github.com/helgeho/ArchiveSpark/blob/master/docs/DataSpecs.md) to load Web archive records from plain (W)ARC records (without CDX). If the records are stored as \*.warc.gz files with each record being compressed separately, `WarcGzHdfsSpec` should be used for efficiency reasons, as it allows for splitting these files. Otherwise, `WarcHdfsSpec` enables to load any \*.arc(.gz) and \*.warc(.gz) files.

In [3]:
val warc = ArchiveSpark.load(WarcHdfsSpec(warcPath + "/*.*arc*"))

While with `WarcGzHdfsSpec` CDX files with positional information can directly be generated `.saveAsCdx`, with `WarcHdfsSpec` the whole dataset has to be saved along with CDX records `.saveAsWarc(..., generateCdx = true)`, therefore `WarcGzHdfsSpec` is highly recommended if it suits your dataset (see below):

In [4]:
val warc = ArchiveSpark.load(WarcGzHdfsSpec(warcPath + "/*.warc.gz"))

### Take a look at the first record

As we can see, although loaded directly from WARC, the records are internally represented in the same format as datasets with provided CDX data. Hence, we can apply the same operations as well Enrich Functions, however, the processing will be less efficient than with available CDX records.

In [5]:
warc.peekJson

{
  "record" : {
    "surtUrl" : "-",
    "timestamp" : "20111220013002",
    "originalUrl" : null,
    "mime" : "-",
    "status" : 0,
    "digest" : "LEPWK3MY3EA6X25EUWXJ452252SZRRXN",
    "redirectUrl" : "-",
    "meta" : "-",
    "compressedSize" : 647
  }
}

### Counting the records in this dataset takes long as all headers and contents are read and parsed

In [6]:
warc.count

78993499

*(took 31 minutes)*

## Generate CDX

We can now generate and save the CDX records corresponding to our dataset for a more efficient use of this dataset with ArchiveSpark next time: (by adding .gz to the path, the output will automatically be compressed using GZip)

In [7]:
warc.saveAsCdx(path + "_cdx.gz")

## Re-load dataset with CDX records

As we have CDX records for our dataset now, we can use the `WarcCdxHdfsSpec` to load it more efficiently:

In [8]:
val records = ArchiveSpark.load(WarcCdxHdfsSpec(path + "_cdx.gz", warcPath))

Counting as well as most of the other [operations](https://github.com/helgeho/ArchiveSpark/blob/master/docs/Operations.md) provided by [Spark](https://spark.apache.org/docs/latest/rdd-programming-guide.html) as well as [ArchiveSpark](https://github.com/helgeho/ArchiveSpark/blob/master/docs/Operations.md) will be more efficient now.

In [9]:
records.count

78993499

*(took 26 seconds)*