# Downloading a Web archive dataset as WARC/CDX from the Wayback Machine

In [1]:
import org.archive.archivespark._
import org.archive.archivespark.implicits._
import org.archive.archivespark.enrich.functions._
import org.archive.archivespark.specific.warc._
import org.archive.archivespark.specific.warc.enrichfunctions._
import org.archive.archivespark.specific.warc.implicits._
import org.archive.archivespark.specific.warc.specs._

## Loading the dataset from the Wayback Machine

ArchiveSpark provides a [Data Specification (DataSpec)](https://github.com/helgeho/ArchiveSpark/blob/master/docs/DataSpecs.md) to load records remotely from the Wayback Machine with metadata fetched from [CDX Server](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server). This `WaybackSpec` can be used with a URL query as well as additional parameters to refine the retrieved records.

For more details about this and other [DataSpecs](https://github.com/helgeho/ArchiveSpark/blob/master/docs/DataSpecs.md) please read the [docs](https://github.com/helgeho/ArchiveSpark/blob/master/docs/README.md).

The following example loads the first 50 pages of records from the CDX Server for URLs starting with *nytimes.com* (`matchPrefix = true`) between May and June 2012, with 5 blocks per page (please read [this documentation](https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server) for more information on these parameters):

In [2]:
val records = ArchiveSpark.load(WaybackSpec("nytimes.com", matchPrefix = true, from = 201205, to = 201206, blocksPerPage = 5, pages = 50))

### Peek at the first record as JSON

As usual the records in this dataset can be printed as JSON and all common [operations](https://github.com/helgeho/ArchiveSpark/blob/master/docs/Operations.md) as well as [Enrich Function](https://github.com/helgeho/ArchiveSpark/blob/master/docs/EnrichFuncs.md) can be applied as shown in [other recipes](https://github.com/helgeho/ArchiveSpark/blob/master/docs/Recipes.md).

In [3]:
records.peekJson

{
  "record" : {
    "surtUrl" : "com,nytimes)/",
    "timestamp" : "20120501011208",
    "originalUrl" : "http://www.nytimes.com/",
    "mime" : "text/html",
    "status" : 200,
    "digest" : "I25SXVS4G5FYLISA5CF54UHMYLATAEW3",
    "redirectUrl" : "-",
    "meta" : "-",
    "compressedSize" : 35906
  }
}

## Save as WARC / CDX

Save the records as local *.warc.gz and *.cdx.gz files (by adding .gz to the path, the output will automatically be compressed using GZip):

In [4]:
records.saveAsWarc("nytimes_201205-201206.warc.gz", WarcMeta(publisher = "Internet Archive"), generateCdx = true)

## Load from WARC / CDX

Now, with the dataset available in local WARC / CDX files, we can load it from there, using `WarcCdxHdfsSpec`:

In [5]:
val records = ArchiveSpark.load(WarcCdxHdfsSpec("nytimes_201205-201206.warc.gz/*.cdx.gz", "nytimes_201205-201206.warc.gz"))

In [6]:
records.peekJson

{
  "record" : {
    "surtUrl" : "com,nytimes)/",
    "timestamp" : "20120502142709",
    "originalUrl" : "http://www.nytimes.com/",
    "mime" : "text/html",
    "status" : 200,
    "digest" : "7PO7ZZLR4PCLW44AE253PWEIKN6NZ5UY",
    "redirectUrl" : "-",
    "meta" : "-",
    "compressedSize" : 35575
  }
}