# Loading and resolving  *revisit* records in Web archive datasets

Revisit records may be included in a dataset to avoid duplicates and save storage. In order to access the actual content such captures, the records need to be resolved first, i.e., the original record has to be identified in the CDX and additional meta information as well as the location information of the corresponding WARC records are *copied* to the revisit record.

More information on *revisit records* can be found here:
https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#example-of-revisit-record

In [1]:
import org.archive.archivespark._
import org.archive.archivespark.implicits._
import org.archive.archivespark.enrich.functions._
import org.archive.archivespark.specific.warc._
import org.archive.archivespark.specific.warc.enrichfunctions._
import org.archive.archivespark.specific.warc.implicits._
import org.archive.archivespark.specific.warc.specs._

## Loading the dataset from CDX + WARC files

In [2]:
val path = "/data/archiveit/ArchiveIt-Collection-2950"
val cdxPath = path + "/cdx/*.cdx.gz"
val warcPath = path + "/warc"

### As the CDX files contain revisit records, we only load these first (`CdxHdfsSpec`) to resolved them

In [3]:
val unresolved = ArchiveSpark.load(CdxHdfsSpec(cdxPath))

Depending on the number of records in the dataset, increasing the parallelism for this operation may be a good idea to prevent out-of-memory errors as this is a relatively expensive operation.

In [4]:
ArchiveSpark.parallelism = 10000

After resolving the revisit records, we can reduce the number of partitions again (`coalesce`), which has become large due to the high parallelism. Finally, the resolved CDX records are cached so that we do not need to compute them multiple times.

In [5]:
val resolved = unresolved.resolveRevisits().coalesce(1000).cache

Before we continue with other operations, the parallelism should be decreased again as well.

In [6]:
ArchiveSpark.parallelism = 100

### We can now load the dataset with the actual WARC files using the resolved CDX (`WarcHdfsCdxRddSpec`)

In [7]:
val records = ArchiveSpark.load(WarcHdfsCdxRddSpec(resolved, warcPath))

From here you can work with the dataset as usual, like shown in other [recipes](https://github.com/helgeho/ArchiveSpark/blob/master/docs/Recipes.md).

For more details on the used [DataSpecs](https://github.com/helgeho/ArchiveSpark/blob/master/docs/DataSpecs.md) please [read the docs](https://github.com/helgeho/ArchiveSpark/blob/master/docs/README.md).

## Comparing resolved and unresolved records

We lose only four records for which no corresponding record was found to resolve it:

In [8]:
unresolved.count

52922792

In [9]:
resolved.count

52922788

In [10]:
records.count

52922788

As expected, all revisit records are turned into their actual counterparts:

In [11]:
unresolved.filter(_.mime == "warc/revisit").count

17502250

In [12]:
resolved.filter(_.mime == "warc/revisit").count

0

In [13]:
records.filter(_.mime == "warc/revisit").count

0

## Storing and loading resolved CDX records

If you often work with this dataset, you can save the resolved CDX records to avoid the resolving process next time (adding the .gz extension to the directory name automatically ensures that the CDX files will be compressed using GZip)

In [14]:
records.saveAsCdx(path + "/resolved_cdx.gz")