## Create HathiTrust Zephir and US Federal Document Overlap Datset

## Data

**Zephir Notebook**
**This notebook requires the zephir-oclc-feddocs-dataset to run. This dataset has not been included in the repository for size reasons. There are two options if you do not have the dataset**
  * Contact the Zephir team for the latest 'zephir-oclc-feddocs-dataset' dataset.
  * Skip this notebook and use the included 'zephir-htusfd-overlap' dataset created by this notebook in futher analysis.


### Imports

In [None]:
import datetime
import collections
import gzip
import json
import os
import shutil


import pandas

### DatasetLoading

#### US Fed Docs Registry (OCLC numbers)

The US Fed Docs Registry: https://github.com/HTGovdocs/feddoc_oclc_nums

In [None]:
with open('data/feddoc_oclc_nums.manifest.json') as json_file:
    usfd_manifest = json.load(json_file)
    
print("US Fed Doc Registry data was commited at {}".format(usfd_manifest["datetime"]))

usfeddoc_registry_df = pandas.read_csv("data/feddoc_oclc_nums.txt",
    names=["oclc"],
    dtype=usfd_manifest["schema"],
    header = 0)

#### Zephir Dataset (Repository-ingested only)
This file is an export of Zephir data from the htmm database, and created by [jupyter notebook](http://localhost:8888/lab/tree/create_zephir_dataset.ipynb). The original file is not included in version control for size reasons.


In [3]:
if not os.path.exists("data/zephir-oclc-feddocs-dataset.csv"):
    print("Check for dataset on Google Drive. It was too large for Github.")
    raise Exception("Missing dataset! See output for details.")
    
with open('data/zephir-oclc-feddocs-dataset.manifest.json') as json_file:
    zephir_manifest = json.load(json_file)
    
print("Zephir data was created {}".format(zephir_manifest["datetime"]))

zephir_df = pandas.read_csv("data/zephir-oclc-feddocs-dataset.csv", dtype=zephir_manifest["schema"])

Zephir data was created 2019-08-16T22:41:27.671011


## Create overlap dataset

### How many volumes in HathiTrust have overlapping OCLCs in the US Fed Docs Registry?

In [4]:
# get overlap
rows_with_matching_oclc = zephir_df["oclc"].isin(usfeddoc_registry_df["oclc"])
usfd_overlap_df = (zephir_df[rows_with_matching_oclc])

# only count unique volumes, some may have multiple oclcs and matches
volumes_matched = usfd_overlap_df.htid.unique()
print("There are {} volumes w/ OCLC numbers matched in the registry".format(len(volumes_matched)))

There are 1224577 volumes w/ OCLC numbers matched in the registry


#### Write dataset to overlap, no filtering or deduping
This data is raw overlap. It includes matches (including duplicate voluemes) that may or may not be indicated as US Federal Documents in their MARC records (usfd_status).

In [5]:
export_df = usfd_overlap_df.drop(columns="zdb_row").sort_values(by=["contributor_code", "contribsys_id"])
# save the dataset
dataset_name = "zephir-htusfd-overlap"
dataset_file ="data/{}.csv".format(dataset_name)
export_df.to_csv(dataset_file, header=True, index=False)

# gzip large file
with open(dataset_file, "rb") as f_in:
    with gzip.open("{}.gz".format(dataset_file), 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

#### Create manifest

In [6]:
manifest = {}
manifest["name"] = dataset_name
manifest["description"] = "Dataset representing the OCLC overlap between Zephir OCLCS and the HT US Fed Docs OCLCS"
manifest["datetime"] = str(datetime.datetime.utcfromtimestamp(os.path.getmtime("data/{}.csv".format(manifest["name"]))).isoformat())

# derive schema from dataframe
schema = collections.OrderedDict()
for column in export_df.columns.array:
    schema[str(column)]= str(export_df.dtypes[column])
manifest["schema"] = schema

# format
manifest["format"] = {
        "delimiter": ",",
        "encoding": "utf8",
        "extension": "csv",
        "header": True,
        "type": "delimited"
    }
        
# note the origins
manifest["data-origins"] = [
    {
    "origin": "zephir-oclc-feddocs-dataset.csv",
    "datetime": str(zephir_manifest["datetime"])},
    {"origin": "feddoc_oclc_nums.txt",
    "datetime": str(usfd_manifest["datetime"])
    }
]

# save the manifest
manifest_file = "data/{}.manifest.json".format(dataset_name)
with open(manifest_file, 'w') as outfile:
    json.dump(manifest, outfile, indent=4)

#### Finishing up!

In [7]:
print("Completed notebook ({}).".format(datetime.datetime.utcnow().isoformat()))
print("Output created:")
print(dataset_file)
print(manifest_file)

Completed notebook (2019-08-16T22:55:02.420918).
Output created:
data/zephir-htusfd-overlap.csv
data/zephir-htusfd-overlap.manifest.json
