## HathiTrust US Federal Document Suggester Analysis
### *Analysis of missing USFD in known USFDs using the HT US Fed Docs Registry*

### Introduction
HathiTrust (HT) determines US Federal Documents (USFDs) through bibliographic analysis. Specifically, the 008 is inspected for a designation it was published in the US and is a federal document*. Many institutions have deposited USFDs in the HT repository but have yet to indicate them through the 008 in the MARC record. If they were to update their metdata records with the correct indicators, many of these volumes would be bibliographically determined to be public domain. This new determination would give the public access through the HT web site. 

We can identify many of these unmarked USFDs by using the HT US Federal Documents registry, which maintains a database of OCLC numbers of US Federal Documents. By cross-referencing this database with the OCLC numbers in Zephir's HT metadata database, we can find volumes which are not marked as a USFD, but probably should be. From this we can derive lists to give to contributors suggesting records to inspect and update if they truly are USFDs.

This notebook analysis this overlap and creates a prototype for generating suggestion lists.

*This is a proxy for US Federal Documents, as sometimes the US government publshes documents outside of the US and other countries may puplish documents in the US.

## Data

### Imports

In [None]:
import datetime
import collections
import json
import os


import pandas

### Dataset Loading

#### Zephir/USFD Registry Overlap
This is a dataset of overlap between two data sources (Zephir and HT USFD Registry). It was created by [jupyter notebook](http://localhost:8888/lab/tree/create_overlap_dataset.ipynb) matching on exports from each source. OCLC was used as a matchpoint between the sources. Since hathitrust volumes can have more than one OCLC number, it is possible for a volume to be duplicated in this overlap dataset.




In [None]:
with open('data/zephir-htusfd-overlap.manifest.json') as json_file:
    overlap_manifest = json.load(json_file)
    
print("The {} was created on {}".format(overlap_manifest["name"], overlap_manifest["datetime"]))
print("The dataset was derived from:")
for dataset in overlap_manifest["data-origins"]:
    print("Dataset {} created at {}".format(dataset["origin"],dataset["datetime"]))

usfd_overlap_df = pandas.read_csv("data/zephir-htusfd-overlap.csv", dtype=overlap_manifest["schema"])

## Analysis

### How many volumes in HathiTrust have overlapping OCLCs in the US Fed Docs Registry?

In [None]:
# only count unique volumes, some may have multiple oclcs and matches
volumes_matched = usfd_overlap_df.htid.unique()
print("There are {} volumes w/ OCLC numbers matched in the registry".format(len(volumes_matched)))

### How many of those volumes are not marked USFD?

In [None]:
# Only keep records not marked as USFD
usfd_suggest_df = usfd_overlap_df[usfd_overlap_df['usfd_status']==False]
# Drop OCLC and eliminate duplicates caused by multiple OCLC matches
unique_usfd_suggest_df = usfd_suggest_df.drop(columns="oclc").drop_duplicates(keep="first")
print("There are {} volumes not marked USFD".format(len(unique_usfd_suggest_df)))

### What is the breakdown of those volumes by contributor?

In [None]:
print("Contributors by volumes")
print(unique_usfd_suggest_df[["contributor_code", "htid"]].groupby(["contributor_code"]).count())

### How many bibliographic records submitted from ILS systems are not marked USFD? 

There can be multiple volulmes associated with a single bibliographic record, so this count will be lower. Ultimately, to update the 008, they contributor will need to us the ILS biblographic identifier rather than the HT volume identifier. We call this identifier the 'contribsys_id' in the Zephir database. They will then need to update that biblographic record in their ILS and resubmit.

In [None]:
# Drop HTID and eliminate duplicates from 1-N relationship with contribsys_id (Contributor's ILS Bib #)
unique_contribsys_usfd_df = unique_usfd_suggest_df.drop(columns=["htid"]).drop_duplicates(keep="first")
print("Total count by contribsys_id: {}".format(len(unique_contribsys_usfd_df)))

### What is the contributor summary by bibliographic control identifier?

In [None]:
print("Contributors by contribsys_id (ILS Bib #)")
print(unique_contribsys_usfd_df[["contributor_code", "contribsys_id"]].groupby(["contributor_code"]).count())

## Creating output

#### Saving dataset

Export candidate suggestions to a dataset.

In [None]:
# presort the output
sorted_contribsys_suggestions_df = unique_contribsys_usfd_df.sort_values(by=["contributor_code", "contribsys_id"])
export_df = unique_contribsys_usfd_df.drop(columns="usfd_status")

dataset_name = "htusfd-suggestions"
dataset_file = "data/{}.csv".format(dataset_name)

# save the dataset
export_df.to_csv("data/{}.csv".format(manifest["name"]), header=True, index=False)

#### Saving the manifest

In [None]:
# create a manifest
manifest = {}
manifest["name"] = dataset_name
manifest["description"] = "Dataset for identifying possible unmarked USFD by contributor"
manifest["datetime"] = str(datetime.datetime.utcfromtimestamp(os.path.getmtime("data/{}.csv".format(manifest["name"]))).isoformat())

# derive schema from dataframe
schema = collections.OrderedDict()
for column in export_df.columns.array:
    schema[str(column)]= str(export_df.dtypes[column])
manifest["schema"] = schema

# not format
manifest["format"] = {
        "delimiter": ",",
        "encoding": "utf8",
        "extension": "csv",
        "header": True,
        "type": "delimited"
    }
        
# note the origins
manifest["data-origins"] = [{
    "origin": "zephir-htusfd-overlap.csv",
    "datetime": str(overlap_manifest["datetime"])}]

# save the manifest
manifest_file = "data/{}.manifest.json".format(manifest["name"])
with open(manifest_file, 'w') as outfile:
    json.dump(manifest, outfile, indent=4)

#### Finishing up!

In [None]:
print("Completed notebook ({}).".format(datetime.datetime.utcnow().isoformat()))
print("Output created:")
print(dataset_file)
print(manifest_file)