This notebook lays out and runs the process for USNVC source data into a MongoDB store.

In [1]:
import os,requests,json
import pandas as pd
from IPython.display import display
from bis2 import dd

# Source Item
Version 2.02 of the USNVC starts in a ScienceBase Item that contains some basic metadata about the source and its provenance along with the files for processing. I unzipped the original packaged files provided by Alexa McKerrow from the database dump we get from the NatureServe Biotics system and labeled them for use in processing. The following code block reads the ScienceBase Item and sets everything up for processing. The files for processing are built into a simple data structure and displayed for reference.

Note: I broke off the PROV work that I was doing here because it was making my head hurt and taking too long to figure out. I'll revisit that in other code.

In [16]:
# Get the catalog root out of the namespaces because we need to reference it below
sbCatalogRoot = "https://www.sciencebase.gov/catalog/item/"
usnvcVersion2_02SourceItemID = "5aa827a2e4b0b1c392ef337a"

usnvcSource = requests.get(sbCatalogRoot+usnvcVersion2_02SourceItemID+"?format=json&fields=files,contacts,dates").json()

processFiles = {}
for file in [f for f in usnvcSource["files"] if f["title"].split("_")[0] in ["SourceData","CodeList","RelationshipData"]]:
    processFiles[file["title"]] = file["url"]
display (processFiles)

{'CodeList_ClassificationLevel': 'https://www.sciencebase.gov/catalog/file/get/5aa827a2e4b0b1c392ef337a?f=__disk__5f%2F17%2Fcb%2F5f17cbfa22c0a06409fa45297604238f826813b2',
 'CodeList_ConfidenceClassification': 'https://www.sciencebase.gov/catalog/file/get/5aa827a2e4b0b1c392ef337a?f=__disk__3e%2F0b%2F5a%2F3e0b5af927557c70b75e79c2d6bb31dd92ad7b9f',
 'CodeList_CurrentPresenceAbsence': 'https://www.sciencebase.gov/catalog/file/get/5aa827a2e4b0b1c392ef337a?f=__disk__e5%2Fd7%2F22%2Fe5d722957b65391d251b8d698208b18b551c0820',
 'CodeList_DistributionConfidence': 'https://www.sciencebase.gov/catalog/file/get/5aa827a2e4b0b1c392ef337a?f=__disk__84%2Fed%2Fcb%2F84edcbff904f568ea9cf4729825bc680d9f66ca0',
 'CodeList_OccurrenceStatus': 'https://www.sciencebase.gov/catalog/file/get/5aa827a2e4b0b1c392ef337a?f=__disk__13%2Ffd%2Fe3%2F13fde34fafe3c9f25f867bbe90c64b96b5836a57',
 'CodeList_SpatialPattern': 'https://www.sciencebase.gov/catalog/file/get/5aa827a2e4b0b1c392ef337a?f=__disk__76%2Fc3%2F85%2F76c38549

# Unit Attributes, Hierarchy, and Descriptions
The following code block merges the unit and unit description tables into one dataframe that serves as the core data for processing.

In [12]:
units = pd.read_csv(processFiles["SourceData_Units"], sep='\t', encoding = "ISO-8859-1", dtype={"element_global_id":str,"parent_id":str,"classif_confidence_id":int})
unitDescriptions = pd.read_csv(processFiles["SourceData_UnitDescription"], sep='\t', encoding = "ISO-8859-1", dtype={"element_global_id":str})
codes_classificationConfidence = pd.read_csv(processFiles["CodeList_ConfidenceClassification"], sep='\t', encoding = "ISO-8859-1", dtype={"D_CLASSIF_CONFIDENCE_ID":int})
codes_classificationConfidence.rename(columns={'D_CLASSIF_CONFIDENCE_ID':'classif_confidence_id'}, inplace=True)

nvcsUnits = pd.merge(units, unitDescriptions, how='left', on='element_global_id')
nvcsUnits = pd.merge(nvcsUnits, codes_classificationConfidence, how='left', on='classif_confidence_id')
print (nvcsUnits.dtypes)

del units
del unitDescriptions
del codes_classificationConfidence

element_global_id             object
parent_id                     object
d_classification_level_id      int64
elementuid                   float64
classificationcode            object
databasecode                  object
status                        object
colloquialname                object
scientificname                object
formattedscientificname       object
translatedname                object
hierarchylevel                object
unitsort                      object
usstatus                      object
typeconceptsentence           object
parentkey                     object
parentname                    object
typeconcept                   object
diagnosticcharacteristics     object
rationale                     object
classificationcomments        object
similarnvctypescomments       object
physiognomy                   object
floristics                    object
plotcount                    float64
dynamics                      object
environment                   object
r

# Unit References
The following dataframes assemble the unit by unit references into a merged dataframe for later query and processing when building the unit documents.

In [13]:
unitByReference = pd.read_csv(processFiles["RelationshipData_UnitXReference"], sep='\t', encoding = "ISO-8859-1", dtype={"element_global_id":str,"reference_id":str})
references = pd.read_csv(processFiles["SourceData_References"], sep='\t', encoding = "ISO-8859-1", dtype={"reference_id":str})
unitReferences = pd.merge(left=unitByReference,right=references, left_on='reference_id', right_on='reference_id')

print (unitReferences.dtypes)

del unitByReference
del references

element_global_id    object
reference_id         object
shortcitation        object
fullcitation         object
dtype: object


# Unit Predecessors
The following codeblock retrieves the unit predecessors for processing.

In [14]:
unitPredecessors = pd.read_csv(processFiles["SourceData_UnitPredecessor"], sep='\t', encoding = "ISO-8859-1", dtype={"element_global_id":str,"predecessor_id":str})

print(unitPredecessors.dtypes)

element_global_id            object
predecessor_id               object
predecessorcode              object
predecessorname              object
predecessorsciname           object
predecessorcolloquialname    object
lineagedate                  object
lineagenote                  object
lineageauthorizedby          object
dtype: object


# Obsolete records
The following codeblock retrieves the two tables that contain references to obsolete units or names. We may want to examine this in future versions to move from simply capturing notes about obsolescence to keeping track of what is actually changing. Alternatively, we can keep with a whole dataset versioning construct if that works better for the community, but as soon as we start minting individual DOIs for the units, making them citable, that changes the dynamic in how we manage the data moving forward.

In [15]:
obsoleteUnits = pd.read_csv(processFiles["SourceData_ObsoleteName"], sep='\t', encoding = "ISO-8859-1", dtype={"element_global_id":str})
print (obsoleteUnits.dtypes)

obsoleteParents = pd.read_csv(processFiles["SourceData_ObsoleteParent"], sep='\t', encoding = "ISO-8859-1", dtype={"element_global_id":str})
print (obsoleteParents.dtypes)


element_global_id    object
obsoletename         object
obsoletenote         object
obsoletedate         object
obsoleteauthority    object
dtype: object
element_global_id     object
obsoleteparentcode    object
obsoletedivision      object
obsoleteparentname    object
obsoletenote          object
obsoletedate          object
obsoleteauthority     object
dtype: object


# Unit Distribution - Nations and Subnations
The following codeblock assembles the four tables that make up all the code references for the unit by unit distribution at the national level and then in North American states and provinces. I played around with adding a little bit of value to the nations structure by looking up names and setting up objects that contain name, abbreviation, uncertainty (true/false), and an info API reference. But I also kept the original raw string/list of national abbreviations. That process would be a lot smarter if I did it here by pulling together a distinct list of all referenced nation codes/abbreviations and then building a lookup dataframe on those. I'll revisit at some point or if the code bogs down, but the REST API call is pretty quick.

In [17]:
unitXSubnation = pd.read_csv(processFiles["RelationshipData_UnitXSubnation"], sep='\t', encoding = "ISO-8859-1", dtype=str)
codes_CurrentPresAbs = pd.read_csv(processFiles["CodeList_CurrentPresenceAbsence"], sep='\t', encoding = "ISO-8859-1", dtype=str)
codes_DistConfidence = pd.read_csv(processFiles["CodeList_DistributionConfidence"], sep='\t', encoding = "ISO-8859-1", dtype=str)
codes_Subnations = pd.read_csv(processFiles["CodeList_Subnation"], sep='\t', encoding = "ISO-8859-1", dtype=str)

nvcsDistribution = pd.merge(left=unitXSubnation,right=codes_CurrentPresAbs, left_on='d_curr_presence_absence_id', right_on='D_CURR_PRESENCE_ABSENCE_ID')
nvcsDistribution = pd.merge(left=nvcsDistribution,right=codes_DistConfidence, left_on='d_dist_confidence_id', right_on='D_DIST_CONFIDENCE_ID')
nvcsDistribution = pd.merge(left=nvcsDistribution,right=codes_Subnations, left_on='subnation_id', right_on='subnation_id')

print (nvcsDistribution.dtypes, nvcsDistribution.size)

del unitXSubnation
del codes_CurrentPresAbs
del codes_DistConfidence
del codes_Subnations

element_global_id             object
subnation_id                  object
d_curr_presence_absence_id    object
d_dist_confidence_id          object
D_CURR_PRESENCE_ABSENCE_ID    object
CURR_PRESENCE_ABSENCE_DESC    object
CURR_PRESENCE_ABSENCE_CD      object
D_DIST_CONFIDENCE_ID          object
DIST_CONFIDENCE_CD            object
DIST_CONFIDENCE_DESC          object
iso_nation_cd                 object
subnation_code                object
subnation_name                object
dtype: object 427336


# USFS Ecoregions
There is a coded list of USFS Ecoregion information in the unit descriptions, but this would have to be parsed and referenced out anyway and the base information seems to come through a "unitX..." set of tables. This codeblock sets those data up for processing.

In [18]:
unitXUSFSEcoregion1994 = pd.read_csv(processFiles["RelationshipData_UnitXUSFSEcoregion1994"], sep='\t', encoding = "ISO-8859-1", dtype=str)
codes_USFSEcoregions1994 = pd.read_csv(processFiles["CodeList_USFSEcoregion1994"], sep='\t', encoding = "ISO-8859-1", dtype=str)

unitXUSFSEcoregion2007 = pd.read_csv(processFiles["RelationshipData_UnitXUSFSEcoregion2007"], sep='\t', encoding = "ISO-8859-1", dtype=str)
codes_USFSEcoregions2007 = pd.read_csv(processFiles["CodeList_USFSEcoregion2007"], sep='\t', encoding = "ISO-8859-1", dtype=str)

codes_OccurrenceStatus = pd.read_csv(processFiles["CodeList_OccurrenceStatus"], sep='\t', encoding = "ISO-8859-1", dtype=str)

usfsEcoregionDistribution1994 = pd.merge(left=unitXUSFSEcoregion1994,right=codes_USFSEcoregions1994, left_on='usfs_ecoregion_id', right_on='USFS_ECOREGION_ID')
usfsEcoregionDistribution1994 = pd.merge(left=usfsEcoregionDistribution1994,right=codes_OccurrenceStatus, left_on='d_occurrence_status_id', right_on='D_OCCURRENCE_STATUS_ID')

usfsEcoregionDistribution2007 = pd.merge(left=unitXUSFSEcoregion2007,right=codes_USFSEcoregions2007, left_on='usfs_ecoregion_2007_id', right_on='usfs_ecoregion_2007_id')
usfsEcoregionDistribution2007 = pd.merge(left=usfsEcoregionDistribution2007,right=codes_OccurrenceStatus, left_on='d_occurrence_status_id', right_on='D_OCCURRENCE_STATUS_ID')

print (usfsEcoregionDistribution1994.dtypes)
print ("----------")
print (usfsEcoregionDistribution2007.dtypes)

del unitXUSFSEcoregion1994
del codes_USFSEcoregions1994
del unitXUSFSEcoregion2007
del codes_USFSEcoregions2007
del codes_OccurrenceStatus

element_global_id            object
usfs_ecoregion_id            object
d_occurrence_status_id       object
USFS_ECOREGION_ID            object
PARENT_USFS_ECOREGION_ID     object
D_USFS_ECOREGION_LEVEL_ID    object
USFS_ECOREGION_NAME          object
USFS_ECOREGION_CLASS_CD      object
USFS_ECOREGION_CONCAT_CD     object
D_OCCURRENCE_STATUS_ID       object
OCCURRENCE_STATUS_CD         object
OCCURRENCE_STATUS_DESC       object
dtype: object
----------
element_global_id                object
usfs_ecoregion_2007_id           object
d_occurrence_status_id           object
parent_usfs_ecoregion_2007_id    object
d_usfs_ecoregion_level_id        object
usfs_ecoregion_2007_name         object
usfs_ecoregion_2007_concat_cd    object
D_OCCURRENCE_STATUS_ID           object
OCCURRENCE_STATUS_CD             object
OCCURRENCE_STATUS_DESC           object
dtype: object


# Helper Functions
The following functions are somewhat specific to NVCS processing but could be pulled out to BIS functions somewhere if desired. The clean_string function, in particular, is probably something to be generalized.

In [19]:
def clean_string(text):
    replacements = {'&amp;': '&','&lt;':'<','&gt;':'>'}
    for x,y in replacements.items():
        text = text.replace(x, y)
    return (text)

def get_hierarchy_from_df(element_global_id):
    # Assumes the full dataframe exists in memory here already
    thisUnitData = nvcsUnits.loc[nvcsUnits["element_global_id"] == str(element_global_id), ["element_global_id","parent_id","hierarchylevel","classificationcode","databasecode","translatedname","colloquialname","unitsort","DISPLAY_ORDER"]]
    
    immediateChildren = nvcsUnits.loc[nvcsUnits["parent_id"] == str(element_global_id), ["element_global_id","parent_id","hierarchylevel","classificationcode","databasecode","translatedname","colloquialname","unitsort","DISPLAY_ORDER"]]

    parentID = thisUnitData["parent_id"].values[0]

    ancestors = []
    while type(parentID) is str:
        ancestor = nvcsUnits.loc[nvcsUnits["element_global_id"] == str(parentID), ["element_global_id","parent_id","hierarchylevel","classificationcode","databasecode","translatedname","colloquialname","unitsort","DISPLAY_ORDER"]]
        ancestors = ancestors + ancestor.to_dict("records")
        parentID = ancestor["parent_id"].values[0]
        
    hierarchyList = []
    for record in ancestors+thisUnitData.to_dict("records")+immediateChildren.to_dict("records"):
        if record["hierarchylevel"] in ["Class","Subclass","Formation","Division"]:
            record["Display Title"] = record["classificationcode"]+" "+record["colloquialname"]+" "+record["hierarchylevel"]
        elif record["hierarchylevel"] in ["Macrogroup","Group"]:
            record["Display Title"] = record["classificationcode"]+" "+record["translatedname"]
        else:
            record["Display Title"] = record["databasecode"]+" "+record["translatedname"]
        hierarchyList.append(record)

    return {"Children":list(map(int, immediateChildren["element_global_id"].tolist())),"Hierarchy":hierarchyList,"Ancestors":list(map(int, [a["element_global_id"] for a in ancestors]))}

def logical_nvcs_root():
    classLevel = nvcsUnits.loc[nvcsUnits["parent_id"].isnull(), ["element_global_id"]]
    nvcsRootDoc = {}
    nvcsRootDoc["_id"] = int(0)
    nvcsRootDoc["title"] = "US National Vegetation Classification"
    nvcsRootDoc["parent"] = None
    nvcsRootDoc["ancestors"] = None
    nvcsRootDoc["children"] = list(map(int, classLevel["element_global_id"].tolist()))
    nvcsRootDoc["Hierarchy"] = {"unitsort":str(0)}
    
    return nvcsRootDoc

# Process Source and Build NVCS Docs
The following code block is the meat of this process. It takes quite a while to run as there are a number of steps and conditional logic that need to play out. I used a couple of guiding principals in laying out these documents.

* Store the data according to the basic pattern established by the ESA Veg Panel in helping to design the online USNVC Explorer app so that it can pretty much be navigated and understood in its "native" form.
* Assign more human-friendly attribute names to the things that we will display to people, but retain a few of the "ugly names" for things that have special meaning in the data assembly process.

I ended up using the element_global_id as the unique id value in the documents as it is unique across the recordset and will be used to maintain record integrity over time.

I build and store the same unit by unit snapshot of the surrounding hierarchy (ancestors and immediate children) similar to how the current application works. I also store parent ID but build children and ancestors at the root level of the documents according to document database best practices and for later processing.

For help in later presentation and usability of the structure, I create a logical root document with an ID of 0 and a small amount of information. The "parentless" Class and Cultural Class units are assigned this unit as parent.

Quite a bit of conditional logic goes into building display title from other attributes, and I pull this up to the top of the document as "title" for convenience in later building out the hierarchy.

In [20]:
nvcsUnitDocs = [logical_nvcs_root()]
for index,row in nvcsUnits.iterrows():
    if index > 30:
        break
    unitDoc = {"Identifiers":{},"Overview":{},"Hierarchy":{},"Vegetation":{},"Environment":{},"Distribution":{},"Plot Sampling and Analysis":{},"Confidence Level":{},"Conservation Status":{},"Hierarchy":{},"Concept History":{},"Synonymy":{},"Authorship":{},"References":[]}

    unitDoc["_id"] = int(row["element_global_id"])

    unitDoc["Identifiers"]["element_global_id"] = int(row["element_global_id"])
    unitDoc["Identifiers"]["Database Code"] = row["databasecode"]
    unitDoc["Identifiers"]["Classification Code"] = row["classificationcode"]

    unitDoc["Overview"]["Scientific Name"] = row["scientificname"]
    unitDoc["Overview"]["Formatted Scientific Name"] = clean_string(row["formattedscientificname"])
    unitDoc["Overview"]["Translated Name"] = row["translatedname"]
    if type(row["colloquialname"]) is str:
        unitDoc["Overview"]["Colloquial Name"] = row["colloquialname"]
    if type(row["typeconceptsentence"]) is str:
        unitDoc["Overview"]["Type Concept Sentence"] = clean_string(row["typeconceptsentence"])
    if type(row["typeconcept"]) is str:
        unitDoc["Overview"]["Type Concept"] = clean_string(row["typeconcept"])
    if type(row["diagnosticcharacteristics"]) is str:
        unitDoc["Overview"]["Diagnostic Characteristics"] = clean_string(row["diagnosticcharacteristics"])
    if type(row["rationale"]) is str:
        unitDoc["Overview"]["Rationale for Nonimal Species or Physiognomic Features"] = clean_string(row["rationale"])
    if type(row["classificationcomments"]) is str:
        unitDoc["Overview"]["Classification Comments"] = clean_string(row["classificationcomments"])
    if type(row["similarnvctypescomments"]) is str:
        unitDoc["Overview"]["Similar NVC Types"] = clean_string(row["similarnvctypescomments"])
    if type(row["othercomments"]) is str:
        unitDoc["Overview"]["Other Comments"] = clean_string(row["othercomments"])

    if row["hierarchylevel"] in ["Class","Subclass","Formation","Division"]:
        unitDoc["Overview"]["Display Title"] = row["classificationcode"]+" "+row["colloquialname"]+" "+row["hierarchylevel"]
    elif row["hierarchylevel"] in ["Macrogroup","Group"]:
        unitDoc["Overview"]["Display Title"] = row["classificationcode"]+" "+row["translatedname"]
    else:
        unitDoc["Overview"]["Display Title"] = row["databasecode"]+" "+row["translatedname"]
        
    unitDoc["title"] = unitDoc["Overview"]["Display Title"]
    
    if type(row["physiognomy"]) is str:
        unitDoc["Vegetation"]["Physiognomy and Structure"] = clean_string(row["physiognomy"])
    if type(row["floristics"]) is str:
        unitDoc["Vegetation"]["Floristics"] = clean_string(row["floristics"])
    if type(row["dynamics"]) is str:
        unitDoc["Vegetation"]["Dynamics"] = clean_string(row["dynamics"])
    
    if type(row["environment"]) is str:
        unitDoc["Environment"]["Environmental Description"] = clean_string(row["environment"])

    if type(row["spatialpattern"]) is str:
        unitDoc["Environment"]["Spatial Pattern"] = clean_string(row["spatialpattern"])

    if type(row["range"]) is str:
        unitDoc["Distribution"]["Geographic Range"] = row["range"]

    if type(row["nations"]) is str:
        unitDoc["Distribution"]["Nations"] = {"Raw List":row["nations"],"Nation Info":[]}
        for nation in unitDoc["Distribution"]["Nations"]["Raw List"].split(","):
            thisNation = {"Abbreviation":nation.replace("?","").strip()}
            if nation.endswith("?"):
                thisNation["Uncertainty"] = True
            else:
                thisNation["Uncertainty"] = False
            thisNation["Info API"] = "https://restcountries.eu/rest/v2/alpha/"+thisNation["Abbreviation"]
            thisNationInfo = requests.get(thisNation["Info API"]+"?fields=name").json()
            if "name" in thisNationInfo.keys():
                thisNation["Name"] = thisNationInfo["name"]
            unitDoc["Distribution"]["Nations"]["Nation Info"].append(thisNation)
    
    if type(row["subnations"]) is str:
        unitDoc["Distribution"]["Subnations"] = {"Raw List":row["subnations"]}

    thisDistribution = nvcsDistribution.loc[nvcsDistribution["element_global_id"] == row["element_global_id"]]
    if len(thisDistribution.index) > 0:
        unitDoc["Distribution"]["States/Provinces Raw Data"] = thisDistribution.to_dict("records")
    
    thisUSFSDistribution1994 = usfsEcoregionDistribution1994.loc[usfsEcoregionDistribution1994["element_global_id"] == row["element_global_id"]]
    if len(thisUSFSDistribution1994.index) > 0:
        unitDoc["Distribution"]["1994 USFS Ecoregion Raw Data"] = thisUSFSDistribution1994.to_dict("records")
    
    thisUSFSDistribution2007 = usfsEcoregionDistribution2007.loc[usfsEcoregionDistribution2007["element_global_id"] == row["element_global_id"]]
    if len(thisUSFSDistribution2007.index) > 0:
        unitDoc["Distribution"]["2007 USFS Ecoregion Raw Data"] = thisUSFSDistribution2007.to_dict("records")

    if type(row["tncecoregions"]) is int:
        unitDoc["Distribution"]["TNC Ecoregions"] = row["tncecoregions"]

    if type(row["omernikecoregions"]) is int:
        unitDoc["Distribution"]["Omernik Ecoregions"] = row["omernikecoregions"]

    if type(row["omernikecoregions"]) is int:
        unitDoc["Distribution"]["Omernik Ecoregions"] = row["omernikecoregions"]

    if type(row["federallands"]) is int:
        unitDoc["Distribution"]["Federal Lands"] = row["federallands"]

    if type(row["plotcount"]) is int:
        unitDoc["Plot Sampling and Analysis"]["Plot Count"] = row["plotcount"]
    if type(row["plotsummary"]) is str:
        unitDoc["Plot Sampling and Analysis"]["Plot Summary"] = row["plotsummary"]
    if type(row["plottypal"]) is str:
        unitDoc["Plot Sampling and Analysis"]["Plot Type"] = row["plottypal"]
    if type(row["plotarchived"]) is str:
        unitDoc["Plot Sampling and Analysis"]["Plot Archive"] = row["plotarchived"]
    if type(row["plotconsistency"]) is str:
        unitDoc["Plot Sampling and Analysis"]["Plot Consistency"] = row["plotconsistency"]
    if type(row["plotsize"]) is str:
        unitDoc["Plot Sampling and Analysis"]["Plot Size"] = row["plotsize"]
    if type(row["plotmethods"]) is str:
        unitDoc["Plot Sampling and Analysis"]["Plot Methods"] = row["plotmethods"]

    unitDoc["Confidence Level"]["Confidence Level"] = row["CLASSIF_CONFIDENCE_DESC"]
    if type(row["confidencecomments"]) is str:
        unitDoc["Confidence Level"]["Confidence Level Comments"] = clean_string(row["confidencecomments"])

    if type(row["grank"]) is str:
        unitDoc["Conservation Status"]["Global Rank"] = row["grank"]
    if type(row["grankreviewdate"]) is str:
        unitDoc["Conservation Status"]["Global Rank Review Date"] = row["grankreviewdate"]
    if type(row["grankauthor"]) is str:
        unitDoc["Conservation Status"]["Global Rank Author"] = row["grankauthor"]
    if type(row["grankreasons"]) is str:
        unitDoc["Conservation Status"]["Global Rank Reasons"] = row["grankreasons"]
        
    unitDoc["Hierarchy"]["parent_id"] = str(row["parent_id"])
    unitDoc["Hierarchy"]["hierarchylevel"] = row["hierarchylevel"]
    unitDoc["Hierarchy"]["d_classification_level_id"] = row["d_classification_level_id"]
    unitDoc["Hierarchy"]["unitsort"] = row["unitsort"]
    unitDoc["Hierarchy"]["parentkey"] = row["parentkey"]
    unitDoc["Hierarchy"]["parentname"] = row["parentname"]
    
    try:
        unitDoc["parent"] = int(row["parent_id"])
    except:
        unitDoc["parent"] = int(0)

    thisHierarchyData = get_hierarchy_from_df(row["element_global_id"])
    unitDoc["children"] = thisHierarchyData["Children"]
    unitDoc["Hierarchy"]["Cached Hierarchy"] = thisHierarchyData["Hierarchy"]
    if len(thisHierarchyData["Ancestors"]) > 0:
        unitDoc["ancestors"] = thisHierarchyData["Ancestors"]
    else:
        unitDoc["ancestors"] = [int(0)]
    
    if type(row["lineage"]) is str:
        unitDoc["Concept History"]["Concept Lineage"] = row["lineage"]
    
    thisUnitPredecessors = unitPredecessors.loc[unitPredecessors["element_global_id"] == row["element_global_id"]]
    if len(thisUnitPredecessors.index) > 0:
        unitDoc["Concept History"]["Predecessors Raw Data"] = thisUnitPredecessors.to_dict("records")

    thisUnitObsoleteUnits = obsoleteUnits.loc[obsoleteUnits["element_global_id"] == row["element_global_id"]]
    if len(thisUnitObsoleteUnits.index) > 0:
        unitDoc["Concept History"]["Obsolete Units Raw Data"] = thisUnitObsoleteUnits.to_dict("records")

    thisUnitObsoleteParents = obsoleteParents.loc[obsoleteParents["element_global_id"] == row["element_global_id"]]
    if len(thisUnitObsoleteParents.index) > 0:
        unitDoc["Concept History"]["Obsolete Parents Raw Data"] = thisUnitObsoleteParents.to_dict("records")

    if type(row["synonymy"]) is str:
        unitDoc["Synonymy"]["Synonymy"] = row["synonymy"]

    if type(row["primaryconceptsource"]) is str:
        unitDoc["Authorship"]["Concept Author"] = row["primaryconceptsource"]
    if type(row["descriptionauthor"]) is str:
        unitDoc["Authorship"]["Description Author"] = row["descriptionauthor"]
    if type(row["acknowledgements"]) is str:
        unitDoc["Authorship"]["Acknowledgements"] = row["acknowledgements"]
    if type(row["versiondate"]) is str:
        unitDoc["Authorship"]["Version Date"] = row["versiondate"]
    
    thisUnitReferences = unitReferences.loc[unitReferences["element_global_id"] == row["element_global_id"]]
    for index,row in thisUnitReferences.iterrows():
        unitDoc["References"].append({"Short Citation":row["shortcitation"],"Full Citation":row["fullcitation"]})

    nvcsUnitDocs.append(unitDoc)
display (nvcsUnitDocs)

[{'Hierarchy': {'unitsort': '0'},
  '_id': 0,
  'ancestors': None,
  'children': [860217, 860211, 860216, 860213, 860214, 860218, 860215],
  'parent': None,
  'title': 'US National Vegetation Classification'},
 {'Authorship': {'Concept Author': 'Hierarchy Revisions Working Group, Federal Geographic Data Committee (Faber-Langendoen et al. 2014)',
   'Description Author': 'Hierarchy Revisions Working Group',
   'Version Date': '8/2/2016'},
  'Concept History': {},
  'Confidence Level': {'Confidence Level': 'Moderate'},
  'Conservation Status': {'Global Rank': 'GNR',
   'Global Rank Review Date': '3/3/2011'},
  'Distribution': {'Geographic Range': 'Climate zones? Bailey (1989) Domains: Dry, Humid Tropical, Humid Temperate Domain, Subarctic Division of Polar Domain, and Mountain Divisions of Dry Domain. Less common in other divisions of Polar or Dry domains.'},
  'Environment': {'Environmental Description': '<i>Climate:</i> Climates range from humid tropical to boreal and subalpine, with f

# Commit to the Database
Once the structure is built out here as a list of dictionaries, we attach to the MongoDB database we are currently using, flush the current collection, and insert the entire batch of documents. This process can be swapped out to some other database infrastructure over time as this is pretty much a self-contained data processing unit.

In [None]:
bis = dd.getDB("bis")
nvcsCollection = bis["NVCS"]
nvcsCollection.delete_many({})
nvcsCollection.insert_many(nvcsUnitDocs)