# data review 
***See [proposal](https://uoregon.sharepoint.com/:w:/r/sites/O365_LIB_DigitalLibraryServices/Shared%20Documents/OD2_UOregon/REF_assorted_attachments/DRAFT_202410_propose_swap_rda_elements.docx?d=w298fed7e72b14c2b9d3817d9520e44c5&csf=1&web=1&e=aL8M8e) to swap RDA Agent, Work, Expression, and Manifestation properties for RDA Unconstrained, and fix invalid RDA/RDF IRI***

## notes, data of interest 
Results based on querying ten sample collections (see `rdf_dumps` below for collection URL slugs)
- **`workType`** 
    - Lots of usage
        - 249,324 total values
        - 1,187 total *unique* values, see [workType.csv](https://gist.github.com/briesenberg07/06fb741a21a34e5bf8022ecfb51f2cce#file-worktype-csv)
        - All values from two sources: Getty AAT and opaquenamespace.org workType vocab
- **`form_of_work`** 
    - Barely any usage, only 38 total values
    - Only 2 unique values, see [form_of_work.csv](https://gist.github.com/briesenberg07/06fb741a21a34e5bf8022ecfb51f2cce#file-form_of_work-csv)
- The other properties I'm proposing to swap out barely had any usage in these sample collections, see:
    - [biographical_information.csv](https://gist.github.com/briesenberg07/06fb741a21a34e5bf8022ecfb51f2cce#file-biographical_information-csv)
        - 6 unique values
        - free text
    - [color_content.csv](https://gist.github.com/briesenberg07/06fb741a21a34e5bf8022ecfb51f2cce#file-color_content-csv)
        - 1 unique value
        - Looks like free text? Just "color"


In [20]:
import os, rdflib

In [21]:
# email briesenb@uoregon.edu if you'd like the rdf dumps
data_path = '../../data_scratch/od2colls/' # if running this code beware local file path to data
rdf_dumps = os.listdir(data_path)
print(rdf_dumps)

['fealy.ttl', 'marketing-photos.ttl', 'angelus.ttl', 'building-or.ttl', 'chinavine.ttl', 'nosatsu.ttl', 'osu-historical-images.ttl', 'lowenstam.ttl', 'uo-arch-photos.ttl', 'osu-historical-publications.ttl']


In [22]:
property_iris = {
    'biographical_information': 'http://rdaregistry.info/Elements/a/P50113',
    'description_of_manifestation': 'http://rdaregistry.info/Elements/w/P10271',
    'form_of_work': 'http://rdaregistry.info/Elements/w/P10004',
    'workType': 'http://www.rdaregistry.info/Elements/w/#P10004',
    'color_content': 'http://rdaregistry.info/Elements/e/P20224',
    'layout': 'http://rdaregistry.info/Elements/m/P30155',
    'mode_of_issuance': 'http://rdaregistry.info/Elements/m/P30003'
}

In [23]:
value_counts = {
    'biographical_information': [],
    'description_of_manifestation': [],
    'form_of_work': [],
    'workType': [],
    'color_content': [],
    'layout': [],
    'mode_of_issuance': []
}

In [24]:
# test
for property in property_iris.items():
    print(f"{property[0]}: {property[1]}")

biographical_information: http://rdaregistry.info/Elements/a/P50113
description_of_manifestation: http://rdaregistry.info/Elements/w/P10271
form_of_work: http://rdaregistry.info/Elements/w/P10004
workType: http://www.rdaregistry.info/Elements/w/#P10004
color_content: http://rdaregistry.info/Elements/e/P20224
layout: http://rdaregistry.info/Elements/m/P30155
mode_of_issuance: http://rdaregistry.info/Elements/m/P30003


In [25]:
# test test
import rdflib
g = rdflib.Graph().parse("../../data_scratch/od2colls/fealy.ttl")
q3 = """
            SELECT (COUNT (?value) AS ?totalValue) WHERE
            { ?s <http://purl.org/dc/terms/title> ?value . }
            """
result = g.query(q3)
print(type(result))
print(len(result))
for row in result:
    print(int(row.totalValue))

<class 'rdflib.plugins.sparql.processor.SPARQLResult'>
1
15


In [26]:
# this cell took 5+ minutes to run
with open('_report_selected_props.md', 'w+') as mdfile:
    for collection in rdf_dumps:
        mdfile.write(f"# {collection.split('.')[0]}\n")
        g = rdflib.Graph().parse(f"{data_path}{collection}")
        for property in property_iris.items():
            mdfile.write(f"- {property[0]} / {property[1]}\n")
            q1 = f"""
            SELECT (COUNT (?value) as ?totalCount) WHERE
            {{ ?s <{property[1]}> ?value .}}
            """
            result = g.query(q1)
            for row in result:
                mdfile.write(f"\t- {row.totalCount} values total\n")
                if int(row.totalCount) > 0:
                    q2 = f"""
                    SELECT (COUNT(DISTINCT ?value) AS ?totalDistinct) WHERE
                    {{ ?s <{property[1]}> ?value . }}
                    """
                    result = g.query(q2)
                    for row in result:
                        mdfile.write(f"\t- {row.totalDistinct} unique values\n")
                    q3 = f"""
                    SELECT DISTINCT ?value WHERE
                    {{ ?s <{property[1]}> ?value . }}
                    """
                    result = g.query(q3)
                    for row in result:
                        if row.value not in value_counts[property[0]]:
                            value_counts[property[0]].append(row.value)
        

In [27]:
# this should've also counted total values
for item in value_counts.items():
    print(f"OD2 element << {item[0]} >> has << {len(item[1])} >> unique values across {len(rdf_dumps)} sample collections")
    if len(item[1]) > 0:
        with open(f"{item[0]}.csv", "w+") as csvfile:
            csvfile.write(f"item[0]\n")
            for value in item[1]:
                csvfile.write(f"{value}\n")

OD2 element << biographical_information >> has << 6 >> unique values across 10 sample collections
OD2 element << description_of_manifestation >> has << 0 >> unique values across 10 sample collections
OD2 element << form_of_work >> has << 2 >> unique values across 10 sample collections
OD2 element << workType >> has << 1187 >> unique values across 10 sample collections
OD2 element << color_content >> has << 1 >> unique values across 10 sample collections
OD2 element << layout >> has << 0 >> unique values across 10 sample collections
OD2 element << mode_of_issuance >> has << 0 >> unique values across 10 sample collections
