In [1]:
# This imports the OpenContextAPI from the api.py file in the
# opencontext directory.
%run '../opencontext/api.py'

In [2]:
import numpy as np
import pandas as pd

oc_api = OpenContextAPI()
oc_api.set_cache_file_prefix('murlo-objs')

# Make multiple values for non-numbers JSON formated strings
oc_api.multi_value_handle_non_number = 'concat'
oc_api.multi_value_handle_keyed_attribs = {
    'Motif': 'json',
    'Decorative Technique': 'json',
    'Fabric Category': 'json',
}

# Clear old cached records.
oc_api.clear_api_cache()

# This is a search url for Poggio Civitate objects (artifacts)
url = 'https://opencontext.org/query/Europe/Italy?cat=oc-gen-cat-object&proj=24-murlo&type=subjects'

# Since we're dealing with data from only 1 project, we won't get too many attribues (hopefully!)
# so we can just request ''
attribs_for_records = ['ALL-ATTRIBUTES',]

# Make a dataframe by fetching result records from Open Context.
# This will be slow until we finish improvements to Open Context's API.
# However, the results get cached by saving as files locally. That
# makes iterating on this notebook much less painful.
df = oc_api.url_to_dataframe(url, attribs_for_records)

Got records 12601 to 12655 of 12655 from: https://opencontext.org/query/Europe/Italy?attributes=ALL-ATTRIBUTES&cat=oc-gen-cat-object&proj=24-murlo&response=metadata%2Curi-meta&rows=200&start=12600&type=subjects

In this particular dataset, there are long (sometimes HTML) descriptions of objects. We're caching these locally in the JSON results from the API requests to Open Context. However, for our purposes of making analysis friendly dataframes, we don't need these long free-text attributes. So we'll drop them from the dataframe.

In [3]:
# Define a list of columns to drop.
print(df.columns.tolist())

import os

# Now save the results of all of this as a CSV file.
repo_path = os.path.dirname(os.path.abspath(os.getcwd()))
csv_path = os.path.join(
    repo_path, 
    'files',
    'oc-api-murlo-objects-multivalue-as-json.csv'
)
df.to_csv(csv_path, index=False)
print('Saved this example as a CSV table at: {}'.format(csv_path))


['uri', 'citation uri', 'label', 'item category', 'project label', 'published', 'updated', 'latitude', 'longitude', 'early bce/ce', 'late bce/ce', 'Context (1)', 'Context (2)', 'Context (3)', 'Context (4)', 'Context (5)', 'Context (6)', 'Context (7)', 'project href', 'Subject', 'Subject [URI]', 'Coverage', 'Coverage [URI]', 'Temporal Coverage', 'Temporal Coverage [URI]', 'Creator', 'Creator [URI]', 'License', 'License [URI]', 'ceramic ware (visual works)', 'ceramic ware (visual works) [URI]', 'ceramic ware (visual works) [getty-aat-300386879]', 'ceramic ware (visual works) [getty-aat-300386879] [URI]', 'metal', 'metal [URI]', 'metal [getty-aat-300010900]', 'metal [getty-aat-300010900] [URI]', 'Fragment Noted', 'Record Type', 'inorganic material', 'inorganic material [URI]', 'inorganic material [getty-aat-300010360]', 'inorganic material [getty-aat-300010360] [URI]', 'Materials (hierarchy name)', 'Materials (hierarchy name) [URI]', 'Materials (hierarchy name) [getty-aat-300010357]', 'Ma

Using the already cached JSON obtained from the Open Context API, we can make a second dataframe that is "wider" (has many more columns"). This wide dataframe will express multiple values for "Motif", "Decorative Technique", and "Fabric Category" in different columns. We set the dictioary `oc_api.multi_value_handle_keyed_attribs` to do this.

In [4]:
oc_api.multi_value_handle_non_number = 'concat'
oc_api.multi_value_handle_keyed_attribs = {
    'Motif': 'column_val',
    'Decorative Technique': 'column_val',
    'Fabric Category': 'column_val',
}
df_wide = oc_api.url_to_dataframe(url, attribs_for_records)


Got records 12601 to 12655 of 12655 from: https://opencontext.org/query/Europe/Italy?attributes=ALL-ATTRIBUTES&cat=oc-gen-cat-object&proj=24-murlo&response=metadata%2Curi-meta&rows=200&start=12600&type=subjects

The `df_wide` dataframe handles multiple values for some attributes by making many boolean columns, with each column noting the presense of a given attribute value on a row for an artifact. For example, `True` values on the column "Motif :: Panther"" indicate the presense of a "Panther" motif observed on an artifact, and `True` valeus of the column "Motif :: Potnia Theron" indicate a "Potnia Theron" motif on an artifact.

In [5]:

csv_wide_path = os.path.join(
    repo_path, 
    'files',
    'oc-api-murlo-objects-multivalue-as-cols.csv'
)
df_wide.to_csv(csv_wide_path, index=False)
print('Saved this example wide as a CSV table at: {}'.format(csv_wide_path))

NameError: name 'drop_cols' is not defined