## Get Chinese Art Provenance from Cleveland Museum of Art JSON

### get the json file

In [None]:
import pandas as pd

!wget -O data.json https://github.com/ClevelandMuseumArt/openaccess/raw/master/data.json
df = pd.read_json('/content/data.json')

--2024-04-27 21:41:29--  https://github.com/ClevelandMuseumArt/openaccess/raw/master/data.json
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/ClevelandMuseumArt/openaccess/master/data.json [following]
--2024-04-27 21:41:29--  https://media.githubusercontent.com/media/ClevelandMuseumArt/openaccess/master/data.json
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 289478751 (276M) [application/octet-stream]
Saving to: ‘data.json’


2024-04-27 21:41:36 (91.0 MB/s) - ‘data.json’ saved [289478751/289478751]



In [None]:
for col in df.columns:
  print(col)

id
accession_number
share_license_status
tombstone
current_location
title
title_in_original_language
series
series_in_original_language
creation_date
creation_date_earliest
creation_date_latest
artists_tags
culture
technique
support_materials
department
collection
type
measurements
dimensions
state_of_the_work
edition_of_the_work
copyright
inscriptions
exhibitions
provenance
find_spot
related_works
former_accession_numbers
did_you_know
description
external_resources
citations
catalogue_raisonne
url
images
alternate_images
creditline
sketchfab_id
sketchfab_url
gallery_donor_text
creators
updated_at


In [None]:
df['department'].unique()

array(['Modern European Painting and Sculpture', 'Drawings',
       'Decorative Art and Design', 'Prints', 'Chinese Art',
       'Japanese Art', 'African Art', 'Photography',
       'European Painting and Sculpture', 'Korean Art',
       'Contemporary Art', 'Textiles', 'American Painting and Sculpture',
       'Indian and Southeast Asian Art',
       'Egyptian and Ancient Near Eastern Art', 'Greek and Roman Art',
       'Medieval Art', 'Islamic Art', 'Oceania', 'Art of the Americas'],
      dtype=object)

### subset the Chinese Art

In [None]:
cleveland_chinese_art_df = df[df['department'] == 'Chinese Art']
len(cleveland_chinese_art_df)

2474

### get provenance

In [None]:
# for index, provenance in enumerate(cleveland_chinese_art_df['provenance']):
#    print(f"Entry {index + 1}: {provenance}")

In [None]:
provenance_list = cleveland_chinese_art_df['provenance'].tolist()

In [8]:
print(provenance_list[:10])

[[{'description': '(Marchant, London, UK, sold to Mr. and Mrs. Joseph P. Keithley)', 'citations': [], 'footnotes': [], 'date': '?–2011'}, {'description': 'Nancy F. and Joseph P. Keithley, Cleveland, OH, given to the Cleveland Museum of Art', 'citations': [], 'footnotes': [], 'date': '2011–2020'}, {'description': 'The Cleveland Museum of Art, Cleveland, OH', 'citations': [], 'footnotes': [], 'date': '2020–'}], [{'description': '(Marchant, London, UK, sold to Mr. and Mrs. Joseph P. Keithley)', 'citations': [], 'footnotes': None, 'date': '?by 1998–2011'}, {'description': 'Nancy F. and Joseph P. Keithley, Cleveland, OH, given to the Cleveland Museum of Art', 'citations': [], 'footnotes': None, 'date': '2011–2020'}, {'description': 'The Cleveland Museum of Art, Cleveland, OH', 'citations': [], 'footnotes': None, 'date': '2020–'}], [{'description': '(K.Y. Fine Art, Hong Kong, sold to Mr. and Mrs. Joseph P. Keithley)', 'citations': [], 'footnotes': None, 'date': '?–2010'}, {'description': 'Na

### Expand descriptions into separate columns

In [10]:

def extract_description_before_comma(entry):
    description = entry['description'].split(',')[0]
    return description.strip('()')

def process_row(provenance_list):
    all_descriptions = []
    for entry in provenance_list:
        if 'description' in entry:
            description = extract_description_before_comma(entry)
            all_descriptions.append(description)
    return all_descriptions

cleveland_chinese_art_df['all_descriptions'] = cleveland_chinese_art_df['provenance'].apply(process_row)


max_descriptions = cleveland_chinese_art_df['all_descriptions'].str.len().max()  # Find the max number of descriptions in any row
for i in range(max_descriptions):
    cleveland_chinese_art_df[f'description_{i+1}'] = cleveland_chinese_art_df['all_descriptions'].apply(
        lambda x: x[i] if i < len(x) else None
    )


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleveland_chinese_art_df['all_descriptions'] = cleveland_chinese_art_df['provenance'].apply(process_row)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleveland_chinese_art_df[f'description_{i+1}'] = cleveland_chinese_art_df['all_descriptions'].apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
 

In [11]:
# keep all description columns
columns_to_keep = [col for col in cleveland_chinese_art_df.columns if col.startswith('description_')]
final_df = cleveland_chinese_art_df[columns_to_keep]

### get pairs

In [12]:


# create a list to hold all (source, target) pairs
links = []

# loop over each column and create pairs of (source, target)
for i in range(1, 11):  # columns from description_1 to description_11
    current_col = f'description_{i}'
    next_col = f'description_{i+1}'
    # Check if next column exists in the DataFrame
    if next_col in final_df.columns:
        for index, row in final_df.iterrows():
            if pd.notna(row[current_col]) and pd.notna(row[next_col]):
                links.append((row[current_col], row[next_col]))


links_df = pd.DataFrame(links, columns=['source', 'target'])


In [13]:
links_df.to_csv('link_df.csv', index=False, encoding='utf-8-sig')

### correct typos, different forms of names manually

In [14]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [15]:
links_df = pd.read_csv('/content/drive/MyDrive/ist671-XimengDeng/link_df.csv')

In [16]:
import plotly.graph_objects as go



# count the frequency of each pair
link_counts = links_df.groupby(['source', 'target']).size().reset_index(name='value')

# filter to keep only links with more than 12 connections
filtered_link_counts = link_counts[link_counts['value'] > 12]

# create lists of unique sources and targets from the filtered data
all_nodes = pd.concat([filtered_link_counts['source'], filtered_link_counts['target']]).unique()
node_dict = {node: idx for idx, node in enumerate(all_nodes)}

# map the source and target to their respective indices
filtered_link_counts['source'] = filtered_link_counts['source'].map(node_dict)
filtered_link_counts['target'] = filtered_link_counts['target'].map(node_dict)

# create the Sankey diagram
fig = go.Figure(data=[go.Sankey(
    node=dict(
        pad=15,
        thickness=20,
        line=dict(color="black", width=1),
        label=[all_nodes[idx] for idx in range(len(all_nodes))],
    ),
    link=dict(
        source=filtered_link_counts['source'],
        target=filtered_link_counts['target'],
        value=filtered_link_counts['value']
    ))])

fig.update_layout(title_text="Sankey Diagram of Descriptions with More Than 5 Links", font_size=10)
fig.show()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_link_counts['source'] = filtered_link_counts['source'].map(node_dict)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_link_counts['target'] = filtered_link_counts['target'].map(node_dict)


In [17]:
fig.write_html("sankey_diagram.html")


### count

In [18]:
non_na_counts = final_df.notna().sum(axis=1)
non_na_counts.value_counts().sort_index()

0     714
1     251
2     920
3     296
4      98
5      72
6      37
7      23
8       2
9      59
10      1
11      1
Name: count, dtype: int64