# Data Analysis: Met API Data

This research project began out of interest in open-source art history data and the value it can provide to society. It became an investigation into the tensions between colonialism and community education and the importance of museum decolonization.

### Areas of interest for this project include: 
1. Provenance, exhibition, and legality of ancient and religious artifacts
2. Colonialist implications both of legitimately acquired and of contested artifacts. What does their presence, within deliberate curatorial narratives aimed at international audiences, suggest when placed in conversation with postcolonial theory?

## Addressing Incomplete Data

### Geographical Data
The Met's classification system features naming conventions that reflect the era during which the artifacts were acquired, but many geographical locations referenced in the metadata either no longer exist or have been renamed. This complicates data analysis. 

A standard naming convention was generated based on presently existing countries and nationalitiees, and this data was added to a new series of columns that follow a naming convention using the suffix '_cc' for 'contemporary classification'. For example, if an artifact contained the word "Swaziland" in its original 'country' column, I added 'Eswatini' to the 'country_cc' column, since this is the name that the Kingdom of Swaziland assumed in 2018, 50 years after it gained independence from British rule that had lasted from 1903-1968. 

This is helpful for mapping the origins of the Met's collections, but it does cause issues for analyzing artifacts from regions that today contain multiple countries, like former Yugoslavia. If a city or province is not included in the geographic data, it is challenging to classify the artifact in a way that offers both contemporary context and cultural respect.

The API's geographic data raises larger concerns as well. The Met provides sample analytics of its API data on Google BigQuery, but these queries focus on gender, medium, and year, which are attributes with more complete information, while geographic data is often lacking. Geographic and cultural data also contain occasional typographical errors and irregularities, which, I addressed using regular expressions during data cleaning. However, like applying contemporary names to historical regions, fixing misspellings or standardizing content by nationality is difficult for a collection in the millions and has a tendency to erase nuances of region and culture.

Given the amount of controversy the Met has faced over stolen artifacts, and the obscure provenance for many items acquired during colonial periods, this discrepancy in API data is concerning.

### Media & Materials Data

While data regarding media and classfication is more comprehensive and less error-prone than the geographic data, the 'classification' column has more missing values than the 'objectName' and 'medium' columns. However, it uses simpler, more consistent terms, like 'prints,' and 'books' compared to the other 2 columns' descriptive but unique values, making it challenging to provide a straightforward analysis of art media used throughout the collection. Therefore, the values from all 3 columns will be cross-referenced.

It is important to note that the information that appears on the Met's website is generally error-free, if not always comprehensive, suggesting the free API and the one used for the backend of the Met's website may not be the same.

#### Additional Questions from Data Exploration and Cleaning:

1. What relationships exist between geography and medium?
2. Do the missing pieces of data also tell a story? Which items are more or less likely to be missing either geographic data, medium data, or both? Which types of items (print, pottery, etc.) are most likely to have or not have accurate geographic data and clear provenance?
3. Why do materials and classification columns have fewer typographical errors than the geographic columns? Were the data entered by different people? If not, what accounts for this discrepancy?

Below, I attempt to address the major questions and research objectives through data visualization and synthesis of the data cleaned in Step 3.

In [1]:
import os
import pandas as pd
import re
from itertools import chain
from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import seaborn as sns

In [15]:
# Create empty DF for concatenation
met_df = pd.read_csv("../../data/met/csv/met_data_cleaned.csv")

  met_df = pd.read_csv("../../data/met/csv/met_data_cleaned.csv")


In [7]:
met_df.columns

Index(['objectID', 'department', 'country_new', 'country_new_single',
       'nationality_new', 'nationality_new_single', 'region_new',
       'region_new_single', 'culture_new', 'culture_new_single', 'continents',
       'continents_single', 'objectName_cc', 'medium_cc', 'classification_cc',
       'objectEndDate_cc', 'accessionYear_cc', 'artistGender', 'creditLine',
       'tags'],
      dtype='object')

In [19]:
met_df.loc[:, "region_new_single"] = met_df.loc[:, "region_new_single"].astype(str)
met_df.loc[:, "accessionYear_cc"] = met_df.loc[:, "accessionYear_cc"].astype("int64", errors="ignore")

In [17]:
met_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 482468 entries, 0 to 482467
Data columns (total 20 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   objectID                482468 non-null  int64  
 1   department              482468 non-null  object 
 2   country_new             482468 non-null  object 
 3   country_new_single      411602 non-null  object 
 4   nationality_new         482468 non-null  object 
 5   nationality_new_single  404156 non-null  object 
 6   region_new              482468 non-null  object 
 7   region_new_single       482468 non-null  object 
 8   culture_new             482468 non-null  object 
 9   culture_new_single      150951 non-null  object 
 10  continents              482468 non-null  object 
 11  continents_single       411602 non-null  object 
 12  objectName_cc           480127 non-null  object 
 13  medium_cc               475221 non-null  object 
 14  classification_cc   