<a href="https://colab.research.google.com/github/emilyrlong/OpenRefine4Collections/blob/main/VA_Collections_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# V&A Collections Data

This quick python notebook pulls data from the [V&A Collections API](https://developers.vam.ac.uk/guide/v2/welcome.html) for a tutorial on cleaning data in OpenRefine. 

In [1]:
# Import required packages
import requests
import pandas as pd
# import json
from google.colab import files

## Choose Your Search

We will search for objects with a material/technique of 'ceramic'. Run one of the sections below depending on if you want to see all ceramic objects or just the objects on display at V&A South Kensington.

In [13]:
# Get a string for the V&A API for all ceramic objects
base_string = "https://api.vam.ac.uk/v2/objects/search?page_size=100&random=true&q_material_technique=ceramic'&response_format=csv&page="

In [49]:
# Get a string for the V&A API of all ceramic objects on display at South Ken
base_string = "https://api.vam.ac.uk/v2/objects/search?on_display_at=southken&page_size=100&random=true&q_material_technique=ceramic'&response_format=csv&page="

## Get Pages

In the CSV format, we can only pull 100 records per page, so we will fetch multiple pages by adding a page number to the end of the base string. You'll need approximately 5 pages for the ceramics data.

In [3]:
# Update p to be the number of pages you need
p = 5
page_list = range(1,(p+1))

## Get Data

Iterate over the pages and compile the data into one Pandas dataframe.

In [14]:
# If you are re-running this code, delete the va_obj dataframe first.
del va_obj

In [15]:
# Note: This code may take a few minutes if you are pulling many pages

# Iterate over the pages
for i in page_list:
  # Load a CSV of search results for objects at SK
  object_df = pd.read_csv(base_string + str(i))
  # If this is the first dataframe, save as a new sk one
  if i == 1:
    va_obj = object_df
  # Otherwise join new dataframe to sk dataframe
  else:
    va_obj = pd.concat([va_obj,object_df])

In [11]:
va_obj.head()

Unnamed: 0,accessionNumber,accessionYear,systemNumber,objectType,_primaryTitle,_primaryPlace,_primaryMaker__name,_primaryMaker__association,_primaryDate,_primaryImageId,_sampleMaterial,_sampleTechnique,_sampleStyle,_currentLocation__displayName,_objectContentWarning,_imageContentWarning
0,S.1076-1996,1996.0,O1138401,Mug,Commemorative mug,Cornwall,"Leaper, Newlyn",makers,1962,2017KM6460,ceramic,ceramic,,In store,False,False
1,S.22-2007,2007.0,O134196,Mug,,Staffordshire,Unknown,,mid 19th century,2017KL4867,ceramic,ceramic,,In store,False,False
2,C.59-2009,2009.0,O1140219,Bowl,,United Kingdom,Unknown,,20th century,2009CP8748,ceramic,,,"Ceramics, Room 143, The Timothy Sainsbury Gallery",False,False
3,C.57-2009,2009.0,O1140217,Jar,,London,"Rie, Lucie",maker,20th century,2009CP8748,ceramic,,,"Ceramics, Room 143, The Timothy Sainsbury Gallery",False,False
4,T.250KK-1982,1982.0,O276024,Button,,Great Britain,Lucie Rie,designer and maker,1945-1948,2019MC0089,ceramic,forming,,In store,False,False


In [16]:
va_obj.shape

(500, 16)

## Investigate Materials & Locations Data

In [17]:
# Get counts by material
material_types = va_obj.groupby('_sampleMaterial')['_sampleMaterial'].count()
material_types = material_types.to_frame()
material_types

Unnamed: 0_level_0,_sampleMaterial
_sampleMaterial,Unnamed: 1_level_1
Ceramic,14
Clay,1
Earthenware,2
Metal,1
Wood,2
ceramic,376
ceramic glaze,1
ceramic tile,9
clay,6
earthenware,2


In [18]:
# Get the counts by room - on display or in storage
room_counts = va_obj.groupby('_currentLocation__displayName')['_currentLocation__displayName'].count()
room_counts

_currentLocation__displayName
British Galleries, Room 126                                                         1
Cast Courts, The Ruddock Family Cast Court, Room 46A                                2
Ceramics, Room 137, The Curtain Foundation Gallery                                 16
Ceramics, Room 139, The Curtain Foundation Gallery                                  1
Ceramics, Room 140, Factory Ceramics                                                2
Ceramics, Room 143, The Timothy Sainsbury Gallery                                  18
Ceramics, Room 145                                                                 14
China, Room 44, The T.T. Tsui Gallery                                               1
Design 1900 to Now, Room 74                                                         1
Design 1900 to Now, Room 76                                                         2
Hallyu! The Korean Wave                                                             1
In Store                

## Check for Duplicates

In [23]:
# Use the duplicated function to indicate 'True' if a row is a duplicate of another row in the dataset
duplicates = va_obj.duplicated(keep='first')
duplicates

0     False
1     False
2     False
3     False
4     False
      ...  
95    False
96    False
97    False
98    False
99    False
Length: 500, dtype: bool

In [33]:
# Drop the duplicates
va_obj = va_obj[~duplicates]

In [34]:
va_obj.shape

(434, 16)

## Download the Data

In [35]:
# Download the csv file for the ceramic data
va_obj.to_csv('VA_CeramicObjects.csv',index = False)
files.download('VA_CeramicObjects.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>