## Parse ArchivesSpace Resources via the API

### Import packages
- configparser: Implements a basic configuration language which provides a structure you can use to write Python programs which can be customized by end users.
- json: Exposes an API for JSON (JavaScript Object Notation).
- requests: A HTTP library.
- pandas: An open source data analysis and manipulation tool, built on top of the Python programming language.

In [6]:
import configparser
import json
import requests
import pandas as pd 

### Read Configuration File

In order to authenticate to ArchivesSpace and thus use the API, you'll have needed to supply a separate -- and ignored by git -- "config.ini" file in the home directory that looks like this:

```
[ARCHIVESSPACE]
BaseURL = 
User = 
Password = 
Respository ID = 
```

In [7]:
print('Reading Configuration File')
config = configparser.ConfigParser()
config.read('config.ini')

base_url = config['ARCHIVESSPACE']['BaseURL']
user = config['ARCHIVESSPACE']['User']
password = config['ARCHIVESSPACE']['Password']
repository_id = config['ARCHIVESSPACE']['RepositoryID']

Reading Configuration File


### Authenticate to ArchivesSpace

In [8]:
print('Authenticating to ArchivesSpace')
endpoint = '/users/' + user + '/login'
params = {'password': password}
response = requests.post(base_url + endpoint, params=params)
print(response.status_code)

response = response.json()
session_key = response['session']

Authenticating to ArchivesSpace
200


### Read Resource IDs from Text File

In order to know which ArchivesSpace Resources to parse, you'll have needed to supply a separate -- and ignored by git -- "resource_ids.txt" file in the home directory with one line for every Resource ID for every Resource you want to parse.

### Parse Resources

In [11]:
results = []

with open('resource_ids.txt', mode='r') as f:
    resource_ids = f.readlines()
    
    for resource_id in resource_ids:
        resource_id = resource_id.strip()

        print('  - GETing Resource ' + str(resource_id))
        endpoint = '/repositories/' + str(repository_id) + '/resources/' + str(resource_id)
        headers = {'X-ArchivesSpace-Session': session_key}
        response = requests.get(base_url + endpoint, headers=headers)
        print(response.status_code)

        resource = response.json()

        ## extract id
        eadid = resource['ead_id']

        # Extract titleproper
        titleproper = resource['finding_aid_title'][16:]

        ## extract abstract
        abstract = ''
        for note in resource['notes']:
            if note.get('type') == 'abstract':
                abstract = note['content'][0]
                
        ## Extract language
        language = resource['finding_aid_language_note'].replace('<language encodinganalog="Language" langcode="eng">English.</language>', 'English.')

        ## Extract scopecontent
        scopecontent = ''
        for note in resource['notes']:
            if note.get('type') == 'scopecontent':
                scopecontent = note['subnotes'][0]['content']
                
        ## Extract bioghist    
        bioghist = ''
        for note in resource['notes']:
            if note.get('type') == 'bioghist':
                bioghist = note['subnotes'][0]['content']

        ## Extract custodhist   

         ## Extract controlaccess
        subjects = []
        subjects_source = []

        genreforms = []
        genreforms_source = []

        geognames = []
        geognames_source = []

        for subject in resource['subjects']:
            subject_id = subject['ref'].split('/')[-1]
            
            print('  - GETing Subject ' + str(subject_id))
            endpoint = '/subjects/' + str(subject_id)
            response = requests.get(base_url + endpoint, headers=headers)
            print(response.status_code)
            
            subject = response.json()
            
            if subject['terms'][0]['term_type'] == 'topical':
                subjects.append(subject['terms'][0]['term'])
                subjects_source.append(subject.get('source', 'No Source'))
            
            if subject['terms'][0]['term_type'] == 'genre_form':
                genreforms.append(subject['terms'][0]['term'])
                genreforms_source.append(subject.get('source', 'No Source'))
            
            if subject['terms'][0]['term_type'] == 'geographic':
                geognames.append(subject['terms'][0]['term'])
                geognames_source.append(subject.get('source', 'No Source'))

        persnames = []
        persnames_source = []

        corpnames = []
        corpnames_source = []

        famnames = []
        famnames_source = []

        for linked_agent in resource['linked_agents']:
            linked_agent_id = linked_agent['ref'].split('/')[-1]
            
            if 'people' in linked_agent['ref']:
                print('  - GETing Person Agent ' + str(linked_agent_id))
                endpoint = '/agents/people/' + str(linked_agent_id)
                response = requests.get(base_url + endpoint, headers=headers)
                print(response.status_code)
            
                person_agent = response.json()
                persnames.append(person_agent['names'][0]['sort_name'])
                persnames_source.append(person_agent['names'][0].get('source', 'No Source'))
                
            if 'corporate_entities' in linked_agent['ref']:
                print('  - GETing Coporate Entity Agent ' + str(linked_agent_id))
                endpoint = '/agents/corporate_entities/' + str(linked_agent_id)
                response = requests.get(base_url + endpoint, headers=headers)
                print(response.status_code)
            
                corporate_entity_agent = response.json()
                corpnames.append(corporate_entity_agent['names'][0]['sort_name'])
                corpnames_source.append(corporate_entity_agent['names'][0].get('source', 'No Source'))
                
            if 'families' in linked_agent['ref']:
                print('  - GETing Family Agent ' + str(linked_agent_id))
                endpoint = '/agents/families/' + str(linked_agent_id)
                response = requests.get(base_url + endpoint, headers=headers)
                print(response.status_code)
            
                family_agent = response.json()
                famnames.append(family_agent['names'][0]['sort_name'])
                famnames_source.append(family_agent['names'][0].get('source', 'No Source'))
                
        result = [str(resource_id), 
                  eadid, 
                  titleproper, 
                  abstract, 
                  language, 
                  scopecontent, 
                  bioghist, 
                  '; '.join(subjects), 
                  '; '.join(subjects_source), 
                  '; '.join(genreforms), 
                  '; '.join(genreforms_source), 
                  '; '.join(geognames), 
                  '; '.join(geognames_source), 
                  '; '.join(persnames), 
                  '; '.join(persnames_source), 
                  '; '.join(corpnames), 
                  '; '.join(corpnames_source), 
                  '; '.join(famnames), 
                  '; '.join(famnames_source)]
        results.append(result)

# Create the pandas DataFrame 
df = pd.DataFrame(results, columns = ['resource_id',
                                      'eadid', 
                                      'titleproper', 
                                      'abstract', 
                                      'language', 
                                      'scopecontent', 
                                      'bioghist', 
                                      'subjects', 
                                      'subjects_source', 
                                      'genreforms', 
                                      'genreforms_source', 
                                      'geognames', 
                                      'geognames_source', 
                                      'persnames', 
                                      'persnames_source', 
                                      'corpnames', 
                                      'corpnames_source', 
                                      'famnames', 
                                      'famnames_source']) 

  - GETing Resource 2677
200
  - GETing Subject 5742
200
  - GETing Subject 7953
200
  - GETing Subject 7800
200
  - GETing Subject 10643
200
  - GETing Subject 947
200
  - GETing Subject 8086
200
  - GETing Subject 2999
200
  - GETing Subject 1700
200
  - GETing Subject 971
200
  - GETing Subject 8088
200
  - GETing Subject 10644
200
  - GETing Subject 10645
200
  - GETing Subject 6098
200
  - GETing Subject 7081
200
  - GETing Subject 7335
200
  - GETing Subject 8243
200
  - GETing Subject 1639
200
  - GETing Subject 6123
200
  - GETing Subject 6503
200
  - GETing Subject 1638
200
  - GETing Subject 5429
200
  - GETing Subject 3584
200
  - GETing Subject 4263
200
  - GETing Subject 1105
200
  - GETing Subject 3649
200
  - GETing Subject 3944
200
  - GETing Subject 3770
200
  - GETing Subject 4987
200
  - GETing Subject 8197
200
  - GETing Subject 4988
200
  - GETing Subject 5479
200
  - GETing Subject 3579
200
  - GETing Subject 5769
200
  - GETing Subject 5738
200
  - GETing Subject

### Write Results to CSV file

In [10]:
df.to_csv('results.csv', encoding='utf-8', index=False)
