# EAP bulk DOI migration to new website

**Context:** Endangered Archives Project assigns DOIs to digitisation projects, but a large number of these were assigned using their old website and the URLs need updating. First we fetch a list of all DOIs in the `bl.eap` account:

In [37]:
import httpx
import pandas as pdt
from dotenv import load_dotenv
from os import getenv

load_dotenv()
cl = httpx.Client(
    timeout=httpx.Timeout(5.0, read=60.0),
    auth=(getenv('DATACITE_ID'), getenv('DATACITE_PW')),
)

params = {
    "client-id": "bl.eap",
    "page[size]": 1000, # we know in advance there are less than 1000, so this will be exhaustive
}
result = cl.get("https://api.datacite.org/dois", params=params)
result.raise_for_status() # raise an error if any problems
data = result.json()["data"]

print(len(data), "total DOIs")

data = pd.json_normalize(data)
print(data.columns)

377 total DOIs
Index(['id', 'type', 'attributes.doi', 'attributes.identifiers',
       'attributes.creators', 'attributes.titles', 'attributes.publisher',
       'attributes.publicationYear', 'attributes.subjects',
       'attributes.contributors', 'attributes.dates', 'attributes.language',
       'attributes.types.ris', 'attributes.types.bibtex',
       'attributes.types.citeproc', 'attributes.types.schemaOrg',
       'attributes.types.resourceType', 'attributes.types.resourceTypeGeneral',
       'attributes.relatedIdentifiers', 'attributes.sizes',
       'attributes.formats', 'attributes.version', 'attributes.rightsList',
       'attributes.descriptions', 'attributes.geoLocations',
       'attributes.fundingReferences', 'attributes.url',
       'attributes.contentUrl', 'attributes.metadataVersion',
       'attributes.schemaVersion', 'attributes.source', 'attributes.isActive',
       'attributes.state', 'attributes.reason', 'attributes.viewCount',
       'attributes.downloadCount', 'a

The DOIs that need updating can be identified by their URL. A few start "https://eap.bl.uk/search/site/":

In [38]:
dois_search_site = data['attributes.url'].str.match(r"https?://eap.bl.uk/search/site/")
dois_search_site.sum()

0

But most start "https://eap.bl.uk/database/overview_project.a4d":

In [39]:
dois_database = data['attributes.url'].str.match(r"https?://eap.bl.uk/database/overview_project.a4d")
dois_database.sum()

0

We can combine these sets to give us the final list of DOIs to update, and at the same time we'll throw away all the columns except the two we need, and rename those to make them shorter:

In [33]:
updates = data[dois_search_site | dois_database][['attributes.doi', 'attributes.url']] \
    .rename({'attributes.doi': 'id', 'attributes.url': 'url'}, axis=1)
updates

Unnamed: 0,id,url
41,10.15130/eap908,http://eap.bl.uk/database/overview_project.a4d...
42,10.15130/eap900,http://eap.bl.uk/database/overview_project.a4d...
43,10.15130/eap890,http://eap.bl.uk/database/overview_project.a4d...
44,10.15130/eap886,http://eap.bl.uk/database/overview_project.a4d...
45,10.15130/eap842,http://eap.bl.uk/database/overview_project.a4d...
...,...,...
333,10.15130/eap542,http://eap.bl.uk/database/overview_project.a4d...
350,10.15130/eap1262,https://eap.bl.uk/search/site/EAP1262
352,10.15130/eap1221,https://eap.bl.uk/search/site/EAP1221
354,10.15130/eap1234,https://eap.bl.uk/search/site/EAP1234


## What are the changes?

All the new URLs follow a simple pattern: `https://eap.bl.uk/project/<uppercase project ID>`

Since each DOI uses the *lowercase* Project ID as its suffix, we can extract these and then use them to form the new URL. We'll keep both so that EAP can check before we make the changes live.

In [34]:
updates['project'] = updates.id.map(lambda s: s.split('/')[1].upper())
updates['new_url'] = updates.project.map(lambda project_id: f'https://eap.bl.uk/project/{project_id}')
updates

Unnamed: 0,id,url,project,new_url
41,10.15130/eap908,http://eap.bl.uk/database/overview_project.a4d...,EAP908,https://eap.bl.uk/project/EAP908
42,10.15130/eap900,http://eap.bl.uk/database/overview_project.a4d...,EAP900,https://eap.bl.uk/project/EAP900
43,10.15130/eap890,http://eap.bl.uk/database/overview_project.a4d...,EAP890,https://eap.bl.uk/project/EAP890
44,10.15130/eap886,http://eap.bl.uk/database/overview_project.a4d...,EAP886,https://eap.bl.uk/project/EAP886
45,10.15130/eap842,http://eap.bl.uk/database/overview_project.a4d...,EAP842,https://eap.bl.uk/project/EAP842
...,...,...,...,...
333,10.15130/eap542,http://eap.bl.uk/database/overview_project.a4d...,EAP542,https://eap.bl.uk/project/EAP542
350,10.15130/eap1262,https://eap.bl.uk/search/site/EAP1262,EAP1262,https://eap.bl.uk/project/EAP1262
352,10.15130/eap1221,https://eap.bl.uk/search/site/EAP1221,EAP1221,https://eap.bl.uk/project/EAP1221
354,10.15130/eap1234,https://eap.bl.uk/search/site/EAP1234,EAP1234,https://eap.bl.uk/project/EAP1234


Let's save that to a spreadsheet and send it to EAP to check whether the changes look right...

In [35]:
updates.to_excel('eap-changes.xlsx', encoding='utf8')

In [36]:
params_types =  {
    "resourceType": "Web page",
    "resourceTypeGeneral": "Other"
}

for doi in updates.itertuples():
    print(f"{doi.id}: {doi.url} => {doi.new_url}")
    params = {'data': {'attributes': {'url': doi.new_url, 'types': params_types}}}
    response = cl.put(f'https://api.datacite.org/dois/{doi.id}', json=params)
    response.raise_for_status()
    print('Status:', response.status_code)

10.15130/eap908: http://eap.bl.uk/database/overview_project.a4d?projID=EAP908;r=41 => https://eap.bl.uk/project/EAP908
Status: 200
10.15130/eap900: http://eap.bl.uk/database/overview_project.a4d?projID=EAP900;r=41 => https://eap.bl.uk/project/EAP900
Status: 200
10.15130/eap890: http://eap.bl.uk/database/overview_project.a4d?projID=EAP890;r=41 => https://eap.bl.uk/project/EAP890
Status: 200
10.15130/eap886: http://eap.bl.uk/database/overview_project.a4d?projID=EAP886;r=41 => https://eap.bl.uk/project/EAP886
Status: 200
10.15130/eap842: http://eap.bl.uk/database/overview_project.a4d?projID=EAP842;r=41 => https://eap.bl.uk/project/EAP842
Status: 200
10.15130/eap1094: http://eap.bl.uk/database/overview_project.a4d?projID=EAP1094;r=41 => https://eap.bl.uk/project/EAP1094
Status: 200
10.15130/eap1034: http://eap.bl.uk/database/overview_project.a4d?projID=EAP1034;r=41 => https://eap.bl.uk/project/EAP1034
Status: 200
10.15130/eap1063: http://eap.bl.uk/database/overview_project.a4d?projID=EAP10