# Librarian's quest to exhaustivity and openness

## Extracting AoU data

Project for the EAHIL conference 2024 : https://eahil2024.rsu.lv/

Authors : **Floriane Muller & Pablo Iriarte**, University of Geneva  
Last update : 11.06.2024  

This notebook is used to extract AoU data from OAI-PMH export.

### Sources

* **AoU OAI-PMH import**: https://archive-ouverte.unige.ch/oai?verb=ListRecords&metadataPrefix=marc21&set=full-archive-ouverte

### Results

* DOIs: data/temp/[year]/aou_dois.tsv
* PMIDs: data/temp/[year]/aou_pmids.tsv
* Links: data/temp/[year]/aou_856.tsv
* UNIGE structures: data/temp/[year]/aou_928.tsv
* Publication dates: data/temp/[year]/aou_dates.tsv
* Funders: data/temp/[year]/aou_988f.tsv
* Document types: data/temp/[year]/aou_980a.tsv

In [1]:
import pandas as pd
import csv
import os
from lxml import etree
from tqdm.auto import tqdm

# paramètres

# dossier pour l'enregistrement des résultats
folder_out = 'data/temp/2024/'

# dossier des fichiers JSON téléchargés
aou_folder = 'data/sources/aou_20240529/'

# afficher toutes les colonnes
pd.set_option('display.max_columns', None)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# TEST
myfilein = aou_folder + 'page-0.xml'

# Parse XML
root = etree.parse(myfilein)

In [3]:
for child in root.iter('*'):
    print(child.tag, child.attrib)

{http://www.openarchives.org/OAI/2.0/}OAI-PMH {'{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd http://www.loc.gov/MARC21/slim https://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd'}
{http://www.openarchives.org/OAI/2.0/}responseDate {}
{http://www.openarchives.org/OAI/2.0/}request {'verb': 'ListRecords', 'metadataPrefix': 'marc21', 'set': 'full-archive-ouverte'}
{http://www.openarchives.org/OAI/2.0/}ListRecords {}
{http://www.openarchives.org/OAI/2.0/}record {}
{http://www.openarchives.org/OAI/2.0/}header {}
{http://www.openarchives.org/OAI/2.0/}identifier {}
{http://www.openarchives.org/OAI/2.0/}datestamp {}
{http://www.openarchives.org/OAI/2.0/}setSpec {}
{http://www.openarchives.org/OAI/2.0/}setSpec {}
{http://www.openarchives.org/OAI/2.0/}setSpec {}
{http://www.openarchives.org/OAI/2.0/}metadata {}
{http://www.loc.gov/MARC21/slim}collection {'{http://www.w3.org/2001/XMLSchema-in

{http://www.loc.gov/MARC21/slim}collection {'{http://www.w3.org/2001/XMLSchema-instance}schemaLocation': 'http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd http://www.loc.gov/MARC21/slim https://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd'}
{http://www.loc.gov/MARC21/slim}record {}
{http://www.loc.gov/MARC21/slim}leader {}
{http://www.loc.gov/MARC21/slim}controlfield {'tag': '001'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '022', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '024', 'ind1': '7', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': '2'}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '041', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '082', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/

{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '653', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '700', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '856', 'ind1': '4', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': '3'}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'u'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '856', 'ind1': '4', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': '3'}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'f'}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'u'}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'x'}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'y'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '856', 'ind1': '4', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}sub

{http://www.loc.gov/MARC21/slim}datafield {'tag': '920', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '928', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '930', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '980', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.openarchives.org/OAI/2.0/}record {}
{http://www.openarchives.org/OAI/2.0/}header {}
{http://www.openarchives.org/OAI/2.0/}identifier {}
{http://www.openarchives.org/OAI/2.0/}datestamp {}
{http://www.openarchives.org/OAI/2.0/}setSpec {}
{http://www.openarchives.org/OAI/2.0/}setSpec 

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



{http://www.loc.gov/MARC21/slim}datafield {'tag': '653', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '653', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '653', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '653', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '700', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}subfield {'code': '0'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '856', 'ind1': '4', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': '3'}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'u'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '856', 'ind1

{http://www.loc.gov/MARC21/slim}subfield {'code': 'c'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '502', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': '8'}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '506', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '508', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'e'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '508', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'e'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '520', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': '9'}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}dat

{http://www.loc.gov/MARC21/slim}datafield {'tag': '506', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '508', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'e'}
{http://www.loc.gov/MARC21/slim}subfield {'code': '0'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '520', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': '9'}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '653', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '653', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http://www.loc.gov/MARC21/slim}datafield {'tag': '653', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}
{http:/

{http://www.loc.gov/MARC21/slim}datafield {'tag': '506', 'ind1': ' ', 'ind2': ' '}
{http://www.loc.gov/MARC21/slim}subfield {'code': 'a'}


IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [4]:
# select the record
records = root.findall("./{http://www.openarchives.org/OAI/2.0/}ListRecords/{http://www.openarchives.org/OAI/2.0/}record")
i = 0
for record in records:
    metadata = record.find("./{http://www.openarchives.org/OAI/2.0/}metadata")
    if (metadata is not None) :
        i = i + 1
        print(i)
        for child in metadata.iter("{http://www.loc.gov/MARC21/slim}controlfield"):
            if child.attrib['tag'] == '001':
                print(child.attrib['tag'], child.text)
        for child in metadata.iter("{http://www.loc.gov/MARC21/slim}datafield"):
            if child.attrib['tag'] == '024':
                # print(child.attrib['tag'], child.find("{http://www.loc.gov/MARC21/slim}subfield").text)
                print(child.attrib['tag'], child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='2']").text, child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='a']").text)

1
001 unige:1
024 DOI 10.1016/j.ijpharm.2007.09.007
024 PMID 17997238
2
001 unige:2
024 DOI 10.1039/b714861e
024 PMID 18092080
3
001 unige:3
024 DOI 10.1016/j.ejpb.2007.06.020
024 PMID 17884402
4
001 unige:4
024 DOI 10.1002/jps.20829
024 PMID 17497726
5
001 unige:5
024 DOI 10.1007/s11095-007-9429-7
024 PMID 17985216
6
001 unige:6
024 DOI 10.1016/j.chroma.2007.12.021
024 PMID 18177881
7
001 unige:7
024 DOI 10.1039/b716488b
024 PMID 18253518
8
001 unige:8
024 DOI 10.1002/chir.20439
024 PMID 17600851
9
001 unige:9
024 DOI 10.1002/chem.200701520
024 PMID 18067110
10
001 unige:10
024 DOI 10.1002/jssc.200700537
024 PMID 18306434
11
001 unige:11
024 DOI 10.1021/la7021352
024 PMID 18072793
12
001 unige:12
024 DOI 10.1016/j.chroma.2008.03.034
024 PMID 18395734
13
001 unige:13
024 DOI 10.1016/j.ejpb.2007.09.021
024 PMID 18023564
14
001 unige:14
024 DOI 10.1016/j.ejpb.2007.08.001
024 PMID 17826969
15
001 unige:15
024 DOI 10.1128/AAC.01372-07
024 PMID 18195063
16
001 unige:16
024 DOI 10.1016/j.cli

In [5]:
# DOIs

# folder
files = os.listdir(aou_folder)

# create files
f = open(folder_out + 'aou_dois.tsv', mode='w', encoding='utf-8')

# write first line
f.write('id\tdoi\n')

# loop on files
for file in tqdm(files):
    # Parse XML
    root = etree.parse(aou_folder + '/' + file)
    # select the record
    records = root.findall("./{http://www.openarchives.org/OAI/2.0/}ListRecords/{http://www.openarchives.org/OAI/2.0/}record")
    for record in records:
        metadata = record.find("./{http://www.openarchives.org/OAI/2.0/}metadata")
        if (metadata is not None) :
            myid = ''
            my024_2 = ''
            my024_a = ''
            mydoi = ''
            for child in metadata.iter("{http://www.loc.gov/MARC21/slim}controlfield"):
                if child.attrib['tag'] == '001':
                    myid = child.text
            for child in metadata.iter("{http://www.loc.gov/MARC21/slim}datafield"):
                if child.attrib['tag'] == '024':
                    my024_2 = child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='2']").text
                    my024_a = child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='a']").text
                    if (my024_2 is not None) & (my024_2 == 'DOI') :
                        mydoi = my024_a
                        # split multiple DOIs
                        if (' ' in mydoi):
                            print (myid + ' DOI: ' + mydoi)
                            mydois = mydoi.split(' ')
                            for doii in mydois:
                                # write info to file
                                f.write(myid)
                                f.write('\t')
                                f.write(doii)
                                f.write('\n')
                        else :
                            # write info to file
                            f.write(myid)
                            f.write('\t')
                            f.write(mydoi)
                            f.write('\n')
f.close()

HBox(children=(IntProgress(value=0, max=1091), HTML(value='')))

unige:168305 DOI: 10.1016/j.freeradbiomed.2021.12.011 10.1016/j.freeradbiomed.2022.01.016
unige:168413 DOI: 10.1111/petr.14186 10.1111/petr.14349
unige:168699 DOI: 10.1007/s00134-021-06346-w 10.1007/s00134-021-06379-1
unige:168806 DOI: 10.1038/s41597-021-00941-8 10.1038/s41597-021-01026-2 10.1038/s41597-021-01044-0
unige:168811 DOI: 10.3389/fpsyt.2022.835580 10.3389/fpsyt.2023.1182124
unige:169087 DOI: 10.3389/fcvm.2022.863968 10.3389/fcvm.2023.1181506
unige:169098 DOI: 10.3389/fonc.2022.1025481 10.3389/fonc.2023.1170338/full
unige:169173 DOI: 10.3233/JAD-201215 10.3233/JAD-219008
unige:169209 DOI: 10.2196/46694 10.2196/49027
unige:169244 DOI: 10.3389/fendo.2022.1031633 10.3389/fendo.2023.1172597
unige:169249 DOI: 10.3389/fendo.2022.971745 10.3389/fendo.2023.1175361
unige:169297 DOI: 10.1038/s41598-022-26963-9 10.1038/s41598-023-30530-1
unige:169299 DOI: 10.1007/s00431-021-04143-7 10.1007/s00431-021-04217-6
unige:169382 DOI: 10.1136/ard-2022-223356 10.1136/ard-2022-223356corr1
unige:16

In [6]:
# PMIDs

# folder
files = os.listdir(aou_folder)

# create files
f = open(folder_out + 'aou_pmids.tsv', mode='w', encoding='utf-8')

# write first line
f.write('id\tpmid\n')

# loop on files
for file in tqdm(files):
    # Parse XML
    root = etree.parse(aou_folder + '/' + file)
    # select the record
    records = root.findall("./{http://www.openarchives.org/OAI/2.0/}ListRecords/{http://www.openarchives.org/OAI/2.0/}record")
    for record in records:
        metadata = record.find("./{http://www.openarchives.org/OAI/2.0/}metadata")
        if (metadata is not None) :
            myid = ''
            my024_2 = ''
            my024_a = ''
            mypmid = ''
            for child in metadata.iter("{http://www.loc.gov/MARC21/slim}controlfield"):
                if child.attrib['tag'] == '001':
                    myid = child.text
            for child in metadata.iter("{http://www.loc.gov/MARC21/slim}datafield"):
                if child.attrib['tag'] == '024':
                    my024_2 = child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='2']").text
                    my024_a = child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='a']").text
                    if (my024_2 is not None) & (my024_2 == 'PMID') :
                        mypmid = my024_a
                        # split multiple PMIDs
                        if (' ' in mypmid):
                            print (myid + ' PMID: ' + mypmid)
                            mypmids = mypmid.split(' ')
                            # write info to file only the first PMID
                            f.write(myid)
                            f.write('\t')
                            f.write(mypmids[0])
                            f.write('\n')
                        else :
                            # write info to file
                            f.write(myid)
                            f.write('\t')
                            f.write(mypmid)
                            f.write('\n')
f.close()

HBox(children=(IntProgress(value=0, max=1091), HTML(value='')))

unige:168305 PMID: 34890769 35078692
unige:168413 PMID: 34738698 35799314
unige:168699 PMID: 33506379 33688994
unige:168806 PMID: 34400655 34508107 34556662
unige:168811 PMID: 35815035 37009116
unige:169087 PMID: 35872923 37025679
unige:169098 PMID: 36713528 36937427
unige:169173 PMID: 33720887 34366360
unige:169209 PMID: 37163336 37201181
unige:169244 PMID: 36531463 36992808
unige:169249 PMID: 36313762 36967800
unige:169297 PMID: 36581637 36859417
unige:169299 PMID: 34129099 34363093
unige:169382 PMID: 36357155 36764818
unige:169631 PMID: 35468327 35716679
unige:170239 PMID: 37113782 37484338
unige:170403 PMID: 34791154 34927669
unige:170435 PMID: 36160799 37009273
unige:170501 PMID: 36553975 37372928
unige:170524 PMID: 36517587 36847792
unige:170560 PMID: 37104966 37380558
unige:171595 PMID: 36713419 36949941
unige:173174 PMID: 36312317 37965631
unige:173486 PMID: 37192151 37768888
unige:174036 PMID: 36199149 36804991
unige:174113 PMID: 37649107 37775834
unige:174537 PMID: 35988568 3

In [7]:
# 856

# folder
files = os.listdir(aou_folder)

# create files
f = open(folder_out + 'aou_856.tsv', mode='w', encoding='utf-8')

# write first line
f.write('id\t856_3\t856_f\t856_u\t856_x\t856_z\n')

# loop on files
for file in tqdm(files):
    # Parse XML
    root = etree.parse(aou_folder + '/' + file)
    # select the record
    records = root.findall("./{http://www.openarchives.org/OAI/2.0/}ListRecords/{http://www.openarchives.org/OAI/2.0/}record")
    for record in records:
        metadata = record.find("./{http://www.openarchives.org/OAI/2.0/}metadata")
        if (metadata is not None) :
            myid = ''
            for child in metadata.iter("{http://www.loc.gov/MARC21/slim}controlfield"):
                if child.attrib['tag'] == '001':
                    myid = child.text
            for child in metadata.iter("{http://www.loc.gov/MARC21/slim}datafield"):
                my856_3 = ''
                my856_f = ''
                my856_u = ''
                my856_x = ''
                my856_z = ''
                if child.attrib['tag'] == '856':
                    if (child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='3']") is not None):
                        my856_3 = child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='3']").text
                        if (my856_3 != 'Record') & (my856_3 != 'URN') :
                            if (child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='f']") is not None):
                                my856_f = child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='f']").text
                            if (child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='u']") is not None):
                                my856_u = child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='u']").text
                            if (child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='x']") is not None):
                                my856_x = child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='x']").text
                            if (child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='z']") is not None):
                                my856_z = child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='z']").text
                            # write info to file
                            f.write(myid)
                            f.write('\t')
                            f.write(my856_3)
                            f.write('\t')
                            f.write(my856_f)
                            f.write('\t')
                            f.write(my856_u)
                            f.write('\t')
                            f.write(my856_x)
                            f.write('\t')
                            f.write(my856_z)
                            f.write('\n')
f.close()

HBox(children=(IntProgress(value=0, max=1091), HTML(value='')))




In [8]:
# Structure UNIGE 928

# folder
files = os.listdir(aou_folder)

# create files
f = open(folder_out + 'aou_928.tsv', mode='w', encoding='utf-8')

# write first line
f.write('id\t928\n')

# loop on files
for file in tqdm(files):
    # Parse XML
    root = etree.parse(aou_folder + '/' + file)
    # select the record
    records = root.findall("./{http://www.openarchives.org/OAI/2.0/}ListRecords/{http://www.openarchives.org/OAI/2.0/}record")
    for record in records:
        metadata = record.find("./{http://www.openarchives.org/OAI/2.0/}metadata")
        if (metadata is not None) :
            myid = ''
            my928 = ''
            for child in metadata.iter("{http://www.loc.gov/MARC21/slim}controlfield"):
                if child.attrib['tag'] == '001':
                    myid = child.text
            for child in metadata.iter("{http://www.loc.gov/MARC21/slim}datafield"):
                if child.attrib['tag'] == '928':
                    # multiples values: $a repeated
                    my928 = child.findall("{http://www.loc.gov/MARC21/slim}subfield[@code='a']")
                    if (my928 is not None):
                        my928a_text = ''
                        for my928a in my928:
                            my928a_text = my928a.text
                            # write info to file
                            f.write(myid)
                            f.write('\t')
                            f.write(my928a_text)
                            f.write('\n')
                    else:
                        print(myid)
f.close()

HBox(children=(IntProgress(value=0, max=1091), HTML(value='')))




In [9]:
# Dates

# folder
files = os.listdir(aou_folder)

# create files
f = open(folder_out + 'aou_dates.tsv', mode='w', encoding='utf-8')

# write first line
f.write('id\tdate\n')

# loop on files
for file in tqdm(files):
    # Parse XML
    root = etree.parse(aou_folder + '/' + file)
    # select the record
    records = root.findall("./{http://www.openarchives.org/OAI/2.0/}ListRecords/{http://www.openarchives.org/OAI/2.0/}record")
    for record in records:
        metadata = record.find("./{http://www.openarchives.org/OAI/2.0/}metadata")
        if (metadata is not None) :
            myid = ''
            mydate = ''
            my773 = ''
            year = ''
            for child in metadata.iter("{http://www.loc.gov/MARC21/slim}controlfield"):
                if child.attrib['tag'] == '001':
                    myid = child.text
            for child in metadata.iter("{http://www.loc.gov/MARC21/slim}datafield"):
                if child.attrib['tag'] == '264':
                    if (child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='c']") is not None):
                        mydate = child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='c']").text
                if child.attrib['tag'] == '773':
                    if (child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='g']") is not None):
                        my773 = child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='g']").text
                        # extract year
                        # count (
                        my773n = 0
                        if ('(' in my773):
                            my773split = my773.split('(')
                            my773n = len(my773split)
                        if my773n > 0:
                            if (my773n > 1):
                                if len(my773split[1]) > 1:
                                    if (((my773split[1][0] == '1') | (my773split[1][0] == '2')) & (' ' not in my773split[1][:4])):
                                        year = my773split[1][:4]
                            if ((year == '') & (my773n > 2)):
                                if len(my773split[2]) > 1:
                                    if ((my773split[2][0] == '1') | (my773split[2][0] == '2')):
                                        year = my773split[2][:4]
                            if ((year == '') & (my773n > 3)):
                                if len(my773split[3]) > 1:
                                    if ((my773split[3][0] == '1') | (my773split[3][0] == '2')):
                                        year = my773split[3][:4]
            if (mydate == '') & (year == ''):
                print(myid)
            if (mydate == '') & (year != ''):
                # write info to file
                f.write(myid)
                f.write('\t')
                f.write(year)
                f.write('\n')
            if (mydate != ''):
                # write info to file
                f.write(myid)
                f.write('\t')
                f.write(mydate)
                f.write('\n')
f.close()

HBox(children=(IntProgress(value=0, max=1091), HTML(value='')))




### Funders 988 $f

``` xml
<marc:datafield tag="988" ind1=" " ind2=" ">
<marc:subfield code="2">Projects</marc:subfield>
<marc:subfield code="c">184814</marc:subfield>
<marc:subfield code="f">Swiss National Science Foundation</marc:subfield>
<marc:subfield code="j">CH</marc:subfield>
<marc:subfield code="n">
Hematopoietic Stem Cells and their niche: origin and maturation during embryonic development
</marc:subfield>
</marc:datafield>
```

In [10]:
# 988 $f

# folder
files = os.listdir(aou_folder)

# create files
f = open(folder_out + 'aou_988f.tsv', mode='w', encoding='utf-8')

# write first line
f.write('id\t988f\n')

# loop on files
for file in tqdm(files):
    # Parse XML
    root = etree.parse(aou_folder + '/' + file)
    # select the record
    records = root.findall("./{http://www.openarchives.org/OAI/2.0/}ListRecords/{http://www.openarchives.org/OAI/2.0/}record")
    for record in records:
        metadata = record.find("./{http://www.openarchives.org/OAI/2.0/}metadata")
        if (metadata is not None) :
            myid = ''
            my988 = ''
            for child in metadata.iter("{http://www.loc.gov/MARC21/slim}controlfield"):
                if child.attrib['tag'] == '001':
                    myid = child.text
            for child in metadata.iter("{http://www.loc.gov/MARC21/slim}datafield"):
                if child.attrib['tag'] == '988':
                    if (child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='f']") is not None):
                        my988 = child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='f']").text
                        # write info to file
                        f.write(myid)
                        f.write('\t')
                        f.write(my988)
                        f.write('\n')
            # if my988 == '':
                # print(myid)
f.close()

HBox(children=(IntProgress(value=0, max=1091), HTML(value='')))




In [11]:
# Document type 980 $a

# folder
files = os.listdir(aou_folder)

# create files
f = open(folder_out + 'aou_980a.tsv', mode='w', encoding='utf-8')

# write first line
f.write('id\t980a\n')

# loop on files
for file in tqdm(files):
    # Parse XML
    root = etree.parse(aou_folder + '/' + file)
    # select the record
    records = root.findall("./{http://www.openarchives.org/OAI/2.0/}ListRecords/{http://www.openarchives.org/OAI/2.0/}record")
    for record in records:
        metadata = record.find("./{http://www.openarchives.org/OAI/2.0/}metadata")
        if (metadata is not None) :
            myid = ''
            my980a = ''
            for child in metadata.iter("{http://www.loc.gov/MARC21/slim}controlfield"):
                if child.attrib['tag'] == '001':
                    myid = child.text
            for child in metadata.iter("{http://www.loc.gov/MARC21/slim}datafield"):
                if child.attrib['tag'] == '980':
                    if (child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='a']") is not None):
                        my980a = child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='a']").text
                        # write info to file
                        f.write(myid)
                        f.write('\t')
                        f.write(my980a)
                        f.write('\n')
            if my980a == '':
                print(myid)
f.close()

HBox(children=(IntProgress(value=0, max=1091), HTML(value='')))




In [6]:
# Journal name 773 $t

# folder
files = os.listdir(aou_folder)

# create files
f = open(folder_out + 'aou_773t.tsv', mode='w', encoding='utf-8')

# write first line
f.write('id\t773t\n')

id_missing = []

# loop on files
for file in tqdm(files):
    # Parse XML
    root = etree.parse(aou_folder + '/' + file)
    # select the record
    records = root.findall("./{http://www.openarchives.org/OAI/2.0/}ListRecords/{http://www.openarchives.org/OAI/2.0/}record")
    for record in records:
        metadata = record.find("./{http://www.openarchives.org/OAI/2.0/}metadata")
        if (metadata is not None) :
            myid = ''
            my773t = ''
            for child in metadata.iter("{http://www.loc.gov/MARC21/slim}controlfield"):
                if child.attrib['tag'] == '001':
                    myid = child.text
            for child in metadata.iter("{http://www.loc.gov/MARC21/slim}datafield"):
                if child.attrib['tag'] == '773':
                    if (child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='t']") is not None):
                        my773t = child.find("{http://www.loc.gov/MARC21/slim}subfield[@code='t']").text
                        if (my773t is not None):
                            # write info to file
                            f.write(myid)
                            f.write('\t')
                            f.write(my773t)
                            f.write('\n')
            if (my773t == '') | (my773t is None) :
                id_missing.append(myid)
f.close()

100%|██████████████████████████████████████████████████████████████████████████████| 1091/1091 [00:33<00:00, 32.63it/s]


In [9]:
'unige:74670' in id_missing

False