### Harvesting metadata about books in the BSB digital collections

#### Introduction

This notebook uses the [Bavarian State Library's OAI Handler](http://bdr.oai.bsb-muenchen.de/OAIHandler) to harvest metadata from the [digital collections](https://oai.bsb-muenchen.de/doc/bayerisches-digitales-repositorium/) of the Bavarian State Library and stores them in a local deployment of [MongoDB Community Edition](https://www.mongodb.com/docs/languages/python/pymongo-driver/current/connect/#local-deployment) using the pymongo client.<br>
Installation instructions for MongoDB community edition, MongoDB Compass and MongoDB Shell are documented [here](https://www.mongodb.com/docs/manual/tutorial/install-mongodb-on-windows/#run-mongodb-community-edition-from-the-command-interpreter).<br>
Instructions for pymongo are given [here](https://www.mongodb.com/docs/languages/python/pymongo-driver/current/get-started/).<br>

To start the database: <br>
Open the Windows command line (CMD) as Administrator and execute the following command (depending on where you want to store the data):
```
    "C:\Program Files\MongoDB\Server\8.0\bin\mongod.exe" --dbpath="c:\data\db"
```
Then open mongosh.exe as Administrator and enter the name of the host. Here: mongodb://localhost <br>
Optionally, you might want to open MongoDB Compass, the mongoDB GUI, and connect to localhost:27017 to view databases and collections, create new collections, and follow the harvesting process.<br>

#### Load packages

In [None]:
# Load packages
from sickle import Sickle
from lxml import etree
import re
import json
from pymongo import MongoClient
import bson
import xmltodict

#### Connect to database

In [None]:
# Local deployment of mongoDB
# The client accesses the local mongoDB instance
# A database named "balneologie" was created using mongoDB Compass
# A collection named "bsb" was created in the "balneologie" database
uri = "mongodb://localhost:27017/"
client = MongoClient(uri)
database = client["bsb_metadata"] # for example
collection = database["numismatics"] # for example

In [None]:
# Check connection and show existing collections in the database
collection_list = database.list_collections()
for c in collection_list:
    print(c)

#### Configure OAI handler

This notebook shows how metadata harvesting was done using the OAI Handler of the Bavarian State Library.<br>
With little adjustment, other OAI Handlers could be used as well, e.g.:
- Bibliothèque Nationale de France (BNF): http://oai.bnf.fr/oai2/OAIHandler
- Sächsiche Landes- und Universitätsbibliothek (SLUB): https://digital.slub-dresden.de/oai
- Leibniz-Informationszentrum Technik und Naturwissenschaften (TIB): https://www.tib.eu/oai/public/repository/open

In [None]:
sickle = Sickle('http://bdr.oai.bsb-muenchen.de/OAIHandler')

In [None]:
oai_sets = sickle.ListSets()
for oai_set in oai_sets:
    print('setSpec value for selective harvesting: ' + oai_set.setSpec)
    print('Name of the set (setName): ' + oai_set.setName + '\n')

In [None]:
oai_formats = sickle.ListMetadataFormats()
for oai_format in oai_formats:
    print(oai_format.metadataPrefix)

#### Configure harvesting

We harvest only the metadata of digitized books that contain information corresponding to particular patterns.<br>
First, we only want books that have variants of the words 'table' or 'statistics' in their title.<br>
Second, we only want to include books that belong to particular series.<br>
The selected series are Merc., Oecon., Enc., Cam., and Num.rec.<br>
These abbreviations correspond to the shelf marks of the series in the BSB (Signaturfach).<br>
The shelf marks are documented in [Haller 2011](https://mdz-nbn-resolving.de/details:bsb00067806).

In [None]:
# The patterns are defined using regular expressions as follows:
patterns = [
    re.compile(r'tafeln?', re.IGNORECASE),
    re.compile(r'tabelle[n|s]?', re.IGNORECASE),
    re.compile(r'tables?', re.IGNORECASE),
    re.compile(r'statisti', re.IGNORECASE),
    re.compile(r'tab[e|u]ul{1,2}a', re.IGNORECASE),
    re.compile(r'Merc\.|Oecon\.|Enc\.|Cam\.|Num\.ant\.', re.IGNORECASE),
    # add more patterns as needed
]

Requests were made on all datasets: 
```
http://bdr.oai.bsb-muenchen.de/OAIHandler?verb=ListRecords&metadataPrefix=MarcXchange&set=all
```
The task was split into several parts, using the 'from' and 'until' information in the metadata.<br>
The patterns are looked up in the raw records. When a pattern matches, the record is stored in the mongoDB database.<br>
Note: Selection of books based on their year of publication is done at a later stage.<br>

In [None]:
namespaces = {
    'http://www.openarchives.org/OAI/2.0/': 'oai',
    'http://www.openarchives.org/OAI/2.0/oai_dc/': 'oai_dc',
    'http://purl.org/dc/elements/1.1/': 'dc'
}

# Initialize counters and lists
processed = 0
count = 0
records_to_insert = []

try:
    # change set for selective harvesting of digital collections in the BSB Digitale Sammlungen 
    # change date for harvesting records from a specific date
    for record in sickle.ListRecords(**{'metadataPrefix': 'oai_dc','set': 'all', 'from': '2021-01-01', 'until': '2021-12-31'}): 
        processed += 1
        if any(pattern.search(record.raw.lower()) for pattern in patterns):
            tree = etree.ElementTree(record.xml)
            xml_string = etree.tostring(tree.getroot(), pretty_print=True, encoding='unicode')
            count += 1
            
            records_to_insert.append(xml_string)
            
            xml_dict = xmltodict.parse(
                xml_string,
                process_namespaces=True,
                namespaces=namespaces
            )

            # Convert to BSON (necessary for MongoDB)
            bson_data = bson.encode(xml_dict)
            
            # Decode back to verify
            decoded_data = bson.decode(bson_data)
            result = collection.insert_one(decoded_data)
            print(f"Stored single record ID: {result.inserted_id}")
            
except Exception as e:
    print(f"An error occurred: {e}")

print(f"\nTotal records processed: {processed}")
print(f"Total matching records found: {count}")

#### Documentation of harvesting results from BSB


Metadata for books with shelf marks Merc., Oecon., Enc., and Cam. were harvested on 12-02-2025 and 13-02-2025.<br>
Metadata for books with shelf marks Num.anc. and Num.rec. were harvested on 09-04-2025.<br>

The harvesting results for the first period are as follows:<br>

```html
<resumptionToken expirationDate="2025-02-13T20:29:25Z" completeListSize="1848361" cursor="0">
TGlzdFJlY29yZHM6OjoxMDA6MToyMDI1LTAyLTEzVDIwJTNBMjklM0EyNVo6MTg0ODM2MQ:all:MarcXchange
</resumptionToken>
```

- 13.02.2025: from 2025-01-01 until 2025-02-13 <br>
    - Total records processed: 46203
    - Total matching records found: 1362
- 12.02.2025: from 2024-01-01 until 2024-12-31 <br>
    - Total records processed: 682421 <br>
    - Total matching records found: 21262 <br>
- 12.02.2025: from 2023-01-01 until 2023-12-31 <br>
    - Total records processed: 176073 <br>
    - Total matching records found: 7753 <br>
- 12.02.2025: from 2022-01-01 until 2022-12-31 <br>
    - Total records processed: 256394 <br>
    - Total matching records found: 9496 <br>
- 12.02.2025: from 2021-01-01 until 2021-12-31 <br>
    - Total records processed: 684296 <br>
    - Total matching records found: 19923 <br>
- 13.02.2025: from 2020-01-01 until 2020-12-31: <br>
    - An error occurred: The combination of the values of the from, until, set and metadataPrefix arguments results in an empty list.<br>
    - Total records processed: 0 <br>
    - Total matching records found: 0 <br>
- 13.02.2025: from 2019-01-01 until 2019-12-31: <br>
    - An error occurred: The combination of the values of the from, until, set and metadataPrefix arguments results in an empty list.<br>
    - Total records processed: 0 <br>
    - Total matching records found: 0 <br>
- 13.02.2025: from 2018-01-01 until 2018-12-31: <br>
    - An error occurred: The combination of the values of the from, until, set and metadataPrefix arguments results in an empty list.<br>
    - Total records processed: 0 <br>
    - Total matching records found: 0 <br>
- 13.02.2025: from 1997-01-01 until 2017-12-31: <br>
    - An error occurred: The combination of the values of the from, until, set and metadataPrefix arguments results in an empty list.<br>
    - Total records processed: 0 <br>
    - Total matching records found: 0 <br>