# Counting the number of newspaper issues per set

## Configuration
Here we define the sets that we want to include in our counting task

In [5]:
sets = {
    "9200300": "Austria",
    "9200301": "Finland",
    "9200303": "Latvia",
    "9200338": "Hamburg",
    "9200339": "Serbia",
    "9200355": "Berlin",
    "9200356": "Estonia",
    "9200357": "Poland",
    "9200359": "Netherlands",
    "9200396": "Luxembourg"
}

## Processing

First, we define a function that does the actual counting for one set

In [6]:
# Libraries needed to retrieve and process the metadata zips
import re, requests
from io import BytesIO
from zipfile import ZipFile

def count_set(set_id):
    # construct the address of the .zip file with the metadata for one set
    md_zip_url = f'https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/{set_id}.zip'
    
    # retrieve and unpack the .zip file
    print(f'Retrieving {md_zip_url}')
    resp = requests.get(md_zip_url)
    zipfile = ZipFile(BytesIO(resp.content))
    files = [zipfile.open(file_name) for file_name in zipfile.namelist()]
    
    # count the links in the file
    set_count = 0
    for file_name in zipfile.namelist():
        with zipfile.open(file_name, mode='r') as file:
            for line in file:
                text = line.decode('UTF-8')
                set_count += len(re.findall('https://www.europeana.eu/item/\d+/BibliographicResource_\d+', text))
    print(f'Number of issues found in set {set_id} ({sets[set]}): {set_count}')
    
    # return the result
    return set_count

Now we simply apply the function to all sets, and each of the counts to a totals counter

In [7]:
total_count = 0
for set_id in sets:
    total_count += count_set(set_id)

Retrieving https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200300.zip
Number of issues found in set 9200300 (Luxembourg): 147515
Retrieving https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200301.zip
Number of issues found in set 9200301 (Luxembourg): 24164
Retrieving https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200303.zip
Number of issues found in set 9200303 (Luxembourg): 67870
Retrieving https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200338.zip
Number of issues found in set 9200338 (Luxembourg): 130938
Retrieving https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200339.zip
Number of issues found in set 9200339 (Luxembourg): 22087
Retrieving https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200355.zip
Number of issues found in set 9200355 (Luxembourg): 134708
Retrieving https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200356.zip
Number of issues found in set 9200356 (Luxembourg): 92

## Results

In [8]:
print(f'Total number of issues found in {len(sets)} sets: {total_count}')

Total number of issues found in 10 sets: 640461
