Dataframes based on CSV files images.csv, secties_totaal.csv and register_totaal.csv (all downloaded from https://doi.org/10.6084/m9.figshare.27221169.v1) are created. <br>
Columns are selected and column names are translated. <br>
The notebook explores the data and reviews the proposed method for linking register, sections and image data, and identifies errors and inconsistencies.<br>
The method adds a 'reg_id' and 'section_id' derived from register_totaal.csv and secties_totaal.csv to the images data. <br>
The method uses the information about microfilm numbers and scan numbers for registers and sections to add the corresponding reg_id and section_id to the images table. <br>
The result of the notebook are csv files that will be used to solve inconsistencies and errors in the data before implementing the method. <br>

In [None]:
import requests
import pandas as pd
import json

In [None]:
# path to downloaded copies of registers.csv, sections.csv and images.csv
registers = r"C:\STRO10\registers_totaal.csv\registers_totaal.csv" # update filepath if necessary
sections = r"C:\STRO10\secties_totaal.csv\secties_totaal.csv" # update filepath if necessary
images = r"C:\STRO10\images.csv\images.csv" # update filepath if necessary

# 1. Preparation

In [None]:
# retrieve files for preparing entity creation from the STRO 2.0 GitHub repository
owner = 'dhofu'
repo = 'stro20'
column_selection = 'STRO_20_column_selection'
rename_columns = 'STRO_20_column_names'
url_selection = f"https://api.github.com/repos/{owner}/{repo}/contents/{column_selection}"
url_rename = f"https://api.github.com/repos/{owner}/{repo}/contents/{rename_columns}"

In [None]:
response_column_selection = requests.get(url_selection)
response_rename_columns = requests.get(url_rename)

In [None]:
# fetch the files for selecting columns from the 'registers', 'sections' and 'images' dataframes
selection_mappings = {
    'usecols_registers': None,
    'usecols_sections': None,
    'usecols_images': None
}
for file in response_column_selection.json():
    if file['name'].startswith('registers_'):
        key1 = file['name'].split('_', 1)[1].split('.', 1)[0]
        selection_mappings[key1] = requests.get(file['download_url']).json()
    if file['name'].startswith('sections_'):
        key2 = file['name'].split('_', 1)[1].split('.', 1)[0]
        selection_mappings[key2] = requests.get(file['download_url']).json()
    if file['name'].startswith('images_'):
        key3 = file['name'].split('_', 1)[1].split('.', 1)[0]
        selection_mappings[key3] = requests.get(file['download_url']).json()

usecols_registers = selection_mappings['usecols_registers']
usecols_sections = selection_mappings['usecols_sections']
usecols_images = selection_mappings['usecols_images']

In [None]:
# fetch the files for renaming columns from the 'registers', 'sections' and 'images' dataframes 
rename_mappings = {
    'rename_registers': None,
    'rename_sections': None,
    'rename_images': None
}

for file in response_rename_columns.json():
    if file['name'].startswith('registers_'):
        key1 = file['name'].split('_', 1)[1].split('.', 1)[0]
        rename_mappings[key1] = requests.get(file['download_url']).json()
    if file['name'].startswith('sections_'):
        key2 = file['name'].split('_', 1)[1].split('.', 1)[0]
        rename_mappings[key2] = requests.get(file['download_url']).json()
    if file['name'].startswith('images_'):
        key3 = file['name'].split('_', 1)[1].split('.', 1)[0]
        rename_mappings[key3] = requests.get(file['download_url']).json()

rename_registers = rename_mappings['rename_registers']
rename_sections = rename_mappings['rename_sections']
rename_images = rename_mappings['rename_images']

## create registers dataframe

In [None]:
# create the registers dataframe from registers_totaal.csv
df_registers = pd.read_csv(registers, sep=",", quotechar='"', \
                         usecols=usecols_registers, header=0, encoding="utf-8").rename(columns=rename_registers).reset_index(names="reg_id")

## create sections dataframe

In [None]:
# create the registers dataframe from secties_totaal.csv
df_sections = pd.read_csv(sections, sep=",", quotechar='"', \
                         usecols=usecols_sections, encoding="utf-8").rename(columns=rename_sections)

Matching sections and registers merging dataframes and then applying conditions as below, results in 3 duplicates for one section and non-matches for 215 sections. The reason is that in these cases the scan numbers of the sections are inconsistent with the scan numbers of the registers. Often, the end of a section is set at a higher scan number than the end of the register. Moreover, data about the latest registers is more detailed than data about the sections during that period. Resolving the former error is done in the next processing step (see notebook STRO20_4_Images_M.ipynb). Resolving the latter error will be extremely time-consuming.

## images.csv

In [None]:
# create the images dataframe from images.csv
df_images = pd.read_csv(images, sep=",", quotechar='"', \
                         usecols=usecols_images, header=0, encoding="utf-8").rename(columns=rename_images)

In [None]:
# len(df_images)

In [None]:
# The columns 'filename' is split into its constituent parts.
# Information about micrifilm number (the origin of the scans is a set of microfilms) and scan are used to match register. section and image data below. 
df_images[['Sonttolregisters', 'mf_nr', 'scan', 'jpg']] = df_images.filename.str.split(r'-|_|\.', expand=True)
df_images.drop(columns=['Sonttolregisters', 'jpg'], inplace=True)
df_images['scannr'] = df_images['scan'].astype(int)
df_images['microfilm_number'] = df_images['mf_nr'].astype(int)
df_images.drop(columns=['mf_nr', 'scan'], inplace=True)

## merging images and registration dataframes to identify non-matches

In [None]:
df_imgreg = pd.merge(df_images, df_registers, how="left", on="microfilm_number")

In [None]:
df_imgreg2 = df_imgreg[['ce_id', 'scannr', 'reg_id']].loc[(df_imgreg['scannr'] >= df_imgreg['first_scan']) & (df_imgreg['scannr'] <= df_imgreg['last_scan'])]

In [None]:
# len(df_imgreg2)

In [None]:
# find missing entries
# match on ce_id and scannr to account for entries across several scans
missing_CE = df_images.merge(df_imgreg2, on=['ce_id', 'scannr'], how='left', indicator='Match')

In [None]:
# scannr is outside of the scope of the first and last scans in the registers dataframe
# this occurs in 774 cases spread across 54 microfilm numbers (see below)
df_missing_CE = missing_CE.loc[missing_CE['Match'] == 'left_only']

In [None]:
# we can use this information to raise the value of the last_scan in the registers dataframe to the highest value in the images dataframe 
missing_regs = df_missing_CE[['scannr', 'reg_id']].groupby(df_missing_CE['microfilm_number']).max('scannr').reset_index()

In [None]:
# missing_regs

In [None]:
# save the dataframe for further manual processing; the processed dataframe is available as CSV file on GitHub (STRO_20_corrections)
# missing_regs.to_csv(r"C:\STRO_HUYGENS\STRO_20_corrections\registers_changeLastScan.csv", encoding='utf-8')

## merging images and sections dataframes to find non-matches

In [None]:
df_imgsec = pd.merge(df_images, df_sections, how="left", on="microfilm_number")

In [None]:
df_imgsec2 = df_imgsec[['ce_id', 'scannr', 'section_id']].loc[(df_imgsec['scannr'] >= df_imgsec['section_first_scan']) & (df_imgsec['scannr'] <= df_imgsec['section_last_scan'])]

In [None]:
# len(df_imgsec2)

In [None]:
df_imgsec2.duplicated(subset=['ce_id', 'scannr'], keep=False).value_counts()

In [None]:
# find missing entries
# match on ce_id and scannr to account for entries across several scans
missing_CE = df_images.merge(df_imgsec2, on=['ce_id', 'scannr'], how='left', indicator='Match')

In [None]:
# scannr is outside of the scope of the first and last scans in the registers dataframe
# this occurs in 774 cases spread across 54 microfilm numbers (see below)
df_missing_CE = missing_CE.loc[missing_CE['Match'] == 'left_only']

In [None]:
# df_missing_CE

In [None]:
# we can use this information to raise the value of the last_scan in the sections dataframe to the highest value in the images dataframe 
missing_sections = df_missing_CE[['microfilm_number', 'scannr']].sort_values(by=['microfilm_number', 'scannr'], ascending=True).value_counts().reset_index()

In [None]:
# save the dataframe for further manual processing; the processed dataframe is available as CSV file on GitHub (STRO_20_corrections)
# missing_sections.to_csv(r"C:\STRO_HUYGENS\STRO_20_corrections\sections_changeSectionScanNumber.csv", encoding='utf-8')