# Pages number extraction

In this notebook we will extract, from the bibliographical notes left by the researchers, the pages that are interesting for the subject of the corpus.

We will use regex expressions to detect the page indications patterns, and compile the results in a csv file that we will use for the next steps.

In [4]:
import re
import pandas as pd

In [9]:
master_df = pd.read_csv('data/DFKV_Master.csv')
master_df = master_df.dropna(subset=['liens iiif'])

In the 'bibliographie' column, we have the indication of which page are interesting. However, our main problem is that no two indications are written the same way. We need to process these string in order to find page numbers that we actually can use. Let's make a few observations on these :

- If document less than 10 pages (and no page indication): we keep them all
- if 'S.' or 'P.' (upper or lower case) then it means we have a page number
- if '-' after then it means there is a beginning page and an ending page
- if twice 'p.' then take the larger interval
- else if no indication (happens only twice in the dataset) : we just take the image from the link

In [70]:
# The function that does all the work of extracting the page numbers and return the length of them
def extract_nb_pages(text):
    if not pd.isna(text):
        after_page = re.findall(r"(?<=[pPsS].).*",text) # extract anything after 'p.'
        if len(after_page) != 0 :
            beginning = re.findall(r"\d+(?=-)", after_page[0]) # extract numbers before '-'
            end = re.findall(r"-(\d+)", after_page[0]) # extract numbers before '-'
            if len(beginning) != 0 and len(end) != 0:
                return int(end[0]) - int(beginning[0]) + 1
            else:
                return 1 # means that there is only one page (e.g. 'n° 25, p. 279')
        elif len(re.findall(r"^[0-9]{4}$", text)) > 0:
            return 1  # bibliography is just the year, then we only take one page
        elif len(re.findall(r"Bd.", text)) > 0 : # means that we have a bibliography of the form "5.Jg., Bd.18, 184-186"
            pages = re.findall(r"[^, ]*$", text)[0] # extract the '184-186'
            separate_pages = re.findall(r"[^-]*", pages)
            if len(separate_pages) > 2:
                beginning = separate_pages[0]
                end = separate_pages[2]
                return int(end) - int(beginning) + 1
            else: # If only one page
                return 1
        else:    
            return -1  # means that we didn't find any page indication
    else:
        return 1 # no page indication, we will only take the one from the link

In [71]:
master_df['pages_to_extract'] = master_df.apply(lambda row: extract_nb_pages(row['bibliographie']), axis=1)

Now if we check for how many documents we couldn't find the page indications, we see that there are 15 of them, we will fill them manually.

In [72]:
master_df[master_df['pages_to_extract']==-1]

Unnamed: 0,ID,Volume_ID,_journal-id,liens iiif,liens de citation (page),liens de citation (volume),bibliographie,pages_to_extract
1065,15457,2619,1602.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k43...,https://gallica.bnf.fr/ark:/12148/bpt6k431806j...,,148-165,-1
1217,14546,2786,1408.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/ark:/12148/bpt6k2031075...,,18.1878.4,-1
2185,14728,3716,1463.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k57...,https://gallica.bnf.fr/ark:/12148/bpt6k5780540...,,22.1907,-1
2308,14718,3847,1408.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/ark:/12148/bpt6k203114c...,,25.1882.6,-1
2351,14585,3890,1463.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,https://gallica.bnf.fr/ark:/12148/bpt6k6125759...,,26.1909,-1
2440,14719,3988,1408.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/ark:/12148/bpt6k203117h...,,28.1883.1,-1
2486,15568,4040,1411.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k75...,https://gallica.bnf.fr/ark:/12148/bpt6k7526448...,,"28e année, n° 10.204 (numéro entier)",-1
2800,14720,4362,1408.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/ark:/12148/bpt6k2031206...,,31.1885.6,-1
3766,14607,5379,1602.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k87...,https://gallica.bnf.fr/ark:/12148/bpt6k871028?...,,51.1882,-1
4788,14648,6426,1602.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k87...,https://gallica.bnf.fr/ark:/12148/bpt6k871357/...,https://gallica.bnf.fr/ark:/12148/bpt6k871357?...,81.1887,-1


In [74]:
master_df.at[1065,'pages_to_extract'] = 18
master_df.at[1217,'pages_to_extract'] = 1
master_df.at[2185,'pages_to_extract'] = 1
master_df.at[2308,'pages_to_extract'] = 1
master_df.at[2351,'pages_to_extract'] = 1
master_df.at[2440,'pages_to_extract'] = 1
master_df.at[2486,'pages_to_extract'] = 1
master_df.at[2800,'pages_to_extract'] = 1
master_df.at[3766,'pages_to_extract'] = 1
master_df.at[4788,'pages_to_extract'] = 1
master_df.at[4795,'pages_to_extract'] = 1
master_df.at[5282,'pages_to_extract'] = 1
master_df.at[5283,'pages_to_extract'] = 1
master_df.at[5586,'pages_to_extract'] = 5
master_df.at[6322,'pages_to_extract'] = 1

Now we save the file, containing the iiif link, the id of the document, and the number of pages to extract.

In [78]:
master_df = master_df.drop(columns=['liens de citation (page)', 'liens de citation (volume)', 'bibliographie', 'Volume_ID', '_journal-id'])

In [79]:
master_df.to_csv('data/DFKV_pages.csv', index=False)