# Building the `Book` instance with basic info
The ARTFL identifiers come from the project :
> Jaucourt, WILDSHUSEN ,  Encyclopédie, ou dictionnaire raisonné des sciences, des arts et des métiers, etc., eds. Denis Diderot and Jean le Rond d'Alembert. University of Chicago: ARTFL Encyclopédie Project (Autumn 2022 Edition), Robert Morrissey and Glenn Roe (eds): https://encyclopedie.uchicago.edu. Accessed on 29 August, 2024

In [2]:
import sys
sys.path.append('/home/antoine/Documents/GitHub/linkencyclo')
datapath = '/home/antoine/Documents/GitHub/datas/'

from EncycloObject import Article, Book
import pandas as pd
import json
import pickle

In [3]:
df = pd.read_excel(datapath + 'edda_27082024_clean.xlsx')
df.columns

Index(['volume', 'numero', 'headphrase', 'authors', 'text', 'artfl', 'enccre',
       'article_id'],
      dtype='object')

In [4]:
plainbook = Book(description='basic Book with full text ')

for _,row in df.iterrows():
    current_article = Article(
        volume=row['volume'],
        numero=row['numero'],
        headphrase=row['headphrase'],
        authors=row['authors'],
        text=row['text'].replace('\n',' ').strip(),
        enccre=row['enccre'],
    )
    plainbook.articles.append(current_article)

In [5]:
plainbook

Book with 15510 articles
basic Book with full text 
Attributes :
dict_keys(['volume', 'numero', 'headphrase', 'authors', 'text', 'hash', 'artfl', 'enccre', 'enccre_link'])

Opening the annotations of P. Nugues :

In [6]:
with open(datapath+'diderot_1751_wd.json', 'r') as f:
    diderot_wd = json.loads(f.read())

`entree_id` can be a key for each article. These identifiers come from the ENCCRE project ([http://enccre.academie-sciences.fr/encyclopedie/](http://enccre.academie-sciences.fr/encyclopedie/))

In [7]:
diderot_dic = {a['entreeid']:a for a in diderot_wd}
len(diderot_wd), len(diderot_dic)

(15274, 15274)

The author annotated the headphrase, but also the person's names when one paragraph is dedicated to their biography.

So, how many articles have a headphrase successfully (!=Q0) tagged by P. Nugues? 

In [8]:
resolved_diderot = [item for item in diderot_wd if 'qid' in item.keys()]
print('Number of articles having at least one QID : ', len(resolved_diderot))
geocoded = [geocoded for geocoded in resolved_diderot if geocoded['qid'][0] != 'Q0']
print('Number of headphrases with a QID != Q0 : ', len(geocoded))


Number of articles having at least one QID :  9631
Number of headphrases with a QID != Q0 :  9540


# Adding the QID to each article

It appears that some `Article` built with our data have a wrong Enccre id.
To match each of our Articles with the entries from P. Nugues' work, we will hence compare the article bodies.

In addition, the content of the article bodies differ slightly in both ARTFL and ENCCRE vary, as illustrated below. We rely hence on a fuzzy string similarity function `string_similarity`

> WILDSHUSEN, (Géog. mod.)​​ petite ville **d’Alleau** cercle de Westphalie, sur la riviere de Hunde, aux confins du comté d’Oldenbourg, & la capitale d’un petit pays auquel elle donne son nom. (D. J.) http://enccre.academie-sciences.fr/encyclopedie/article/v17-1360-0
> 
> WILDSHUSEN, (Géog. mod.) petite ville **d'Allemagne au**  cercle de Westphalie, sur la riviere de Hunde, aux confins du comté d'Oldenbourg, & la capitale d'un petit pays auquel elle donne son nom. (D. J.) https://artflsrv04.uchicago.edu/philologic4.7/encyclopedie0922/navigate/17/2279 


In [9]:
from difflib import SequenceMatcher
import re
from unidecode import unidecode

def string_similarity(
        s1:str,
        s2:str,
        threshold:float = 0.95,
        shorten:int = 50
        )-> tuple[str,str,bool]:
    """ 
    Because ENCRRE's OCR and ARTFL's one have slight differences, we need to tolarate some differences.

    Compares two input strings
    Returns a tuple with the cleaned strings and a boolean indicating if they are similar enough
    if short is True, inspects only the 100 first caracters of each string
    """
    # Cleaning the strings
    s1_cleaned = unidecode(s1).lower().replace('\u200b', '').replace('\n', '')#.replace('  ', ' ').strip()
    s2_cleaned = unidecode(s2).lower().replace('\u200b', '').replace('\n', '')#.replace('  ', ' ').strip()
    s1_cleaned = re.sub(r'[^a-z\s]', '',s1_cleaned).replace('  ', ' ').strip()
    s2_cleaned = re.sub(r'[^a-z\s]', '',s2_cleaned).replace('  ', ' ').strip()

    if shorten :
        s1_cleaned = s1_cleaned[: min(shorten,min(len(s1_cleaned), len(s2_cleaned)))]
        s2_cleaned = s2_cleaned[: min(shorten,min(len(s1_cleaned), len(s2_cleaned)))]


    similarity_ratio = SequenceMatcher(None, s1_cleaned, s2_cleaned).ratio()

    return s1_cleaned, s2_cleaned, similarity_ratio >= threshold

# Example usage
s1 = '*\u200b A, s. petite riviere de France, \n qui a sa source près de Fontaines en Sologne.'
s2 = '* A, f. petite rivie\nre de France, qui a sa source près de Fontaines en ave slight differences, we need to tolarate some differences'

s1_cleaned, s2_cleaned, check = string_similarity(s1, s2)
s1_cleaned, s2_cleaned, check

('a s petite riviere de france qui a sa source pres ',
 'a f petite riviere de france qui a sa source pres ',
 True)

In [10]:
def read_enccre_id(id:str)-> tuple[int,int,int]:
    """read an enccre id"""
    m = re.match(r'v(\d+)-(\d+)-(\d+)', id)
    if m:
        return (int(m.group(1)), int(m.group(2)), int(m.group(3)))
    else:
        return None

We iterate over the articles with a headphrase tagged successfully to find the related `Article` instance from our datas

In [11]:
# iterate over the articles with QIDs
for tagged in geocoded:

    # search for the corresponding article in our datas
    candidates = [a for a in plainbook if hasattr(a,'enccre') and a.enccre == tagged['entreeid']]


    if candidates :

        if len(candidates) > 1:
            for c in candidates:
                #print(c.headphrase)
                cand_clean, tagged_clean, check1 = string_similarity(
                    c.text,
                    tagged['texte'],
                    threshold=0.9,
                    shorten=50)
                if check1:
                    candidate = c
                    print('Multiple candidates found for ', tagged['entreeid'])
                    print('>>>', tagged_clean)
                    print('+++', cand_clean)
                    print()
                    break
        else :
            candidate = candidates[0]

        # check if article is indeed the same
        cand_clean, tagged_clean, check1 = string_similarity(
            candidate.text,
            tagged['texte'],
            threshold=0.9,
            shorten=50)

        # if both are correct, we continue : the enccre_id is correct
        if check1 :
            candidate.enccre_id = tagged['entreeid']
            continue

        # if not, we search for a candidate, based on the text this time
        else :
            print('MISMATCH : ', tagged['entreeid'])
            print('>>>', tagged_clean)
            print('---', cand_clean)
        
            # manual search for candidates
            vol,_,_ = read_enccre_id(tagged['entreeid'])
            candidates = [a for a in plainbook if a.volume == vol]
            for c in candidates :
                # article body check
                c_clean, tagged_clean, check1 = string_similarity(
                    c.text,
                    tagged['texte'],
                    threshold=0.90,
                    shorten=50)
                
                # headĥrase check
                c_head, tagged_head, check2 = string_similarity(
                    c.headphrase,
                    tagged['vedette'],
                    threshold=0.95,
                    shorten=50)
                
                if check1 and check2:
                    print('+++', c_clean)
                    print(c.hash)
                    

                    # we update the enccre_id
                    c.enccre_id = tagged['entreeid']

                    break
            print()
            continue

Multiple candidates found for  v1-2865-0
>>> ariano geog ville ditalie au royaume de naples dan
+++ ariano geog ville ditalie au royaume de naples dan

Multiple candidates found for  v1-3123-0
>>> arve geog riviere de fossigny en savoie elle sort 
+++ arve geog riviere de fossigny en savoie elle sort 

Multiple candidates found for  v1-3127-0
>>> arun petite riviere du comte de sussex en angleter
+++ arun petite riviere du comte de sussex en angleter

Multiple candidates found for  v2-1062-0
>>> bernaw geog petite ville dallemagne dans lelectora
+++ bernaw geog petite ville dallemagne dans lelectora

Multiple candidates found for  v2-1063-0
>>> bernbourg geog petite ville dallemagne du cercle d
+++ bernbourg geog petite ville dallemagne du cercle d

Multiple candidates found for  v2-1317-0
>>> binche geog ville ancienne du hainaut sur la rivie
+++ binche geog ville ancienne du hainaut sur la rivie

Multiple candidates found for  v2-2581-0
>>> bulach geog petite ville dallemagne en soua

We finnaly add the QID 

In [12]:
for tagged in geocoded :
    related_article = [a for a in plainbook if hasattr(a,'enccre_id') and a.enccre_id == tagged['entreeid']]
    if related_article:
        if len(related_article) > 1:
            print('Multiple candidates found for ', tagged['entreeid'])
            for c in related_article:
                print(c.headphrase)
            continue
        related_article = related_article[0]
        related_article.gold_qid = tagged['qid'][0]

**How many tagged headphrase do we now have for our downstream Entity Linking task ?**

In [13]:
len([a for a in plainbook if hasattr(a,'gold_qid')])

8987

**How many tags are lost ?**

In [14]:
lost_tags=0
for tagged in geocoded :
    related_article = [a for a in plainbook if hasattr(a,'enccre_id') and a.enccre_id == tagged['entreeid']]
    if not related_article :
        lost_tags+=1
print(lost_tags)

553


Saving our book

In [15]:
import pickle
with open(datapath+'geobook_plain_28082024.pkl', 'wb') as f:
    pickle.dump(plainbook, f)