# BookNLP-fr Hands-on Tutorial

Welcome to the BookNLP-fr tutorial! This notebook will guide you through the process of analyzing a French novel using the booknlp-fr library. You'll learn how to load a novel, tokenize it, extract named entities, resolve coreferences, and analyze the main characters.

## Step 1: Installing Required Libraries

Before we begin, we need to install the necessary libraries. We'll use spaCy for tokenization and booknlp_fr for mentions detection and coreference resolution.


In [None]:
# import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # Can be use to disable GPU

In [None]:
import sklearn

!pip install --quiet spacy-transformers
! python -m spacy download fr_dep_news_trf

import spacy
import spacy_transformers

## ----

! pip install booknlp_fr -U
import booknlp_fr

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m756.2/756.2 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.0/314.0 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m77.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m66.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m42.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Step 2: Loading a French Novel

In this step, you will be importing a novel from Project Gutenberg (or any other source you would like) that will be used as an example in the rest of the notebook

You can select a book from this page: [Project Gutenberg French Books List](https://www.gutenberg.org/browse/languages/fr)

In [1]:
import requests # library used to scrape content from html pages

def load_gutenberg_project_novel_as_string(gutenberg_url):

    gutenberg_book_id = gutenberg_url.split('/')[-1]
    plain_text_url = f"https://www.gutenberg.org/cache/epub/{gutenberg_book_id}/pg{gutenberg_book_id}.txt"
    # Fetch the content of the page
    response = requests.get(plain_text_url, verify=False)
    # Check if the request was successful
    if response.status_code == 200:
        # Get the content as a string
        text_content = response.text
        print("Book content loaded successfully!")
        return text_content
    else:
        print(f"Failed to fetch content. Status code: {response.status_code}")

gutenberg_url = "https://www.gutenberg.org/ebooks/34204" # 34204 corresponds to the index for the novel La Petite Fadette by George Sand
text_content = load_gutenberg_project_novel_as_string(gutenberg_url)

# Display the first 10,000 characters of the novel
print(text_content[:10000])



Book content loaded successfully!
﻿The Project Gutenberg eBook of La petite Fadette
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: La petite Fadette

Author: George Sand

Release date: November 3, 2010 [eBook #34204]

Language: French

Credits: Produced by Claudine Corbasson and the Online Distributed
        Proofreading Team at http://www.pgdp.net (This file was
        produced from images generously made available by the
        Bibliothèque nationale de France (BnF/Gallica) at
        http://gallica.bnf.fr)


*** START OF THE PROJECT GUTENBERG EBOOK LA PE

## Step 3: Preprocessing the Text

The text from Project Gutenberg contains metadata that we need to remove. We'll clean the text by stripping unnecessary content and formatting it for easier analysis.

Here we see that the contents of the book contain additional information from Project Gutenberg, as well as a preface.

We first remove the Project Gutenberg start and end infos

In [None]:
import re
# Regular expression to capture the content between the START and END markers
pattern = r"\*\*\* START OF THE PROJECT GUTENBERG EBOOK .* \*\*\*(.*?)\*\*\* END OF THE PROJECT GUTENBERG EBOOK .* \*\*\*"

# Use re.DOTALL to make `.` match newline characters
match = re.search(pattern, text_content, re.DOTALL)

if match:
    text_content = match.group(1).strip()
else:
    print("Markers not found in the text.")

print(text_content[:8000])

Produced by Claudine Corbasson and the Online Distributed
Proofreading Team at http://www.pgdp.net (This file was
produced from images generously made available by the
Bibliothèque nationale de France (BnF/Gallica) at
http://gallica.bnf.fr)









  Au lecteur

  Cette version électronique reproduit dans son intégralité
  la version originale.

  La ponctuation n'a pas été modifiée hormis quelques corrections
  mineures.

  L'orthographe a été conservée. Seuls quelques mots ont été modifiés.
  La liste des modifications se trouve à la fin du texte.




  LA

  PETITE FADETTE

  PAR

  GEORGE SAND


  NOUVELLE ÉDITION

  PARIS

  MICHEL LÉVY FRÈRES, ÉDITEURS
  RUE VIVIENNE, 2 BIS, ET BOULEVARD DES ITALIENS, 15
  A LA LIBRAIRIE NOUVELLE

  1869
  Droits de reproduction et de traduction réservés




  OEUVRES

  DE

  GEORGE SAND

  MICHEL LÉVY FRÈRES, ÉDITEURS


  OEUVRES COMPLÈTES
  DE
  GEORGE SAND

  FORMAT GRAND IN-18

To go further we can now remove the preface to only keep the content of the text, to do this, we'll locate the first sentence where the text actually begins

In [None]:
first_line_content = "Le père Barbeau de la Cosse n'était pas mal dans ses affaires, à"

# Find the position of the first occurrence of the line
start_index = text_content.find(first_line_content)

if start_index != -1:
    # Extract text starting from the first line
    text_content = text_content[start_index:]
else:
    print("First line not found in the text.")

print(text_content[:800])

Le père Barbeau de la Cosse n'était pas mal dans ses affaires, à
preuve qu'il était du conseil municipal de sa commune. Il avait deux
champs qui lui donnaient la nourriture de sa famille, et du profit
par-dessus le marché. Il cueillait dans ses prés du foin à pleins
charrois, et, sauf celui qui était au bord du ruisseau, et qui était
un peu ennuyé par le jonc, c'était du fourrage connu dans l'endroit
pour être de première qualité.

La maison du père Barbeau était bien bâtie, couverte en tuile, établie
en bon air sur la côte, avec un jardin de bon rapport et une vigne de
six journaux. Enfin il avait, derrière sa grange, un beau verger, que
nous appelons chez nous une ouche, où le fruit abondait tant en prunes
qu'en guignes, en poires et en cormes. Mêmement les noyers de ses
bor


Here we can see that the text is formatted with linebreak as in a book layout, however to facilitate further analysis it might be beneficial to remove those additional line breaks while keeping the actual paragraphs boudaries

In [None]:
# Reformat the text
formatted_text = text_content.replace("\r\n\r\n", "[PARAGRAPH_BREAK]").replace("\r\n", " ").replace("[PARAGRAPH_BREAK]", "\n\n")
print(formatted_text[:800])

Le père Barbeau de la Cosse n'était pas mal dans ses affaires, à preuve qu'il était du conseil municipal de sa commune. Il avait deux champs qui lui donnaient la nourriture de sa famille, et du profit par-dessus le marché. Il cueillait dans ses prés du foin à pleins charrois, et, sauf celui qui était au bord du ruisseau, et qui était un peu ennuyé par le jonc, c'était du fourrage connu dans l'endroit pour être de première qualité.

La maison du père Barbeau était bien bâtie, couverte en tuile, établie en bon air sur la côte, avec un jardin de bon rapport et une vigne de six journaux. Enfin il avait, derrière sa grange, un beau verger, que nous appelons chez nous une ouche, où le fruit abondait tant en prunes qu'en guignes, en poires et en cormes. Mêmement les noyers de ses bordures étaient


**For practical reasons, if the GPU is not available, we will only work on the first 25,000 characters in order to reduce the processing time.**

In [None]:
import torch

# Check for GPU availability
if not torch.cuda.is_available():
    # Only use the first 25000 characters if GPU is not available
    formatted_text = formatted_text[:5000]
    print("Using only the first 25.000 characters due to lack of GPU.")
else:
    print("GPU is available, using all characters.")

Using only the first 25.000 characters due to lack of GPU.


## Step 4: Tokenizing the Text

Now, let's tokenize the text using the spaCy library. We will use the fr_dep_news_trf model, a transformer-based model trained for French text. This will allow us to tokenize the text, segment sentences, and extract POS (part-of-speech) tags and syntactic dependencies.


In [None]:
from booknlp_fr import load_spacy_model

spacy_model = load_spacy_model(model_name='fr_dep_news_trf', model_max_length=500000)

Loaded Spacy Model: fr_dep_news_trf
CUDA is not available, model will run on CPU.


In [None]:
from booknlp_fr import generate_tokens_df

text_content = formatted_text

# Initialize spaCy model
if torch.cuda.is_available():
    spacy.prefer_gpu()
    spacy_model = spacy.load('fr_dep_news_trf')

tokens_df = generate_tokens_df(text_content,
                               spacy_model,
                               max_char_sentence_length=5000 # The text is processed by batches of 5000 tokens maximum
                               )

print(f"Generated Tokens Count: {len(tokens_df)}")
# Display tokenized data
tokens_df.head(15)

Batch Spacy Tokenization:   0%|          | 0/2 [00:00<?, ?it/s]

Generated Tokens Count: 1134


Unnamed: 0,paragraph_ID,sentence_ID,token_ID_within_sentence,token_ID_within_document,word,lemma,byte_onset,byte_offset,POS_tag,dependency_relation,syntactic_head_ID
0,0,0,0,0,Le,le,0,2,DET,det,1
1,0,0,1,1,père,père,3,7,NOUN,nsubj,9
2,0,0,2,2,Barbeau,Barbeau,8,15,PROPN,flat:name,1
3,0,0,3,3,de,de,16,18,ADP,case,5
4,0,0,4,4,la,le,19,21,DET,det,5
5,0,0,5,5,Cosse,Cosse,22,27,PROPN,nmod,1
6,0,0,6,6,n',ne,28,30,ADV,advmod,9
7,0,0,7,7,était,être,30,35,AUX,cop,9
8,0,0,8,8,pas,pas,36,39,ADV,advmod,9
9,0,0,9,9,mal,mal,40,43,ADV,ROOT,9


## Step 5: Extracting Mentions (Named Entity Recognition)

In this step we will extract mentions of different types (characters, locations, facilities, geopolitical entities, time, vehicles) with a pretrained Named Entities Recognition model.

You can take a look at the description of the model we will use [Mentions Detection Model](https://huggingface.co/AntoineBourgois/BookNLP-fr_NER_camembert-large)

In [None]:
from booknlp_fr import load_mentions_detection_model, generate_entities_df
from booknlp_fr import load_tokenizer_and_embedding_model, get_embedding_tensor_from_tokens_df

In [None]:
# Load the pretrained NER model
mentions_detection_model = load_mentions_detection_model(model_path="AntoineBourgois/BookNLP-fr_NER_camembert-large",
                                                         force_download=True)
NER_base_model = mentions_detection_model["base_model_name"]
print(f"NER Foundation Model: {NER_base_model}")

# Generate embeddings for the tokens
tokenizer, embedding_model = load_tokenizer_and_embedding_model(NER_base_model)
tokens_embedding_tensor = get_embedding_tensor_from_tokens_df(tokens_df, tokenizer, embedding_model, mini_batch_size=10)

print(f"\ntokens_embedding_tensor contains 1 vector of {tokens_embedding_tensor.shape[1]} dimensions for each {tokens_embedding_tensor.shape[0]} tokens")

Downloading model from HuggingFace: https://huggingface.co/AntoineBourgois/BookNLP-fr_NER_camembert-large
Model Downloaded Successfully
Saving model locally to: /content/AntoineBourgois/BookNLP-fr_NER_camembert-large
NER Foundation Model: almanach/camembert-large


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.78k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/809k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/374 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/456 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.35G [00:00<?, ?B/s]

Some weights of CamembertModel were not initialized from the model checkpoint at almanach/camembert-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Tokenizer and Embedding Model Initialized: almanach/camembert-large


Embedding Tokens:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating average embeddings:   0%|          | 0/1424 [00:00<?, ?it/s]

Averaging subwords embeddings:   0%|          | 0/1134 [00:00<?, ?it/s]


tokens_embedding_tensor contains 1 vector of 1024 dimensions for each 1134 tokens


In [None]:
# Extract entities
entities_df = generate_entities_df(tokens_df, tokens_embedding_tensor, mentions_detection_model, batch_size=32)
print(f"Columns: {entities_df.columns}")

from booknlp_fr import add_features_to_entities

# This function is used to add some infos to each mention (gender, number, nested level, etc.)
entities_df = add_features_to_entities(entities_df, tokens_df)
print(f"Columns: {entities_df.columns}")

entities_df.head()

Columns: Index(['start_token', 'end_token', 'cat', 'confidence', 'text'], dtype='object')


Extracting Mention Head Infos:   0%|          | 0/54 [00:00<?, ?it/s]

Columns: Index(['start_token', 'end_token', 'cat', 'confidence', 'text', 'mention_len',
       'paragraph_ID', 'sentence_ID', 'start_token_ID_within_sentence',
       'out_to_in_nested_level', 'in_to_out_nested_level',
       'nested_entities_count', 'head_id', 'head_word',
       'head_dependency_relation', 'head_syntactic_head_ID', 'POS_tag', 'prop',
       'number', 'gender', 'grammatical_person'],
      dtype='object')


Unnamed: 0,start_token,end_token,cat,confidence,text,mention_len,paragraph_ID,sentence_ID,start_token_ID_within_sentence,out_to_in_nested_level,...,nested_entities_count,head_id,head_word,head_dependency_relation,head_syntactic_head_ID,POS_tag,prop,number,gender,grammatical_person
0,4,5,PER,0.431295,la cosse,2,0,0,4,0,...,0,5,cosse,nmod,1,PROPN,PROP,Singular,Not_Assigned,3
1,11,11,PER,0.99507,ses,1,0,0,11,0,...,0,11,ses,det,12,DET,PRON,Singular,Ambiguous,3
2,17,17,PER,0.998877,il,1,0,0,17,0,...,0,17,il,nsubj,20,PRON,PRON,Singular,Male,3
3,23,23,PER,0.971053,sa,1,0,0,23,0,...,0,23,sa,det,24,DET,PRON,Singular,Ambiguous,3
4,26,26,PER,0.996804,il,1,0,1,0,0,...,0,26,il,nsubj,27,PRON,PRON,Singular,Male,3


In [None]:
entities_df[:50]

Unnamed: 0,start_token,end_token,cat,confidence,text,mention_len,paragraph_ID,sentence_ID,start_token_ID_within_sentence,out_to_in_nested_level,...,nested_entities_count,head_id,head_word,head_dependency_relation,head_syntactic_head_ID,POS_tag,prop,number,gender,grammatical_person
0,4,5,PER,0.431295,la cosse,2,0,0,4,0,...,0,5,cosse,nmod,1,PROPN,PROP,Singular,Not_Assigned,3
1,11,11,PER,0.99507,ses,1,0,0,11,0,...,0,11,ses,det,12,DET,PRON,Singular,Ambiguous,3
2,17,17,PER,0.998877,il,1,0,0,17,0,...,0,17,il,nsubj,20,PRON,PRON,Singular,Male,3
3,23,23,PER,0.971053,sa,1,0,0,23,0,...,0,23,sa,det,24,DET,PRON,Singular,Ambiguous,3
4,26,26,PER,0.996804,il,1,0,1,0,0,...,0,26,il,nsubj,27,PRON,PRON,Singular,Male,3
5,31,31,PER,0.991698,lui,1,0,1,5,0,...,0,31,lui,iobj,32,PRON,PRON,Singular,Ambiguous,3
6,36,36,PER,0.969236,sa,1,0,1,10,1,...,0,36,sa,det,37,DET,PRON,Singular,Ambiguous,3
7,36,37,PER,0.903803,sa famille,2,0,1,10,0,...,1,37,famille,nmod,34,NOUN,NOM,Plural,Ambiguous,3
8,48,48,PER,0.997841,il,1,0,2,0,0,...,0,48,il,nsubj,49,PRON,PRON,Singular,Male,3
9,51,51,PER,0.966605,ses,1,0,2,3,0,...,0,51,ses,det,52,DET,PRON,Singular,Ambiguous,3


In [None]:
from booknlp_fr import add_features_to_entities
entities_df = add_features_to_entities(entities_df, tokens_df)
print(f"Columns: {entities_df.columns}")
entities_df

Extracting Mention Head Infos:   0%|          | 0/54 [00:00<?, ?it/s]

Columns: Index(['start_token', 'end_token', 'cat', 'confidence', 'text', 'mention_len',
       'paragraph_ID', 'sentence_ID', 'start_token_ID_within_sentence',
       'out_to_in_nested_level', 'in_to_out_nested_level',
       'nested_entities_count', 'head_id', 'head_word',
       'head_dependency_relation', 'head_syntactic_head_ID', 'POS_tag', 'prop',
       'number', 'gender', 'grammatical_person'],
      dtype='object')


Unnamed: 0,start_token,end_token,cat,confidence,text,mention_len,paragraph_ID,sentence_ID,start_token_ID_within_sentence,out_to_in_nested_level,...,nested_entities_count,head_id,head_word,head_dependency_relation,head_syntactic_head_ID,POS_tag,prop,number,gender,grammatical_person
0,4,5,PER,0.431295,la cosse,2,0,0,4,0,...,0,5,cosse,nmod,1,PROPN,PROP,Singular,Not_Assigned,3
1,11,11,PER,0.995070,ses,1,0,0,11,0,...,0,11,ses,det,12,DET,PRON,Singular,Ambiguous,3
2,17,17,PER,0.998877,il,1,0,0,17,0,...,0,17,il,nsubj,20,PRON,PRON,Singular,Male,3
3,23,23,PER,0.971053,sa,1,0,0,23,0,...,0,23,sa,det,24,DET,PRON,Singular,Ambiguous,3
4,26,26,PER,0.996804,il,1,0,1,0,0,...,0,26,il,nsubj,27,PRON,PRON,Singular,Male,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
160,1111,1111,PER,0.787638,les,1,9,34,33,0,...,0,1111,les,obj,1112,PRON,PRON,Plural,Ambiguous,3
161,1113,1115,TIME,0.503698,tous les jours,3,9,34,35,0,...,0,1115,jours,obj,1112,NOUN,NOM,Not_Assigned,Not_Assigned,3
162,1120,1120,PER,0.941849,eux,1,9,34,42,0,...,0,1120,eux,obl:mod,1118,PRON,PRON,Plural,Ambiguous,3
163,1125,1125,PER,0.998174,je,1,9,34,47,0,...,0,1125,je,nsubj,1129,PRON,PRON,Singular,Ambiguous,1


## Step 6: Coreference Resolution

Now, let’s resolve coreferences. This step helps link mentions of the same entity, such as when different parts of the text refer to the same character.

In [None]:
from booknlp_fr import load_coreference_resolution_model, perform_coreference, CoreferenceResolutionModel

In [None]:
# Load the coreference resolution model
coreference_resolution_model = load_coreference_resolution_model(
    model_path="AntoineBourgois/BookNLP-fr_coreference-resolution_camembert-large_FAC_GPE_LOC_PER_TIME_VEH",
    force_download=True)

coreference_resolution_base_model = coreference_resolution_model["base_model_name"]

print(f"Coreference Resolution Foundation Model: {coreference_resolution_base_model}")

if coreference_resolution_base_model != NER_base_model:
    tokenizer, embedding_model = load_tokenizer_and_embedding_model(coreference_resolution_base_model)
    tokens_embedding_tensor = get_embedding_tensor_from_tokens_df(tokens_df, tokenizer, embedding_model, mini_batch_size=20)

# Perform coreference resolution
entities_df = perform_coreference(entities_df=entities_df,
                                  tokens_embedding_tensor=tokens_embedding_tensor,
                                  coreference_resolution_model=coreference_resolution_model,
                                  batch_size=10)

# Display entities after coreference resolution
entities_df

Downloading model from HuggingFace: https://huggingface.co/AntoineBourgois/BookNLP-fr_coreference-resolution_camembert-large_FAC_GPE_LOC_PER_TIME_VEH
Model Downloaded Successfully
Saving model locally to: /content/AntoineBourgois/BookNLP-fr_coreference-resolution_camembert-large_FAC_GPE_LOC_PER_TIME_VEH
Coreference Resolution Foundation Model: almanach/camembert-large


  self.per_mentions_embeddings = torch.tensor(generator_model_data['overall_mentions_embeddings_tensor'],


Predicting Coreference Pairs:   0%|          | 0/658 [00:00<?, ?it/s]

  0%|          | 0/165 [00:00<?, ?it/s]

Unnamed: 0,start_token,end_token,cat,confidence,text,mention_len,paragraph_ID,sentence_ID,start_token_ID_within_sentence,out_to_in_nested_level,...,head_id,head_word,head_dependency_relation,head_syntactic_head_ID,POS_tag,prop,number,gender,grammatical_person,COREF
0,4,5,PER,0.431295,la cosse,2,0,0,4,0,...,5,cosse,nmod,1,PROPN,PROP,Singular,Not_Assigned,3,0
1,11,11,PER,0.995070,ses,1,0,0,11,0,...,11,ses,det,12,DET,PRON,Singular,Ambiguous,3,0
2,17,17,PER,0.998877,il,1,0,0,17,0,...,17,il,nsubj,20,PRON,PRON,Singular,Male,3,0
3,23,23,PER,0.971053,sa,1,0,0,23,0,...,23,sa,det,24,DET,PRON,Singular,Ambiguous,3,0
4,26,26,PER,0.996804,il,1,0,1,0,0,...,26,il,nsubj,27,PRON,PRON,Singular,Male,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
160,1111,1111,PER,0.787638,les,1,9,34,33,0,...,1111,les,obj,1112,PRON,PRON,Plural,Ambiguous,3,3
161,1113,1115,TIME,0.503698,tous les jours,3,9,34,35,0,...,1115,jours,obj,1112,NOUN,NOM,Not_Assigned,Not_Assigned,3,34
162,1120,1120,PER,0.941849,eux,1,9,34,42,0,...,1120,eux,obl:mod,1118,PRON,PRON,Plural,Ambiguous,3,3
163,1125,1125,PER,0.998174,je,1,9,34,47,0,...,1125,je,nsubj,1129,PRON,PRON,Singular,Ambiguous,1,2


In [None]:
entities_df[:50]

Unnamed: 0,start_token,end_token,cat,confidence,text,mention_len,paragraph_ID,sentence_ID,start_token_ID_within_sentence,out_to_in_nested_level,...,head_id,head_word,head_dependency_relation,head_syntactic_head_ID,POS_tag,prop,number,gender,grammatical_person,COREF
0,4,5,PER,0.431295,la cosse,2,0,0,4,0,...,5,cosse,nmod,1,PROPN,PROP,Singular,Not_Assigned,3,0
1,11,11,PER,0.99507,ses,1,0,0,11,0,...,11,ses,det,12,DET,PRON,Singular,Ambiguous,3,0
2,17,17,PER,0.998877,il,1,0,0,17,0,...,17,il,nsubj,20,PRON,PRON,Singular,Male,3,0
3,23,23,PER,0.971053,sa,1,0,0,23,0,...,23,sa,det,24,DET,PRON,Singular,Ambiguous,3,0
4,26,26,PER,0.996804,il,1,0,1,0,0,...,26,il,nsubj,27,PRON,PRON,Singular,Male,3,0
5,31,31,PER,0.991698,lui,1,0,1,5,0,...,31,lui,iobj,32,PRON,PRON,Singular,Ambiguous,3,0
6,36,36,PER,0.969236,sa,1,0,1,10,1,...,36,sa,det,37,DET,PRON,Singular,Ambiguous,3,0
7,36,37,PER,0.903803,sa famille,2,0,1,10,0,...,37,famille,nmod,34,NOUN,NOM,Plural,Ambiguous,3,10
8,48,48,PER,0.997841,il,1,0,2,0,0,...,48,il,nsubj,49,PRON,PRON,Singular,Male,3,0
9,51,51,PER,0.966605,ses,1,0,2,3,0,...,51,ses,det,52,DET,PRON,Singular,Ambiguous,3,0


## Step 6: Extracting Character Attributes

In this final step, we will extract attributes for characters, such as verbs in which the character is the agent or patient, modifiers, and possessives.

In [None]:
from booknlp_fr import extract_attributes
tokens_df = extract_attributes(entities_df, tokens_df)

from booknlp_fr import generate_characters_dict
characters_dict = generate_characters_dict(tokens_df, entities_df)

In [None]:
print(characters_dict['characters'][0].keys())

dict_keys(['id', 'count', 'gender', 'number', 'mentions', 'agent', 'patient', 'mod', 'poss'])


In [None]:
characters_dict['characters'][0]

{'id': 0,
 'count': {'occurrence': 42, 'mention_ratio': 0.2763},
 'gender': {'ratio': 0.4286,
  'inference': {'Male': 1.0, 'Female': 0.0},
  'max': 1.0,
  'argmax': 'Male'},
 'number': {'ratio': 0.9762,
  'inference': {'Singular': 1.0, 'Plural': 0.0},
  'max': 1.0,
  'argmax': 'Singular'},
 'mentions': {'proper': [{'n': 'la cosse', 'c': 1}],
  'common': [{'n': 'le père barbeau', 'c': 3},
   {'n': 'du père barbeau', 'c': 1},
   {'n': 'notre maître', 'c': 1},
   {'n': 'le père', 'c': 1}],
  'pronoun': [{'n': 'il', 'c': 12},
   {'n': 'ses', 'c': 6},
   {'n': 'sa', 'c': 5},
   {'n': 'lui', 'c': 2},
   {'n': 'me', 'c': 2},
   {'n': 'ma', 'c': 2},
   {'n': 'je', 'c': 2},
   {'n': "m'", 'c': 1},
   {'n': "j'", 'c': 1},
   {'n': 'mon', 'c': 1},
   {'n': 'vous', 'c': 1}]},
 'agent': [{'w': 'avoir', 'i': 27},
  {'w': 'cueillir', 'i': 49},
  {'w': 'avoir', 'i': 130},
  {'w': 'avoir', 'i': 216},
  {'w': 'revenir', 'i': 479},
  {'w': 'faire', 'i': 534},
  {'w': 'étonner', 'i': 544},
  {'w': 'aller'

## Displaying the characters informations


In [None]:
for character_id in range(0, 10):
  print(f"Character ID: {character_id}")
  print(f"Mention count: {characters_dict['characters'][character_id]['count']}")
  print(f"Character Gender: {characters_dict['characters'][character_id]['gender']['argmax']}")
  if characters_dict['characters'][character_id]['mentions']['proper']:
    print(f"Main proper mention: {characters_dict['characters'][character_id]['mentions']['proper'][0:5]}")
  if characters_dict['characters'][character_id]['mentions']['common']:
    print(f"Main common noun mention: {characters_dict['characters'][character_id]['mentions']['common'][0:5]}\n")

Character ID: 0
Mention count: {'occurrence': 42, 'mention_ratio': 0.2763}
Character Gender: Male
Main proper mention: [{'n': 'la cosse', 'c': 1}]
Main common noun mention: [{'n': 'le père barbeau', 'c': 3}, {'n': 'du père barbeau', 'c': 1}, {'n': 'notre maître', 'c': 1}, {'n': 'le père', 'c': 1}]

Character ID: 1
Mention count: {'occurrence': 20, 'mention_ratio': 0.1316}
Character Gender: Female
Main common noun mention: [{'n': 'la mère barbeau', 'c': 2}, {'n': 'sa femme', 'c': 1}, {'n': 'ma femme', 'c': 1}, {'n': 'ma bonne femme', 'c': 1}, {'n': 'la femme', 'c': 1}]

Character ID: 2
Mention count: {'occurrence': 17, 'mention_ratio': 0.1118}
Character Gender: Female
Main common noun mention: [{'n': 'la mère sagette', 'c': 3}, {'n': 'mère barbeau', 'c': 1}]

Character ID: 3
Mention count: {'occurrence': 15, 'mention_ratio': 0.0987}
Character Gender: Null
Main common noun mention: [{'n': 'deux enfants de plus', 'c': 1}, {'n': 'ces deux enfants -là', 'c': 1}, {'n': 'deux bessons si p', '

In [None]:
from booknlp_fr import generate_sacr_file

In [None]:
help(generate_sacr_file)

Help on function generate_sacr_file in module booknlp_fr.booknlp_fr_generate_sacr_file:

generate_sacr_file(file_name, tokens_df, entities_df, end_directory, entity_type_column='cat', coref_name_column='COREF', sacr_extension='.generated_sacr')
    Generates a SACR file with inline annotations for mention detection and coreference chains.
    
    Args:
        file_name (str): Base name for the output file (without extension).
        tokens_df (DataFrame): DataFrame with tokenized text, including byte offsets.
        entities_df (DataFrame): DataFrame with entity metadata.
        end_directory (str): Directory to save the generated SACR file.
        entity_type_column (str): Column name for entity type (default: "cat").
        coref_name_column (str): Column name for coreference ID (default: "COREF").
        sacr_extension (str): File extension for the SACR file (default: ".generated_sacr").
    
    Returns:
        None: Saves the annotated SACR file to the specified directory

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
generate_sacr_file("fadette.sacr", tokens_df, entities_df, "/content/drive/My Drive/")

Generating Sacr Annotations:   0%|          | 0/165 [00:00<?, ?it/s]

File saved at:
/content/drive/My Drive/fadette.sacr.generated_sacr


'Le père Barbeau de {0:EN="PER" la Cosse} n\'était pas mal dans {0:EN="PER" ses} affaires, à preuve qu\'{0:EN="PER" il} était du conseil municipal de {0:EN="PER" sa} commune. {0:EN="PER" Il} avait deux champs qui {0:EN="PER" lui} donnaient la nourriture de {10:EN="PER" {0:EN="PER" sa} famille}, et du profit par-dessus le marché. {0:EN="PER" Il} cueillait dans {0:EN="PER" ses} prés du foin à pleins charrois, et, sauf celui qui était {17:EN="LOC" au bord {18:EN="LOC" du ruisseau}}, et qui était un peu ennuyé par le jonc, c\'était du fourrage connu dans l\'endroit pour être de première qualité.\n\n La maison {0:EN="PER" du père Barbeau} était bien bâtie, couverte en tuile, établie en bon air sur {19:EN="LOC" la côte}, avec {20:EN="FAC" un jardin de bon rapport} et {21:EN="FAC" une vigne de six journaux}. Enfin {0:EN="PER" il} avait, derrière {22:EN="FAC" {0:EN="PER" sa} grange}, {23:EN="FAC" un beau verger}, que {11:EN="PER" nous} appelons chez {11:EN="PER" nous} une ouche, où le fruit ab