### 1. Library Import 

In [2]:
import pandas as pd
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from transformers import AutoTokenizer
import numpy as np

nltk.download('punkt')
nltk.download('punkt_tab')

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to /Users/angele/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/angele/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

### 2. Data Import 

In [3]:
df = pd.read_csv("CAIRN_final_cleaned.csv")
df2 = pd.read_csv("LeMonde_auteur_filtré50.csv")

### 3. Looking for natural paragraphs indicators 

In [4]:
# List to store results
paragraphes_detectes = []

# Iterate through each text
for i, texte in enumerate(df["texte_article"]):
    if pd.isna(texte):
        continue

    # Detect line breaks (single or double)
    if "\n" in texte:
        paragraphes_detectes.append((i, texte[:500]))  # Store just an excerpt

# Display a sample
print(f"✅ {len(paragraphes_detectes)} texts potentially contain paragraphs.\n")

for i, extrait in paragraphes_detectes[:5]:
    print(f"--- Text index {i} ---\n{extrait}\n")

✅ 0 texts potentially contain paragraphs.



In [5]:
# Count different forms of "line breaks"
count_n = 0          # real \n
count_escaped_n = 0  # strings containing \\n
count_literal = 0    # strings containing '\n' as literal

for texte in df["texte_article"].dropna():
    if "\n" in texte:
        count_n += 1
    if "\\n" in texte:
        count_escaped_n += 1
    if r'\n' in texte:
        count_literal += 1

print(f"Texts containing REAL \\n      : {count_n}")
print(f"Texts containing escaped \\\\n : {count_escaped_n}")
print(f"Texts containing LITERAL '\\n' : {count_literal}")


Texts containing REAL \n      : 0
Texts containing escaped \\n : 0
Texts containing LITERAL '\n' : 0


There is no natural indicators.

### 4. Familiarization with Mistral's tokenizer

NLTK's tokenization doesn't reflect how Mistral actually processes text. We switched to Mistral's official tokenizer to count tokens the same way the model does, giving us accurate estimates of text length for processing.

#### 4.a Tokenizer import 

In [6]:
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

#### 4.b Preliminary text analysis using Mistral tokenizer

In [7]:
if "texte_article" not in df.columns:
    raise ValueError("The 'texte_article' column is missing from the CSV file.")

# List of common French abbreviations
abbrev_list = ['M', 'Mme', 'Dr', 'Prof', 'K', 'p', 'etc', 'Sr', 'Jr']

def split_sentences_nltk(text):
    text_clean = text.replace('«', '').replace('»', '').replace('"', '').replace('“', '').replace('”', '')

    # Temporarily replace dots in abbreviations with a special symbol
    for abbr in abbrev_list:
        text_clean = re.sub(r'\b' + re.escape(abbr) + r'\.', abbr + '‹dot›', text_clean)

    # Standard NLTK segmentation
    sentences = sent_tokenize(text_clean, language='french')

    # Restore the dots
    sentences = [s.replace('‹dot›', '.') for s in sentences]

    # Remove empty sentences or those without alphanumeric characters
    sentences = [s.strip() for s in sentences if re.search(r'\w', s)]

    return sentences

# Storage for tokens and sentences
tokens_per_sentence = []
sentences_list = []

for text in df["texte_article"].dropna().head(100):
    sentences = split_sentences_nltk(str(text))
    for sentence in sentences:
        tokens = tokenizer.encode(sentence, add_special_tokens=True)
        if len(tokens) >= 3:
            tokens_per_sentence.append(len(tokens))
            sentences_list.append(sentence)

# Statistics
tokens_array = np.array(tokens_per_sentence)
print(f"Total number of sentences: {len(tokens_per_sentence)}")
print(f"Average tokens per sentence: {tokens_array.mean():.2f}")
print(f"Median: {np.median(tokens_array)}")
print(f"Min tokens: {tokens_array.min()} | Sentence: {sentences_list[tokens_array.argmin()]}")
print(f"Max tokens: {tokens_array.max()} | Sentence: {sentences_list[tokens_array.argmax()]}")
print(f"Standard deviation: {tokens_array.std():.2f}")
q1 = np.percentile(tokens_array, 25)
q3 = np.percentile(tokens_array, 75)
print(f"First quartile (25%): {q1}")
print(f"Third quartile (75%): {q3}")
print(f"IQR: {q3 - q1}")

Total number of sentences: 40074
Average tokens per sentence: 50.64
Median: 43.0
Min tokens: 3 | Sentence: Extension.
Max tokens: 13743 | Sentence: H. : 376  Salve Deus Rex Iudaeorum : 123  Samuel (2) : 132, 134  Sands, Philip : 151  Schomberg, Frédéric, duc de : 468  Schwartz, Bernard : 69  Sénèque : 295  Shakespeare, William : 123, 181, 268, 378, 379, 417, 418, 448  Sieyès, L’Abbé : 49, 52, 54, 79, 91, 117, 118, 120, 264, 287, 288, 317, 397, 458, 460  Smith, Adam : 90, 311, 313, 485  Smith, Thomas :207, 273, 277, 316  Solon d’Athènes : 404  Sophocle : 132  Soveraigne Power of Parliaments and Kingdoms : 442  St Germain, Christopher : 71, 141, 142, 205  Stuart, Charles : Voir Charles Ier  Stuart, Henry, Lord Darnley : 416  Stuart, Jacques-François : 454  Stuart, James : Voir Jacques VI et Ier  Stuart, Marie : Voir Marie Stuart Stuarts, Dynastie des : 119, 455  Sudbury, Simon : 113  Talleyrand-Périgord, Charles Maurice de : 93, 127, 349  Thouret, Jacques-Guillaume : 197, 351, 352  Tibèr

Now, I will proceed to split my articles into paragraphs/blocks. I was considering making an average estimate of the number of tokens per sentence and splitting paragraphs using a high estimate of tokens per sentence and therefore a low estimate of sentences per paragraph. However, since sentences range from 3 tokens to over 900, it's simpler to directly select the number of sentences based on their token count to ensure I never exceed Mistral's limit.

### 5. Spliting articles into chunks

The code below splits articles into chunks such that each chunk contains only complete sentences and the total token count doesn't exceed 3000 tokens. Although the actual limit is set to 4096, we need to account for the error margin and the prompt length which already uses 500 tokens. If adding a sentence would exceed this threshold in a chunk, it is not included and marks the beginning of the next chunk.

#### 5.a CAIRN articles

In [9]:
# List of common French abbreviations
abbrev_list = ['M', 'Mme', 'Dr', 'Prof', 'K', 'p', 'etc', 'Sr', 'Jr']

# Sentence segmentation function
def split_sentences_nltk(text):
    text_clean = text.replace('«', '').replace('»', '').replace('"', '').replace('“', '').replace('”', '')
    for abbr in abbrev_list:
        text_clean = re.sub(r'\b' + re.escape(abbr) + r'\.', abbr + '‹dot›', text_clean)
    sentences = nltk.sent_tokenize(text_clean, language='french')
    sentences = [s.replace('‹dot›', '.') for s in sentences if re.search(r'\w', s)]
    return sentences

# Function to split an article into chunks ≤ max_tokens
def split_article_max_tokens(article_text, max_tokens=3000):
    sentences = split_sentences_nltk(article_text)
    chunks = []
    current_chunk = []
    current_tokens = 0
    
    for sentence in sentences:
        sentence_tokens = len(tokenizer.encode(sentence, add_special_tokens=True))
        
        if current_tokens + sentence_tokens > max_tokens:
            if current_chunk:
                chunks.append(" ".join(current_chunk))
            current_chunk = [sentence]
            current_tokens = sentence_tokens
        else:
            current_chunk.append(sentence)
            current_tokens += sentence_tokens
    
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    
    return chunks

if "texte_article" not in df.columns:
    raise ValueError("The 'texte_article' column is missing from the CSV file.")

# Split all articles into chunks and repeat other columns
all_chunks = []
for _, row in df.iterrows():
    text = row["texte_article"]
    
    if pd.isna(text):
        continue
        
    chunks = split_article_max_tokens(text, max_tokens=3000)
    for chunk in chunks:
        new_row = row.copy()
        new_row["chunk"] = chunk  # New column for the split text
        all_chunks.append(new_row)

# Create new DataFrame and save
df_chunks = pd.DataFrame(all_chunks)
df_chunks.to_csv("CAIRN_clean_final_chunks.csv", index=False)

print(f"{len(df_chunks)} rows created and saved to 'CAIRN_clean_final_chunks.csv'.")


89921 rows created and saved to 'CAIRN_real_final_chunks.csv'.


#### 5.b Le Monde articles 

I will proceed in the same way for Le Monde because the naturals paragraph markers used to split articles into paragraphs are too frequent and therefore leave too little context for studying the article content via an LLM.

In [14]:
# List of common French abbreviations
abbrev_list2 = ['M', 'Mme', 'Dr', 'Prof', 'K', 'p', 'etc', 'Sr', 'Jr']

# Sentence segmentation function
def split_sentences_nltk2(text2):
    text_clean2 = text2.replace('«', '').replace('»', '').replace('"', '').replace('“', '').replace('”', '')
    for abbr2 in abbrev_list2:
        text_clean2 = re.sub(r'\b' + re.escape(abbr2) + r'\.', abbr2 + '‹dot›', text_clean2)
    sentences2 = nltk.sent_tokenize(text_clean2, language='french')
    sentences2 = [s2.replace('‹dot›', '.') for s2 in sentences2 if re.search(r'\w', s2)]
    return sentences2

# Function to split an article into chunks ≤ max_tokens
def split_article_max_tokens2(article_text2, max_tokens2=3000):
    sentences2 = split_sentences_nltk2(article_text2)
    chunks2 = []
    current_chunk2 = []
    current_tokens2 = 0
    
    for sentence2 in sentences2:
        sentence_tokens2 = len(tokenizer.encode(sentence2, add_special_tokens=True))
        
        if current_tokens2 + sentence_tokens2 > max_tokens2:
            if current_chunk2:
                chunks2.append(" ".join(current_chunk2))
            current_chunk2 = [sentence2]
            current_tokens2 = sentence_tokens2
        else:
            current_chunk2.append(sentence2)
            current_tokens2 += sentence_tokens2
    
    if current_chunk2:
        chunks2.append(" ".join(current_chunk2))
    
    return chunks2

if "content" not in df2.columns:
    raise ValueError("The 'content' column is missing from the CSV file.")

# Split all articles into chunks and repeat other columns
all_chunks2 = []
for _, row2 in df2.iterrows():
    text2 = row2["content"]
    
    if pd.isna(text2):
        continue
        
    chunks2 = split_article_max_tokens2(text2, max_tokens2=3000)
    for chunk2 in chunks2:
        new_row2 = row2.copy()
        new_row2["chunk"] = chunk2  # New column for the split text
        all_chunks2.append(new_row2)

# Create new DataFrame and save
df_chunks2 = pd.DataFrame(all_chunks2)
df_chunks2.to_csv("LM_real_final_chunks.csv", index=False)

print(f"{len(df_chunks2)} rows created and saved to 'LM_real_final_chunks.csv'.")


6535 rows created and saved to 'LM_real_final_chunks.csv'.


#### Deleting the column "content" of CAIRN_real_final_chunks.csv" because it is too heavy

In [10]:
df_CAIRNcontent=pd.read_csv("CAIRN_clean_final_chunks.csv")

In [11]:
display(df_CAIRNcontent)

Unnamed: 0.1,url,Unnamed: 0,titre,auteur,section,twitter_card,journal,annee,numero,page_debut,page_fin,doi,pdf_url,type_document,variante_recherchee,name,nom_auteur_absent,author_name_absent,chunk
0,https://droit.cairn.info/Economie-des-donnees-...,86147,IV. Les comportements en matière de vie privée...,Fabrice Rochelandet,Sociologie,summary,Repères,2010.0,,67,87,,,Chapitre de Que sais-je ? / Repères,Ackerman,Galia Ackerman,False,False,L 'objectif de ce chapitre est d'envisager à q...
1,https://droit.cairn.info/Economie-des-donnees-...,86147,IV. Les comportements en matière de vie privée...,Fabrice Rochelandet,Sociologie,summary,Repères,2010.0,,67,87,,,Chapitre de Que sais-je ? / Repères,Ackerman,Galia Ackerman,False,False,"Elle dépend, d'autre part, de l'identité de j ..."
2,https://droit.cairn.info/Economie-des-donnees-...,86147,IV. Les comportements en matière de vie privée...,Fabrice Rochelandet,Sociologie,summary,Repères,2010.0,,67,87,,,Chapitre de Que sais-je ? / Repères,Ackerman,Galia Ackerman,False,False,"La politique inverse est alors envisageable, à..."
3,https://droit.cairn.info/Economie-des-donnees-...,86147,IV. Les comportements en matière de vie privée...,Fabrice Rochelandet,Sociologie,summary,Repères,2010.0,,67,87,,,Chapitre de Que sais-je ? / Repères,Ackerman,Galia Ackerman,False,False,Ils sont supposés maximiser leur utilité de ma...
4,https://droit.cairn.info/Economie-des-donnees-...,86147,IV. Les comportements en matière de vie privée...,Fabrice Rochelandet,Sociologie,summary,Repères,2010.0,,67,87,,,Chapitre de Que sais-je ? / Repères,Ackerman,Galia Ackerman,False,False,4) Un décalage entre les actions présentes et ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
89916,https://stm.cairn.info/violences-de-la-maladie...,26782 | 20315 | 22977,2. Géométrie de la souffrance,Claire Marin,Sociologie,summary,Le temps des idées,2015.0,,43,68,,,Chapitre d’ouvrage,Dostoïevski | F. Dostoievski | F. Dostoïevski,Fiodor Dostoïevski,False,False,Disparaît pour mon existence la possibilité d...
89917,https://stm.cairn.info/violences-de-la-maladie...,26782 | 20315 | 22977,2. Géométrie de la souffrance,Claire Marin,Sociologie,summary,Le temps des idées,2015.0,,43,68,,,Chapitre d’ouvrage,Dostoïevski | F. Dostoievski | F. Dostoïevski,Fiodor Dostoïevski,False,False,"Décrite dans la littérature médicale, elle pas..."
89918,https://stm.cairn.info/violences-de-la-maladie...,26782 | 20315 | 22977,2. Géométrie de la souffrance,Claire Marin,Sociologie,summary,Le temps des idées,2015.0,,43,68,,,Chapitre d’ouvrage,Dostoïevski | F. Dostoievski | F. Dostoïevski,Fiodor Dostoïevski,False,False,Sans quoi cette nouvelle perception ne fait qu...
89919,https://stm.cairn.info/violences-de-la-maladie...,26782 | 20315 | 22977,2. Géométrie de la souffrance,Claire Marin,Sociologie,summary,Le temps des idées,2015.0,,43,68,,,Chapitre d’ouvrage,Dostoïevski | F. Dostoievski | F. Dostoïevski,Fiodor Dostoïevski,False,False,Le sujet est ramené à lui-même par le biais d’...


### 6. Author Presence Detection per Chunk

In [12]:
df_CAIRNcontent.drop(columns=['texte_article'], inplace=True)

In [14]:
df_CAIRNcontent.to_csv("CAIRN_clean_final_chunks.csv")

In [24]:
# Extract last name (family name = last token) ---
def extract_last_name(full_name):
    if pd.isna(full_name):
        return None
    full_name = re.sub(r'\s+', ' ', full_name.strip())
    return full_name.split(' ')[-1].lower()

# Extract last names from a "|" separated field ---
def extract_last_names_from_field(field):
    if pd.isna(field):
        return []
    return [
        extract_last_name(name)
        for name in field.split('|')
        if extract_last_name(name)
    ]

# Author presence per chunk (INDEXED ON variante_recherchee) ---
def author_presence_per_chunk(row):
    text = str(row['chunk']).lower()

    # Now canonical authors = variante_recherchee
    canonical_authors = extract_last_names_from_field(row['variante_recherchee'])

    # Helper: all searchable names from 'name' (optional)
    name_authors = extract_last_names_from_field(row['name'])

    presence = []

    for author in canonical_authors:
        # author present if its variant OR the corresponding canonical name is in the text
        found = author in text or any(name in text for name in name_authors)
        presence.append('1' if found else '0')

    return '|'.join(presence)

# Apply at chunk level ---
df_CAIRNcontent['presence_auteur'] = df_CAIRNcontent.apply(
    author_presence_per_chunk, axis=1
)

# Chunk-level flag ---
df_CAIRNcontent['any_author_in_chunk'] = df_CAIRNcontent['presence_auteur'].apply(
    lambda x: any(v == '1' for v in x.split('|')) if isinstance(x, str) else False
)

# Article-level aggregation ---
article_author_presence = (
    df_CAIRNcontent
    .groupby('url')['any_author_in_chunk']
    .any()
    .reset_index(name='author_present_in_article')
)

# Merge article info back ---
df_CAIRNcontent = df_CAIRNcontent.merge(
    article_author_presence, on='url', how='left'
)

# Articles with NO author in any chunk ---
articles_without_author = article_author_presence[
    article_author_presence['author_present_in_article'] == False
]

print(f"Articles with NO author found in any chunk: {len(articles_without_author)}")
display(articles_without_author.head())



Articles with NO author found in any chunk: 0


Unnamed: 0,url,author_present_in_article


In [25]:
display(df_CAIRNcontent.head())

Unnamed: 0.1,url,Unnamed: 0,titre,auteur,section,twitter_card,journal,annee,numero,page_debut,...,variante_recherchee,name,nom_auteur_absent,author_name_absent,chunk,presence_auteur,any_author_in_chunk,author_present_in_article_x,author_present_in_article_y,author_present_in_article
0,https://droit.cairn.info/Economie-des-donnees-...,86147,IV. Les comportements en matière de vie privée...,Fabrice Rochelandet,Sociologie,summary,Repères,2010.0,,67,...,Ackerman,Galia Ackerman,False,False,L 'objectif de ce chapitre est d'envisager à q...,0,False,True,True,True
1,https://droit.cairn.info/Economie-des-donnees-...,86147,IV. Les comportements en matière de vie privée...,Fabrice Rochelandet,Sociologie,summary,Repères,2010.0,,67,...,Ackerman,Galia Ackerman,False,False,"Elle dépend, d'autre part, de l'identité de j ...",0,False,True,True,True
2,https://droit.cairn.info/Economie-des-donnees-...,86147,IV. Les comportements en matière de vie privée...,Fabrice Rochelandet,Sociologie,summary,Repères,2010.0,,67,...,Ackerman,Galia Ackerman,False,False,"La politique inverse est alors envisageable, à...",1,True,True,True,True
3,https://droit.cairn.info/Economie-des-donnees-...,86147,IV. Les comportements en matière de vie privée...,Fabrice Rochelandet,Sociologie,summary,Repères,2010.0,,67,...,Ackerman,Galia Ackerman,False,False,Ils sont supposés maximiser leur utilité de ma...,0,False,True,True,True
4,https://droit.cairn.info/Economie-des-donnees-...,86147,IV. Les comportements en matière de vie privée...,Fabrice Rochelandet,Sociologie,summary,Repères,2010.0,,67,...,Ackerman,Galia Ackerman,False,False,4) Un décalage entre les actions présentes et ...,0,False,True,True,True


In [27]:
df_CAIRNcontent.to_csv("CAIRN_chunks_and_authors.csv", index=False)