# üß† Named Entity Recognition (NER)

This project focuses on optimizing a natural language processing (NLP) pipeline to detect and classify named entities in **French texts**, across the following categories:

* `PER` ‚Äì Person
* `LOC` ‚Äì Location
* `ORG` ‚Äì Organization
* `MISC` ‚Äì Miscellaneous

We leverage **multiple NER tools** to maximize accuracy:

* **CasEN**: A linguistic rule-based system based on **Unitex**, developed by linguists.
* **spaCy**: A fast and efficient NLP library.
* **Stanza**: A deep learning-based NLP library from Stanford, well-suited for morphologically rich languages.

---

### üìÅ Single vs. Multiple Corpus Processing

We implemented an option that lets you choose whether to generate **one file per description** or a **single file for all descriptions combined**.

To preserve the traceability of each description's origin, we wrap them with custom tags in the merged file:

```xml
<doc id="X">
    [description content]
</doc>
```

This allows the system to:

- ‚úÖ Significantly reduce execution time (more than 2√ó faster in our tests)

- ‚úÖ Better exploit generic graph-based rules, which can tag all similar entities once one is found

üìä Entity Detection Results

| Mode                     | Total Entities Found | Gain    |
| ------------------------ | -------------------- | ------- |
| One file per description | 9,446                | ‚Äî       |
| One file for all         | 13,233               | +40.09% |


---

## üöÄ CasEN Optimization (method : casENOpti)

We then evaluated the **precision** and **entity yield** of each graph individually.

This analysis helped us identify certain graphs or combinations of graphs that provided the most benefit. We leveraged this insight to **prioritize and retain their extracted entities**, even if they were not detected by other systems.

### üîç Example of a Graph Sequence

| Step            | Graph Name               |
|------------------|--------------------------|
| main_graph      | `grfpersCivilitePersonne` |
| second_graph  | `grftagCiviliteS`         |
| third_graph   | `grftagNomFamille`        |

These optimized sequences allow us to improve both recall and consistency across descriptions by capturing entities that would otherwise be missed.


---
## üîÑ Multi-Model Entity Detection & Cross-Validation

Each text description is first processed individually by all three systems (**CasEN**, **spaCy**, and **Stanza**).
Then, we apply a **cross-validation strategy** during result fusion:

### Cross-System Agreement

* If multiple systems detect the **same entity**, we merge their outputs and label them accordingly.
* Example: If both **CasEN** and **Stanza** detect "Nora" as a `PER`, the merged method becomes `CasEN_Stanza`.

###  Conflict Resolution with Priority Rules

When an entity is detected by **multiple systems with different labels**, we apply **priority rules**:

* Entities found by **more systems** are considered more reliable.
* If systems agree on the **entity** but not on the **label**, we prioritize the **most frequent or reliable label** among agreeing systems.

‚ö†Ô∏è **Important:** Currently, this system works only for **PER** entities.  
After a brief analysis, this configuration appears to yield the highest number of entities with minimal loss in precision.
We have also combined this with a dictionary of words that are often taken by these graphs but that we know are not good (a list that eliminates certain ambiguities with PERs).


#### Example

![Excel Result Preview](src/images/image.png)

As shown above:

* Both **CasEN** and **Stanza** classify **‚ÄúNora‚Äù** as a **Person (`PER`)**.
* **spaCy**, however, classifies it as a **Location (`LOC`)**.

As a result, the merged label becomes: CasEN_Stanza_priority


This indicates that CasEN and Stanza agreed on both the entity and the label, and their interpretation takes precedence over spaCy‚Äôs.

---
## üìä Named Entity Recognition (NER) ‚Äì Evaluation Results

This section presents the evolution of NER performance across different configurations using **CasEN**, **SpaCy**, **Stanza**, and optimized graph sequences.



###  Initial Evaluation (CasEN ‚à© SpaCy)

Entities detected using the intersection of CasEN and SpaCy systems at the beginning of the pipeline.

| Category | Total Entities | Accuracy |
|----------|----------------|----------|
| NE       | 4,085          | 97.67%   |
| PER      | 2,744          | 98.69%   |
| LOC      | 1,212          | 98.68%   |
| ORG      | 129            | 66.67%   |
| MISC     | 0              | 0.00%    |



### üìÅ CasEN on Single Corpus File (CasEN ‚à© SpaCy)

Performance after switching to a **single concatenated file** approach for CasEN.

| Category | Total Entities | Accuracy | Entity Gain | Accuracy Loss |
|----------|----------------|----------|--------------|----------------|
| NE       | 5,327          | ‚úÖ 97.61%   | üîº +30.40%     | üîΩ -0.06%         |
| PER      | 4,236          | ‚úÖ 98.31%   | üîº +51.37%     | üîΩ -0.37%         |
| LOC      | 952            | ‚úÖ 98.83%   | üîΩ -21.45%     | üîº +0.15%         |
| ORG      | 139            | ‚ö†Ô∏è 66.92%   | üîº +7.75%      | üîΩ -0.26%         |
| MISC     | 0              | ‚ùå 0.00%    | ‚ûñ 0.00%       | ‚ûñ 0.00%          |



### üöÄ CasEN + Optimized Graphs

Results using **CasEN with graph optimization** strategies.

| Category | Total Entities | Accuracy | Entity Gain | Accuracy Loss |
|----------|----------------|----------|--------------|----------------|
| NE       | 6,010          | ‚úÖ 97.14%   | üîº +12.82%     | üîΩ -0.47%         |
| PER      | 4,491          | ‚úÖ 98.00%   | üîº +6.02%      | üîΩ -0.31%         |
| LOC      | 1,294          | ‚úÖ 97.78%   | üîº +35.92%     | üîº +1.05%         |
| ORG      | 225            | ‚ö†Ô∏è 75.12%   | üîº +61.87%     | üîΩ -8.20%         |
| MISC     | 0              | ‚ùå 0.00%    | ‚ûñ 0.00%       | ‚ûñ 0.00%          |


### Full System: CasEN + SpaCy + Stanza + Optimization & Priority Rules

Final performance combining **all systems** with **graph priority strategies** and **CasEN optimizations**.

| Category | Total Entities | Accuracy | Entity Gain | Accuracy Loss |
|----------|----------------|----------|--------------|----------------|
| NE       | 7,086          | ‚úÖ 97.08%   | üîº +17.90%     | üîΩ -0.06%         |
| PER      | 5,592          | ‚úÖ 97.37%   | üîº +24.52%     | üîΩ -0.63%         |
| LOC      | 1,267          | ‚úÖ 98.30%   | üîΩ -2.09%      | üîº +0.52%         |
| ORG      | 227            | ‚ö†Ô∏è 82.84%   | üîº +0.89%      | üîΩ -7.72%         |
| MISC     | 0              | ‚ùå 0.00%    | ‚ûñ 0.00%       | ‚ûñ 0.00%          |



#### ‚úÖ Summary


| Category | Total Entities | Accuracy | Entity Gain | Accuracy Loss |
|----------|----------------|----------|--------------|----------------|
| NE       | 7,086          | ‚úÖ97.08%   | üîº +73.46%     | üîΩ -0.60%         |
| PER      | 5,592          | ‚úÖ97.37%   | üîº +103.79%     | üîΩ -1.31%        |
| LOC      | 1,267          | ‚úÖ98.30%   | üîº +4.54%      | üîΩ -0.38%         |
| ORG      | 227            | ‚ö†Ô∏è 82.84%   | üîº +75.97%      | üîº +16.18%         |
| MISC     | 0              | ‚ùå 0.00%    | ‚ûñ 0.00%       | ‚ûñ 0.00%          |

---
## üîÑ Suggestions for Further Work / Improvements

- ‚úÖ After two months, several updates have been made to CasEN. It would be beneficial to reanalyze the graphs (as some have changed!) in order to update the `CasENOpti` configuration.

- ‚úÖ Additionally, further analysis could be performed by modifying the order in which the graphs are applied particularly for the `Generique`     graphs.

- ‚úÖ It could also be very interesting to replace the single text file generated for CasEN with several ‚Äòcollection‚Äô type files, grouping EPGs from the same collection together. We can probably imagine a more coherent result for the use of generic graphs in this case.

- ‚úÖ We could also analyse the file containing the descriptions a little more and devise a system for deleting descriptions that are the same, to reduce the workload afterwards. While keeping the information to be able to put the entities with the right descriptions in afterwards.

- Adding exlude words to the dictionary for PERs.

- The `priority` system could also be further improved and extended.  
  Currently, it identifies all composite methods (e.g., `CasEN_Stanza`) and atomic methods (e.g., `CasEN`, `Stanza`) separately.  
  When both a composite and an atomic method detect the same entity but assign different categories, the system applies a priority rule in favor of the composite method.  
  (It might also be worth exploring comparisons between atomic methods themselves to refine the decision-making process.)

‚ö†Ô∏è **Important:** All tests and analyses were carried out on a single day's data set. It is possible that by working on much larger data sets, certain functions may no longer work or certain optimisations may no longer be consistent.


## üìÖ Installation

### 1. Clone the repository

```bash
git clone https://github.com/Valentin-Gauthier/NER.git
cd NER
```



### 2. Install dependencies

```bash
pip install -r requirements.txt
```

### 3. Configure the project

Before running the project, make sure to edit the `config.yaml` file to configure all settings according to your machine.

---

## ‚úçÔ∏è Author

Valentin ‚Äî Bachelor‚Äôs degree, 3rd year, Computer Science<br>
Internship at LIFAT - 2025


In [1]:
import importlib
import pandas as pd

# Load Data

In [2]:
# LOAD THE DATAS
DATAS = pd.read_excel("C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\src\\Ressources\\20231105_raw.xlsx")

# CasEN Configuration

| Parameter         | Type             | Description                                                                                                 |
|------------------|-------------------|-------------------------------------------------------------------------------------------------------------|
| `run_casen`           | `bool`| If `True`, executes CasEN. If `False`, assumes data already exists in `corpus_folder` and `result_folder`.  |        
| `single_corpus`  | `bool`            | If `True`, produces a single corpus file; otherwise, one per description in the `data`.                           |
| `production_mode`  | `bool`            | If `True`, keep only the needed columns (use less memory).                           |                                                             
| `remove_misc`    | `bool`            | If `True`, removes all MISC tags from the output.                                                           |
| `logging`        | `bool`            | Enables logging of key function execution times to a log file.                                              |
| `timer`          | `bool`            | Displays execution time in the console during runtime.                                                      |
| `archiving_result`          | `bool`            | Store the current files in the CasEN result folder to the Archiving folder before running CasEN.                                                      |
| `verbose`        | `bool`            | Enables detailed debug output in the console.                                                               |

In [3]:
from tools import casen_config
importlib.reload(casen_config)
from tools.casen_config import CasenConfig

# ========================= CASEN EXEMPLE ===================== 
c = CasenConfig(
    run_casen= True,
    single_corpus= True,
    production_mode = True, # production_mode True : 5432714 bytes VS False : 8422013 bytes 
    remove_misc= True,
    logging= False,
    timer= True,
    archiving_result= False,
    verbose= False
)

c_df = c.run(DATAS)
#c_df.to_excel("casen_generique_at_end.xlsx", index=False)
c_df.head()

load_config in : 0.13s
load_data in : 0.00s
generate_corpus in : 0.52s
{'productName', 'geogName', 'date', 'roleName', 'product', 'timePeriod', 'event', 'measure', 'gYear', 'datePeriod', 'geogFeat', 'demonym', 'name', 'org', 'vieuxSigle', 'adress', 'orgName', 'nationality', 'time', 'ref', 'extent', 'placeName', 'persName', 'place'}
corpus.txt
C:\Users\valen\Documents\Informatique-L3\Stage_NER\NER\src\Results\Corpus		 -> results in : C:\Users\valen\Documents\Informatique-L3\Stage_NER\NER\src\Results\CasEN\Res_CasEN
1 files to process with CasEN in  C:\Users\valen\Documents\Informatique-L3\Stage_NER\NER\src\Results\Corpus

run_casEN_on_corpus in : 342.96s
load_files in : 0.00s
get_entities in : 2.96s
CasEN in : 3.03s
run in : 348.54s


Unnamed: 0,NER,NER_label,method,main_graph,second_graph,third_graph,file_id,entity_start,entity_end
0,Christophe Perrin,PER,casEN,grfpersPrenomNom,grftagPrenom,grftagNomFamille,0,0,17
1,exploitation,ORG,casEN,grforgEntreprise,,,0,233,245
2,Nathalie,PER,casEN,grfpersGenerique,,,0,294,302
3,Marianne,PER,casEN,grfpersGenerique,,,0,306,314
4,Dubreuil,PER,casEN,grfpersContextePersonne,grftagNomFamille,,0,444,452


# SpaCy Configuration

| Parameter         | Type             | Description                                                                                                 |
|------------------|-------------------|-------------------------------------------------------------------------------------------------------------|
| `model`        | `str`            | Choose the NLP to load from SpaCy:  `fr_core_news_md`, `fr_core_news_lg` (you must download them before).                                              |
| `production_mode`  | `bool`            | If `True`, keep only the needed columns (use less memory).                           |      
| `logging`        | `bool`            | Enables logging of key function execution times to a log file.                                              |
| `timer`          | `bool`            | Displays execution time in the console during runtime.                                                      |
| `verbose`        | `bool`            | Enables detailed debug output in the console.                                                               |

In [37]:
from tools import spacy_wrapper
importlib.reload(spacy_wrapper)
from tools.spacy_wrapper import SpaCyConfig

sp = SpaCyConfig(
    model = "fr_core_news_sm",
    production_mode = True,
    timer = False,
    logging = False,
    verbose = False
)

sp_df = sp.run(DATAS)
sp_df.head()

Unnamed: 0,NER,NER_label,method,file_id,entity_start,entity_end
0,Christophe Perrin,PER,spaCy,0,0,17
1,bassin d'Arcachon,LOC,spaCy,0,47,64
2,L'H√©ritage,LOC,spaCy,0,219,229
3,Nathalie,MISC,spaCy,0,294,302
4,Marianne,PER,spaCy,0,306,314


# Stanza Configuration

| Parameter         | Type             | Description                                                                                                 |
|------------------|-------------------|-------------------------------------------------------------------------------------------------------------|
| `use_gpu`        | `bool`            | Run Stanza on the `GPU`, to make it faster. (You must have to install some dependencies before)                                             |
| `production_mode`  | `bool`            | If `True`, keep only the needed columns (use less memory).                           |      
| `logging`        | `bool`            | Enables logging of key function execution times to a log file.                                              |
| `timer`          | `bool`            | Displays execution time in the console during runtime.                                                      |
| `verbose`        | `bool`            | Enables detailed debug output in the console.                                                               |

In [43]:
from tools import stanza_wrapper
importlib.reload(stanza_wrapper)
from tools.stanza_wrapper import StanzaConfig

st = StanzaConfig(
    use_gpu = True,
    production_mode = True,
    timer = False,
    logging = False,
    verbose = False
)

st_df = st.run(DATAS)
st_df.head()

2025-06-20 14:19:40 INFO: Loading these models for language: fr (French):
| Processor | Package            |
----------------------------------
| tokenize  | combined           |
| mwt       | combined           |
| ner       | wikinergold_charlm |

2025-06-20 14:19:40 INFO: Using device: cpu
2025-06-20 14:19:40 INFO: Loading: tokenize
2025-06-20 14:19:40 INFO: Loading: mwt
2025-06-20 14:19:40 INFO: Loading: ner
2025-06-20 14:19:44 INFO: Done loading processors!


Unnamed: 0,NER,NER_label,method,file_id,entity_start,entity_end
0,Christophe Perrin,PER,stanza,0,0,17
1,Arcachon,LOC,stanza,0,56,64
2,L'H√©ritage,MISC,stanza,0,219,229
3,Nathalie,PER,stanza,0,294,302
4,Marianne,PER,stanza,0,306,314


In [None]:
# We can also load are DataFrames

c_df = pd.read_excel("Results/short_casen.xlsx")
sp_df = pd.read_excel("Results/short_spacy.xlsx")
st_df = pd.read_excel("Results/short_stanza.xlsx")

# NER Configuration

| Parameter         | Type             | Description                                                                                                 |
|------------------|-------------------|-------------------------------------------------------------------------------------------------------------|
| `process_priority_merge`  | `bool` | If systems agree on the entity but not on the label, we prioritize the most frequent or reliable label among agreeing systems                                         |
| `process_casen_opti`  | `bool`            | We keep the entities found only by CasEN but found by graphs judged to be precise.                           |   
| `remove_duplicated_entity_per_desc`  | `bool`            | Remove every duplicated entities for same description                     |
| `keep_only_trustable_methods`  | `bool`            | Keep all entities when they are find with the good methods (remove all potential wrong entities)                           |
| `save_to_file`  | `bool`            | Save the result to  a `xlsx` or `csv` file                           |
| `production_mode`  | `bool`            | If `True`, keep only the needed columns (use less memory).                           |   
| `logging`        | `bool`            | Enables logging of key function execution times to a log file.                                              |
| `timer`          | `bool`            | Displays execution time in the console during runtime.                                                      |
| `verbose`        | `bool`            | Enables detailed debug output in the console.                                                               |

In [44]:
from tools import ner_config
importlib.reload(ner_config)
from tools.ner_config import NerConfig


ner = NerConfig(
    process_priority_merge = True,
    process_casen_opti = True,
    remove_duplicated_entity_per_desc = True,
    keep_only_trustable_methods = True,
    save_to_file = True,
    production_mode = False,
    logging = False,
    timer = False,
    verbose = False
)

ner_df = ner.run(data=DATAS, dfs=[c_df, sp_df, st_df]) 
ner_df.head()

File saved at : Results\20231105_priority_CasenOpti_TrustMethods.xlsx


Unnamed: 0,NER,NER_label,desc,method,main_graph,second_graph,third_graph,file_id,entity_start,entity_end
0,Christophe Perrin,PER,"Christophe Perrin, un ostr√©iculteur reconnu du",casEN_spaCy_stanza,grfpersPrenomNom,grftagPrenom,grftagNomFamille,0.0,0.0,17.0
7,Nathalie,PER,le depuis des g√©n√©rations.... Nathalie et Mari...,casEN_stanza_priority,grfpersGenerique,,,0.0,294.0,302.0
8,Marianne,PER,s g√©n√©rations.... Nathalie et Marianne aimerai...,casEN_spaCy_stanza,grfpersGenerique,,,0.0,306.0,314.0
12,Dubreuil,PER,"ement, alors que le capitaine Dubreuil et son ...",casEN_stanza,grfpersContextePersonne,grftagNomFamille,,0.0,444.0,452.0
17,Sara,PER,"ns sa maison ultra connect√©e. Sara, le syst√®me...",casEN_spaCy_stanza,grfpersGenerique,,,1.0,63.0,67.0


### üß™ Example: Using `NER_Consensus`

(usefull for production)

---

#### üì¶ Import

```python
from tools.ner_consensus import NER_Consensus
ner_df = NER_Consensus(your_dataframe)
ner_df.head() # Show the output DataFrame
```

#### üîß Internal Processing
- Merges results from all NER systems.
- Applies priority rules between detected entities.
- Uses casENOpti configuration.
- Removes duplicated entities per description.

In [35]:
from tools import ner_consensus
importlib.reload(ner_consensus)
importlib.reload(casen_config)
importlib.reload(spacy_wrapper)
importlib.reload(stanza_wrapper)
from tools.ner_consensus import NER_Consensus



result_df =  NER_Consensus(DATAS)
result_df.to_excel("20231105_result.xlsx", index=False)
result_df.head()

2025-06-20 13:09:13 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json: 432kB [00:00, 23.2MB/s]                    
2025-06-20 13:09:14 INFO: Downloaded file to C:\Users\valen\stanza_resources\resources.json
2025-06-20 13:09:14 INFO: Loading these models for language: fr (French):
| Processor | Package            |
----------------------------------
| tokenize  | combined           |
| mwt       | combined           |
| ner       | wikinergold_charlm |

2025-06-20 13:09:14 INFO: Using device: cpu
2025-06-20 13:09:14 INFO: Loading: tokenize
2025-06-20 13:09:15 INFO: Loading: mwt
2025-06-20 13:09:15 INFO: Loading: ner
2025-06-20 13:09:18 INFO: Done loading processors!


{'time', 'date', 'name', 'demonym', 'vieuxSigle', 'geogName', 'productName', 'placeName', 'nationality', 'timePeriod', 'adress', 'orgName', 'ref', 'geogFeat', 'roleName', 'org', 'extent', 'product', 'persName', 'datePeriod', 'place', 'measure', 'event', 'gYear'}
corpus.txt
C:\Users\valen\Documents\Informatique-L3\Stage_NER\NER\src\Results\Corpus		 -> results in : C:\Users\valen\Documents\Informatique-L3\Stage_NER\NER\src\Results\CasEN\Res_CasEN
1 files to process with CasEN in  C:\Users\valen\Documents\Informatique-L3\Stage_NER\NER\src\Results\Corpus



Unnamed: 0,titles,sub_title,days,channel,category,NER,NER_label,clean_titles,method,file_id
0,L'h√©ritage,T√©l√©film\nT√©l√©film policier\nDur√©e : 1h47min\n...,20231105,13eme RUE,T√©l√©film,Christophe Perrin,PER,L'h√©ritage,casEN_spaCy_stanza,0.0
7,L'h√©ritage,T√©l√©film\nT√©l√©film policier\nDur√©e : 1h47min\n...,20231105,13eme RUE,T√©l√©film,Nathalie,PER,L'h√©ritage,casEN_stanza_priority,0.0
8,L'h√©ritage,T√©l√©film\nT√©l√©film policier\nDur√©e : 1h47min\n...,20231105,13eme RUE,T√©l√©film,Marianne,PER,L'h√©ritage,casEN_spaCy_stanza,0.0
12,L'h√©ritage,T√©l√©film\nT√©l√©film policier\nDur√©e : 1h47min\n...,20231105,13eme RUE,T√©l√©film,Dubreuil,PER,L'h√©ritage,casEN_stanza,0.0
17,Einstein : √©quations criminelles (S3-E1),S√©rie TV\nS√©rie polici√®re\nDur√©e : 42min\nR√©al...,20231105,13eme RUE,S√©rie TV,Sara,PER,Einstein : √©quations criminelles,casEN_spaCy_stanza,1.0


In [46]:
initial_data = pd.read_excel("../Ressources/20231101_raw.xlsx")
initial_data.head()

Unnamed: 0.1,Unnamed: 0,titles,sub_title,days,channel,category,desc,length,start_hour,start_mins,stop_hour,stop_mins,clean_titles
0,0,Faster than fear,S√©rie TV\nS√©rie polici√®re\nR√©alisateur :\nFlor...,20231101,13eme RUE,S√©rie TV,Ralf a pu prouver son innocence et Sunny a √©t√©...,50,1,30,2,20,Faster than fear
1,1,Commissaire Magellan (S1-E30),S√©rie TV\nS√©rie polici√®re\nDur√©e : 1h40min\nR√©...,20231101,13eme RUE,S√©rie TV,L'oeuvre du talentueux photographe Tristan Gar...,105,2,20,4,5,Commissaire Magellan
2,2,Einstein : √©quations criminelles (S3-E1),S√©rie TV\nS√©rie polici√®re\nDur√©e : 42min\nR√©al...,20231101,13eme RUE,S√©rie TV,Un ch√¢telain f√©ru de chasse et avec la g√¢chett...,45,4,5,4,50,Einstein : √©quations criminelles
3,3,La mort du P√®re No√´l,Cin√©ma\nCourt m√©trage\nDur√©e : 15min\nR√©alisat...,20231101,13eme RUE,Cin√©ma,Le P√®re No√´l est mort. Qui l'a tu√© ?,10,4,50,5,0,La mort du P√®re No√´l
4,4,La belle affaire,Cin√©ma\nCourt m√©trage\nDur√©e : 25min\nR√©alisat...,20231101,13eme RUE,Cin√©ma,"A la fronti√®re suisse, une d√©tective est charg...",25,5,0,5,25,La belle affaire


In [None]:
from pathlib import Path


def cleaning_data(df: pd.DataFrame) -> pd.DataFrame:
    # Sauvegarde du nombre de lignes avant nettoyage
    before = df.shape[0]

    # On garde l'index original comme identifiant unique (ou file_id)
    df = df.copy()
    df["file_id"] = df.index

    # Cr√©ation d'une table files_id par description
    df_files = df.groupby("desc")["file_id"].apply(lambda x: list(x)).reset_index()
    df_files.rename(columns={"file_id": "files_id"}, inplace=True)

    # On garde une seule ligne par description (la premi√®re)
    df_unique = df.drop_duplicates(subset="desc", keep="first")

    # Fusion avec les fichiers li√©s
    df_cleaned = pd.merge(df_unique, df_files, on="desc", how="left")

    # Affichage du nombre de lignes supprim√©es
    print(f"Nombre de lignes agr√©g√©es : {before - df_cleaned.shape[0]}")


    # --------------------- COLLECTION ---------------

    df_cleaned["collection_id"] = pd.factorize(df_cleaned["clean_titles"])[0]

    return df_cleaned

def generate_corpus_by_collection(df : pd.DataFrame) -> pd.DataFrame:
    """Generate one text file per collection_id with <doc id=...>description</doc> per line"""
    corpus_folder = Path("C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\src\\Results\\Corpus")

    missing_desc = df["desc"].isna().sum()

    # Grouper les lignes par collection_id
    grouped = df.groupby("collection_id")

    for collection_id, group in grouped:
        file_path = corpus_folder / f"collection_{collection_id}.txt"
        with open(file_path, 'w', encoding="utf-8") as f:
            for idx, row in group.iterrows():
                desc = row["desc"]
                if pd.notna(desc):
                    f.write(f'<doc id="{row["file_id"]}">{desc}</doc>\n')

        print(f"[generate file(s)] {len(grouped)} collection files generated in: {corpus_folder}")
        print(f"[generate file(s)] Missing description(s): {missing_desc}")
   

output_data = cleaning_data(initial_data)
#output_data.to_excel("cleaned_data.xlsx")
#output_data.head()
generate_corpus_by_collection(output_data)

Nombre de lignes agr√©g√©es : 6282
[generate file(s)] 1980 collection files generated in: C:\Users\valen\Documents\Informatique-L3\Stage_NER\NER\src\Results\Corpus
[generate file(s)] Missing description(s): 1
[generate file(s)] 1980 collection files generated in: C:\Users\valen\Documents\Informatique-L3\Stage_NER\NER\src\Results\Corpus
[generate file(s)] Missing description(s): 1
[generate file(s)] 1980 collection files generated in: C:\Users\valen\Documents\Informatique-L3\Stage_NER\NER\src\Results\Corpus
[generate file(s)] Missing description(s): 1
[generate file(s)] 1980 collection files generated in: C:\Users\valen\Documents\Informatique-L3\Stage_NER\NER\src\Results\Corpus
[generate file(s)] Missing description(s): 1
[generate file(s)] 1980 collection files generated in: C:\Users\valen\Documents\Informatique-L3\Stage_NER\NER\src\Results\Corpus
[generate file(s)] Missing description(s): 1
[generate file(s)] 1980 collection files generated in: C:\Users\valen\Documents\Informatique-L3