# NER (Named entity recognition)

## ‚öôÔ∏è Parameters Overview

| Parameter         | Type             | Description                                                                                                 |
|------------------|-------------------|-------------------------------------------------------------------------------------------------------------|
| `EXCEL_DATA`           | `str`             | Path to the `CasEN.ipynb` notebook that executes CasEN.                                                     |
| `EXCEL_CORRECTION`  | `str`             | Folder containing the input file(s) to be analyzed.                                                         |
| `GRF`  | `str`             | Folder where output files generated by CasEN will be saved.                                                 |
| `EXCLUDED_NAMES`           | `DataFrame \| str`| Input data loaded from an Excel file.                                                                       |
| `PIPELINE_RESULT`    | `bool`            | If `True`, removes all MISC tags from the output.                                                           |
| `CASEN_PATH` | `str \| None`     | Optional folder to archive corpora and result files (`None` disables archiving).                            |
| `CASEN_CORPUS_FOLDER`  | `bool`            | If `True`, produces a single corpus file; otherwise, one per entry in the `data`.                           |
| `CASEN_RESULT_FOLDER`      | `bool`            | If `True`, executes CasEN. If `False`, assumes data already exists in `corpus_folder` and `result_folder`.  |
| `LOG_FOLDER`        | `bool`            | Enables logging of key function execution times to a log file.                                              |
| `ARCHIVES_FOLDER`     | `str`             | Path to the folder that will contain the log file.                                                          |



In [1]:
import importlib
import pandas as pd

In [2]:
EXCEL_DATA = "D:\\travail\\Stage\\NER\\NER\\Ressources\\20231101_raw.xlsx"
EXCEL_CORRECTION = "D:\\travail\\Stage\\NER\\NER\\Ressources\\20231101_correction.xlsx"
GRF = "D:\\travail\\Stage\\NER\\NER\\NER\\optimisations\\grf.json"
EXCLUDED_NAMES = "D:\\travail\\Stage\\NER\\NER\\NER\\optimisations\\name.json"
PIPELINE_RESULT = "D:\\travail\\Stage\\NER\\NER\\NER\\Results"
CASEN_PATH = "D:\\travail\\Stage\\Stage_NER\\CasEN_fr\\CasEN_fr.2.0\\CasEN.ipynb"
CASEN_CORPUS_FOLDER = "D:\\travail\\Stage\\NER\\NER\\Results\\Corpus"
CASEN_RESULT_FOLDER = "D:\\travail\\Stage\\NER\\NER\\Results\\CasEN\\Res_CasEN"
LOG_FOLDER = "D:\\travail\\Stage\\NER\\NER\\NER\\Logs"
ARCHIVES_FOLDER = "D:\\travail\\Stage\\NER\\NER\\NER\\Archives"

In [3]:
# LOAD THE DATAS
DATAS = pd.read_excel(EXCEL_DATA)

# üß© CasEN Configuration

This guide describes how to configure and initialize the **CasEN** tool for processing and analyzing corpora.

---

## ‚öôÔ∏è Parameters Overview

| Parameter         | Type             | Description                                                                                                 |
|------------------|-------------------|-------------------------------------------------------------------------------------------------------------|
| `path`           | `str`             | Path to the `CasEN.ipynb` notebook that executes CasEN.                                                     |
| `corpus_folder`  | `str`             | Folder containing the input file(s) to be analyzed.                                                         |
| `result_folder`  | `str`             | Folder where output files generated by CasEN will be saved.                                                 |
| `data`           | `DataFrame \| str`| Input data loaded from an Excel file.                                                                       |
| `remove_MISC`    | `bool`            | If `True`, removes all MISC tags from the output.                                                           |
| `archive_folder` | `str \| None`     | Optional folder to archive corpora and result files (`None` disables archiving).                            |
| `single_corpus`  | `bool`            | If `True`, produces a single corpus file; otherwise, one per entry in the `data`.                           |
| `run_casEN`      | `bool`            | If `True`, executes CasEN. If `False`, assumes data already exists in `corpus_folder` and `result_folder`.  |
| `logging`        | `bool`            | Enables logging of key function execution times to a log file.                                              |
| `log_folder`     | `str`             | Path to the folder that will contain the log file.                                                          |
| `timer`          | `bool`            | Displays execution time in the console during runtime.                                                      |
| `verbose`        | `bool`            | Enables detailed debug output in the console.                                                               |

---

üí° **Tip**: Use `remove_MISC=True` when you want to exclude miscellaneous annotations from your output.


In [16]:
from tools import casen
importlib.reload(casen)
from tools.casen import CasEN

CASEN = CasEN(
    path= CASEN_PATH,
    corpus_folder= "D:\\travail\\Stage\\NER\\NER\\NER\\Archives\\20250606_210209_corpus",
    result_folder= "D:\\travail\\Stage\\NER\\NER\\NER\\Archives\\20250606_210209_results",
    data= DATAS,
    remove_MISC= True,
    archive_folder= None,
    single_corpus= True,
    run_casEN= False,
    logging= True,
    log_folder= LOG_FOLDER,
    timer= True,
    verbose= True
)

CASEN_DF = CASEN.run()

[load] 1 file(s) loaded
get_entities in : 2.06s
CasEN in : 2.46s
run in : 2.46s


# üß© spaCy Wrapper Configuration
This section outlines how to set up the SpaCy wrapper for NLP processing using spaCy.

| Parameter      | Type                | Description                                                                |
| -------------- | ------------------- | -------------------------------------------------------------------------- |
| `data`         | `DataFrame \| str`  | Input data, generally from an Excel file.                                  |
| `spaCy_model`  | `str`               | The name of the spaCy model to use (e.g., `"fr_core_news_sm"` for French). |
| `timer_option` | `bool`              | If `True`, execution time is shown in the console.                         |
| `log_option`   | `bool`              | If `True`, logs are saved for each processing step.                        |
| `log_path`     | `str`               | Path where log files should be stored.                                     |
| `verbose`      | `bool`              | Enables verbose mode for detailed output.                                  |


In [None]:
from tools import spacy_wrapper
importlib.reload(spacy_wrapper)
from tools.spacy_wrapper import SpaCy

SPACY = SpaCy(
    data = EXCEL_DATA,
    spaCy_model = "fr_core_news_sm",
    timer_option = True,
    log_option = True,
    log_path = LOG_FOLDER,
    verbose = True
)

SPACY_DF = SPACY.run()

# üß© Stanza Wrapper Configuration
This section describes how to configure and initialize the Stanza wrapper for linguistic analysis.

| Parameter    | Type                | Description                                                                          |
| ------------ | ------------------- | ------------------------------------------------------------------------------------ |
| `data`       | `DataFrame \| str`  | Input data, typically loaded from an Excel file.                                     |
| `use_gpu`    | `bool`              | If `True`, enables GPU acceleration for processing (recommended for large datasets). |
| `logging`    | `bool`              | Enables logging of processing steps and timing.                                      |
| `log_folder` | `str`               | Path to the folder where log files will be stored.                                   |
| `timer`      | `bool`              | Displays execution time in the console.                                              |
| `verbose`    | `bool`              | Enables detailed output for debugging purposes.                                      |


In [22]:
from tools import stanza_wrapper
importlib.reload(stanza_wrapper)
from tools.stanza_wrapper import Stanza

STANZA = Stanza(
    data = EXCEL_DATA,
    use_gpu = True,
    logging = True,
    log_folder = LOG_FOLDER,
    timer = True,
    verbose = False
)

STANZA_DF = STANZA.run()

  from .autonotebook import tqdm as notebook_tqdm
2025-06-09 13:25:23 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json: 428kB [00:00, ?B/s]                        
2025-06-09 13:25:23 INFO: Downloaded file to C:\Users\valen\stanza_resources\resources.json
2025-06-09 13:25:24 INFO: Loading these models for language: fr (French):
| Processor | Package            |
----------------------------------
| tokenize  | combined           |
| mwt       | combined           |
| ner       | wikinergold_charlm |

2025-06-09 13:25:24 INFO: Using device: cuda
2025-06-09 13:25:24 INFO: Loading: tokenize
2025-06-09 13:25:27 INFO: Loading: mwt
2025-06-09 13:25:27 INFO: Loading: ner
2025-06-09 13:25:31 INFO: Done loading processors!


0 - Entit√© : Sunny, Type : MISC
0 - Entit√© : affaire Haffner, Type : MISC
0 - Entit√© : Sunny, Type : LOC
0 - Entit√© : Nora, Type : PER
0 - Entit√© : Haffner, Type : PER
0 - Entit√© : Marcel, Type : PER
1 - Entit√© : Tristan Garil, Type : PER
1 - Entit√© : Delandin, Type : PER
1 - Entit√© : Estelle Delandin, Type : PER
1 - Entit√© : Magellan, Type : PER
1 - Entit√© : Selma Berrayah, Type : PER
3 - Entit√© : P√®re No√´l, Type : MISC
7 - Entit√© : f√™te de la Bi√®re, Type : MISC
7 - Entit√© : Saignac, Type : LOC
7 - Entit√© : Christophe Diaz, Type : PER
7 - Entit√© : Trois Amis, Type : ORG
7 - Entit√© : Magellan, Type : PER
7 - Entit√© : Claude Verague, Type : PER
8 - Entit√© : Magellan, Type : PER
8 - Entit√© : Berrayah, Type : PER
8 - Entit√© : Audrey Galvi, Type : PER
8 - Entit√© : Lucie Mauricourt, Type : PER
8 - Entit√© : Lorena Bardin, Type : PER
8 - Entit√© : Magellan, Type : PER
9 - Entit√© : Felix Winterberg, Type : PER
9 - Entit√© : Albert Einstein, Type : PER
9 - Entit√© : 

# NER

In [19]:
CASEN_DF = pd.read_excel("new_casen.xlsx")
SPACY_DF = pd.read_excel("new_spacy.xlsx")
STANZA_DF = pd.read_excel("stanza.xlsx")

In [None]:
from tools import ner
importlib.reload(ner)
from tools.ner import Ner


NER = Ner(
    data= DATAS,
    dfs = [CASEN_DF, SPACY_DF, STANZA_DF],
    casEN_priority_merge= True,
    casEN_graph_validation= GRF,
    extent_optimisation= False,
    remove_duplicate_rows= False,
    ner_result_folder= PIPELINE_RESULT,
    excluded_names= EXCLUDED_NAMES,
    make_excel_file = True,
    correction = EXCEL_CORRECTION,
    logging = True,
    log_folder = LOG_FOLDER,
    timer = True,
    verbose = True
)

NER.run()

[merge] shape of dataframe n¬∞0 : (12390, 9)
[merge] shape of dataframe n¬∞1 : (20912, 7)
[merge] shape of dataframe n¬∞2 : (20362, 6)
[merge] Final merged DataFrame shape: (34799, 9)
[merge] method value counts:
method
spaCy                 9385
Stanza                7369
spaCy_Stanza          5655
casEN_spaCy_Stanza    5465
casEN                 4645
casEN_Stanza          1873
casEN_spaCy            407
Name: count, dtype: int64
merge in : 0.93s
[casEN_optimisation] SpaCy only      : 9385 lignes
[casEN_optimisation] CasEN only      : 4584 lignes
[casEN_optimisation] CasEN_opti only      : 61 lignes
[casEN_optimisation] Intersection    : 0 lignes
casEN_optimisation in : 0.56s
[casEN_priority] Comparing casEN with: spaCy, spaCy_Stanza, casEN_spaCy_Stanza, casEN_Stanza, Stanza, casEN_spaCy, casEN_opti
[casEN_priority] Updated 408 rows from 'casEN' to 'casEN_priority'
[casEN_priority] spaCy : 9385 lignes
[casEN_priority] Stanza : 7369 lignes
[casEN_priority] spaCy_Stanza : 5655 lignes
[c

Unnamed: 0,manual cat,correct,extent,category,titles,NER,NER_label,desc,method,main_graph,second_graph,third_graph,file_id
0,PER,1,1.0,1,Faster than fear,Haffner,PER,"e s'adresser √†... elle. En garde √† vue, Haffne...",spaCy,,,,0.0
1,PER,1,1.0,1,Faster than fear,Haffner,PER,lle n'a plus rien √† voir avec l'affaire Haffne...,spaCy_Stanza,,,,0.0
2,PER,1,1.0,1,Faster than fear,Nora,PER,"sunny . d ' ailleurs , elle est persuad√©e que ...",casEN_Stanza,grfpersGenerique,,,0.0
3,PER,1,1.0,1,Faster than fear,Marcel,PER,", haffner n ' avoue toujours pas o√π se trouve ...",casEN_spaCy_Stanza,grfpersGenerique,,,0.0
4,,,,,Faster than fear,Sunny,PER,Ralf a pu prouver son innocence et Sunny a √©t√©...,spaCy,,,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
34794,,,,,Geld gezocht,Shana,MISC,"de leurs deux fils , elias et noah , que shana...",Stanza,,,,9722.0
34795,PER,1,1.0,1,Geld gezocht,Elias,PER,"C'est aux c√¥t√©s de leurs deux fils, Elias et N...",spaCy_Stanza,,,,9722.0
34796,,,,,Geld gezocht,Kamel,PER,. ils demandent alors l ' aide de kristel et k...,Stanza,,,,9722.0
34797,PER,1,1.0,1,Geld gezocht,Jelle,PER,"deux fils, Elias et Noah, que Shana et Jelle ...",spaCy_Stanza,,,,9722.0
