# NER (Named entity recognition)

## ‚öôÔ∏è Parameters Overview

| Parameter         | Type             | Description                                                                                                 |
|------------------|-------------------|-------------------------------------------------------------------------------------------------------------|
| `EXCEL_DATA`           | `str`             | Path to the `CasEN.ipynb` notebook that executes CasEN.                                                     |
| `EXCEL_CORRECTION`  | `str`             | Folder containing the input file(s) to be analyzed.                                                         |
| `GRF`  | `str`             | Folder where output files generated by CasEN will be saved.                                                 |
| `EXCLUDED_NAMES`           | `DataFrame \| str`| Input data loaded from an Excel file.                                                                       |
| `PIPELINE_RESULT`    | `bool`            | If `True`, removes all MISC tags from the output.                                                           |
| `CASEN_PATH` | `str \| None`     | Optional folder to archive corpora and result files (`None` disables archiving).                            |
| `CASEN_CORPUS_FOLDER`  | `bool`            | If `True`, produces a single corpus file; otherwise, one per entry in the `data`.                           |
| `CASEN_RESULT_FOLDER`      | `bool`            | If `True`, executes CasEN. If `False`, assumes data already exists in `corpus_folder` and `result_folder`.  |
| `LOG_FOLDER`        | `bool`            | Enables logging of key function execution times to a log file.                                              |
| `ARCHIVES_FOLDER`     | `str`             | Path to the folder that will contain the log file.                                                          |



In [14]:
import importlib
import pandas as pd

In [12]:
EXCEL_DATA = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\src\\Ressources\\20231101_raw.xlsx"
EXCEL_CORRECTION = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\src\\Ressources\\20231101_correction.xlsx"
GRF = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\src\\NER\\optimisations\\grf.json"
EXCLUDED_NAMES = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\src\\NER\\optimisations\\name.json"
PIPELINE_RESULT = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\src\\NER\\Results"
CASEN_PATH = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage\\CasEN_fr.2.0\\CasEN.ipynb"
CASEN_CORPUS_FOLDER = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\src\\Results\\Corpus"
CASEN_RESULT_FOLDER = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\src\\Results\\CasEN\\Res_CasEN"
LOG_FOLDER = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\src\\NER\\Logs"
ARCHIVES_FOLDER = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\src\\NER\\Archives"

In [15]:
# LOAD THE DATAS
DATAS = pd.read_excel(EXCEL_DATA)

# üß© CasEN Configuration

This guide describes how to configure and initialize the **CasEN** tool for processing and analyzing corpora.

---

## ‚öôÔ∏è Parameters Overview

| Parameter         | Type             | Description                                                                                                 |
|------------------|-------------------|-------------------------------------------------------------------------------------------------------------|
| `path`           | `str`             | Path to the `CasEN.ipynb` notebook that executes CasEN.                                                     |
| `corpus_folder`  | `str`             | Folder containing the input file(s) to be analyzed.                                                         |
| `result_folder`  | `str`             | Folder where output files generated by CasEN will be saved.                                                 |
| `data`           | `DataFrame \| str`| Input data loaded from an Excel file.                                                                       |
| `remove_MISC`    | `bool`            | If `True`, removes all MISC tags from the output.                                                           |
| `archive_folder` | `str \| None`     | Optional folder to archive corpora and result files (`None` disables archiving).                            |
| `single_corpus`  | `bool`            | If `True`, produces a single corpus file; otherwise, one per entry in the `data`.                           |
| `run_casEN`      | `bool`            | If `True`, executes CasEN. If `False`, assumes data already exists in `corpus_folder` and `result_folder`.  |
| `logging`        | `bool`            | Enables logging of key function execution times to a log file.                                              |
| `log_folder`     | `str`             | Path to the folder that will contain the log file.                                                          |
| `timer`          | `bool`            | Displays execution time in the console during runtime.                                                      |
| `verbose`        | `bool`            | Enables detailed debug output in the console.                                                               |

---

üí° **Tip**: Use `remove_MISC=True` when you want to exclude miscellaneous annotations from your output.


In [18]:
from tools import casen
importlib.reload(casen)
from tools.casen import CasEN

CASEN = CasEN(
    path= CASEN_PATH,
    corpus_folder= CASEN_CORPUS_FOLDER,
    result_folder= CASEN_RESULT_FOLDER,
    data= DATAS,
    remove_MISC= True,
    archive_folder= None,
    single_corpus= True,
    run_casEN= False,
    logging= True,
    log_folder= LOG_FOLDER,
    timer= True,
    verbose= True
)

CASEN_DF = CASEN.run()

[load] 1 file(s) loaded
get_entities in : 2.32s
CasEN in : 3.30s
CasEN : (12437, 9)
run in : 3.30s


# üß© spaCy Wrapper Configuration
This section outlines how to set up the SpaCy wrapper for NLP processing using spaCy.

| Parameter      | Type                | Description                                                                |
| -------------- | ------------------- | -------------------------------------------------------------------------- |
| `data`         | `DataFrame \| str`  | Input data, generally from an Excel file.                                  |
| `spaCy_model`  | `str`               | The name of the spaCy model to use (e.g., `"fr_core_news_sm"` for French). |
| `timer_option` | `bool`              | If `True`, execution time is shown in the console.                         |
| `log_option`   | `bool`              | If `True`, logs are saved for each processing step.                        |
| `log_path`     | `str`               | Path where log files should be stored.                                     |
| `verbose`      | `bool`              | Enables verbose mode for detailed output.                                  |


In [19]:
from tools import spacy_wrapper
importlib.reload(spacy_wrapper)
from tools.spacy_wrapper import SpaCy

SPACY = SpaCy(
    data = EXCEL_DATA,
    spaCy_model = "fr_core_news_sm",
    timer_option = True,
    log_option = True,
    log_path = LOG_FOLDER,
    verbose = True
)

SPACY_DF = SPACY.run()

[spaCy] spaCy version: 3.8.7
[spaCy] spaCy model: core_news_sm
SpaCy : (20912, 6)
run in : 85.29s


# üß© Stanza Wrapper Configuration
This section describes how to configure and initialize the Stanza wrapper for linguistic analysis.

| Parameter    | Type                | Description                                                                          |
| ------------ | ------------------- | ------------------------------------------------------------------------------------ |
| `data`       | `DataFrame \| str`  | Input data, typically loaded from an Excel file.                                     |
| `use_gpu`    | `bool`              | If `True`, enables GPU acceleration for processing (recommended for large datasets). |
| `logging`    | `bool`              | Enables logging of processing steps and timing.                                      |
| `log_folder` | `str`               | Path to the folder where log files will be stored.                                   |
| `timer`      | `bool`              | Displays execution time in the console.                                              |
| `verbose`    | `bool`              | Enables detailed output for debugging purposes.                                      |


In [25]:
from tools import stanza_wrapper
importlib.reload(stanza_wrapper)
from tools.stanza_wrapper import Stanza

STANZA = Stanza(
    data = EXCEL_DATA,
    use_gpu = True,
    logging = True,
    log_folder = LOG_FOLDER,
    timer = True,
    verbose = False
)

STANZA_DF = STANZA.run()

  from .autonotebook import tqdm as notebook_tqdm
2025-06-10 11:14:19 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json: 428kB [00:00, 49.2MB/s]                    
2025-06-10 11:14:19 INFO: Downloaded file to C:\Users\valen\stanza_resources\resources.json
2025-06-10 11:14:20 INFO: Loading these models for language: fr (French):
| Processor | Package            |
----------------------------------
| tokenize  | combined           |
| mwt       | combined           |
| ner       | wikinergold_charlm |

2025-06-10 11:14:20 INFO: Using device: cpu
2025-06-10 11:14:20 INFO: Loading: tokenize
2025-06-10 11:14:36 INFO: Loading: mwt
2025-06-10 11:14:36 INFO: Loading: ner
2025-06-10 11:14:40 INFO: Done loading processors!


run in : 970.38s


# NER

In [None]:
# CASEN_DF = pd.read_excel("new_casen.xlsx")
# SPACY_DF = pd.read_excel("new_spacy.xlsx")
# STANZA_DF = pd.read_excel("new_stanza.xlsx")

In [32]:
from tools import ner
importlib.reload(ner)
from tools.ner import Ner


NER = Ner(
    data= DATAS,
    dfs = [CASEN_DF, SPACY_DF, STANZA_DF],
    casEN_priority_merge= True,
    casEN_graph_validation= GRF,
    extent_optimisation= False,
    remove_duplicate_rows= False,
    ner_result_folder= PIPELINE_RESULT,
    excluded_names= EXCLUDED_NAMES,
    make_excel_file = True,
    correction = EXCEL_CORRECTION,
    logging = True,
    log_folder = LOG_FOLDER,
    timer = True,
    verbose = True
)

NER.run()

[merge] shape of dataframe n¬∞0 : (12437, 9)
[merge] shape of dataframe n¬∞1 : (20912, 6)
[merge] shape of dataframe n¬∞2 : (20361, 6)
[merge] Final merged DataFrame shape: (34963, 9)
[merge] method value counts:
method
spaCy                 9424
Stanza                7428
spaCy_Stanza          5674
casEN_spaCy_Stanza    5447
casEN                 4811
casEN_Stanza          1812
casEN_spaCy            367
Name: count, dtype: int64
merge in : 1.84s
[casEN_optimisation] SpaCy only      : 9424 lignes
[casEN_optimisation] CasEN only      : 4317 lignes
[casEN_optimisation] CasEN_opti only      : 494 lignes
[casEN_optimisation] Intersection    : 0 lignes
casEN_optimisation in : 0.67s
[composite_entity_priority] Composite methods: ['spaCy_Stanza', 'casEN_spaCy_Stanza', 'casEN_Stanza', 'casEN_spaCy']
[composite_entity_priority] Atomic methods: ['spaCy', 'Stanza', 'casEN']
[composite_entity_priority] Updated 1627 rows to _priority.
[composite_entity_priority] spaCy : 9424 lignes
[composite_enti

Unnamed: 0,manual cat,correct,extent,category,titles,NER,NER_label,desc,method,main_graph,second_graph,third_graph,file_id
0,PER,1,1.0,1,Faster than fear,Haffner,PER,"voir avec l'affaire Haffner, mais celui-ci",spaCy,,,,0.0
1,PER,1,1.0,1,Faster than fear,Haffner,PER,"voir avec l'affaire Haffner, mais celui-ci",spaCy_Stanza,,,,0.0
2,PER,1,1.0,1,Faster than fear,Marcel,PER,toujours pas o√π se trouve Marcel.,casEN_spaCy_Stanza,grfpersGenerique,,,0.0
3,PER,0,1.0,PER,Faster than fear,Sunny,MISC,pu prouver son innocence et Sunny a √©t√© suspen...,Stanza,,,,0.0
4,PER,0,1.0,PER,Faster than fear,Sunny,LOC,pu prouver son innocence et Sunny a √©t√© suspen...,spaCy_Stanza,,,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
34958,PER,0,1.0,PER,Geld gezocht,Kristel,ORG,alors l'aide de Kristel et Kamel pour r√©duire ...,casEN,grftagOrgFunder,grfroleName,,9722.0
34959,PER,0,1.0,PER,Geld gezocht,Kristel,LOC,alors l'aide de Kristel et Kamel pour r√©duire ...,spaCy,,,,9722.0
34960,,,,,Geld gezocht,Kamel,PER,'aide de Kristel et Kamel pour r√©duire leurs d...,Stanza,,,,9722.0
34961,PER,1,1.0,1,Geld gezocht,Shana,PER,"Elias et Noah, que Shana et Jelle vivent √† Lede",spaCy,,,,9722.0


In [None]:
"Jean est tomb√© Malade. Mr Jean Pierre est malade."