# NER (Named entity recognition)

## ‚öôÔ∏è Parameters Overview

| Parameter         | Type             | Description                                                                                                 |
|------------------|-------------------|-------------------------------------------------------------------------------------------------------------|
| `EXCEL_DATA`           | `str`             | Path to the `CasEN.ipynb` notebook that executes CasEN.                                                     |
| `EXCEL_CORRECTION`  | `str`             | Folder containing the input file(s) to be analyzed.                                                         |
| `GRF`  | `str`             | Folder where output files generated by CasEN will be saved.                                                 |
| `EXCLUDED_NAMES`           | `DataFrame \| str`| Input data loaded from an Excel file.                                                                       |
| `PIPELINE_RESULT`    | `bool`            | If `True`, removes all MISC tags from the output.                                                           |
| `CASEN_PATH` | `str \| None`     | Optional folder to archive corpora and result files (`None` disables archiving).                            |
| `CASEN_CORPUS_FOLDER`  | `bool`            | If `True`, produces a single corpus file; otherwise, one per entry in the `data`.                           |
| `CASEN_RESULT_FOLDER`      | `bool`            | If `True`, executes CasEN. If `False`, assumes data already exists in `corpus_folder` and `result_folder`.  |
| `LOG_FOLDER`        | `bool`            | Enables logging of key function execution times to a log file.                                              |
| `ARCHIVES_FOLDER`     | `str`             | Path to the folder that will contain the log file.                                                          |



In [1]:
import importlib
import pandas as pd

In [2]:
EXCEL_DATA = "D:\\travail\\Stage\\NER\\NER\\Ressources\\20231101_raw.xlsx"
EXCEL_CORRECTION = "D:\\travail\\Stage\\NER\\NER\\Ressources\\20231101_correction.xlsx"
GRF = "D:\\travail\\Stage\\NER\\NER\\NER\\optimisations\\grf.json"
EXCLUDED_NAMES = "D:\\travail\\Stage\\NER\\NER\\NER\\optimisations\\name.json"
PIPELINE_RESULT = "D:\\travail\\Stage\\NER\\NER\\NER\\Results"
CASEN_PATH = "D:\\travail\\Stage\\Stage_NER\\CasEN_fr\\CasEN_fr.2.0\\CasEN.ipynb"
CASEN_CORPUS_FOLDER = "D:\\travail\\Stage\\NER\\NER\\Results\\Corpus"
CASEN_RESULT_FOLDER = "D:\\travail\\Stage\\NER\\NER\\Results\\CasEN\\Res_CasEN"
LOG_FOLDER = "D:\\travail\\Stage\\NER\\NER\\NER\\Logs"
ARCHIVES_FOLDER = "D:\\travail\\Stage\\NER\\NER\\NER\\Archives"

In [3]:
# LOAD THE DATAS
DATAS = pd.read_excel(EXCEL_DATA)

# üß© CasEN Configuration

This guide describes how to configure and initialize the **CasEN** tool for processing and analyzing corpora.

---

## ‚öôÔ∏è Parameters Overview

| Parameter         | Type             | Description                                                                                                 |
|------------------|-------------------|-------------------------------------------------------------------------------------------------------------|
| `path`           | `str`             | Path to the `CasEN.ipynb` notebook that executes CasEN.                                                     |
| `corpus_folder`  | `str`             | Folder containing the input file(s) to be analyzed.                                                         |
| `result_folder`  | `str`             | Folder where output files generated by CasEN will be saved.                                                 |
| `data`           | `DataFrame \| str`| Input data loaded from an Excel file.                                                                       |
| `remove_MISC`    | `bool`            | If `True`, removes all MISC tags from the output.                                                           |
| `archive_folder` | `str \| None`     | Optional folder to archive corpora and result files (`None` disables archiving).                            |
| `single_corpus`  | `bool`            | If `True`, produces a single corpus file; otherwise, one per entry in the `data`.                           |
| `run_casEN`      | `bool`            | If `True`, executes CasEN. If `False`, assumes data already exists in `corpus_folder` and `result_folder`.  |
| `logging`        | `bool`            | Enables logging of key function execution times to a log file.                                              |
| `log_folder`     | `str`             | Path to the folder that will contain the log file.                                                          |
| `timer`          | `bool`            | Displays execution time in the console during runtime.                                                      |
| `verbose`        | `bool`            | Enables detailed debug output in the console.                                                               |

---

üí° **Tip**: Use `remove_MISC=True` when you want to exclude miscellaneous annotations from your output.


In [16]:
from tools import casen
importlib.reload(casen)
from tools.casen import CasEN

CASEN = CasEN(
    path= CASEN_PATH,
    corpus_folder= "D:\\travail\\Stage\\NER\\NER\\NER\\Archives\\20250606_210209_corpus",
    result_folder= "D:\\travail\\Stage\\NER\\NER\\NER\\Archives\\20250606_210209_results",
    data= DATAS,
    remove_MISC= True,
    archive_folder= None,
    single_corpus= True,
    run_casEN= False,
    logging= True,
    log_folder= LOG_FOLDER,
    timer= True,
    verbose= True
)

CASEN_DF = CASEN.run()

[load] 1 file(s) loaded
get_entities in : 2.06s
CasEN in : 2.46s
run in : 2.46s


# üß© spaCy Wrapper Configuration
This section outlines how to set up the SpaCy wrapper for NLP processing using spaCy.

| Parameter      | Type                | Description                                                                |
| -------------- | ------------------- | -------------------------------------------------------------------------- |
| `data`         | `DataFrame \| str`  | Input data, generally from an Excel file.                                  |
| `spaCy_model`  | `str`               | The name of the spaCy model to use (e.g., `"fr_core_news_sm"` for French). |
| `timer_option` | `bool`              | If `True`, execution time is shown in the console.                         |
| `log_option`   | `bool`              | If `True`, logs are saved for each processing step.                        |
| `log_path`     | `str`               | Path where log files should be stored.                                     |
| `verbose`      | `bool`              | Enables verbose mode for detailed output.                                  |


In [None]:
from tools import spacy_wrapper
importlib.reload(spacy_wrapper)
from tools.spacy_wrapper import SpaCy

SPACY = SpaCy(
    data = EXCEL_DATA,
    spaCy_model = "fr_core_news_sm",
    timer_option = True,
    log_option = True,
    log_path = LOG_FOLDER,
    verbose = True
)

SPACY_DF = SPACY.run()

# üß© Stanza Wrapper Configuration
This section describes how to configure and initialize the Stanza wrapper for linguistic analysis.

| Parameter    | Type                | Description                                                                          |
| ------------ | ------------------- | ------------------------------------------------------------------------------------ |
| `data`       | `DataFrame \| str`  | Input data, typically loaded from an Excel file.                                     |
| `use_gpu`    | `bool`              | If `True`, enables GPU acceleration for processing (recommended for large datasets). |
| `logging`    | `bool`              | Enables logging of processing steps and timing.                                      |
| `log_folder` | `str`               | Path to the folder where log files will be stored.                                   |
| `timer`      | `bool`              | Displays execution time in the console.                                              |
| `verbose`    | `bool`              | Enables detailed output for debugging purposes.                                      |


In [None]:
from tools import stanza_wrapper
importlib.reload(stanza_wrapper)
from tools.stanza_wrapper import Stanza

STANZA = Stanza(
    data = EXCEL_DATA,
    use_gpu = True,
    logging = True,
    log_folder = LOG_FOLDER,
    timer = True,
    verbose = False
)

STANZA_DF = STANZA.run()

# NER

In [19]:
CASEN_DF = pd.read_excel("new_casen.xlsx")
SPACY_DF = pd.read_excel("new_spacy.xlsx")
STANZA_DF = pd.read_excel("stanza.xlsx")

In [21]:
from tools import ner
importlib.reload(ner)
from tools.ner import Ner


NER = Ner(
    data= DATAS,
    dfs = [CASEN_DF, SPACY_DF, STANZA_DF],
    casEN_priority_merge= True,
    casEN_graph_validation= GRF,
    extent_optimisation= False,
    remove_duplicate_rows= False,
    ner_result_folder= PIPELINE_RESULT,
    excluded_names= EXCLUDED_NAMES,
    make_excel_file = True,
    correction = EXCEL_CORRECTION,
    logging = True,
    log_folder = LOG_FOLDER,
    timer = True,
    verbose = True
)

NER.run()

[merge] shape of dataframe n¬∞0 : (12390, 9)
[merge] shape of dataframe n¬∞1 : (20912, 7)
[merge] shape of dataframe n¬∞2 : (20361, 7)
[merge] Final merged DataFrame shape: (34797, 9)
[merge] method value counts:
method
spaCy                 9384
Stanza                7367
spaCy_Stanza          5656
casEN_spaCy_Stanza    5465
casEN                 4645
casEN_Stanza          1873
casEN_spaCy            407
Name: count, dtype: int64
merge in : 0.91s
[casEN_optimisation] SpaCy only      : 9384 lignes
[casEN_optimisation] CasEN only      : 4584 lignes
[casEN_optimisation] CasEN_opti only      : 61 lignes
[casEN_optimisation] Intersection    : 0 lignes
casEN_optimisation in : 0.59s
[casEN_priority] Comparing casEN with: spaCy, Stanza
[casEN_priority] 539 conflicts found between spaCy and casEN
[casEN_priority] 500 conflicts found between Stanza and casEN
[casEN_priority] 465 new PER entities added based on CasEN conflicts.
[casEN_priority] spaCy : 9384 lignes
[casEN_priority] Stanza : 7367 

Unnamed: 0,manual cat,correct,extent,category,titles,NER,NER_label,desc,method,main_graph,second_graph,third_graph,file_id
0,PER,1,1.0,1,Faster than fear,Haffner,PER,"e s'adresser √†... elle. En garde √† vue, Haffne...",spaCy,,,,0.0
1,PER,1,1.0,1,Faster than fear,Marcel,PER,", haffner n ' avoue toujours pas o√π se trouve ...",casEN_spaCy_Stanza,grfpersGenerique,,,0.0
2,PER,1,1.0,1,Faster than fear,Nora,PER,"sunny . d ' ailleurs , elle est persuad√©e que ...",casEN_Stanza,grfpersGenerique,,,0.0
3,,,,,Faster than fear,Nora,LOC,"nny. D'ailleurs, elle est persuad√©e que Nora a...",spaCy,,,,0.0
4,PER,0,1.0,PER,Faster than fear,Sunny,MISC,ralf a pu prouver son innocence et sunny a √©t√©...,Stanza,,,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
35179,PER,0,1.0,PER,Geld gezocht,Kamel,ORG,. ils demandent alors l ' aide de kristel et k...,casEN,grftagOrgFunder,,,9722.0
35180,PER,0,1.0,PER,Geld gezocht,Kamel,LOC,ls demandent alors l'aide de Kristel et Kamel ...,spaCy,,,,9722.0
35181,,,,,Geld gezocht,Kristel,PER,petit extra . ils demandent alors l ' aide de ...,Stanza,,,,9722.0
35182,PER,0,1.0,PER,Geld gezocht,Kristel,ORG,petit extra . ils demandent alors l ' aide de ...,casEN,grftagOrgFunder,,,9722.0
