# ðŸ§  NER Pipeline
This notebook demonstrates a complete Named Entity Recognition (NER) pipeline that combines results from two different sources: spaCy and CasEN. It handles everything from data loading and entity extraction to results merging and exporting.

## ðŸ“„ Overview
The pipeline:

   - Loads input data from an Excel file.

   - Applies NER using spaCy with the selected French model.

   - Formats the data and exports text input for CasEN.

   - Runs the CasEN engine on the generated text.

   - Parses the CasEN output using BeautifulSoup and regex.

   - Merges the results from both NER systems (spaCy + CasEN).

   - Exports the merged results into a final Excel file.



---

## ðŸ›  Dependencies

To run this notebook, make sure you have the following:

### âœ… Requirements

- **Python 3.8 or higher**
- **Jupyter Notebook** or **JupyterLab** (to run `.ipynb` files)

Install Jupyter Notebook:
```bash
pip install notebook
```
### ðŸ“¦ Python Packages
```bash
pip install pandas spacy beautifulsoup4 openpyxl
python -m spacy download fr_core_news_sm
python -m spacy download fr_core_news_md
python -m spacy download fr_core_news_lg

```
---

In [2]:
!where python

c:\Users\valen\Documents\Informatique-L3\Stage_NER\NER\.venv\Scripts\python.exe
C:\Users\valen\AppData\Local\Microsoft\WindowsApps\python.exe


In [3]:
from tools import casen, spacy_wrapper, ner
import importlib

In [4]:
EXCEL_DATA = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\Ressources\\20231101_raw.xlsx"
EXCEL_CORRECTION = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\Ressources\\20231101_Digital3D_Tele-Loisirs _telerama_NER_weekday_evaluation_v3(1).xlsx"
GRF = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\NER\\optimisations\\grf.json"
EXCLUDED_NAMES = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\NER\\optimisations\\name.json"
PIPELINE_RESULT = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\NER\\Results"
CASEN = "C:\\\\Users\\\\valen\\\\Documents\\\\Informatique-L3\\\\Stage\\\\CasEN_fr.2.0\\\\CasEN.ipynb"
CASEN_CORPUS_FOLDER = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\Results\\Corpus"
CASEN_RESULT_FOLDER = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\Results\\CasEN\\Res_CasEN_Analyse_synthese_grf"
LOG_FOLDER = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\NER\\Logs"
ARCHIVES_FOLDER = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\NER\\Archives"

In [6]:
importlib.reload(spacy_wrapper)
from tools.spacy_wrapper import SpaCy

s = SpaCy(
    data = EXCEL_DATA,
    spaCy_model = "fr_core_news_sm",
    timer_option = True,
    log_option = True,
    log_path = LOG_FOLDER,
    verbose = True
)
s_df = s.run()
#s.df.to_excel("new_spacy.xlsx")

[spaCy] spaCy version: 3.8.7
[spaCy] spaCy model: core_news_sm
run in : 124.90s


In [8]:
importlib.reload(casen)
from tools.casen import CasEN


c = CasEN(
    path= CASEN,
    corpus_folder = CASEN_CORPUS_FOLDER,
    result_folder = CASEN_RESULT_FOLDER,
    data = EXCEL_DATA,
    remove_MISC = True,
    archive_folder = ARCHIVES_FOLDER,
    single_corpus = True,
    run_casEN = False,
    logging = True,
    log_folder = LOG_FOLDER,
    timer = True,
    verbose = True
)
c_df = c.run()
# c.df.to_excel("new_casen.xlsx")

load_data in : 1.86s
[load] 1 file(s) loaded
get_entities in : 2.30s
CasEN in : 2.92s
run in : 2.93s


In [10]:
importlib.reload(ner)
from tools.ner import NER

N = NER(
    spaCy = s_df,
    casEN = c_df,
    data = EXCEL_DATA,
    casEN_priority_merge = True,
    casEN_graph_validation = GRF,
    remove_duplicate_rows = False,
    NER_result_folder = PIPELINE_RESULT,
    excluded_names = EXCLUDED_NAMES,
    correction = "C:\\Users\\valen\\Documents\\Informatique-L3\\Stage\\Stage\\NER\\Excels\\NER_analyses.xlsx",
    logging = True,
    log_folder = LOG_FOLDER,
    timer = True,
    verbose = True
)
N.run()

[merge] Found 9731 valid rows
[merge] description without entities : 4243
merge in : 0.52s
[opt] count par mÃ©thode aprÃ¨s :
method
spaCy           15020
intersection     7503
casEN            5445
casEN_opti        806
Name: count, dtype: int64
casEN_optimisation in : 0.74s
[casEN_priority] 2237 conflicting entities found (spaCy vs casEN)
[casEN_priority] 1571 added.
casEN_priority in : 0.15s
apply_correction in : 2.68s


'File saved in : c:\\Users\\valen\\Documents\\Informatique-L3\\Stage_NER\\NER\\NER\\NER.xlsx'