<h2 style="text-align:center;font-size:200%;">
    <b>Workplace Accident Database Textual Analysis through LLMs</b>
</h2>
<h3  style="text-align:center;">Keywords : 
    <span style="border-radius:7px;background-color:yellowgreen;color:white;padding:7px;">Large Language Models</span>
    <span style="border-radius:7px;background-color:yellowgreen;color:white;padding:7px;">Natural Language Processing</span>
    <span style="border-radius:7px;background-color:yellowgreen;color:white;padding:7px;">Mistral</span>
    <span style="border-radius:7px;background-color:yellowgreen;color:white;padding:7px;">Work Accidents</span>
    <span style="border-radius:7px;background-color:yellowgreen;color:white;padding:7px;">EHS</span>
</h3>


The [EPICEA database](https://www.inrs.fr/publications/bdd/epicea.html) is managed by a french institute called [INRS](https://www.inrs.fr/), in charge of risk preventions in work environments.

The purpose of this notebook is to create a tool able to : 
1. extract massively the accident descriptions from the french "EPICEA" database.,
2. apply a LLM prompt in order to extract structured information from unstructured description
3. propose a first analysis of the database

Epicea est une base de données nationale et anonyme rassemblant plus de 21 000 cas d'accidents du travail survenus, depuis 1990, à des salariés du régime général de la Sécurité sociale. Ces accidents sont mortels, graves ou significatifs pour la prévention.

Cette base de données n'est pas exhaustive puisque tous les accidents du travail n'y sont pas répertoriés.

L'anonymat des personnes physiques et morales est respecté et l'origine des informations est préservée.

Le numéro du dossier (qui s'incrémente automatiquement) : plus le numéro est élevé, plus l'accident est récent
Le comité technique national (classification des grands secteurs d'activité selon l'arrêté du 17 octobre 1995 modifié)
Le code entreprise (jusqu'en 2015 : code risque, déclinaison des comités techniques nationaux ; à partir de janvier 2015 : code APE selon la nomenclature NAF)
Le facteur matériel le plus proche des lésions : objet, matériel, matériau, installation, etc. intervenant dans l'accident
Le récit circonstancié de l'accident, éventuellement complété par des documents attachés (photos, arbres des causes, schémas, etc.)

Le facteur matériel (ou matériel en cause) est structuré et renvoie à un libellé plus ou moins détaillé. Par exemple 510210 concerne les toitures en matériaux fragiles, 5102* une partie de bâtiment ou d’ouvrage, 51* les zones géographiques et emplacements de travail.

Une collection de dossiers est obtenue par sélection multicritère.

# <div style="text-align: left; background-color: yellowgreen; color: white; padding: 10px; line-height:1;border-radius:10px">1. Modules and dependancies installing</div>

In [4]:
# System path configuration (if necessary)
sys.path.append("C:/Users/arnaud/AppData/Roaming/Python/Python312/site-packages/")
sys.path.append("C:/Windows/System32/")
sys.path.append("C:/Users/Arnaud/AppData/Roaming/Python/Python312/site-packages/onnxruntime/capi/")

In [13]:
# Standard libraries
import json
import re
import sys
import time
from ast import literal_eval
from enum import Enum
from io import StringIO
from os.path import exists
from pathlib import Path
from typing import List, Optional, Sequence, Generic, TypeVar
from urllib.request import urlopen

# Data and analysis libraries
import numpy as np
import pandas as pd

# Natural language processing and AI libraries
from langchain.callbacks import get_openai_callback
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_community.chat_models import ChatOllama
from langchain_core.messages import HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_mistralai import ChatMistralAI, MistralAIEmbeddings
from langchain_openai import ChatOpenAI, OpenAI
from ollama import Client
import openai

# Web scraping and automation libraries
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

# Visualization and user interface libraries
from IPython.display import Markdown as md
from tqdm import tqdm

# Data validation and modeling libraries
from pydantic import BaseModel, Field, Extra, validator, ConfigDict, field_validator

# Ollama specific imports
from langchain_community.llms import Ollama

In [6]:
# Loading of helpful functions located in helper.py
from helper import *

In [7]:
# Configure Chrome options for headless mode
chrome_options = Options()
chrome_options.add_argument("--headless")
# Initialize the Chrome driver with the specified options
driver = webdriver.Chrome() 
waiting_time = 1.5

# <div style="text-align: left; background-color: yellowgreen; color: white; padding: 10px; line-height:1;border-radius:10px">2. Data Collection from INRS website</div>

Web scraping, is a technique used in data science to automatically extract data from websites.
It involves using a program or script to navigate through web pages, parse the HTML or XML code, and extract specific pieces of information, such as text, images, files or other structured data. 

Our web scraping strategy will be performed in 2 separate steps:
- First we will get the list of accident #IDs available in the database
- Then we will extract separately the informations related to each individual accident.

## 2.1. Accidents #ID extraction

In [8]:
def extract_all_accident_ids():
    """Main function to extract all accident IDs."""
    driver = initialize_driver()
    try:
        navigate_to_search_page(driver)
        perform_search(driver)
        display_list(driver)
        total_pages = get_total_pages(driver)
        accident_ids = extract_accident_ids(driver, total_pages)
        total_ids = save_accident_ids(accident_ids)
        print(f"Extraction complete. Total accident IDs: {total_ids}")
        return total_ids
    finally:
        driver.quit()

In [9]:
# Execute the extraction
total_ids = extract_all_accident_ids()
print(f"Total number of accident IDs extracted: {total_ids}")

  3%|██▎                                                                          | 132/4349 [03:36<1:55:17,  1.64s/it]


KeyboardInterrupt: 

## 2.2. Detailed data extraction

In [None]:
df, df_analyzed = load_data()
df = filter_unanalyzed_data(df, df_analyzed)
df = initialize_dataframe(df)
new_data = process_accidents(df)

# Combine new data with existing analyzed data
df_analyzed = pd.concat([df_analyzed, new_data], ignore_index=True)
df_analyzed.to_csv('Accident_database.csv', sep='|', index=False, encoding="utf-8")

print(f"Data extraction complete. Total accidents in database: {len(df_analyzed)}")

# <div style="text-align: left; background-color: yellowgreen; color: white; padding: 10px; line-height:1;border-radius:10px">3. Extraction of data from narratives</div>

A part of the code will use prompt and variable name formulated in french. Because the data source in written in french, it is necessary, for better results, to write the prompts in french and to describe the expected output in french.

## 3.1. Classes description

In [11]:
class BodyZone(str, Enum):
    HEAD = "tete"
    CHEST = "torse"
    STOMACH = "ventre"
    BACK = "dos"
    ARM = "bras"
    HAND = "main"
    LEG = "jambe"
    FOOT = "pied"
    POSTERIOR = "posterieur"
    HEART = "coeur"
    NA = "NA"

In [12]:
class Accident(BaseModel):
    Metier: str = Field(description="Victim's job, role or function who suffered the accident.")
    Sexe: str = Field(description="Sex (Man or Woman) of the victim who suffered the accident.")
    Age: int = Field(description="Age of the victim who suffered the accident.")
    
    Type_accident: str = Field(description="Type of accident that occurred. 1 or 2 words maximum.")
    Blessure: str = Field(description="Medical description of injuries or symptoms. 1 or 2 words maximum.")
    
    Deces: bool = Field(description="The victim is mentioned as deceased.")
    Circulation: bool = Field(description="Accident related to traffic.")
    Malaise: bool = Field(description="Accident related to a medical condition such as stroke, heart attack.")
    Suicide: bool = Field(description="Accident related to suicide.")
    
    Machine: List[str] = Field(description="Machines, parts or objects involved in the accident. 1 or 2 words maximum per item.")
    Cause: List[str] = Field(description="Factors that directly caused or contributed to the accident. 1 to 3 words maximum per factor.")
    Zone: BodyZone = Field(description="Body area affected by the accident.")
        
    @field_validator('Sexe')
    @classmethod
    def sexe_valide(cls, v):
        if v.lower() not in ['homme', 'femme']:
            raise ValueError('Sex must be "Homme" or "Femme"')
        return v.capitalize()

    @field_validator('Age')
    @classmethod
    def age_valide(cls, v):
        if v is not None and (v < -1 or v > 120):
            raise ValueError('Age must be between 0 and 120')
        return v

    @field_validator('Metier', 'Type_accident', 'Blessure')
    @classmethod
    def non_vide(cls, v):
        if not v.strip():
            raise ValueError('This field cannot be empty')
        return v

    @field_validator('Machine', 'Cause')
    @classmethod
    def liste_non_vide(cls, v):
        if not v:
            return ['Not specified']
        return [item.strip() for item in v if item.strip()]

    @field_validator('Zone')
    @classmethod
    def zone_valide(cls, v):
        zone_mapping = {
            'tete': ['crane', 'visage', 'cou', 'cerveau'],
            'torse': ['poitrine', 'torse', 'poumon'],
            'ventre': ['ventre', 'estomac'],
            'dos': ['dos', 'epaule'],
            'bras': ['bras', 'coude', 'epaule'],
            'main': ['main', 'doigt', 'poignet'],
            'jambe': ['genou', 'cuisse', 'mollet', 'tibia'],
            'pied': ['pied', 'cheville'],
            'posterieur': ['fesses'],
            'coeur': ['coeur']
        }
        
        v = v.lower()
        for zone, keywords in zone_mapping.items():
            if v in keywords:
                return BodyZone(zone)
        return BodyZone.NA

    model_config = ConfigDict(
        extra='forbid',
        use_enum_values=True,
        json_schema_extra={
            'examples': [
                {
                    'Metier': 'Maintenance technician',
                    'Sexe': 'Homme',
                    'Age': 45,
                    'Type_accident': 'Fall',
                    'Blessure': 'Fracture',
                    'Deces': False,
                    'Circulation': False,
                    'Malaise': False,
                    'Suicide': False,
                    'Machine': ['Ladder'],
                    'Cause': ['Slippery floor', 'Lack of PPE'],
                    'Zone': 'jambe'
                }
            ]
        }
    )

C:\Users\Arnaud\AppData\Local\Temp\ipykernel_38960\1841967672.py:18: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
  @validator('Sexe')
C:\Users\Arnaud\AppData\Local\Temp\ipykernel_38960\1841967672.py:24: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should migrate to Pydantic V2 style `@field_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
  @validator('Age')
C:\Users\Arnaud\AppData\Local\Temp\ipykernel_38960\1841967672.py:30: PydanticDeprecatedSince20: Pydantic V1 style `@validator` validators are deprecated. You should mi

## 3.2. Runinng functions

In [None]:
def setup_llm_and_prompt():
    """Set up the language model and prompt for accident analysis."""
    pydantic_parser = PydanticOutputParser(pydantic_object=Accident)
    format_instructions = pydantic_parser.get_format_instructions()

    template_string = """You are a French analyst reviewing accident reports and performing data entry. 
    Analyze the text below between triple apostrophes and extract the required information. 

    Accident description: ```{descriptif}```

    IMPORTANT:
    - All your answers MUST be in French.
    - For the 'Sexe' field, use ONLY 'Homme' or 'Femme'.
    - The 'Metier' field must be a string, not a list.
    - For the 'Zone' field, use ONLY one of the following values according to the affected body area:
      - 'tete' for [crane, visage, cou, cerveau]
      - 'torse' for [poitrine, torse, poumon]
      - 'ventre' for [ventre, estomac]
      - 'dos' for [dos, epaule]
      - 'bras' for [bras, coude, epaule]
      - 'main' for [main, doigt, poignet]
      - 'jambe' for [genou, cuisse, mollet, tibia]
      - 'pied' for [pied, cheville]
      - 'posterieur' for [fesses]
      - 'coeur' for [coeur]
      - 'NA' if the information is not present
    - If the information does not appear in the narrative, use 'NA' for text fields, 

    Your response MUST be a valid JSON object, strictly adhering to the following schema. Do NOT include ANY text outside this JSON object.

    {format_instructions}
    """

    prompt = ChatPromptTemplate(
        messages=[
            HumanMessagePromptTemplate.from_template(template_string)  
        ],
        input_variables=["descriptif"],
        partial_variables={"format_instructions": format_instructions}
    )

    llm = ChatOllama(
        model="mistral", 
        format="json",
        temperature=0,
        top_k=10,
        top_p=0.9,
        repeat_penalty=1.1
    )

    return llm, prompt

In [None]:
def load_and_prepare_data(csv_name="Accident_database_refined.csv"):
    """Load and prepare the accident data for analysis."""
    if Path(csv_name).is_file():
        df = pd.read_csv(csv_name, sep="|")
    else:
        df = pd.read_csv('Accident_database.csv', sep="|")
        new_columns = ['Metier', 'Sexe', 'Age', 'Type_accident', 'Blessure', 'Deces', 'Circulation', 'Malaise',
                       'Suicide', 'Machine', 'Cause', 'Zone', 'Status']
        for col in new_columns:
            df[col] = None
        
        # Filter data based on specific enterprise codes
        enterprise_codes = ['241GM', '274CG', '295EC', '2110Z', '2120Z', '244CB', '244DA', '4646Z', '4773Z', '514NA', 
                            '523AB', '1073Z', '1086Z', '1089Z', '157AB', '158VB']
        df = df[df['Code_entreprise'].apply(lambda x: any(code in x for code in enterprise_codes))]
        df = df[df['Numero_dossier'] != 19258]

    return df

In [None]:
def analyze_accidents(df, llm, prompt):
    """Analyze accidents using the LLM and update the dataframe."""
    unanalyzed_refs = df.loc[df['Status'].isnull(), 'Numero_dossier'].tolist()

    for num in tqdm(unanalyzed_refs):
        descriptif = df.loc[df['Numero_dossier'] == num, 'Resume'].item()
        messages = prompt.format_messages(descriptif=descriptif)

        chat_model_response = llm.invoke(messages)
        print(f"Raw response for number {num}:")
        print(datetime.datetime.now())
        print(chat_model_response.content)

        content_dict = parse_json_safely(chat_model_response.content)

        if content_dict:
            content_dict = process_content_dict(content_dict)
            content_dict = add_default_values(content_dict)
            content_dict = clean_and_standardize_content(content_dict)

            if validate_content(content_dict):
                for key in content_dict.keys():
                    df.loc[df['Numero_dossier'] == num, key] = content_dict[key]
                df.loc[df['Numero_dossier'] == num, 'Status'] = 'Analyzed'
            else:
                print(f"Warning: Incomplete result for number {num}")
                print(f"Dictionary content: {content_dict}")
        else:
            print(f"Error: Unable to parse the result for number {num}")
            print(f"Description for number {num}:")
            print(descriptif)
            continue

        # Incremental save
        df.to_csv('Accident_database_refined.csv', sep='|', index=False, encoding="utf-8")

    return df

## 3.3. Main loop

In [None]:
start_time = time.time()

llm, prompt = setup_llm_and_prompt()
df = load_and_prepare_data()
df = analyze_accidents(df, llm, prompt)

inference_time = time.time() - start_time
print("Analysis completed.")
print(f"Total inference time: {inference_time:.2f} seconds")