Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality

Link to the article: https://www.nature.com/articles/s44184-023-00046-7

Study

This repositoy contains the computer code that has been executed to generate the results of the article:

Bey, Romain, Ariel Cohen, Vincent Trebossen, Basile Dura, Pierre-Alexis Geoffroy, Charline Jean, Benjamin Landman, et al.
« Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality ».
npj Mental Health Research 3, nᵒ 1 (14 février 2024): 6.
https://doi.org/10.1038/s44184-023-00046-7.

The code has been executed on the database of the Greater Paris University Hospitals

IRB number: CSE210013

⚠️ This repository is not maintained. It contains computer code that is specific to a research study.

Version 1.0.0

Code of puublished article.

Setup

You should run the file set_environment.py in order to create a conda environment and an associated jupyter kernel.

python set_environment.py -n env_cse_210013
conda activate env_cse_210013
pip install --upgrade pip

cd cse_210013
poetry install

How to run the code on AP-HP's data platform

You can run all the analysis pipelines with the ./bash/run_analysis.sh command:

bash bash/run_analysis.sh <conf_name>

Example:

bash bash/run_analysis.sh conf_article

It requires the prior training/import of the machine learning model for SA detection.

Project structure

Repository organization

bash: Bash files to execute the pipelines and tests
conf: Configuration files
data: Intermediate data and export results
figures: Figures and their associated tables
notebooks: Tutorials and examples
suicide_attempt: Source code (functions and pipelines)

Pipelines

Stay & Document selection: Retrieve documents that mention a lexical variant of Suicide Attempt for the stays that fulfill the inclusion criteria
`Rule-based entity classification
Machine learning (ML) entity classification
Stay classification using text data
Stay classification using claim data
Retrieve documents with a risk factor (RF) mention for the previously SA visits (text data).
Rule-based entity classification for the RF
Make plots
Evaluate configuration & data description
Train ML model

Configuration file

debug: (Boolean) If set to True, the pipelines will be executed using only a sample of data. Useful for debuging.
schema: Name of the schema to query.
admission_mode: Admissions mode to keep. For example: [2-URG] for admission through the emergency department. If None, no criterion is applied.
type_of_visit: Type of visit to keep. For example: [I,U] for hospitalizations and emergency visits, respectively. If None, all visits will be considered.
only_cat_docs: List of text document categories to use exclusively. If None, no action is applied.
rule_select_docs: Method used to select one document per visit. If None, no selection is applied.
text_classification_method: name of the method used to classify an identified SA entity as positive (is_true_instance variable).
rule_icd10: Name of the rule used to classify a visit as positive for SA using claim data.
icd10_type: Source database that is considered for claim data (either ORBIS or AREM).
threshold_positive_instances: Minimum number of positive suicide attempt text instances found in text to classify the visit as positive.
delta_min_visits: timedelta used to tag recurrent visits related to the same SA event (string with the accepted format of pd.to_timedelta). If None, no action is applied.
delta_history: timedelta used to discard SA detected by NLP algorithms but that are related to a patient's history. If the algorithm detects the date of a SA and if the date is before the admission date minus delta_history, the visit is not tagged as a SA-caused visit. If None, no action is applied.
date_from: Consider only visits fulfilling start_date >= date_from.
date_upper_limit: date up to which analysis is carried out. Only visits that start strictly before date_upper_limit are considered. Also used to fill values of visits with no visit_end_date for the Kaplan-Meier estimator.
hospitals_train: List of hospital considered in the training set (trigrams).
hospitals_test: List of hospital considered in the testing set (trigram). If None, no action is applied.
ehr_deployement_file: Name of the file containing information on the deployement dates of the electronic health record used for data collection.
encounter_subset: List of encounter numbers to consider exclusively. If None, no action is applied.

Acknowledgement

We would like to thank Assistance Publique – Hôpitaux de Paris and AP-HP Foundation for funding this project.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bash		bash
conf		conf
data/export/ehr_deployement		data/export/ehr_deployement
notebooks		notebooks
suicide_attempt		suicide_attempt
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
set_environment.py		set_environment.py

License

aphp-datascience/study-nlp-suicidality-surveillance

Folders and files

Latest commit

History

Repository files navigation

Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality

Study

Version 1.0.0

Setup

How to run the code on AP-HP's data platform

Project structure

Repository organization

Pipelines

Configuration file

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Languages