## Toolkit Glossary
https://machinelearningmastery.com/start-here/

## Potential research questions
1. How can we predict a patient's condition based on their anamneses?
2. Can a ML decision support system aid doctors to reduce the risk of misdiagnosis?
3. How can doctors learn from ML decision support systems to prevent misdiagnoses among their patients
4. How can we estimate the cost of a patient's visit based on their predicted diagnosis?


## Pretrained models
https://huggingface.co/Zabihin/Symptom_to_Diagnosis

https://huggingface.co/pucpr/clinicalnerpt-diagnostic


## Datasets

https://huggingface.co/datasets/gretelai/symptom_to_diagnosis

https://huggingface.co/datasets/shanover/disease_symptoms_prec_full

https://paperswithcode.com/dataset/semclinbr

MIMIC-III, a freely accessible critical care database: https://www.nature.com/articles/sdata201635

## Papers and Articles
https://paperswithcode.com/paper/semclinbr-a-multi-institutional-and-multi
https://aclanthology.org/2020.clinicalnlp-1.7/

Predicting Clinical Diagnosis from Patients Electronic Health Records Using BERT-based Neural Networks - https://arxiv.org/pdf/2007.07562.pdf

Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study -https://medinform.jmir.org/2019/3/e14830/

DNN-Based Diagnosis Prediction from Pediatric Big Healthcare Data - https://www.semanticscholar.org/paper/DeepDiagnosis%3A-DNN-Based-Diagnosis-Prediction-from-Shi-Fan/af4b592a0e8068fab9a6a0aea702f9db36b26a2f

Diagnosis Prediction In healthcare via attention-based bidirectional recurrent neural networks - https://arxiv.org/abs/1706.05764

Diagnosis prediction from electronic health records (EHR) using the binary diagnosis history vector representation - https://pubmed.ncbi.nlm.nih.gov/28686467/

MNN: Multimodal Attentional Neural Networks for Diagnosis Prediction - https://doi.org/10.24963/ijcai.2019%2F823

Multi-label Classification of ICD-10 Codes with BERT - https://ceur-ws.org/Vol-2380/paper_67.pdf

Developing an artificial intelligence-based system for medical prediction, Bulletin of Russian State Medical University - https://cyberleninka.ru/article/n/developing-an-artificial-intelligence-based-system-for-medical-prediction

Predicting patient’s diagnoses and diagnostic cat- egories from clinical-events in EHR data - https://link.springer.com/chapter/10.1007/978-3-030-21642-9_17

Enriching word vectors with subword information - https://arxiv.org/abs/1607.04606

The third-leading cause of death in US most doctors don’t want you to know about - https://www.cnbc.com/2018/02/22/medical-errors-third-leading-cause-of-death-in-america.html

Medical Errors: Modern Condition of the Problem

Health Insurance Cost Prediction by Using Machine Learning - https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4366801

Medical Error Reduction and Prevention - https://www.ncbi.nlm.nih.gov/books/NBK499956/

# Notes from literature review

### 1. Predicting Clinical Diagnosis from Patients Electronic Health Records Using BERT-based Neural Networks

Summarized by Chris

**Connected papers:** https://www.connectedpapers.com/main/7150ede02534303c0c838111f2fcb9b673ecfdda/Predicting-Clinical-Diagnosis-from-Patients-Electronic-Health-Records-Using-BERT%20based-Neural-Networks/derivative

**Notes**:
1. Many patient’s visits, in fact up to 30% in the US, are misdiagnosed 
ICD-10 coding specification: https://icd.who.int/browse10/Content/statichtml/ICD10Volume2_en_2019.pdf
2. Consider using pre-trained embeddings. Samples available from: Multi-label Classification of ICD-10 Codes with BERT - https://ceur-ws.org/Vol-2380/paper_67.pdf

**Research objective**: We formulate the aforementioned clinical decision support problem as a multi-label text classification of clinical notes (anamnesis and stated symptoms) during a patient visit, where the classification is performed for a wide range of diagnosis codes represented by the International Statistical Classification of Diseases (ICD- 10)

**Research contributions**
1. propose a novel BERT-based model for classification of textual clinical notes, called RuPool- BERT. Differs from other models by it's FC layer composition
2. compare the perfor- mance of our method with various baselines across different text representation techniques and classification models
3. compare the performance of the BERT model pretrained on a large corpus of out-of-domain data [6] with the BERT model pretrained exclusively on in-domain data and using an in-domain tokenizer. (Read more about in-domain vs out-of-domain data here: https://qr.ae/pKL8cd. Read about tokenizers here: https://www.linkedin.com/pulse/demystifying-tokenization-preparing-data-large-models-rany-2nebc/)
4. demonstrate the advantage of the proposed models and their comparable results with a human baseline

**Related Work**
Deep Neural Network-based (DNN) diagnosis prediction algorithm, but only for the pediatric EHRs

**Methodology**  
Clinical data of ~ 4M hospital visits and ~1mn patients. 

*Limitiation identified:* Researchers did not preprocess data enough to standardize inputs from different EMRs, e.g. using diagnostic data. Also "did not apply any special preprocessing to training data...and therefore it was presented to the model as a raw text including typos, abbreviations and misspellings". Further more, only symptoms and anamnesis (self-reported notes) of patients used in analysis.

Training dataset fully covers all the age groups, including having about 20% of the training data visits being children under the age of 14 years

"Multi-label Classification of ICD-10 Codes with BERT" achieves a +6% increase in F1 score on a german medical dataset when translated to english to english via the BERT model.

Fei Li et al. investigated the problem of BERT model fine-tuning for biomedical and clinical entity normalization. Research includes comparative analysis between pretrained general and biological domain BERT models

We follow this principle and analyze electronic health records of medical patients written in Russian. we work with a large dataset of about 4,000,000 patients’ visits to various clinics in Russia

The first two datasets (we call them DataN and DataM in the paper) pertain to two large private networks of clinics and the third one (we call it DataT) pertains to the network of public clinics

Patient visits were split into train, validation and test sets. The whole DataT part was assigned to the test set, and the union of DataN and DataM sets was randomly split in the 80/20 proportion to make the train and the validation sets. The final cardinalities of the train, validation and test sets were 1,798,687 (45.23%), 449,672 (11.3%) and 1,728,529 (43.47%) respectively. 

The validation set was used exclusively to fine-tune hyperparameters for the baseline BERT model. The test set was used to compare different baselines with the proposed models. The reason we decided to keep the entire DataT only as the test set lies in the more reliable nature of this data. More specifically, for each visit in DataT, we have an confirmed diagnosis.

Predictor variable / Labels: K = 265 categories of ICD codes where used for this task, accounting for 95% of all diagnosis in their dataset. A second case of K = 1000 codes was found to significantly complicate the classification task.


**Model**: 

We present our classification model that predicts the 265 ICD codes based only on textual information of the patient visits to the clinics. We solve the classification problem by using a transformer-based BERT model

The inputs to the model constitute text from the symptoms and the anamnesis fields of the patients visit to a doctor that are concatenated into a single text sequence. 

Tokenization approach aligns with Google's 2016 NMT paper: https://arxiv.org/pdf/1609.08144.pdf. That allows to naturally process out of vocabulary words (typos, misspellings, etc.)

Mean number of tokens is around 79 and median value is about 57. Therefore, we decided to allow some margin by limiting each sequence to N = 128 tokens


![title](img/rupoolbert_architecture.jpeg)

To fully leverage the difference in vocabulary between general texts and med- ical records, we trained a tokenizer with a vocabulary of 40k tokens on all the texts from DataN dataset. The tokenizer is identical to the one used in the original BERT model with the difference coming from the training data word distribution. Our expectation was that such a medical-domain tokenizer will al- low the model to capture a wider range of medical linguistics phenomena


**Evaluation**
Baseline model comparisons: An RNN model (with GRU units), the FastText model [19] and the multilingual Universal Sentence Encoder (USE). To remove the effect of hyperparameter tuning and compare all the BERT models under the same conditions, hyperparameters were kept the same across all these models.

**Conclusion**
1. Our experiments of applying the developed prediction model to the practical task of classifying 265 diseases showed the advantage of this model com- pared to the fine-tuned RuBERT base analog and other text representation mod- els
2. We also showed that using a BERT model with a vocabulary and pretraining dataset tailored to the medical texts representation (RuEHR-BERT) improves performance on the classification task, specially on less frequent diseases
3. Comparison of our model with a panel of medical experts showed that the results of our model were similar to the results of experts in terms of the Hit@3 performance measure
4. The most reliable perfor- mance of our system is achieved on those samples having longer textual inputs, i.e. text sequences having at least 20 input tokens

**Future work**  
Explainability of predictive model: Our partners in the medical community identified one issue with the proposed method: they maintain that the proposed approach would benefit greatly from clear explanations of how our method arrived at each particular diagnoses. This is the topic of future research on which we plan to focus in the immediate future