Bidirectional Encoder Representations from Transformers (BERT) transfer learning for named entity recognition and de-identification of sensitive data
You Don't Know My Name: Transfer Learning to De-Identify Protected Health Information in Electronic Health Records
Arnobio Morelix, Pauline Wang
School of Information, University of California, Berkeley
Date: August 2, 2019
This paper presents an information extraction system using Bidirectional Encoder Representations from Transformers (BERT) to de-identify protected health information (PHI) from electronic health records (EHR). Past work associated with PHI have used a combination of dictionary-based, rule-based, and machine learning algorithms to deal with the inherent complexity in PHI categories. In this paper we use BERT, a pre-trained model with context-aware word embeddings, to classify PHI categories in a named entity recognition task. Our model performs in line with, and in some measures better than, models relying on extensive rules-based pre-processing. Because of the cost of data pre-processing in rules-based systems, and their reliance on specific styles of annotation (e.g., how a particular hospital might record and report on biometrics), we believe the type of model we present here has more generalization potential and we present further opportunities for refinement.