NLP Information Extraction
Benzon Carlitos Salazar
This is my undergraduate 2020 project focusing on automated Information Extraction
The goal of this project is to be able to automate data/information extraction to create a larger database of CSVs for the medical domain (for proprietary research in University of Wisconsin - WHITEWATER)
Contents:
General Overview of the Pipeline
- Case reports are crawled from online resources
- Summary:
- Successfully extracted 223 PDF articles from Trauma Case Reports Online Medical Journal Vol. 10 - Vol. 27.
- The articles extracted included Editorial Boards from all volumes, which will be removed for NER/Sentence Classification as it is irrelevant for all case reports
- Summary:
- Documents are converted and cleaned from PDF to text
- NER model created
- Summary:
- Successfully created a first version of our NER model.
- The attributes we would generally extract manually are correctly automatically extracted by our model.
- After this step, we move on to extracting the actions values from our case reports.
- The sample can be found here.
- Summary:
- Validating machine-generated CSVs against human-generated CSVs
- Use of Sentence Classification/Named Entity Recognition from Case Report section of literatures
- Summary:
- Successfully done Named Entity Recognition and Information Extraction.
- I have done IE on 50 case reports as a test case, and a full implementation will an IE on the 223 PDF articles.
- Sample on 50 case reports can be found here.
- Summary:
- CSV assembly from relevant sentences
- Summary:
- Successfully converted all the NER results into a proper CSV, which can be found here.
- Summary:
- CSV for proprietary research
Sources I am following:
For all NLP:
- Natural Language Processing for Information Extraction (Sonit Singh, 2018)
- Pipelines for Procedural Information Extraction from Scientific Literature:Towards Recipes using Machine Learning and Data Science (H Yang, 2019)
- Med7: a transferable clinical natural language processing model for electronic health records (Kormilitzin et al., 2020)
Text preprocessing:
- A General Approach to Preprocessing Text Data
- Text Wrangling & Pre-processing: A Practitioner’s Guide to NLP
- All you need to know about text preprocessing for NLP and Machine Learning
Tools
Text Extraction
Text Annotation
Ontology
- BioThings Explorer
- BioThings Explorer is an engine for autonomously querying a distributed knowledge graph. The distributed knowledge graph is made up of biomedical APIs that have been annotated with semantically-precise descriptions of their inputs and outputs.
folder
Similarity Measures- Levenshtein Edit Distance
- Jaro/Jaro-Winkler
- Soundex
- Description and Evaluation of Semantic Similarity Measures Approaches
How to run
- Make sure you have your NER model.
- Make sure you have a folder of inputs of pdfs, check here for example.
- Run:
$ ./pipeline.sh