Skip to content
master
Switch branches/tags
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
lib
 
 
src
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Contributors Forks Stargazers Issues Apache-2.0 License

NLP Information Extraction

Benzon Carlitos Salazar

This is my undergraduate 2020 project focusing on automated Information Extraction

The goal of this project is to be able to automate data/information extraction to create a larger database of CSVs for the medical domain (for proprietary research in University of Wisconsin - WHITEWATER)

Contents:

General Overview of the Pipeline

  • Case reports are crawled from online resources
  • Documents are converted and cleaned from PDF to text
  • NER model created
    • Summary:
      • Successfully created a first version of our NER model.
      • The attributes we would generally extract manually are correctly automatically extracted by our model.
      • After this step, we move on to extracting the actions values from our case reports.
      • The sample can be found here.
  • Validating machine-generated CSVs against human-generated CSVs
  • Use of Sentence Classification/Named Entity Recognition from Case Report section of literatures
    • Summary:
      • Successfully done Named Entity Recognition and Information Extraction.
      • I have done IE on 50 case reports as a test case, and a full implementation will an IE on the 223 PDF articles.
      • Sample on 50 case reports can be found here.
  • CSV assembly from relevant sentences
    • Summary:
      • Successfully converted all the NER results into a proper CSV, which can be found here.
  • CSV for proprietary research

Sources I am following:

For all NLP:

Text preprocessing:

Tools

Text Extraction

Text Annotation

Ontology

  • BioThings Explorer
    • BioThings Explorer is an engine for autonomously querying a distributed knowledge graph. The distributed knowledge graph is made up of biomedical APIs that have been annotated with semantically-precise descriptions of their inputs and outputs.

Similarity Measures folder

How to run

  1. Make sure you have your NER model.
  2. Make sure you have a folder of inputs of pdfs, check here for example.
  3. Run:
$ ./pipeline.sh

About

My 2020 project focusing on NLP - Information Extraction

Topics

Resources

License

Releases

No releases published

Packages

No packages published