Skip to content
Switch branches/tags
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Contributors Forks Stargazers Issues Apache-2.0 License

NLP Information Extraction

Benzon Carlitos Salazar

This is my undergraduate 2020 project focusing on automated Information Extraction

The goal of this project is to be able to automate data/information extraction to create a larger database of CSVs for the medical domain (for proprietary research in University of Wisconsin - WHITEWATER)


General Overview of the Pipeline

  • Case reports are crawled from online resources
  • Documents are converted and cleaned from PDF to text
  • NER model created
    • Summary:
      • Successfully created a first version of our NER model.
      • The attributes we would generally extract manually are correctly automatically extracted by our model.
      • After this step, we move on to extracting the actions values from our case reports.
      • The sample can be found here.
  • Validating machine-generated CSVs against human-generated CSVs
  • Use of Sentence Classification/Named Entity Recognition from Case Report section of literatures
    • Summary:
      • Successfully done Named Entity Recognition and Information Extraction.
      • I have done IE on 50 case reports as a test case, and a full implementation will an IE on the 223 PDF articles.
      • Sample on 50 case reports can be found here.
  • CSV assembly from relevant sentences
    • Summary:
      • Successfully converted all the NER results into a proper CSV, which can be found here.
  • CSV for proprietary research

Sources I am following:

For all NLP:

Text preprocessing:


Text Extraction

Text Annotation


  • BioThings Explorer
    • BioThings Explorer is an engine for autonomously querying a distributed knowledge graph. The distributed knowledge graph is made up of biomedical APIs that have been annotated with semantically-precise descriptions of their inputs and outputs.

Similarity Measures folder

How to run

  1. Make sure you have your NER model.
  2. Make sure you have a folder of inputs of pdfs, check here for example.
  3. Run:
$ ./


My 2020 project focusing on NLP - Information Extraction





No releases published


No packages published