Skip to content

Tools for manipulating Tabular Document-Concept format

License

Notifications You must be signed in to change notification settings

erwanm/tdc-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tdc-tools

Tools for representing and manipulating data in the Tabular Document-Concept (TDC) format

Overview

This repository contains Python and Bash scripts to generate and manipulate data in the Tabular Document-Concept (TDC) format. TDC is a format specificailly designed to represent the biomedical literature as a collection of documents represented by their concepts. In particular it facilitates the extraction of a knowledge graph of concepts and can be used as a support for Literature-Based Discovery (LBD).

Most of the biomedical literature is available for download from Medline and PubMedCentral (PMC). PubTatorCentral (PTC) offers an alternative to the raw data format with the BioC format. While the PTC data is much richer and BioC more convenient than the raw xml format, these formats are all fairly low level: very detailed, quite complex to parse, and not very convenient to capture high-level relations between articles or concepts. By contrast the TDC format is a high-level representation of the literature where each document is considered as a collection of concepts and the documents are grouped by year of publication. The format is meant to facilitate the extraction of the concepts individual and joint frequency.

  • Common format for different extraction methods, e.g. using the Knowledge Discovery (KD) system or PTC.
  • Suited for Literature-Based Discovery and similar applications
  • Tabular format akin to a relational database
  • Year-based format to facilitate analysis across time or filtering by range of years
  • Preserves link of a concept with its source sentence/document (this is possible but not implemented)

Changelog

1.0.2

  • [added] group option for script get-frequency-from-doc-concept-matrix.py and related bash scripts
  • [added] documentation for script build-dcm-from-mesh-descriptors-by-pmid.py
  • [fixed] minor issues in documentation

1.0.1

  • [added] option to ignore PTC types in add-term-from-umls.py
  • [added] new script build-dcm-from-mesh-descriptors-by-pmid.py to generate a doc-concept matrix (dcm) format from a file containing the Medline Mesh descriptors by PMID
  • [added] option in filter-column.py to ignore NA values in numerical filtering
  • [added] new script to calculate association measures (e.g. PMI)

1.0.0

  • [added] full documentation

License

This software is published under the GPL 3.0 license. Please see file LICENSE.txt in this repository for details.

About

Tools for manipulating Tabular Document-Concept format

Resources

License

Stars

Watchers

Forks

Packages

No packages published