Skip to content

Commit

Permalink
Corrected version of full corpus data, and corresponding stats and de…
Browse files Browse the repository at this point in the history
…scriptions.
  • Loading branch information
czi-sunil committed Sep 18, 2018
1 parent da03de1 commit e075f26
Show file tree
Hide file tree
Showing 5 changed files with 12 additions and 11 deletions.
2 changes: 2 additions & 0 deletions .gitignore
@@ -1,3 +1,5 @@
# Exclude local doc README.txt
README.txt
# Exclude compiled java classes
*.class
# compiled Python
Expand Down
13 changes: 6 additions & 7 deletions README.md
Expand Up @@ -56,23 +56,22 @@ by concatenating the Title and Abstract, separated by a SPACE character. The _Me
is the actual mention between those character positions. The _EntityID_ is the UMLS entity
(concept) id, and the _SemanticTypeID_ is the id for the Semantic Type that entity is linked
to in UMLS. If the UMLS entity is linked to more than one semantic type, then this field
contains the lowest common ancestor. All UMLS concepts that are not in the 2017-AA Active release are linked to the
contains a comma-separated list of all these type IDs. All UMLS concepts that are not in the 2017-AA Active release are linked to the
special semantic type _UnknownType_.

Here is an example:
```
25763772|t|DCTN4 as a modifier of chronic Pseudomonas aeruginosa infection in cystic fibrosis
25763772|a|Pseudomonas aeruginosa (Pa) infection in cystic fibrosis (CF) patients is associated with worse long-term pulmonary disease and shorter survival, and chronic Pa infection (CPA) is associated with reduced lung function, faster rate of lung decline, increased rates of exacerbations and shorter survival. By using exome sequencing and extreme phenotype design, it was recently shown that isoforms of dynactin 4 (DCTN4) may influence Pa infection in CF, leading to worse respiratory disease. The purpose of this study was to investigate the role of DCTN4 missense variants on Pa infection incidence, age at first Pa infection and chronic Pa infection incidence in a cohort of adult CF patients from a single centre. Polymerase chain reaction and direct sequencing were used to screen DNA samples for DCTN4 variants. A total of 121 adult CF patients from the Cochin Hospital CF centre have been included, all of them carrying two CFTR defects: 103 developed at least 1 pulmonary infection with Pa, and 68 patients of them had CPA. DCTN4 variants were identified in 24% (29/121) CF patients with Pa infection and in only 17% (3/18) CF patients with no Pa infection. Of the patients with CPA, 29% (20/68) had DCTN4 missense variants vs 23% (8/35) in patients without CPA. Interestingly, p.Tyr263Cys tend to be more frequently observed in CF patients with CPA than in patients without CPA (4/68 vs 0/35), and DCTN4 missense variants tend to be more frequent in male CF patients with CPA bearing two class II mutations than in male CF patients without CPA bearing two class II mutations (P = 0.06). Our observations reinforce that DCTN4 missense variants, especially p.Tyr263Cys, may be involved in the pathogenesis of CPA in male CF.
25763772 0 5 DCTN4 T103 C4308010
25763772 23 63 chronic Pseudomonas aeruginosa infection T038 C0854135
25763772 67 82 cystic fibrosis T038 C0010674
25763772 83 120 Pseudomonas aeruginosa (Pa) infection T038 C0854135
25763772 0 5 DCTN4 T116,T123 C4308010
25763772 23 63 chronic Pseudomonas aeruginosa infection T047 C0854135
25763772 67 82 cystic fibrosis T047 C0010674
25763772 83 120 Pseudomonas aeruginosa (Pa) infection T047 C0854135
...
```
In this example, the Title is 82 characters long. The first mention is for the UMLS concept
"DCTN4 protein, human" whose UMLS id is _C4308010_. This entity is linked to two semantic
types: "Amino Acid, Peptide, or Protein" (T116) and "Biologically Active Substance" (T123),
whose lowest common ancestor is "Chemical" (_T103_ ).
types: "Amino Acid, Peptide, or Protein" (T116) and "Biologically Active Substance" (T123).


## How to cite
Expand Down
8 changes: 4 additions & 4 deletions full/ReadMe.md
Expand Up @@ -24,11 +24,11 @@ Description | Stat | avg
Number of Concepts in UMLS 2017-AA Active | 3,271,124
Number of Semantic Types (incl. UnknownType) | 128
Number of Annotated Docs in MedMentions | 4,392
Total number of Mentioned Concepts | 34,720 | (1.06% of UMLS)
Total number of Mentions in MedMentions | 351,813 | (80.1 / doc)
Total number of Mentioned Concepts | 34,728 | (1.06% of UMLS)
Total number of Mentions in MedMentions | 352,594 | (80.3 / doc)
Total Number of of Tokens (PTB via StanfordNLP) | 1,176,058 | (267.8 / doc)
Number of Annotated Tokens | 493,225 | (112.3 / doc)
Proportion of tokens annotated | 41.9% | (1.4 / mention)
Number of Annotated Tokens | 493,908 | (112.5 / doc)
Proportion of tokens annotated | 42.0% | (1.4 / mention)

As a comparison, the [BioCreative V Chemical-Disease Relation Task Corpus (BC5-CDR)](http://www.biocreative.org/resources/biocreative-v/proceedings-biocreative5/)
is a smaller set of 1,500 papers annotated only with Chemical and Disease entity mentions
Expand Down
Binary file modified full/data/corpus_pubtator.txt.gz
Binary file not shown.
Binary file modified full/stats/TypeMentionStats.xlsx
Binary file not shown.

0 comments on commit e075f26

Please sign in to comment.