# BigScience/Biomedical Demo Notebook
### 9/1/22
### Author: David Kartchner

This notebook provides a 

In [4]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import ujson

from bigbio.dataloader import BigBioConfigHelpers
from tqdm.auto import tqdm

import sys
sys.path.append('..')
from bigbio_utils import get_dataset_df

conhelps = BigBioConfigHelpers()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# TL;DR 
### Get a dataset from BigBio preformatted, deduplicated, etc in a Pandas Dataframe

In [9]:
abbrev_dict = ujson.load(open('../data/abbreviations.json'))
data_df = get_dataset_df('ncbi_disease', abbrev_dict=abbrev_dict)
display(data_df.head())
data_df.to_dict(orient='records')[:5]

Found cached dataset ncbi_disease (/nethome/dkartchner3/.cache/huggingface/datasets/ncbi_disease/ncbi_disease_bigbio_kb/1.0.0/e6b217666a5647d5abc614785b2caad62f1d72a94d1631b86c0f615b75dcc865)


  0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,document_id,offsets,text,type,db_ids,split,mention_id,deabbreviated_text
2,10021369,"[[43, 76]]",adenomatous polyposis coli tumour,[Modifier],[MESH:D011125],train,10021369.1,adenomatous polyposis coli tumour
3,10021369,"[[93, 132]]",adenomatous polyposis coli (APC) tumour,[Modifier],[MESH:D011125],train,10021369.2,adenomatous polyposis coli (APC) tumour
1,10021369,"[[357, 372]]",colon carcinoma,[Modifier],[MESH:D003110],train,10021369.3,colon carcinoma
4,10021369,"[[955, 970]]",colon carcinoma,[Modifier],[MESH:D003110],train,10021369.4,colon carcinoma
0,10021369,"[[1090, 1096]]",cancer,[SpecificDisease],[MESH:D009369],train,10021369.5,cancer


[{'document_id': '10021369',
  'offsets': [[43, 76]],
  'text': 'adenomatous polyposis coli tumour',
  'type': ['Modifier'],
  'db_ids': ['MESH:D011125'],
  'split': 'train',
  'mention_id': '10021369.1',
  'deabbreviated_text': 'adenomatous polyposis coli tumour'},
 {'document_id': '10021369',
  'offsets': [[93, 132]],
  'text': 'adenomatous polyposis coli (APC) tumour',
  'type': ['Modifier'],
  'db_ids': ['MESH:D011125'],
  'split': 'train',
  'mention_id': '10021369.2',
  'deabbreviated_text': 'adenomatous polyposis coli (APC) tumour'},
 {'document_id': '10021369',
  'offsets': [[357, 372]],
  'text': 'colon carcinoma',
  'type': ['Modifier'],
  'db_ids': ['MESH:D003110'],
  'split': 'train',
  'mention_id': '10021369.3',
  'deabbreviated_text': 'colon carcinoma'},
 {'document_id': '10021369',
  'offsets': [[955, 970]],
  'text': 'colon carcinoma',
  'type': ['Modifier'],
  'db_ids': ['MESH:D003110'],
  'split': 'train',
  'mention_id': '10021369.4',
  'deabbreviated_text': 'colon 

# Load a dataset
**Example below uses MedMentions Full**

Datasets we use in this project: 
* medmentions_full
* medmentions_st21pv  
* bc5cdr  
* gnormplus  
* ncbi_disease  
* nlmchem  
* nlm_gene

Soon to be included:
* bc6id
* plant_norm

In [2]:
data = conhelps.for_config_name(f"medmentions_full_bigbio_kb").load_dataset()

Reusing dataset medmentions (/home/dkartchner3/.cache/huggingface/datasets/medmentions/medmentions_full_bigbio_kb/1.0.0/3fc6b8a3681d540ae6c7497c238636b543b90764247b5ff3642d243474000794)


  0%|          | 0/3 [00:00<?, ?it/s]

In [12]:
# Get a split of the data
train = data['train']
train

Dataset({
    features: ['id', 'document_id', 'passages', 'entities', 'events', 'coreferences', 'relations'],
    num_rows: 2635
})

In [13]:
doc0 = train[0]
doc0

{'id': '0',
 'document_id': '25763772',
 'passages': [{'id': '110',
   'type': 'title',
   'text': ['DCTN4 as a modifier of chronic Pseudomonas aeruginosa infection in cystic fibrosis'],
   'offsets': [[0, 82]]},
  {'id': '111',
   'type': 'abstract',
   'text': ['Pseudomonas aeruginosa (Pa) infection in cystic fibrosis (CF) patients is associated with worse long-term pulmonary disease and shorter survival, and chronic Pa infection (CPA) is associated with reduced lung function, faster rate of lung decline, increased rates of exacerbations and shorter survival. By using exome sequencing and extreme phenotype design, it was recently shown that isoforms of dynactin 4 (DCTN4) may influence Pa infection in CF, leading to worse respiratory disease. The purpose of this study was to investigate the role of DCTN4 missense variants on Pa infection incidence, age at first Pa infection and chronic Pa infection incidence in a cohort of adult CF patients from a single centre. Polymerase chain react

## Passages
* id: Unique id of each passage
* type: Type of passage (e.g. title, abstract, etc).  Not always meaningful
* text: Text of passage.  Represented as list in case of multiple chunck of included text
* offsets: offsets of each chunk of text in passage (as measured from beginning of entire document and assuming that there is exactly one space between passages)

In [18]:
doc0['passages']

[{'id': '110',
  'type': 'title',
  'text': ['DCTN4 as a modifier of chronic Pseudomonas aeruginosa infection in cystic fibrosis'],
  'offsets': [[0, 82]]},
 {'id': '111',
  'type': 'abstract',
  'text': ['Pseudomonas aeruginosa (Pa) infection in cystic fibrosis (CF) patients is associated with worse long-term pulmonary disease and shorter survival, and chronic Pa infection (CPA) is associated with reduced lung function, faster rate of lung decline, increased rates of exacerbations and shorter survival. By using exome sequencing and extreme phenotype design, it was recently shown that isoforms of dynactin 4 (DCTN4) may influence Pa infection in CF, leading to worse respiratory disease. The purpose of this study was to investigate the role of DCTN4 missense variants on Pa infection incidence, age at first Pa infection and chronic Pa infection incidence in a cohort of adult CF patients from a single centre. Polymerase chain reaction and direct sequencing were used to screen DNA samples f

## Entity Annotations
* type: Unique id assigned to entity annotation
* type: Type of the entity.  Note that entities with multiple type annotations will be listed multiple times
* text: Text of the entity.  Since some entities are discontiguous (may be split up), this is listed as a list of lists (see **Nuances** below)
* offsets: Where in the text entity is located.  May have multiple sets of offsets for discontiguous entities (see **Nuances** below)
* normalized: Database identifier to which entity is linked.  In some cases, the text could be linked to multiple entities (e.g. from separate databases)

### Nuances
1. If a data point has multiple types, it is included once for each type.  For example, "acetaminophen" could be both an `organic chemical` and a `pharmalogial substance`.  In this case, there will be two separate entries in the entity list, one with each type.
2. Some entities may be *noncontiguous*.  For example, the phrase `cardiovascular and pulmonary fibrosis` has two entities -- `cardiovascular fibrosis` and `pulmonary fibrosis`.  In this case, `cardiovascular fibrosis` will contain two sets of entity offsets

In [17]:
doc0['entities']

[{'id': '1',
  'type': 'T116',
  'text': ['DCTN4'],
  'offsets': [[0, 5]],
  'normalized': [{'db_name': 'UMLS', 'db_id': 'C4308010'}]},
 {'id': '2',
  'type': 'T123',
  'text': ['DCTN4'],
  'offsets': [[0, 5]],
  'normalized': [{'db_name': 'UMLS', 'db_id': 'C4308010'}]},
 {'id': '3',
  'type': 'T047',
  'text': ['chronic Pseudomonas aeruginosa infection'],
  'offsets': [[23, 63]],
  'normalized': [{'db_name': 'UMLS', 'db_id': 'C0854135'}]},
 {'id': '4',
  'type': 'T047',
  'text': ['cystic fibrosis'],
  'offsets': [[67, 82]],
  'normalized': [{'db_name': 'UMLS', 'db_id': 'C0010674'}]},
 {'id': '5',
  'type': 'T047',
  'text': ['Pseudomonas aeruginosa (Pa) infection'],
  'offsets': [[83, 120]],
  'normalized': [{'db_name': 'UMLS', 'db_id': 'C0854135'}]},
 {'id': '6',
  'type': 'T047',
  'text': ['cystic fibrosis'],
  'offsets': [[124, 139]],
  'normalized': [{'db_name': 'UMLS', 'db_id': 'C0010674'}]},
 {'id': '7',
  'type': 'T047',
  'text': ['CF'],
  'offsets': [[141, 143]],
  'normali