# Visualize the data

This paper used a reverse substitution method to generate the dataset as follows: 

A previous work provided a list of medical acronyms and their corresponding expansions, this formed the key-value pairs of targets. The authors then secured medical notes and identified sentences that contained expansions that are present in the targets dataset. These expansions were replaced in-place by the acronym that was present in the targets database. The data therefore contains the following fields:

- id
- text: the full source sentence
- location: the word index where the acronym was present
- label: the correct expansion of the acronym
- label_num: the index of the expansion in the label field. Since we are training a classification model, this index is what is actually outputted by the model, and is used to calculate the loss (generated by the load_dataframes function)

In [7]:
import pandas as pd 
import os

from utils import load_dataframes, load_model, train_loop

import torch
from transformers import ElectraTokenizer
from models.tokenizer_and_dataset import HuggingfaceDataset

pd.set_option('display.max_colwidth', None)
DEVICE = torch.device(f"cuda" if torch.cuda.is_available() else "cpu")

data_path = "data/train/data.csv"
adam_path = "toy_data/valid_adam.txt" # this contains a list of all valid expansions that were used in reverse substution (from this paper: https://academic.oup.com/bioinformatics/article/22/22/2813/197656)

In [3]:
train_data = pd.read_csv(data_path, nrows=300)

In [4]:
adam_df = pd.read_csv(adam_path, sep='\t')
len(adam_df.EXPANSION.unique())
adam_df.head()

Unnamed: 0,PREFERRED_AB,VARIANT_AB,EXPANSION
0,+dP/dt,+dP/dt:18|+dp/dt:3,contractility
1,+dP/dt,+dP/dt:6,rate of left ventricular pressure development
2,"1,5-AG","1,5-AG:10|1,5AG:3",anhydrodglucitol
3,"1,5-AG","1,5-AG:14|1,5AG:11",anhydroglucitol
4,14C,14C:8,labeled with radioactive carbon


In [5]:
train, valid, test, label_to_ix = \
        load_dataframes(data_dir="data", data_filename="data.csv", adam_path=adam_path)

train.head()

Unnamed: 0,ABSTRACT_ID,TEXT,LOCATION,LABEL,LABEL_NUM
0,14145090,velvet antlers vas are commonly used in traditional chinese medicine and invigorant and contain many PET components for health promotion the velvet antler peptide svap is one of active components in vas based on structural study the svap interacts with tgfÎ² receptors and disrupts the tgfÎ² pathway we hypothesized that svap prevents cardiac fibrosis from pressure overload by blocking tgfÎ² signaling SDRs underwent TAC tac or a sham operation T3 one month rats received either svap mgkgday or vehicle for an additional one month tac surgery induced significant cardiac dysfunction FB activation and fibrosis these effects were improved by treatment with svap in the heart tissue tac remarkably increased the expression of tgfÎ² and connective tissue growth factor ctgf ROS species C2 and the phosphorylation C2 of smad and ERK kinases erk svap inhibited the increases in reactive oxygen species C2 ctgf expression and the phosphorylation of smad and erk but not tgfÎ² expression in cultured cardiac fibroblasts angiotensin ii ang ii had similar effects compared to tac surgery such as increases in Î±smapositive CFs and collagen synthesis svap eliminated these effects by disrupting tgfÎ² IB to its receptors and blocking ang iitgfÎ² downstream signaling these results demonstrated that svap has antifibrotic effects by blocking the tgfÎ² pathway in CFs,63,transverse aortic constriction,20135
1,1900667,the clinical features of our cases demonstrated some of the already known characteristics of the variable spectrum of hiv infection da are the most important risk category in italy of the arc cases evolved into aids during a month followup on average the most frequent oi in our aids cases were pcp c albicans esophagitis and chronic mucocutaneous ulcers an high percentage of neurologic involvement from hiv was observed and HM were encountered in aids ks and undifferentiated b lymphoma as well as in arc HD statistically significant worsening of the immunologic situation is evident as the disease progresses from las to aids G1 b lymphocytes represent most of the cells of the germinal center during the hyperplastic stage of lymphadenopathy reversal of the tt ratio appears early during the initial stage of lymphadenopathy and is due to a decrease of cd and a relative increase of cd also destruction of the follicular dendritic cells is an early feature which becomes more evident as the disease advances and the lymph node evolves toward progressive involution activated blymphocyte augmentation with polyclonal ig secretion appears to be related to TI b stimulation by coinfection such as cmv ebv and hbv the increase of CD8 lymphocytes seems to be partly related to the excessive activation of b lymphocytes and partially directed to the cells INF by hiv or coated with its proteins the destruction of follicular dendritic cells has been interpreted not only as a killer effect of the virus but also as a result of the intervention of ctl sensitized to the cells containing the virus their destruction may contribute to the impaired recognition of soluble antigen which is one of the main features of the immune deficiency of hiv infection,85,hodgkins lymphoma,9002
2,8625554,ceftobiprole bpr is an investigational cephalosporin with activity against staphylococcus aureus including methicillinresistant s aureus mrsa strains the pharmacodynamic pd profile of bpr against s aureus strains with a variety of susceptibility phenotypes in an immunocompromised murine pneumonia model was characterized the bpr mics of the test isolates ranged from to mugml pharmacokinetic pk studies were conducted with infected neutropenic balbc mice and the bpr concentrations were measured in plasma epithelial lining fluid elf and lung tissue pd studies with these mice were undertaken with eight s aureus isolates two MSSA strains three hospitalacquired mrsa strains and three CA mrsa strains subcutaneous bpr doses of to mgkg of body weightday were administered and the NC in the number of log cfuml in lungs was evaluated after h of therapy the pd profile was characterized by using the free drug exposures f determined from the following parameters the percentage of time that the concentration was greater than the mic t mic the Cmax in serummic and the area under the concentrationtime curvemic the bpr pk parameters were linear over the dose range studied in plasma and the elf concentrations ranged from to of the free plasma concentration ft mic was the parameter that best correlated with tau against a diverse array of s aureus isolates in this mu pneumonia model the effective dose ed ed and stasis exposures appeared to be similar among the isolates studied bpr exerted maximal antibacterial effects when ft mic ranged from to regardless of the phenotypic profile of resistance to betalactam fluoroquinolone erythromycin clindamycin or TCs,90,methicillinsusceptible s aureus,13368
3,8157202,we have taken a basic biologic RPA to elucidate the pathophysiology of BPD bpd the CLD based on cellmolecular mechanisms of physiologic lung OD stretch coordinates PTHrP pthrp signaling between the ATII cell and the mesoderm to coordinately upregulate key genes for the homeostatic FB phenotype including peroxisome proliferator G1 TRG ppargamma adipocyte differentiation related protein adrp and OB and the retrograde stimulation of type ii cell surfactant synthesis by leptin each of these paracrine interactions requires cellspecific receptors on adjacent cells derived from the mesoderm or endoderm respectively to serially upregulate the signaling pathways between and within each celltype it is this PET compartmentation that is key to understanding how specific agonists and antagonists can predictably affect this mechanism of AM homeostasis using a wide variety of pathophysiologic insults associated with bpd barotrauma oxotrauma and infection we have found that there are type ii cell andor FB cellmolecular effects generated by these insults which can lead to the bpd phenotype we have exploited these cellspecific mechanisms to effectively prevent and treat lung injuries using ppargamma agonists to sustain this signaling pathway it is critically important to judiciously select physiologically and developmentally relevant interventions when treating the preterm neonate,26,parathyroid hormonerelated protein,16983
4,6784974,lipoperoxidationderived aldehydes for example MDA mda can damage proteins by generating covalent adducts whose accumulation probably participates in tissue damage during aging however the mechanisms of adduct formation and their stability are scarcely known this article investigates whether oxidative steps are involved in the process as a MM of the process the interaction between mda and bovine SS Al bsa was analyzed incubation of bsa with mda resulted in rapid quenching of tryptophan fluorescence and appearance of mda protein adduct fluorescence transition metal ion traces interfered with the latter process mda induced generation of peroxides in bsa which was preventable with the antioxidant BHT bht mdaexposed bsa underwent aggregation degradation and bhtsensitive gel retardation effects phycoerythrin fluorescence disappearance a marker of damage mediated by ROS species indicated synergism between mda and metal ions the interaction between reactive aldehydes and proteins is likely to occur in several steps some of them oxidative in nature giving rise to T3 LPO endproducts which could participate with AGEs in the generation of tissue damage during aging,157,lipoperoxidation,11668


# Tokenization

Now we have our train, test, and validation data. The sentences need to be tolkenized - broken down into chunks that are manageble yet meaningful. There are three ways we can break down a sentence:
- word-level (unigrams, bigrams, etc). This  makes the vocabulary (list of all possible tolkens) extremely large since it must cover all possible words. The training data may not even have examples of all word-level tolkens
- char-level (a, b, c, etc.). The vocab is nice and small (26 chars in english) but these tolkens hold little information alone.
- sub-word-level (token, izat, ion). Strikes a middle ground where the vocab size is managable, yet each token still holds some meaning

In the paper the authors use the pre-build electra-tolkenizer, which uses [WordPiece tokenization](https://huggingface.co/course/chapter6/6?fw=pt) under-the-hood. This is a sub-word style tokenizer that determines tokens based on the probability of different character pairs appearing in the training text.


In [8]:
# Create tokenizer objects
tokenizer = ElectraTokenizer.from_pretrained('google/electra-small-discriminator')

# Create torch Dataset objects
train_data = HuggingfaceDataset(train, tokenizer=tokenizer, device=DEVICE)
valid_data = HuggingfaceDataset(valid, tokenizer=tokenizer, device=DEVICE)