Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RE datasets #162

Open
luana-be opened this issue Apr 8, 2021 · 17 comments
Open

RE datasets #162

luana-be opened this issue Apr 8, 2021 · 17 comments

Comments

@luana-be
Copy link

luana-be commented Apr 8, 2021

Hello,

I'm using the GAD and EUADR datasets for relation extraction and I'm noticing contradictory annotations in both sets.

Here is an example extracted from the EUADR test set:

15 The @GENE$ SNP could be considered as a genetic marker to predict the clinical course of patients suffering from oropharyngeal and @DISEASE$. 0
16 The @GENE$ SNP could be considered as a genetic marker to predict the clinical course of patients suffering from @DISEASE$ and hypopharyngeal cancer. 1

How are these sentences actually annotated?

Thank you for your help,
Luana

@wangweifeng2018
Copy link

Same issue, have you got answer regarding this?

@luana-be
Copy link
Author

Same issue, have you got answer regarding this?

Not yet! I decided to do some cleaning up manually/using grep

@wonjininfo
Copy link
Member

wonjininfo commented Apr 22, 2021

Hi all,
Thank you for your interest in our paper.

It seems like the examples could be categorized as annotation errors. (and also from the inherent nature of the dataset: please see the second section of this reply or this reply ).
The original paper https://doi.org/10.1016/j.jbi.2012.04.004 said that the dataset is annotated by experts.

For the given examples, they are from PMID : 18347176. https://pubmed.ncbi.nlm.nih.gov/18347176/

The T393C SNP could be considered as a genetic marker to predict the clinical course of patients suffering from oropharyngeal and hypopharyngeal cancer.

is from the end of the abstract.

When taking a look at the original EUADR corpus from https://biosemantics.erasmusmc.nl/index.php/resources/euadr-corpus , we can find an annotation file "18347176.txt" in it.

...
Target-Disorder	True	concept	T393C	1649	1654	annotator1,annotator2,annotator3	['sda/5', 'sda/2', 'sda/1']	27	SNP & Sequence variations
...
Target-Disorder	True	concept	oropharyngeal	1757	1770	annotator1,annotator2,annotator3	['sda/1', 'sda/10', 'sda/15']	28	Diseases & Disorders
Target-Disorder	True	concept	hypopharyngeal cancer	1775	1796	annotator1,Computer,annotator3	['umls/C0006581', 'sda/17']	29	Diseases & Disorders
...
Target-Disorder	False	relation	27	29	['sda/5', 'sda/2', 'sda/1']	['umls/C0006581', 'sda/17']	1649:1654	1775:1796	annotator1,annotator3	PA
Target-Disorder	True	relation	27	28	['sda/5', 'sda/2', 'sda/1']	['sda/1', 'sda/10', 'sda/15']	1649:1654	1757:1770	annotator1,annotator2,annotator3	PA

From the above lines, you can see the original annotated file says that T393C (id : 27 - from the 9th column) and oropharyngeal (28) have a relation (TRUE) but T393C (27) and hypopharyngeal cancer (29) do not (which seems to be an error to me as well).


(Added after the discussion; thanks for constructive discussions James Morrill, luana-be, and Amir Kadivar!)

GAD and EUADR datasets can be classified as weakly labeled (or distant supervision) datasets that are notably noisy. As mentioned in this reply, since we now have multiple high-quality BioRE datasets, I personally suggest that we need to refrain from using weakly labeled datasets and move to use other datasets such as ChemProt, DrugProt, or other human-labeled datasets for evaluating BioLMs.


As a BioNLP researcher, I think RE datasets are difficult to make and sometimes contains erroneous examples as it requires extensive manual work of healthcare professionals.
Although we have a few wrong samples, I think providing a dataset is a huge contribution to the BioNLP community, and I would like to express my sincere gratitude to the annotators! 😊

Thank you!
Best regards,
Wonjin


Here is a python script I used to check the dataset.

inptext=<get 18347176.txt>
inpTok = [ele.split("\t") for ele in inptext.splitlines()]
print(len(inpTok)) # should be about 52
entities = {int(ele[8]):"entity : '%s', "%ele[3]+ele[7] for ele in inpTok if ele[2]=="concept"}
print(len(entities)) # should be 36
entities[28] # > "entity : 'oropharyngeal', ['sda/1', 'sda/10', 'sda/15']"

@jambo6
Copy link

jambo6 commented Jun 15, 2021

The dataset does not seem to be a few cases of poorly labelled samples, it seems to be extremely poor. Could something else have gone wrong somewhere?

For example, it fails on even the most obvious changes to the input sentence

# True sentence
sentence_true = 'This result suggests that susceptibility to @DISEASE$ may be associated with the RsaI and @GENE$ polymorphism of the P450IIE1 gene.'

# Negated sentence
sentence_negated = "This result suggests that susceptibility to @DISEASE$ is not associated with the RsaI and @GENE$ polymorphism of the P450IIE1 gene."

# Run the fine tuned model
nlp(sentence_true)    # Returns {'label': 'LABEL_0', 'score': 0.9942463040351868}
nlp(sentence_negated)   # Returns {'label': 'LABEL_0', 'score': 0.9936720728874207}

The model appears to predict almost zero difference in all the cases i've tried between a positive association and a negated version. This suggests to me it's not an artifact of just some poor labels.

I think its also worth noting that the poor labels, that happen continually, are not a challenge to annotate (by and large). I don't quite buy that its because RE is difficult to do.

@jambo6
Copy link

jambo6 commented Jun 15, 2021

@luana-be if you did end up doing some manual processing, is there any chance you would share said labels?

@luana-be
Copy link
Author

@jambo6 my manual processing was not enough to get good results. I totally gave up on these RE datasets! Sorry :-(

@jambo6
Copy link

jambo6 commented Jun 15, 2021

Ah no worries. Thanks for the reply.

Are you familiar with BioNLP by any chance? I know there exist good gene-gene relationship databases and I was wondering : is there anything that would likely prevent a model trained on gene-gene relationships perform poorly on gene-disease? It feels to me the language is relatively similar. Obviously its not ideal and I'm sure you'd miss certain things but I wouldn't have thought it would be too bad...

@luana-be
Copy link
Author

That's a good idea @jambo6 ! Thanks for the insight :-)

@amirkdv
Copy link

amirkdv commented Jun 16, 2021

@wonjininfo thank you for looking into this.

I also ran into the exact same issue described here and documented my findings in #153. I agree with @jambo6 that this is a serious issue, beyond the usual "labeled data is hard". I'd even add that the scope goes beyond BioBert and this repository per se. At this point, this GAD RE dataset has become a de facto benchmark, used by a lot of other folks, and BioBert is now part of the genealogy of the dataset (e.g. see BLURB).

I encourage you to look at my findings in #153; you can find a summary of it below.


I set up a tiny experiment. I picked 20 random examples from the official BioBert GAD RE dataset and verified their prescribed labels. I found that the true/false labels were basically no better than a coin toss. I then tried to trace the genealogy of the GAD RE dataset, which roughly goes something like this: Becker et al. (2004), Bravo et al. (2015), and Lee, Yoon, et al. 2019 (i.e. BioBert).

My conclusion so far is that the main problem with the datset we have now is the very definition of true/false labels. Becker et al. (2004)'s dataset was a good ole, manually curated RE dataset with 5,000 data points, each being a tuple like (pubmed, gene, disease, label), notably each sample refers to a PubMed id, i.e. whole article, and not the specific sentence in question. Then Bravo et al. (2015) do some non-trivial, and IMO questionable, gymnastics and turn the original GAD into a bigger (poorly) labeled sentence-level RE dataset. This is where the labels get weird, and I think useless. It also seems like BioBert's GAD RE dataset simply inherits these labels from Bravo et al. (2015), @wonjininfo can you confirm this? The data of Bravo et al. (2015) is no longer alive (defunct URL) but from reading the paper/supplements it looks like a (sentence, gene, disease) would be labeled false not only if a human annotated that sentence as such (desired scenario), but also if the article from which it was taken was not part of the original GAD, for example if it was published after 2004!

It does sound crazy, but it's my best theory of what's gone wrong and how badly.

@jambo6
Copy link

jambo6 commented Jun 16, 2021

Yeah you are right on this, I just read Gu et al., "Domain-Specific Language Model Pretraining for Biomedical
Natural Language Processing"
and they state

The Genetic Association Database corpus was created semi-automatically using the Genetic Association Archive. Specifically, the archive contains a list of gene-disease associations, with the corresponding sentences in the PubMed abstracts reporting the association studies. Bravo et al. used a biomedical NER tool to identify gene and disease mentions, and create the positive examples from the annotated sentences in the archive, and negative examples from gene-disease co-occurrences that were not annotated in the archive.

Interestingly, this odd labelling method did not appear to be a cause for concern.

@DunnoHtL
Copy link

DunnoHtL commented Aug 8, 2021

Thanks for your guys' exploration! I came across the same issue and was glad that I'm not the only one. By the way, does anyone know if there exists any high-quality dataset for RE tasks? Thanks ahead!

@jambo6
Copy link

jambo6 commented Aug 8, 2021

I used this one recently and it worked quite well: https://github.com/sujunhao/RENET2

I stuck a fine-tuned model on huggingface if you are interested in trying it out: https://huggingface.co/jambo/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext-finetuned-renet

@DunnoHtL
Copy link

DunnoHtL commented Aug 9, 2021

Appreciate your links so much! @jambo6, but the dataset link in that repo (http://www.bio8.cs.hku.hk/RENET2/renet2_data_models.tar.gz) doesn't seem to work :(

I also like your idea of utilizing gene-gene relationship databases. In fact, I did a little investigation and found DisGeNET may serve as a good resource. Take Alzheimer's Disease as an example, this page https://www.disgenet.org/browser/0/1/1/C0002395/ lists all the association types and the evidence (free text) related to it. I was just wondering whether we could constitute a corpus based on that and establish a multi-classification NLP task.

@jambo6
Copy link

jambo6 commented Aug 11, 2021

It works fine for me, have you tried it in another browser?

Disgenet is good, however their association types/evidences are only based on NLP models, not expert-curated; as such anything trained on their labels is only likely to be as good as ones trained on the original datasets they used. In fact, I think the RE dataset they used is precisely this problematic GAD one. (See the paper where they explicitly mention the GAD dataset).

@DunnoHtL
Copy link

DunnoHtL commented Aug 13, 2021

@jambo6 Thx I tried IE and it works!

That's insane... Can't imagine this problematic dataset has been utilized by such a highly cited work...

Days ago a friend of mine just recommended me another source of biomedical NLP datasets: https://www.i2b2.org/NLP/DataSets/Main.php, you can also have a look.

@wonjininfo
Copy link
Member

wonjininfo commented Oct 27, 2021

Hello all,
My apologies for the delay in response. I've undergone busy days handling some personal events, including conscription to the Korean Army.
After the conscription, I participated in the relation extraction track of the BioCreative 7 challenge, which changed my viewpoints on this RE dataset quality issue.

I carefully read this issue thread and also combined my experience of the aforementioned RE challenge; I think I have overlooked the situation in the previous comments.

As @amirkdv suggested, I agree that the GAD dataset clearly seems to be weakly labeled (distant supervision).
To the best of my knowledge, we followed Bravo et al. (2015) (I think we get data from their paper). For more details, I need to check with my co-author who prepared the dataset for the RE task experiments. The dataset I received was in the format of our current released dataset, not in the format of the original database-styled dataset.

In my defence, what I remember about the time we selected the RE dataset was, that we did not have an abundant choice of BioRE datasets at the time we wrote the paper. We selected the dataset by the popularity, i.e. the number of papers cited, since we thought that highly cited would represent the "reputation" and the quality. (This approach was too naive and I feel responsible for the following studies). Back then, high-quality BioRE datasets were rare (at least to the best of our knowledge) and it seems like GAD was one of the widely used RE datasets.
I have to admit that we (authors) were focusing more on the pretraining of the model rather than on verifying the quality of each public datasets we used.

To conclude, I agree that the GAD and EUADR datasets are weakly supervised (distant supervision) datasets. And since we now have multiple high-quality BioRE datasets, I personally suggest that we need to refrain from using weakly labeled datasets and move to use other datasets such as ChemProt, DrugProt, or other human-labeled datasets for evaluating BioLMs.

Thank you all very much for your constructive comments, I deeply appreciate them!
ps) I will edit my first comment on this issue soon.

@wonjininfo
Copy link
Member

ps2) @amirkdv I think your experiment on 20 random examples is very interesting.
According to Riedel et al. 2010, precision for distant supervision was 70~87% (errors up to about 30%). Judging from your experiments, it seems like distant supervision works much worse in the biomedical domain.

Riedel, S., Yao, L., & McCallum, A. (2010, September). Modeling relations and their mentions without labeled text. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 148-163). Springer, Berlin, Heidelberg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants