Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate BigBio NER data sets into HunFlair #3146

Merged
merged 14 commits into from Apr 13, 2023
Merged

Conversation

mariosaenger
Copy link
Collaborator

This PR implements an adapter to integrate biomedical named entity recognition data sets provided by the BigScience biomedical initiative (also known as BigBio):

https://github.com/bigscience-workshop/biomedical

BigBIO is an open library of biomedical dataloaders built using Huggingface's datasets library. It provides programmatic and harmonized access to over 120 biomedical datasets. This PR implements an adapter to the named entity recognition data sets of the library enabling users to easily work with these corpora in HunFlair (e.g. for model training or evaluation).

@alanakbik
Copy link
Collaborator

@mariosaenger thanks for adding this! Does this mean that the "old" HUNER dataset classes (like HUNER_CELL_LINE) can be removed?

@mariosaenger
Copy link
Collaborator Author

Hi @alanakbik! No, the "old" data sets are sill needed and used. This is due to the fact that in HUNER, in addition to the more technical harmonisation (e.g. common data format), we also standardize data sets on a content-related / semantic level (e.g. standardise different entity type labels).

@alanakbik
Copy link
Collaborator

Ah I see - but I guess the non-standardized corpora like JNLPBA can be replaced with the BIGBIO version, or are they also still needed?

@mariosaenger
Copy link
Collaborator Author

Good point. We have to check these corpora.

@mariosaenger
Copy link
Collaborator Author

Hej @alanakbik! We discussed the deletion of data sets in our developer group. We would be rather reluctant to do this as it would break existing implementations referencing to these data sets. Furthermore, the BigBio datasets are (unfortunately) sometimes still of mixed quality and our data set implementations are more assured, in contrast.

However, if you insist on deleting these data sets: could we implement this in an separate PR and mark the data sets as deprecated first, since deleting the data sets would result in massive code changes?

@alanakbik
Copy link
Collaborator

No worries, we can keep the old classes in this case.

@marctorsoc
Copy link
Contributor

@alanakbik after merging this, or some other PR, could we release a new version?

a) README says we're in 0.12.2, but the last one I can see is 0.12.1 :)
b) I'm eager to get hugging-face-hub unpinned (#3149) to unpin a bunch of deps in my repos

If that's not possible, when do you expect to release a new version? is there a calendar?

@alanakbik
Copy link
Collaborator

It was just released on pip!

@marctorsoc
Copy link
Contributor

amazing! thanks a ton

@mariosaenger
Copy link
Collaborator Author

Hej @alanakbik! I added a deprecated tag to all data sets that are available in BigBio. Are there any other things that need to be changed before the implementation go into main?

@alanakbik
Copy link
Collaborator

Hello @mariosaenger, everything mostly looks good. The handling of different sentence splitters however is suboptimal with the new classes: In the old Huner classes, we appended the sentence splitter name to the generated files. This made it possible to switch sentence splitters.

Here is an illustration:

corpus = HUNER_GENE_CELL_FINDER()
print(corpus)

corpus = HUNER_GENE_CELL_FINDER(sentence_splitter=SegtokSentenceSplitter())
print(corpus)

This prints two different corpus sizes, as the first corpus is loaded using the default SciSpaCy sentence splitter, and the second with a different splitter.

However, when doing this with the new classes:

corpus = HUNER_GENE_TMVAR_V3()
print(corpus)

corpus = HUNER_GENE_TMVAR_V3(sentence_splitter=SegtokSentenceSplitter())
print(corpus)

The same corpus gets loaded twice. For the second corpus, the segtok sentence splitter is not applied.

Can you fix this? Easiest would probably be to use the same solution as for the old classes.

@mariosaenger
Copy link
Collaborator Author

@alanakbik I will have a look at it

…e directories per sentence splitter (configuration)
@mariosaenger
Copy link
Collaborator Author

@alanakbik Fixed this issue. Now the new data sets work as expected 😉

2023-04-12 16:28:29,421 Reading data from /home/mario/.flair/datasets/huner_gene_tmvar_v3/SciSpacySentenceSplitter_core_sci_sm_0.2.5_SciSpacyTokenizer_core_sci_sm_0.2.5
2023-04-12 16:28:29,422 Train: /home/mario/.flair/datasets/huner_gene_tmvar_v3/SciSpacySentenceSplitter_core_sci_sm_0.2.5_SciSpacyTokenizer_core_sci_sm_0.2.5/train.conll
2023-04-12 16:28:29,422 Dev: None
2023-04-12 16:28:29,422 Test: None
Corpus: 4454 train + 495 dev + 550 test sentences
2023-04-12 16:28:31,072 Reading data from /home/mario/.flair/datasets/huner_gene_tmvar_v3/SegtokSentenceSplitter
2023-04-12 16:28:31,072 Train: /home/mario/.flair/datasets/huner_gene_tmvar_v3/SegtokSentenceSplitter/train.conll
2023-04-12 16:28:31,072 Dev: None
2023-04-12 16:28:31,072 Test: None
Corpus: 4364 train + 485 dev + 539 test sentences

@alanakbik
Copy link
Collaborator

@mariosaenger thanks for adding this!

@alanakbik alanakbik merged commit dca69ab into master Apr 13, 2023
1 check passed
@alanakbik alanakbik deleted the bigbio-integration branch April 13, 2023 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants