Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create the license-compliant version of the Pile: PubMed Central #74

Closed
albertvillanova opened this issue Nov 17, 2021 · 3 comments · Fixed by huggingface/datasets#3287
Assignees
Labels
data catalog Gathering data from data sources language modeling script Need Language Modeling loading script wontfix This will not be worked on

Comments

@albertvillanova
Copy link
Member

albertvillanova commented Nov 17, 2021

Subset of The Pile.

PubMed Central: this was downloaded in a license-compliant fashion.

@albertvillanova albertvillanova changed the title PubMed Central: this was downloaded in a license-compliant fashion The Pile: PubMed Central Nov 17, 2021
@albertvillanova albertvillanova changed the title The Pile: PubMed Central PubMed Central Nov 17, 2021
@albertvillanova albertvillanova self-assigned this Nov 17, 2021
@albertvillanova albertvillanova added the data catalog Gathering data from data sources label Nov 22, 2021
@albertvillanova albertvillanova changed the title PubMed Central Create the license-compliant version of the Pile: PubMed Central Nov 22, 2021
@albertvillanova albertvillanova added the language modeling script Need Language Modeling loading script label Jan 20, 2022
@lvwerra
Copy link

lvwerra commented Jan 25, 2022

#self-assign

@lvwerra
Copy link

lvwerra commented Jan 25, 2022

Done:
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_pmc

Sample:

{
'text':  "==== Front\nPLoS BiolPLoS BiolpbioplosbiolPLoS Biology1544-91731545-7885Public Library of Science San Francisco, USA 10.1371/journal.pbio.0000005Research ArticleGenetics/Genomics/Gene TherapyInfectious DiseasesMicrobiologyPlasmodiumThe Transcriptome of the Intraerythrocytic Developmental Cycle of Plasmodium falciparum\n P. falciparum IDC TranscriptomeBozdech Zbynek ..."
'meta': "{'pmid': 12929205}"
}

@albertvillanova
Copy link
Member Author

Thanks @lvwerra.

@albertvillanova albertvillanova added the wontfix This will not be worked on label Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data catalog Gathering data from data sources language modeling script Need Language Modeling loading script wontfix This will not be worked on
Development

Successfully merging a pull request may close this issue.

2 participants