Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create license-compliant version of the Pile: Stack Exchange #376

Closed
Tracked by #65
albertvillanova opened this issue Jan 28, 2022 · 1 comment
Closed
Tracked by #65
Assignees
Labels
data catalog Gathering data from data sources language modeling script Need Language Modeling loading script

Comments

@albertvillanova
Copy link
Member

No description provided.

@albertvillanova albertvillanova added data catalog Gathering data from data sources language modeling script Need Language Modeling loading script labels Jan 28, 2022
@albertvillanova albertvillanova self-assigned this Jan 28, 2022
@albertvillanova
Copy link
Member Author

DONE: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_the_pile_stack_exchange

Sample:

{'text': "Q:\n\nWhat is the h-index exactly and how does it work?\n\nWhat is the h-index, and how does it work ?\n\nA:\n\nThe h-index is a measure of the impact of someone's publication list. An h-index of 10 for example means that the person has published 10 papers with at least 10 citations. The total number of papers published may be higher, but only 10 will have 10 or more citations.\nCritics argue that this measure disadvantages young researchers who did not have time to publish a lot and whose work has not been published for long and thus may not have attracted many citations. Other criticisms include that it makes a researcher focus on how to increase the citation count for a paper that may be not that good but would increase the h-index.\nFor more explanation, see for example the Wikipedia article.",
 'meta': "{'file': 'academia.stackexchange_0000000005.txt'}"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data catalog Gathering data from data sources language modeling script Need Language Modeling loading script
Development

No branches or pull requests

1 participant