diff --git a/en/pages/uploads/images/data_governance_map.png b/en/pages/uploads/images/data_governance_map.png new file mode 100644 index 000000000..55c0345f1 Binary files /dev/null and b/en/pages/uploads/images/data_governance_map.png differ diff --git a/en/pages/uploads/presentations/Data Governance at BigScience Episode #1 - Poster.pdf b/en/pages/uploads/presentations/Data Governance at BigScience Episode #1 - Poster.pdf new file mode 100644 index 000000000..51279ab9b Binary files /dev/null and b/en/pages/uploads/presentations/Data Governance at BigScience Episode #1 - Poster.pdf differ diff --git a/en/pages/working-groups/data-governance/index.md b/en/pages/working-groups/data-governance/index.md new file mode 100644 index 000000000..5d798ea11 --- /dev/null +++ b/en/pages/working-groups/data-governance/index.md @@ -0,0 +1,103 @@ +# Big Science 🌸 Data Governance + +## 🐠 Overview + +The [Hugging Face](https://huggingface.co/) "[Big Science](pages/index.md)" +project is a year-long workshop that brings together nearly 600 researchers from +50 countries to work together on better understanding Large Language Models ( +LLMs) +-- a family of deep learning systems that are trained on considerable amounts of +data to learn statistical properties of language. LLMS have been increasingly +adopted in contemporary language technologies, from Internet search to automatic +translation. + +Our goal is to improve the scientific understanding of the capabilities and +limitations of large-scale neural network models in NLP. In doing so, we seek to +create both a multilingual language corpus and an open-source large language +model, open to the scientific community. The +May [MIT Tech Review article by Karen Hao](https://www.technologyreview.com/2021/05/20/1025135/ai-large-language-models-bigscience-project/) +provides an excellent overview of the project in its broader context, and +further details can be found in our recent article in +VentureBeat: [NLP needs to be open. 500+ researchers are trying to make it happen](https://venturebeat.com/2021/07/14/nlp-needs-to-be-open-500-researchers-are-trying-to-make-it-happen/) +. + +A core part of this project focuses on creating ethical protocols for Data +Governance, including the collection and management of a training dataset. There +are a wealth of language data sources to draw from to meet the growing needs of +these technologies, but bringing those together to train large models while +giving due consideration to ethical concerns such as **anonymization** (no PII) +, **consent** (if the dataset does include PII), and **contestation** ( +individuals can request their information be removed from the dataset), remains +an open problem. + +To this end, we are exploring the feasibility of working with a small network of +organizations who are themselves interested in working on aspects of ethical +data governance and can help develop tools and protocols for data collection, +hosting, and management. Together, these organizations serve as ā€œData +Custodiansā€ for the Big Science project. + +### Data Custodians: Overview + +We are starting to implement a global collaborative data governance +structure ([as described in this poster](https://drive.google.com/file/d/10pgOzQllHe-EXe5G6t-0pKlkaGJdZSqC/view)) +, where ā€œData Custodiansā€ help to collect and serve the data. Data Custodians +can be further broken down into Data Providers -- institutions collecting and +providing the data -- and Data Hosts -- institutions serving the data. Data +Providers and Data Hosts may be one and the same entity, or may collaborate +together, for example, on data curation + +#### Data Providers + +Data Providers will work with Data Hosts and other members of Big Science, +including pro-bono legal scholars, on the data collection process. A Data +Provider’s main role is to provide access to data they own or have access to. +This data may come in the form of e.g. web content, digitized books or +scientific articles, audio files such as radio shows or podcasts that can be +transcribed, images of text to be OCR’ed, or any other media containing language +data. + +Collaborations between Data Providers and the rest of the Big Science project +will focus on mechanisms for extracting specific kinds of data at a large scale, +and brainstorming on how to share the processed data beyond the Big Science +language model training run. The collaboration is also intended to catalyze the +sharing of all of the processing tools, which may also be useful for Data +Providers’ indexing (e.g., new audio transcripts), or training models of their +own. + +We are also working on a long form paper on approaches to data governance in the +context of the emergence of data-driven technology, and are hoping to focus on +Data Providers’ existing models of global data governance. + +#### Data Hosts + +In collaboration with Big Science Workshop participants, the role of Data Hosts +is to serve their data for at least 1 year. Each data host will serve their +different data, which together will form the input training data for a large +language model. Data Hosts are selected to help serve a diversity of languages. + +Additionally, Data Hosts are welcome to work with us on refining the data +governance structure, including: + +1. Collecting, obtaining, and/or curating appropriate language data from a + variety of sources and providers +2. Developing tools to manage, index, and visualize these data sources +3. Developing protocols to make the data available to users and other host + organizations subject to the rights and requirements of the data creators, + subjects, and providers + +--- + +In addition to being part of a new approach to handling data in Machine +Learning, a key benefit of participation is access to the shared +infrastructure for data +governance. + +The questions we still need to iron out are: + +4. How can we manage access to the input datasets to most efficiently achieve + the above goals? +5. How might contestation protocols work, for asking that one’s data be removed? + + +For more information, please email: meg@huggingface.co, yacine@huggingface.co +