-
Notifications
You must be signed in to change notification settings - Fork 0
How to contribute Datasets to ULCA
It's fairly easy to contribute dataset to ULCA ecosystem. The submitter just have to upload a zip folder containing two textual files and optional reference files like audio or image. The textual file content can be in JSON or CSV format. The naming convention of textual file should be :
-
params.jsonorparams.csv -
data.jsonordata.csv
ULCA system currently supports the following type of datasets :
- Parallel dataset
- Monolingual dataset
- ASR / TTS dataset
- OCR dataset
- Document Layout dataset
- Transliteration dataset
- Glossary datset
ULCA relies upon the submitter to explain their dataset, so that it can be beneficial to the large community, following some of the suggestions will surely benefit the community at large.
params file should contain the discussed attributes.
Dataset should have the following mandatory attributes, we will cover each of them individually. Please note the mandatory attributes and values assigned to these attributes are strictly enforced.
- datasetType
- languages
- collectionSource
- domain
- license
- submitter
Following are the optional attributes :
- version
This defines the type of dataset (parallel/monolingual/asr etc). The values can be referred in DatasetType
Sample usage :
"dataset-type": "parallel-corpus"
It is important to convey what language the dataset is directed towards. The structure of languages attributes should be followed. Same parameter can be used to define a single language or a language pair. Let's look at the following example where the languages defines a parallel dataset that typically has a language pair where sourceLanguage is English and targetLanguage is Bengali. The defined language code are per ISO 639-1 & 639-2 and can be referred in LanguagePair
{
"sourceLanguage": "en",
"targetLanguage": "bn"
}Monolingual, ASR/TTS, OCR dataset typically uses a single language and the following example can be used to define the languages attribute.
"sourceLanguage": "en" This attribute defines that relevant business area or domain under which dataset is curated. ULCA ONLY accepts one values that are defined under Domain schema.
Sample usage :
- domain specifically for
legaldomain
"domain": "legal"- dataset meant for
newsdomain
"domain": "news"This attribute is bit straight forward, dataset submitter should choose on from available License.
Sample usage:
"license": "cc-by-4.0"This attribute is mostly free text and optional, however we recommend it to be descriptive so that community users should be able to look at the sources from where the dataset has been curated. Mostly putting a URL along with some description should suffice.
Sample usage :
"collectionSource": [
"https://main.sci.gov.in",
"42040.pdf",
"SCI judgment pdfs"
]The attribute holds the description of the user who submitted the dataset as well as the team members who are part of the project, we suggest acknowledging all team members how small the contribution could be. Typically it should describe the project or team's goal.
Sample usage :
{
"submitter": {
"name": "Project Anuvaad",
"aboutMe": "Open source project run by ekStep foundation, part of Sunbird project"
},
"team": [
{
"name": "Ajitesh Sharma",
"aboutMe": "NLP team lead at Project Anuvaad"
},
{
"name": "Vishal Mauli",
"aboutMe": "Backend team lead at Project Anuvaad"
},
{
"name": "Aravinth Bheemraj",
"aboutMe": "Data engineering team lead at Project Anuvaad"
},
{
"name": "Rimpa Mondal",
"aboutMe": "Freelancer Bengali translator at Project Anuvaad"
}
]
}This section explains the params specific to supported dataset type. We will go through each dataset type individually and in detail.
Parallel dataset params have few specific attributes defined below
- collectionMethod
This attribute is an optional field in params for the parallel dataset. It's a combination of collectionDescription and collectionDetails. collectionDescription is a mandatory property if a collectionMethod is included, which actually defines the methods the user has used for creating the dataset.
Sample usage :
"collectionMethod": {
"collectionDescription": [
"machine-translated-post-edited"
],
"collectionDetails": {
"translationModel": "Google",
"translationModelVersion": "v2",
"editingTool": "Anuvaad",
"editingToolVersion": "v1.4",
"contributor": {
"name": "Aravinth Bheemaraj",
"aboutMe": "NLP Data team lead at Project Anuvaad"
}
}
}The values for the collectionDescription can be found here
Based on the collection method defined, the collectionDetails can one of the 4 available schemas.
See detailed sample usage at data.json and params.json
In order to do bitext mining at large scale, submitters often leverage strategies like LaBSE, LASER etc. to align and generate parallel corpus. This strategy at large scale bitext mining has helped the community at large. Use this property in params to indicate your bitext mining strategy and also report alignmentScore property in data for every record. A sample record is defined below :
{
"sourceText": "In the last 24 hours, 4,987 new confirmed cases have been added.",
"targetText": "उन्होंने बताया कि पिछले 24 घंटे में 4987 नए मामलों की पुष्टि हुई है।",
"collectionMethod": {
"collectionDetails": {
"alignmentScore": 0.79782
}
}
} }
}ULCA will reject those records not satisfying the mentioned criterion. We have explained this scenario in the example, data.json and params.json
Listed properties are specific to OCR dataset.
- format
- dpi
- imageTextType
Describe the image file format present in the submitted dataset, choose from following image type. Also refer to the example provided.
- jpeg
- bmp
- png
- tiff
"format": "tiff" Describes the standard image metadata about pixel density.
- 300_dpi
- 72_dpi
"dpi": "72_dpi" This property defines the presence of text on various categories of image. For example a text region can be present on scene or let's say on a document. Following are various defined possibilities here.
- scene-text
- typewriter-typed-text
- computer-typed-text
- handwritten-text
user can use these options as follows based upon text annotation done on the image type.
"imageTextType": "computer-typed-text" Please reach out to us via Discussions forum if you wish to improvise the documentation or contribute to ULCA