-
Notifications
You must be signed in to change notification settings - Fork 0
How to port Models to ULCA
To document steps and processes involved in porting a model onto the ULCA platform. Models belonging to - Translation, ASR, OCR, TTS can be ported on ULCA provided the guidelines mentioned in this document are met.
- Model must be publicly accessible
- The Model that is meant to be submitted to ULCA, must be accessible from outside the environment of its hosting through REST calls. ULCA will be accessing the hosted Model through REST APIs due to which it is recommended that the owners of Model maintain a well-defined API throttling policy on their end as well.
- Whenever the Model is migrated to a different environment or if there is any change in the API endpoint, a new version of the same model must be submitted on ULCA.
- Once the API is made publically available, the Model must accept requests and respond back as per the contract defined by ULCA in the ULCA API contract
- This contract will be followed by all the Models made available on ULCA, so as to maintain a well-defined way of communication and exception handling.
Good practices:
- API exposed to ULCA should be backed by API throttling feature
- The API must have a well defined logic for error and exceptional handling, ULCA will be dependent on the error responses from the API. Logging enabled to API hits.
- A model can be submitted to be onboarded to ULCA via APIs which ULCA has exposed, these APIs are backed by RBAC, due to which the submitter must be a registered user on the ULCA platform.
- Model submission can also be done via the ULCA portal, however the user has to be signed in for making any submission..
- A JSON of this format - Model JSON format has to be created with all the necessary information regarding the model. The JSON file hence formed will be the input to ULCA.
- Sample Model JSONs are available here - Model JSON examples. The steps to create such a file is briefly explained towards the end of this document.
- Once the Model is submitted, ULCA runs a series of validations on the submitted model, to see if the inference endpoint is reachable and also check if the request/response are aligned with the contract. If all the checks pass, Model will be accepted and will be live on ULCA for the consumption of all users.
- The Model can be viewed in the ‘Model’ -> ‘My Contribution’ section of the App once you login. The same will also be visible on the ‘Explore Models’ screen of the App along with all other Models available on ULCA.
- On the ‘Explore Model’ page, please click on your Model card and verify the details. ULCA also provides a ‘Try Me’ option on Models for the users to try out the Model in real time, please verify if the Model is working fine through the ‘Try Model’ tab.
- If all of these checks pass, the Model is now available on ULCA and can now be benchmarked using different benchmarking techniques available on ULCA.
- Benchmarks can be run on a Model using the existing Benchmark Datasets on the ULCA platform. These datasets are carefully curated per dataset type to ensure that they can be used as Golden Data for any benchmarking process.
- ULCA provides a ‘Run Benchmark’ feature against every Model, where the user must choose the Benchmark Dataset and the Metric using which the Model has to be benchmarked. Once the benchmarking process is over, If the process is successful, the Model can be ‘Published’ on ULCA with the scores of the benchmarking process.
- Note that, A Model can be benchmarked just once for a given Benchmark Dataset + Metric combination.
- The Models thus published post benchmarking will be available on the Leader Board where the model will be placed based on the performance of the Model during benchmarking.
- Lets say, you’re not satisfied with the benchmark scores, you always have the provision to unpublish the Model and submit a newer version of the Model with required updates and repeat the process.
- The old Model that's unpublished will remain dormant in the system, The new one will be live on the Leader Board.
ULCA system currently supports the following type of Models :
- translation
- transliteration
- tts
- asr
- document-layout
- ocr
- glossary
- ner
- txt-lang-detection
ULCA relies upon the submitter to explain their model, so that it can be beneficial to the large community, following some of the suggestions will surely benefit the community at large.
submitted json file should contain the discussed attributes.
Model should have the following mandatory attributes, we will cover each of them individually. Please note the mandatory attributes and values assigned to these attributes are strictly enforced.
- name
- version
- description
- task
- languages
- license
- domain
- submitter
- inferenceEndPoint
- trainingDataset
Following are the optional attributes :
- refUrl
This defines the name of the model that will be displayed in ulca
Sample usage :
"name" : "Ai4Bharat English Named Entity Recognition"
This defines the version of the model that will be displayed in ulca
Sample usage :
"version": "v1.0"
This defines the description of the model that will be displayed in ulca . The parameter can contain a brief description about model, its goal or basically something sweet about it
Sample usage :
"description": "Ai4Bharat model to detect named entities from provided english text"
This defines the type of task the model is intented to perform taskType
Sample usage :
"task": { "type" : "ner" }
It is important to convey what language the model can handle. The structure of languages attributes should be followed. Same parameter can be used to define a single language or a language pair. Let's look at the following example where the languages defines a translation model that typically has a language pair where sourceLanguage is English and targetLanguage is Bengali. The defined language code are per ISO 639-1 & 639-2 and can be referred in LanguagePair
"languages" :
[{
"sourceLanguage": "en",
"targetLanguage": "bn"
}]Monolingual, ASR/TTS, OCR, NER dataset typically uses a single language and the following example can be used to define the languages attribute.
"languages" :
[{
"sourceLanguage": "en",
}]This attribute defines that relevant business area or domain under which model is specialised. ULCA ONLY accepts one values that are defined under Domain schema.
Sample usage :
- domain specifically for
legaldomain
"domain": "legal"dataset meant for news domain
"domain": "news"This attribute is bit straight forward, dataset submitter should choose on from available License.
Sample usage:
"license": "cc-by-4.0"The attribute holds the description of the user who submitted the dataset as well as the team members who are part of the project, we suggest acknowledging all team members how small the contribution could be. Typically it should describe the project or team's goal.
Sample usage :
"submitter":
{
"name": "Project Anuvaad",
"aboutMe": "Open source project run by ekStep foundation, part of Sunbird project"
},
"team": [
{
"name": "Ajitesh Sharma",
"aboutMe": "NLP team lead at Project Anuvaad"
},
{
"name": "Vishal Mauli",
"aboutMe": "Backend team lead at Project Anuvaad"
},
{
"name": "Aravinth Bheemraj",
"aboutMe": "Data engineering team lead at Project Anuvaad"
},
{
"name": "Rimpa Mondal",
"aboutMe": "Freelancer Bengali translator at Project Anuvaad"
}
]The attribute holds the description of the dataset used to train and finetune the model. It can have two sub attributes, namely datasetId( optional) and description( mandatory)
Sample usage :
"trainingDataset": {
"datasetId": "2398749282",
"description": "trained on the general datasets from ULCA"
}The attribute holds the model specific core information and endpoint details to send request and recieve response from it. These are model andpoint specific content. It has multiple sub parameters, which are discussed below
This sub attribute is a required field. It holds the url to be hit to access the model response.
Sample usage :
"callbackUrl" : "https://developers.ulca.org/aai4b-ner-inference/v0/ner"This sub attribute is an optional field. Its is to be filled in only if the specified callbackUrl requires an api-key authentication. The sub attribut by itself contains furthur sub attributes, which are name and value . name is for the header name ( by default it will be apiKey ) and value wis where we specify the corresponding api key to access the endpoint.
Sample usage :
"inferenceApiKey":
{
"name" : "apiKey",
"value" : "xx019858-b354-4e24-8e92-a7a4b320cxx0"
}This is a required field and it differs wrt model type. since it differs widely with different models, we will discuss it on a seperate thread.
The final inferenceEndPoint will look like this :
Sample usage :
"inferenceEndPoint":
{
"callbackUrl": "https://ner-api.ai4bharat.org/inference_authenticated",
"inferenceApiKey":
{
"name" : "apiKey",
"value" : "INSERT-API-KEY-HERE"
}
"schema" : { DISCUSSED SEPERATELY }
}Sample Model JSONs are available here - Model JSON examples
Demo videos: https://www.youtube.com/watch?v=6MS--C3V6Fw&list=PLLMnmzzUOBiKvAX9Kc23YjqEDjHT5FYJg
ULCA Intro Page:
https://bhashini.gov.in/ulca
Explore Models Page:
https://bhashini.gov.in/ulca/model/explore-models
GitHub Page:
https://github.com/ULCA-IN/ulca
Please reach out to us via Discussions forum if you wish to improvise the documentation or contribute to ULCA