Attiri: Dataset and an instruction-following large language model for Tamil based on LLaMa and Stanford Alpaca
Attiri, an extension of the LLaMa and Stanford Alpaca, aims to build and share an instruction-following language model for the Tamil language. Recent breakthrough in LLM, such as LLaMA, LaMDA and GPT-4, have introduced the potential for Artificial General Intelligence (AGI) and sparked widespread attention from the industry. However, the high cost of training and deployment has made it difficult to promote transparent and open academic research in the field. In response, a project has taken steps to promote open research in the Tamil natural language processing (NLP) community by releasing the Tamil LLaMA model and the Alpaca large model as open-source resources. These models expand the Tamil vocabulary and improve basic semantic understanding by utilizing secondary pre-training on Tamil data. Additionally, the project uses Tamil instruction data for fine-tuning the Tamil LLaMA model, enhancing the model's ability to understand and execute instructions. It is important to note that these resources are solely intended for academic research purposes.
We also release a minimum viable model weight to the huggingface model hub.
The repository contains
- Dataset
- Code to generate the data
- Code to fine tune LLaMA 7B model
To use the program, you must have Python 3.9+ (recommended = 3.9) and the necessary packages installed. You can install the necessary packages using pip:
Create a new Conda environment with Python 3.9:
conda create --name attiri python=3.9
Activate the new environment:
conda activate attiri
pip install -r requirements.txt
S.No | Dataset | Description | Count | I/O |
---|---|---|---|---|
1 | Attiri-Alpaca | Tamil version of the Stanford Alpaca dataset | 52K | Instruction, Input, Output |
2 | Attiri-Nomic | Tamil version of the Nomic AI GPT4ALL dataset | 500K | Prompt, Response |
3 | IndicCorp | A single large text file containing one sentence per line. The publicly released version is randomly shuffled, untokenized and deduplicated. | 31.5M | Sentences |
attiri_data.py
translates instruction data for the Alpaca dataset from one language to another using the Google Translate API. The program is built with Click and tqdm for command-line argument parsing and progress tracking, respectively.
Attiri Nomic data is available on request, including a csv file with the prompt and response in English and their corresponding tamil translations. To Request : Click Here
Here are some examples of how to use the program:
Translate data from alpaca_data.json in English to Tamil and save it to output.json:
python attiri_data.py \
--source en \
--target ta \
--dataset alpaca \
--input alpaca_data.json \
--output output.json
Alternatively -s
and -t
can be used instead of --source
and --target
and -i
and -o
can be used instead of --input
and --output
respectively.
The parameters.json
file contains the configuration parameters for running the model. Make sure to update the parameters in parameters.json
according to your specific use case before running the model.
Now you can run finetune the model using the following steps:
import attiri.finetune as ft
trainer = ft.LlamaTrainer(BASE_MODEL_PATH, DATAL_PATH )
trainer.train()
A minimum viable model weight is released to the huggingface model hub. You can find it here. (Note this is not a fully working model yet. Further models will be released as the project progresses)
To view a quick demo of the model, please follow the instructions below:
Clone the alpaca-lora repository:
git clone https://github.com/tloen/alpaca-lora.git
cd alpaca-lora
git checkout a48d947
To launch the demo, run the following command:
python generate.py \
--load_8bit \
--base_model 'decapoda-research/llama-7b-hf' \
--lora_weights 'adithya-balaji/attiri-lama' \
--share_gradio
Please cite this project if you use the dataset, model or code in this repo. (Note: Naturally you should also cite the original LLaMA, Stanford Alpaca, and LoRa papers)
@misc{Attiri,
author = {Adithya Balaji},
title = {Attiri: Dataset and a LLaMa based instruction-following large language model for Tamil},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/adithyab94/Attiri}},
}
This project is actively looking for collaborators. If you are interested in contributing to this project, please raise a pull request or write to me
- Translate alpaca json data into Tamil
- Translate nomic json data into Tamil
- Clean training data
- Finetuning with LLaMA uing Local GPU
- Release mvp model
- Output model to hugging face
- Demo UI (Hugging Face) [
PARTIAL
:Self-hosted app] - Finetuning using Cloud GPU (minimum: 8x A100s 80GB memory)
- Release v1.0 model
- Finetuning the 13B, 33B ,65B models using Cloud GPU (minimum: 8x A100s 80GB memory)
- Output models to hugging face
- Demo UI (Hugging Face / Hosted app)
- Prepare organic dataset customized to suit Tamil language
- Prepare organic dataset customized to suit Kondunthamizh and Romanized Tamil
- Prepare dataset for other languages
- Finetune to create language models
- Finetune other LLMs like PaLM, Flan, GPT and Compare results
- Prepare Toxicity and abuse detection dataset
- Finetune to create safe language model
LLaMA 7B model is not finetuned for Tamil. The model is finetuned for languages with Latin script. The Alpaca model hence performs poorly for tamil prompts.
On the other hand, ChatGPT performs better comparitively but it doesnt generate meaningful responses.
Attiri model is finetuned for Tamil and hence performs better than Alpaca just with the pre-release model. Hence it shows great potential to have a large language model customized for Tamil.
Thanks for the open source projects - LLaMA, Stanford Alpaca, and Alpaca-Lora from which this project is inspired.
Thanks to the AI4Bharat team for the IndicCorp dataset. and Nomic for the GPT4ALL dataset.
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for details.
The word "Attiri" ("அத்திரி") is used by the poet Ilango in the famous Tamil epic Silappadikaram which acccording to the Tamil dictionary could be a camel, a distant relative of the Llamas and Alpacas.
வான வண்கையன் அத்திரி ஏற
மான் அமர் நோக்கியும் வையம் ஏறிக்
கோடி பல அடுக்கிய கொழிநிதிக் குப்பை..
- கடலாடு காதை, சிலப்பதிகாரம்