AnimeSpeech: Dataset Generation for Language Model Training and Text-to-Speech Synthesis from Anime Subtitles

This project aims to facilitate the generation of datasets for training Language Models (LLMs) and Text-to-Speech (TTS) synthesis using subtitles from anime videos.

Table of Contents

  1. Introduction
  2. Requirements
  3. Inputs
  4. Functionalities
  5. Directory Structure
  6. How to Use
  7. License


AnimeSpeech is a project designed to generate datasets for training language models (LLMs) and text-to-speech (TTS) synthesis from anime subtitles. This project is aimed at making it easy to create the necessary data for training machine learning models using subtitles from anime videos. Specifically, it provides functionalities such as extracting conversation data and synthesizing character voices, which are useful for research and development in language modeling and speech synthesis.

Speaker recognition from videos, known as speaker verification task, unfortunately has not been used in this case. The current approach involves extracting embeddings from videos, while the character creation part involves human labeling of subtitles. Based on these labeled embeddings, they serve as training data for KNN. When predicting from new videos, it measures the distance between the labeled embeddings and the new embeddings, recognizing them as characters if the distance is smaller than a certain threshold.

Explanation Video

Blog post


  • demoji==1.1.0
  • gradio==4.20.1
  • matplotlib==3.8.0
  • munch==4.0.0
  • neologdn==0.5.2
  • numpy==1.25.2
  • pandas==2.0.3
  • scikit_learn==1.2.2
  • setuptools==69.1.1
  • speechbrain==0.5.16
  • toml==0.10.2
  • torch==2.0.1
  • torchaudio==2.0.2
  • tqdm==4.66.1
  • transformers==4.36.2


  • Video file: The video from which audio and dialog data will be extracted.
  • Subtitles file: The .str file containing the subtitles of the video.
  • Annotation file: The csv file containing the predictions. This file should be used only for creating audio and dialogs datasets, for labeling or predicting it should be outputed automatically.

Both the video file and subtitles file should be placed in the data/inputs folder. Only the filenames are required; the full path is not needed. In the case of the annotation file, the path of the folder is needed. For example: video_name/preds.csv which is in the data/outputs folder.


Create Annotations

This functionality involves processing subtitles and video to generate annotations.

Character Creation

Users can label the data to create representations of desired characters. The converted subtitles are transformed into tabular data similar to an Excel sheet. For predicting new data, we need the embeddings of the characters so, this process needs to be done at least once.

Now 3 models are available SpeechBrain, WavLM and Espnet. The most powerful one is Espnet, this model is used inside a Docker container, details in Docker.

Character Prediction

This function predicts the character speaking each line. It requires pre-created representations (embeddings) of desired characters and predicts characters only for those with representations.

Create Datasets

This functionality takes an annotations file as input and creates datasets for training LLMs and TTS.

Dialogues Dataset

This creates a conversational dataset suitable for training LLMs. Users can select which characters' dialogues to include.

Audios Dataset

This extracts all audios of a desired character and organizes them into a folder along with corresponding text for TTS training.

FineTune LLM Dialogues

Training script designed to facilitate the training of conversational language models (LMs) using the Hugging Face Transformers library. It provides functionalities to load pre-trained models, prepare data, train models, and save the training logs.

Directory Structure

├── data
│   ├── inputs
│   │   ├── subtitle-file.str
│   │   ├── video-file
│   ├── outputs
│   │   ├── subtitle-file.csv
│   │   ├── video-file
│   │   │   ├── preds.csv
│   │   │   ├── voice
│   │   │   ├── embeddings
├── docker
├── pretrained_models
├── src
│   ├── characterdataset
│   │   ├── api
│   │   ├── common
│   │   ├── configs
│   │   ├── datasetmanager
│   │   ├── oshifinder
│   │   ├── train_llm
├── tests
│   ├──
│   ├──
│   ├──

File description

-data stores the subtitles and video files, the predictions get saved there as well.
-datasetmanager sub-package that processes subtitles files and the text part.
-oshifinder sub-package that creates embeddings and makes predictions.
-train_llm sub-package for finetuning LLM using QLoRA. gradio based interface. gradio based interface for training QLoRA.

How to Use


git clone
cd animespeechdataset

In case of using conda it is recommended to create a new environment.

conda create -n animespeech python=3.11
conda activate animespeech

Then install the required packages. In the case you don't have a nvidia gpu in your pc, then remove the --index-url line from the requirements file. As that line installs cuda software.

pip install -r requirements.txt
pip install -e .

To use the webui just run:



Embedding of audio is a crucial part. The quality of this embedding affects the quality of the dataset. If the accuracy of predictions increases, it will also make annotation corrections easier. Therefore, I would prefer to use the best-performing model if possible. Espnet-SPK provides a powerful model for conversational data. However, Espnet's package uses slightly older libraries, such as Python 3.10 and Torch 2.1.2. Conversely, Speechbrain and Transformers libraries are compatible with newer versions of Torch, so I felt it unnecessary to downgrade all environments. Therefore, Espnet can be used with Docker containers. Using volumes, Espnet's output is stored in the host folder. The API requires only simple commands.

If you wish to use Docker, execute the following command:

docker compose up -d

As it is wasteful to recreate it every time, you will need to download Espnet's conversational model each time. Therefore, if you do not want to delete the image and container, you can pause it. To pause, execute the following command:

docker compose stop

To resume usage, execute the following command:

docker compose start

Creating character representations

  1. Introduce the video name and the subtitles name, both placed in data/inputs. In the case of not having the subtitles, then use the transcribe checkbox.
  2. Create reprentations(embeddings) of the desired characters, to load the dataset just use the load df button.
  3. The user labels the dataframe, just introduce the character name in the first column.
  4. Save the labeled data using the safe annotations button.
  5. Use the Create representation button to extract the embeddings from the labeled data.

Predict characters

  1. Introduce the video name and the subtitles name, both placed in data/inputs. In the case of not having the subtitles, then use the transcribe checkbox.
  2. Use the Predict characters button, the annotation file path will be displayed at the annotation file textbox. The result file will be stored in a folder with the same name as the video file.

Correction of Character Predictions

Since predictions are not perfect, it is recommended to correct annotations. However, this task can be quite tedious. To make it a bit easier, we have provided the following steps:

  1. Paste the prediction file into the text box of Annnotationsand use the Create file with texts and predictionsbutton. This will generate a file named cleaning.csv.
  2. Use the cleaning.csvfile to listen to the audio while correcting the text.
  3. After correcting the cleaning.csv file, paste the prediction file into the text box of Annnotationsand use the Update predictions button. This will generate a file named PREDICTION-FILE_cleaned.csv. Additionally, the names of embeddings and audio files will also be changed.

Create audio and dialogs datasets

  1. Introduce the prediction results file, the folder in which is stored should be included, but not the data/outputs part.
  2. Select in the Export for training tab the type of dataset to create, dialogues or audios.
  3. In the case of dialogues you can specify first character and second character, user role and assystant role. In the case of audios you have to choose the character.
  4. For the dialogues you can choose the maximum time interval to consider 2 lines as a conversation, default is 5 seconds.
  5. Click the Transform button.

Adding New Labeled Data

After modifying the prediction file, the embeddings can be used as part of the training dataset. Adding this new data to the labeled embedding dataset should improve prediction accuracy. We want to use samples that are far from neighboring data based on distance, as these samples are considered 'difficult' for the model and thus have high value as training data.

  1. Paste the modified prediction file into the text box under Annotations. In the Create characters tab, expand Add new data.
  2. After setting a minimum distance, we suggest a threshold of 0.2, as samples with distances above 0.4 are considered doubtful. Click on Add new embeddings to the labeled data. The embedding files will be automatically copied to the Character embeddings folder.


For speech recognition, we are using the nemo model released by reazonspeech. However, this module cannot be used directly on Windows. There are no issues when using WSL2. Therefore, we have included a simple script using FastAPI for speech recognition.

For speech recognition, we have created a Docker image. This container performs speech recognition processing based on the file names. The generated Annnotations file is then saved to the host directory. Thanks to the use of volumes, the container saves files in the host's data/outputs directory. Since there is no need to send files to an API, the processing speed of the program is improved.

Check best K for KKN

To look for the best n_neighbors, just run:

python ./tools/

You can also plot the embeddings of the characters. Run the following command.

python ./tools/

Training QLoRA

To train a conversational LM, a configuration file (default_config.toml) specifying the required parameters for training is required. This file can be updated using the interface. For training, using the CMD is also supported.


rank = 64
alpha = 64
dropout = 0.1
bias = "none"

character_name = "THE-CHARACTER-NAME-TO-LEARN"

max_steps = 80
learning_rate = 1e-4
per_device_train_batch_size = 16
optimizer = "adamw_8bit"
save_steps = 5
logging_steps = 5
output_dir = "output"
save_total_limit = 10
push_to_hub = false
warmup_ratio = 0.05
lr_scheduler_type = "constant"
gradient_checkpointing = true
gradient_accumulation_steps = 2
max_grad_norm = 0.3
save_only_model = true

For training QLoRA more packages are needed, bitsandbytes, peft, etc. In case of training use the following command.

pip install -r requirements-train.txt

To use the webui for training just run:


To run it using the CMD. Run the module with the path to the configuration file as an argument (--config_file), if no config_file is provided, by default the config file inside train_llm is used.

python -m characterdataset.train_llm --config_file "YOUR-CONFIG-FILE"


  • Change classes attributes to function parameters when possible.
  • When creating dialogues, look for (可能) with the character name.
  • Add support for Whisper.
  • Process entire folders, not just individual files.
  • Add QLoRA script for finetunning LLM.
  • Add the n_neighbors as parameter.




This project is licensed under the MIT license. Details are in the LICENSE file.


Dataset Generation for Language Model Training and Text-to-Speech Synthesis from Anime Subtitles








