[Japanese/English]
AnimeSpeech: Dataset Generation for Language Model Training and Text-to-Speech Synthesis from Anime Subtitles
This project aims to facilitate the generation of datasets for training Language Models (LLMs) and Text-to-Speech (TTS) synthesis using subtitles from anime videos.
AnimeSpeech
is a project designed to generate datasets for training language models (LLMs) and text-to-speech (TTS) synthesis from anime subtitles. This project is aimed at making it easy to create the necessary data for training machine learning models using subtitles from anime videos. Specifically, it provides functionalities such as extracting conversation data and synthesizing character voices, which are useful for research and development in language modeling and speech synthesis.
Speaker recognition from videos, known as speaker verification task, unfortunately has not been used in this case. The current approach involves extracting embeddings from videos, while the character creation part involves human labeling of subtitles. Based on these labeled embeddings, they serve as training data for KNN. When predicting from new videos, it measures the distance between the labeled embeddings and the new embeddings, recognizing them as characters if the distance is smaller than a certain threshold.
- demoji==1.1.0
- gradio==4.20.1
- matplotlib==3.8.0
- munch==4.0.0
- neologdn==0.5.2
- numpy==1.25.2
- pandas==2.0.3
- scikit_learn==1.2.2
- setuptools==69.1.1
- speechbrain==0.5.16
- toml==0.10.2
- torch==2.0.1
- torchaudio==2.0.2
- tqdm==4.66.1
- transformers==4.36.2
- Video file: The video from which audio and dialog data will be extracted.
- Subtitles file: The .str file containing the subtitles of the video.
- Annotation file: The csv file containing the predictions. This file should be used only for creating audio and dialogs datasets, for labeling or predicting it should be outputed automatically.
Both the video file and subtitles file should be placed in the data/inputs
folder. Only the filenames are required; the full path is not needed. In the case of the annotation file, the path of the folder is needed. For example:
video_name/preds.csv
which is in the data/outputs
folder.
This functionality involves processing subtitles and video to generate annotations.
Users can label the data to create representations of desired characters. The converted subtitles are transformed into tabular data similar to an Excel sheet. For predicting new data, we need the embeddings of the characters so, this process needs to be done at least once.
Now 3 models are available SpeechBrain
, WavLM
and Espnet
. The most powerful one is Espnet
, this model is used inside a Docker container, details in Docker.
This function predicts the character speaking each line. It requires pre-created representations (embeddings) of desired characters and predicts characters only for those with representations.
This functionality takes an annotations file as input and creates datasets for training LLMs and TTS.
This creates a conversational dataset suitable for training LLMs. Users can select which characters' dialogues to include.
This extracts all audios of a desired character and organizes them into a folder along with corresponding text for TTS training.
Training script designed to facilitate the training of conversational language models (LMs) using the Hugging Face Transformers library. It provides functionalities to load pre-trained models, prepare data, train models, and save the training logs.
├── data
│ ├── inputs
│ │ ├── subtitle-file.str
│ │ ├── video-file
│ ├── outputs
│ │ ├── subtitle-file.csv
│ │ ├── video-file
│ │ │ ├── preds.csv
│ │ │ ├── voice
│ │ │ ├── embeddings
├── docker
├── pretrained_models
├── src
│ ├── characterdataset
│ │ ├── api
│ │ ├── common
│ │ ├── configs
│ │ ├── datasetmanager
│ │ ├── oshifinder
│ │ ├── train_llm
├── tests
│ ├── test_dataset_manager.py
│ ├── test_finder.py
│ ├── test_train_conversational.py
├── webui_finder.py
├── train_webui.py
-data stores the subtitles and video files, the predictions get saved there as well.
-datasetmanager sub-package that processes subtitles files and the text part.
-oshifinder sub-package that creates embeddings and makes predictions.
-train_llm sub-package for finetuning LLM using QLoRA.
-webui_finder.py gradio based interface.
-train_webui.py gradio based interface for training QLoRA.
git clone https://github.com/deeplearningcafe/animespeechdataset
cd animespeechdataset
In case of using conda
it is recommended to create a new environment.
conda create -n animespeech python=3.11
conda activate animespeech
Then install the required packages. In the case you don't have a nvidia gpu in your pc, then remove the --index-url
line from the requirements file. As that line installs cuda software.
pip install -r requirements.txt
pip install -e .
To use the webui just run:
python webui_finder.py
Embedding of audio is a crucial part. The quality of this embedding affects the quality of the dataset. If the accuracy of predictions increases, it will also make annotation corrections easier. Therefore, I would prefer to use the best-performing model if possible. Espnet-SPK
provides a powerful model for conversational data. However, Espnet's package uses slightly older libraries, such as Python 3.10
and Torch 2.1.2
. Conversely, Speechbrain
and Transformers
libraries are compatible with newer versions of Torch
, so I felt it unnecessary to downgrade all environments. Therefore, Espnet can be used with Docker containers. Using volumes, Espnet's output is stored in the host folder. The API requires only simple commands.
If you wish to use Docker, execute the following command:
docker compose up -d
As it is wasteful to recreate it every time, you will need to download Espnet's conversational model each time. Therefore, if you do not want to delete the image and container, you can pause it. To pause, execute the following command:
docker compose stop
To resume usage, execute the following command:
docker compose start
- Introduce the video name and the subtitles name, both placed in
data/inputs
. In the case of not having the subtitles, then use the transcribe checkbox. - Create reprentations(embeddings) of the desired characters, to load the dataset just use the
load df
button. - The user labels the dataframe, just introduce the character name in the first column.
- Save the labeled data using the
safe annotations
button. - Use the
Create representation
button to extract the embeddings from the labeled data.
- Introduce the video name and the subtitles name, both placed in
data/inputs
. In the case of not having the subtitles, then use the transcribe checkbox. - Use the
Predict characters
button, the annotation file path will be displayed at the annotation file textbox. The result file will be stored in a folder with the same name as the video file.
Since predictions are not perfect, it is recommended to correct annotations. However, this task can be quite tedious. To make it a bit easier, we have provided the following steps:
- Paste the prediction file into the text box of
Annnotations
and use theCreate file with texts and predictions
button. This will generate a file namedcleaning.csv
. - Use the
cleaning.csv
file to listen to the audio while correcting the text. - After correcting the
cleaning.csv
file, paste the prediction file into the text box ofAnnnotations
and use theUpdate predictions
button. This will generate a file namedPREDICTION-FILE_cleaned.csv
. Additionally, the names of embeddings and audio files will also be changed.
- Introduce the prediction results file, the folder in which is stored should be included, but not the
data/outputs
part. - Select in the
Export for training
tab the type of dataset to create,dialogues
oraudios
. - In the case of
dialogues
you can specifyfirst character
andsecond character
, user role and assystant role. In the case ofaudios
you have to choose the character. - For the
dialogues
you can choose the maximum time interval to consider 2 lines as a conversation, default is 5 seconds. - Click the
Transform
button.
After modifying the prediction file, the embeddings can be used as part of the training dataset. Adding this new data to the labeled embedding dataset should improve prediction accuracy. We want to use samples that are far from neighboring data based on distance, as these samples are considered 'difficult' for the model and thus have high value as training data.
- Paste the modified prediction file into the text box under
Annotations
. In theCreate characters
tab, expandAdd new data
. - After setting a minimum distance, we suggest a threshold of 0.2, as samples with distances above 0.4 are considered doubtful. Click on
Add new embeddings to the labeled data
. The embedding files will be automatically copied to theCharacter embeddings
folder.
For speech recognition, we are using the nemo model released by reazonspeech. However, this module cannot be used directly on Windows. There are no issues when using WSL2. Therefore, we have included a simple script asr_api.py using FastAPI for speech recognition.
For speech recognition, we have created a Docker image. This container performs speech recognition processing based on the file names. The generated Annnotations
file is then saved to the host directory. Thanks to the use of volumes, the container saves files in the host's data/outputs
directory. Since there is no need to send files to an API, the processing speed of the program is improved.
To look for the best n_neighbors
, just run:
python ./tools/knn_choose.py
You can also plot the embeddings of the characters. Run the following command.
python ./tools/check_knn.py
To train a conversational LM, a configuration file (default_config.toml
) specifying the required parameters for training is required. This file can be updated using the train_webui.py
interface. For training, using the CMD is also supported.
[peft]
rank = 64
alpha = 64
dropout = 0.1
bias = "none"
[dataset]
dataset = "YOUR-DATASET-CSV-PATH"
character_name = "THE-CHARACTER-NAME-TO-LEARN"
[train]
base_model = "HUGGINGFACE-MODEL-NAME"
max_steps = 80
learning_rate = 1e-4
per_device_train_batch_size = 16
optimizer = "adamw_8bit"
save_steps = 5
logging_steps = 5
output_dir = "output"
save_total_limit = 10
push_to_hub = false
warmup_ratio = 0.05
lr_scheduler_type = "constant"
gradient_checkpointing = true
gradient_accumulation_steps = 2
max_grad_norm = 0.3
save_only_model = true
For training QLoRA more packages are needed, bitsandbytes, peft, etc. In case of training use the following command.
pip install -r requirements-train.txt
To use the webui for training just run:
python train_webui.py
To run it using the CMD. Run the module with the path to the configuration file as an argument (--config_file), if no config_file
is provided, by default the config file inside train_llm
is used.
python -m characterdataset.train_llm --config_file "YOUR-CONFIG-FILE"
- Change classes attributes to function parameters when possible.
- When creating dialogues, look for (可能) with the character name.
- Add support for Whisper.
- Process entire folders, not just individual files.
- Add QLoRA script for finetunning LLM.
- Add the
n_neighbors
as parameter.
This project is licensed under the MIT license. Details are in the LICENSE file.