Augmental Dataset and Model Training Code

I've come up with a step-by-step guide to using the notebook in this repo. You can find it here (feel free to dismiss the subscription prompt).

Repo description 99% generated by GPT-4, and not all proofread, tell me if there are errors or GPT-isms.

I've licensed this code under MIT. I have no clue what kind of license fits the dataset itself, and I'm not going to give any legal advice on that front either.

Augmental Dataset and Model Training Code

This repository contains the code used to generate the Augmental dataset, as well as the model training code to finetune models on it. The dataset stands out due to its innovative approach of utilizing AI to enhance human-written scripts from visual novels, bridging the gap between purely synthetic data and manual data curation.

Introduction

The Augmental dataset is a novel multiturn dataset containing 7.86k replies spread across about 480 different conversations among 7 distinct characters. This dataset was crafted by refining and enhancing the script of the visual novel Steins;Gate using GPT-4. The dataset prioritizes quality, longer responses, and retaining the human-like essence of the conversation while benefitting from the capabilities of GPT-4.

Prerequisites

Dataset Source

The dataset is generated from the .scx.txt files of Steins;Gate. It's essential to have a legal copy of Steins;Gate to extract the required files. Please ensure you have the rights to use the text from the visual novel for your purposes.

Tools

sc3tools for extracting .scx.txt files from Steins;Gate.

Data Generation

Extraction and Preprocessing

Extract the .scx.txt files from Steins;Gate using sc3tools.
Merge the extracted .scx.txt files into a single text file.

Processing with Notebook

Open the processing_refactor.ipynb notebook.
Before running the notebook, ensure you've toggled the dataset_has_been_manually_edited flag at the top:
- Set to True if working with the original dataset. If this is true, it (shouldn't) make any OAI calls and will leave any gaps in the dataset alone.
- Set to False if you're generating new data.
Run the notebook to process the raw text file. The output will be the Augmental dataset, ready for model training.

Model Training

The training code for finetuning models on the Augmental dataset is contained in train.py.

Usage

Use the processing_refactor notebook how you would normally use a notebook. Cells that run OpenAI will skip generations that have already been saved to files in the ./annotated_convs, ./scenarios, and ./anchors directories. !!DO NOT DELETE FILES IN THOSE DIRECTORIES UNLESS YOU WANT PAINFUL ERRORS!!

python train.py

Acknowledgements

This dataset is an evolution of the dataset that was used to train MythoMakise, a model that achieved notable recognition. The current model, trained on the Augmental dataset, promises even higher quality interactions and versatility. See the note on the Augmental Dataset HF page for legal considerations. TLDR: if the legal holders of the Steins;Gate IP tell me to take this down I will without a second thought.

References and Useful Links and Self-promotional Links

Reddit Post introducing the dataset.
Model Card detailing the specifics and innovations behind the model.
Ko-fi Support Link
Essay on Substack discussing the rationale behind the dataset. Feedback, suggestions, and contributions are always welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
__pycache__		__pycache__
anchors		anchors
annotated_convs		annotated_convs
conversations		conversations
scenarios		scenarios
sg_script_raw		sg_script_raw
sg_script_test_subset		sg_script_test_subset
.gitignore		.gitignore
Add link to comprehensive data generation guide that uses this repo as an example		Add link to comprehensive data generation guide that uses this repo as an example
LICENSE		LICENSE
README.md		README.md
concatenate_files.ipynb		concatenate_files.ipynb
determine_perspective.py		determine_perspective.py
final_dataset.json		final_dataset.json
formatted_training_examples_original.csv		formatted_training_examples_original.csv
formatted_training_examples_original.json		formatted_training_examples_original.json
make_card_evanchat.py		make_card_evanchat.py
merge-lora.ipynb		merge-lora.ipynb
processing_refactor.ipynb		processing_refactor.ipynb
tokenizer_trouble.ipynb		tokenizer_trouble.ipynb
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Augmental Dataset and Model Training Code

Table of Contents

Introduction

Prerequisites

Dataset Source

Tools

Data Generation

Extraction and Preprocessing

Processing with Notebook

Model Training

Usage

Acknowledgements

References and Useful Links and Self-promotional Links

About

Releases

Packages

Languages

License

e-p-armstrong/amadeus

Folders and files

Latest commit

History

Repository files navigation

Augmental Dataset and Model Training Code

Table of Contents

Introduction

Prerequisites

Dataset Source

Tools

Data Generation

Extraction and Preprocessing

Processing with Notebook

Model Training

Usage

Acknowledgements

References and Useful Links and Self-promotional Links

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages