text to instruction dataset

A collection of scripts used to generate the dataset for the WW-Storytelling-70B-LoRA.

Supports:

Breaking .txt files into chunks based on context length.
Using a local instance of Oobabooga (or anything that supports an OpenAI-style API) to generate prompts and other metadata.
Outputting a final .json file for training with Oobabooga.
Various tools for analyzing the dataset (count common phrases, randomize names, batch generate responses from the final model).

Setup

Clone this repo.
Install requirements with pip install -r requirements.txt.
Edit settings.toml (or make a copy named user.toml).
- The defaults should work out of the box, unless:
  - You're not using Oobabooga to generate prompts or it isn't being hosted on the same machine.
  - You're not working on a storytelling dataset.
    - This should still be doable, but you'll have to live with the keys being named "story", "context", "prompt".
    - See prompt_gen.prompt_file_path in the settings.toml for how to modify the prompt generation request.
(If generating prompts) Launch Oobabooga's Text-Generation-Webui with the API extension enabled and load a model.

Usage

Build a dataset in the form of a folder of .txt files.
Turn your .txt files into "chunks" that you want to train the model on.

python -m extractors.book_to_chunks --input_folder IN_FOLDER 
    --output_folder OUT_FOLDER --max_tokens CHUNK_SIZE
# The CHUNK_SIZE should be the CUTOFF_LENGTH that you specify during training, minus your
# expected 'instruction' size (~300 tokens if using this repo's defaults).

Use a local model to generate prompts (see Setup for details).

python process_prompts.py --input_folder IN_FOLDER 
    --output_folder OUT_FOLDER --mode generate_prompts

Generate the .json file to use with your training scripts.

python finalize_dataset.py --input_folder IN_FOLDER1 IN_FOLDER2
    --output_folder OUT_FOLDER
# You can specify multiple input folders after the --input_folder flag

Use the .json dataset files in the output folder to train in Oobabooga. A compatible format file is provided in the project root directory.

Tools

The COUNT_PHRASES mode of process_prompts.py counts repeated phrases across your prompt json folders. Comparing this across two kinds of input (e.g. literature vs fanfiction) can be useful for finding phrases biases in the dataset.
The RANDOMIZE_NAMES mode of process_prompts.py randomizes names in prompt json folders. Avoids name biases.
python -m tools.inject_hardcoded_keys can be used to batch edit prompt jsons (e.g. add the genre, year of publication, etc.)
python -m tools.prompt_tester can be used to generate sample outputs. Uses a template + list of values to cycle through.
python -m tools.merge_prompts can be used to combine prompts from two different folders. This is mostly for doing partial reverts on the prompt json folders.

Credits

This repo contains lists of male and female names sourced from:

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.idea		.idea
extractors		extractors
library		library
processors		processors
tests		tests
tools		tools
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
finalize_dataset.py		finalize_dataset.py
process_prompts.py		process_prompts.py
requirements.txt		requirements.txt
settings.toml		settings.toml
t2d_oobabooga_training_format.json		t2d_oobabooga_training_format.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text to instruction dataset

Setup

Usage

Tools

Credits

About

Releases

Packages

Languages

alac/txt_to_dataset

Folders and files

Latest commit

History

Repository files navigation

text to instruction dataset

Setup

Usage

Tools

Credits

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages