# Welcome to Augmentool

This notebook is where you generate all your data.

Augmentool is meant to allow instruct-tuned models to learn from books, even using themselves to generate new data through a sort-of bootstrapping method. It is meant to stop model creators from having to work as data annotators, and not actual model trainers. It is meant to allow anyone to make their own high-quality dataset with thousands of entries.

### Here are some ways you can adapt this notebook to your use case, along with a brief description of how to do so, arranged in increasing order of difficulty:

1. ***Change the source texts used to generate training data.*** You can do this in the cell right below this one. **IMPORTANT** the filenames of these should be formatted in a specific way, since the filenames are used as part of the prompts and in at least one regex. You need to have them be like: `[textname], by authorname`. You can also include the publication date after the author name if you want, but note that this will tend to bias most of the characters to live in the era of the textbook, which may or may not be what you want.

2. ***Change the personalities of the characters generated.*** Currently, when generating characters for the multiturn conversation step, three randomly-selected traits are appended to the "special instructions" set of the prompt to constrain what kind of character is generated by the model. Depending on what kind of model you want to make, or even just if your preferences vary, then you will probably want to modify this a bit. You can do so in `./generation_functions/special_instructions.py`. A more in-depth description of the trait-axis system that I (over)thought up is available in the comments of that file.

3. ***Change the constants.*** There are a few constant values in this notebook, and in `./generation_functions/constant_values.py`. These constants are tested, but if your usecase requires special settings (e.g., you want to make conversations from more permutations of existing questions; or you think the character counts for the "duplicate question/answer" validation functions are too restrictive) then feel free to change the related setting. The most intuitive and least-likely-to-break-anything settings to change are rearrangements_to_take and double_check_counter. Beyond that... you'll need to figure out what the function does before changing it if you expect it to run.

4. ***Change the model.*** This is as simple as switching the LOGICAL_MODEL parameter out for another one, but your mileage may vary, significantly. You will also have to adjust RoPE scaling for non-llama 2 models -- e.g., if you're using Mixtral, don't leave `rope_freq_scale=0.33`, which 3xes the context (you do not need 96k context, only 12k).

5. ***Change the examples.*** If you change the examples you can completely overhaul what this notebook does, but this requires a lot of prompting skill and possibly huge amounts of time to get it working again (source: most of my last three months were spent prompting, and most of this prompting was spent on the examples). Unless you want to convert this notebook from question-and-answer generation to some completely other task, I'd recommend changing only the conversation generation prompts -- they're a bit less finnicky, and if you just want to change the kind of characters generated (maybe you want a different writing style) that's where you'd find the differences.

## Quickstart:

- Get this notebook and the other repo code onto a machine with the power to run Airoboros-l2-70b-3.1.2.Q4_K_M
- Put the model into ./logical_models (relative to this notebook). 
- Run all the cells below and watch as the notebook generates questions, answers, and conversations based on Principles of Chemistry, Simple Sabotage, and Introduction to Philosophy.

If you want to add your own texts, follow the instructions in list item #1 above.

### Note: this notebook makes roughly 1/3 characters generated to be NSFW by default. You will want to follow point #2 above to change that if you want something cleaner. Or use "Assistant mode" which makes all the characters used be your typical sanitized "As an AI language model". 
Notice: Assistant mode is not fully implemented yet.

## Why is it writing so many files?

This notebook writes the final questions generated, the revisions of those questions, and the final multiturn conversations, to files. But it also writes the output of every single prompt to a unique file in a folder named for the prompt it's a part of (to a unique file whose filename is a uuid). Why all the writing? Taking inspiration from Jon Durbin's cinematika, this notebook records output information so that, in the future, possibly, a model can be finetuned specifically for running as the logical model behind the notebook. Writing each step down ensures that a dataset is made and outputs are not wasted. If a model is ever built, what'll probably be done is a regex and other code will be used to determine which runs (identified by the same uuid across folders) ended successfully, and these will make up the dataset. DPO might also be done on steps that failed vs steps that succeeded.

The folders you want to look out for, by default, are named `qatuples_raw`, `qatuples_revised`, and `multi_turn_convs`.

## Known limitations:

Multi-turn conversations sometimes have impersonation (ie, one character will describe what another character does in their own message). This only happens sometimes from my testing, and is quite possibly easily fixable by creating a prompt that takes conversations with potential impersonation and rewrites them to have none. I merely have not done this yet.

Multi-turn conversations can have the primary character ask if the secondary character needs anything else in a repetitive way. So for instance, the primary character might end with "Do you need anything else?" twice or thrice in a row. I am unsure whether this is a quirk of the model or the notebook, either way it should be easily fixable enough with a prompt (+ a regex that checks the end of statements, so that the prompt isn't called on things that are fine).

Spelling mistakes -- I had to use RoPE to boost the ctx quite high, and I think this is causing the model to (VERY rarely) misspell things. This happens maybe one in a dozen outputs, maybe less. Models with higher ctx, e.g., Mixtral, probably won't suffer from this problem.

Numbers -- I've found the model missing zeroes occasionally when spelling out dates. Unsure if this is a llama issue, an Airoboros issue, or a notebook/prompting issue.

Sensitive to text differences -- I've tested this on a few texts, but I will say that depending on the book you are using, and how it's written, what you get with the default Augmentool will vary significantly. This can be unpredictable: for instance, this notebook really struggles with H.G. Wells' "A Short History of the World" but is mostly fine with "Principles of Chemistry" despite them both being quite old, factual texts. If you try this on a text you like and it doesn't work, here's the process to debug it: take some times it failed (the notebook saves prompt outputs at each step so you can find where it went stupid), and manually turn it into a few-shot example for the step that went bad, except fix it yourself. This should make the notebook less inclined to commit the same error. Then run it again, and if there's a new problem, fix it the same way.

Other limitations -- I've listed the major ones but I'm sure there are a handful I'm forgetting.

I'm just one developer with finite time -- feel free to address any of these issues yourself! I'm happy to give prompting pointers if you ask on the repo or on Discord (I'm @Heralax over there). 



In [1]:
# NOTE NOTEBOOK SETTINGS AND CONSTANTS (some script file constants are in generation_functions/constants.py)

# Put your desired quant of your desired model in the relevant directories


# "./logical_model/airoboros-l2-70b-3.1.2.Q4_K_M.gguf"

LOGICAL_MODEL = "./logical_model/flatorcamaid-13b-v0.2.Q8_0.gguf" # model used for decision-making and base question generation (should be "smart")
RAM_IS_LIMITED = True # change this to false if you can fit both models on your GPU at once
ASSISTANT_MODE = False # change to true if you want all conversations to be with an "AI language model" and not characters. Useful for more professional usecases. Currently TODO.
DOUBLE_CHECK_COUNTER = 3 # Set to 1 to check outputs only once; set to 2 to check twice; set to 3 to check thrice, etc. Set to 0 to break everything in vet_question_loop() and elsewhere. Set to -1 and cause the universe to implode?

REARRANGEMENTS_TO_TAKE = 3 # How many of the possible permutations of tuples in a group to take and make multiturn convs out of. Adjust higher to get more data out of less text, but it might be a bit repetitive. NOTE your eval loss will be basically worthless if you aren't careful with how you shuffle your dataset when you're about to train.

source_texts = [
    "Principles of Chemistry, by Demitry Mendeleev, published 1897.txt",
    "Simple Sabotage, by the Office of Strategic Services, published 1944.txt",
    "Introduction to Philosophy, by George Stuart Fullerton.txt",
]

In [None]:
from transformers import AutoTokenizer
import re
from tqdm import tqdm
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

tokenizer = AutoTokenizer.from_pretrained("Gryphe/MythoMax-L2-13b") # It doesn't matter what model goes here as long as it is sentencepiece

def sentence_chunking_algorithm(file_path, tokenizer, max_token_length=400):
    """
    This function takes a plaintext file and chunks it into sentences.
    
    :param file_path: Path to the plaintext file
    :param tokenizer: SentencePiece tokenizer
    :param max_token_length: The maximum token length for a chunk of sentences
    :return: List of sentence chunks with source text information
    """
    sentence_chunks_with_source = []
    current_chunk = []
    token_count = 0
    source_name = file_path.replace(".txt","")

    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()

    # Remove Gutenberg header and footer
    content = re.sub(r'^.*?START OF (THIS|THE) PROJECT GUTENBERG EBOOK.*$\n', '', content, flags=re.MULTILINE)
    content = re.sub(r'^.*?END OF (THIS|THE) PROJECT GUTENBERG EBOOK.*$\n', '', content, flags=re.MULTILINE)

    sentences = sent_tokenize(content)

    for sentence in tqdm(sentences, desc=f"Processing {file_path}"):
        sentence_token_count = len(tokenizer.encode(sentence))

        if token_count + sentence_token_count <= max_token_length:
            current_chunk.append(sentence)
            token_count += sentence_token_count
        else:
            sentence_chunks_with_source.append((' '.join(current_chunk), source_name))
            current_chunk = [sentence]
            token_count = sentence_token_count

    # Add the last chunk if it exists
    if current_chunk:
        sentence_chunks_with_source.append((' '.join(current_chunk), source_name))

    return sentence_chunks_with_source



sentence_chunks = []
for source_text in source_texts:
    sentence_chunks += sentence_chunking_algorithm(source_text, tokenizer)

def fix_text(to_replace_arr, text):
    for startup in to_replace_arr:
        text = text.replace(startup[0], startup[1])
    return text

conversions = [("\n"," "), ("  ", " ")]

paragraphs_processed = [(fix_text(conversions, seq[0]), seq[1]) for seq in sentence_chunks]

In [None]:
len(paragraphs_processed)

In [None]:
paragraphs_processed[5]

In [None]:
from llama_cpp import Llama

In [None]:
# rp_llm = Llama(model_path=RP_MODEL,n_ctx=4096,n_gpu_layers=100) # load the RP llp and offload everything

In [None]:
logic_llm = Llama(model_path=LOGICAL_MODEL,n_gqa=8,offload_kqv=True,n_ctx=12000,rope_freq_scale=0.33,n_gpu_layers=100,verbose=False) # load the logical LLM and offload everything

In [7]:
# This is in no way best practices, but all my prompts being searchable and separate files is a good way to make my life easier.
import pkgutil
import importlib
import generation_functions  # This is the package directory
import sys

sys.path.append('./generation_functions')

# First, import all modules so they can be reloaded
for _, module_name, _ in pkgutil.iter_modules(generation_functions.__path__, generation_functions.__name__ + '.'):
    importlib.import_module(module_name)

# Now, reload each module and import all callable attributes
for _, module_name, _ in pkgutil.iter_modules(generation_functions.__path__, generation_functions.__name__ + '.'):
    # Reload the module
    module = importlib.reload(sys.modules[module_name])
    # Iterate through each attribute in the reloaded module
    for attribute_name in dir(module):
        # Retrieve the attribute
        attribute = getattr(module, attribute_name)
        if callable(attribute):
            # If it's callable, it's a function or class, so you set it in the globals dictionary
            globals()[attribute_name] = attribute

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:
print_grammar: error printing grammar: malformed rule, does not end with LLAMA_GRETYPE_END: 2

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:



root ::= text-analysis answer-breakdown accuracy-check final-judgment 
text-analysis ::= [#] [#] [#] [ ] [T] [e] [x] [t] [ ] [A] [n] [a] [l] [y] [s] [i] [s] [:] [<U+000A>] identify-key-info categorize-info-type 
answer-breakdown ::= [#] [#] [#] [ ] [A] [n] [s] [w] [e] [r] [ ] [B] [r] [e] [a] [k] [d] [o] [w] [n] [:] [<U+000A>] dissect-answer identify-answer-type 
accuracy-check ::= [#] [#] [#] [ ] [A] [c] [c] [u] [r] [a] [c] [y] [ ] [C] [h] [e] [c] [k] [:] [<U+000A>] direct-comparison inference-and-contextual-alignment 
final-judgment ::= [#] [#] [#] [ ] [F] [i] [n] [a] [l] [ ] [J] [u] [d] [g] [m] [e] [n] [t] [:] [<U+000A>] comprehensive-assessment overall-accuracy-determination 
identify-key-info ::= [#] [#] [#] [#] [ ] [I] [d] [e] [n] [t] [i] [f] [y] [ ] [K] [e] [y] [ ] [I] [n] [f] [o] [r] [m] [a] [t] [i] [o] [n] [:] [ ] text-info-detail [<U+000A>] 
categorize-info-type ::= [#] [#] [#] [#] [ ] [C] [a] [t] [e] [g] [o] [r] [i] [z] [e] [ ] [I] [n] [f] [o] [r] [m] [a] [t] [i] [o] [n] [ ] 

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:
print_grammar: error printing grammar: malformed rule, does not end with LLAMA_GRETYPE_END: 2

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:

from_string grammar:



In [None]:
# Determine which paragraphs are worthy of making questions from
judged_worthy_for_questions = []
for idx, p in tqdm(enumerate(paragraphs_processed)): # for each paragraph, try determining if it is suitable for questions or not, at most three times each
    judgement = judge_paragraph(p,logic_llm)
    judged_worthy_for_questions.append(judgement)
    try:
        if judgement[0] is not None:
            print(f"DEBUG model decided that index {idx} was suitable")
        elif judgement[0] is None:
            print(f"DEBUG model decided that index {idx} was not suitable")
    except:
        print(f"DEBUG max retries exceeded for index {idx}")
    
        
            


In [None]:
# Graphing code generated by GPT-4. May be suboptimal/ugly.
import matplotlib.pyplot as plt
from collections import Counter

def filter_and_graph(tuples):
    # Count the occurrences of None and non-None for each source text
    source_counts = Counter()
    for paragraph, source in tuples:
        if paragraph is None:
            source_counts[source] = source_counts.get(source, [0, 0])
            source_counts[source][0] += 1
        else:
            source_counts[source] = source_counts.get(source, [0, 0])
            source_counts[source][1] += 1

    # Prepare data for the graph
    labels = list(source_counts.keys())
    none_counts = [source_counts[source][0] for source in labels]
    non_none_counts = [source_counts[source][1] for source in labels]

    # Plotting the graph
    x = range(len(labels))
    plt.bar(x, none_counts, width=0.4, label='Not suitable', align='center')
    plt.bar(x, non_none_counts, width=0.4, label='Valid Paragraphs', align='edge')
    plt.xlabel('Source Text')
    plt.ylabel('Number of Paragraphs')
    plt.title('Paragraphs Suitable for Questions by Source Text')
    plt.xticks(x, labels, rotation='vertical')
    plt.legend()
    plt.tight_layout()
    plt.show()

    # Filter out tuples with None and return the new list
    filtered_list = [t for t in tuples if t[0] is not None]
    return filtered_list


In [None]:
filtered_worthy_for_questions = filter_and_graph(judged_worthy_for_questions)

In [None]:

# Answer generation code, begin.
# structure: define a series of helpers, then define the control flow, exception handling, retries etc. in a for loop that iterates over the processed sequences of paragraphs at the end in another cell

# Since some paragraphs can be much shorter 
# First off, question generation.

# Each local LLM function essentially has 3 phases: prompt, regex to extract response, and reaction to that response.
# However I'm not going to build an abstraction for that because I need fine control.

# If any function fails to make things, it won't throw, it'll just return None.

# Strengths of open source AI: hella cheap, very customizable, you can call it as much as you want
# Downside: you need very good regexes to catch its outputs


In [None]:
print(filtered_worthy_for_questions[0])

In [None]:
import os

def write_output_to_file(output, directory, uuid):
    # Ensure directory exists
    if not os.path.exists(directory):
        os.makedirs(directory)
    
    # Define the file path using the directory and UUID
    file_path = os.path.join(directory, f"{uuid}.txt")
    
    # Write the output to the file
    with open(file_path, 'w') as file:
        file.write(output)
    
    print(f"Output written to {file_path}")

In [None]:
# Control flow helpers

# Change this value to change how many times the checks must pass consecutively for a thing to be accepted

import logging
from math import ceil
import uuid

def make_id():
    return str(uuid.uuid4())


# Setup logging
# Except I actually don't use this because I switched to print() because jupyter (only to find out at the end of this )
logging.basicConfig(filename='data_generation.log', 
                    filemode='a', 
                    format='%(asctime)s - %(levelname)s - %(message)s',
                    level=logging.INFO)

def tokenize_and_check_length(text):
    tokens = tokenizer.encode(text)
    return len(tokens) > 500, tokens

def vet_answer_accuracy_loop(qa_tuple,total_retries,run_id):
    try:
        qtuple = qa_tuple
        print(f"\n\nStarting ACCURACY loop for question: {qtuple[0]}, context: {qtuple[2]}")
        passed_checks = 0
        times_checked = 0
        dissenting_reasoning = ""
        while times_checked < DOUBLE_CHECK_COUNTER:
            print(f"\n\nACCURACY CALL CHECK ANSWER: {qtuple[0]}, context: {qtuple[2]}, retries: {total_retries}, dissenting reasoning: {dissenting_reasoning}")
            judgement, answer_accuracy_output = check_answer(qtuple, logic_llm)
            write_output_to_file(answer_accuracy_output, "./check_answer_accuracy_generations", run_id)
            if not judgement[0]:  # if not accurate
                dissenting_reasoning = judgement[1]
            else:
                passed_checks += 1
            times_checked+=1
            if passed_checks >= ceil(DOUBLE_CHECK_COUNTER/2):
                break
            failed_checks = times_checked - passed_checks
            if failed_checks >= ceil(DOUBLE_CHECK_COUNTER/2):
                break
        
        if passed_checks >= ceil(DOUBLE_CHECK_COUNTER/2):  # if question checks passed
            print(f"\n\ANSWER ACCURACY CHECKS PASSED retries: {total_retries}")
            return qtuple
        else:
            # Generate new question and restart the loop
            print(f"\n\nACCURACY CHECKS FAILED - SENDING BACK TO QUESTION LOOP retries: {total_retries}")
            total_retries += 1
            qtuple, generate_new_q_output = generate_new_question(qtuple, logic_llm)
            write_output_to_file(generate_new_q_output, "./regenerate_question_generations", run_id)
            vet_question_loop(qtuple,total_retries,run_id=run_id.split("--subquestion--")[0]) # going to get one hell of a call stack by the end of this, but it should be fine
    except Exception as e:
        print("!!ERROR!!")
        print(e)
        pass
    
    return (None, None, None, qtuple[3])

def vet_answer_relevance_loop(qa_tuple,total_retries,run_id):
    try:
        qtuple = qa_tuple
        print(f"\n\nStarting RELEVANCE loop for question: {qtuple[0]}, context: {qtuple[2]}")
        passed_checks = 0
        times_checked = 0
        dissenting_reasoning = ""
        while times_checked < DOUBLE_CHECK_COUNTER:
            print(f"\n\nRELEVANCE CALL CHECK ANSWER: {qtuple[0]}, context: {qtuple[2]}, retries: {total_retries}, dissenting reasoning: {dissenting_reasoning}")
            judgement, answer_relevancy_output = check_answer_relevancy_with_text(qtuple, logic_llm)
            write_output_to_file(answer_relevancy_output, "./check_answer_relevancy_generations", run_id)
            if not judgement[0]:  # if not relevant
                dissenting_reasoning = judgement[1]
            else:
                passed_checks += 1
            times_checked += 1
            if passed_checks >= ceil(DOUBLE_CHECK_COUNTER/2):
                break
            failed_checks = times_checked - passed_checks
            if failed_checks >= ceil(DOUBLE_CHECK_COUNTER/2):
                break
        
        if passed_checks >= ceil(DOUBLE_CHECK_COUNTER/2):
            print(f"\n\nRELEVANCE CHECKS PASSED")
            return vet_answer_accuracy_loop(qtuple,total_retries,run_id)
        else:
            print(f"\n\nRELEVANCE CHECKS FAILED - SENDING BACK TO QUESTION LOOP")
            total_retries += 1
            qtuple, generate_new_q_output = generate_new_question(qtuple, logic_llm)
            write_output_to_file(generate_new_q_output, "./regenerate_question_generations", run_id)
            return vet_question_loop(qtuple,total_retries,run_id=run_id.split("--subquestion--")[0])
    except Exception as e:
        print("!!ERROR!!")
        print(e)
        pass
    
    return (None, None, None, qtuple[3])

def vet_question_loop(qa_tuple, question_group_id=None, total_retries=0):
    try:
        qtuple = qa_tuple
        print(f"\n\nStarting QUESTION loop for question: {qtuple[0]}, context: {qtuple[2]}")
        while total_retries <= 4:
            run_id = question_group_id + "--subquestion--" + make_id()
            passed_checks = 0
            times_checked = 0
            dissenting_reasoning = ""
            while times_checked < DOUBLE_CHECK_COUNTER:
                print(f"\n\nQUESTION CALL CHECK ANSWER: {qtuple[0]}, context: {qtuple[2]}, retries: {total_retries}, dissenting reasoning: {dissenting_reasoning}")
                judgement, check_q_output = check_question(qtuple, logic_llm)
                write_output_to_file(check_q_output, "./check_question_generations", run_id)
                if not judgement[0]:  # if not relevant
                    dissenting_reasoning = judgement[1]
                else:
                    passed_checks += 1
                times_checked += 1
                if passed_checks >= ceil(DOUBLE_CHECK_COUNTER/2):
                    break
                failed_checks = times_checked - passed_checks
                if failed_checks >= ceil(DOUBLE_CHECK_COUNTER/2):
                    break
            
            if passed_checks >= ceil(DOUBLE_CHECK_COUNTER/2):  # if all question checks passed
                print(f"\n\nQUESTION CHECKS PASSED retries: {total_retries}")
                return vet_answer_relevance_loop(qtuple,total_retries,run_id)
            else:
                # Generate new question and restart the loop
                print(f"\n\nQUESTION CHECKS FAILED - GENERATING NEW QUESTION retries: {total_retries}")
                total_retries += 1
                if (total_retries <= 4): # don't regen question if we're already at max regens
                    qtuple, generate_new_q_output = generate_new_question(qtuple, logic_llm) 
                    write_output_to_file(generate_new_q_output, "./regenerate_question_generations", run_id)
                    print("New question: ", qtuple)
                # no calling of vet_question_loop, since we're already in a while loop
    except Exception as e:
        print("!!ERROR!!")
        print(e)
    
    return (None, None, None, qtuple[3])


In [None]:
# control flow
import json
import os
from tqdm import tqdm
import glob

# Directory for QA tuples
qa_tuples_dir = './qatuples_raw'
if not os.path.exists(qa_tuples_dir):
    os.makedirs(qa_tuples_dir)


# Generate questions CoT-style
vetted_qa_tuples = [] # tuple list of qa tuples that have been judged good
for idx, para in enumerate(tqdm(filtered_worthy_for_questions)):
    try:
        existing_files = glob.glob(os.path.join(qa_tuples_dir, f'para_{idx}_*.json')) # check if qs already exist
    
        if len(existing_files) > 0:  # If files exist, skip this paragraph entirely
            print(f"Skipping para_{idx} as files already exist.")
            continue

        question_group_id = make_id()
        print(f"\n\n\nOUTER LOOP CALL GENERATE QPLAN para: {para}, \n\n idx: {idx}")
        plan, questions_plan_output = generate_questions_plan(para,logic_llm)
        write_output_to_file(questions_plan_output, "./question_plan_generations", question_group_id)
        print(f"\n\n\nOUTER LOOP CALL GENERATE Q: {para}, \n\n idx: {idx} \n\n plan: {plan}")
        question_answer_tuples, question_generation_output = generate_questions(para,plan,logic_llm)
        write_output_to_file(question_generation_output, "./question_generation_generations", question_group_id)
        for qnum, question_answer_tuple in enumerate(question_answer_tuples):
            print(f"\n\n=======!!=BEGIN VETTING QA TUPLE {idx}_{qnum}=!!=======\n\n")
            good_qa_tuple = vet_question_loop(question_answer_tuple,question_group_id=question_group_id)
            
            # Write resulting question file if the tuple is not None
            if good_qa_tuple[0] is not None:
                file_path = os.path.join(qa_tuples_dir, f'para_{idx}_q_{qnum}.json')
                with open(file_path, 'w') as file:
                    json.dump(good_qa_tuple, file, indent=4)
            
            vetted_qa_tuples.append(good_qa_tuple) # We must filter out all None values at the end; but appending Nones lets us know where things went wrong, and how often.
    except Exception as e:
        print(f"Q ERROR: {e}")
        
print("-------------- QUESTIONS CREATED ------------- STATS SO FAR:")
nones= list(filter(lambda x: x[0] is None, vetted_qa_tuples))
print(f"Nones: {len(nones)}")
print(f"Non-nones: {len(vetted_qa_tuples) - len(nones)}")
print(f"Total: {len(vetted_qa_tuples)}")
# filter out all None values
vetted_qa_tuples = [qa for qa in vetted_qa_tuples if qa[0] is not None]
print("---------------- ONTO EXAMPLES GENERATION-------------------")


In [None]:
# Graphing code again generated by GPT-4. I believe that this one is bugged.
import matplotlib.pyplot as plt

# Counting nones vs non-nones for questions per source text
qa_counts = {}
for qa in vetted_qa_tuples:
    source_text = qa[3] if qa[3] is not None else 'Unknown'
    if source_text not in qa_counts:
        qa_counts[source_text] = {'None': 0, 'Valid': 0}
    
    if qa[0] is None:
        qa_counts[source_text]['None'] += 1
    else:
        qa_counts[source_text]['Valid'] += 1

# Plotting
fig, ax = plt.subplots()

# Data for plotting
sources = list(qa_counts.keys())
none_counts = [qa_counts[src]['None'] for src in sources]
valid_counts = [qa_counts[src]['Valid'] for src in sources]

# Creating the bar plot
ax.bar(sources, valid_counts, label='Valid QA Tuples')
ax.bar(sources, none_counts, bottom=valid_counts, label='None QA Tuples')

# Adding labels and title
ax.set_xlabel('Source Text')
ax.set_ylabel('Count')
ax.set_title('Number of None vs Valid QA Tuples per Source Text')
ax.legend()

# Rotate labels for better readability
plt.xticks(rotation=45)

plt.show()


In [None]:
# Check for and fix the common mistake: mentioning "the text".
old_tuples = vetted_qa_tuples.copy()
for idx, tup in enumerate(vetted_qa_tuples):
    try:
        revision_id = make_id()
        revision, revision_output = check_qatuple_context(tup,logic_llm)
        write_output_to_file(revision_output, "./question_context_revision_generations", revision_id) # incidentally, identifying the problem and fixing it in the same step (without another planning step) works a lot better than identifying it and then trying to fix it in the next step.
        if (isinstance(revision[0],str)): # if the thing was reworded
            vetted_qa_tuples[idx] = revision
        elif (not revision[0]):
            vetted_qa_tuples[idx] = None # prepare item for deletion later; right now we just store it as None because indexes
        # else, if it passed, we just leave it be.
    except Exception as e:
        print ("!!! ERROR!", e)

In [4]:
# If vetted_qa_tuples is not defined (user of notebook skips previous steps because they generated questions in an earlier run) then this cell provides the functionality to read qatuples back in from files.

import json
import os
from pathlib import Path
import re

def read_json_files(directory):
    # List to hold all tuples
    all_tuples = []
    
    # Get all the json files from the given directory
    json_files = [f for f in Path(directory).iterdir() if f.suffix == '.json']
    
    # Function to extract numbers from filename for sorting
    def extract_numbers(filename):
        numbers = re.findall(r'\d+', filename.stem)
        return [int(num) for num in numbers] if numbers else [0]

    # Sort files based on numbers extracted from filenames
    sorted_files = sorted(json_files, key=extract_numbers)

    # Read each file and extract tuples
    for file in sorted_files:
        with open(file, 'r', encoding='utf-8') as f:
            data = json.load(f)
            
            # Extract numbers from filename
            numbers = extract_numbers(file)
            
            # Use the first number as para_num, default to 0 if not present
            para_num = numbers[0] if numbers else 0
            
            # Ensure we have the list to hold tuples for this para_num
            if para_num >= len(all_tuples):
                all_tuples.extend([[] for _ in range(para_num - len(all_tuples) + 1)])
            
            # Extract the tuple and add to the corresponding sublist
            q_a_tuple = (data[0], data[1], data[2], data[3])
            all_tuples[para_num].append(q_a_tuple)
            
    return all_tuples

# The directory path relative to the notebook
directory_path = './qatuples_revised'

# read vetted qa tuples from files, if not defined
try:
    vetted_qa_tuples
except:
    # Call the function to read JSON files if vetted_qa_tuples not defined
    vetted_qa_tuples = read_json_files(directory_path)
    vetted_qa_tuples = [item for sublist in vetted_qa_tuples if sublist for item in sublist]
    vetted_qa_tuples[:3]  # Display the first 3 sublists for brevity


In [None]:
# Print stats related to revised qatuples, and filter out nones (questions that were unanswerable due to lack of context).
import json
import os
print("-------------- QUESTIONS REVISED ------------- STATS SO FAR:")
nones= list(filter(lambda x: x is None, vetted_qa_tuples))
print(f"Nones: {len(nones)}")
print(f"Non-nones: {len(vetted_qa_tuples) - len(nones)}")
print(f"Total: {len(vetted_qa_tuples)}")
# filter out all None values
vetted_qa_tuples = [qa for qa in vetted_qa_tuples if qa is not None]
writepath = "./qatuples_revised"
if not os.path.exists(writepath):
    os.makedirs(writepath)
for idx, qatup in enumerate(vetted_qa_tuples):
    file_path = os.path.join(writepath, f'conv_{idx}.json')
    with open(file_path, 'w') as file:
        json.dump(qatup,file,indent=4)
print("---------------- ONTO EXAMPLES GENERATION-------------------")

In [5]:
# Group tuples for multiturn example generation (by chunk of source text) and then run that helper (so that we can make multiturn conversations from questions based on the same paragraphs)
def group_by_text(tuples_list):
    # Dictionary to hold the groups with text as the key
    groups = {}
    
    # Iterate over each tuple in the list
    for question, answer, text, textname in tuples_list:
        # If the text is not yet a key in the dictionary, add it with an empty list
        if text not in groups:
            groups[text] = []
        
        # Append the current tuple to the appropriate list
        groups[text].append((question, answer, text,textname))
        
    # Return the values of the dictionary, which are the lists of tuples grouped by text; also remove duplicates
    return [identify_duplicates(group) for group in list(groups.values())]

In [8]:
qa_tuples_by_paragraph = group_by_text(vetted_qa_tuples)

In [None]:
from math import ceil

# multiturn helpers
# These will probably be used for multiturn rapid-fire answering.

# Idea: use multiple short answers to train the task of answering multiple questions in one response. Two-three short answers per response should be enough.
def make_multiturn_character(qa_tuples,conv_id):
    plan, instructions, card_plan_output = create_character_card_plan_many_tuples(qa_tuples,logic_llm) # I will reuse the many tuples function for short question-answers, there's a lot of prompting in here already
    write_output_to_file(card_plan_output, "./multiturn_card_plan_generations", conv_id)
    char, char_output = create_character_card_many_tuples(qa_tuples,plan,instructions,logic_llm) # creates a character card
    write_output_to_file(char_output, "./multiturn_card_generations", conv_id)    
    return char, instructions

def make_multiturn_scenario(qa_tuples,character,conv_id):
    plan, scenario_plan_output = create_scenario_plan_many_tuples(qa_tuples,character,logic_llm)
    write_output_to_file(scenario_plan_output, "./multiturn_scenario_plan_generations", conv_id)
    scenario, scenario_output = create_scenario_many_tuples(qa_tuples,character,plan,logic_llm) # creates a scenario based on a character card and question/answer tuple
    write_output_to_file(scenario_output, "./multiturn_scenario_generations", conv_id)
    return scenario, plan

def make_multiturn_conversation(qa_tuples,logic_llm):
    conv_id = make_id()
    # thought_plan = create_thought_plan_many_tuples(qa_tuples,character,scenario,logic_llm) # There IS a way to make multiturn chain of thought answering work: generate each pair of messages using a separate prompt or a separate function, each of which has only the thought plan for that question/answer pair. But simply cramming in all the step-by-step things will confuse the hell out of the poor model. So for the pre-alpha, we're skipping it and just giving the response, with no reasoning, in the multiturn convs.
    character, instructions = make_multiturn_character(qa_tuples,conv_id)
    scenario, scenario_plan = make_multiturn_scenario(qa_tuples, character,conv_id)
    conv, conv_output = multi_turn_conversation(qa_tuples, character, scenario, scenario_plan, logic_llm)
    write_output_to_file(conv_output, "./multiturn_conversation_generations", conv_id)
    
    return conv 



def ensure_multiple_answers_are_same(qatuples, conv, scenario): # why is this a whole separate function? Once upon a time, LLMs were used in validation here, too. But programmatic validation SEEMS to catch the common problems. This is here so that I can add it back in if I have to.
    """Loop to ensure that the answer is consistent in the conversation and in the tuple."""
    retries = 0
    c = conv
    while retries < 2: # try twice, since multiturn is an expensive operation
        
        if call_all_processors(c[0],qatuples):  # if programmatic validation passes
            return c

        retries += 1
        if retries >= 2:
            return None
        # If we're here, majority of relevance checks failed
        print("----------------\n\n\n\nRETRYING!!!!\n\n\n\n----------------")
        retry = make_multiturn_conversation(qatuples, logic_llm)
        if retry is not None:  # Note: retry CANNOT actually be None
            c = retry
        else:
            # If we failed to generate a retry, don't waste compute
            return None

    return None


In [None]:

import os
import json
import random
import itertools



multi_turn_convs_dir = './multi_turn_convs'
if not os.path.exists(multi_turn_convs_dir):
    os.makedirs(multi_turn_convs_dir)

multi_turn_convs = []
for idx, group in enumerate(qa_tuples_by_paragraph):
    all_permutations = list(itertools.permutations(group))
    
    sample_size = min(REARRANGEMENTS_TO_TAKE, len(all_permutations))
    sampled_permutations = random.sample(all_permutations, sample_size)
    
    group_convs = []
    
    for iter, perm in enumerate(sampled_permutations):
        file_path = os.path.join(multi_turn_convs_dir, f'conv_{idx}_{iter}.json')
        
        # Skip if file already exists
        if not os.path.exists(file_path):
            conv = make_multiturn_conversation(perm, logic_llm)
            final_conv = ensure_multiple_answers_are_same(perm, conv, logic_llm)
            
            if final_conv is not None:
                with open(file_path, 'w') as file:
                    json.dump(final_conv, file, indent=4)
            
            group_convs.append(final_conv)
        else:
            print(f"Skipped generating {file_path} as it already exists")
    
    multi_turn_convs.append(group_convs)


In [None]:
# Since everything was written to files already, technically we're done.
# Convert to Axolotl format: TODO