# Convert and save a subset of the guanaco instruction fine tuning data 

I downloaded the Guanaco fine tuning dataset https://huggingface.co/datasets/timdettmers/openassistant-guanaco. Then I save a subset of training data to demonstrate how you can run fine tuning without dedicating the time and resources to run the full training set. In the end I used 9000 random entries from the training set, which is nearly the entire set.

In [1]:
from datasets import load_dataset
import json

In [2]:
guanaco = load_dataset('timdettmers/openassistant-guanaco')

Repo card metadata block was not found. Setting CardData to empty.


In [3]:
training_dataset = guanaco['train']
training_df = training_dataset.to_pandas()

In [4]:
training_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9846 entries, 0 to 9845
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    9846 non-null   object
dtypes: object(1)
memory usage: 77.1+ KB


In [5]:
training_df.sample(10)

Unnamed: 0,text
8699,"### Human: Come up with an acronym for ""Long R..."
7695,### Human: Crie conteúdo para Instagram focado...
9504,### Human: Do millipedes really have 1000 legs...
337,### Human: Puedo hacerme a nado el estrello de...
9729,### Human: ¿Tienes la capacidad para jugar adi...
3592,### Human: Give me the latest updates of the w...
3810,"### Human: What forms of intelligence exist, a..."
3884,### Human: Мне 22 года. У меня есть цель к 30 ...
3974,### Human: Please write me a haiku about machi...
5865,### Human: Quisiera que me ayudaras con una li...


In [6]:
# Take a subset of the overall training set
sample_subset = training_df.sample(9000, random_state=42)

In [7]:
# Limit the size of the inputs and targets to a specified amount of characters. 
# Otherwise we'll get out of memory errors during training.
size_limit = 15000
segment_limit = 20
reserved_strings = ["[INST]", "[/INST]", "<s>", "</s>"]

# Convert to format expected by Mistral "<s>[INST] User Instruction [/INST] Response </s>"
def convert_to_mistral(human, assistant):
    return f"<s>[INST] {human} [/INST] {assistant} </s>"

def convert_entry(entry):
    converted_segments = []

    # Reject any entries that are too long
    if len(entry) > size_limit:
        return None

    # Reject any entries that contain existing Mistral instruction format markers
    if any(reserved in entry for reserved in reserved_strings):
        return None

    # Reject entries with conversation chains that are too long
    segments = entry.split('### Human:')
    if len(segments) > segment_limit:
        return None

    # For each human -> assistant exchange, append to a list in Mistral format
    for segment in segments:
        convo_split = segment.split('### Assistant:')
        human = convo_split[0].strip()
        if len(human) == 0:
            continue
        assistant = ""
        if (len(convo_split) > 1):
            assistant = convo_split[1].strip()
        converted_segments.append(convert_to_mistral(human, assistant))

    # Return a map (JSON object) with the text for training
    return {'text': ''.join(converted_segments)}

training_entries = []
for index, entry in sample_subset.iterrows():
    converted_entry = convert_entry(entry['text'])
    if converted_entry:
        training_entries.append(converted_entry)

# Write all of the data to JSONL file
with open("data/fine_tune.jsonl", 'w') as f:
    print(f"Saving {len(training_entries)} entries")
    for training_entry in training_entries:
        f.write(json.dumps(training_entry) + "\n")

Saving 9000 entries
