# Convert and save a subset of the guanaco instruction fine tuning data 

I downloaded the Guanaco fine tuning dataset https://huggingface.co/datasets/timdettmers/openassistant-guanaco. Then I save a subset of training data that is under a binary searched token size limit that won't cause an out of memory error when training.

In [1]:
from datasets import load_dataset
import json
from transformers import AutoTokenizer

In [2]:
guanaco = load_dataset('timdettmers/openassistant-guanaco')

Repo card metadata block was not found. Setting CardData to empty.


In [3]:
training_dataset = guanaco['train']
training_df = training_dataset.to_pandas()

In [4]:
training_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9846 entries, 0 to 9845
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    9846 non-null   object
dtypes: object(1)
memory usage: 77.1+ KB


In [5]:
training_df.sample(10)

Unnamed: 0,text
2310,"### Human: hola### Assistant: Hola, ¿Qué tal e..."
3130,### Human: ¿Cómo puedo hablarle a la chica que...
9660,### Human: Write a python code that lists all ...
7359,### Human: Crea una pregunta de examen de alge...
7332,### Human: What date will the united states ec...
1620,### Human: Как защитить линукс систему?### Ass...
4952,### Human: ¿Cuáles son los principales problem...
3431,### Human: 摩托没有后 abs 安全么？### Assistant: ABS （a...
6116,### Human: Nombrame todas las capitales del mu...
624,### Human: ¿Qué películas de terror que este e...


In [6]:
# Use the whole training set, you may wish to use a smaller sample to experiment.
sample_subset = training_df

In [7]:
# Load the tokenizer
model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"    

In [8]:
# Limit the size of the inputs to a specified amount of tokens. 
# Otherwise we'll get out of memory errors during training.
token_size_limit = 800

# Prevent reserved strings used by the model from appearing in the traning data itself.
reserved_strings = ["[INST]", "[/INST]", "<s>", "</s>"]

# Convert to format expected by Mistral "<s> [INST] User Instruction [/INST] Response </s>"
# Use the tokenizer itself to apply the chat template and avoid manual string concatenation.
def convert_convo(messages):
    encoded = tokenizer.apply_chat_template(messages, return_tensors="pt")[0]
    return len(encoded), tokenizer.decode(encoded)

def convert_entry(entry):
    messages = []

    # Reject any entries that contain existing Mistral instruction format markers
    if any(reserved in entry for reserved in reserved_strings):
        return 0, None

    # Reject entries with conversation chains that are too long
    segments = entry.split('### Human:')

    # For each human -> assistant exchange, append to a list in Mistral format
    for segment in segments:
        convo_split = segment.split('### Assistant:')
        human = convo_split[0].strip()
        if len(human) == 0:
            continue
        messages.append({"role": "user", "content": human})
        assistant = ""
        if (len(convo_split) > 1):
            assistant = convo_split[1].strip()
        messages.append({"role": "assistant", "content": assistant})

    # Convert the list of user and assistant messages to Mistral format.
    token_size, mistral_convo = convert_convo(messages)
        
    # Return a map (JSON object) with the text for training
    return token_size, {'text': mistral_convo}

training_entries = []
for index, entry in sample_subset.iterrows():
    token_size, converted_entry = convert_entry(entry['text'])
    if converted_entry and token_size <= token_size_limit:
        training_entries.append(converted_entry)

# Write all of the data to JSONL file
with open("data/fine_tune.jsonl", 'w') as f:
    print(f"Saving {len(training_entries)} entries")
    for training_entry in training_entries:
        f.write(json.dumps(training_entry) + "\n")

Saving 8679 entries
