# Convert and save a subset of the guanaco instruction fine tuning data 

I downloaded the parquet files for the Guanaco fine tuning dataset https://huggingface.co/datasets/timdettmers/openassistant-guanaco. Then I save a subset of training data to demonstrate instruction fine tuning without dedicating the time and resources to run the full training set.

In [1]:
from datasets import load_dataset
import json

In [2]:
guanaco = load_dataset('timdettmers/openassistant-guanaco')

Repo card metadata block was not found. Setting CardData to empty.


In [3]:
training_dataset = guanaco['train']
training_df = training_dataset.to_pandas()

In [4]:
training_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9846 entries, 0 to 9845
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    9846 non-null   object
dtypes: object(1)
memory usage: 77.1+ KB


In [5]:
training_df.sample(10)

Unnamed: 0,text
6443,### Human: dame la receta de ¿cómo preparar le...
4547,### Human: How could nuclear quadrupole resona...
861,### Human: Me puedes dar una lista de nombres ...
8089,### Human: Please explain the process and impl...
3083,### Human: 这是一个提示### Assistant: 您好，我没有理解您的意思。请...
4888,### Human: ¿Cómo sería una ficha de personaje ...
4446,### Human: Cổ phiếu là gì ?### Assistant: Cổ p...
202,### Human: ¿Cómo podemos alcanzar una comprens...
2412,### Human: อยากให้ตัวเองตั้งใจเรียนมากขึ้นควรท...
799,"### Human: Завдяки чому ти, як нейронна мережа..."


In [8]:
# Take a small subset of the overall training set
sample_subset = training_df.sample(1000, random_state=42)

In [13]:
# Limit the size of the inputs and targets to a specified amount of characters. 
# Otherwise we'll get out of memory errors during training.
size_limit = 200

# Convert to format expected by Mistral "<s>[INST] User Instruction [/INST] Response </s>"
def convert_to_mistral(human, assistant):
    return f"<s>[INST] {human} [/INST] {assistant} </s>"

def convert_entry(entry):
    converted_segments = []

    segments = entry.split('### Human:')
    for segment in segments:
        convo_split = entry.split('### Assistant:')
        human = convo_split[0].strip()
        assistant = ""
        if (len(convo_split) > 1):
            assistant = convo_split[1].strip()
        converted_segments.append(convert_to_mistral(human, assistant))
    return {'text': ''.join(converted_segments)}

training_entries = []
for index, entry in sample_subset.iterrows():
    if (len(entry['text']) < size_limit):
        training_entries.append(convert_entry(entry['text']))
    
with open("data/fine_tune.jsonl", 'w') as f:
    print(f"Saving {len(training_entries)} entries")
    for training_entry in training_entries:
        f.write(json.dumps(training_entry) + "\n")

Saving 57 entries
