
# **Finetuning Dataset Preparation**

본 튜토리얼에서는 Finetuning 에 사용되는 Dataset 을 준비하는 과정에 대해 설명합니다.

그 예시로 Huggingface 에 올라와 있는 dataset 중 하나인 'cais/mmlu' dataset 을 Friendli Finetuning 에 사용하기 위한 형태로 바꾸는 과정을 소개합니다.

### **0. Download the Huggingface dataset**

먼저, Huggingface 에 공개되어 있는 'cais/mmlu' dataset 을 다운 받겠습니다.

(link: https://huggingface.co/datasets/cais/mmlu/tree/main/all)

all subset 의 auxiliary_train-00000-of-00001.parquet 파일을 활용했습니다.

```
# dataset sample
{
    "question": "Davis decided to kill Adams. He set out for Adams's house. Before he got there he saw Brooks, who resembled Adams. Thinking that Brooks was Adams, Davis shot at Brooks. The shot missed Brooks but wounded Case, who was some distance away. Davis had not seen Case. In a prosecution under a statute that proscribes any attempt to commit murder, the district attorney should indicate that the intended victim(s) was\/were",
    "subject":""
    "answer":1,
    "choices":[
        "Adams only.","Brooks only.","Case only.","Adams and Brooks"
    ]
}
```

### **1. Convert the dataset schema into Friendli Finetuning available style**

Dataset이 준비가 되었다면, 해당 dataset 을 Friendli Finetuning 기능을 활용할 수 있는 형태로 바꾸는 과정을 보여드리겠습니다.

dataset requirement:
- 오직 .jsonl 또는 .csv 형식의 파일만 지원함.
- 각 data는 "messages" 라는 column을 포함해야 하고 해당 column만이 학습에 사용됨.
- 각 "messages" 는 [{"role": ..., "content": ...}, ...] 와 같은 형식이어야 함.
- train 에 사용될 파일과, test 에 사용될 파일을 별도로 생성해야 함.


```
# valid data sample
{
    "messages": [
        {
            "role": "user",
            "content": "Davis decided to kill Adams. He set out for Adams's house. Before he got there he saw Brooks, who resembled Adams. Thinking that Brooks was Adams, Davis shot at Brooks. The shot missed Brooks but wounded Case, who was some distance away. Davis had not seen Case. In a prosecution under a statute that proscribes any attempt to commit murder, the district attorney should indicate that the intended victim(s) was\/were"
        },
        {
            "role": "system",
            "content": "Brooks only."
        }
    ]
}

"""
Tip: role 을 "system" 으로 넣는 경우 content 는 model 의 output 으로 기대하는 내용을,
role 을 "user" 로 넣는 경우 content 는 user 가 model 의 input 으로 넣을 내용을 입력해주시면 됩니다.
"""
```

현재 Friendli Finetuning 에서 지원되는 파일 형식인 .jsonl or .csv 에 맞춰주기 위해,

다운로드 받은 .parquet 파일을 .jsonl 형식으로 변환해주겠습니다.

In [36]:
import pandas as pd

def parquet_to_jsonl(parquet_file_path, jsonl_file_path):
    # Read the Parquet file into a DataFrame
    df = pd.read_parquet(parquet_file_path)

    # Save the DataFrame as a JSONL file
    df.to_json(jsonl_file_path, orient='records', lines=True)

    print(f"Conversion complete. JSONL file saved at: {jsonl_file_path}")

# Specify the file paths
parquet_file_path = 'auxiliary_train-00000-of-00001.parquet'
jsonl_file_path = 'mmlu_auxiliary_train.jsonl'

# Perform the conversion
parquet_to_jsonl(parquet_file_path, jsonl_file_path)

Conversion complete. JSONL file saved at: mmlu_auxiliary_train.jsonl


나머지 requirements 에 맞게끔 dataset 을 변환하는 script 를 실행하도록 하겠습니다.

유저가 "question" 을 질문하면, 모델은 "choices" 중 "answer"에 해당하는 답변을 출력하는 dataset 을 구성했습니다.

이후, train 과 test dataset 을 각각 80% / 20% 비율로 구분해서 생성해주었습니다.

In [37]:
import json
import random

def convert_schema(input_file):
    # Read all lines from the input file
    with open(input_file, 'r') as infile:
        lines = infile.readlines()

    # Shuffle the lines to ensure randomness
    # random.shuffle(lines)

    # Split the data into 80% training and 20% test
    split_index = int(0.8 * len(lines))
    train_lines = lines[:split_index]
    test_lines = lines[split_index:]

    return train_lines, test_lines

def save_converted_data(lines, output_file):
    with open(output_file, 'w') as outfile:
        for line in lines:
            # Parse the JSON line
            original_data = json.loads(line)

            # Extract the relevant fields
            question = original_data['question']
            answer_index = original_data['answer']
            choices = original_data['choices']
            answer = choices[answer_index]

            # Create the new schema
            new_data = {
                "messages": [
                    {"role": "user", "content": question},
                    {"role": "system", "content": answer}
                ]
            }

            # Write the new data to the output file
            outfile.write(json.dumps(new_data) + '\n')

# Define input and output file paths
input_file = 'mmlu_auxiliary_train.jsonl'
train_output_file = 'train.jsonl'
test_output_file = 'test.jsonl'

# Convert the schema and split the data
train_lines, test_lines = convert_schema(input_file)

# Save the converted data to train and test files
save_converted_data(train_lines, train_output_file)
save_converted_data(test_lines, test_output_file)

print("Data has been successfully converted and saved!\n")

# Optionally, print a sample of the converted data
with open(train_output_file, 'r') as f:
    sample_data = json.loads(f.readline())
    print("Sample from train.jsonl:")
    print(json.dumps(sample_data, indent=4))

Data has been successfully converted and saved!

Sample from train.jsonl:
{
    "messages": [
        {
            "role": "user",
            "content": "Davis decided to kill Adams. He set out for Adams's house. Before he got there he saw Brooks, who resembled Adams. Thinking that Brooks was Adams, Davis shot at Brooks. The shot missed Brooks but wounded Case, who was some distance away. Davis had not seen Case. In a prosecution under a statute that proscribes any attempt to commit murder, the district attorney should indicate that the intended victim(s) was/were"
        },
        {
            "role": "system",
            "content": "Brooks only."
        }
    ]
}


성공적으로 dataset 을 변환하였습니다!

### **2. Validate the dataset with the given script**

준비된 dataset 을 Friendli Finetuning 이 가능한 형태로 convert 하고 나면,
이제 해당 dataset 이 valid 한지 확인하는 과정을 보여드리겠습니다.

dataset 이 사전에 finetuning 에서 동작할지 안할지 확인해보는 목적으로 사용하셔도 좋습니다.


In [38]:
import json
import csv

def validate_jsonl(file_path):
    with open(file_path, 'r') as file:
        for line_number, line in enumerate(file, start=1):
            try:
                data = json.loads(line)
                if 'messages' not in data:
                    print(f"Line {line_number}: Missing 'messages' key")
                    return
                messages = data['messages']
                if not isinstance(messages, list):
                    print(f"Line {line_number}: 'messages' is not a list")
                    return
                for message in messages:
                    if not isinstance(message, dict):
                        print(f"Line {line_number}: A message is not a dictionary")
                        return
                    if 'role' not in message or 'content' not in message:
                        print(f"Line {line_number}: Missing 'role' or 'content' key in a message")
                        return
            except json.JSONDecodeError:
                print(f"Line {line_number}: Invalid JSON")
                return
    print("Validation successful: All data conforms to the schema.")

def validate_csv(file_path):
    with open(file_path, 'r') as file:
        reader = csv.DictReader(file)
        for line_number, row in enumerate(reader, start=2):  # CSV files have a header line, so start counting from 2
            if 'messages' not in row:
                print(f"Row {line_number}: Missing 'messages' column")
                return
            try:
                messages = json.loads(row['messages'])
                if not isinstance(messages, list):
                    print(f"Row {line_number}: 'messages' is not a list")
                    return
                for message in messages:
                    if not isinstance(message, dict):
                        print(f"Row {line_number}: A message is not a dictionary")
                        return
                    if 'role' not in message or 'content' not in message:
                        print(f"Row {line_number}: Missing 'role' or 'content' key in a message")
                        return
            except json.JSONDecodeError:
                print(f"Row {line_number}: Invalid JSON in 'messages' column")
                return
    print("Validation successful: All data conforms to the schema.")

def validate(file_path):
    print(f"Validate '{file_path}'")
    if file_path.endswith('.jsonl'):
        validate_jsonl(file_path)
    elif file_path.endswith('.csv'):
        validate_csv(file_path)
    else:
        print("Unsupported file format. Please provide a .jsonl or .csv file.")

def main():
    # Validate the original dataset file
    validate('mmlu_auxiliary_train.jsonl')
    print("\n")

    # Validate the converted train dataset file
    validate('train.jsonl')
    print("\n")

    # Validate the converted test dataset file
    validate('test.jsonl')
    print("\n")

if __name__ == "__main__":
    main()

Validate 'mmlu_auxiliary_train.jsonl'
Line 1: Missing 'messages' key


Validate 'train.jsonl'
Validation successful: All data conforms to the schema.


Validate 'test.jsonl'
Validation successful: All data conforms to the schema.




### **3. Upload the dataset on the Huggingface**

최종적으로 validation 이 완료된 dataset 을 Huggingface 에 upload 하도록 하겠습니다.

Friendli Suite 계정에서 token 으로 integrate 한 계정으로 로그인해서,
private repository 에 dataset 을 업로드하도록 하겠습니다.


### **4. Launch the finetuning job using the uploaded dataset**

Huggingface 에 업로드 한 dataset 을 활용해서 Friendli Suite 에서 Finetuning 을 실행해보겠습니다.

### ***. In case the base model doesn't have its chat template.**

본인이 finetuning 하고 싶은 모델이 chat_template 이 별도로 존재하지 않는 경우,
직접 template 이 적용된 값을 {"text": "xxx"} 형식으로 넣어서 Finetuning 에 사용할 수 있습니다.

```
# pseudo sample
{
    "text": "##|user|## Davis decided to kill Adams. He set out for Adams's house. Before he got there he saw Brooks, who resembled Adams. Thinking that Brooks was Adams, Davis shot at Brooks. The shot missed Brooks but wounded Case, who was some distance away. Davis had not seen Case. In a prosecution under a statute that proscribes any attempt to commit murder, the district attorney should indicate that the intended victim(s) was\/were, ##|system|## Brooks only."
}
```