## Lab 8 Supervised Fine Tuning

# 1 Prepare the data

In this section, we will generate instruction data and use them to do the Supervised Fine-tuning of a pre-trained Llama3-8B-instruct model.  

First, we will use THU Chinese Classical Poetry Corpus (THU-CCPC) as our resource to generate instruction data. THU-CCPC is a part of THUNLP-AIPoet, which is a long-term project for AI generated Chinese poetry.

The data in THU-CCPC is just base information of poems, so the first step is to preprocess the data and extract the necessary information. We will use the following steps to preprocess the data:

Again, let's first set the working directory.

In [1]:
%cd /gfshome

/gfshome


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [2]:
# dataset downloaded from 
# https://github.com/THUNLP-AIPoet/Datasets.git

# we have already downloaded the dataset and put it in /ssdshare/share/lab8/Datasets
# let's link it to the working directory for convenience

# create a directory for processed output
!mkdir ccpc

In [3]:
# let's examine the input file
!head -20 /ssdshare/share/lab8/Datasets/CCPC/ccpc_train_v1.0.json
# This code transforms the CCPC dataset to a more readable and usable format.  

{"dynasty": "Ming", "author": "翁万达", "content": "崖悬百尺古|面削一屏开|晴日流丹草|春风长绿苔", "title": "锦屏岩", "keywords": "屏开 晴日 春风 绿苔"}
{"dynasty": "Ming", "author": "童冀", "content": "每忆宋夫子|终年坐北轩|著书良自苦|得意好忘言", "title": "次胡仲申先生斋居述怀韵十首兼简宋景濂先生 其八", "keywords": "著书 终年 好 忘言"}
{"dynasty": "Ming", "author": "管讷", "content": "劝酒重持杯|杯深喜不辞|愿将今日意|同保百年期", "title": "初度日复呈兄勉翁三首 其二", "keywords": "劝酒 杯深 持杯 愿将"}
{"dynasty": "Song", "author": "汪应辰", "content": "仁心均动植|风化正邦家|福庆方骈集|灵符尚辟邪", "title": "太上皇后合端午帖子词 其二", "keywords": "均 风化 灵符 邦家"}
{"dynasty": "Song", "author": "蒲寿宬", "content": "骤来惊辟易|久视益虚无|咫尺星堪摘|波摇又走珠", "title": "心泉二首 其二", "keywords": "波摇 星 辟易 咫尺"}
{"dynasty": "Song", "author": "袁说友", "content": "红妆夸睡足|粉额趁颜开|惟有江梅样|蛾眉淡拂来", "title": "用杨诚斋韵再题欧阳长老墨梅 其一", "keywords": "红妆 江梅 蛾眉 睡足"}
{"dynasty": "Ming", "author": "毕自严", "content": "屈指归来日|竹桃正著花|还应识故主|烂漫吐红霞", "title": "屈指归来日效白体七首 其七", "keywords": "屈指 红霞 识 著花"}
{"dynasty": "Song", "author": "王致", "content": "年近古稀有|不易升此堂|他日留泉下|须留姓氏香", "title": "辞本州教官", "keywords": "古稀 他日 姓氏

In [4]:
# This code transforms the CCPC dataset to a more friendly JSON format, 
# keeping only fields we need.
# also we need to distinguish 五言诗 from 七言诗
import json

# Define input and output files
input_file = "/ssdshare/share/lab8/Datasets/CCPC/ccpc_train_v1.0.json"
output_file = "ccpc/ccpc_transformed_with_format.json"

transformed_data = []

# Load and transform data
with open(input_file, "r", encoding="utf-8") as f:
    for line in f:
        if line.strip():  # Skip empty lines
            item = json.loads(line.strip())  # Load each line as a JSON object
            
            # Process the content of the poem
            lines = item["content"].split("|")
            formatted_lines = [f"{line}，" if i % 2 == 0 else f"{line}。" for i, line in enumerate(lines)]
            formatted_content = "\n".join(formatted_lines)

            # Test the type of the poem
            line_lengths = [len(line.replace("，", "").replace("。", "")) for line in lines]
            if all(length == 5 for length in line_lengths):
                poem_type = "五言诗"
            elif all(length == 7 for length in line_lengths):
                poem_type = "七言诗"
            else:
                poem_type = "杂言诗"

            # Transform the data structure
            transformed_item = {
                "title": item["title"],
                "author": item["author"],
                "content": formatted_content,
                "keywords": item["keywords"].split(),
                "poem_type": poem_type
            }
            transformed_data.append(transformed_item)

# Save the transformed data to output file
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(transformed_data, f, ensure_ascii=False, indent=4)

# check if there are mixed poems
def check_for_mixed_poems(transformed_data):
    has_mixed_poems = any(item["poem_type"] == "杂言诗" for item in transformed_data)
    return has_mixed_poems


if check_for_mixed_poems(transformed_data):
    print("There are mixed poems in the transformed data."  )
else:
    print("There are no mixed poems in the transformed data.")

print("Transformation complete! Transformed data saved to:", output_file)

There are no mixed poems in the transformed data.
Transformation complete! Transformed data saved to: ccpc/ccpc_transformed_with_format.json


In [5]:
# take a look at the resulting data
!tail -50 ccpc/ccpc_transformed_with_format.json

    },
    {
        "title": "和韵题雨竹并菊",
        "author": "黄仲昭",
        "content": "雨过琅玕湿未乾，\n霜馀黄菊尚堪餐。\n他年何处偏思尔，\n紫陌鸡声旅梦残。",
        "keywords": [
            "他年",
            "琅玕",
            "黄菊",
            "雨过"
        ],
        "poem_type": "七言诗"
    },
    {
        "title": "燕都杂题三首 其三",
        "author": "何失",
        "content": "花市东边柳市西，\n矮堂一笑百金挥。\n如今踪迹无寻处，\n谷雨绵山燕子飞。",
        "keywords": [
            "东边",
            "谷雨",
            "燕子",
            "踪迹"
        ],
        "poem_type": "七言诗"
    },
    {
        "title": "久雨",
        "author": "陈高",
        "content": "新年新雨连残腊，\n一月浑无一日晴。\n晓起莺声都寂寞，\n寒深柳眼未分明。",
        "keywords": [
            "莺声",
            "新年",
            "分明",
            "寂寞"
        ],
        "poem_type": "七言诗"
    },
    {
        "title": "山谷草圣",
        "author": "廖大圭",
        "content": "涪翁醉墨动惊蛇，\n流落人间几岁华。\n长恐六丁起雷电，\n为龙飞去玉皇家。",
        "keywords": [
            "玉皇",
            "流落",
            "岁华",
            "飞去"
        ],
   

In [6]:
# turn the labelled dataset into a SFT format
# i.e. becomes question-answer pairs. 

import json
import random

# Define input and output files
input_file = "ccpc/ccpc_transformed_with_format.json"
output_file = "LLaMA-Factory/data/alpaca_sft_dataset_with_varied_instructions.json" # Save in Llama-Factory/data for use

#Only use 5-character and 7-character poems
def filter_poems(poem):
    return poem["poem_type"] in ["五言诗", "七言诗"]

# Define filtering function 
def generate_instruction(poem_type, theme):
    poem_type_map = {
        "五言诗": "5-character",
        "七言诗": "7-character"
    }
    english_poem_type = poem_type_map.get(poem_type, "unknown")
    themes = ", ".join(theme)
    
    # Define instruction templates in five different styles
    instruction_templates = [
        f"Hi, you are a Chinese poet, can you write a {english_poem_type} 4-line poem about the themes: {themes}?",
        f"Hi, you are a Chinese poet now, can you compose a {english_poem_type} 4-line poem reflecting on the ideas of {themes}?",
        f"Hi, as a Chinese poet, can you help me to create a {english_poem_type} 4-line poem that incorporates the themes of {themes}?",
        f"Hi, please draft a {english_poem_type} 4-line poem based on the themes: {themes}.",
        f"Hi, you are a Chinese poet, please generate a {english_poem_type} 4-line poem exploring the themes of {themes}."
    ]
    return random.choice(instruction_templates)  # Choose a random instruction template

# Transform poem data into Alpaca format
def create_alpaca_data(poem):
    theme = poem["keywords"]
    instruction = generate_instruction(poem["poem_type"], theme)
    return {
        "instruction": instruction,
        "input": "",  # Unnecessary
        "output": poem["content"] 
    }

# Load poem data and filter out unsuitable poems
with open(input_file, "r", encoding="utf-8") as f:
    poems = json.load(f)
filtered_poems = [poem for poem in poems if filter_poems(poem)]
alpaca_data = [create_alpaca_data(poem) for poem in filtered_poems]

# Saved as Alpaca format
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(alpaca_data, f, ensure_ascii=False, indent=4)

print(f"Alpaca dataset with varied instructions created successfully! Saved to {output_file}")


Alpaca dataset with varied instructions created successfully! Saved to LLaMA-Factory/data/alpaca_sft_dataset_with_varied_instructions.json


In [7]:
# take a look at the result data

!head -20 LLaMA-Factory/data/alpaca_sft_dataset_with_varied_instructions.json

[
    {
        "instruction": "Hi, as a Chinese poet, can you help me to create a 5-character 4-line poem that incorporates the themes of 屏开, 晴日, 春风, 绿苔?",
        "input": "",
        "output": "崖悬百尺古，\n面削一屏开。\n晴日流丹草，\n春风长绿苔。"
    },
    {
        "instruction": "Hi, please draft a 5-character 4-line poem based on the themes: 著书, 终年, 好, 忘言.",
        "input": "",
        "output": "每忆宋夫子，\n终年坐北轩。\n著书良自苦，\n得意好忘言。"
    },
    {
        "instruction": "Hi, please draft a 5-character 4-line poem based on the themes: 劝酒, 杯深, 持杯, 愿将.",
        "input": "",
        "output": "劝酒重持杯，\n杯深喜不辞。\n愿将今日意，\n同保百年期。"
    },
    {
        "instruction": "Hi, you are a Chinese poet, please generate a 5-character 4-line poem exploring the themes of 均, 风化, 灵符, 邦家.",
        "input": "",
        "output": "仁心均动植，\n风化正邦家。\n福庆方骈集，\n灵符尚辟邪。"


In [8]:
# Now we already got the data set for SFT, but in Llama-Factory, we also need to register it.
# datasets needs to be registered in LLaMA-Factory/data/dataset_info.json
# let's take a look at the dataset_info.json first
!head -100 LLaMA-Factory/data/dataset_info.json

{
  "identity": {
    "file_name": "identity.json"
  },
  "alpaca_en_demo": {
    "file_name": "alpaca_en_demo.json"
  },
  "alpaca_zh_demo": {
    "file_name": "alpaca_zh_demo.json"
  },
  "glaive_toolcall_en_demo": {
    "file_name": "glaive_toolcall_en_demo.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "conversations",
      "tools": "tools"
    }
  },
  "glaive_toolcall_zh_demo": {
    "file_name": "glaive_toolcall_zh_demo.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "conversations",
      "tools": "tools"
    }
  },
  "mllm_demo": {
    "file_name": "mllm_demo.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "messages",
      "images": "images"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant"
    }
  },
  "mllm_audio_demo": {
    "file_name": "mllm_audio_demo.json",
    "formatting": "sharegpt",
    "columns": {
    

In [9]:
# Now we add our SFT data to the  LLaMA-Factory/data/dataset_info.json:
# poet_instructions = {
#     "file_name": "alpaca_sft_dataset_with_varied_instructions.json",
#     "formatting": "alpaca"
# }

import os
import json

# Path to the dataset_info.json file
dataset_info_path = os.path.join("LLaMA-Factory", "data", "dataset_info.json")

# Read the existing dataset_info.json
with open(dataset_info_path, "r", encoding="utf-8") as f:
    dataset_info = json.load(f)

# Add the poet_instructions entry
dataset_info["poet_instructions"] = {
    "file_name": "alpaca_sft_dataset_with_varied_instructions.json",
    "formatting": "alpaca"
}

# Write the updated dataset_info.json
with open(dataset_info_path, "w", encoding="utf-8") as f:
    json.dump(dataset_info, f, ensure_ascii=False, indent=4)

print(f"Updated {dataset_info_path} with poet_instructions dataset information.")


Updated LLaMA-Factory/data/dataset_info.json with poet_instructions dataset information.


In [10]:
# ensure the dataset is registered
!tail -10 LLaMA-Factory/data/dataset_info.json

        "columns": {
            "prompt": "content"
        },
        "folder": "python"
    },
    "poet_instructions": {
        "file_name": "alpaca_sft_dataset_with_varied_instructions.json",
        "formatting": "alpaca"
    }
}

Now we have the dataset ready, lets go back to 01_llama factory.ipynb. :)