<a href="https://colab.research.google.com/github/VincentZuo/Code/blob/main/20230826_hegel_gpt_llm_trainer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
prompt = "A model that takes in a puzzle-like reasoning-heavy question in English, and responds with a well-reasoned, step-by-step thought out response in Spanish."
temperature = .4
number_of_examples = 100

system_message = "Given a philosophy question, you will respond with like you are Hegel"

In [3]:
import pandas as pd
hegel_df = pd.read_pickle('/content/drive/My Drive/cs231/hegel_qa.pkl')
hegel_df.loc[:1250].tail(20)

Unnamed: 0,question,answer
1231,What is the final shape of spirit in absolute ...,"797. What in religion was content, or the form..."
1232,What is the nature and movement of the knowing...,798. This last shape of spirit is that of abso...
1233,What is the nature and content of the knowing ...,"799. The nature, moments, and movement of this..."
1234,"What is the relationship between spirit, consc...","800. However, with regards to the existence of..."
1235,What is the relationship between substance and...,"801. Now, in actuality the substance that is k..."
1236,What is the significance of time in relation t...,9 das Begreifen; conceptually comprehending or...
1237,"What is the relationship between time, conscio...",as long as it does not erase time. Time is the...
1238,What is the nature of spirit and its relations...,"802. For this reason, it must be said that not..."
1239,What is the nature of spirit and its relations...,803. The movement of propelling forward the fo...
1240,What is the nature of spirit and its relations...,but rather is also the equality of the self wi...


In [4]:
import json
import pandas as pd

df = hegel_df.loc[:1249]

# Initialize lists to store prompts and responses
prompts = df['question'].to_numpy()
responses = df['answer'].to_numpy()

for i in [104, 364, 676]:
  print(prompts[i])
  print(responses[i])
  print()

What is the author's perspective on the reception of their proposed characterization of science and its system?
71. While I have posited that science exists as a result of the self-
movement of the concept, and while my way of looking at all the aspects
of this diverges from current ideas55 about the nature and shape of truth –
all of which are in fact quite opposed to my own views (and not only the
ones I have cited but others as well) – there does not seem to be much
promise at all that an attempt to expound the system of science accord-
ing to the characterization I have given of it will be received favorably.
In the meantime, I can bear in mind that, for example, the excellence of
Plato’s philosophy has sometimes been said to lie in his scientifically val-
ueless myths, and there have also been times, which have even been called
times of religious enthusiasm,56 in which the Aristotelian philosophy was
esteemed for the sake of its speculative depth and when Plato’s Parmenides,
perha

In [5]:
import re

def clean_string(s):
    # 1. Remove next line tokens
    s = s.replace("\n", "")

    # 2. Remove all numbers. If the number is followed by a period, remove that too.
    s = re.sub(r'\d+\.*', '', s)

    # 3. Ensure comma and period is not preceded by a space and is followed by a space.
    s = re.sub(r'\s*,\s*', ', ', s)
    s = re.sub(r'\s*\.\s*', '. ', s)

    # 4. Strip beginning and ending whitespaces
    s = s.strip()

    return s
# Test the function
test_str = responses[676]
print(clean_string(test_str))

The simple substance of spirit divides itself up as consciousness, or, as consciousness of abstract sensuous being passes over into perception, sodoes the immediate certainty of real ethical being also pass over, and just assimple being becomes for sense-perception a thing of many properties, sofor ethical perception a case of acting becomes an actuality of many ethicalrelations. However, to the former, the useless plurality of properties is con-densed into the essential opposition between singularity and universality, Werk. Here the meaning is “work, ” and not “labor” (Arbeit).


In [6]:
df = hegel_df.loc[:1249]

# Initialize lists to store prompts and responses
prompts = df['question'].apply(clean_string).to_numpy()
responses = df['answer'].apply(clean_string).to_numpy()

for i in [104, 364, 676]:
  print(prompts[i])
  print(responses[i])
  print()

What is the author's perspective on the reception of their proposed characterization of science and its system?
While I have posited that science exists as a result of the self-movement of the concept, and while my way of looking at all the aspectsof this diverges from current ideas about the nature and shape of truth –all of which are in fact quite opposed to my own views (and not only theones I have cited but others as well) – there does not seem to be muchpromise at all that an attempt to expound the system of science accord-ing to the characterization I have given of it will be received favorably. In the meantime, I can bear in mind that, for example, the excellence ofPlato’s philosophy has sometimes been said to lie in his scientifically val-ueless myths, and there have also been times, which have even been calledtimes of religious enthusiasm, in which the Aristotelian philosophy wasesteemed for the sake of its speculative depth and when Plato’s Parmenides, perhaps the greatest wo

In [7]:
# Create a DataFrame
df = pd.DataFrame({
    'prompt': prompts,
    'response': responses
})

# Remove duplicates
df = df.drop_duplicates()

print('There are ' + str(len(df)) + ' successfully-generated examples. Here are the first few:')

df.head()

There are 1250 successfully-generated examples. Here are the first few:


Unnamed: 0,prompt,response
0,Why is it considered inappropriate and counter...,"In the preface to a philosophical work, it is ..."
1,What is the significance of determining the re...,"other hand, this would give rise to the follow..."
2,What is the significance of determining the re...,Determining the relation that a philosophical ...
3,What is the significance of differentiating be...,Those who demand both such explanations and th...
4,What is the nature of the subject matter and i...,little more than a contrivance for avoiding wh...


Split into train and test sets.

In [8]:
# Split the data into train and test sets, with 90% in the train set
train_df = df.sample(frac=0.9, random_state=42)
test_df = df.drop(train_df.index)

# Save the dataframes to .jsonl files
train_df.to_json('train.jsonl', orient='records', lines=True)
test_df.to_json('test.jsonl', orient='records', lines=True)

# Install necessary libraries

In [9]:
!pip install -q accelerate peft bitsandbytes transformers trl sentencepiece

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

# Define Hyperparameters

In [10]:
model_name = "lmsys/vicuna-7b-v1.5" # use this if you have access to the official LLaMA 2 model "meta-llama/Llama-2-7b-chat-hf", though keep in mind you'll need to pass a Hugging Face key argument
dataset_name = "/content/train.jsonl"
new_model = "vicuna-7b-custom"
lora_r = 64
lora_alpha = 16
lora_dropout = 0.1
use_4bit = True
bnb_4bit_compute_dtype = "float16"
bnb_4bit_quant_type = "nf4"
use_nested_quant = False
output_dir = "./results"
num_train_epochs = 1
fp16 = False
bf16 = False
per_device_train_batch_size = 4
per_device_eval_batch_size = 4
gradient_accumulation_steps = 1
gradient_checkpointing = True
max_grad_norm = 0.3
learning_rate = 2e-4
weight_decay = 0.001
optim = "paged_adamw_32bit"
lr_scheduler_type = "constant"
max_steps = -1
warmup_ratio = 0.03
group_by_length = True
save_steps = 25
logging_steps = 5
max_seq_length = None
packing = False
device_map = {"": 0}

#Load Datasets and Train

In [11]:
# Load datasets
train_dataset = load_dataset('json', data_files='/content/train.jsonl', split="train")
valid_dataset = load_dataset('json', data_files='/content/test.jsonl', split="train")

# Preprocess datasets
train_dataset_mapped = train_dataset.map(lambda examples: {'text': [f'[INST] <<SYS>>\n{system_message.strip()}\n<</SYS>>\n\n' + prompt + ' [/INST] ' + response for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True)
valid_dataset_mapped = valid_dataset.map(lambda examples: {'text': [f'[INST] <<SYS>>\n{system_message.strip()}\n<</SYS>>\n\n' + prompt + ' [/INST] ' + response for prompt, response in zip(examples['prompt'], examples['response'])]}, batched=True)

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/1125 [00:00<?, ? examples/s]

Map:   0%|          | 0/125 [00:00<?, ? examples/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/162 [00:00<?, ?B/s]

In [12]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

In [13]:
# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="all",
    evaluation_strategy="steps",
    eval_steps=5  # Evaluate every 20 steps
)
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset_mapped,
    eval_dataset=valid_dataset_mapped,  # Pass validation dataset here
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)
trainer.train()
trainer.model.save_pretrained(new_model)



Map:   0%|          | 0/1125 [00:00<?, ? examples/s]

Map:   0%|          | 0/125 [00:00<?, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
5,2.959,3.045397
10,2.9463,2.942352
15,2.8521,2.838099
20,2.7698,2.739173
25,2.6858,2.652681
30,2.6071,2.588998
35,2.5652,2.560271
40,2.535,2.526304
45,2.4554,2.508698
50,2.3055,2.503814


In [14]:
# Cell 4: Test the model
logging.set_verbosity(logging.CRITICAL)
prompt = f"[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\nWhat is the Hegel's perspective on the reception of their proposed characterization of science and its system? [/INST]" # replace the command here with something relevant to your task
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(prompt)
print(result[0]['generated_text'])



[INST] <<SYS>>
Given a philosophy question, you will respond with like you are Hegel
<</SYS>>

What is the Hegel's perspective on the reception of their proposed characterization of science and its system? [/INST] The reception of their proposed characterization of science and its system is, to be sure, not yet the reception of the science itself. Rather, it is the reception of the science as a system, and it is the reception of the science as a system insofar as it is a system of the science. The reception of the science itself is the reception of the science as a science, or as a system of the science. The reception of the science as a system is the reception of the science as a system of the science, or as a system of the science as a science. The reception of the science as a system is the reception of the science as a system of the science as


#Run Inference

In [20]:
from transformers import pipeline

prompt = f"[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n What is subjective. [/INST]" # replace the command here with something relevant to your task
num_new_tokens = 200  # change to the number of new tokens you want to generate

# Count the number of tokens in the prompt
num_prompt_tokens = len(tokenizer(prompt)['input_ids'])

# Calculate the maximum length for the generation
max_length = num_prompt_tokens + num_new_tokens

gen = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=max_length)
result = gen(prompt)
print(result[0]['generated_text'].replace(prompt, ''))

OutOfMemoryError: ignored

#Merge the model and store in Google Drive

In [19]:
# Merge and save the fine-tuned model
from google.colab import drive
drive.mount('/content/drive')

model_path = "/content/drive/My Drive/cs231/vicuna-ps"  # change to your preferred path

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Save the merged model
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

OutOfMemoryError: ignored

# Load a fine-tuned model from Drive and run inference

In [None]:
from google.colab import drive
from transformers import AutoModelForCausalLM, AutoTokenizer

drive.mount('/content/drive')

model_path = "/content/drive/MyDrive/llama-2-7b-custom"  # change to the path where your model is saved

model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [None]:
from transformers import pipeline

prompt = "What is 2 + 2?"  # change to your desired prompt
gen = pipeline('text-generation', model=model, tokenizer=tokenizer)
result = gen(prompt)
print(result[0]['generated_text'])