Cerebras GPT

In [1]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("cerebras/Cerebras-GPT-111M")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

model = AutoModelForCausalLM.from_pretrained("cerebras/Cerebras-GPT-111M")

In [2]:
# mps device
device = torch.device("mps" if torch.backends.mps.is_built() else "cpu")
device

device(type='mps')

In [3]:
# model = model.to(device)
model.device

device(type='cpu')

In [4]:
prompt = 'How many enterprise Tactics IDs are in the Mitre Att&ck v13 matrix released in April 2023?\n\n Assistant:'

In [5]:
from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
generated_text = pipe(prompt, max_length=256, do_sample=False, no_repeat_ngram_size=2)[0]
print(generated_text['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


How many enterprise Tactics IDs are in the Mitre Att&ck v13 matrix released in April 2023?

 Assistant:  
  -  The Mitra v14 matrix is a very useful tool for the analysis of the data.

-  A very good tool to analyze the information in a way that is very easy to understand.

  1.  In the case of a single enterprise, the number of enterprise types is not very large. The number is usually very small. In this case, a large number will be very few. This is the reason why the enterprise is so large that it is difficult to find a solution. It is also very difficult for a company to solve a problem. A company that has a lot of data is extremely difficult. For example, if a business is to be able to perform a task in which a customer is able, it will not be possible to do this. However, in this situation, there is no way to know the solution, and the company will have to make a decision. If a person is going to have a good solution in order to get a better solution then the problem will become 

Pytorch based inference

In [6]:
# Encode the prompt using the tokenizer
input_ids = tokenizer.encode(prompt, return_tensors="pt")

In [7]:
# Generate text based on the encoded prompt
with torch.no_grad():
    output = model.generate(
        input_ids=input_ids,
        do_sample=True,
        top_p=0.75,
        top_k=85,
        temperature=1.99,
        typical_p=1,
        repetition_penalty=1.3,
        max_length=250,  # The maximum number of tokens to generate
        num_beams=5,    # The number of beams to use for beam search
#         early_stopping=True,  # Stop generation when the model predicts an end-of-sequence token
    )

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [8]:
# Decode the generated text and print it
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

How many enterprise Tactics IDs are in the Mitre Att&ck v13 matrix released in April 2023?

 Assistant: A.B., D.E., D.E., E.R., D.E., D.E., G.L., M.F., D.E., L.D., R.J., C.S., P.A., T.M., R.S., T.N., T.H., S.P., R.G., S.M., M.W., K.M., T.T., D.M., M.J., K.L., T.N., T.J., T.M., S.P., T.M., R.S., R.S., T.N., T.M., T.N., T.M., T.T., S.P.

  ---------------------------------------------- -------------------------------------------------
  1\. To analyze the use of this data we present the following results: (a) We used the two methods in which we obtained the largest number of variables. (b) We found a significant number of variables that could be accounted for by these variables. (c) We compared the results with


Pytorch based fine tuning

In [11]:
import datasets
dataset = datasets.load_dataset('json', data_files='chat_gpt_context/security_base_sample.json', field='train')

Using custom data configuration default-4d7b996aff28bf98


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Downloading and preparing dataset json/default to /Users/choiwb/.cache/huggingface/datasets/json/default-4d7b996aff28bf98/0.0.0/da492aad5680612e4028e7f6ddc04b1dfcec4b64db470ed7cc5f2bb265b9b6b5...


Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 6213.78it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 499.26it/s]
                            

Dataset json downloaded and prepared to /Users/choiwb/.cache/huggingface/datasets/json/default-4d7b996aff28bf98/0.0.0/da492aad5680612e4028e7f6ddc04b1dfcec4b64db470ed7cc5f2bb265b9b6b5. Subsequent calls will reuse this data.


100%|██████████| 1/1 [00:00<00:00, 409.32it/s]


In [12]:
dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 20
    })
})

In [13]:
cutoff_len = 512

def generate_prompt(entry):
    if entry['input']:
        return f"User: {entry['instruction']}: {entry['input']}\n\nAssistant: {entry['output']}"
    else:
        return f"User: {entry['instruction']}\n\nAssistant: {entry['output']}"

def tokenize(item, add_eos_token=True):
    result = tokenizer(
        generate_prompt(item),
        truncation=True,
        max_length=cutoff_len,
        padding=False,
        return_tensors=None,
    )

    if (
        result["input_ids"][-1] != tokenizer.eos_token_id
        and len(result["input_ids"]) < cutoff_len
        and add_eos_token
    ):
        result["input_ids"].append(tokenizer.eos_token_id)
        result["attention_mask"].append(1)

    result["labels"] = result["input_ids"].copy()

    return result

In [14]:
train_val = dataset["train"].train_test_split(test_size=0.2, shuffle=True, seed=42)
train_data = train_val["train"].shuffle().map(tokenize)
val_data = train_val["test"].shuffle().map(tokenize)

100%|██████████| 16/16 [00:00<00:00, 2259.94ex/s]
100%|██████████| 4/4 [00:00<00:00, 1300.36ex/s]


In [15]:
'''
if 'model' in globals(): 
    del model
    # torch.cuda.empty_cache()

model = transformers.AutoModelForCausalLM.from_pretrained(
    'cerebras/Cerebras-GPT-111M',    
    
    # load_in_8bit=True,
    # torch_dtype=torch.float16,

    # device_map={'': 0}
    #device = torch.device("cpu")
    # device_map = 'auto'
)
'''

'\nif \'model\' in globals(): \n    del model\n    # torch.cuda.empty_cache()\n\nmodel = transformers.AutoModelForCausalLM.from_pretrained(\n    \'cerebras/Cerebras-GPT-111M\',    \n    \n    # load_in_8bit=True,\n    # torch_dtype=torch.float16,\n\n    # device_map={\'\': 0}\n    #device = torch.device("cpu")\n    # device_map = \'auto\'\n)\n'

In [16]:
'''
import peft

model = peft.prepare_model_for_int8_training(model)

model = peft.get_peft_model(model, peft.LoraConfig(
    r=8,
    lora_alpha=16,
    # target_modules=["q_proj", "v_proj"],
    target_modules=["c_attn"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
))
'''

'\nimport peft\n\nmodel = peft.prepare_model_for_int8_training(model)\n\nmodel = peft.get_peft_model(model, peft.LoraConfig(\n    r=8,\n    lora_alpha=16,\n    # target_modules=["q_proj", "v_proj"],\n    target_modules=["c_attn"],\n    lora_dropout=0.05,\n    bias="none",\n    task_type="CAUSAL_LM",\n))\n'

In [17]:
# import peft

# model = peft.PeftModel.from_pretrained(
#     model,
#     # 'lora-cerebras-gpt2.7b-hh-rlhf-helpful-online',
#     output_dir,
#     torch_dtype=torch.float16
# )

In [18]:
import os
import wandb 

output_dir = 'cerebras-gpt111m-finetune'

use_wandb = True,
wandb_run_name = f"{output_dir}-{wandb.util.generate_id()}"

# set the wandb project where this run will be logged
os.environ["WANDB_PROJECT"]=output_dir

# save your trained model checkpoint to wandb
os.environ["WANDB_LOG_MODEL"]="true"

# turn off watch to log faster
os.environ["WANDB_WATCH"]="false"

In [19]:
training_args = transformers.TrainingArguments(
    # per_device_train_batch_size=16, 
    per_device_train_batch_size=1, 

    gradient_accumulation_steps=8,  
    num_train_epochs=20,  
    learning_rate=1e-4, 
    # only be used on CUDA devices.
    # fp16=True,
    
    optim="adamw_torch",
    logging_steps=10, 
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=200,
    save_steps=200,
    output_dir=output_dir, 
    save_total_limit=3,

    report_to="wandb" if use_wandb else None,
    run_name=wandb_run_name if use_wandb else None,
)

In [20]:
trainer = transformers.Trainer(
    model=model, 
    train_dataset=train_data,
    eval_dataset=val_data,
    args=training_args, 
    data_collator=transformers.DataCollatorForSeq2Seq(
        tokenizer, pad_to_multiple_of=1, return_tensors="pt", padding=True
    ),
)

In [21]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [22]:
model.config.use_cache = False
# result = trainer.train()
# model.save_pretrained(output_dir)

# wandb.finish()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mkhuam1216[0m ([33mngcsm[0m). Use [1m`wandb login --relogin`[0m to force relogin


  0%|          | 0/40 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 25%|██▌       | 10/40 [00:16<00:46,  1.55s/it]

{'loss': 1.5498, 'learning_rate': 7.500000000000001e-05, 'epoch': 5.0}


 50%|█████     | 20/40 [00:32<00:32,  1.64s/it]

{'loss': 0.2073, 'learning_rate': 5e-05, 'epoch': 10.0}


 75%|███████▌  | 30/40 [00:47<00:15,  1.55s/it]

{'loss': 0.112, 'learning_rate': 2.5e-05, 'epoch': 15.0}


100%|██████████| 40/40 [01:05<00:00,  1.80s/it]

{'loss': 0.0894, 'learning_rate': 0.0, 'epoch': 20.0}
{'train_runtime': 71.835, 'train_samples_per_second': 4.455, 'train_steps_per_second': 0.557, 'train_loss': 0.4896343022584915, 'epoch': 20.0}


100%|██████████| 40/40 [01:07<00:00,  1.69s/it]


0,1
train/epoch,▁▃▆██
train/global_step,▁▃▆██
train/learning_rate,█▆▃▁
train/loss,█▂▁▁
train/total_flos,▁
train/train_loss,▁
train/train_runtime,▁
train/train_samples_per_second,▁
train/train_steps_per_second,▁

0,1
train/epoch,20.0
train/global_step,40.0
train/learning_rate,0.0
train/loss,0.0894
train/total_flos,6175087902720.0
train/train_loss,0.48963
train/train_runtime,71.835
train/train_samples_per_second,4.455
train/train_steps_per_second,0.557


Pytorch based fine tuning model load & inference

In [9]:
model.config
print(model.dtype)

# model.half()
model.eval()

torch.float32


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(2048, 768)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-9): 10 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0.0, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): GELUActivation()
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [16]:
# 저장된 모델 호출
output_dir = 'cerebras-gpt111m-finetune'
model = transformers.AutoModelForCausalLM.from_pretrained(output_dir)
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(2048, 768)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-9): 10 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0.0, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): GELUActivation()
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [17]:
# prompt = 'What is XSS (Cross Site Scripting) attack?'
prompt = 'How many enterprise Tactics IDs are in the Mitre Att&ck v13 matrix released in April 2023?'

In [18]:
inputs = tokenizer(prompt, return_tensors="pt")
# input_ids = inputs["input_ids"].to(model.device)
input_ids = inputs["input_ids"]

In [19]:
generation_config = transformers.GenerationConfig(
    max_new_tokens=100,
    temperature=0.2,
    top_p=0.75,
    top_k=50,
    repetition_penalty=1.2,
    do_sample=True,
    early_stopping=True,
#     num_beams=5,
    
    pad_token_id=model.config.pad_token_id,
    eos_token_id=model.config.eos_token_id,
)

In [20]:
'''
with torch.no_grad():
    output = model.generate(
        input_ids=input_ids,
        attention_mask=torch.ones_like(input_ids),
        generation_config=generation_config
    )[0].cuda()
'''
with torch.no_grad():
    output = model.generate(
        input_ids=input_ids,
        attention_mask=torch.ones_like(input_ids),
        generation_config=generation_config
    )[0]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [21]:
result = tokenizer.decode(output, skip_special_tokens=True).strip()
print(result)

How many enterprise Tactics IDs are in the Mitre Att&ck v13 matrix released in April 2023?

Assistant: The Mitre Att&ck v13, announced in April 2023, consists of a total of 14 Enterprise Tactics IDs.
