# Phi-3-TinyStories

Author: [Han@Jina AI](https://twitter.com/hxiao)

This 50M-parameter model reconfigures `Phi-3-mini-128k-instruct` (3.8B parameters) by following the guidelines provided by the [Super Tiny Language Models](https://arxiv.org/abs/2405.14159) paper from A*STAR. It is then trained from scratch on [Microsoft's TinyStories dataset](https://arxiv.org/abs/2305.07759).

Note:
- After the model creation, I copied weights from the pretrained `Phi-3-mini-128k-instruct` model to the new model by truncating at the tails. This heuristic serves as a good initialization point for training.
- The A*STAR paper uses Llama2 as the base, so the tokenizer and activation function are different.
- Since the original TinyStories dataset does not contain instruction-following data, for instruction tuning, this notebook uses one fixed instruction: `tell me a story`. A better way would be to [generate synthetic instructions using TinyStories metadata](https://huggingface.co/datasets/roneneldan/TinyStories/discussions/11).
- Given the model's size and the very basic training, I don't expect it to generalize well to any out-of-domain data. At best, it may generalize to other fairy tales (i.e., out-of-distribution).
- For an untrained 50M-parameter version, [please look at this notebook](https://colab.research.google.com/drive/188RpybbauEJKSIRPGL3RZi4Lk66HfBJj).


In [None]:
!pip install -U transformers
!pip install huggingface_hub peft bitsandbytes
!pip install trl xformers flash-attn

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [None]:
!nvidia-smi

Mon May 27 21:03:21 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:01:00.0 Off |                  N/A |
| 41%   48C    P8              37W / 350W |     28MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  | 00000000:21:00.0 Off |  

In [None]:
import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "2"

In [None]:
from transformers.utils import is_flash_attn_2_available
is_flash_attn_2_available()

True

In [None]:
def count_model_parameters(model):
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Total parameters: {total_params:,}")
    print(f"Trainable parameters: {trainable_params:,}")

In [None]:
import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, pipeline
MAX_SEQ_LENGTH = 512

config = AutoConfig.from_pretrained("microsoft/Phi-3-mini-128k-instruct", torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation='flash_attention_2')
config.max_position_embeddings = MAX_SEQ_LENGTH   # in the original Phi3-128k this was 131072 (2**17), A*Star sets this to 512. Note this wont change the size of the parameters but does save a LOT VRAM during training.
config.num_hidden_layers = 10
config.tie_word_embeddings = True
config.hidden_size = 512
config.intermediate_size = 1536
config.num_attention_heads = 16
config.num_key_value_heads = 16

# Adjust the rope scaling factors
required_length = config.hidden_size // (config.num_attention_heads * 2)
config.rope_scaling['long_factor'] = config.rope_scaling['long_factor'][:required_length]
config.rope_scaling['short_factor'] = config.rope_scaling['short_factor'][:required_length]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-128k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


In [None]:
# Initialize a new model with the modified configuration
new_model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation='flash_attention_2')

In [None]:
new_model.config

Phi3Config {
  "_name_or_path": "microsoft/Phi-3-mini-128k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-128k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-128k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 512,
  "initializer_range": 0.02,
  "intermediate_size": 1536,
  "max_position_embeddings": 512,
  "model_type": "phi3",
  "num_attention_heads": 16,
  "num_hidden_layers": 10,
  "num_key_value_heads": 16,
  "original_max_position_embeddings": 4096,
  "pad_token_id": 32000,
  "resid_pdrop": 0.0,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "long_factor": [
      1.0299999713897705,
      1.0499999523162842,
      1.0499999523162842,
      1.0799999237060547,
      1.2299998998641968,
      1.2299998998641968,
      1.299999952316284

In [None]:
count_model_parameters(new_model)

Total parameters: 66,923,008
Trainable parameters: 66,923,008


# (optional) Copy weights from pretrained models to the new model

In [None]:
# General function to copy tensor with shape handling
def copy_tensor(pre_tensor, new_tensor):
    # Determine the slice indices for each dimension
    slices = tuple(slice(-min(pre_dim, new_dim), None) if pre_dim != new_dim else slice(None)
                   for pre_dim, new_dim in zip(pre_tensor.shape, new_tensor.shape))

    # Copy the relevant sub-tensor
    new_tensor[slices] = pre_tensor[slices]
    return new_tensor

# Function to copy weights with generalized shape handling
def copy_weights(pretrained_model, new_model):
    pretrained_state_dict = pretrained_model.state_dict()
    new_state_dict = new_model.state_dict()

    for key in new_state_dict.keys():
        if key in pretrained_state_dict:
            pre_tensor = pretrained_state_dict[key]
            new_tensor = new_state_dict[key]
            if new_tensor.shape == pre_tensor.shape:
                new_state_dict[key] = pre_tensor
                print(f'{key} get fully copied')
            else:
                new_state_dict[key] = copy_tensor(pre_tensor, new_tensor)
                print(f'{key} get partial copied')
        else:
            print(f"Skipping {key} as it is not present in the pretrained model.")

    new_model.load_state_dict(new_state_dict)
    print('pretrained weights are copied to the new model')
    return new_model

pretrained_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct",
    torch_dtype="auto",
    trust_remote_code=True,
    attn_implementation='flash_attention_2'
)
# Copy weights from the pretrained model to the new model
new_model = copy_weights(pretrained_model, new_model)

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

model.embed_tokens.weight get partial copied
model.layers.0.self_attn.o_proj.weight get partial copied
model.layers.0.self_attn.qkv_proj.weight get partial copied
model.layers.0.mlp.gate_up_proj.weight get partial copied
model.layers.0.mlp.down_proj.weight get partial copied
model.layers.0.input_layernorm.weight get partial copied
model.layers.0.post_attention_layernorm.weight get partial copied
model.layers.1.self_attn.o_proj.weight get partial copied
model.layers.1.self_attn.qkv_proj.weight get partial copied
model.layers.1.mlp.gate_up_proj.weight get partial copied
model.layers.1.mlp.down_proj.weight get partial copied
model.layers.1.input_layernorm.weight get partial copied
model.layers.1.post_attention_layernorm.weight get partial copied
model.layers.2.self_attn.o_proj.weight get partial copied
model.layers.2.self_attn.qkv_proj.weight get partial copied
model.layers.2.mlp.gate_up_proj.weight get partial copied
model.layers.2.mlp.down_proj.weight get partial copied
model.layers.2.i

In [None]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
new_model.to('cuda')


messages = [
    {"role": "user", "content": '''tell me a story'''},
]

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

pipe = pipeline(
    "text-generation",
    model=new_model,
    tokenizer=tokenizer,
)

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


enaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaenaCountryenaenaenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryenaCountryCountryCountryCountryenaCountryenaCountryCountryCountryCountryCountryCountryCountryenaCountryCountryenaCountryCountryCountryCountryCountryCountryenaCountryenaCountryCountryCountryCountryCountryCountryCountryCountryCountryenaCountryCountryCountryenaCountryCountryenaCountryCountryCountryCountryenaCountryCountryCountryenaCountryCountryCountryCountryCountryCountryCountryCountryCountryCountryCountryenaCountryCountryenaCou

In [None]:
from datasets import load_dataset

dataset = load_dataset("roneneldan/TinyStories")

Repo card metadata block was not found. Setting CardData to empty.
  table = cls._concat_blocks(blocks, axis=0)


In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})

In [None]:
def formatting_prompts_func(story):
    # note that the original tinystories dataset is NOT instruction-followin, so here for convience i just fix the instruction to tell me a story.
    text = f"<|user|>tell me a story<|end|><|assistant|>{story['text']}<|endoftext|>"
    return {"text": text}

In [None]:
from transformers import TrainingArguments

per_device_train_batch_size = 64  # adjust this if u r running on bigger/smaller VRAM, for 3090 24GB this seems fine
gradient_accumulation_steps = 2
num_train_epochs = 1

args = TrainingArguments(
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    gradient_checkpointing=True,
    learning_rate=2e-5,
    lr_scheduler_type="cosine",
    max_steps=-1,
    num_train_epochs=num_train_epochs,
    evaluation_strategy="steps",
    logging_steps=100,
    output_dir='phi-3-tinystories',
    optim="paged_adamw_32bit",
    bf16=True,
)

In [None]:
from trl import SFTTrainer
trainer = SFTTrainer(
    model=new_model,
    args=args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    formatting_func=formatting_prompts_func
)

trainer.train()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
100,7.0551,6.335695
200,6.0578,5.799569
300,5.6199,5.442186
400,5.3353,5.218199
500,5.1461,5.057658
600,5.0088,4.939473
700,4.8994,4.85007
800,4.8195,4.785405




In [None]:
output = pipe(messages, **generation_args)
print(output[0]['generated_text'])