# Phi-3-Tiny-Untrained

Author: [Han@Jina AI](https://twitter.com/hxiao)

[2024.05.27: Updated version with training](https://colab.research.google.com/drive/12Bm2wQguDgXhpDwOISlORQaP57t5--70)

This 50M-parameter model reconfigs `Phi-3-mini-128k-instruct` (3.8B parameters) by following the parameters given by the [Super Tiny Language Models](https://arxiv.org/abs/2405.14159) from A*STAR.

Note, A*Star paper uses Llama2 as the base, so the tokenizer and activation function are different.

In [None]:
!pip install -U transformers
!pip install huggingface_hub peft bitsandbytes
!pip install trl xformers flash-attn

Collecting transformers
  Downloading transformers-4.41.1-py3-none-any.whl (9.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.41.0
    Uninstalling transformers-4.41.0:
      Successfully uninstalled transformers-4.41.0
Successfully installed transformers-4.41.1
Collecting peft
  Downloading peft-0.11.1-py3-none-any.whl (251 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl (119.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate>=0.21.0 (from peft)
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     

In [None]:
!nvidia-smi

Tue May 28 11:40:15 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
from transformers.utils import is_flash_attn_2_available
is_flash_attn_2_available()

True

In [None]:
def count_model_parameters(model):
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Total parameters: {total_params:,}")
    print(f"Trainable parameters: {trainable_params:,}")

In [None]:
import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, pipeline


config = AutoConfig.from_pretrained("microsoft/Phi-3-mini-128k-instruct", torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation='flash_attention_2')
config.max_position_embeddings = 512  # in the original Phi3-128k this was 131072 (2**17), A*Star sets this to 512. Note this wont change the size of the parameters but does save a LOT VRAM during training.
config.num_hidden_layers = 10
# config.tie_word_embeddings = True # hmm this seems like a bug, can't turn it during training... turn it off the parameter increases to 66M. In original A*Star their baseline is 50M, so in their implementation this should be on
config.hidden_size = 512
config.intermediate_size = 1536
config.num_attention_heads = 16
config.num_key_value_heads = 16

# Adjust the rope scaling factors
required_length = config.hidden_size // (config.num_attention_heads * 2)
config.rope_scaling['long_factor'] = config.rope_scaling['long_factor'][:required_length]
config.rope_scaling['short_factor'] = config.rope_scaling['short_factor'][:required_length]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-128k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


In [None]:
# Initialize a new model with the modified configuration
new_model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation='flash_attention_2')

modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-128k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


In [None]:
new_model

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 512, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-9): 10 x Phi3DecoderLayer(
        (self_attn): Phi3FlashAttention2(
          (o_proj): Linear(in_features=512, out_features=512, bias=False)
          (qkv_proj): Linear(in_features=512, out_features=1536, bias=False)
          (rotary_emb): Phi3SuScaledRotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=512, out_features=3072, bias=False)
          (down_proj): Linear(in_features=1536, out_features=512, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=512, out_features

In [None]:
new_model.config

Phi3Config {
  "_name_or_path": "microsoft/Phi-3-mini-128k-instruct",
  "architectures": [
    "Phi3ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "microsoft/Phi-3-mini-128k-instruct--configuration_phi3.Phi3Config",
    "AutoModelForCausalLM": "microsoft/Phi-3-mini-128k-instruct--modeling_phi3.Phi3ForCausalLM"
  },
  "bos_token_id": 1,
  "embd_pdrop": 0.0,
  "eos_token_id": 32000,
  "hidden_act": "silu",
  "hidden_size": 512,
  "initializer_range": 0.02,
  "intermediate_size": 1536,
  "max_position_embeddings": 512,
  "model_type": "phi3",
  "num_attention_heads": 16,
  "num_hidden_layers": 10,
  "num_key_value_heads": 16,
  "original_max_position_embeddings": 4096,
  "pad_token_id": 32000,
  "resid_pdrop": 0.0,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "long_factor": [
      1.0299999713897705,
      1.0499999523162842,
      1.0499999523162842,
      1.0799999237060547,
      1.2299998998641968,
      1.2299998998641968,
      1.299999952316284

In [None]:
count_model_parameters(new_model)

Total parameters: 66,923,008
Trainable parameters: 66,923,008


In [None]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
new_model.to('cuda:0')


messages = [
    {"role": "user", "content": '''can u tell me a joke?'''},
]

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

pipe = pipeline(
    "text-generation",
    model=new_model,
    tokenizer=tokenizer,
)

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


RuntimeError: FlashAttention only supports Ampere GPUs or newer.