# Lesson 6: Model Example

In this lesson, you will reinforce your understanding of the transformer architecture by exploring the decoder-only [model](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) `microsoft/Phi-3-mini-4k-instruct`.

## Setup

We start with setting up the lab by installing the required libraries (`transformers` and `accelerate`) and ignoring the warnings. The `accelerate` library is required by the `Phi-3` model. But you don't need to worry about installing these libraries, the requirements for this lab are already installed. 

<p style="background-color:#fff6ff; padding:15px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px"> 💻 &nbsp; <b>Access <code>requirements.txt</code> file:</b> If you'd like to access the requirements file: 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Open"</em>. For more help, please see the <em>"Appendix – Tips, Help, and Download"</em> Lesson.</p>

In [1]:
#!pip install transformers>=4.46.1 accelerate>=0.31.0

In [2]:
!pip install transformers>=4.48.3 accelerate>=1.3.0

In [3]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

## Loading the LLM

Let's first load the model and its tokenizer. For that you will first import the classes: `AutoModelForCausalLM` and `AutoTokenizer`. When you want to process a sentence, you can apply the tokenizer first and then the model in two separate steps. Or you can create a pipeline object that wraps the two steps and then apply the pipeline to the sentence. You'll explore both approaches in this notebook. This is why you'll also import the `pipeline` class.

<p style="background-color:#fff1d7; padding:15px; "> <b>FYI: </b> The transformers library has two types of model classes: <code> AutoModelForCausalLM </code> and <code>AutoModelForMaskedLM</code>. Causal language models represent the decoder-only models that are used for text generation. They are described as causal, because to predict the next token, the model can only attend to the preceding left tokens. Masked language models represent the encoder-only models that are used for rich text representation. They are described as masked, because they are trained to predict a masked or hidden token in a sequence.</p>

In [4]:
# import the required classes
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [5]:
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/phi-3/pytorch/phi-3.5-mini-instruct/2")

model = AutoModelForCausalLM.from_pretrained(
    "/kaggle/input/phi-3/pytorch/phi-3.5-mini-instruct/2",
    device_map="cpu",
    torch_dtype="auto",
    trust_remote_code=True,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<p style="background-color:#fff1d7; padding:15px; "> <b> Note:</b> You'll receive a warning that the flash-attention package is not found. That's because flash attention requires certain types of GPU hardware to run. Since the model of this lab is not using any GPU, you can ignore this warning.</p>

Now you can wrap the model and the tokenizer in a [pipeline](https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.pipeline) object that has "text-generation" as task.

In [6]:
# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False, # False means to not include the prompt text in the returned text
    max_new_tokens=300, 
    do_sample=False, # no randomness in the generated text
)

Device set to use cpu


## Generating a Text Response to a Prompt

You'll now use the pipeline object (labeled as generator) to generate a response consisting of 50 tokens to the given prompt.

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"> ⏳ <b>Note: </b> The model might take around 2 minutes to generate the output.</p>

In [7]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened. "

output = generator(prompt)

print(output[0]['generated_text'])

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48




Subject: Sincere Apologies for the Gardening Mishap

Dear Sarah,

I hope this message finds you well. I am writing to express my deepest apologies for the unfortunate incident that occurred in your garden yesterday.

As you know, I had been looking forward to helping you with your gardening project. Unfortunately, during our time together, a series of unforeseen events led to an accident that resulted in some damage to your beautiful garden.

The incident began when I was attempting to help you transplant a young sapling into its new location. In my eagerness to ensure the plant's successful relocation, I accidentally knocked over a nearby pot of delicate flowers. The pot shattered on impact, scattering soil and broken pottery across your meticulously maintained garden beds.

I understand that this incident has caused you distress and inconvenience. Please know that I am truly sorry for any upset this may have caused you. I take full responsibility for my actions and the subsequent d

## Exploring the Model's Architecture

You can print the model to take a look at its architecture.

In [8]:
model

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3LongRoPEScaledRotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out

The vocabulary size is 32064 tokens, and the size of the vector embedding for each token is 3072.

In [9]:
model.model.embed_tokens

Embedding(32064, 3072, padding_idx=32000)

You can just focus on printing the stack of transformer blocks without the LM head component.

In [10]:
model.model

Phi3Model(
  (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
  (embed_dropout): Dropout(p=0.0, inplace=False)
  (layers): ModuleList(
    (0-31): 32 x Phi3DecoderLayer(
      (self_attn): Phi3Attention(
        (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
        (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
        (rotary_emb): Phi3LongRoPEScaledRotaryEmbedding()
      )
      (mlp): Phi3MLP(
        (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
        (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
        (activation_fn): SiLU()
      )
      (input_layernorm): Phi3RMSNorm()
      (resid_attn_dropout): Dropout(p=0.0, inplace=False)
      (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
      (post_attention_layernorm): Phi3RMSNorm()
    )
  )
  (norm): Phi3RMSNorm()
)

There are 32 transformer blocks or layers. You can access any particular block.

In [11]:
model.model.layers[0]

Phi3DecoderLayer(
  (self_attn): Phi3Attention(
    (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
    (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
    (rotary_emb): Phi3LongRoPEScaledRotaryEmbedding()
  )
  (mlp): Phi3MLP(
    (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
    (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
    (activation_fn): SiLU()
  )
  (input_layernorm): Phi3RMSNorm()
  (resid_attn_dropout): Dropout(p=0.0, inplace=False)
  (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
  (post_attention_layernorm): Phi3RMSNorm()
)

## Generating a Single Token to a Prompt

You earlier used the Pipeline object to generate a text response to a prompt. The pipeline provides an abstraction to the underlying process of text generation. Each token in the text is actually generated one by one. 

Let's now give the model a prompt and check the first token it will generate.

In [12]:
prompt = "The capital of Nigeria is"

You'll need first to tokenize the prompt and get the ids of the tokens.

In [13]:
# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids

tensor([[  450,  7483,   310, 20537,   423,   338]])

Let's now pass the token ids to the transformer block (before the LM head).

In [14]:
# Get the output of the model before the lm_head
model_output = model.model(input_ids)

The transformer block outputs for each token a vector of size 3072 (embedding size). Let's check the shape of this output.

In [15]:
# Get the shape the output the model before the lm_head
model_output[0].shape

torch.Size([1, 6, 3072])

The first number represents the batch size, which is 1 in this case since we have one prompt. The second number 5 represents the number of tokens. And finally 3072 represents the embedding size (the size of the vector that corresponds to each token). 

Let's now get the output of the LM head.

In [16]:
# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])

In [17]:
lm_head_output.shape

torch.Size([1, 6, 32064])

The LM head outputs for each token in the input prompt, a vector of size 32064 (vocabulary size). So there are 5 vectors, each of size 32064. Each vector can be mapped to a probability distribution, that shows the probability for each token in the vocabulary to come after the given token in the input prompt.

Since we're interested in generating the output token that comes after the last token in the input prompt ("is"), we'll focus on the last vector. So in the next cell, `lm_head_output[0,-1]` is a vector of size 32064 from which you can generate the token that comes after ("is"). You can do that by finding the id of the token that corresponds to the highest value in the vector `lm_head_output[0,-1]` (using `argmax(-1)`, -1 means across the last axis here).

In [18]:
token_id = lm_head_output[0,-1].argmax(-1)
token_id

tensor(1976)

Finally, let's decode the returned token id.

In [19]:
tokenizer.decode(token_id)

'Ab'

In [20]:
prompt = "Help rewrite this using a STAR interview method: I worked with an e-commerce company with a size of 20 people. The company need to predict customer behaviour and sales. I stepped into the project collaborating with the sales and marketing team to apply a machine-learning model to predict customer behaviours and sales. I sourced my data from existing stakeholders involved, cleaned the data, checked for outliers, split the dataset into train and test sets, normalised the provided dataset employed different algorithms i.e. MLP, logistic regression and random forest, the reason for this is the problem I'm solving is a classification problem and I made predictions on the test set. I employed evaluation metrics to measure the model's performance using precision, recall, F1-score, and ROC-AUC. The model achieved an accuracy of 76%, resulting in a 15% increase in sales. One of the challenges encountered was outliers and missing data points, I applied a median imputation to handle the missing data and remove features with the high outliers. The company business was impacted as they made required adjustments based on the sales prediction, I created an interactive dashboard using PowerBI to show the predicted sales revenue over time, with filters for different product categories and regions. The stakeholders, including the sales and marketing teams, could use the dashboard to explore the data, identify trends, and make data-driven decisions. I received positive feedback from the stakeholders on the effectiveness of the visualizations in communicating the insights and results."

output = generator(prompt)

print(output[0]['generated_text'])



# Answer
**Situation:**
At an e-commerce company with a team of 20, I was tasked with enhancing our sales forecasting capabilities. The company aimed to better understand customer behavior to drive sales growth.

**Task:**
I collaborated with the sales and marketing teams to implement a machine learning model that could accurately predict customer behaviors and sales trends.

**Action:**
To build the predictive model, I began by sourcing and cleaning the data from our stakeholders. I meticulously checked for outliers and handled missing data using median imputation. I then split the dataset into training and testing sets and normalized the data to prepare it for various algorithms.

I experimented with different machine learning algorithms, including Multilayer Perceptron (MLP), logistic regression, and random forest, to find the most effective model for our classification problem. After training the models, I evaluated their performance using precision, recall, F1-score, and ROC-AUC

In [21]:
prompt = "What is the capital of Nigeria?"
output = generator(prompt)

print(output[0]['generated_text'])



A) Lagos
B) Abuja
C) Kano
D) Port Harcourt

# Answer
B) Abuja

Abuja was designated as the capital of Nigeria in 1991, replacing Lagos. It is located in the center of the country and was chosen for its central location, which is more accessible to all parts of the nation.


<p style="background-color:#f2f2ff; padding:15px; border-width:3px; border-color:#e2e2ff; border-style:solid; border-radius:6px"> ⬇
&nbsp; <b>Download Notebooks:</b> If you'd like to donwload the notebook: 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Download as"</em> and select <em>"Notebook (.ipynb)"</em>. For more help, please see the <em>"Appendix – Tips, Help, and Download"</em> Lesson.</p>