# Running Inference with the Mistral 7B Model

In this notebook, we'll set up and utilize the Mistral 7B "Instruct" model. Our primary objective is to perform inference on this model and experiment with various completions.


### Setup Runtime
For fine-tuning Llama, a GPU instance is essential. Follow the directions below:

1. Go to `Runtime` (located in the top menu bar).
2. Select `Change Runtime Type`.
3. Choose `T4 GPU` (or a comparable option).


### Install Transformers Library from GitHub

The code below installs the `transformers` library directly from the HuggingFace GitHub repository.



In [1]:
!pip install git+https://github.com/huggingface/transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-lkhsu3ei
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-lkhsu3ei
  Resolved https://github.com/huggingface/transformers to commit daf281f44f654abd5e7ed07e37985cee2250c3af
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.23.0 (from transformers==4.42.0.dev0)
  Downloading huggingface_hub-0.23.0-py3-none-any.whl (401 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.42.0.dev0-p

### Installing Additional Libraries

The following commands install several libraries:

- `accelerate`: A library from HuggingFace that aids in utilizing hardware accelerators like GPUs and TPUs more efficiently.
- `bitsandbytes`: Provides fast gradient compression, beneficial for accelerated training, particularly in distributed scenarios.
- `sentencepiece`: A library for Neural Network-based text processing, often used in tokenization processes for language models.

The `-q` flag ensures a quiet installation, minimizing the log output.



In [2]:
!pip install -q peft  accelerate bitsandbytes safetensors

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
!pip install sentencepiece




### Model Initialization and Setup

In this section:

- **torch**: The PyTorch library is imported, which will be used for tensor operations and to leverage GPU acceleration.
  
- **AutoModelForCausalLM**: From the HuggingFace Transformers library, this class provides an interface to load models designed for causal language modeling. Causal language models predict the next token in a sequence.

- **AutoTokenizer**: This class is used to load tokenizers that can convert text into tokens suitable for the input of a transformer model.

- `model_name`: Defines the identifier for the model we want to load. In this case, we're using the sharded version of the Mistral-7B model named [`"filipealmeida/Mistral-7B-Instruct-v0.1-sharded"`](https://huggingface.co/filipealmeida/Mistral-7B-Instruct-v0.1-sharded).


In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer,AutoModelForSequenceClassification
import transformers

model_name = "meta-llama/Meta-Llama-3-8B-Instruct"


### Setting up the BitsAndBytes Configuration

The code block below configures the `BitsAndBytes` quantization settings, which are designed to optimize model performance by reducing the memory requirements of the model parameters:

- `load_in_4bit`: This flag, set to `True`, instructs the model to load its weights in 4-bit quantization. This can reduce memory usage significantly, allowing for larger models to fit into memory.

- `bnb_4bit_use_double_quant`: When set to `True`, this flag enables double quantization, which can further enhance the efficiency of 4-bit quantization.

- `bnb_4bit_quant_type`: Specifies the type of 4-bit quantization to use. The value `"nf4"` represents a specific form of quantization, but details on this are needed for a more complete description.

- `bnb_4bit_compute_dtype`: This defines the data type to use for computations when the model weights are quantized. Here, `torch.bfloat16` is used, which is a 16-bit floating point representation that offers a balance between precision and memory usage.


In [5]:
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

### Loading the Pretrained Model with Quantization

The code below is responsible for loading our pretrained Mistral-7B model, utilizing the previously configured `BitsAndBytes` quantization settings:

- `model_name`: Specifies the identifier for the pretrained model we want to load, which we've previously set to the sharded version of the Mistral-7B model.

- `load_in_4bit`: With this set to `True`, the model loads its weights using 4-bit quantization, which significantly reduces memory requirements.

- `torch_dtype`: Specifies the data type for tensor computations. We've set it to `torch.bfloat16` to strike a balance between memory efficiency and computational precision.

- `quantization_config`: We provide the `BitsAndBytes` configuration (`bnb_config`) established in the previous step to apply the specified quantization settings during model loading.

By leveraging these settings, the model is loaded in a memory-optimized manner, ensuring that even large models like Mistral-7B can be effectively used in constrained environments.


In [6]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: read).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your term

In [7]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    # load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config

)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

### Tokenizer Initialization and Configuration

1. **Initialize the Tokenizer**: Using the `AutoTokenizer` class from the `transformers` library, we initialize a tokenizer corresponding to our predefined model, `model_name`.
2. **Set Beginning of Sequence Token**: The `bos_token_id` is set to `1`, designating this token ID as the beginning of a sequence.
3. **Define Stop Tokens**: We define a list of token IDs, `stop_token_ids`, that signify stopping points in token sequences. Here, the token ID `0` is considered a stop token.
4. **Confirmation Print**: A print statement confirms the successful loading of the model into memory.


In [8]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.bos_token_id = 1
stop_token_ids = [0]

print(f"Successfully loaded the model {model_name} into memory")


tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Successfully loaded the model meta-llama/Meta-Llama-3-8B-Instruct into memory


### Generating Text with the Model 🚀

1. **Define Instruction Text** 📝: We set up our instruction text in the `text` variable. Remember to replace `~Add your instrunctions here~` with the actual content you wish to provide.
2. **Tokenize Input Text** 🔢: Using our previously initialized `tokenizer`, we convert the instruction text into its tokenized form with `return_tensors="pt"` to get the output as PyTorch tensors.
3. **Model Inference** 🤖: With our tokenized input, we run the model's `generate` function to produce an output. We specify a maximum of 200 new tokens to be generated and enable sampling for diverse outputs.
4. **Decode the Output** 📄: The generated token IDs are decoded back into human-readable text using `tokenizer.batch_decode`.
5. **Print the Result** 🖨️: We display the model's generated output for review.



In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
filename="/content/ft_data.xlsx"

df = pd.read_excel(filename)
df.head()
X_train = list()
X_test = list()
for intent in df['intent'].unique():
    train, test  = train_test_split(df[df.intent==intent],
                                    train_size=0.4,test_size=0.05,
                                    random_state=42)
    X_train.append(train)
    X_test.append(test)
X_train = pd.concat(X_train).sample(frac=1, random_state=10)
X_test = pd.concat(X_test)
X_train = X_train.reset_index(drop=True)

In [10]:
X_train[:8]

Unnamed: 0,intent,query
0,Card: temporary limit increase,Can you hike my credit limit just for a short ...
1,Notifications: manage,Canâ€™t deal with all these alerts. Help me se...
2,Spend Account: get cash withdrawal and reload ...,"Help me out, I gotta find a reload center for ..."
3,Spending Tracker: get info,What's the feature to track my financial activ...
4,Document: upload,"I need to send you my proof of address, whatâ€..."
5,Card: disable,"Ugh, lost my wallet. Disable my card before so..."
6,Rewards: opt out,"Yo, how do I stop getting these reward points?"
7,Transaction: history,Where can I find the history of my deposits an...


In [11]:
ans=[]
for query in X_test['query']:
  text =f"""You have to serve as an Conversational Intent Classifier for the query given by the user,
  #             below is the list from which you have to choose the intent.

  #       List of Predefined Intents and there lable:
  #       Card: disable
  #       Card: enable
  #       Card: get shipping status where is
  #       Card: report stolen or lost
  #       Global: get balance
  #       Global: get routing number direct deposit info
  #       Statement: get
  #       Transaction: history
  #       Transaction: report dispute incorrect
  #       User Account: change email address
  #       User Account: change mailing address
  #       Spend Account: transfer funds
  #       Spend Account: transfer funds checks
  #       Spend Account: transfer funds external bank
  #       Savings Account: get info view program
  #       User Account: connect banks
  #       Rewards: view offers
  #       Rewards: opt in
  #       Rewards: opt out
  #       User Account: get help contact customer service
  #       Spending Tracker: get info
  #       Overdraft: opt out
  #       User Account: get secure inbox messages
  #       User Account: get fee plan info
  #       Card: add new
  #       Document: upload
  #       Spend Account: find ATMs
  #       Spend Account: get cash withdrawal and reload locations
  #       User Account: log out
  #       Card: get info status
  #       User Account: esign
  #       Refer a Friend: get info
  #       User Account: change phone number
  #       User Account: edit profile name
  #       Card: reset PIN
  #       Card: cancel close
  #       Card: temporary limit increase
  #       Spend Account: consent to direct deposit
  #       User Account: closure request
  #       Notifications: manage
  #       Notifications: sign up for
  #       Card: report not receiveda
  #       Card: replace or upgrade
  #       User Account: change password post login

  Some Examples for better understanding are given below
 ####
  Query: I want to apply for a temporary limit increase on my card, guide me through it.
  <Intent> Card: temporary limit increase <\Intent>

  Query: 	I'd like to manage my alert settings for low balance warnings
  <Intent>: Notifications: manage <\Intent>

  Query: Can you point me to the nearest cash top-up facility?
  <Intent>: Spend Account: get cash withdrawal and reload locations <\Intent>

  Query: Is there a way to view my expenses breakdown?
  <Intent>: Spending Tracker: get info <\Intent>

  Query: I'm trying to upload my passport scan, can you show me how?
  <Intent> Document: upload <\Intent>

  Query: Ugh, lost my wallet. Disable my card before someone goes on a shopping spree!
  <Intent> Card: disable <\Intent>

  Query: Yo, how do I stop getting these reward points?
  <Intent> Rewards: opt out <\Intent>

  Query: Where can I find the history of my deposits and withdrawals?
  <Intent> Transaction: history <\Intent>
 ####
  Return the intent for below query
  Query:{query}
  """

  encoded = tokenizer(text, return_tensors="pt", add_special_tokens=False)
  model_input = encoded
  generated_ids = model.generate(**model_input, max_new_tokens=70, do_sample=True)
  decoded = tokenizer.batch_decode(generated_ids)
  ans.append(decoded[0])

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for

In [17]:
#import regex library
final=[]
import re
for i in range(len(ans)):
  #Finding the substring

  occur = 9

  #Finding the nth occurrence of substring
  inilist = [i.start() for i in re.finditer('<Intent>', ans[i])]
  if len(inilist)>= occur:
    #Printing result
    idx1=inilist[occur-1]
  inilist1 = [i.start() for i in re.finditer(r'<\\Intent>', ans[i])]
  if len(inilist1)>= occur:
    #Printing result
    idx2=inilist1[occur-1]
  res = ''
  # getting elements in between
  for idx in range(idx1 + len('<Intent>') + 1, idx2):
      res = res + ans[i][idx]
  final.append(res.strip())

In [16]:
print(ans[3])

You have to serve as an Conversational Intent Classifier for the query given by the user,
  #             below is the list from which you have to choose the intent.

  #       List of Predefined Intents and there lable:
  #       Card: disable
  #       Card: enable
  #       Card: get shipping status where is
  #       Card: report stolen or lost
  #       Global: get balance
  #       Global: get routing number direct deposit info
  #       Statement: get
  #       Transaction: history
  #       Transaction: report dispute incorrect
  #       User Account: change email address
  #       User Account: change mailing address
  #       Spend Account: transfer funds
  #       Spend Account: transfer funds checks
  #       Spend Account: transfer funds external bank
  #       Savings Account: get info view program
  #       User Account: connect banks
  #       Rewards: view offers
  #       Rewards: opt in
  #       Rewards: opt out
  #       User Account: get help contact customer serv

In [18]:
final

['Card: report not received',
 'query = query.lower()\n  i',
 '',
 'Card: report lost',
 'Global: get balance',
 'ase answer with the',
 '</Intent>\n  ####',
 'Transaction: history',
 '',
 'User Account: change email address',
 'User Account: change mailing address',
 'Spend Account: transfer funds external bank',
 'ccount: transfer funds checks\n\n  Query:Can I',
 '# For exam',
 'Savings Account: get info view program',
 'User Account: connect banks',
 'he intent label\n  # For exam',
 '#</Intent>\n\n  #</Intent>',
 'ewards: opt out<',
 'User Account: get help contact customer service',
 'p?\n   #Hint: The intent is not in the predefined',
 'Overdraft: opt out',
 'User Account: get secure inbox messages',
 '',
 '',
 'Document: upload',
 'Account: find AT',
 '# The intent is S',
 'User Account: log out',
 'Card: get info status',
 'User Account: esign',
 'Refer a Friend: get info',
 'User Account: change phone number',
 'User Account: edit profile name',
 'Card: reset PIN',
 'Card: can

In [None]:
idx1

2684

In [None]:
idx2

2682

In [None]:
print(ans[1])

You have to serve as an Conversational Intent Classifier for the query given by the user,
  #             below is the list from which you have to choose the intent.

  #       List of Predefined Intents and there lable:
  #       Card: disable
  #       Card: enable
  #       Card: get shipping status where is
  #       Card: report stolen or lost
  #       Global: get balance
  #       Global: get routing number direct deposit info
  #       Statement: get
  #       Transaction: history
  #       Transaction: report dispute incorrect
  #       User Account: change email address
  #       User Account: change mailing address
  #       Spend Account: transfer funds
  #       Spend Account: transfer funds checks
  #       Spend Account: transfer funds external bank
  #       Savings Account: get info view program
  #       User Account: connect banks
  #       Rewards: view offers
  #       Rewards: opt in
  #       Rewards: opt out
  #       User Account: get help contact customer serv

In [None]:
print(ans[3])

You have to serve as an Conversational Intent Classifier for the query given by the user,
  #             below is the list from which you have to choose the intent.

  #       List of Predefined Intents and there lable:
  #       Card: disable
  #       Card: enable
  #       Card: get shipping status where is
  #       Card: report stolen or lost
  #       Global: get balance
  #       Global: get routing number direct deposit info
  #       Statement: get
  #       Transaction: history
  #       Transaction: report dispute incorrect
  #       User Account: change email address
  #       User Account: change mailing address
  #       Spend Account: transfer funds
  #       Spend Account: transfer funds checks
  #       Spend Account: transfer funds external bank
  #       Savings Account: get info view program
  #       User Account: connect banks
  #       Rewards: view offers
  #       Rewards: opt in
  #       Rewards: opt out
  #       User Account: get help contact customer serv

In [20]:
X_test_list=list(X_test['intent'])

In [21]:
count=0
for i in range(len(final)):
  if X_test_list[i].lower()==final[i].lower():
    count+=1
print(count*100/len(final))

47.72727272727273
