# Running Inference with the Mistral 7B Model

In this notebook, we'll set up and utilize the Mistral 7B "Instruct" model. Our primary objective is to perform inference on this model and experiment with various completions.


### Setup Runtime
For fine-tuning Llama, a GPU instance is essential. Follow the directions below:

1. Go to `Runtime` (located in the top menu bar).
2. Select `Change Runtime Type`.
3. Choose `T4 GPU` (or a comparable option).


### Install Transformers Library from GitHub

The code below installs the `transformers` library directly from the HuggingFace GitHub repository.



In [1]:
!pip install git+https://github.com/huggingface/transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-7cax0c5s
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-7cax0c5s
  Resolved https://github.com/huggingface/transformers to commit 1b3dba9417eebe16b7c206d1dfca6a4c7f11dbec
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.23.0 (from transformers==4.41.0.dev0)
  Downloading huggingface_hub-0.23.0-py3-none-any.whl (401 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.41.0.dev0-p

### Installing Additional Libraries

The following commands install several libraries:

- `accelerate`: A library from HuggingFace that aids in utilizing hardware accelerators like GPUs and TPUs more efficiently.
- `bitsandbytes`: Provides fast gradient compression, beneficial for accelerated training, particularly in distributed scenarios.
- `sentencepiece`: A library for Neural Network-based text processing, often used in tokenization processes for language models.

The `-q` flag ensures a quiet installation, minimizing the log output.



In [2]:
!pip install -q peft  accelerate bitsandbytes safetensors

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
!pip install sentencepiece




### Model Initialization and Setup

In this section:

- **torch**: The PyTorch library is imported, which will be used for tensor operations and to leverage GPU acceleration.
  
- **AutoModelForCausalLM**: From the HuggingFace Transformers library, this class provides an interface to load models designed for causal language modeling. Causal language models predict the next token in a sequence.

- **AutoTokenizer**: This class is used to load tokenizers that can convert text into tokens suitable for the input of a transformer model.

- `model_name`: Defines the identifier for the model we want to load. In this case, we're using the sharded version of the Mistral-7B model named [`"filipealmeida/Mistral-7B-Instruct-v0.1-sharded"`](https://huggingface.co/filipealmeida/Mistral-7B-Instruct-v0.1-sharded).


In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers

model_name = "unsloth/mistral-7b-bnb-4bit"


### Setting up the BitsAndBytes Configuration

The code block below configures the `BitsAndBytes` quantization settings, which are designed to optimize model performance by reducing the memory requirements of the model parameters:

- `load_in_4bit`: This flag, set to `True`, instructs the model to load its weights in 4-bit quantization. This can reduce memory usage significantly, allowing for larger models to fit into memory.

- `bnb_4bit_use_double_quant`: When set to `True`, this flag enables double quantization, which can further enhance the efficiency of 4-bit quantization.

- `bnb_4bit_quant_type`: Specifies the type of 4-bit quantization to use. The value `"nf4"` represents a specific form of quantization, but details on this are needed for a more complete description.

- `bnb_4bit_compute_dtype`: This defines the data type to use for computations when the model weights are quantized. Here, `torch.bfloat16` is used, which is a 16-bit floating point representation that offers a balance between precision and memory usage.


In [6]:
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

### Loading the Pretrained Model with Quantization

The code below is responsible for loading our pretrained Mistral-7B model, utilizing the previously configured `BitsAndBytes` quantization settings:

- `model_name`: Specifies the identifier for the pretrained model we want to load, which we've previously set to the sharded version of the Mistral-7B model.

- `load_in_4bit`: With this set to `True`, the model loads its weights using 4-bit quantization, which significantly reduces memory requirements.

- `torch_dtype`: Specifies the data type for tensor computations. We've set it to `torch.bfloat16` to strike a balance between memory efficiency and computational precision.

- `quantization_config`: We provide the `BitsAndBytes` configuration (`bnb_config`) established in the previous step to apply the specified quantization settings during model loading.

By leveraging these settings, the model is loaded in a memory-optimized manner, ensuring that even large models like Mistral-7B can be effectively used in constrained environments.


In [7]:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    # load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Unused kwargs: ['quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors:   0%|          | 0.00/4.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

### Tokenizer Initialization and Configuration

1. **Initialize the Tokenizer**: Using the `AutoTokenizer` class from the `transformers` library, we initialize a tokenizer corresponding to our predefined model, `model_name`.
2. **Set Beginning of Sequence Token**: The `bos_token_id` is set to `1`, designating this token ID as the beginning of a sequence.
3. **Define Stop Tokens**: We define a list of token IDs, `stop_token_ids`, that signify stopping points in token sequences. Here, the token ID `0` is considered a stop token.
4. **Confirmation Print**: A print statement confirms the successful loading of the model into memory.


In [8]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.bos_token_id = 1
stop_token_ids = [0]

print(f"Successfully loaded the model {model_name} into memory")


tokenizer_config.json:   0%|          | 0.00/971 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

Successfully loaded the model unsloth/mistral-7b-bnb-4bit into memory


### Generating Text with the Model 🚀

1. **Define Instruction Text** 📝: We set up our instruction text in the `text` variable. Remember to replace `~Add your instrunctions here~` with the actual content you wish to provide.
2. **Tokenize Input Text** 🔢: Using our previously initialized `tokenizer`, we convert the instruction text into its tokenized form with `return_tensors="pt"` to get the output as PyTorch tensors.
3. **Model Inference** 🤖: With our tokenized input, we run the model's `generate` function to produce an output. We specify a maximum of 200 new tokens to be generated and enable sampling for diverse outputs.
4. **Decode the Output** 📄: The generated token IDs are decoded back into human-readable text using `tokenizer.batch_decode`.
5. **Print the Result** 🖨️: We display the model's generated output for review.



In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
filename="/content/ft_data.xlsx"

df = pd.read_excel(filename)
df.head()
X_train = list()
X_test = list()
for intent in df['intent'].unique():
    train, test  = train_test_split(df[df.intent==intent],
                                    train_size=0.4,test_size=0.05,
                                    random_state=42)
    X_train.append(train)
    X_test.append(test)
X_train = pd.concat(X_train).sample(frac=1, random_state=10)
X_test = pd.concat(X_test)
X_train = X_train.reset_index(drop=True)

In [10]:
X_train[:5]

Unnamed: 0,intent,query
0,Card: temporary limit increase,I want to apply for a temporary limit increase...
1,Notifications: manage,I'd like to manage my alert settings for low b...
2,Spend Account: get cash withdrawal and reload ...,Can you point me to the nearest cash top-up fa...
3,Spending Tracker: get info,Is there a way to view my expenses breakdown?
4,Document: upload,"I'm trying to upload my passport scan, can you..."


In [23]:
ans=[]
for query in X_test['query']:
  text =f"""You have to serve as an Conversational Intent Classifier for the query given by the user,
  #             below is the list from which you have to choose the intent.

  #       List of Predefined Intents:
  #       Card: disable
  #       Card: enable
  #       Card: get shipping status where is
  #       Card: report stolen or lost
  #       Global: get balance
  #       Global: get routing number direct deposit info
  #       Statement: get
  #       Transaction: history
  #       Transaction: report dispute incorrect
  #       User Account: change email address
  #       User Account: change mailing address
  #       Spend Account: transfer funds
  #       Spend Account: transfer funds checks
  #       Spend Account: transfer funds external bank
  #       Savings Account: get info view program
  #       User Account: connect banks
  #       Rewards: view offers
  #       Rewards: opt in
  #       Rewards: opt out
  #       User Account: get help contact customer service
  #       Spending Tracker: get info
  #       Overdraft: opt out
  #       User Account: get secure inbox messages
  #       User Account: get fee plan info
  #       Card: add new
  #       Document: upload
  #       Spend Account: find ATMs
  #       Spend Account: get cash withdrawal and reload locations
  #       User Account: log out
  #       Card: get info status
  #       User Account: esign
  #       Refer a Friend: get info
  #       User Account: change phone number
  #       User Account: edit profile name
  #       Card: reset PIN
  #       Card: cancel close
  #       Card: temporary limit increase
  #       Spend Account: consent to direct deposit
  #       User Account: closure request
  #       Notifications: manage
  #       Notifications: sign up for
  #       Card: report not receiveda
  #       Card: replace or upgrade
  #       User Account: change password post login

  Some Examples for better understanding are given below
 ####
  Query: I want to apply for a temporary limit increase on my card, guide me through it.
  Intent: Card: temporary limit increase

  Query: 	I'd like to manage my alert settings for low balance warnings
  Intent: Notifications: manage

  Query: Can you point me to the nearest cash top-up facility?
  Intent: Spend Account: get cash withdrawal and reload locations

  Query: Is there a way to view my expenses breakdown?
  Intent: Spending Tracker: get info

  Query: I'm trying to upload my passport scan, can you show me how?
  Intent: Document: upload
 ####
  Return the intent for below query
  Query:{query}
  Intent: """

  encoded = tokenizer(text, return_tensors="pt", add_special_tokens=False)
  model_input = encoded
  generated_ids = model.generate(**model_input, max_new_tokens=150, do_sample=True)
  decoded = tokenizer.batch_decode(generated_ids)
  ans.append(decoded[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for o

In [24]:
!pip install langchain langchain_openai

Collecting langchain
  Downloading langchain-0.1.20-py3-none-any.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_openai
  Downloading langchain_openai-0.1.7-py3-none-any.whl (34 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.6-py3-none-any.whl (28 kB)
Collecting langchain-community<0.1,>=0.0.38 (from langchain)
  Downloading langchain_community-0.0.38-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m85.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2.0,>=0.1.52 (from langchain)
  Downloading langchain_core-0.1.52-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.9/302.9 kB[0m [31m40.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.1,>=0.0.1 (from langchain)
  Downloading langchai

In [30]:
ans[i]



In [34]:
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0, api_key="")
final=[]
for i in range(len(ans)):
  messages = [
      ("system", """You will be given the prompt and the output which is given by an llm, just return intent llm has given of the query user asked and  from it which is present in the following list and nothing else.Remember that input may contain garbage you just extract the intent and do not put intent word before the output
    Card: disable
    Card: enable
    Card: get shipping status where is
    Card: report stolen or lost
    Global: get balance
    Global: get routing number direct deposit info
    Statement: get
    Transaction: history
    Transaction: report dispute incorrect
    User Account: change email address
    User Account: change mailing address
    Spend Account: transfer funds
    Spend Account: transfer funds checks
    Spend Account: transfer funds external bank
    Savings Account: get info view program
    User Account: connect banks
    Rewards: view offers
    Rewards: opt in
    Rewards: opt out
    User Account: get help contact customer service',
    Spending Tracker: get info',
    Overdraft: opt out
    User Account: get secure inbox messages
    User Account: get fee plan info
    Card: add new
    Document: upload
    Spend Account: find ATMs
    Spend Account: get cash withdrawal and reload locations',
    User Account: log out
    Card: get info status
    User Account: esign
    Refer a Friend: get info
    User Account: change phone number
    User Account: edit profile name
    Card: reset PIN
    Card: cancel close
    Card: temporary limit increase
    Spend Account: consent to direct deposit',
    User Account: closure request
    Notifications: manage
    Notifications: sign up for
    Card: report not receiveda
    Card: replace or upgrade
    User Account: change password post login
  """),
      ("human", f"return the intent present in following phrase.{ans[i]}."),
  ]
  final.append(llm.invoke(messages))

In [35]:
ff_ans=[]
for i in range(len(final)):
  ff_ans.append(final[i].content)

In [36]:
ff_ans

['Card: disable',
 'Card: enable',
 'Card: get shipping status where is',
 'Card: report stolen or lost',
 'Statement: get',
 'Global: get routing number direct deposit info',
 'Transaction: get',
 'Global: get routing number direct deposit info',
 'Transaction: report dispute incorrect',
 'Intent: User Account: change email address',
 'User Account: change mailing address',
 'Spend Account: transfer funds external bank',
 'User Account: change email address',
 'Spend Account: transfer funds external bank',
 'Savings Account: get info view program',
 'User Account: Connect Banks',
 'Rewards: view offers',
 'Notifications: sign up for',
 'Rewards: opt out',
 'User Account: get help contact customer service',
 'Spending Tracker: get info',
 'Overdraft: opt out',
 'User Account: get secure inbox messages',
 'User Account: get help contact customer service',
 'Card: add new',
 'User Account: edit profile name',
 'Spend Account: find ATMs',
 'Spend Account: get cash withdrawal and reload lo

In [39]:
X_test_list=list(X_test['intent'])

In [40]:
count=0
for i in range(len(ff_ans)):
  if ff_ans[i]==X_test_list[i]:
    count+=1
print(count*100/len(ff_ans))

65.9090909090909
