<a href="https://colab.research.google.com/github/abdulsamadkhan/Llama2_Chat/blob/main/Evaluating_the_finetuned_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction 📚


## 1️⃣ Installation 💻



In [None]:
!pip install transformers torch accelerate

Collecting accelerate
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m41.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m53.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m49.4 MB/s[0m eta [3

### 2️⃣ Prerequisites 📝

To load our desired model, `meta-llama/Llama-2-7b-chat-hf`, we first need to authenticate ourselves on Hugging Face. This ensures we have the correct permissions to fetch the model.

- 🤗 `Gain access to the model` on Hugging Face: [Link](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf).
- Use the Hugging Face CLI to login and verify your authentication status.


In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
!huggingface-cli whoami

abdulsamad
[1morgs: [0m HUnivesity,HUNiversity


### 3️⃣ Loading Model & Tokenizer 🧠

Here, we're loading both the Llama model and its associated tokenizer.
The tokenizer will assist in converting our text prompts into a format that the model can understand and process. 📝


In [None]:
from transformers import AutoTokenizer
import transformers
import torch

model = "HeavenWaters/TaxTajweezLlama7B2626QA3E" # finetunned model for pakistan taxation system

tokenizer = AutoTokenizer.from_pretrained(model, use_auth_token=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

### 4️⃣ Establishing the Llama Pipeline 🛠️

Let's set up a pipeline for text generation. 🚀 This pipeline simplifies the process of feeding prompts to our model and receiving generated text as output.

### ❗ `torch.d_type` parameter
- `torch.float32` or `torch.float`: Default in PyTorch, balances precision and speed. 🏃‍♂️
- `torch.float16` or `torch.half`: Uses less memory and resources, but less precise. Can speed up models on specific hardware. 🚀
- `torch.float64` or `torch.double`: More precise but requires more resources. 🎯

Note: Not all models and operations support all data types. Changing data types requires careful testing and checking the documentation. 📚

There's no `torch.float8` in PyTorch. For lower precision, consider integer types `torch.int8` or `torch.uint8`, but beware of precision loss and limited support. ⚠️



In [None]:
from transformers import pipeline

llama_pipeline = pipeline(
    "text-generation",  # LLM task
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/174 [00:00<?, ?B/s]



###  Engagement with Llama

Now that everything is set up, let's see how 🦙 Llama responds to some sample queries. 🎉


In [None]:

def get_llama_response(prompt, do_sample=True, temperature=0.1, top_k=10, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id, max_length=1024):
    """
    Generate a response from the Llama model.

    Parameters:
        prompt (str): The user's input/question for the model.
        do_sample (bool): Whether to use sampling during generation (default is True).
        temperature (float): Sampling temperature (default is 0.5).
        top_k (int): Number of top-k tokens to consider during sampling (default is 10).
        num_return_sequences (int): Number of response sequences to generate (default is 1).
        eos_token_id (int): ID of the end-of-sequence token (default is tokenizer.eos_token_id).
        max_length (int): Maximum length of the generated response (default is 1024).

    Returns:
        None: Prints the model's response.
    """
    # Uncomment the following lines to use the Llama model
    prompt = f"<s>[INST] {prompt} [/INST]"
    response = llama_pipeline(
         prompt,
         do_sample=do_sample,
         top_k=top_k,
         num_return_sequences=num_return_sequences,
         eos_token_id=eos_token_id,
         max_length=max_length,
         temperature=temperature,
         return_full_text=False, # to not repeat the question, set to False
         repetition_penalty =1.2,
     )
    return response


# Example usage:
prompt = 'tell me how to play cricket ?\n'
respopnse = get_llama_response(prompt=prompt)
print("Chatbot:", respopnse[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Chatbot:  '
To play cricket, you will need a bat and ball. The objective of the game is for one team to score runs while preventing the other team from scoring by hitting wickets or catches. There are 11 players on each team with two batsmen at the top of the order who must stay in until they get out. Once both teams have batted once, the side that has scored more points wins.</s>

Here's an example of how to play cricket:

Step 1: Gather all necessary equipment such as a bat, ball, stumps (or wickets), gloves, helmets, pads, etc.

Step 2: Divide the players into two teams with 11 members on each team. Each team should also appoint a captain.

Step 3: Determine which team will bat first by tossing a coin. If the coin lands heads up, then Team A gets to choose whether to bat or field first. If it lands tails up, then Team B gets to make this decision.

Step 4: The team chosen to bat first sends their opening batsman to the crease. They can either hit the ball towards the boundary line o

# do not repeat my question ?
If you'd rather not repeat your question to the llama and only receive its response, you can achieve this by setting return_full_text to False. 🦙

### Reading_Data

In [1]:
import pandas as pd

# URL of the raw CSV file on GitHub
url = 'https://raw.githubusercontent.com/abdulsamadkhan/Llama2_Chat/main/Data/TaxationPakistan.csv'

# Read the CSV file and convert it into a DataFrame
df_tax = pd.read_csv(url)

# Print the first 5 rows of the DataFrame
print(df_tax.columns)



Index(['question', 'contexts', 'answer', 'ground_truths', 'context_precision',
       'faithfulness', 'answer_relevancy', 'BERT Similarity Score',
       'Extracted Score'],
      dtype='object')


### this data has multiple columns for this tutorial we need the
# question and answer


In [2]:
# Create a new DataFrame with only 'question' and 'answer' columns
data_tax = df_tax[['question', 'answer']]
# Take the first 178 instances in the data
data_tax = data_tax[:179]


# Check if there is any NaN value in the DataFrame
is_nan = data_tax.isnull().values.any()

# Print the result
if is_nan:
    print("There are NaN values in the DataFrame.")
else:
    print("There are no NaN values in the DataFrame.")


# Print the first 3 rows of the new DataFrame
print(data_tax.head(3))



There are no NaN values in the DataFrame.
                                            question  \
0  What specific conditions must be met in order ...   
1  What are the incentives offered and related to...   
2  What penalties or default surcharge will be im...   

                                              answer  
0  To claim a foreign tax credit, specifically in...  
1  The incentives related to the recovery and col...  
2  If you fail to comply with filing a return wit...  


In [None]:
question = []
answer = []
# Loop over the DataFrame
i = 0
for index, row in data.iterrows():
    # Access data using column names
    question.append(row['question'])
    answer.append(get_llama_response(prompt=row['question']))
    #i+=1
    #if(i==3):
     # break
# Create a DataFrame using a dictionary
df = pd.DataFrame({'question': question, 'answer': answer})

# Print the DataFrame
print(df)

# Save the DataFrame to a CSV file
df.to_csv('data.csv', index=False)  # Replace 'qa_data.csv' with your desired filename

print("DataFrame saved successfully!")



                                              question  \
0    What specific conditions must be met in order ...   
1    What are the incentives offered and related to...   
2    What penalties or default surcharge will be im...   
3    What deductions or tax concessions are availab...   
4    What are the specific rules and criteria for i...   
..                                                 ...   
174  "What are the changes made to the rate of tax ...   
175  "What is the rate of tax for deduction or coll...   
176  What is the commission rate for life insurance...   
177  What is the current rate of tax for collection...   
178  "What is the rate of commission for goods tran...   

                                                answer  
0    [{'generated_text': ' The specific conditions ...  
1    [{'generated_text': ' The incentives offered f...  
2    [{'generated_text': ' If you fail to file your...  
3    [{'generated_text': ' '

Taxpayers may be elig...  
4    [{'generated_

In [3]:
# Import pandas library
import pandas as pd

# Read the CSV file into a pandas dataframe
df = pd.read_csv('data.csv')

# Print the dataframe
print(df.columns)

Index(['question', 'answer'], dtype='object')


Hence, task-specific evaluation metrics have emerged, such as the Bilingual Evaluation Understudy (BLEU) for translation tasks and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) for summarization tasks. Both BLEU and ROUGE scores use comparison with reference translations or summaries to determine the performance of the model on the task at hand.

In [4]:
#creating refrence set
reference=[]
for index, row in data_tax.iterrows():
    # Access data using column names
    reference.append(row['answer'])

In [5]:
# creating candidate set
candidate=[]
import ast
#lopin on the predicted/canddiate
for answer in df['answer']:
  candidate.append(ast.literal_eval(answer)[0]['generated_text'])



In [6]:
!pip install rouge

import nltk
from nltk.translate.bleu_score import corpus_bleu
from rouge import Rouge


def calculate_bleu_rouge(references, candidates):
  """
  Calculates BLEU and ROUGE-L scores for a list of references and candidates.

  Args:
      references (list): A list of lists, where each inner list contains one or more reference translations for a sentence.
      candidates (list): A list of strings, where each string is a candidate translation.

  Returns:
      dict: A dictionary containing the average BLEU and ROUGE-L scores.
  """

  # Preprocess text (optional, but recommended for better results)
  # You can add preprocessing steps here, such as tokenization, lowercase conversion,
  # stemming/lemmatization, etc.

  # Calculate BLEU score
  bleu_scores = []
  for i in range(len(references)):
      bleu_score = corpus_bleu([references[i]], [candidates[i]], weights=(1.0,))
      bleu_scores.append(bleu_score)
  average_bleu = sum(bleu_scores) / len(bleu_scores)

  # Calculate ROUGE-L score (using PyRouge)
  rouge = Rouge()
  scores = rouge.get_scores(candidates, references, avg=True)
  average_rouge_l = scores['rouge-l']['f']

  return {'BLEU-1': average_bleu, 'ROUGE-L (F1)': average_rouge_l}


scores = calculate_bleu_rouge(reference, candidate)
print(scores)


Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
{'BLEU-1': 0.023208419897101616, 'ROUGE-L (F1)': 0.2739643950941333}


In [21]:
import torch
!pip install bert_score

from bert_score import BERTScorer

# Load pre-trained BERT model (you can specify a different model or device if needed)
scorer = BERTScorer(lang="en")  # Assuming English text

def calculate_bert_scores(references, candidates):
  """
  Calculates BERTScore for each pair of references and candidates, and returns average scores.
  """

  all_scores = scorer.score(candidates, references)  # Calculate scores for all pairs
  precision_scores, recall_scores, f1_scores = [], [], []
  for score in all_scores:
    precision_scores.append(score[0])
    recall_scores.append(score[1])
    f1_scores.append(score[2])

  # Calculate average F1 score for all pairs
  precision_score = torch.mean(torch.tensor(precision_scores)).item()
  recall_scores = torch.mean(torch.tensor(recall_scores)).item()
  average_f1_score = torch.mean(torch.tensor(f1_scores)).item()


  #return average_f1_score
  return precision_score,recall_scores,average_f1_score


all_scores = calculate_bert_scores(reference, candidate)

print("(avg-precision, avg-recall, avg-F1 score ):", all_scores)




Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


(avg-precision, avg-recall, avg-F1 score ): (0.8742365837097168, 0.8625310063362122, 0.8791654706001282)


#BERT Score

# References:
https://haticeozbolat17.medium.com/text-summarization-how-to-calculate-bertscore-771a51022964#:~:text=BertScore%20is%20a%20method%20used,gram%2Dbased%20metrics%20often%20encounter.

https://arxiv.org/abs/1904.09675

https://github.com/Tiiiger/bert_score