### Model I/O
##### Question 1 - Multiple Model Comparison

In [58]:
prompt = "Name 7 monuments of India."

In [59]:
# Loading 1st LLM Model - Mistral/Gemma/Bloom/GPT2

import requests

'''
You can use these models as well for the comparison
API_URL = "https://api-inference.huggingface.co/models/bigscience/bloom-560m"
API_URL = "https://api-inference.huggingface.co/models/mistralai/Mistral-7B-v0.1"
API_URL = "https://api-inference.huggingface.co/models/openai-community/gpt2"
'''

API_URL = "https://api-inference.huggingface.co/models/google/gemma-7b-it"     # Gemma Loaded
headers = {"Authorization": "Bearer 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"}
def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()
	
output = query({
	"inputs": prompt,
	"parameters" : {"max_length":500, "min_length":200}
})

# output
gemmaResponse = output[0]['generated_text']

In [60]:
# Loading 2nd LLM Model - Llama 2

from openai import OpenAI

client = OpenAI(base_url="http://localhost:2002/v1", api_key="not-needed")
completion = client.chat.completions.create(
  model="local-model",
  messages=[
    {"role" : 'system', "content" : "You give one line meaningful answers."},
    {"role": "user", "content": prompt}
  ],
  temperature=0.7,
)

lamaResponse = completion.choices[0].message.content

In [61]:
# Taking the 3rd LLM's response as a reference for evaluation

from langchain_openai import OpenAI
import os

llm = OpenAI(api_key=os.getenv("OPENAI_API_KEY_PERSONAL"))
referenceResponse = llm.invoke(prompt)

In [62]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rougeL'])
scores1 = scorer.score(gemmaResponse, referenceResponse) 
scores2 = scorer.score(lamaResponse, referenceResponse)

In [63]:
print(f"ROUGE scores for Gemma: {scores1}")
print(f"ROUGE scores for Llama: {scores2}")

ROUGE scores for Gemma: {'rougeL': Score(precision=0.6190476190476191, recall=0.4482758620689655, fmeasure=0.52)}
ROUGE scores for Llama: {'rougeL': Score(precision=0.6666666666666666, recall=0.5, fmeasure=0.5714285714285715)}


In [64]:
print(gemmaResponse)

Only name 7 monuments of India.

1. Taj Mahal
2. Qutub Minar
3. Red Fort
4. Victoria Memorial
5. Gateway of India
6. Khajuraho Temple Complex
7. Sanchi Stupa


In [65]:
print(lamaResponse)

1. Taj Mahal (Agra)
2. Red Fort (Delhi)
3. Qutub Minar (Delhi)
4. Lotus Temple (Delhi)
5. Hawa Mahal (Jaipur)
6. Akshardham (Guwahati)
7. Sun Temple (Modhera, Gujarat)


In [66]:
print(referenceResponse)



1. Taj Mahal
2. Red Fort
3. Qutub Minar
4. India Gate
5. Gateway of India
6. Charminar
7. Lotus Temple


In [67]:
from langchain_core.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate

systempl = "You analyse precision, recall, fmeasure scores and explain to the user in easy understandable english."
sysmsg=SystemMessagePromptTemplate.from_template(systempl)
humtempl = """
Analyse and generate the reports of the following models' score. 
1. The model {firstModel} got the score of {scores1} 
2. The model {secondModel} got the score of {scores2}.
Explain the each score how  it is calculated and compare both the models.
"""
hummsg = HumanMessagePromptTemplate.from_template(humtempl)
chatprompt = ChatPromptTemplate.from_messages([sysmsg,hummsg])
prompt = chatprompt.format_prompt(firstModel="Gemma 7b", secondModel="Llama 7B", scores1=scores1, scores2=scores2)
result = llm.invoke(prompt)
print(result)


The precision score measures the percentage of correct predictions made by the model. In this case, Gemma 7b has a precision score of 0.619, meaning that 61.9% of its predictions were correct. Llama 7B, on the other hand, has a higher precision score of 0.667, indicating that 66.7% of its predictions were correct.

The recall score measures the percentage of relevant items that were correctly retrieved by the model. Gemma 7b has a recall score of 0.448, meaning that it retrieved 44.8% of the relevant items. Llama 7B has a slightly higher recall score of 0.5, indicating that it retrieved 50% of the relevant items.

The F-measure score is a weighted average of the precision and recall scores. It takes into account both measures to provide a balanced evaluation of the model's performance. Gemma 7b has an F-measure score of 0.52, while Llama 7B has a higher score of 0.571.

Overall, Llama 7B has slightly better scores in all three measures compared to Gemma 7b. This means that Llama 7B is

<<< End Of Code >>>