#### Testing LLMs

In this section, we will discuss testing LLMs which we used as Judge to evaluate our metrics before in DeepEval and RAGAs

<img src="./img/TestingLLMs.png" width="800" height="400" style="display: block; margin: auto;">

#### Simple Code using Ollama for Sentiment Analysis

In [2]:
import ollama

def analyse_sentiment(text):
    prompt = f"Analyse the sentiment of this text and give me POSITIVE, NEGATIVE and NEUTRAL for this {text}"
    
    response = ollama.chat(model='qwen2.5:latest',messages=[{'role': 'user', 'content': prompt}])
    
    return response['message']['content']

texts = [
    "I love this product",
    "I hate the food in this resturant",
    "The movie was just okay"
]

for text in texts:
    sentiment = analyse_sentiment(text)
    print(f"Text: {text}")
    print(f"Sentiment: {sentiment}")

Text: I love this product
Sentiment: The sentiment of the text "I love this product" is:

**POSITIVE**

This statement expresses strong positive feelings towards the product in question.
Text: I hate the food in this resturant
Sentiment: The sentiment of the text "I hate the food in this restaurant" is:

NEGATIVE

This statement expresses a strong dislike or dissatisfaction with the food at the restaurant, which indicates a negative sentiment.
Text: The movie was just okay
Sentiment: The sentiment analysis for "The movie was just okay" would be:

- **POSITIVE**: 0 (There's no positive connotation)
- **NEGATIVE**: 0 (No negative words or connotations are present)
- **NEUTRAL**: 1 (This phrase indicates a neutral stance, suggesting the movie met basic expectations but did not exceed them)

The phrase "just okay" generally implies that the experience was neither particularly good nor bad, hence it is classified as neutral.


#### Text Summarization (Classification) 📚

In [6]:
import ollama

def summarize_text(text):
    prompt = f"Please provide a consice summary of the given {text} and keep it as simple as possible"
    
    response = ollama.chat(model='llama3.2:latest',messages=[{'role': 'user', 'content': prompt}])
    
    return response['message']['content']

text = """
        Function calling, which allows LLMs to invoke predetermined functions based on user requests, is a well-established feature of modern AI models. Sometimes referred to as “tool use,” function calling is not mutually exclusive with MCP; the new protocol simply standardizes how this API feature works. 

        Without MCP, when you use a function call directly with an LLM API, you need to:

        Define model-specific function schemas, which are JSON descriptions of the function, acceptable parameters, and what it returns.
        Implement handlers (the actual code that executes when a function is called) for those functions.
        Create different implementations for each model you support.
        MCP standardizes this process by:

        Defining a consistent way to specify tools (functions) across any AI system.
        Providing a protocol for discovering available tools and executing them.
        Creating a universal, plug-and-play format where any AI app can use any tool without custom integration code.
        You might be familiar with AI apps that use function calling, like Custom GPTs using GPT Actions. A Custom GPT can determine which API call resolves the user's prompt, create the necessary JSON, then make the API call with it. While this allows some purpose-built tooling, it’s bound to OpenAI’s ecosystem. MCP brings similar capabilities to any AI application that implements the protocol, regardless of the underlying model vendor.
"""

summary = summarize_text(text)
print(f"Text: {text}")
print(f"Summary: {summary}")

Text: 
        Function calling, which allows LLMs to invoke predetermined functions based on user requests, is a well-established feature of modern AI models. Sometimes referred to as “tool use,” function calling is not mutually exclusive with MCP; the new protocol simply standardizes how this API feature works. 

        Without MCP, when you use a function call directly with an LLM API, you need to:

        Define model-specific function schemas, which are JSON descriptions of the function, acceptable parameters, and what it returns.
        Implement handlers (the actual code that executes when a function is called) for those functions.
        Create different implementations for each model you support.
        MCP standardizes this process by:

        Defining a consistent way to specify tools (functions) across any AI system.
        Providing a protocol for discovering available tools and executing them.
        Creating a universal, plug-and-play format where any AI app can 

### NLP Library (Transformer) from HuggingFace

In [None]:
# !pip install transformers
# !pip install evaluate
# !pip install tensorflow
# !pip install tf-keras
# !pip install scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting evaluate
  Using cached evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Using cached evaluate-0.4.3-py3-none-any.whl (84 kB)
Installing collected packages: evaluate
Successfully installed evaluate-0.4.3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting tensorflow
  Using cached tensorflow-2.19.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (4.0 kB)
Collecting absl-py>=1.0.0 (from tensorflow)
  Using cached absl_py-2.2.1-py3-none-any.whl.metadata (2.4 kB)
Collecting astunparse>=1.6.0 (from tensorflow)
  Using cached astun

In [4]:
from transformers import pipeline

classifier = pipeline("text-classification")

classifier("I think I love this food eventhough many people says its terrible and bad and worst")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.
Device set to use 0


[{'label': 'NEGATIVE', 'score': 0.9976416826248169}]

#### Text Summarization Classification

In [8]:
from transformers import pipeline

summarizer = pipeline("summarization")

summary = summarizer(text)

No model was supplied, defaulted to google-t5/t5-small and revision df1b051 (https://huggingface.co/google-t5/t5-small).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.
Device set to use 0
I0000 00:00:1743275381.779261 21121308 service.cc:152] XLA service 0x374771870 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1743275381.779322 21121308 service.cc:160]   StreamExecutor device (0): Host, Default Version
2025-03-30 08:09:41.786351: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR

In [10]:
summary[0]['summary_text']

'function calling allows LLMs to invoke predetermined functions based on user requests . the new protocol simply standardizes how this API feature works . MCP brings similar capabilities to any AI application that implements the protocol .'

#### Testing custom LLM model from HF 🧠🤖

In [15]:
from transformers import pipeline

classifier = pipeline("text-classification", model="ExecuteAutomation/bert-base-text-classification-model")

# classifier([
#     'I like learning AI Testing',
#     'I started annoyed with my laptop which has no GPU',
#     'I am low today',
#     'I think I love this food eventhough many people says its terrible and bad and worst'
# ])

classifier('My child plays tennis all the time, and this affects his studies')

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.
Device set to use 0


[{'label': 'sadness', 'score': 0.28763842582702637}]

#### Zero Shot Classification

In [20]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")

classifier("I love learning Machine Learning and AI Testing from ExecuteAutomation",
           candidate_labels=["education", "marketing", "motivation", "business"])



No model was supplied, defaulted to FacebookAI/roberta-large-mnli and revision 2a8f12d (https://huggingface.co/FacebookAI/roberta-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFRobertaForSequenceClassification.

All the weights of TFRobertaForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.
Device set to use 0


{'sequence': 'I love learning Machine Learning and AI Testing from ExecuteAutomation',
 'labels': ['education', 'motivation', 'business', 'marketing'],
 'scores': [0.48328739404678345,
  0.47531428933143616,
  0.02796463668346405,
  0.013433674350380898]}

#### NER Classification

In [23]:
from transformers import pipeline

classifier = pipeline("ner", grouped_entities=True)

classifier("Karthik is working in ExecuteAutomation and living in NZ and my email is karthik@ea.com")


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFBertForTokenClassification.

All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.
Device set to use 0


[{'entity_group': 'PER',
  'score': 0.9893281,
  'word': 'Karthik',
  'start': 0,
  'end': 7},
 {'entity_group': 'ORG',
  'score': 0.9558758,
  'word': 'ExecuteAutomation',
  'start': 22,
  'end': 39},
 {'entity_group': 'LOC',
  'score': 0.9997625,
  'word': 'NZ',
  'start': 54,
  'end': 56},
 {'entity_group': 'PER',
  'score': 0.7651939,
  'word': 'karthik',
  'start': 73,
  'end': 80}]