# LLM Evaluation with MLflow
**What is an LLM (Large Language Model)?**

**Definition:**

* An LLM is a deep learning model trained on large amounts of text data to
understand and generate human-like language. Examples include GPT-2, GPT-3, and BERT.

***Key Capabilities:***

* LLMs can perform tasks like text generation, summarization, translation, Q&A, and chat by leveraging contextual patterns learned from massive datasets.

**What is MLflow?**

***Definition:***

* MLflow is an open-source platform designed to manage the ML lifecycle — including tracking experiments, packaging code, and deploying models.

***Key Features:***

* Tracks parameters, metrics, and artifacts across experiments. Supports logging for models, datasets, visualizations, and more

**How Does LLM Evaluation Work with MLflow?**

***Evaluation Pipeline:***

* You prompt the LLM with input text, get its output, and compare it against a ground-truth response (reference) using evaluation metrics like ROUGE or BLEU.

***MLflow Integration:***

* Parameters (e.g., model name)

* Artifacts (e.g., prompt-output-reference files)

* Metrics (e.g., ROUGE scores)
This enables experiment tracking and performance comparison across models.

# **Use Cases of LLM Evaluation with MLflow:**

***Model Comparison & Selection:***

* Helps you evaluate multiple models side-by-side to decide which one performs best on your custom dataset.

***Reproducibility & Experiment Tracking:***

* Keeps a complete log of what was run, with what settings, and how it performed — useful in education, research, and production.




# **STEP 1: Install Required Libraries**

In [1]:
!pip install transformers mlflow evaluate rouge_score sacrebleu nltk pyngrok --quiet

# **STEP 2: Import Required Modules**


In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import evaluate
import mlflow
import mlflow.pyfunc
from pyngrok import ngrok
import time
import threading

# **STEP 3: Set Up MLflow Experiment**

In [None]:
# Authenticate ngrok with your API key
from pyngrok import ngrok
ngrok.set_auth_token("YUOR-API-KEY")  # Replace with your ngrok authtoken

# Function to launch the Streamlit app using a system command
def run_app():
    !mlflow server --host 127.0.0.1 --port 5000

# Terminate any active ngrok tunnels before starting a new one
ngrok.kill()

# Start the Streamlit app in a separate thread so the script can continue running
app_thread = threading.Thread(target=run_app)
app_thread.start()

# Allow time for the Streamlit app to fully start before creating the tunnel
time.sleep(10)

# Create a public URL using ngrok and display it
try:
    public_url = ngrok.connect(5000)
    print("🚀 Your app is live!")
    print(f"🌐 Share this link: {public_url}")
    print("📱 Anyone can access your app with this link!")
except:
    print("⚠️ Need ngrok token for sharing. App is running locally.")

[2025-07-17 05:10:56 +0000] [3397] [INFO] Starting gunicorn 23.0.0
[2025-07-17 05:10:56 +0000] [3397] [INFO] Listening at: http://127.0.0.1:5000 (3397)
[2025-07-17 05:10:56 +0000] [3397] [INFO] Using worker: sync
[2025-07-17 05:10:56 +0000] [3398] [INFO] Booting worker with pid: 3398
[2025-07-17 05:10:56 +0000] [3399] [INFO] Booting worker with pid: 3399
[2025-07-17 05:10:56 +0000] [3400] [INFO] Booting worker with pid: 3400
[2025-07-17 05:10:56 +0000] [3401] [INFO] Booting worker with pid: 3401
🚀 Your app is live!
🌐 Share this link: NgrokTunnel: "https://397ecb42e17f.ngrok-free.app" -> "http://localhost:5000"
📱 Anyone can access your app with this link!


In [4]:
mlflow.set_tracking_uri('http://localhost:5000')
mlflow.set_experiment("LLM_Evaluation_Experiment")

2025/07/17 05:11:09 INFO mlflow.tracking.fluent: Experiment with name 'LLM_Evaluation_Experiment' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/688568944897355669', creation_time=1752729069455, experiment_id='688568944897355669', last_update_time=1752729069455, lifecycle_stage='active', name='LLM_Evaluation_Experiment', tags={}>

#  **STEP 4: Define a Custom Dataset (Prompt-Response-Reference Pairs)**

In [5]:
dataset = [
    {"prompt": "Translate to French: Hello, how are you?", "reference": "Bonjour, comment ça va ?"},
    {"prompt": "Summarize: The cat sat on the mat and looked at the stars through the window.",
     "reference": "The cat watched stars from the mat."},
    {"prompt": "What is the capital of France?", "reference": "Paris"}
]

#**STEP 5: Load Evaluation Metric**

In [6]:

# 📏 Load evaluation metrics
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
meteor = evaluate.load("meteor")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


#**STEP 6: Safe generation wrapper**

In [7]:

def safe_generate_text(model_name, prompt, tokenizer, model, max_new_tokens=50):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


# **STEP 7: Evaluate with Multiple Models**

In [8]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import evaluate
import mlflow
import mlflow.pyfunc

# 📦 Define prompts and reference completions
dataset = [
    {"prompt": "Translate to French: Hello, how are you?", "reference": "Bonjour, comment ça va ?"},
    {"prompt": "Summarize: The cat sat on the mat and looked at the stars through the window.",
     "reference": "The cat watched stars from the mat."},
    {"prompt": "What is the capital of France?", "reference": "Paris"}
]

# 📏 Load evaluation metrics
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
meteor = evaluate.load("meteor")

# 🛡️ Safe generation wrapper
def safe_generate_text(model_name, prompt, tokenizer, model, max_new_tokens=50):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# 🧪 Evaluation logic
def evaluate_model(model_name):
    print(f"🚀 Evaluating model: {model_name}")
    device = "cuda" if torch.cuda.is_available() else "cpu"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

    preds = []
    refs = []

    for item in dataset:
        pred = safe_generate_text(model_name, item["prompt"], tokenizer, model)
        preds.append(pred)
        refs.append([item["reference"]])  # BLEU expects list of references

    # 📊 Compute metrics
    rouge_score = rouge.compute(predictions=preds, references=[r[0] for r in refs])["rougeL"]
    bleu_score = bleu.compute(predictions=preds, references=refs)["bleu"]
    meteor_score = meteor.compute(predictions=preds, references=[r[0] for r in refs])["meteor"]

    print(f"📊 ROUGE-L: {rouge_score:.4f}")
    print(f"📊 BLEU:    {bleu_score:.4f}")
    print(f"📊 METEOR:  {meteor_score:.4f}")

    # 🧪 Log with MLflow
    with mlflow.start_run(run_name=f"{model_name}_eval"):
        mlflow.log_param("model", model_name)
        mlflow.log_metric("ROUGE-L", rouge_score)
        mlflow.log_metric("BLEU", bleu_score)
        mlflow.log_metric("METEOR", meteor_score)

    return {
        "model": model_name,
        "rougeL": rouge_score,
        "bleu": bleu_score,
        "meteor": meteor_score
    }

# 🔍 Run evaluation on multiple free models
models = ["distilgpt2", "sshleifer/tiny-gpt2"]

all_results = []
for m in models:
    result = evaluate_model(m)
    all_results.append(result)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


🚀 Evaluating model: distilgpt2


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

📊 ROUGE-L: 0.1212
📊 BLEU:    0.0000
📊 METEOR:  0.1644
🏃 View run distilgpt2_eval at: http://localhost:5000/#/experiments/688568944897355669/runs/0dfe9e89fac54a2ab558165b55f0cd3e
🧪 View experiment at: http://localhost:5000/#/experiments/688568944897355669
🚀 Evaluating model: sshleifer/tiny-gpt2


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.51M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.51M [00:00<?, ?B/s]

📊 ROUGE-L: 0.0370
📊 BLEU:    0.0000
📊 METEOR:  0.1012
🏃 View run sshleifer/tiny-gpt2_eval at: http://localhost:5000/#/experiments/688568944897355669/runs/8374cf8142c64a019c83d58b4140cf39
🧪 View experiment at: http://localhost:5000/#/experiments/688568944897355669


# **STEP 8: Evaluation Summary**

In [9]:

print("\n📈 Evaluation Summary:")
for r in all_results:
    print(f"- {r['model']}: ROUGE-L={r['rougeL']:.4f}, BLEU={r['bleu']:.4f}, METEOR={r['meteor']:.4f}")


📈 Evaluation Summary:
- distilgpt2: ROUGE-L=0.1212, BLEU=0.0000, METEOR=0.1644
- sshleifer/tiny-gpt2: ROUGE-L=0.0370, BLEU=0.0000, METEOR=0.1012


In [10]:
 print(f"Open this link: {public_url}")

Open this link: NgrokTunnel: "https://397ecb42e17f.ngrok-free.app" -> "http://localhost:5000"
