## Let's Test Which LLM should be used for Recommending an Astrologer 

**Sample LLM client test**

In [64]:
from together import Together
client = Together()

model_name = "mistralai/Mistral-7B-Instruct-v0.1"

# Response Generation
response = client.chat.completions.create(
    model=model_name,
    messages=[
      {
        "role": "user",
        "content": "what is the capital of France?"
      }
    ]
)
print(response.choices[0].message.content)

 The capital of France is Paris.


### Testing Various small sized LLMs on a Custom Dataset

**Models we will be testing**
- meta-llama/Llama-3-8b-chat-hf
- Qwen/Qwen2.5-7B-Instruct-Turbo
- mistralai/Mistral-7B-Instruct-v0.1
- google/gemma-2-27b-it

In [9]:
import os, json 

**we have created a custom sample dataset along with ground truth to test the LLMs:**

Here's a sample - 

```json
{
  "id": "CASE_001",
  "user_input": {
    "profile": "Male, 28, Software Engineer",
    "chat_transcript": "Hi, I'm feeling really stuck. My career isn't moving, and I'm thinking of starting my own software company but I'm worried about the financial risk. I just don't know if it's the right move for me."
  },
  "ideal_recommendation": {
    "top_3": ["Aarav Sharma", "Ananya Reddy", "Sneha Gupta"],
    "reasoning": "User has a direct career and business question with a financial component, making Aarav Sharma the primary choice. The need for 'Decision Making' about the startup makes Ananya Reddy a strong secondary choice. The general feeling of being 'stuck' relates to 'Life Path & Purpose', making Sneha Gupta relevant for a broader perspective."
  }
}
```

Also, Astrologer's profile data in file - data/astrologers.json

In [11]:
# Read from json file 
def read_json_file(file_path):
    with open(file_path, 'r') as file:
        return json.load(file)

In [15]:
astrologers_list = read_json_file("data/astrologers.json")
sample_data = read_json_file("data/sample_data.json")

In [18]:
def create_prompt(user_profile, chat_transcript, astrologer_pool):
    """Creates a standardized, detailed prompt for the LLM."""
    
    # Format the astrologer pool into a readable string
    astrologer_list_str = "\n".join([
        f"- {astro['name']}: Specializes in {', '.join(astro['specialties'])}"
        for astro in astrologer_pool
    ])

    # The instruction prompt template
    prompt = f"""
        You are an expert recommendation engine for an astrology platform. Your task is to analyze a user's profile and chat history to recommend the three most suitable astrologers from a provided list.

        **CONTEXT:**
        1.  **USER PROFILE:** {user_profile}
        2.  **CHAT TRANSCRIPT:** "{chat_transcript}"
        3.  **AVAILABLE ASTROLOGERS:** 
        {astrologer_list_str}

        **INSTRUCTIONS:**
        1.  Carefully read the user's profile and chat transcript to understand their primary concerns.
        2.  Match their concerns to the specialties of the available astrologers.
        3.  Provide your response as a JSON object containing the top 3 recommended astrologer names and a brief reasoning for your choices.

        **Your response MUST strictly be a JSON object in the following format and nothing else:**
        {{
        "top_3": ["Astrologer Name 1", "Astrologer Name 2", "Astrologer Name 3"],
        "reasoning": "Your detailed explanation for why these three astrologers were chosen based on the user's issues and the astrologers' specialties."
        }}
    """
    return prompt


In [None]:
m1 = "meta-llama/Llama-3-8b-chat-hf"
m2 = "Qwen/Qwen2.5-7B-Instruct-Turbo"
m3 = "mistralai/Mixtral-8x7B-Instruct-v0.1"
m4 = "google/gemma-2-27b-it"
# m5 = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

In [None]:
models = [
    m2, 
    m3, 
    m4
]
# m1 is already tested previously

In [54]:
from tqdm import tqdm 

In [61]:
## Execution 
results = {}
print("Starting LLM Testing...")

for m in tqdm(models, desc="Models", position=0):
    print(f"--- Testing Model: {m} ---")
    results[m] = {}

    for test_case in tqdm(sample_data, desc="Test Cases", position=1, leave=False):
        case_id = test_case['id']

        # Create the detailed prompt the current test case 
        prompt_content = create_prompt(
            user_profile = test_case["user_input"]["profile"], 
            chat_transcript = test_case["user_input"]["chat_transcript"],
            astrologer_pool = astrologers_list
        )

        try: 
            # LLM Call 
            response = client.chat.completions.create(
                model = m, 
                messages = [
                    {
                        "role": "user", 
                        "content": prompt_content
                    }
                ], 
                temperature=0.1
            )

            # Extract the content 
            response_content = response.choices[0].message.content.strip()
            results[m][case_id] = {
                "response": response_content
            }
        except Exception as e:
            results[m][case_id] = {
                "error": str(e)
            }
    print(f"--- Completed Testing Model: {m} ---") 

Starting LLM Testing...


Models:   0%|          | 0/3 [00:00<?, ?it/s]

--- Testing Model: Qwen/Qwen2.5-7B-Instruct-Turbo ---


Models:  33%|███▎      | 1/3 [00:34<01:08, 34.35s/it]

--- Completed Testing Model: Qwen/Qwen2.5-7B-Instruct-Turbo ---
--- Testing Model: mistralai/Mistral-7B-Instruct-v0.1 ---


Models:  67%|██████▋   | 2/3 [01:13<00:36, 36.90s/it]

--- Completed Testing Model: mistralai/Mistral-7B-Instruct-v0.1 ---
--- Testing Model: google/gemma-2-27b-it ---


Models: 100%|██████████| 3/3 [01:59<00:00, 39.88s/it]

--- Completed Testing Model: google/gemma-2-27b-it ---





In [62]:
# Save 
with open("_results.json", "w") as f:
    json.dump(results, f, indent=4)

## Evaluation:

### We have used a highly capable state-of-the-art model **Google's Gemini-2.5-pro** to evaluate the quality of these models

The model's output for 20 test cases was evaluated against a pre-defined "ideal recommendation" using three core metrics:

1.  **Top-1 Accuracy:** Measures if the model's single most important recommendation matches the ideal #1 choice. This tests its ability to identify the most critical expert for the user's primary issue.
2.  **Top-3 Overlap Score:** Counts how many of the model's three recommendations appear in the ideal list of three, regardless of their order. This assesses the overall relevance of the suggested astrologers.
3.  **Reasoning Quality:** A qualitative score (from 1-3) judging how logical, relevant, and insightful the model's justification for its choices is.


---

### Evaluation of Recommendation Quality: `meta-llama/Llama-3-8b-chat-hf`

This report analyzes the performance of the `meta-llama/Llama-3-8b-chat-hf` model on the task of recommending astrologers based on user profiles and chat transcripts. The evaluation focuses exclusively on the **quality and relevance of the recommendations**, not the specific output format.

### Overall Performance Summary

**Conclusion:** **Excellent.**

The `Llama-3-8B-Instruct` model demonstrates a high degree of competence for this recommendation task. It consistently understands the user's core problems, even when they are nuanced, and matches them to the correct astrologer specialties with impressive accuracy. The model's reasoning is its standout feature, providing clear and logical justifications for its choices.

### Quantitative Metrics

| Metric | Score | Percentage / Rating |
| :--- | :--- | :--- |
| **Top-1 Accuracy** | 16 / 20 | 80% |
| **Top-3 Overlap** | 44 out of 60 possible | 73.3% |
| **Reasoning Quality** | 52 out of 60 possible | 86.7% |

### Qualitative Analysis

#### Key Strengths:

*   **Strong Core Logic:** The model rarely makes an illogical recommendation. Even when its choices differ slightly from the ideal, its reasoning demonstrates a solid understanding of the user's problem.
*   **Performance on Clear Cases:** For straightforward user problems, the model performs exceptionally well. In cases like `CASE_007` (needing a Vaastu expert for a new home) or `CASE_020` (a couple facing business and marriage issues), its recommendations were a perfect match with the ideal answer.
*   **Excellent Reasoning:** This is the model's greatest strength. It consistently articulates *why* it chose a specific astrologer, correctly connecting user keywords like "stuck in my career" to an astrologer's specialty in "Career & Business". This builds trust in the recommendation.

#### Areas for Improvement:

*   **Missing Subtle Keywords:** The model's few errors occurred when it missed a subtle but critical keyword.
    *   In `CASE_013`, the user asked, "**When** will I finally get a job?" The model recommended career astrologers but missed the urgency of "when," failing to recommend **Mira Kapoor**, the specialist in "Timing of Events."
    *   Similarly, in `CASE_014`, the model focused on the legal and family disputes but missed the "father is **sick**" cue, overlooking the need for a "Health & Wellness" expert.
*   **Bias Towards Action-Oriented Advice:** In ambiguous cases like `CASE_006` where the user felt "lost" and questioned their "purpose," the model recommended astrologers focused on tangible actions ("Decision Making," "Career"). This is a valid interpretation, but it missed the deeper, more introspective options like "Spiritual Healing" or "Past Life Karma" that were part of the ideal recommendation.

### Final Verdict

For the core task of generating high-quality, relevant, and well-justified astrologer recommendations, `meta-llama/Llama-3-8b-chat-hf` is a highly effective and robust model. Its performance indicates it is a strong candidate for powering the recommendation engine. The minor issues with subtlety could likely be improved with more advanced prompt engineering or light fine-tuning.

---

### Evaluation of Recommendation Quality: `Qwen/Qwen2.5-7B-Instruct-Turbo`

This report analyzes the performance of the `Qwen/Qwen2.5-7B-Instruct-Turbo` model on the task of recommending astrologers based on user profiles and chat transcripts. The evaluation focuses exclusively on the quality and relevance of the recommendations.

### Overall Performance Summary

**Conclusion:** **Fair to Good.**

The `Qwen2.5-7B` model demonstrates a foundational ability to handle the recommendation task. It correctly identifies the primary expert in the majority of cases and performs very well on complex problems where the user's needs are explicitly stated across multiple domains (e.g., business and family). However, its performance degrades significantly on queries requiring a deeper, more subtle understanding, where it often recommends irrelevant experts.

### Quantitative Metrics

| Metric | Score | Percentage / Rating |
| :--- | :--- | :--- |
| **Top-1 Accuracy** | 15 / 20 | 75% |
| **Top-3 Overlap** | 35 out of 60 possible | 58.3% |
| **Reasoning Quality** | 39 out of 60 possible | 65% |

### Qualitative Analysis

#### Key Strengths:

*   **Handling Explicit, Multi-faceted Problems:** The model's best performance was on complex cases where the user explicitly mentioned multiple problems. In `CASE_017` (land sale involving finance, family, and property harmony) and `CASE_020` (a couple facing business and marriage issues), the model achieved a perfect score, identifying all correct experts.
*   **Identifying the Primary Expert:** The model correctly identified the single most important astrologer in 75% of cases, showing it can generally grasp the user's main issue.
*   **Logical Reasoning for Correct Choices:** When the model did select the correct astrologers, its reasoning for those choices was typically logical and well-articulated.

#### Areas for Improvement:

*   **Difficulty with Nuance and Subtlety:** This is the model's most significant weakness. It repeatedly failed to grasp critical but subtle keywords. For example, it missed "when" in `CASE_013` (a query about a job timeline) and the "more than just physical" cue in `CASE_019` (a query about unexplained pain), leading to poor recommendations.
*   **High Rate of Irrelevant Recommendations:** The model's low `Top-3 Overlap` score (58.3%) highlights a tendency to make irrelevant suggestions. In several cases (`CASE_005`, `CASE_011`, `CASE_018`), after identifying the primary expert, it filled the remaining slots with astrologers whose specialties did not align with the user's problem.
*   **Weakness on Introspective Queries:** The model struggled with less tangible, spiritual, or emotional queries. In cases where users felt "lost" (`CASE_006`) or were exploring dreams (`CASE_016`), the model's recommendations were often generic and missed the more suitable, specialized experts for introspection and spiritual healing.

### Final Verdict

The `Qwen/Qwen2.5-7B-Instruct-Turbo` model is a capable baseline but is less reliable than other models tested. Its inability to consistently grasp subtle user needs and its tendency to suggest irrelevant experts make it a less ideal choice for a production system where nuance and high relevance are critical. It would require significant prompt engineering or fine-tuning to overcome these limitations.```

---

### Evaluation of Recommendation Quality: `mixtral-8x7b-instruct-v0.1`

This report analyzes the performance of the `mixtral-8x7b-instruct-v0.1` model on the task of recommending astrologers based on user profiles and chat transcripts. The evaluation focuses exclusively on the quality and relevance of the recommendations.

### Overall Performance Summary

**Conclusion:** **Outstanding.**

The `mixtral-8x7b-instruct-v0.1` model demonstrates a superior and highly reliable understanding of the recommendation task. It performs with exceptional accuracy, consistently grasping both the explicit and subtle nuances of the user's needs. Its reasoning is sharp, logical, and directly tied to the provided expert profiles. This model sets the benchmark for quality and relevance.

### Quantitative Metrics

| Metric | Score | Percentage / Rating |
| :--- | :--- | :--- |
| **Top-1 Accuracy** | 19 / 20 | 95% |
| **Top-3 Overlap** | 56 out of 60 possible | 93.3% |
| **Reasoning Quality** | 58 out of 60 possible | 96.7% |

### Qualitative Analysis

#### Key Strengths:

*   **Near-Perfect Accuracy:** The model's ability to identify the correct astrologers is exceptional. With a 95% Top-1 accuracy and a 93.3% Top-3 overlap, its recommendations are consistently the most relevant and helpful. It almost never makes an illogical choice.
*   **Deep Nuance Comprehension:** This is the model's most impressive trait. It successfully captures subtle keywords that other models missed. For instance, in `CASE_013` it correctly identified "When will I finally get a job?" as a query about **timing** and recommended the "Timing of Events" expert. It also correctly identified the "more than just physical" nature of the health problem in `CASE_019`.
*   **Sharp, Concise, and Relevant Reasoning:** The reasoning provided is consistently of high quality. It is direct, avoids making up justifications, and clearly links the user's specific problem to the astrologer's specialty.
*   **Excellent Performance on All Case Types:** The model excels equally on straightforward queries, complex multi-faceted problems, and ambiguous, introspective requests. This versatility makes it highly reliable across the entire spectrum of potential user issues.

#### Areas for Improvement:

*   **Minimal and Rare Errors:** It is difficult to find significant weaknesses in this model's performance. The very few "errors" are minor and often represent a plausible alternative recommendation rather than a mistake. For example, in `CASE_020` (business + marriage issue), its recommendation of a "Decision Making" expert instead of a "Financial Growth" expert is a different but still highly logical and defensible choice. There are no systemic flaws in its approach.

### Final Verdict

The `mixtral-8x7b-instruct-v0.1` model is an **excellent and highly recommended choice** for this task. Its combination of high accuracy, nuanced understanding, and logical reasoning makes it the top-performing model evaluated. It can be used with a high degree of confidence to power a trustworthy and effective recommendation engine.