## Multi-agent Approach to Generating More Accurate Responses

This is an attempt at implementing a multi-agent solution for generating more accurate responses. This was inspired by the work done by Li et al, 2024 (Improving LLM performance through ensembling) and Chen et al, 2024 (Scaling laws of Compound Inference Systems). As observed by Chen et al (and anecdotally), simpler queries tend to speed up while open ended or complex prompts result in inconsistent generation times. 

A simple similarity based scoring is also implemented here to pick the "best" response out of all the generated responses. 

### Implementing Multi-Agent Approach

In [9]:
from llama_cpp import Llama
import spacy
nlp = spacy.load('en_core_web_sm')

In [68]:
def perform_inference(llm, prompt):
    output = llm(
       f"<|user|>\n{prompt}<|end|>\n<|assistant|>",
  max_tokens=256,  # Generate up to 256 tokens
  stop=["<|end|>"], 
  temperature=0.7,
  seed=-1,
    )
    return output['choices'][0]['text']

In [58]:
llm_instances = [
    Llama(
        model_path="./Phi-3-mini-4k-instruct-q4.gguf",
        n_ctx=4096,
        n_threads=4,
        n_gpu_layers=35,
        verbose=True
    ) for _ in range(4)  # Create two instances
]

llama_model_loader: loaded meta data with 24 key-value pairs and 195 tensors from ./Phi-3-mini-4k-instruct-q4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 4096
llama_model_loader: - kv   3:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   4:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                           phi3.block_count u32              = 32
llama_model_loader: - kv   6:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi3.attention.head_count_kv u32         

In [59]:
prompt = input('Enter your prompt: ')
print(prompt)

What is the best place to visit in South India?


In [69]:
responses = []
for llm in llm_instances:
    response = perform_inference(llm, prompt)
    responses.append(response)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     479.23 ms
llama_print_timings:      sample time =      53.43 ms /   256 runs   (    0.21 ms per token,  4791.58 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =    8129.62 ms /   256 runs   (   31.76 ms per token,    31.49 tokens per second)
llama_print_timings:       total time =    9121.80 ms /   256 tokens
Llama.generate: prefix-match hit

llama_print_timings:        load time =     683.34 ms
llama_print_timings:      sample time =      56.80 ms /   256 runs   (    0.22 ms per token,  4507.44 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (     nan ms per token,      nan tokens per second)
llama_print_timings:        eval time =    8033.44 ms /   256 runs   (   31.38 ms per token,    31.87 tokens per second)
llama_print_timings:       to

### Scoring the Responses Based on Similarity

In [29]:
def calculate_similarity(text1, text2):
    doc1 = nlp(text1)
    doc2 = nlp(text2)
    return doc1.similarity(doc2)

In [71]:
def select_best_response(responses):
    best_response = responses[0]
    highest_similarity = 0
    for i in range(len(responses)):
        similarity_sum = 0
        for j in range(len(responses)):
            if i != j:
                similarity_sum += calculate_similarity(responses[i], responses[j])
        if similarity_sum > highest_similarity:
            highest_similarity = similarity_sum
            best_response = responses[i]
    return best_response

In [72]:
final_response = select_best_response(responses)

# Print the final response
print("Final Response:", final_response)

  return doc1.similarity(doc2)


Final Response:  South India boasts numerous remarkable destinations, each with unique attractions. However, some stand out for their historical significance and natural beauty:


1. **Tiruchirappalli (Trichy)** - Famous for its Rock Fort built by the Pandyas in 762 AD, it features temples like the Meenakshi Amman Temple with stunning architecture and intricate carvings.


2. **Mysore** - Home to the grand Mysore Palace, known for its exquisite paintings and silk sarees made in the nearby Saraswati Mahal (Silk Saree Mills). The city also hosts the annual Dasara Festival with a spectacular chariot procession.


3. **Kolkata** - Although on the border of South India, Kolkata offers an array of experiences from colonial architecture to art galleries and vibrant street life. It's not traditionally in South India but is often included for its rich cultural heritage.


4. **Coorg (Coffee Country)** - A haven for nature lovers, with lush coffee estates, scenic


### References
https://arxiv.org/abs/2403.02419 - Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems
https://arxiv.org/abs/2402.05120 - More Agents Is All You Need