# Part 4: Optimization  

Now that we have a structured evaluation process, it’s time to **optimize and refine** our system.  

In this part, we’ll:  
- **Test different models** to compare their strengths.  
- **Analyze trade-offs** between speed, performance, and cost.  
- **Make informed decisions** about improvements and potential deployment.  

This is where we shift from a prototype to something more **robust and practical**. Whether you're thinking about deploying an LLM system or simply improving local performance, this step will help you make data-driven optimizations.  

Unfortunately there's no free lunch. 
Let's change some inference parameters and see if we can get a speedup,
but use our evals to understand what the tradeoff is.

## Steps
* Run inference but this time focus on engineering metrics, in our case we care about clock time.
  * Measure 10 responses using the an `gemma2:2b-instruct-fp16` model, then repeat with a `gemma2:2b-instruct-q2_K` model
  * Measure the total generation time using python.
  * Think about a confounder for generation time, something that strongly effects it. Use some basic mathematics to account for this.
* Before running evals, what is different about these models?
  * Hint: Google quantization, look at the model size on Ollama, think about what is expensive during inference.
* After runing evals
  * What do you notice is different?
  * What about the quality of the responses? How does that change?

In [1]:
from ollama import chat
from ollama import ChatResponse
import time
import os

In [2]:
article = """
**A Journey Through Time: The History and Achievements of North Melbourne Kangaroos**\n\nThe North Melbourne Kangaroos, affectionately known as the Magpies prior to their 1983 rebranding, are a storied team with a rich tapestry of history. Nestled in Ballarat, Victoria, these kangaroos have been the pride of the region since their inception in 1924. This article delves into their remarkable journey, from their early days as a minor league side to becoming a significant force in Australian rugby union.\n\n**The Founding of the Kangaroos**\n\nNorth Melbourne Kangaroos were born on November 6, 1924, when the North Melbourne Rugby Union Club was established. The club\'s name change to Kangaroos occurred after their 1983 merger with Torquay United, a move that reflected their transition from being based in Melbourne to Ballarat.\n\n**A Legacy of Excellence**\n\nThe Kangaroos\' history is marked by excellence and resilience. Over the decades, they\'ve claimed multiple State Titles, showcasing their dominance on the field. Notable players such as David Nasmith and Adam Droms have been luminaries of the team, with Nasmith holding the distinction of being one of the youngest players to grace the field in 1958 and still active today.\n\n**Current Stance in the League**\n\nSince their move to Ballarat in 2019, the North Melbourne Kangaroos have become a prominent side in the Victorian top-tier competition. Their transition was met with both anticipation and nostalgia from their old fans, who cherished the memories of playing at Belmore Ground, the club\'s historic home.\n\n**Community Connection**\n\nThe kangaroos are deeply rooted in the Ballarat community. Their branding includes the iconic "Kangaroo" logo, symbolizing their connection to the local wildlife. The team has also embraced community involvement, with players and staff engaging actively in local initiatives, fostering a strong bond between the club and its surroundings.\n\n**Engaging with Fans**\n\nIn keeping with rugby union traditions, the Kangaroos embrace close-knit relationships with their fans. Whether through match days or annual events, the team consistently demonstrates a commitment to community spirit, further cementing its identity as a local institution.\n\n**Looking Ahead**\n\nThe future of North Melbourne Kangaroos looks promising. With a focus on developing young talent and maintaining high standards, the club is poised for continued success. Fans can expect an exciting season ahead, with the kangaroos aiming to uphold their legacy while exploring new horizons in the league.\n\nIn conclusion, the North Melbourne Kangaroos represent more than just a rugby team; they are a symbol of pride for Ballarat and a testament to the enduring spirit of sportsmanship. As they continue to navigate the challenges and opportunities of modern rugby union, one can look forward to many thrilling matches and memorable moments in their storied history.'
"""

In [None]:
model = 'gemma2:2b'

def single_turn_with_time(prompt, model):
    start_time = time.time()
    response: ChatResponse = chat(model=model, messages=[
      {
        'role': 'user',
        'content': prompt,
      },
    ])
    end_time = time.time()
    
    total_time = end_time - start_time
    return response.message.content.strip(), total_time


prompt = f"Is this about an australian or american team?, print no other words: {article}?"
single_turn_with_time(prompt, model)

In [None]:
model = 'gemma2:2b-instruct-fp16'

for i in range(10):
    prompt = f"Is this about an australian or american team?, print no other words: {article}?"
    print("Response {0}: Seconds {1}".format(*single_turn_with_time(prompt, model=model)))

In [None]:
model = 'gemma2:2b-instruct-fp16'

for i in range(10):
    prompt = f"Is this about an australian or american team?, print no other words: {article}?"
    response, wall_time = single_turn_with_time(prompt, model=model)
    character_per_second = len(response) / wall_time
    print("Response {0}: Characters per second {1}".format(response, character_per_second))

## Quantized model
Let's try the quantized model. See if there's anything you notice


In [None]:
model = 'gemma2:2b-instruct-q2_K'

for i in range(10):
    prompt = f"Is this about an australian or american team?, print no other words: {article}?"
    print("Response {0}: Seconds {1}".format(*single_turn_with_time(prompt, model=model)))

In [None]:
model = 'gemma2:2b-instruct-q2_K'

for i in range(10):
    prompt = f"Is this about an australian or american team?, print no other words: {article}?"
    response, wall_time = single_turn_with_time(prompt, model=model)
    character_per_second = len(response) / wall_time
    print("Response {0}: Characters per second {1}".format(response, character_per_second))

## Questions
* What is the difference in tokens per second?
* What about the quality of the answer?
* Is this tradeoff worth it? 

### Why do we pick these models?
In this exercise we want to highlight how drastically inference speed can change **even with the same model size**.
For this exercise we picked the highest precision "best quality" weights and are comparing it to the "fastest" but lowest precision weights.

Quantization is just one decision to be made when optimizing serving. There are a number of other methods that speed up inference. For example a common one is a [KV Cache](https://medium.com/@joaolages/kv-caching-explained-276520203249). These days most frameworks already include them. See the following blog post and setting in Ollama.


Methods can be combined, for example KV Caching and quantization. To see the deep details check out this discussion from the ollama maintainers. https://github.com/ollama/ollama/issues/5091#issuecomment-2476456862

For more settings see this [doc](https://github.com/ollama/ollama/blob/main/docs/faq.md)
  * os.environ["OLLAMA_KV_CACHE_TYPE"] = "q4_0"
  * os.environ["OLLAMA_KV_CACHE_TYPE"] = "FP16"
  * os.environ["OLLAMA_FLASH_ATTENTION"] = "0"kv cach

For more on efficient serving see the [GenAI Guidebook references](https://ravinkumar.com/GenAiGuidebook/language_models/inference.html)

## 🎯 Recap: What We Learned  

### Inference is as important as training
* There are a number of decisions that need to made that balance user experience and cost of running models
* Quantization is one such example, especially for open models
* There are many others such as prompt caching, speculative decoding