## Loading LLM when used via Ollama

### Importing the required libraries

In [1]:
from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

import time
from datetime import datetime

### Keeping the model loaded in memory

In the following code, we will see how to keep the model loaded in memory indefinitely.

Instantiating the model and chain

In [2]:
llm = ChatOllama(model="llama3", keep_alive=-1)
prompt = ChatPromptTemplate.from_template("Tell me a short joke about {topic}")
chain = prompt | llm | StrOutputParser()

The model is running for the first time. The model will be initially loaded, which takes a little bit of time.

In [3]:
start_time = time.time()
print(chain.invoke({"topic": "Space travel"}))
end_time = time.time()
elapsed_time = end_time - start_time
print(f"\nElapsed time: {elapsed_time:.2f} seconds")
current_time = datetime.now().strftime("%H:%M:%S")
print("Current time:", current_time)

Why did the astronaut break up with his girlfriend?

Because he needed space!

(Sorry, it's a bit of a "stellar" pun, but I hope it launched a smile on your face!)

Elapsed time: 6.09 seconds
Current time: 02:42:13


Now, let's run the model again after 10 minutes to ensure the model is still loaded. This is to check if the model is still loaded in VRAM even after 5 minutes, which is the default time to keep the model loaded.

In [4]:
start_time = time.time()
print(chain.invoke({"topic": "Space travel"}))
end_time = time.time()
elapsed_time = end_time - start_time
print(f"\nElapsed time: {elapsed_time:.2f} seconds")
current_time = datetime.now().strftime("%H:%M:%S")
print("Current time:", current_time)

Why did the astronaut break up with his girlfriend before going to Mars?

Because he needed space!

Elapsed time: 0.39 seconds
Current time: 02:52:33


**Conclusion**: We can see from both runs of elapsed and current time that the model has been kept loaded in memory for more than 10 minutes, as the elapsed time was much shorter than the elapsed time of the previous run of the model.

## Unloading the model immediately after inferencing

We will see how the model can immediately be unloaded after inferencing.

Re-instatiating model and chain

In [6]:
llm = ChatOllama(model="llama3", keep_alive=0)
prompt = ChatPromptTemplate.from_template("Tell me a short joke about {topic}")
chain = prompt | llm | StrOutputParser()

In [8]:
## Initially model was loaded. So had to re-run this cell to ensure the unloading and reloading of the model during inferencing.
start_time = time.time()
print(chain.invoke({"topic": "Space travel"}))
end_time = time.time()
elapsed_time = end_time - start_time
print(f"\nElapsed time: {elapsed_time:.2f} seconds")
current_time = datetime.now().strftime("%H:%M:%S")
print("Current time:", current_time)

Why did the astronaut break up with his girlfriend?

Because he needed space! (get it?)

Elapsed time: 5.81 seconds
Current time: 02:58:35


Now lets run the model to see if the model is unloaded from the memory or not.

In [9]:
start_time = time.time()
print(chain.invoke({"topic": "Space travel"}))
end_time = time.time()
elapsed_time = end_time - start_time
print(f"\nElapsed time: {elapsed_time:.2f} seconds")
current_time = datetime.now().strftime("%H:%M:%S")
print("Current time:", current_time)

Why did the astronaut break up with his girlfriend before going to Mars?

Because he needed space!

Elapsed time: 5.62 seconds
Current time: 02:58:47


**Conclusion**: Elapsed time for both runs indicates that the model was immediately unloaded after the first run and then reloaded into memory for the second run.

### More on `keep_alive` parameter

Based on the use case, the `keep_alive` parameter can be set to:
- a duration string (such as "10m" or "24h")
- a number in seconds (such as 3600)
- any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")