In [1]:
from easyroutine import path_to_parents
path_to_parents(2)

%load_ext autoreload
%autoreload 2

Changed working directory to: /home/francesco/HistoryRevisionismLLM


# Inference Module
`easyroutine` provide a simple interface to interact with various LLMs using different backends. Specifically, it supports:
- **vLLM**: A high-performance inference engine for large language models running on GPUs.
- **LiteLLM**: A lightweight interface for OpenAI, Anthropic, and XAI APIs.


## LiteLLM Inference Model

First load the api keys from the `.env` file:


In [2]:

from dotenv import load_dotenv
load_dotenv()
#get the openai api key from the .env file
import os
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

Then, init the interface with the desired model and API keys:

In [3]:
from easyroutine.inference import LiteLLMInferenceModel, LiteLLMInferenceModelConfig
config = LiteLLMInferenceModelConfig(
    model_name="gpt-4.1-nano-2025-04-14",
    openai_api_key=OPENAI_API_KEY
)
model = LiteLLMInferenceModel(config)

  from .autonotebook import tqdm as notebook_tqdm


INFO 07-14 11:25:17 [__init__.py:244] Automatically detected platform cuda.


All the models are available in the `easyroutine.inference` module have the `.append_with_chat_template` method to append a message to the chat history with the specified role (either "user" or "assistant"). The `.chat` method than will handle the translation of the chat history to the specific model format and return the response.

`append_with_chat_template` method take a message and a role as input, and returns a chat message in the format required by the model. It can also take a `chat_history` parameter to append the message to an existing chat history.


In [4]:
chat_message = model.append_with_chat_template(message="What is the capital of France?", role="user")
print(chat_message)

[{'role': 'user', 'content': 'What is the capital of France?'}]


In [5]:
response = model.chat(chat_message)
print(response)

[Choices(finish_reason='stop', index=0, message=Message(content='The capital of France is Paris.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})]


## Batched inference

In [6]:
inputs = [
    model.append_with_chat_template(message="What is the capital of Italy?", role="user"),
    model.append_with_chat_template(message="What is the capital of Germany?", role="user"),
    model.append_with_chat_template(message="What is the capital of Spain?", role="user"),
]

In [None]:
response = model.batch_chat(inputs)
print([response[i]["choices"][0]["message"].content for i in range(len(response))]  # Extract the content of the responses)

[ModelResponse(id='chatcmpl-Bt9jD8Iy4A9h6OyHUuqQEN8s7qqqq', created=1752485135, model='gpt-4.1-nano-2025-04-14', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='The capital of Italy is Rome.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]), provider_specific_fields={})], usage=Usage(completion_tokens=7, prompt_tokens=14, total_tokens=21, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default'), ModelResponse(id='chatcmpl-Bt9jDAATOtSORl4CtVSBblbsH3oX4', created=1752485135, model='gpt-4.1-nano-2025-04-14', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='sto