# Running LLM inference with Spring AI & Ollama
This notebook implement the basic text-to-text generation using [Spring AI](https://spring.io/projects/spring-ai) and OpenAI.
You need to an OpenAI API key to run. Rename `openaikey.example.json` to `openaikey.secret.json` and update the OpenAI key

Free feel to contribute to add more use cases.

## Install dependencies

In [None]:
// load version variables
%use @file[resources/version.json](currentDir=".")

In [1]:
USE {
    repositories {
        maven { url = "https://repo.spring.io/milestone" }
        mavenCentral()
    }
    dependencies {
        implementation("org.springframework.ai:spring-ai-core:$springAiVersion")
        implementation("org.springframework.ai:spring-ai-ollama:$springAiVersion")
        implementation("org.springframework.ai:spring-ai-openai:$springAiVersion")

        implementation("io.projectreactor:reactor-core:$reactorVersion")
        implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core-jvm:$kotlinCoroutineVersion")
        implementation("org.jetbrains.kotlinx:kotlinx-coroutines-reactor:$kotlinCoroutineVersion")
        implementation("org.jetbrains.kotlinx:kotlinx-coroutines-reactive:$kotlinCoroutineVersion")
    }
    import(
        "kotlinx.coroutines.*",
        "kotlinx.coroutines.flow.*",
        "kotlinx.coroutines.reactor.*",
        "kotlinx.coroutines.reactive.*",
    )
}
// list the library, if the dependencies doesn't show up, run again and restart the kernel
notebook.currentClasspath.joinToString("\n")

/Users/gaplo917/Library/Caches/JetBrains/IntelliJIdea2024.3/kotlinNotebook/kotlin-notebook-standalone.eb20de96/kernels/0.12.0-356/kotlin-jupyter-script-classpath-shadowed-zip_extracted/kotlin-stdlib-1.9.23.jar
/Users/gaplo917/Library/Caches/JetBrains/IntelliJIdea2024.3/kotlinNotebook/kotlin-notebook-standalone.eb20de96/kernels/0.12.0-356/kotlin-jupyter-script-classpath-shadowed-zip_extracted/kotlin-reflect-1.9.23.jar
/Users/gaplo917/Library/Caches/JetBrains/IntelliJIdea2024.3/kotlinNotebook/kotlin-notebook-standalone.eb20de96/kernels/0.12.0-356/kotlin-jupyter-script-classpath-shadowed-zip_extracted/kotlinx-serialization-core-jvm-1.6.3.jar
/Users/gaplo917/Library/Caches/JetBrains/IntelliJIdea2024.3/kotlinNotebook/kotlin-notebook-standalone.eb20de96/kernels/0.12.0-356/kotlin-jupyter-script-classpath-shadowed-zip_extracted/annotations-13.0.jar
/Users/gaplo917/Library/Caches/JetBrains/IntelliJIdea2024.3/kotlinNotebook/kotlin-notebook-standalone.eb20de96/kernels/0.12.0-356/kotlin-jupyter-sc

## Function to measure model performance
A sample implementation using stream to measure key metrics
- time to first token
- input token process rate/s
- output token process rate/s

In [5]:
import org.springframework.ai.chat.model.ChatModel
import org.springframework.ai.chat.prompt.Prompt
import kotlin.time.measureTimedValue

data class ModelPerformance(
    val timeToFirstTokenInMills: Double,
    val totalTimeInMills: Double,
    val promptTokens: Long,
    val generationTokens: Long,
    val inputTokenRatePerSec: Double,
    val outputTokenRatePerSec: Double
)

fun ChatModel.measureModelPerformance(prompt: Prompt = Prompt("tell me 5 jokes")): ModelPerformance {
    var ttft: Long? = null

    val timedResp = measureTimedValue {
        runBlocking {
            val startTime = System.nanoTime()
            stream(prompt)
                .asFlow()
                .onStart {
                    ttft = System.nanoTime() - startTime
                }
                .onEach {
                    print(it.result?.output?.content)
                }
                .last()
        }
    }
    val resp = timedResp.value
    val totalTime = timedResp.duration.inWholeMilliseconds
    val timeToFirstTokenInMills = ttft!! / 1_000_000.0
    return ModelPerformance(
        timeToFirstTokenInMills = timeToFirstTokenInMills,
        totalTimeInMills = timedResp.duration.inWholeMilliseconds.toDouble(),
        promptTokens = resp.metadata.usage.promptTokens,
        generationTokens = resp.metadata.usage.generationTokens,
        inputTokenRatePerSec = resp.metadata.usage.promptTokens * 1000.0 / timeToFirstTokenInMills,
        outputTokenRatePerSec = resp.metadata.usage.generationTokens * 1000.0 / (totalTime - timeToFirstTokenInMills)
    )
}

## Create GPT4o-mini chat model

### Load OpenAI Key into Kotlin Notebook
Rename `openaikey.example.json` to `openaikey.secret.json` and update the OpenAI key


In [4]:
// Load openaikey.json into `openAiKey`
%use @file[resources/openaikey.secret.json](currentDir=".")

In [6]:
import org.springframework.ai.chat.prompt.Prompt
import org.springframework.ai.openai.OpenAiChatModel
import org.springframework.ai.openai.OpenAiChatOptions
import org.springframework.ai.openai.api.OpenAiApi

val model = OpenAiChatModel(
    OpenAiApi(openAiKey),
    OpenAiChatOptions.builder()
        .streamUsage(true)
        .model(OpenAiApi.ChatModel.GPT_4_O_MINI)
        .temperature(0.7)
        .build()
)

model.measureModelPerformance(Prompt("tell me 5 jokes"))

Sure! Here are five jokes for you:

1. Why did the scarecrow win an award?
   Because he was outstanding in his field!

2. What do you call fake spaghetti?
   An impasta!

3. Why don’t scientists trust atoms?
   Because they make up everything!

4. How do you organize a space party?
   You planet!

5. Why did the bicycle fall over?
   Because it was two-tired!

Hope these brought a smile to your face!nullnull

ModelPerformance(timeToFirstTokenInMills=17.110708, totalTimeInMills=2107.0, promptTokens=12, generationTokens=99, inputTokenRatePerSec=701.3152231924009, outputTokenRatePerSec=47.37093030667579)