# Running LLM inference with Spring AI & Ollama
This notebook implement the basic text-to-text generation using [Spring AI](https://spring.io/projects/spring-ai) and Ollama.
You need to install Ollama in your machine in order to run.

Free feel to contribute to add more use cases.


## Install dependencies

In [4]:
// load version variables
%use @file[resources/version.json](currentDir=".")

In [5]:
USE {
    repositories {
        maven { url = "https://repo.spring.io/milestone" }
        mavenCentral()
    }
    dependencies {
        implementation("org.springframework.ai:spring-ai-core:$springAiVersion")
        implementation("org.springframework.ai:spring-ai-ollama:$springAiVersion")

        implementation("io.projectreactor:reactor-core:$reactorVersion")
        implementation("org.jetbrains.kotlinx:kotlinx-coroutines-core-jvm:$kotlinCoroutineVersion")
        implementation("org.jetbrains.kotlinx:kotlinx-coroutines-reactor:$kotlinCoroutineVersion")
        implementation("org.jetbrains.kotlinx:kotlinx-coroutines-reactive:$kotlinCoroutineVersion")
    }
    import(
        "kotlinx.coroutines.*",
        "kotlinx.coroutines.flow.*",
        "kotlinx.coroutines.reactor.*",
        "kotlinx.coroutines.reactive.*",
    )
}

// list the library, if the dependencies doesn't show up, run again and restart the kernel
notebook.currentClasspath.joinToString("\n")

/Users/gaplo917/Library/Caches/JetBrains/IntelliJIdea2024.3/kotlinNotebook/kotlin-notebook-standalone.eb20de96/kernels/0.12.0-356/kotlin-jupyter-script-classpath-shadowed-zip_extracted/kotlin-stdlib-1.9.23.jar
/Users/gaplo917/Library/Caches/JetBrains/IntelliJIdea2024.3/kotlinNotebook/kotlin-notebook-standalone.eb20de96/kernels/0.12.0-356/kotlin-jupyter-script-classpath-shadowed-zip_extracted/kotlin-reflect-1.9.23.jar
/Users/gaplo917/Library/Caches/JetBrains/IntelliJIdea2024.3/kotlinNotebook/kotlin-notebook-standalone.eb20de96/kernels/0.12.0-356/kotlin-jupyter-script-classpath-shadowed-zip_extracted/kotlinx-serialization-core-jvm-1.6.3.jar
/Users/gaplo917/Library/Caches/JetBrains/IntelliJIdea2024.3/kotlinNotebook/kotlin-notebook-standalone.eb20de96/kernels/0.12.0-356/kotlin-jupyter-script-classpath-shadowed-zip_extracted/annotations-13.0.jar
/Users/gaplo917/Library/Caches/JetBrains/IntelliJIdea2024.3/kotlinNotebook/kotlin-notebook-standalone.eb20de96/kernels/0.12.0-356/kotlin-jupyter-sc

## Function to measure model performance
A sample implementation using stream to measure key metrics
- time to first token
- input token process rate/s
- output token process rate/s

In [2]:
import org.springframework.ai.chat.model.ChatModel
import org.springframework.ai.chat.prompt.Prompt
import kotlin.time.measureTimedValue

data class ModelPerformance(
    val timeToFirstTokenInMills: Double,
    val totalTimeInMills: Double,
    val promptTokens: Long,
    val generationTokens: Long,
    val inputTokenRatePerSec: Double,
    val outputTokenRatePerSec: Double
)

fun ChatModel.measureModelPerformance(prompt: Prompt = Prompt("tell me 5 jokes")): ModelPerformance {
    var ttft: Long? = null

    val timedResp = measureTimedValue {
        runBlocking {
            val startTime = System.nanoTime()
            stream(prompt)
                .asFlow()
                .onStart {
                    ttft = System.nanoTime() - startTime
                }
                .onEach {
                    print(it.result?.output?.content)
                }
                .last()
        }
    }
    val resp = timedResp.value
    val totalTime = timedResp.duration.inWholeMilliseconds
    val timeToFirstTokenInMills = ttft!! / 1_000_000.0
    return ModelPerformance(
        timeToFirstTokenInMills = timeToFirstTokenInMills,
        totalTimeInMills = timedResp.duration.inWholeMilliseconds.toDouble(),
        promptTokens = resp.metadata.usage.promptTokens,
        generationTokens = resp.metadata.usage.generationTokens,
        inputTokenRatePerSec = resp.metadata.usage.promptTokens * 1000.0 / timeToFirstTokenInMills,
        outputTokenRatePerSec = resp.metadata.usage.generationTokens * 1000.0 / (totalTime - timeToFirstTokenInMills)
    )
}

## Create Ollama Gemma2 2B INT4 model
You need approximately 2GB GPU VRAM to run `gemma2:2b` locally.

In [5]:
import org.springframework.ai.chat.prompt.Prompt
import org.springframework.ai.ollama.OllamaChatModel
import org.springframework.ai.ollama.api.OllamaApi
import org.springframework.ai.ollama.api.OllamaOptions

val model = OllamaChatModel.builder()
    .ollamaApi(OllamaApi("http://localhost:11434"))
    .defaultOptions(
        OllamaOptions.builder()
            .model("gemma2:2b")
            .numCtx(8192)
            .temperature(0.7)
            .build()
    ).build()

model.measureModelPerformance(Prompt("tell me 5 jokes"))


Here are five jokes for you:

1. **Why don't scientists trust atoms?**  Because they make up everything!
2. **What do you call a lazy kangaroo?** A pouch potato!
3. **Why did the scarecrow win an award?** Because he was outstanding in his field! 
4. **Why don't eggs tell jokes?** They'd crack each other up!
5. **What do you get from a pampered cow?** Spoiled milk!


Let me know if you want to hear some more! 😊  


ModelPerformance(timeToFirstTokenInMills=17.103209, totalTimeInMills=1746.0, promptTokens=14, generationTokens=123, inputTokenRatePerSec=818.5598386829045, outputTokenRatePerSec=71.14363369768093)