### Intro to LangKit

Table of Contents
- [Install LangKit](#intro-to-langkit)
- [Initialize LLM metrics](#Initialize-LLM-metrics)
- [Hello, World!](#hello,-world!)
- [Comparing Data](#comparing-data)
- [Next Steps](#next-steps)


Ok! let's install __langkit__.

In [None]:
# Note: you may need to restart the kernel to use updated packages.
%pip install 'langkit[all]' -q

## Initialize LLM metrics
LangKit provides a toolkit of metrics for LLM applications, lets initialize them!

In [None]:
from langkit import llm_metrics # alternatively use 'light_metrics'
import whylogs as why

why.init(session_type='whylabs_anonymous')
# Note, llm_metrics.init() downloads models so this is slow first time.
schema = llm_metrics.init() 

## Hello, World!
In the below code we log a few example prompt/response pairs and send metrics to a WhyLabs

In [2]:
from langkit.whylogs.samples import load_chats

results = why.log(load_chats(), name="langkit-sample-chats-all", schema=schema)

✅ Aggregated 15 rows into profile 'langkit-sample-chats-all'

Visualize and explore this profile with one-click
🔍 https://hub.whylabsapp.com/resources/model-1/profiles?sessionToken=session-6tuzHg&profile=ref-JHJ75222R6VdaTge


## Comparing Data
Things get more interesting when you can compare two sets of metrics from an LLM application. The power of gathering systematic telemetry over time comes from being able to see how these metrics change, or how two sets of profiles compare. Below we asked GPT for some positive words and then asked for negative words as part of these two toy examples.

In [3]:
pos_chats = load_chats("pos")
neg_chats = load_chats("neg")

results_comparison = why.log(multiple={"positive_chats": pos_chats,
                                       "negative_chats": neg_chats},
                            schema=schema)

✅ Aggregated 5 lines into profile 'positive_chats', 5 lines into profile 'negative_chats'

Visualize and explore the profiles with one-click
🔍 https://hub.whylabsapp.com/resources/model-1/profiles?sessionToken=session-6tuzHg&profile=ref-zMGVHP0ww3QrGUXS&profile=ref-Rfryq1IXRIzqKtFL

Or view each profile individually
 ⤷ https://hub.whylabsapp.com/resources/model-1/profiles?profile=ref-zMGVHP0ww3QrGUXS&sessionToken=session-6tuzHg
 ⤷ https://hub.whylabsapp.com/resources/model-1/profiles?profile=ref-Rfryq1IXRIzqKtFL&sessionToken=session-6tuzHg


You can explore and compare specific metrics, in this example we expect a large and obvious distribution drift in the sentiment scores on the response which you can see [here](https://hub.whylabsapp.com/resources/model-1/profiles?feature-highlight=response.sentiment_nltk&includeType=discrete&includeType=non-discrete&limit=30&offset=0&profile=ref-zMGVHP0ww3QrGUXS&profile=ref-Rfryq1IXRIzqKtFL&sessionToken=session-6tuzHg&sortModelBy=LatestAlert&sortModelDirection=DESC)

## Next Steps
If you see value in detecting changes in how your LLM application is behaving, you might take a look at some of our other examples showing how to monitor these metrics as a timeseries for an LLM application in production, or how to customize the metrics logged by using your own surrogate models or critic metrics.
* check out the [examples](https://github.com/whylabs/langkit/tree/main/langkit/examples) folder for scenarios from "Hello World!"  to monitoring an LLM in production!
* Learn more about the [features](https://github.com/whylabs/langkit#features) LangKit extracts out of the box.
* Learn more about LangKit's [modules documentation](https://github.com/whylabs/langkit/blob/main/langkit/docs/modules.md)
* Explore more on [WhyLabs](https://whylabs.ai/whylabs-free-sign-up?utm_source=github&utm_medium=referral&utm_campaign=langkit) and monitor your LLM application over time!

