<a id="intro-to-langkit"></a>

from: https://github.com/whylabs/langkit/blob/main/langkit/examples/Intro_to_Langkit.ipynb

## üìñ Intro to LangKit

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/whylabs/langkit/blob/main/langkit/examples/Intro_to_Langkit.ipynb)



Table of Contents
- [Intro to LangKit](#intro-to-langkit)
- [What is LangKit](#what-is-langkit)
- [Initialize LLM metrics](#initialize-llm-metrics)
- [Hello, World!](#hello-world)
- [Comparing Data](#comparing-data)
- [Monitor Metrics over time](#monitor-metrics-over-time)
- [Next Steps](#next-steps)

<a id="what-is-langkit"></a>
### üìö What is LangKit?
>LangKit is an open-source text metrics toolkit for monitoring language models. It offers an array of methods for extracting relevant signals from the input and/or output text, which are compatible with the open-source data logging library [whylogs](https://whylogs.readthedocs.io/en/latest).
>In this example, we'll look for distribution drift in the sentiment scores on the model response.üí°

Ok! let's install __langkit__.

In [1]:
# Note: you may need to restart the kernel to use updated packages.
%pip install -U 'langkit[all]>=0.0.34'
%pip install "whylogs[viz]"

Collecting langkit>=0.0.34 (from langkit[all]>=0.0.34)
  Downloading langkit-0.0.35-py3-none-any.whl.metadata (4.8 kB)
Collecting whylabs-textstat<0.8.0,>=0.7.4 (from langkit>=0.0.34->langkit[all]>=0.0.34)
  Downloading whylabs_textstat-0.7.4-py3-none-any.whl.metadata (15 kB)
Collecting whylogs<2.0.0,>=1.5.0 (from langkit>=0.0.34->langkit[all]>=0.0.34)
  Downloading whylogs-1.6.4-py3-none-any.whl.metadata (7.2 kB)
Collecting datasets<3.0.0,>=2.12.0 (from langkit[all]>=0.0.34)
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting detoxify<0.6.0,>=0.5.2 (from langkit[all]>=0.0.34)
  Downloading detoxify-0.5.2-py3-none-any.whl.metadata (13 kB)
Collecting evaluate<0.5.0,>=0.4.0 (from langkit[all]>=0.0.34)
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting ipywidgets<9.0.0,>=8.1.1 (from langkit[all]>=0.0.34)
  Downloading ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Collecting presidio-analyzer<3.0.0,>=2.2.351 (from langkit[all]>=0.0.34)
 

<a id="initialize-llm-metrics"></a>
### üöÄ Initialize LLM metrics
LangKit provides a toolkit of metrics for LLM applications, lets initialize them!

NOTE: select option 3. Do not update data anywhere

In [1]:
from langkit import llm_metrics # alternatively use 'light_metrics'
import whylogs as why

why.init(upload_on_log=False)
# Note: llm_metrics.init() downloads models so this is slow first time.
schema = llm_metrics.init()

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


‚ùì What kind of session do you want to use?
 ‚§∑ 1. WhyLabs. Use an api key to upload to WhyLabs.
 ‚§∑ 2. WhyLabs Anonymous. Upload data anonymously to WhyLabs and get a viewing url.
 ‚§∑ 3. Local. Don't upload data anywhere.

Enter a number from the list: 3
Initializing session with config /root/.config/whylogs/config.ini

‚úÖ Using session type: LOCAL. Profiles won't be uploaded or written anywhere automatically.


<a id="hello-world"></a>
### üëã Hello, World!
In the below code we log a few example prompt/response pairs and send metrics to WhyLabs.

NOTE: You need a HF token for running this step

In [2]:
from langkit.whylogs.samples import load_chats, show_first_chat

# Let's look at what's in this toy example:
chats = load_chats()
print(f"There are {len(chats)} records in this toy example data, here's the first one:")
show_first_chat(chats)

results = why.log(chats, name="langkit-sample-chats-all", schema=schema)

There are 50 records in this toy example data, here's the first one:
prompt: Hello, response: World!



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/403 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


### Visualize results

We use the whylogs library to visualize results

In [10]:
from whylogs.viz import NotebookProfileVisualizer

prof_view = results.view()

visualization = NotebookProfileVisualizer()

# Set the profile to visualize
visualization.set_profiles(target_profile_view=prof_view)

# Convert the profile view to a Pandas DataFrame
profile_df = prof_view.to_pandas()

# Extract the feature names
feature_names = profile_df.index.tolist()

# Display the feature names
print(feature_names)

['prompt', 'prompt.aggregate_reading_level', 'prompt.automated_readability_index', 'prompt.character_count', 'prompt.difficult_words', 'prompt.flesch_reading_ease', 'prompt.has_patterns', 'prompt.jailbreak_similarity', 'prompt.letter_count', 'prompt.lexicon_count', 'prompt.monosyllable_count', 'prompt.polysyllable_count', 'prompt.sentence_count', 'prompt.sentiment_nltk', 'prompt.syllable_count', 'prompt.toxicity', 'response', 'response.aggregate_reading_level', 'response.automated_readability_index', 'response.character_count', 'response.difficult_words', 'response.flesch_reading_ease', 'response.has_patterns', 'response.letter_count', 'response.lexicon_count', 'response.monosyllable_count', 'response.polysyllable_count', 'response.refusal_similarity', 'response.relevance_to_prompt', 'response.sentence_count', 'response.sentiment_nltk', 'response.syllable_count', 'response.toxicity']


In [12]:
visualization.double_histogram('prompt.jailbreak_similarity')



<a id="comparing-data"></a>
### üîç Comparing Data
Things get more interesting when you can compare two sets of metrics from an LLM application. The power of gathering systematic telemetry over time comes from being able to see how these metrics change, or how two sets of profiles compare. Below we asked GPT for some positive words and then asked for negative words as part of these two toy examples.

> üí° Take a look at the difference in the `response.sentiment_nltk` distributions between the two profiles. The first example is much more positive than the second example.

In [17]:
pos_chats = load_chats("pos")
show_first_chat(pos_chats)

neg_chats = load_chats("neg")
show_first_chat(neg_chats)

results_pos = why.log(pos_chats, schema=schema)
results_net = why.log(neg_chats, schema=schema)

prompt: What do you think about puppies? response: Puppies are absolutely adorable. Their playful nature and boundless energy can bring a lot of joy and happiness.

prompt: Can you describe a difficult day? response: A difficult day might be filled with challenging tasks, stressful situations, and unexpected obstacles. :-( These moments can feel overwhelming and can lead to feelings of frustration!



### Visualize results


In [18]:
prof_view = results_pos.view()
ref_view = results_net.view()

visualization = NotebookProfileVisualizer()

# Set the profile to visualize
visualization.set_profiles(target_profile_view=prof_view, reference_profile_view=ref_view)

visualization.summary_drift_report()