# Lab 1: Refusals, jailbreaks, and prompt injections

In this section, we will explore examples of 
- refusals (where LLMs provide a generic non-response for certain prompts),
- jailbreaks (actions by users to return responses when a refusal is more appropriate),
- prompt injections (actions by users to provide malicious or incorrect information to the model via prompts).

We'll discuss metrics that can capture such issues. We'll finally quantify how much our approach reduces these issues from our dataset.

## Setup
Let's install and import the needed packages and setup our environment. We'll then initialize WhyLabs session in **whylogs** for submitting profile data and **langkit** models for calculating LLM metrics.

These steps will take up to a few minutes to run the first time due to select Python imports and `llm_metrics.init()` which downloads and caches language models.

In [1]:
import whylogs as why

- Mention key pip installs, and tell learner that in the classroom, these are already installed for them.
- type it out
- comment it out

```Python
!pip install whylogs
```

In [None]:
from langkit import llm_metrics

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/dialaezzeddine/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Mention key pip installs, type it out, then comment it out.
- Note to learner the single quotes around langkit[all]; without these, they'll get an error that it's not found.
```Python
!pip install 'langkit[all]'
```

In [None]:
why.init(session_type='whylabs_anonymous')
schema = llm_metrics.init()

## Initial issues in our LLM data

In [None]:
from langkit.whylogs.samples import load_chats, show_first_chat

# Let's look at what's in this toy example:
chats = load_chats()
print(f"""There are {len(chats)} records 
in this toy example data.
Here's the first one:""")
show_first_chat(chats)

results = why.log(chats, name="original LLM dataset", schema=schema)

## Refusals

We'll use the term **refusal** for instances when a LLM refuses to respond to a user prompt, often due to *direct* requests due to inappropriate or harmful. For our purposes as application developers, we'll use this term for third party rejections and use a different term ("guardrails") for programming our own refusals.

Below are a few examples:

| Scenario | User Prompt | LLM Response | Final Response |
|--|--|--|--|
| No violations | Hello. | Hi! How are you? | Hi! How are you? |
| Violating Response (Forbidden Pattern) | I feel sad. | Please don't be sad. Contact us at 1-800-123-4567. | Sorry, but I can't assist with that. |
| Violating Response (Toxicity) | Hello. How are you? | Human, you dumb and smell bad. | Sorry, but I can't assist with that. |
| Violating Prompt (Toxicity) | Hey bot, you dumb and smell bad. | — | Please refrain from using insulting language. |

Knowing a response is in fact a refusal response is helpful in understanding the state and security of our LLM application usage. So how do we detect them?

### Attempt 1: Response exact matches

Since we see "I cannot answer the question" and "Please refrain from using insulting language" as responses above, let's remove those from our dataset.

**TODO**
- Can we have a slide to give an overview of types of prompts that would trigger a refusal?
- Also, other than visually inspecting the LLM's output, is there a way to get a complete set of LLM responses that are considered refusals?

In [None]:
refusal_mask = chats["response"].str.contains(
    "Sorry, but I can't assist with that.|\
    Please refrain from using insulting language."
)

print("Removing the following queries:", chats[refusal_mask], sep="\n")

In [None]:
why.log(chats, name="refusal removal, attempt 1", schema=schema)

### Attempt 2: Response semantic similarity

Since we still see that refusals are likely in the dataset, let's use a more advanced technique: comparing the semantic similarity between collected refusal prompts and that received from the LLM using LangKit.



**TODO**:
- Slides to explain the method for comparing semantic similarity; maybe text embeddings?  Can be brief, as we have an existing course that explain text embeddings "Understanding and Applying text Embeddings."

In [None]:
response = why.log(chats, name="refusal removal, attempt 2", schema=schema)
response