## Classification using Prompt Engineering
Large language models (LLMs) can be a useful tool for information extraction tasks such as text classification. Common applications include classifying the intents of user interactions via channels such as email, chatbots, voice, and others, or categorizing documents to route their requests to downstream systems. The initial step involves identifying the intent or class of the user's request or the document. These Intents or classes could take many forms - from short single words to thousands of hierarchical classes and sub-classes.

In the following, we demonstrate prompt engineering on synthetic conversation data to extract intents. Additionally, we show how pre-trained models can be assessed to determine if fine-tuning is needed. 

Let's start with this simple example. We have a list of customer interactions with an imaginary health/life insurance company. To start, we used the Llama2-70B-chat model we deployed in the previous section.



In [None]:
import json
import utils

inference_instance_type = "ml.g5.48xlarge"

# Llama 70b chat
model_id, model_version = "meta-textgeneration-llama-2-70b-f", "2.*"
endpoint_name = model_id

In [None]:
predictor = utils.get_predictor(
    endpoint_name=endpoint_name,
    model_id=model_id,
    model_version=model_version,
    inference_instance_type=inference_instance_type,
)

The `get_predictor` is a helper function that creates a predictor object from a model id and version. If the specified endpoint doesn't exist, it will create a new endpoint and deploy the model. If the endpoint already exists, it will utilize the existing endpoint.

In [None]:
customer_interactions = [
    """Hello, I've recently moved to a new state and I need to update my address for my health insurance policy.
Can you assist me with that?
""",
    """Good afternoon! I'm interested in adding dental coverage to my existing health plan.
Could you provide me the options and prices?
""",
    """I had a disappointing experience with the customer service yesterday regarding my claim.
I want to file a formal complaint and speak with a supervisor.
""",
]

In [None]:
system = """
Your task is to identify the customer intent from their interactions with support bot in the provided text. The intention output is not more than 4 words. If the intent is not clear, please provide a fallback intent of "unknown".
"""

#### Using a capable LLM such as llama-2-70b-chat


In [None]:
def get_intent(system, customer_interactions):
    for customer_interaction in customer_interactions:
        response = utils.llama2_chat(
            predictor,
            system=system,
            user=customer_interaction,
        )
        content = utils.llama2_parse_output(response)
        print(content)

In [None]:
get_intent(system, customer_interactions)

Looking at the output, these seem reasonable as the intents. However, the format and style of the intents can vary depending on the language model. Another limitation of this approach is that intents are not confined to a predefined list, which means the language model might generate and word the intents differently each time we run it.

To address this, we can use in-context learning technique in prompt engineering to steer the model towards selecting from a predefined set of intents, or class labels, that we provide. In the example below, alongside the customer conversation, we also include a list of potential intents and ask the model to choose from this list. 

In [None]:
system = """
Your task is to identify the intent from the customer interaction with the support bot. Identify the intent of the provided text using the list of provided intent delimited with #### to classify the customer intention. If the intent is not clear, please provide a fallback intent of "unknown". ONLY write the intent.

####
- information change
- add coverage
- complaint
- portal navigation
- free product upgrade
####
"""

In [None]:
get_intent(system, customer_interactions)

Reviewing the results, it's evident that the language model performs well in selecting the appropriate intent in the desired format.

### Sub-intents and intent trees


If we make the above scenario more complex, as in many real-life use cases, intents can be designed in a large number of categories and also in a hierarchical fashion, which will make the classification tasks more challenging for the model. Therefore, we further improve and modify our prompt to both provide an example to the model also known as n-shot learning (aka k-shot learning or a-few-shot learning).
This is the intent tree we are using in this example:


In [None]:
system = """
Your task is to identify the intent from the customer interaction with the support bot. Identify the intent of the provided text using the list of provided intent tree delimited with #### to classify the customer intention. The intents are defined in classes and sub-classes. Write the intention with this format: <main-intent>:<sub-intent>. ONLY write the intent.

OUTPUT EXAMPLE:
profile_update:contact_info

OUTPUT EXAMPLE:
customer_retention:complaint

####
{intents}
####
"""


intents_json = json.dumps(utils.INTENTS, indent=4)
system = system.format(intents=intents_json)
get_intent(system, customer_interactions)

While large language models can often correctly identify intent from a list of possible intents, they may sometimes produce additional outputs or fail to adhere to the exact intent structure and output schema. There are also scenarios where intents are not as straightforward as they initially seem or are highly specific to a business domain context that model doesn't fully comprehend. For instance, the model may misinterpret customer interactions if not adequately trained on niche intents.

In the sample interaction below, the customer ultimately wants to change their coverage, but their immediate question and interaction intent is to get help with “portal navigation”. Similarly, in the second interaction, the more appropriate intent is "free product upgrade," which the customer is requesting. However, the model is unable to detect these nuanced intents as accurately as desired (see model outputs below):


In [None]:
customer_interactions = [
    "I want to change my coverage plan. But I'm not seeing where to do this on the online website. Could you please point me to it?",
    "I'm unhappy with the current benefits of my plan and I'm considering canceling unless there are better alternatives. What can you offer?",
]

In [None]:
get_intent(system, customer_interactions)

#### Evalute the model performance


In [None]:
import pandas as pd

intent_dataset_test_file = "data/intent_dataset_test.jsonl"
incorrect_responses_log_file = "data/log_incorrect_responses_llama2.jsonl"
error_responses_log_file = "data/log_error_responses_llama2.jsonl"
test_dataset = utils.load_dataset(intent_dataset_test_file)

df = pd.DataFrame(test_dataset)
print(len(df))

Eval model with zero-shot learning and without classes in the prompt

In [None]:
system = """
Your task is to identify the intent from the customer interaction with the support bot. Identify the intent of the provided text. The intents are defined in classes and sub-classes. Write the intention with this format: <main-intent>:<sub-intent>. ONLY write the intent.

OUTPUT EXAMPLE:
profile_update:contact_info

OUTPUT EXAMPLE:
customer_retention:complaint
"""

In [None]:
res = utils.evaluate_model(
    predictor=predictor,
    llm=utils.llama2_chat,
    dataset=test_dataset,
    system_message=system,
    prompt_template="{query}",
    max_tokens=15,
    response_formatter=utils.llama2_chat_output_intent_formatter,
    incorrect_responses_log_file=incorrect_responses_log_file,
    error_responses_log_file=error_responses_log_file,
)

In [None]:
utils.print_eval_result(res, test_dataset)

Eval with one-shot learning and classes in the prompt.

In [None]:
system = """Classify the input text only from the intents listed below. Write your response similar to the examples provided below.

ONLY write the intent. DO NOT write any other text other than the intent.

Intents:
{intents}

Example:
Input:

"Given that I haven't used the dental services, is it wise to continue paying for them?"

Response:

health_cover:remove_extras
"""

`llama2` performs better when the in-context intent are flatten in the prompt, rather than in a json format. So, we'll use the following format to flatten the intents in the prompt:

[main intent]:[sub intent]


In [None]:
# convert INTENTS to <main-intent>:<sub-intent> format
intents = []
for item in utils.INTENTS:
    for sub_intent in item["sub_intents"]:
        intents.append(f"{item['main_intent']}:{sub_intent}")

system_message = system.format(intents="\n".join(intents))

print(system_message)

In [None]:
res = utils.evaluate_model(
    predictor=predictor,
    llm=utils.llama2_chat,
    dataset=test_dataset,
    system_message=system_message,
    prompt_template="{query}",
    max_tokens=15,
    response_formatter=utils.llama2_chat_output_intent_formatter,
    incorrect_responses_log_file=incorrect_responses_log_file,
    error_responses_log_file=error_responses_log_file,
)
utils.print_eval_result(res, test_dataset)

While prompt engineering can successfully extract specific intent classes, and we may achieve success with additional prompt engineering to further clarify such intents, it may not be sufficient when there are many defined classes in a complex hierarchy. Some scenarios where prompt engineering alone might be limiting include:
- Large number of classes and/or longer conversations that exceed the context window of the LLM or making queries more costly
- The desired output is in a specific format that the LLM fails to follow
- The need to improve model performance by teaching the domain or the task

In these scenarios, we can fine-tune the LLM using labeled data to improve the performance.
In the next section, we demonstrate how fine-tuning can boost the accuracy of the LLM for the intent classification task attempted above.


