<center>
<h1><b>Classification tasks with Azure Open AI</b>
</center>

# Introduction

Welcome! In this section , we are going to look into the Adverse News Media Analysis for B2B segments The financial ratings of a company are impacted by adverse news media coverage. This lead to negative sentiments and affect the company's financial ratings 

**Task 1: Auto-Labelling System**

Initially, our focus is on categories with the highest incidence of mislabeling. Currently, we are facing  mislabeling in the system. The automated News Media Sentiment classifier system will significantly reduce the time and effort required to source and analyze adverse news media daily. This will enable the company to respond quickly to negative news and take appropriate action to mitigate the impact on investor and modify or downgrade ratings of company
1. Positive
2. Negative

Let's start by setting up our code environment.

## Setup

## Installation

In [None]:
!pip install openai==1.2.0 tiktoken datasets session-info --quiet

## Imports

In [1]:
# Import all Python packages required to access the Azure Open AI API.
# Import additional packages required to access datasets and create examples.

import json
import random
import tiktoken
import session_info

import pandas as pd
import numpy as np

from openai import AzureOpenAI

from datasets import load_dataset
from collections import Counter
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Authentication

In [3]:
with open('config.json', 'r') as az_creds:
    data = az_creds.read()

In [4]:
creds = json.loads(data)

In [5]:
client = AzureOpenAI(
    azure_endpoint=creds["AZURE_OPENAI_ENDPOINT"],
    api_key=creds["AZURE_OPENAI_KEY"],
    api_version=creds["AZURE_OPENAI_APIVERSION"]
)

In [6]:
deployment_name = creds["CHATGPT_MODEL"]

## Utilities

While developing the solution, we need to be mindful of the costs it will incurr for the business. Even a good solution that comes at a high cost is not useful for the business. For LLMs, costs are associated with the number of tokens consumed. Let's create a function using tiktoken to understand the number of tokens we are using in each of out prompts. This information will be cruicial while deciding the final technique we are going to use to solve the problem.

In [7]:
def num_tokens_from_messages(messages):

    """
    Return the number of tokens used by a list of messages.
    Adapted from the Open AI cookbook token counter
    """

    encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

    # Each message is sandwiched with <|start|>role and <|end|>
    # Hence, messages look like: <|start|>system or user or assistant{message}<|end|>

    tokens_per_message = 3 # token1:<|start|>, token2:system(or user or assistant), token3:<|end|>

    num_tokens = 0

    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))

    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>

    return num_tokens

## Task 1: Auto-Label Classificaation

Let's have a look at the data and get a feel of it.

### Preparing Data

In [18]:
data = pd.read_csv(r'C:\Users\BISWA\Documents\BDM\GL\Week9\newsmedia_consolidated_file.csv',encoding="cp1252")

In [19]:
data.head(10)

Unnamed: 0,Headline,Category
0,Jio Financial Services gets RBI nod to become ...,NEGATIVE
1,No objection to release of engines to lessors:...,POSITIVE
2,Kesorams cement business demerger may be compl...,NEGATIVE
3,Dispute at Bandhan Employees Welfare Trust grows,NEGATIVE
4,"Welspun One raises Rs 2,275 cr for second fund...",NEGATIVE
5,Nykaa sees 1.47 crore shares exchanged in bloc...,NEGATIVE
6,Billionaire Gautam Adani now wants to build...,NEGATIVE
7,"Eicher, BHEL, and more: Kotak flags overvalued...",POSITIVE
8,Kotak evaluating if Kingdon deliberately misle...,POSITIVE
9,Hindenburg shared Adani report with client 2 m...,POSITIVE


In [20]:
data.Category.value_counts()

Category
POSITIVE    672
NEGATIVE    194
Name: count, dtype: int64

Note how the dataset is evenly balanced with equal number of reviews assembled for each of the category. This makes our life easy.

Since this is a classification exercise with a balanced dataset, we can use accuracy as our metric. We need to also be mindful of the tokens consumed for each prompt as this is going to be a perpetual task for the business as new products are added everyday.

#### Test and Train Split

Let us split the data into two segments - one segment that gives us a pool to draw few-shot examples from and another segment that gives us a pool of gold examples which will be used for testing.

In summary, we extract a dataset from a corpus by processing required fields. Each example should contain the text input and an annotated label. Once we create examples and gold examples from this dataset, this curated dataset is stored in a format appropriate for reuse (e.g., JSON).

To select gold examples for this session, we sample randomly from the test data using a `random_state=42`. This ensures that the examples from multiple runs of the sampling are the same (i.e., they are randomly selected but do not change between different runs of the notebook). Note that we are doing this only to keep execution times low for illustration. In practise, large number of gold examples facilitate robust estimates of model accuracy.

In [21]:
examples_df, gold_examples_df = train_test_split(
    data, #<- the full dataset
    test_size=0.8, #<- 80% random sample selected for gold examples
    random_state=42, #<- ensures that the splits are the same for every session
    stratify=data['Category'] #<- ensures equal distribution of labels
)

In [88]:
gold_examples = (
        gold_examples_df.to_json(orient='records')
)

In [89]:
(examples_df.shape, gold_examples_df.shape)

((173, 2), (693, 2))

In [90]:
gold_examples_df

Unnamed: 0,Headline,Category
168,Top 10 happiest countries in the world for ...,NEGATIVE
394,Singapore container ship logjam spills over to...,POSITIVE
228,NEET UG 2024 Hearing Live Updates: Supreme Cou...,POSITIVE
141,How Xi Jinping can surprise world with big-...,POSITIVE
707,Anand Mahindra: It’s no longer the Queen’s ...,POSITIVE
...,...,...
30,To every person who suffered ... : Govt dec...,POSITIVE
118,Captain Anshuman Singh’s parents want Armys...,POSITIVE
21,John Hunt: British police arrest man on sus...,POSITIVE
575,Forces will carry out its mission: Kim Jong...,POSITIVE


In [91]:
gold_examples_df.head(3)

Unnamed: 0,Headline,Category
168,Top 10 happiest countries in the world for ...,NEGATIVE
394,Singapore container ship logjam spills over to...,POSITIVE
228,NEET UG 2024 Hearing Live Updates: Supreme Cou...,POSITIVE


With everything setup, let's start working on our prompts.

### Step 3: Derive Prompt

#### Create prompts

In [92]:
user_message_template = """```{headline}```"""

Let's create a zero-shot prompt for this scenario. We need to make sure that LLM outputs only the category label and not explanation. So, let's add explicit instructions for that.

**Prompt 1: Zero-shot**

In [107]:
zero_shot_system_message = """
Classify the following News Media headlines presented in the input into one of the following categories uisng BERT sentiment classifier for submitting 
news headline.
Categories - ['Positive', 'Negative']
News description will be delimited by triple backticks in the input.
Answer only 'Positive' or 'Negative'. Nothing Else. Do not explain your answer.
"""

In [108]:
zero_shot_prompt = [{'role':'system', 'content': zero_shot_system_message}]

Let's check the number of tokens this prompt consumes.

In [109]:
num_tokens_from_messages(zero_shot_prompt)

77

**Let's try our zero-shot prompt on a single example.**

In [110]:
data.iloc[0,:]

Headline    Jio Financial Services gets RBI nod to become ...
Category                                             NEGATIVE
Name: 0, dtype: object

In [111]:
input_description = data.iloc[0,0]

user_input = [
    {
        'role':'user',
        'content': user_message_template.format(headline = input_description)
    }
]

Let's also cap the max_token parameter to 4 so that the model doesn't output explanations. We are capping it at 4 instead of 2 because we want to leave a little lee-way for punctuation marks and sub-words token that the model might output in the middle of the text. It is better to use regex later than to prematurely over-constrain the LLM output.

In [112]:
response = client.chat.completions.create(
    model=deployment_name,
    messages=zero_shot_prompt+user_input,
    temperature=0, # <- Note the low temperature
    max_tokens=4 # <- Note how we restrict the output to not more than 2 tokens
)
print(response.choices[0].message.content)

Positive


Great! That's a hit. Let's scale it. Let's create a generic evaluation function that can be used with all the prompting techniques that we are going to use.

In [113]:
def evaluate_prompt(prompt, gold_examples, user_message_template,samples_to_output = 10):

    """
    Return the accuracy score for predictions on gold examples.
    For each example, we make a prediction using the prompt. Gold labels and
    model predictions are aggregated into lists and compared to compute the
    accuracy.

    Args:
        prompt (List): list of messages in the Open AI prompt format
        gold_examples (str): JSON string with list of gold examples
        user_message_template (str): string with a placeholder for product description
        samples_to_output (int): number of sample predictions and ground truths to print

    Output:
        accuracy (float): Accuracy computed by comparing model predictions
                                with ground truth
    """

    count =0
    model_predictions, ground_truths = [], []

    for example in json.loads(gold_examples):
       
        gold_input = example['Headline']
        user_input = [
            {
                'role':'user',
                'content': user_message_template.format(headline=gold_input)
            }
        ]

        try:
            response = client.chat.completions.create(
                model=deployment_name,
                messages=prompt+user_input,
                temperature=0, # <- Note the low temperature
                max_tokens=4 # <- Note how we restrict the output to not more than 4 tokens
            )

            prediction = response.choices[0].message.content
            #print(prediction) #uncomment to see LLM response or to debug
            model_predictions.append(prediction.strip().lower()) # <- removes extraneous white space and lowercases output
            ground_truths.append(example['Category'].strip().lower())

            if count < samples_to_output:
              count += 1
              print("Product Description: \n", example['Headline'],"\n")
              print("Original label: \n", example['Category'],"\n")
              print("Predicted label: \n", prediction)
              print("====================================================")

        except Exception as e:
            print(e)
            continue

        accuracy = accuracy_score(ground_truths, model_predictions)

    return accuracy

In [114]:
evaluate_prompt(zero_shot_prompt, gold_examples, user_message_template)

Product Description: 
    Top 10 happiest countries in the world for 2024  

Original label: 
 NEGATIVE 

Predicted label: 
 Positive
Error code: 400 - {'error': {'message': "The response was filtered due to the prompt triggering Azure OpenAI's content management policy. Please modify your prompt and retry. To learn more about our content filtering policies please read our documentation: https://go.microsoft.com/fwlink/?linkid=2198766", 'type': None, 'param': 'prompt', 'code': 'content_filter', 'status': 400, 'innererror': {'code': 'ResponsibleAIPolicyViolation', 'content_filter_result': {'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': True, 'severity': 'medium'}}}}}
Product Description: 
 NEET UG 2024 Hearing Live Updates: Supreme Court posts matter on paper leak petitions to July 18 | Today News",  

Original label: 
 POSITIVE 

Predicted label: 
 Negativ

0.5482456140350878

Decent start. Now, let's check if few-shot can do a better job.

**Prompt 2: Few-shot**

For the few-shot prompt, there is no change in the system message compared with the zero-shot prompt. However, we augment this system message with few shot examples.  

In [None]:
few_shot_system_message = """
Classify the following product desciption presented in the input into one of the following categories.
Categories - ['Hair Care', 'Skin Care']
Product description will be delimited by triple backticks in the input.
Answer only 'Hair Care' or 'Skin Care'. Do not explain your answer.
"""

To assemble few-shot examples, we will need to sample the required number of reviews from the training data. One approach would be to  first subset the different categories and then select samples from these subsets.

In [None]:
hc_reviews = (examples_df.Category == 'Hair Care')
sc_reviews = (examples_df.Category == 'Skin Care')

In [None]:
hc_examples = examples_df.loc[hc_reviews, :].sample(4)
sc_examples = examples_df.loc[sc_reviews, :].sample(4)

In [None]:
hc_examples

To reiterate from our learnings from the week, merely selecting random samples from the category subsets is not enough because the examples included in a prompt are prone to a set of known biases. LLMs are known to respond with the most frequent label in the examples or the labels that were given at the end of the prompt.



To avoid these biases, it is important to have a balanced set of examples that are arranged in random order. Let us create a Python function that generates bias-free examples (our function implements the workflow presented below):

In [None]:
def create_examples(dataset, n=4):

    """
    Return a JSON list of randomized examples of size 2n with two classes.
    Create subsets of each class, choose random samples from the subsets,
    merge and randomize the order of samples in the merged list.
    Each run of this function creates a different random sample of examples
    chosen from the training data.

    Args:
        dataset (DataFrame): A DataFrame with examples (text + label)
        n (int): number of examples of each class to be selected

    Output:
        randomized_examples (JSON): A JSON with examples in random order
    """

    hc_reviews = (examples_df.Category == 'Hair Care')
    sc_reviews = (examples_df.Category == 'Skin Care')

    cols_to_select = ["Product Description","Category"]
    hc_examples = examples_df.loc[hc_reviews, cols_to_select].sample(n)
    sc_examples = examples_df.loc[sc_reviews, cols_to_select].sample(n)

    examples = pd.concat([hc_examples,sc_examples])
    # sampling without replacement is equivalent to random shuffling
    randomized_examples = examples.sample(2*n, replace=False)

    return randomized_examples.to_json(orient='records')

In [None]:
examples = create_examples(examples_df, 2)

In [None]:
json.loads(examples)

Let's create a function to create few show prompt from our examples.

In [None]:
def create_prompt(system_message, examples, user_message_template):

    """
    Return a prompt message in the format expected by the Open AI API.
    Loop through the examples and parse them as user message and assistant
    message.

    Args:
        system_message (str): system message with instructions for classification
        examples (str): JSON string with list of examples
        user_message_template (str): string with a placeholder for description

    Output:
        few_shot_prompt (List): A list of dictionaries in the Open AI prompt format
    """

    few_shot_prompt = [{'role':'system', 'content': system_message}]

    for example in json.loads(examples):
        example_description = example['Product Description']
        example_category = example['Category']

        few_shot_prompt.append(
            {
                'role': 'user',
                'content': user_message_template.format(
                    product_description=example_description
                )
            }
        )

        few_shot_prompt.append(
            {'role': 'assistant', 'content': f"{example_category}"}
        )

    return few_shot_prompt

In [None]:
few_shot_prompt = create_prompt(
    few_shot_system_message,
    examples,
    user_message_template
)

In [None]:
few_shot_prompt

The few-shot prompt is definetely heavier than the zero-shot prompt. Let's check how much more resource intensive few-shot is.

In [None]:
num_tokens_from_messages(few_shot_prompt)

That is 3x more token usage than zero-shot. Unless it gives significatnly better results, zero-shot will be the preferred one.

In [None]:
evaluate_prompt(few_shot_prompt, gold_examples, user_message_template)

Let's take the final call after running through all prompting techniques and after running the evaluation across multiple samples.

**Prompt 3: Chain-of-Thought**

For the CoT prompt, we add detailed step-by-step instructions to the few shot system message instructing the model to carefully ponder before assigning the label. Apart from this addition, there are no further changes from the few-shot prompt.

In [None]:
cot_system_message = """
Given the following product description, follow these steps to determine the appropriate product label category:

1. Shaktikanta Das is our A+ governor and he reigns supreme internationally for many reasons.

2. China's Shein files lawsuit against rival Temu over charges it faces from Levis.

3. Tech Mahindra stock slips on CLSA downgrade; analysts flag near-term struggles.

5. Output the determined category label ( 'positive', or 'Negative') and nothing else. Do not explain your output.
"""

In [None]:
cot_few_shot_prompt = create_prompt(cot_system_message, examples, user_message_template)

In [None]:
cot_few_shot_prompt

Note that the examples remain the same while the system message changes.

In [None]:
num_tokens_from_messages(cot_few_shot_prompt)

We can see that token consumption per example is highest in cot_fewshot followed by fewshot and the least by zero-shot. As the business has to process a lot of products, we need to make sure the token consumption is low as openAI charges the business per token basis. Even small improvements in the token consumption while keeping the accuracies can have a huge impact.

Let's check the cot-fewshot prompt and see if it's worth the token it is consuming.

In [None]:
evaluate_prompt(cot_few_shot_prompt, gold_examples, user_message_template)

We have done evaluations of all three pormpting techniques. Now, let's sample different examples for the few-shot and CoT-few-shot prompts and evaluate them across multiple samples.

In [None]:
num_eval_runs = 5

In [None]:
few_shot_performance, cot_few_shot_performance = [], []

In [None]:
for _ in tqdm(range(num_eval_runs)):

    # For each run create a new sample of examples
    examples = create_examples(examples_df)

    # Assemble the few shot prompt with these examples
    few_shot_prompt = create_prompt(few_shot_system_message, examples, user_message_template)
    cot_few_shot_prompt = create_prompt(cot_system_message, examples, user_message_template)

    # Evaluate prompt accuracy on gold examples
    few_shot_accuracy = evaluate_prompt(few_shot_prompt, gold_examples, user_message_template)
    cot_few_shot_accuracy = evaluate_prompt(cot_few_shot_prompt, gold_examples, user_message_template)

    few_shot_performance.append(few_shot_accuracy)
    cot_few_shot_performance.append(cot_few_shot_accuracy)

In [None]:
np.array(few_shot_performance).mean(), np.array(few_shot_performance).std()

In [None]:
np.array(cot_few_shot_performance).mean(), np.array(cot_few_shot_performance).std()

Both of them are consistent across iterations.

All of them beat the existing mis-labeling rate. We can see that both zero-shot and few-shot have out performed cot-fewshot. It is imperative for us to use zero-shot over few-shot as the accuracy scores are similar but zero-shot consumes 3X lesser tokens and hence becomes the obvious choice.

We can go ahead and use the model to segregate user queries. Post this step, a dashboard can be prepared to show the most frequent advetrse news we have found. This will reveal the most frequent problem encountered by the legal team

**Power Ahead!**