## Compare different models for same tasks

1. Use finetuned classification model and compare it's output with general purpose model with prompting
2. Same with summarization

In [None]:
%pip install datasets
%pip install openai
%pip install scikit-learn
%pip install transformers
%pip install torch

In [1]:
from datasets import load_dataset
from openai import AzureOpenAI
from sklearn.metrics import accuracy_score
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


### Load dataset

We load a dataset containing Imdb reviews with the review sentiment. Each review is labelled with `0` and `1` where `0` denotes negative review sentiment and `1` positive review sentiment.

We use a subset of the original Imdb dataset to limit the traffic to our test endpoint during this exercise.

In [51]:
# Load Imdb dataset from Hugging Face
dataset = load_dataset("Aappo/imdb_subset")
dataset = dataset.shuffle(seed=42)
dataset

Downloading readme: 100%|██████████| 296/296 [00:00<00:00, 2.00kB/s]
Downloading data: 100%|██████████| 43.9k/43.9k [00:00<00:00, 92.7kB/s]
Generating test split: 100%|██████████| 50/50 [00:00<00:00, 22922.20 examples/s]


DatasetDict({
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 50
    })
})

## TODO

Check that you understand what the dataset contains. Hint: Dataset is a dictionary, so check the content with `dataset["test"]` etc.

## Try Azure OpenAI model deployment

Next we'll try out a model that is deployed behind an Azure OpenAI API endpoint. The model is an instruct finetuned GPT-3.5 Turbo model and we call it here by using the Python SDK.

In [10]:
api_key = "api-key" # TODO: change this to the Azure OpenAI API key

deployment_name="gpt-35-turbo-instruct"

client = AzureOpenAI(
    api_key=api_key,  
    api_version="2023-03-15-preview",
    azure_endpoint = f"https://learning-sprint-openai.openai.azure.com/openai/deployments/{deployment_name}/completions?api-version=2023-03-15-preview"
    )
    

prompt = 'Write a tagline for an consulting company excelling in using LLMs. '
response = client.completions.create(model=deployment_name, prompt=prompt, max_tokens=20)

print(f"Prompt: {prompt}")
print(f"Response: {response.choices[0].text}")

Prompt: Write a tagline for an consulting company excelling in using LLMs. 
Response: 

"Achieving Excellence through Advanced LLM Solutions"


## Classification through prompting

Let's see how well the GPT-3.5 model is able to do simple text classification when prompted.

First let's defined two auxiliary functions. First one generates the prompt when given a text:

`"Task: Classify text into negative or positive classes. Text: Some example text with positive sentiment".`

The second one encodes the model output into integers that correspond to the dataset labels, i.e. `0 = negative sentiment` and `1 = positive sentiment`.

In [52]:
def prompt(text, classes=["negative","positive"]):
    if len(classes)>2:
        return "Task: Classify the text into " + ", ".join(classes[:-1]) + f" or {classes[-1]} classes. Text: {text}"
    elif len(classes)==2:
        return f"Task: Classify the text into {classes[0]} or {classes[1]} classes. Text: {text} "

def encode_response(response, classes=["negative","positive"]):
    response_string = response.choices[0].text.lower()
    matches = [c in response_string for c in classes]
    if sum(matches) != 1:
        print(f"Invalid response: {response_string}")
        return -1
    return matches.index(True)


## TODO 

Make sure you understand the above functions.

## Classification accuracy


In [56]:
predicted_labels = []
for i in range(dataset["test"].shape[0]):
    text = dataset["test"]["text"][i]
    true_label = dataset["test"]["label"][i]
    try:
        response = client.completions.create(model=deployment_name, prompt=prompt(text), max_tokens=5)
        encoded_response = encode_response(response)
        print("\ntext:", text)
        print("true_label", true_label)
        print("response", encoded_response)
    except Exception as e:
        print("Error:", e)
        encoded_response = -1
    predicted_labels.append(encoded_response)

print("\nAccuracy:", accuracy_score(dataset["test"]["label"], predicted_labels))


text: After a very long time Marathi cinema has come with some good movie.This movie is one of the best Marathi movies ever made. It shows how a old grandfather tries to save his grandsons eye. He tries everything that is possible in his hands to save the child's eye. Doctor and a relative of his tries to help him in his attempt.<br /><br />The acting by the grandfather, the boy and the doctor are simply superb. They have shown true picture of a typical Marathi life. Every bit of action has some meaning in it. I would recommend to watch this movie, as initially I thought this one would be of documentary type but this was above my expectations.<br /><br />This film is really going to touch your hearts.I would expect more Marathi movies to come up with performances like this.
true_label 1
response 1

text: "In Love and War" is a simple feel-good TV-film, and should be viewed as such.<br /><br />(Possible spoiler)<br /><br />It is the story of a WWII British soldier, Newby, captured with

### Comparison with fine-tuned model

Let's compare the accuracy with the accuracy of a fine-tuned model. HuggingFace has several different models that are fine-tuned to predict the sentiment of Imdb reviews. Select the model `lvwerra/distilbert-imdb` and calculate the model accuracy.

## TODO

Check the model description from HuggingFace. What is the documented accuracy of the model?

### Use the transformers pipeline for predictions

In [57]:
pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [58]:
predicted_labels_finetuned = pipe([dataset["test"]["text"][i][:512] for i in range(dataset["test"].shape[0])])
predicted_labels_finetuned_encoded = [1 if item["label"] == "POSITIVE" else 0 for item in predicted_labels_finetuned]

print("Accuracy: ", accuracy_score(dataset["test"]["label"], predicted_labels_finetuned_encoded))

Accuracy:  0.9


### TODO

1. Does the prompting approach and fine-tuned approach have the same difficulties in determining the sentiment (are the same texts classified incorrectly)?
2. Can you improve the prompt somehow so that you could get the faulty classifications by the GPT-3.5 models corrected?