## Compare different models for same tasks

1. Use finetuned classification model and compare it's output with general purpose model with prompting
2. Same with summarization

In [None]:
%pip install datasets
%pip install openai
%pip install scikit-learn

In [1]:
from datasets import load_dataset
from openai import AzureOpenAI
from sklearn.metrics import accuracy_score
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


### Load dataset

We load a dataset containing Imdb reviews with the review sentiment. Each review is labelled with `0` and `1` where `0` denotes negative review sentiment and `1` positive review sentiment.

In [2]:
# Load Imdb dataset from Hugging Face
dataset = load_dataset("stanfordnlp/imdb")
dataset

Downloading readme: 100%|██████████| 7.81k/7.81k [00:00<00:00, 50.8kB/s]


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

## TODO

Check that you understand what the dataset contains. Hint: Dataset is a dictionary, so chechk the content with `dataset["train"]` etc.

## Try Azure OpenAI model deployment

Next we'll try out a model that is deployed behind an Azure OpenAI API endpoint. The model is an instruct finetuned GPT-3.5 Turbo model and we call it here by using the Python SDK.

In [9]:
api_key = "api-key" # TODO: change this to the Azure OpenAI API key

deployment_name="gpt-35-turbo-instruct"

client = AzureOpenAI(
    api_key=api_key,  
    api_version="2023-03-15-preview",
    azure_endpoint = f"https://learning-sprint-openai.openai.azure.com/openai/deployments/{deployment_name}/completions?api-version=2023-03-15-preview"
    )
    

prompt = 'Write a tagline for an consulting company excelling in using LLMs. '
response = client.completions.create(model=deployment_name, prompt=prompt, max_tokens=20)

print(f"Prompt: {prompt}")
print(f"Response: {response.choices[0].text}")

Prompt: Write a tagline for an consulting company excelling in using LLMs. 
Response: 

"Transforming businesses through the power of LLMs."


## Classification through prompting

Let's see how well the GPT-3.5 model is able to do simple text classification when prompted.

First let's defined two auxiliary functions. First one generates the prompt when given a text:

`"Task: Classify text into negative or positive classes. Text: Some example text with positive sentiment".`

The second one encodes the model output into integers that correspond to the dataset labels, i.e. `0 = negative sentiment` and `1 = positive sentiment`.

In [65]:
def prompt(text, classes=["negative","positive"]):
    if len(classes)>2:
        return "Task: Classify the text into " + ", ".join(classes[:-1]) + f" or {classes[-1]} classes. Text: {text}"
    elif len(classes)==2:
        return f"Task: Classify the text into {classes[0]} or {classes[1]} classes. Text: {text} "

def encode_response(response, classes=["negative","positive"]):
    response_string = response.choices[0].text.lower()
    matches = [c in response_string for c in classes]
    if sum(matches) != 1:
        print(f"Invalid response: {response_string}")
        return -1
    return matches.index(True)


## TODO 

Make sure you understand the above functions.

## Classification accuracy


In [68]:
n_test_samples = 50
predicted_labels = []
for i in range(n_test_samples):
    text = dataset["test"]["text"][i]
    true_label = dataset["test"]["label"][i]
    try:
        response = client.completions.create(model=deployment_name, prompt=prompt(text), max_tokens=5)
        encoded_response = encode_response(response)
        print("\ntext:", text)
        print("true_label", true_label)
        print("response", encoded_response)
    except Exception as e:
        print("Error:", e)
        encoded_response = -1
    predicted_labels.append(encoded_response)

print("\nAccuracy:", accuracy_score(dataset["test"]["label"][:n_test_samples], predicted_labels))


text: Utterly ridiculous movie which makes fun of the college admission process. While it is true that the SAT's is not everything in evaluating a student for admission to college, what the movie talks about is utterly ridiculous and not worth repeating nor viewing.<br /><br />College admissions officials are made to look like stupid people who have an extremely narrow view of the entire process. The film is an insult to hard-working high school students who work hard and then have to suffer through a long process until they receive that letter of acceptance or rejection from the schools they have applied for. <br /><br />This movie certainly deserves rejection on all levels.
true_label 0
response 0

text: Hi guys, this is my first review and I would had to have picked the worst movie to review. As I only watched 5 minutes of it but trust me you could see this movie was going nowhere. The acting was deplorable, the camera work and lighting looked as though it was shot and run by a pac

### Comparison with fine-tuned model

Let's compare the accuracy with the accuracy of a fine-tuned model. HuggingFace has several different models that are fine-tuned to predict the sentiment of Imdb reviews. Select the model `lvwerra/distilbert-imdb` and calculate the model accuracy.

## TODO

Check the model description from HuggingFace. What is the documented accuracy of the model?

### Use the transformers pipeline for predictions

In [69]:
pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb")



config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[{'label': 'NEGATIVE', 'score': 0.987957775592804}]

In [80]:
predicted_labels_finetuned = pipe([dataset["test"]["text"][i][:512] for i in range(n_test_samples)])
predicted_labels_finetuned_encoded = [1 if item["label"] == "POSITIVE" else 0 for item in predicted_labels_finetuned]

print("Accuracy: ", accuracy_score(dataset["test"]["label"][:n_test_samples], predicted_labels_finetuned_encoded))

Accuracy:  0.86


### TODO

Does the prompting approach and fine-tuned approach have the same difficulties in determining the sentiment (are the same texts classified incorrectly)?

In [84]:
import pandas as pd
results_df = pd.DataFrame({"true_label": dataset["test"]["label"][:n_test_samples], "predicted_label_finetuned": predicted_labels_finetuned_encoded, "predicted_label_gpt35": predicted_labels,"text": dataset["test"]["text"][:n_test_samples]})
results_df.loc[lambda df: df["predicted_label_finetuned"] != df["predicted_label_gpt35"]]

Unnamed: 0,true_label,predicted_label_finetuned,predicted_label_gpt35,text
2,0,1,0,"Eight Legged Freaks is a modern monster movie,..."
8,1,0,1,"When one thinks of Soviet cinema, the propagan..."
15,1,1,0,A film to divide its viewers. Just criticism p...
16,1,1,0,"To most of us, life is an unfolding process of..."
22,1,1,0,"Okay, this wasn't the greatest horror movie I'..."
27,0,1,0,"If you have few expectations, then this will e..."
33,0,0,1,"Read the book, forget the movie!"
34,1,0,1,"More of a near miss than a flop, MR. IMPERIUM ..."
46,1,0,1,From the start you will like Sam Elliott's cha...


In [85]:
results_df.loc[lambda df: (df["predicted_label_finetuned"] == df["predicted_label_gpt35"]) & (df["predicted_label_finetuned"] != df["true_label"])]

Unnamed: 0,true_label,predicted_label_finetuned,predicted_label_gpt35,text
32,1,0,0,here in Germany it was only shown on TV one ti...
37,1,0,0,There is something that one of the characters ...
