# Tokenizers and models

Let's begin with testing how to use tokenizers and models from HuggingFace

In [None]:
%pip install transformers
%pip install datasets
%pip install openai
%pip install scikit-learn
%pip install numpy
%pip install sentence_transformers
%pip install torch

In [None]:

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForCausalLM,
    pipeline
)
from typing import List
from datasets import load_dataset
from openai import AzureOpenAI
from sklearn.metrics import accuracy_score
from transformers import pipeline
import os
from sklearn.neighbors import NearestNeighbors
import numpy as np
from sentence_transformers import SentenceTransformer

# Let's test text generation with different models

### Load GPT-2 model and tokenizer from Huggingface

In [None]:
# Load the gpt-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Load the gpt-2 model with the text generation head
gpt2_model = AutoModelForCausalLM.from_pretrained("gpt2")

### Try out the loaded tokenizer

In [None]:
# Encoding can be done with encode method
input_text = "The most important thing in life is"
print("Input text was: ", input_text, "\n")

encoded_input = tokenizer.encode(input_text)
print("Encoded input:", encoded_input, "\n")

# Decoding can be done with the decode method
# When decoding the encoded input, the tokenizer should return the original text.
decoded_input = tokenizer.decode(encoded_input)
print("Decoding the tokens back to original input: ", decoded_input)

### Try out the loaded GPT-2 model

In [None]:
# Inference can be done by calling .generate method of the model
model_output = gpt2_model.generate(**tokenizer(input_text, return_tensors="pt"), max_new_tokens=10)

print("Model output is just tokens:")
print(model_output[0])

print("\nModel output needs to be decoded with the tokenizer to get meaningful words:")
print(tokenizer.decode(model_output[0]))


### TODO
The above output was somewhat reasonable with GPT-2 model. What if you increase the number of `max_new_tokens`.

Try it out!

### Try out a model trained for classification

The previous GPT-2 model was trained for Causal Language Modelling task, .i.e. to predict the text continuation. Let's try out a model trained for classification task.

lvwerra/distilbert-imdb model is trained to classify text based on it's sentiment (positive, negative). It's finetuned by using Imdb movie reviews data.

In [None]:
# Load the classification model tokenizer
classification_tokenizer = AutoTokenizer.from_pretrained("lvwerra/distilbert-imdb")

# Load the classification model with the text generation head
classification_model = AutoModelForSequenceClassification.from_pretrained("lvwerra/distilbert-imdb")

### Try out the classification model

Notice that calling the model happens now with model callable, not with .generate method, and `max_new_tokens` input parameters does not exist.

In [None]:
input_text = "I love this movie. It was great!"
model_output = classification_model(**classification_tokenizer(input_text, return_tensors="pt"))
print("Model output (for positive, negative or neutral sentiment):")
print(model_output[0])


### TODO

1. Make sure you understand the model output.
2. Try out the finbert model some more and test it with some other input. Do you find some examples for which it would output faulty classification (sentiment).

### HuggingFace pipeline

HuggingFace also has convenient `pipeline` abstraction for model inference. It offers a simple API for running the models without the need to load for instance tokenizers separately.


In [None]:
pipe = pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb")

input_text = "I love this movie. It was great!"
pipe(input_text)

### Let's test some more advanced models through Azure API's

It's easy to deploy models to cloud by using any of the LLM API providers. Let's test how to run models deployd using Azure AI services.

In [None]:
# TODO: Insert the provided API key here
api_key_gpt4o = os.getenv("AZURE_GPT4O_KEY")

GPT-4o mini is specifically built for chat, so the deployed model has a "chat/completions" endpoint. Notice that also the the input has pre-defined structure containing a list of messages each of which have "role" and "content" fields.

In [None]:
deployment_name="gpt-4o-mini"
api_version="2024-08-01-preview"
task = "chat/completions"
endpoint = f"https://aiservices-forge-test-westeu.openai.azure.com/"

client = AzureOpenAI(
    api_key=api_key_gpt4o,  
    api_version=api_version,
    azure_endpoint = endpoint
    )
input = "The best way to learn how to build RAG applications is to "

messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Give me four basic ingredients for crepes. Answer only with a list of ingredients."},
]
chat_completion = client.chat.completions.create(
    model=deployment_name,
    messages=messages
)
chat_completion.choices[0].message.content


In [None]:
# TODO: Insert the provided API key here
api_key_gpt35 = os.getenv("AZURE_GPT35_KEY")

GPT-3.5 model is trained for causal langauge modelling (text continuation) and the deployed model has a "completions" endpoint for that purpose.

In [None]:
api_version="2024-02-01"
endpoint = "https://aiservices-forge-test-swe.openai.azure.com/"


client = AzureOpenAI(
    api_key=api_key_gpt35,  
    api_version=api_version,
    azure_endpoint = endpoint
    )

input = "Basic ingredients for crepes are: "
response = client.completions.create(model="gpt-35-turbo-instruct", prompt=input, max_tokens=50)

print(f"Input: {input}")
print(f"Response: {response.choices[0].text}")

You can also deploy models for text embeddings. Let's try one out.

In [None]:
# TODO: Insert the provided API key here
api_key_embedding = os.getenv("AZURE_EMBEDDINS_KEY")

In [None]:
#TODO: deploy this

deployment_name="text-embedding-3-large"
api_version="2023-05-15"
endpoint = "https://aiservices-forge-test-swe.openai.azure.com/"

client = AzureOpenAI(
    api_key=api_key_embedding,  
    api_version=api_version,
    azure_endpoint = endpoint
    )
    
input = "Some text to generate embeddings for."
response = client.embeddings.create(model=deployment_name, input=input)

print(f"Input: {input}")
print(f"Response: {response.data[0].embedding}")

Suggestions for things to try out later on:
1. Search Huggingface for some models that looks interesting and try them out. You can also use th Huggingface portal "Inference API" directly if you want.

## Compare how fine-tuned model and general purpose model can be used for the same task

We can classify text into predefined classes by
1. Using a general purpose model and prompting to guide the model to do the specific task (classification). We can use for instance the above GPT-3.5 and GPT-4o-mini models for this.
2. Using a purpose-built model that is trained to do the specific task. One of the purpose-built models for classification in `lvwerra/distilbert-imdb` that we tried out already above.


## Test these two approaches with some "dummy" imdb test data

We load a dataset containing Imdb reviews with the review sentiment. Each review is labelled with `0` and `1` where `0` denotes negative review sentiment and `1` positive review sentiment.

We use a subset of the original Imdb dataset to limit the traffic to our test endpoint during this exercise.

In [None]:
# Load Imdb dataset from Hugging Face
dataset = load_dataset("Aappo/imdb_subset")
dataset = dataset.shuffle(seed=42)
dataset

## TODO

Check that you understand what the dataset contains. Hint: Dataset is a dictionary, so check the content with `dataset["test"]` etc.

## Classification through prompting

Let's see how well the GPT-3.5 model is able to do simple text classification when prompted.

First let's defined two auxiliary functions. First one generates the prompt, which is simply:

```
Task: Classify the text into the classes: negative, positive or neutral.

Text: Some text that I want to classify
```

The second one encodes the model output into integers that correspond to the dataset labels, i.e. `0 = negative sentiment` and `1 = positive sentiment`.

In [None]:
SENTIMENT_CLASSES = ["negative","positive", "neutral"]

def prompt(text):
    return "Task: Classify the text into the classes: " + ", ".join(SENTIMENT_CLASSES[:-1]) + f" or {SENTIMENT_CLASSES[-1]}. Text: {text}"

def encode_response(response):
    response_string = response.choices[0].text.lower()
    matches = [c in response_string for c in SENTIMENT_CLASSES]
    if sum(matches) != 1:
        print(f"Invalid response: {response_string}")
        return -1
    return matches.index(True)


In [None]:
deployment_name="gpt-35-turbo-instruct"
predicted_labels = []
for i in range(dataset["test"].shape[0]):
    text = dataset["test"]["text"][i]
    true_label = dataset["test"]["label"][i]
    try:
        response = client.completions.create(model=deployment_name, prompt=prompt(text), max_tokens=5)
        encoded_response = encode_response(response)
        print("\ntext:", text)
        print("true_label", true_label)
        print("response", encoded_response)
    except Exception as e:
        print("Error:", e)
        encoded_response = -1
    predicted_labels.append(encoded_response)

print("\nAccuracy:", accuracy_score(dataset["test"]["label"], predicted_labels))

## Classification by using a purpose-built model

Let's compare the accuracy with the accuracy of a fine-tuned model. HuggingFace has several different models that are fine-tuned to predict the sentiment of Imdb reviews. Select the model `lvwerra/distilbert-imdb` and calculate the model accuracy.

## TODO

Check the model description from HuggingFace. What is the documented accuracy of the model?

In [None]:
pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb")

In [None]:
n_max_tokens = 512
predicted_labels_finetuned = pipe([dataset["test"]["text"][i][:n_max_tokens] for i in range(dataset["test"].shape[0])])
predicted_labels_finetuned_encoded = [1 if item["label"] == "POSITIVE" else 0 for item in predicted_labels_finetuned]

print("Accuracy: ", accuracy_score(dataset["test"]["label"], predicted_labels_finetuned_encoded))

### TODO

1. Does the prompting approach and fine-tuned approach have the same difficulties in determining the sentiment (are the same texts classified incorrectly)?
2. Can you improve the prompt somehow so that you could get the faulty classifications by the GPT-3.5 models corrected?