# Tokenizers and models

Let's begin with testing how to use tokenizers and models from HuggingFace

In [1]:
%pip install transformers
%pip install datasets
%pip install openai
%pip install scikit-learn
%pip install numpy
%pip install sentence_transformers

In [1]:

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForCausalLM,
    pipeline
)
from typing import List
from datasets import load_dataset
from openai import AzureOpenAI
from sklearn.metrics import accuracy_score
from transformers import pipeline
import os
from sklearn.neighbors import NearestNeighbors
import numpy as np
from sentence_transformers import SentenceTransformer

# Let's test text generation with different models

### Load GPT-2 model and tokenizer from Huggingface

In [2]:
# Load the gpt-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Load the gpt-2 model with the text generation head
gpt2_model = AutoModelForCausalLM.from_pretrained("gpt2")

### Try out the loaded tokenizer

In [3]:
# Encoding can be done with encode method
input_text = "The most important thing in life is"
print("Input text was: ", input_text, "\n")

encoded_input = tokenizer.encode(input_text)
print("Encoded input:", encoded_input, "\n")

# Decoding can be done with the decode method
# When decoding the encoded input, the tokenizer should return the original text.
decoded_input = tokenizer.decode(encoded_input)
print("Decoding the tokens back to original input: ", decoded_input)

Input text was:  The most important thing in life is 

Encoded input: [464, 749, 1593, 1517, 287, 1204, 318] 

Decoding the tokens back to original input:  The most important thing in life is


### Try out the loaded GPT-2 model

In [4]:
# Inference can be done by calling .generate method of the model
model_output = gpt2_model.generate(**tokenizer(input_text, return_tensors="pt"), max_new_tokens=10)

print("Model output is just tokens:")
print(model_output[0])

print("\nModel output needs to be decoded with the tokenizer to get meaningful words:")
print(tokenizer.decode(model_output[0]))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Model output is just tokens:
tensor([ 464,  749, 1593, 1517,  287, 1204,  318,  284,  307, 1498,  284,  466,
        1223,  326,  345, 1842,   13])

Model output needs to be decoded with the tokenizer to get meaningful words:
The most important thing in life is to be able to do something that you love.


### TODO
The above output was somewhat reasonable with GPT-2 model. What if you increase the number of `max_new_tokens`.

Try it out!

### Try out a model trained for classification

The previous GPT-2 model was trained for Causal Language Modelling task, .i.e. to predict the text continuation. Let's try out a model trained for classification task.

ProsusAI/finbert model description:

"FinBERT is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification."


In [5]:
# Load the finbert tokenizer
finbert_tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")

# Load the finbert model with the text generation head
finbert_model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

### Try out the classification model

Notice that calling the model happens now with model callable, not with .generate method, and `max_new_tokens` input parameters does not exist.

In [6]:
input_text = "Top private equity firms put brakes on China dealmaking"
model_output = finbert_model(**finbert_tokenizer(input_text, return_tensors="pt"))
print("Model output (for positive, negative or neutral sentiment):")
print(model_output[0])


Model output (for positive, negative or neutral sentiment):
tensor([[-1.7899,  2.5756,  0.2115]], grad_fn=<AddmmBackward0>)


### TODO

1. Make sure you understand the model output.
2. Try out the finbert model some more and test it with some other input. Do you find some examples for which it would output faulty classification (sentiment).

### Let's test some more advanced models through Azure API's

It's easy to deploy models to cloud by using any of the LLM API providers. Let's test how to run models deployd using Azure AI services.

In [2]:
# TODO: Insert the provided API key here
api_key = os.getenv("AZURE_LLM_KEY")

GPT-4o mini is specifically built for chat, so the deployed model has a "chat/completions" endpoint. Notice that also the the input has pre-defined structure containing a list of messages each of which have "role" and "content" fields.

In [4]:
deployment_name="dep-forge-ragdemo-dev-9wgh3p0b-chat"
api_version="2025-01-01-preview"
task = "chat/completions"
endpoint = f"https://swedencentral.api.cognitive.microsoft.com/"

client = AzureOpenAI(
    api_key=api_key,  
    api_version=api_version,
    azure_endpoint = endpoint
    )
input = "The best way to learn how to build RAG applications is to "

messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Give me four basic ingredients for crepes. Answer only with a list of ingredients."},
]
chat_completion = client.chat.completions.create(
    model=deployment_name,
    messages=messages
)
chat_completion.choices[0].message.content


'1. Flour  \n2. Milk  \n3. Eggs  \n4. Butter  '

You can also deploy models for text embeddings. Let's try one out.

In [6]:
#TODO: deploy this

deployment_name="text-embedding-3-large"
api_version="2023-05-15"
endpoint = "https://swedencentral.api.cognitive.microsoft.com/"

client = AzureOpenAI(
    api_key=api_key,  
    api_version=api_version,
    azure_endpoint = endpoint
    )
    
input = "Some text to generate embeddings for."
response = client.embeddings.create(model=deployment_name, input=input)

print(f"Input: {input}")
print(f"Response: {response.data[0].embedding}")

Input: Some text to generate embeddings for.
Response: [-0.0031519695185124874, 0.006718718912452459, -0.01631944254040718, -0.016004782170057297, 0.028748536482453346, 0.004923723172396421, -0.015961874276399612, 0.0401335284113884, -0.00299642700701952, -0.01780693046748638, 0.015876058489084244, 0.02264126017689705, 0.004022649955004454, 0.002672827336937189, 0.016519682481884956, -0.011334933340549469, -0.010433859191834927, 0.019337324425578117, 0.009489878080785275, -0.030779527500271797, -0.0065828426741063595, 0.0012032192898914218, -0.055265843868255615, -0.006000005640089512, 0.020510150119662285, -0.012300369329750538, -0.01661980152130127, 0.01114184595644474, 0.0016269383486360312, 0.058927349746227264, -0.005785464309155941, -0.017692508175969124, 0.011928497813642025, -0.03961862996220589, 0.017821231856942177, 0.011399295181035995, 0.028290849179029465, 0.036071546375751495, 0.0016099538188427687, -0.00668296217918396, 0.051690153777599335, -0.0014356389874592423, -0.04

Suggestions for things to try out later on:
1. Search Huggingface for some models that looks interesting and try them out. You can also use th Huggingface portal "Inference API" directly if you want.
2. Test different embedding models. Can mix & match different models i.e. are the embeddings somehow comparable accross different models?

### HuggingFace pipeline

HuggingFace also has convenient `pipeline` abstraction for model inference. It offers a simple API for running the models without the need to load for instance tokenizers separately.


In [5]:
pipe = pipe = pipeline("text-classification", model="ProsusAI/finbert")

input_text = "Top private equity firms put brakes on China dealmaking"
pipe(input_text)

Device set to use mps:0


[{'label': 'negative', 'score': 0.9035528302192688}]

# Embeddings and RAG

Let's next build a very simple RAG application. The application uses financial new articles as a database and is able to find similar articles to a given one and generate some additional information regarding the retrieved articles.

### Load a dataset from HuggingFace

In [6]:
fina_news = load_dataset("Aappo/fina_news_1000")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


The loaded dataset contains financial news data (news headline, journalists, data, link to the article and the article text)

In [7]:
fina_news['train'][0]

{'Headline': 'Ivory Coast Keeps Cocoa Export Tax Below 22%, Document Shows',
 'Journalists': ['Baudelaire Mieu'],
 'Date': Timestamp('2011-10-06 15:14:20'),
 'Link': 'http://www.bloomberg.com/news/2011-10-06/ivory-coast-keeps-cocoa-export-tax-below-22-document-shows.html',
 'Article': 'Export taxes on cocoa beans from Ivory Coast , the world’s biggest producer of the chocolate ingredient, won’t exceed 22 percent of the international price this season, meeting a commitment to the International Monetary Fund , according to a finance ministry document. In the 2008-9 season taxes averaged 25.3 percent of international prices, the IMF said in a document posted on its website in November last year. While the country met the commitment in the season just ended, it had a change in government earlier this year. The rate meets a demand by the International Monetary Fund and the World Bank to reform the Ivorian cocoa and coffee industries in order to comply with the terms of its Heavily Indebted 

We will use an embedding model from HuggingFace. Embedding models can be loaded by using the SentenceTransformer class.

In [8]:
embedder = SentenceTransformer("msmarco-distilbert-base-v4")

README.md:   0%|          | 0.00/3.53k [00:00<?, ?B/s]

### Some helper functions

Let's define some helper functions for generating a vector index and for searching the index. In this example case the vector index is a scikit-learn nearest neighbour model.

In [9]:
def index_documents_huggingface(articles:List[str]):
    embeddings = embedder.encode(articles)
    nbrs = NearestNeighbors(n_neighbors=5, algorithm='kd_tree').fit(embeddings)
    return nbrs

In [10]:
def get_nearest_neighbours_huggingface(nbrs, article:str, all_articles: List[str], n_neighbors:int=2):
    embedding = embedder.encode(article)
    neighbour_indices = nbrs.kneighbors([embedding], n_neighbors=n_neighbors)
    neighbour_artices = np.array(all_articles)[neighbour_indices[1][0]]
    return neighbour_artices, neighbour_indices[0]

### Let's index the articles

This can take a short while on Colab, so we are only using the first 100 articles.

In [11]:
nbrs_huggingface = index_documents_huggingface(fina_news["train"]["Article"][:100])

### Find the similar articles of a given one

Let's take a random article from our article catalog:

In [12]:
article = fina_news["train"]["Article"][10]
display(input)

'The best way to learn how to build RAG applications is to '

In [13]:

nearest_articles = get_nearest_neighbours_huggingface(nbrs=nbrs_huggingface, all_articles=fina_news["train"]["Article"][:1000], article=article, n_neighbors=5)
display(nearest_articles)

(array(['Apple Inc. (AAPL) fans worldwide mourned the death of co-founder Steve Jobs , paying tribute to the man who changed the way they listen to music, use their mobile phones and play on their computers. At Apple’s headquarters -- located at 1 Infinite Loop, Cupertino, California -- flags flew at half-staff and bagpipes sounded to the tune of “Amazing Grace” as people placed flowers around a white iPad with a picture of Jobs, who died yesterday at 56, after a battle with cancer. Mourners flocked to Apple stores from New York to Hong Kong , while a crowd gathered in San Francisco ’s Mission Dolores Park for an iPhone-lit vigil. “Part of the narrative that made Apple what it is today goes out with Steve Jobs,” said Christopher Smith, 40, a former business development manager in San Francisco who joined the vigil. “I came out to honor the fact that one man with vision, courage and unwavering dedication can still change the world. The way that I communicate and the way that I interact 

### Generate some additional information about the retrieved articles

Let' start with generating short summaries of the retrieved articles. There are specialized summarization models as well, but we'll use prompting and GPT-4o model in this case.

In [14]:
deployment_name="dep-forge-ragdemo-dev-9wgh3p0b-chat"
api_version="2025-01-01-preview"
task = "chat/completions"
endpoint = f"https://swedencentral.api.cognitive.microsoft.com/"

client = AzureOpenAI(
    api_key=api_key,  
    api_version=api_version,
    azure_endpoint = endpoint
    )

for article in nearest_articles[0]:
    messages = [{"role":"system", "content": "You are a helpful assistant giving short one sentence summary of the given text."},
                {"role": "user", "content": article}]
    response = client.chat.completions.create(model=deployment_name, messages=messages, max_tokens=100)
    print("\nSummary:")
    display(response.choices[0].message.content)



Summary:


'Fans and admirers worldwide mourned Apple co-founder Steve Jobs after his death at 56, honoring his visionary impact on technology and culture with tributes at Apple stores, online condolences, and heartfelt memorials.'


Summary:


"In South Korea, shares of Samsung Electronics and LG Electronics rose 4% and 6.6% respectively, amid optimism for increased market share following the death of Apple's Steve Jobs, while Hana Financial Group surged 7.8% ahead of a court ruling related to its acquisition of Korea Exchange Bank."


Summary:


'Eric J. Aronson faces accusations of running a $26 million Ponzi scheme, using investor funds to pay restitution for past fraud, while promising high returns from his company, PermaPave Industries LLC.'


Summary:


'A study by General Electric and Ohio State University reveals that mid-sized U.S. companies expect to continue growing despite economic slowdowns, having added 2.2 million jobs during the recession, but face challenges such as limited access to capital and international competition.'


Summary:


"Reed Hundt proposed a 10-year tax break for alternative-energy companies to foster job creation and innovation, while also addressing the aftermath of Solyndra's bankruptcy and collaborating with states to enhance clean energy funding."

### TODO

You can continue to develop this application further:

1. How could you use the GPT-4o model to classify the articles based on for instance their topic or sentiment?
2. How could you change the prompt to use GPT-4o to explain why the articles are similar to each other?
3. What if you use the above `ProsusAI/finbert` model for classification? If there are errors, how could you prevent those?
4. In what type of real life scenario could you use this type of retrieval setup?
5. Modify the code so that you use the model `text-embedding-3-large` for generating the embeddings.
6. Try deploying your own LLM model on some API provider infra and use that to 1. generate the embeddings 2. generate the additional information.