# Tokenizers, LLMs and heads

In [42]:

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForCausalLM
)


## Try out GPT-2 model

In [43]:
# Load the gpt-2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Load the gpt-2 model with the text generation head
gpt2_model = AutoModelForCausalLM.from_pretrained("gpt2")



### Try out the loaded tokenizer

In [44]:
# Encoding can be done with encode method or via calling the tokenizer callable
input_text = "The most important thing in life is"
encoded_input = tokenizer.encode(input_text)

print("Encoded input:")
print(encoded_input)
print("\nEncoded input with tokenizer callable:")
print(tokenizer(input_text))

# Decoding can be done with the decode method
# When decoding the encoded input, the tokenizer should return the original text.
print("\nDecoded input:")
print(tokenizer.decode(encoded_input))

Encoded input:
[464, 749, 1593, 1517, 287, 1204, 318]

Encoded input with tokenizer callable:
{'input_ids': [464, 749, 1593, 1517, 287, 1204, 318], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Decoded input:
The most important thing in life is


### Try out the loaded model

In [45]:
# Inference can be done by calling .generate method of the model
model_output = gpt2_model.generate(**tokenizer(input_text, return_tensors="pt"), max_new_tokens=10)
print("Model output tokens:")
print(model_output[0])
print("\nModel output decoded:")
print(tokenizer.decode(model_output[0]))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Model output tokens:
tensor([ 464,  749, 1593, 1517,  287, 1204,  318,  284,  307, 1498,  284,  466,
        1223,  326,  345, 1842,   13])

Model output decoded:
The most important thing in life is to be able to do something that you love.


### TODO
The above output was somewhat reasonable with GPT-2 model. What if you increase the number of `max_new_tokens`.

Try it out.

## Try out a model trained for classification

ProsusAI/finbert:

"FinBERT is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification."


In [46]:
# Load the finbert tokenizer
finbert_tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")

# Load the finbert model with the text generation head
finbert_model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

### TODO
Try out the finbert model.

Notice that calling the model happens now with model callable, not with .generate method, and `max_new_tokens` input parameters does not exist.

In [47]:
input_text = "Top private equity firms put brakes on China dealmaking"
model_output = finbert_model(**finbert_tokenizer(input_text, return_tensors="pt"))
print("Model output (softmax for positive, negative, neutral):")
print(model_output[0])


Model output (softmax for positive, negative, neutral):
tensor([[-1.7899,  2.5756,  0.2115]], grad_fn=<AddmmBackward0>)


### TODO

If you have time, search Huggingface for some model that looks interesting and try it out. You can also use th Huggingface portal "Inference API" directly if you want.