AI - TP3_8

Bastien SAUVAT et Bastien FAISANT

# Exercise 2 : Explore the Huggingface transformers library

In [1]:
from transformers import pipeline, AutoTokenizer, TFAutoModelForCausalLM
from datasets import load_dataset
import tensorflow as tf

## Text classification

A pre-trained sentiment analysis pipeline is used to determine the sentiment of given text. By provinding a sentence, we can classify its sentiment as positive or negative along with a confidence score. This demonstrates the ease of using pre-trained models for sentiment analysis tasks.

In [2]:
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [3]:

classifier("We are very happy to show you the 🤗 Transformers library.")

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

In [4]:
results = classifier(["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: POSITIVE, with score: 0.9998
label: NEGATIVE, with score: 0.5309


## Question answering

Another pipeline function facilitates question answering. A context is supplied (text about Elon Musk) and the model extracts answers from the provided context to answer questions. It accurately identifies information such as Elon Musk's occupation and birth date.

In [5]:
question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForQuestionAnswering.

All the weights of TFDistilBertForQuestionAnswering were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForQuestionAnswering for predictions without further training.


In [6]:
context = '''Elon Reeve Musk FRS (/ˈiːlɒn/ EE-lon; born June 28, 1971) is a business magnate and investor. He is the founder, CEO and chief engineer of SpaceX; 
            angel investor, CEO and product architect of Tesla, Inc.; owner and CEO of Twitter, Inc.; founder of The Boring Company; co-founder of Neuralink 
            and OpenAI; and president of the philanthropic Musk Foundation. With an estimated net worth of around $175 billion as of February 3, 2023, primarily 
            from his ownership stakes in Tesla and SpaceX,[4][5] Musk is the second-wealthiest person in the world, according to both the Bloomberg Billionaires 
            Index and Forbes's real-time billionaires list.[6][7]
            Musk was born in Pretoria, South Africa, and briefly attended at the University of Pretoria before moving to Canada at age 18, acquiring citizenship 
            through his Canadian-born mother. Two years later, he matriculated at Queen's University and transferred to the University of Pennsylvania, 
            where he received bachelor's degrees in economics and physics. He moved to California in 1995 to attend Stanford University. After two days, he 
            dropped out and with his brother Kimbal, co-founded the online city guide software company Zip2. In 1999, Zip2 was acquired by Compaq for $307 million 
            and Musk co-founded X.com, a direct bank. X.com merged with Confinity in 2000 to form PayPal, which eBay acquired for $1.5 billion in 2002.
            With $175.8 million, Musk founded SpaceX in 2002, a spaceflight services company. In 2004, he was an early investor in the electric vehicle manufacturer 
            Tesla Motors, Inc. (now Tesla, Inc.). He became its chairman and product architect, assuming the position of CEO in 2008. In 2006, he helped create 
            SolarCity, a solar energy company that was later acquired by Tesla and became Tesla Energy. In 2015, he co-founded OpenAI, a nonprofit artificial 
            intelligence research company. The following year, he co-founded Neuralink—a neurotechnology company developing brain–computer interfaces—and The Boring 
            Company, a tunnel construction company. Musk has also proposed a hyperloop high-speed vactrain transportation system. In 2022, his acquisition of Twitter 
            for $44 billion was completed.'''

In [7]:
question = "Who is Elon Musk?"
question_answerer(context = context, question = question)

{'score': 0.47643476724624634,
 'start': 63,
 'end': 92,
 'answer': 'business magnate and investor'}

In [8]:
question = "When Elon Musk was born?"
question_answerer(context = context, question = question)

{'score': 0.9790981411933899,
 'start': 43,
 'end': 56,
 'answer': 'June 28, 1971'}

## Custom model builds

Here, the pre-trained model "gpt2" is used for text generation. After initializing the tokenizer with `AutoTokenizer` and model with `TFAutoModelForCausalLM`, the prompt ("Kilian Mbappe is") is generated. The model generates continuations, demonstrating its ability to produce coherent text based on the provided input.

In [40]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

In [41]:
model = TFAutoModelForCausalLM.from_pretrained("gpt2")

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [42]:
model.config.pad_token_id = model.config.eos_token_id

In [46]:
inputs = tokenizer(["Kilian Mbappe is "], return_tensors="tf")

In [47]:
generated = model.generate(**inputs, do_sample=True, seed=(42, 0))
print("Sampling output: ", tokenizer.decode(generated[0]))

Sampling output:  Kilian Mbappe is  a great young player. He's a high-achieving


In [48]:
generated = model.generate(**inputs, do_sample=True, seed=(42, 0), max_new_tokens=30)
print("Sampling output: ", tokenizer.decode(generated[0]))

Sampling output:  Kilian Mbappe is  a great young player. He's a high-achieving individual. It's not that he's the best player in the world, it's
