<a href="https://colab.research.google.com/github/enzogranado/Machine-Learning-Primeiro-Lab/blob/main/C%C3%B3pia_de_Lab_Aula_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TRANSFORMERS: WHAT CAN THEY DO?

<p style="font-family: Georgia;">Transformer models are used to solve all kinds of NLP tasks, like the ones mentioned in the previous section.

<p style="font-family: Georgia;">The <a href="https://github.com/huggingface/transformers" style="font-weight: bold;">Transformers Library</a> provides the functionality to create and use those shared models. <p style="font-family: Georgia;">This script will teach you about <b>Natural Language Processing (NLP)</b> using libraries from the <a href="https://huggingface.co/" style="font-weight: bold;">Hugging Face</a> ecosystem. It’s completely free and without ads.</p>

<p style="font-family: Georgia;">Before diving into how Transformer models work under the hood, let’s look at a few examples of how they can be used to solve some interesting NLP problems.</p>

<p style="font-family: Georgia;">The most basic object in the Transformers library is the <b><code>pipeline()</code></b> function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text (or lists of text statements) and get an intelligible answer.</p>

<p style="font-family: Georgia;">By default, the pipeline shown below selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when you create the classifier object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.<p><br>
<p style="font-family: Georgia; font-weight: bold;">There are three main steps involved when you pass some text to a pipeline:</p>
<ol style="font-family: Georgia;">
    <li>The text is preprocessed into a format the model can understand.</li>
    <li>The preprocessed inputs are passed to the model.</li>
    <li>The predictions of the model are post-processed, so you can make sense of them.</li>
</ol><br>
    
<p style="font-family: Georgia; font-weight: bold;">Some of the currently available pipelines are:</p>
<ul style="font-family: Georgia;">
    <li>sentiment-analysis</li>
    <li>feature-extraction (get the vector representation of a text)</li>
    <li>fill-mask</li>
    <li>ner (named entity recognition)</li>
    <li>question-answering</li>
    <li>summarization</li>
    <li>text-generation</li>
    <li>translation</li>
    <li>zero-shot-classification</li>
</ul><br>


In [4]:
#########################
# !pip install --upgrade transformers

# Full Development Version
!pip install -q --upgrade transformers[sentencepiece]

# Install Flair NLP library - https://github.com/flairNLP/flair
!pip install -q --upgrade flair

In [5]:
# Import the pipeline module from the transformers library
from transformers import pipeline

# Create a suitable pipeline for a given task. The currently accepted tasks are:
#    - `"audio-classification"`
#    - `"automatic-speech-recognition"`
#    - `"conversational"`
#    - `"feature-extraction"`
#    - `"fill-mask"`
#    - `"image-classification"`
#    - `"question-answering"`
#    - `"table-question-answering"`
#    - `"text2text-generation"`
#    - `"text-classification"`
#    - `"text-generation"`
#    - `"token-classification"`
#    - `"translation"`
#    - `"translation_xx_to_yy"`
#    - `"summarization"`
#    - `"zero-shot-classification"`

### SENTIMENT ANALYSIS

Let’s have a look at a few of these starting with sentiment analysis!

In [6]:
import warnings
warnings.filterwarnings("ignore")
from transformers import logging
logging.set_verbosity_error()

classifier = pipeline("sentiment-analysis", model="cardiffnlp/twitter-xlm-roberta-base-sentiment")

# `call` the classifier for a positive and then a negative statement
print("\n... INDIVIDUAL CALLS [POSITIVE FOLLOWED BY NEGATIVE] ...\n")
print('\t', classifier("O atendimento da empresa foi excelente, mas a entrega demorou demais."))
print('\t', classifier("O novo software de gestão é confuso e atrapalhou os processos da equipe."))
print('\t', classifier("Estou muito satisfeito com a parceria firmada neste trimestre."))

# `call` the classifier on a list containing the above two statements in a single call
print("\n\n... LIST CALL [POSITIVE FOLLOWED BY NEGATIVE] ...\n")
print('\t', classifier([
    "O atendimento da empresa foi excelente, mas a entrega demorou demais.",
    "O novo software de gestão é confuso e atrapalhou os processos da equipe.",
    "Estou muito satisfeito com a parceria firmada neste trimestre."
]))


... INDIVIDUAL CALLS [POSITIVE FOLLOWED BY NEGATIVE] ...

	 [{'label': 'negative', 'score': 0.661409854888916}]
	 [{'label': 'negative', 'score': 0.9031264781951904}]
	 [{'label': 'positive', 'score': 0.9339599609375}]


... LIST CALL [POSITIVE FOLLOWED BY NEGATIVE] ...

	 [{'label': 'negative', 'score': 0.661409854888916}, {'label': 'negative', 'score': 0.9031264781951904}, {'label': 'positive', 'score': 0.9339599609375}]


### ZERO-SHOT CLASSIFICATION

<p style="font-family: Georgia;">We’ll start by tackling a more challenging task where we need to classify texts that haven’t been labelled. This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise.</p>

<p style="font-family: Georgia;">For this use case, the zero-shot-classification pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don’t have to rely on the labels of the pretrained model. You’ve already seen how the model can classify a sentence as positive or negative using those two labels — but it can also classify the text using any other set of labels you like.</p>

<p style="font-family: Georgia;">This pipeline is called zero-shot because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!</p>


### P: O modelo aparentemete acertou as classificações realizadas?
O modelo aparentemente acetou as classificações realiadas, já que, as respostas do moodelo, e, sua mairia, condizem com os comentários dos usuários

In [7]:
warnings.filterwarnings("ignore")
logging.set_verbosity_error()

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

def print_result(example_num, sequence, labels):
    result = classifier(sequence, candidate_labels=labels)
    print("EXAMPLE #{}".format(example_num))
    print("-- Classification --\n {}\n".format(result))

# Run examples
print_result(1,
    "I would like to know more about your prices and delivery times.",
    ["request for quote", "complaint", "partnership"])

print_result(2,
    "The product arrived damaged and I cannot use it.",
    ["complaint", "refund request", "praise"])

print_result(3,
    "We are interested in collaborating on a joint research project.",
    ["partnership", "request for quote", "complaint"])

print_result(4,
    "Your consulting services were extremely helpful for our company.",
    ["praise", "information request", "criticism"])

print_result(5,
    "Could we schedule a meeting to discuss a potential merger?",
    ["meeting request", "partnership", "complaint"])

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

EXAMPLE #1
-- Classification --
 {'sequence': 'I would like to know more about your prices and delivery times.', 'labels': ['request for quote', 'complaint', 'partnership'], 'scores': [0.6631441116333008, 0.2687123715877533, 0.06814359128475189]}

EXAMPLE #2
-- Classification --
 {'sequence': 'The product arrived damaged and I cannot use it.', 'labels': ['complaint', 'refund request', 'praise'], 'scores': [0.7395321726799011, 0.24804016947746277, 0.012427630834281445]}

EXAMPLE #3
-- Classification --
 {'sequence': 'We are interested in collaborating on a joint research project.', 'labels': ['partnership', 'request for quote', 'complaint'], 'scores': [0.8934212327003479, 0.09395067393779755, 0.012628085911273956]}

EXAMPLE #4
-- Classification --
 {'sequence': 'Your consulting services were extremely helpful for our company.', 'labels': ['praise', 'information request', 'criticism'], 'scores': [0.9438984990119934, 0.03883228078484535, 0.017269158735871315]}

EXAMPLE #5
-- Classificatio

### TEXT GENERATION

<p style="font-family: Georgia;">Now let’s see how to use a pipeline to generate some text. The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. Text generation involves randomness, so it’s normal if you don’t get the same results as shown below.</p>

<p style="font-family: Georgia;"> You can control how many different sequences are generated with the argument <b><code>num_return_sequences</code></b> and the total length of the output text with the argument <b><code>max_length</code></b>.</p>

In [8]:
warnings.filterwarnings("ignore")
logging.set_verbosity_error()

# Instantiate the default text-generation model
generator = pipeline("text-generation")

print("\n\n... EXAMPLE #1 ...\n")
original_text = "Sales Report - Q3 2025. Overview: The third quarter of 2025 has shown a significant increase in our renewable energy sector sales. Our team focused on expanding the client base in the solar and wind markets, achieving"
pred_text = generator(original_text,
                      return_full_text=False,
                      max_new_tokens=30,
                      top_p=0.9,
                      repetition_penalty=2.0,
                      num_return_sequences=1)
print(f"\n===== Original Phrase =====\n\t--> {original_text}")
print(f"\n----- Generated Text -----\n\t--> {pred_text[0]['generated_text']} ...")


print("\n\n... EXAMPLE #2 ...\n")
original_text = "Innovative industries are changing the way that we consume and invest in our products. Companies like"
pred_text_seqs = generator(original_text,
                           return_full_text=False,
                           max_length=50,
                           top_p=0.9,
                           repetition_penalty=2.0,
                           num_return_sequences=3)
print(f"\n===== Original Phrase =====\n\t--> {original_text}")
for i, pred_text in enumerate(pred_text_seqs, start=1):
    print(f"\n----- Generated Text {i} -----\n\t--> {pred_text['generated_text']}...")


print("\n\n... EXAMPLE #3 ...\n")
original_text = "GPT thought God to be"
pred_text_seqs = generator(original_text,
                           return_full_text=False,
                           max_length=100,
                           repetition_penalty=2.0,
                           num_return_sequences=3)
print(f"\n===== Original Phrase =====\n\t--> {original_text} ... ")
for i, pred_text in enumerate(pred_text_seqs, start=1):
    print(f"\n----- Generated Text {i} -----\n\t--> {pred_text['generated_text']} ...")


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]



... EXAMPLE #1 ...


===== Original Phrase =====
	--> Sales Report - Q3 2025. Overview: The third quarter of 2025 has shown a significant increase in our renewable energy sector sales. Our team focused on expanding the client base in the solar and wind markets, achieving

----- Generated Text -----
	-->  high returns for clients who are already looking to diversify their portfolio across multiple sectors such as automotive or industrial applications; increased profitability by providing customers with low ...


... EXAMPLE #2 ...


===== Original Phrase =====
	--> Innovative industries are changing the way that we consume and invest in our products. Companies like

----- Generated Text 1 -----
	-->  IBM, Cisco or Google have built innovative technologies to drive innovation while providing customers with fast data access for their business."
 [PDF]...

----- Generated Text 2 -----
	-->  Google, Facebook or Amazon have taken on a global responsibility to create an alternative by offer

#### USING ANY MODEL FROM THE HUB IN A PIPELINE

<p style="font-family: Georgia;">The previous examples used the default model for the task at hand, but you can also choose a particular model from the Hub to use in a pipeline for a specific task — say, text generation. Go to the <a href="https://huggingface.co/models" style="font-weight: bold;">Model Hub</a> and click on the corresponding tag on the left to display only the supported models for that task. You should get to a page like <a href="https://huggingface.co/models?pipeline_tag=text-generation" style="font-weight: bold;">this one.</a></p>

### FILL MASK</b>

<p style="font-family: Georgia;">The next pipeline you’ll try is fill-mask. The idea of this task is to fill in the blanks in a given text. The <b><code>top_k</code></b> argument controls how many possibilities you want to be displayed. Note that here the model fills in the special <b><code>&lt;mask></code></b> word, which is often referred to as a mask token. Other mask-filling models might have different mask tokens, so it’s always good to verify the proper mask word when exploring other models. One way to check it is by looking at the mask word used in the widget.</p>

<p style="font-family: Georgia;">Masked Language Modeling (MLM) is a language task very common in Transformer architectures today. It involves masking part of the input, then [teaching] a model to predict the missing tokens – essentially reconstructing the non-masked input. MLM is often used within pretraining tasks, to give models the opportunity to learn textual patterns from unlabeled data.</p>

<p style="font-family: Georgia;">Downstream tasks can benefit from models pretrained on MLM too. Suppose that you are faced with the task of reconstructing the contents of partially destroyed documents.</p>

<p style="font-family: Georgia;">Take the example [...] ---> <b>“I am <span style="color: blue">&lt;mask></span> to the bakery”.</b></p><ul style="font-family: Georgia;"><li><p style="font-family: Georgia;"><b style="color:blue;">"going"</b> is the expected missing value here</li></ul></p>

Use the <b><code>top_k</code></b> argument to generate however many examples you want to fill in the blanks</b><br><br>
</div></center>

In [9]:
warnings.filterwarnings("ignore")
logging.set_verbosity_error()

# Instantiate the default fill-mask model
unmasker = pipeline("fill-mask")
print("... EXAMPLE #1 - WITH DEFAULT MODEL ...")
for k in unmasker("This course will teach you all about <mask> models.", top_k=2): print(k)

print("\n\n\n... EXAMPLE #2 ...")
for k in unmasker("2 + 2 = <mask>.", top_k=2): print(k)

print("\n\n\n... EXAMPLE #3 ...")
for k in unmasker("Vincent Van Gogh was a <mask>.", top_k=2): print(k)

print("\n\n\n... EXAMPLE #4 ...")
for k in unmasker("Hugging Face engineers prefer to use the <mask> library when coding.", top_k=2): print(k)

print("\n\n\n... EXAMPLE #5 ...")
for k in unmasker("a b c d e f g <mask> i j k l m n o p q r s t u v w x y z", top_k=2): print(k)

print("\n\n\n... EXAMPLE #6 ...")
for k in unmasker("There are some things money can’t buy. For everything else, there’s <mask>.", top_k=2): print(k)

print("\n\n\n... EXAMPLE #7 ...")
for k in unmasker("If you have a body, you are an athlete. Just <mask> it.", top_k=2): print(k)

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

... EXAMPLE #1 - WITH DEFAULT MODEL ...
{'score': 0.19619767367839813, 'token': 30412, 'token_str': ' mathematical', 'sequence': 'This course will teach you all about mathematical models.'}
{'score': 0.04052715748548508, 'token': 38163, 'token_str': ' computational', 'sequence': 'This course will teach you all about computational models.'}



... EXAMPLE #2 ...
{'score': 0.08684605360031128, 'token': 321, 'token_str': ' 0', 'sequence': '2 + 2 = 0.'}
{'score': 0.06310440599918365, 'token': 112, 'token_str': ' 1', 'sequence': '2 + 2 = 1.'}



... EXAMPLE #3 ...
{'score': 0.06565066426992416, 'token': 25760, 'token_str': ' painter', 'sequence': 'Vincent Van Gogh was a painter.'}
{'score': 0.037771765142679214, 'token': 34580, 'token_str': ' philosopher', 'sequence': 'Vincent Van Gogh was a philosopher.'}



... EXAMPLE #4 ...
{'score': 0.05544894188642502, 'token': 46374, 'token_str': ' SDL', 'sequence': 'Hugging Face engineers prefer to use the SDL library when coding.'}
{'score': 0.0280

### P: O modelo desempenhou bem ao preencher uma lacuna ([MASK]) no texto? Como ele performou com os slogans? Justifique.
O modelo não desempenhou bem, já que em alguns casos, como por exemplo
'sequence': '2 + 2 = 0.'
e
{'score': 0.15778876841068268, 'token': 364, 'token_str': ' e', 'sequence': 'a b c d e f g e i j k l m n o p q r s t u v w x y z'}
Quanto aos slogans, o modelo não desempenhou bem, já que as respostas geradas não são propicias a serem usadas em slogans, visto que o modelo apenas completou com palavras que cabiam no contexto e não necessáriamente palavras que seriam usadas para marketing.

### NAMED ENTITY RECOGNITION (NER)

<p style="font-family: Georgia;">Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations.It is sometimes referred to as entity chunking, extraction, or identification. An entity can be any word or series of words that consistently refers to the same thing. Every detected entity is classified into a predetermined category. For example, an NER machine learning (ML) model might detect the word “super.AI” in a text and classify it as a “Company”.</p>

<p style="font-family: Georgia;">*In the first example below, the model correctly identified that Sylvain is a person (PER), Hugging Face an organization (ORG), and Brooklyn a location (LOC).</p>

<p style="font-family: Georgia;">We pass the option <b><code>grouped_entities=True</code></b> in the pipeline creation function to tell the pipeline to regroup together the parts of the sentence that correspond to the same entity: In the first example below, the model correctly grouped “Hugging” and “Face” as a single organization, even though the name consists of multiple words.


In [10]:
from spacy import displacy
import matplotlib
import matplotlib.pyplot as plt

def convert_hf_to_displacy_format(hf_pred, _original_text, _title=None):
    """ Function to convert prediction to the displacy specific format """
    return [dict(
        text=_original_text,
        ents=[{
            "start":ent["start"],
            "end":ent["end"],
            "label":ent["entity_group"],
            "score":ent["score"]} for ent in hf_pred],
        title=_title
    ),]

# Instantiate the default Named-Entity-Recognition model
#      --> dbmdz/bert-large-cased-finetuned-conll03-english
ner_1 = pipeline("ner", grouped_entities=True)

print("... EXAMPLE #1 WITH DEFAULT MODEL ...")
original_text =  \
    """
        My name is Sylvain and I work at Hugging Face in Brooklyn.
    """
ner_pred = ner_1(original_text)
displacy.render(convert_hf_to_displacy_format(ner_pred, original_text), style="ent", manual=True)

print("\n\n... EXAMPLE #2 WITH DEFAULT MODEL...")
original_text =  \
    """
        Italy, officially the Italian Republic is a country consisting of a peninsula
        delimited by the Alps and several islands surrounding it, whose territory
        largely coincides with the homonymous geographical region. Italy is located
        in the centre of the Mediterranean Sea, in Southern Europe; it is
        also considered part of Western Europe. A unitary parliamentary republic
        with Rome as its capital and largest city. The country covers a total area of
        301,340 km2 (116,350 sq mi) and shares land borders with France, Switzerland,
        Austria, Slovenia, as well as the enclaved microstates of Vatican City and San
        Marino. Italy has a territorial exclave in Switzerland (Campione) and a maritime
        exclave in Tunisian waters (Lampedusa). With around 60 million inhabitants,
        Italy is the third-most populous member state of the European Union.
    """
ner_pred = ner_1(original_text)
displacy.render(convert_hf_to_displacy_format(ner_pred, original_text), style="ent", manual=True)


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

... EXAMPLE #1 WITH DEFAULT MODEL ...




... EXAMPLE #2 WITH DEFAULT MODEL...


### QUESTION ANSWERING

<p style="font-family: Georgia;">Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language. The question-answering pipeline answers questions using information from a given context:Note that this pipeline works by extracting information from the provided context; it does not generate the answer.</p>

In [11]:
warnings.filterwarnings("ignore")
logging.set_verbosity_error()

def pretty_print_qa(_model, _questions, _context, show_context=True):
    """ Simple function to pretty print the output of QA model """

    # Coerce if necessary
    if type(_questions)!=list: _questions=[_questions]

    # Show context if required
    if show_context:
        print(f"\n{'-'*100}\nCONTEXT:\n{'-'*100}\n{_context}\n{'-'*100}")

    # Print QA
    for i, _q in enumerate(_questions):
        _a = _model(question=_q, context=_context )
        print(f"\n\tQUESTION #{i+1}: {_q}")
        print(f"\t\tANSWER:\t--> {_a['answer']}")
        print(f"\t\tSCORE:\t--> {_a['score']}")

# Instantiate the default question answering model
question_answerer = pipeline("question-answering")

# Basic Example
print("\n\n\n... EXAMPLE #1 ...")
context_text  = "My name is Sylvain and I work at Hugging Face in Brooklyn"
question_text = "Where do I work?"
pretty_print_qa(question_answerer, question_text, context_text)

# Example from Italy wikipedia. I'd say this is an example of simple QA
print("\n\n\n\n... EXAMPLE #2 ...")
context_text = \
    """
        Italy, officially the Italian Republic is a country consisting of a peninsula
        delimited by the Alps and several islands surrounding it,[15] whose territory
        largely coincides with the homonymous geographical region.[16] Italy is located
        in the centre of the Mediterranean Sea, in Southern Europe;[17][18][19] it is
        also considered part of Western Europe.[20][21] A unitary parliamentary republic
        with Rome as its capital and largest city, the country covers a total area of
        301,340 km2 (116,350 sq mi) and shares land borders with France, Switzerland,
        Austria, Slovenia, as well as the enclaved microstates of Vatican City and San
        Marino. Italy has a territorial exclave in Switzerland (Campione) and a maritime
        exclave in Tunisian waters (Lampedusa). With around 60 million inhabitants,
        Italy is the third-most populous member state of the European Union.
    """
question_text = ["Where is Italy located?",
                 "What is the largest city in Italy?",
                 "What is the most populous EU member state?",
                 "What countries border Italy?",
                 "How large is Italy?",
                 "How large is Italy in Miles?",
                 "What continent is Italy located within?",
                 "Is Italy a country or state?",
                 "What is the relationship between the enclaved microstates of Vatican City and Italy?",
                 "What mountain range is close to Italy?"]
pretty_print_qa(question_answerer, question_text, context_text)

# Example from hard reading comprehension test - hard example
print("\n\n\n\n... EXAMPLE #3 ...")
context_text = \
    """
        'Strange Bedfellows!' lamented the title of a recent letter to Museum News, in which a certain
        Harriet Sherman excoriated the National Gallery of Art in Washington for its handling of
        200,000 tickets to the much-ballyhooed “Van Gogh’s van Goghs” exhibit. A huge proportion
        of the free tickets were snatched up by the opportunists in the dead of winter, who
        then scalped those tickets at $85 apiece to less hardy connoiseurs.
        Yet, Sherman’s bedfellows are far from strange. Art, despite its religious and magical
        origins, very soon became a commercial venture. From bourgeois patrons funding art they
        barely understood in order to share their protegee’s prestige, to museum curators
        stage-managing the cult of artists in order to enhance the market value of museum
        holdings, entrepreneurs have found validation and profit in big-name art. Speculators,
        thieves, and promoters long ago created and fed a market where cultural icons could
        be traded like commodities. This trend toward commodification of high-brow art took
        an ominous, if predictable, turn in the 1980s during the Japanese 'bubble economy.'
        At a time when Japanese share prices more than doubled, individual tycoons and industrial
        giants alike invested record amounts in some of the West’s greatest masterpieces.
        Ryoei Saito, for example, purchased van Gogh’s Portrait of Dr. Gachet for a record-breaking
        $82.5 million. The work, then on loan to the Metropolitan Museum of Modern Art, suddenly
        vanished from the public domain. Later learning that he owed the Japanese government $24
        million in taxes, Saito remarked that he would have the paining cremated with him to spare
        his heirs the inheritance tax. This statement, which he later dismissed as a joke, alarmed
        and enraged many. A representative of the Van Gogh museum, conceding that he had no legal
        redress, made an ethical appeal to Mr. Saito, asserting, 'a work of art remains the
        possession of the world at large'. Ethical appeals notwithstanding, great art will increasingly
        devolve into big business. Firstly, great art can only be certified by its market value.
        Moreover, the 'world at large' hasn’t the means of acquisition. Only one museum currently
        has the funding to contend for the best pieces–the J. Paul Getty Museum, founded by the
        billionaire oilman. The art may disappear into private hands, but its transfer will
        disseminate once static fortunes into the hands of various investors, collectors, and
        occasionally the artist.
    """

question_text = ["What is the main idea being communicated by this passage?",
                 "Which museum might be able to afford to keep or obtain top art pieces?",
                 "What famous artist does this article reference?",
                 "What painting is referenced in this article?",
                 "How much did Ryoei pay for Van Gogh's portrait of Dr. Gachet?",
                 "What did Saito joke about?",
                 "Why would Saito cremate a painting?",
                 "Which group of people does the author of this article like the least?"]
pretty_print_qa(question_answerer, question_text, context_text)

config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]




... EXAMPLE #1 ...

----------------------------------------------------------------------------------------------------
CONTEXT:
----------------------------------------------------------------------------------------------------
My name is Sylvain and I work at Hugging Face in Brooklyn
----------------------------------------------------------------------------------------------------

	QUESTION #1: Where do I work?
		ANSWER:	--> Hugging Face
		SCORE:	--> 0.6949766278266907




... EXAMPLE #2 ...

----------------------------------------------------------------------------------------------------
CONTEXT:
----------------------------------------------------------------------------------------------------

        Italy, officially the Italian Republic is a country consisting of a peninsula
        delimited by the Alps and several islands surrounding it,[15] whose territory
        largely coincides with the homonymous geographical region.[16] Italy is located
        in the centr

### SUMMARIZATION

<p style="font-family: Georgia;">Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. Since manual text summarization is a time expensive and generally laborious task, the automatization of the task is gaining increasing popularity and therefore constitutes a strong motivation for academic research.</p>
    
<p style="font-family: Georgia;">There are important applications for text summarization in various NLP related tasks such as text classification, question answering, legal texts summarization, news summarization, and headline generation. Moreover, the generation of summaries can be integrated into these systems as an intermediate stage which helps to reduce the length of the document.</p>
    
<p style="font-family: Georgia;">In the big data era, there has been an explosion in the amount of text data from a variety of sources. This volume of text is an inestimable source of information and knowledge which needs to be effectively summarized to be useful. This increasing availability of documents has demanded exhaustive research in the NLP area for automatic text summarization. Automatic text summarization is the task of producing a concise and fluent summary without any human help while preserving the meaning of the original text document.</p>
    
<p style="font-family: Georgia;">It is very challenging, because when we as humans summarize a piece of text, we usually read it entirely to develop our understanding, and then write a summary highlighting its main points. Since computers lack human knowledge and language capability, it makes automatic text summarization a very difficult and non-trivial task.</p>
    
<p style="font-family: Georgia;">Various models based on machine learning have been proposed for this task. Most of these approaches model this problem as a classification problem which outputs whether to include a sentence in the summary or not. Other approaches have used topic information, Latent Semantic Analysis (LSA), Sequence to Sequence models, Reinforcement Learning and Adversarial processes.</p>

<p style="font-family: Georgia;">In general, there are two different approaches for automatic summarization:</p>
<ul style="font-family: Georgia;">
    <li><b>Extractive Summarization</b> - Summarizes using extracted pieces of text from the original corpus that best represent the content.</li>
    <li><b>Abstractive Summarization</b> - Summarizes using generated pieces of text that best represent the context of the original corpus</li>
</ul>

Use the <code>min_length</code> and <code>max_length</code> arguments to generate summaries with constrained lengths.</b><br><br>
</div></center>

In [12]:
warnings.filterwarnings("ignore")
logging.set_verbosity_error()

def pretty_print_summary(_model, _text, **kwargs):
    """Print only the summary text once"""
    summary = _model(_text, **kwargs)[0]["summary_text"]
    print(f"\n{'-'*100}\n{summary}\n{'-'*100}")


# As we want to see the differences between extractive and abstractive
# summarization we will instantiate two models
abstractive_summarizer = pipeline("summarization", model="google/pegasus-xsum")
extractive_summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

text = \
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
    """

# We don't include a 'short' answer as it will just be a truncated version of the default.
print("\n\n... EXTRACTIVE MODEL - DEFAULT ANSWER ...")
pretty_print_summary(extractive_summarizer, text)

print("\n\n... EXTRACTIVE MODEL - LONG ANSWER ...")
pretty_print_summary(extractive_summarizer, text, min_length=40, max_length=160)

print("\n\n... ABSTRACTIVE MODEL - DEFAULT ANSWER ...")
pretty_print_summary(abstractive_summarizer, text)

print("\n\n... ABSTRACTIVE MODEL - LONG ANSWER ...")
pretty_print_summary(abstractive_summarizer, text, min_length=40, max_length=160)


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]



... EXTRACTIVE MODEL - DEFAULT ANSWER ...

----------------------------------------------------------------------------------------------------
 China and India graduate six and eight times as many traditional engineers as the U.S. as does other industrial countries . America suffers an increasingly serious decline in the number of engineering graduates and a lack of well-educated engineers . There are declining offerings in engineering subjects dealing with infrastructure, infrastructure, the environment, and related issues .
----------------------------------------------------------------------------------------------------


... EXTRACTIVE MODEL - LONG ANSWER ...

----------------------------------------------------------------------------------------------------
 China and India graduate six and eight times as many traditional engineers as the U.S. as does other industrial countries . America suffers an increasingly serious decline in the number of engineering graduates and a lac

### P: Qual das duas técnicas (Extractive ou Abstractive Summarization) tem mais risco de alucinar?
A técnica de abstract tem mais risco de aluscinar, já que caso o dado não tenha sido informado, a maquina usará conhecimentos gerais para completar a informação.