<a href="https://colab.research.google.com/github/Walidsati/AAI_612O/blob/main/Week3/Notebook3.3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AAI612: Deep Learning & its Applications

*Notebook 3.3: Practice with HuggingFace*



## Experiment with Hugging Face Transformers

In [56]:
text = """Having served on the COVID Vaccine Development committee at Moderna, USA, \
    Dr. Nader was involved in the fight against the pandemic of the century. As \
    the race was on to develop a vaccine – the ultimate defense against a virus \
    of which little was known – what helped to expedite the process at the \
    pharmaceutical and biotechnology company was the availability of the \
    technology – messenger RNA – which had been 10 years in the making.\
    The development of vaccines in record time encapsulates the prerequisites \
    for discovery: research, technology, anticipation and inquiring minds, skills \
    that should be fostered in education."""

### Text Completion

Once you execute the below code, notice in the score in the output.  The highest the score, the higher the probability of that output being selected!

In [57]:
from transformers import pipeline

# specifying the pipeline
bert_unmasker = pipeline('fill-mask', model="bert-base-uncased")
text = "I have to wake up in the morning and [MASK] a doctor"
result = bert_unmasker(text)
for r in result:
    print(r)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


{'score': 0.645744264125824, 'token': 2156, 'token_str': 'see', 'sequence': 'i have to wake up in the morning and see a doctor'}
{'score': 0.17833644151687622, 'token': 2655, 'token_str': 'call', 'sequence': 'i have to wake up in the morning and call a doctor'}
{'score': 0.0750807598233223, 'token': 2424, 'token_str': 'find', 'sequence': 'i have to wake up in the morning and find a doctor'}
{'score': 0.05682665854692459, 'token': 2131, 'token_str': 'get', 'sequence': 'i have to wake up in the morning and get a doctor'}
{'score': 0.006895723287016153, 'token': 2022, 'token_str': 'be', 'sequence': 'i have to wake up in the morning and be a doctor'}


### Text Classification

The below will be classified the above text as positive.  Can you change that?

In [58]:
#hide_output
from transformers import pipeline

classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


In [59]:
import pandas as pd

outputs = classifier(text)
pd.DataFrame(outputs)

Unnamed: 0,label,score
0,NEGATIVE,0.989723


### Named Entity Recognition

NER involves detecting and categorizing information in text known as named entities. Named entities refer to the key subjects of a piece of text, such as names, locations, companies, events and products, as well as themes, topics, times, monetary values and percentages.

In [60]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


### Question Answering

In [61]:
reader = pipeline("question-answering")
question = "What was Dr. Nader involved in?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


Unnamed: 0,score,start,end,answer
0,0.576397,44,52,a doctor


### Summarization

In [62]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0
Your min_length=56 must be inferior than your max_length=45.
Your max_length is set to 45, but your input_length is only 17. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=8)


 "I have to wake up in the morning and [MASK] a doctor," she says. She says she has to be a doctor every day to get up and go to the doctor. "I'm


### Translation

The below will use a German translation model.  Can you change this to French?  Google will be your best friend in this task :-)

In [63]:
translator = pipeline("translation_en_to_de",
                      model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

Device set to use cuda:0


Ich muss morgen früh aufwachen und [MASK] einen Arzt haben. Ich bin ein guter Arzt. Ich bin ein guter Arzt. Ich bin ein guter Arzt. Ich bin ein guter Arzt. Ich bin ein guter Arzt. Ich bin ein guter Arzt. Ich bin ein guter Arzt. Ich bin ein guter Arzt. Ich bin ein guter Arzt. Ich bin ein guter Arzt. Ich bin ein guter Arzt. Ich bin ein guter Arzt. Ich bin ein guter Arzt. Ich bin ein guter Arzt.


### Text Generation

In [64]:
#hide
from transformers import set_seed
set_seed(42) # Set the seed to get reproducible results

In [65]:
generator = pipeline("text-generation")
response = "Dear Dr. Nader, Thank you for working on the vaccine."
prompt = text + "\n\nResponse to the story:\n" + response
outputs = generator(prompt, max_length=500)
print(outputs[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I have to wake up in the morning and [MASK] a doctor

Response to the story:
Dear Dr. Nader, Thank you for working on the vaccine. It was really helpful in figuring out what was going on. When I was speaking she mentioned Ebola on CNN and when she said her doctors looked at this for a few moments, she immediately said, 'I was on a high alert for Ebola here.' I had seen her talk about Ebola in the news, something that had happened a few times before and I was at that moment really thinking about whether there was this really dangerous pandemic right now. So I just said, 'This is what I'm here to talk about.' When she asked for details, she didn't say there was a specific disease, but she said maybe it was Ebola. It was like, I am not a doctor and I know there is no Ebola and no Ebola, I was only telling my doctor how I feel about Ebola, and that I never heard it out loud until, 'Don't do that!' Because I was there for 12 hours and I felt nothing really for 30 minutes on the phone, so I 

In [71]:
text = """Having served on the COVID Vaccine Development committee at Moderna, USA, \
    Dr. Nader was involved in the fight against the pandemic of the century. As \
    the race was on to develop a vaccine – the ultimate defense against a virus \
    of which little was known – what helped to expedite the process at the \
    pharmaceutical and biotechnology company was the availability of the \
    technology – messenger RNA – which had been 10 years in the making.\
    The development of vaccines in record time encapsulates the prerequisites \
    for discovery: research, technology, anticipation and inquiring minds, skills \
    that should be fostered in education."""

In [82]:
classifier = pipeline("text-classification")
classification_result = classifier(text)
print("Sentiment Classification Result:")
print(classification_result)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


Sentiment Classification Result:
[{'label': 'POSITIVE', 'score': 0.9980323910713196}]


In [73]:
translator = pipeline("translation_en_to_fr")
translation = translator(text)
print("\nFrench Translation:")
print(translation[0]['translation_text'])

No model was supplied, defaulted to google-t5/t5-base and revision a9723ea (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0



French Translation:
Ayant siégé au comité de développement du vaccin contre la COVID à Moderna, aux États-Unis, le Dr Nader a participé à la lutte contre la pandémie du siècle. Alors que la course à la mise au point d’un vaccin – la défense ultime contre un virus dont on sait peu – était en cours, ce qui a aidé à accélérer le processus chez l’entreprise pharmace


In [74]:
ner_pipeline = pipeline("ner", grouped_entities=True)
ner_results = ner_pipeline(text)
print("\nNamed Entities Detected:")
if ner_results:
    for entity in ner_results:
        # Each entity has a 'entity_group' field (e.g., PER, ORG) and a 'word' field.
        print(f"Entity: {entity['word']}, Type: {entity['entity_group']}")
else:
    print("No entities detected.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0



Named Entities Detected:
Entity: COVID V, Type: ORG
Entity: ##cine Development, Type: ORG
Entity: Moderna, Type: ORG
Entity: USA, Type: LOC
Entity: Nader, Type: PER


In [83]:
summarizer = pipeline("summarization")
summary = summarizer(text, max_length=80, min_length=50, do_sample=False)
print("\nSummary:")
print(summary[0]['summary_text'])
print("\nComment on Summary Accuracy:")
print("The summary captures the core negative aspects such as poor communication, chaotic organization, "
      "and overall disaster, which align with the detailed description in the original text.")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0



Summary:
 Dr. Nader was involved in the fight against the pandemic of the century . He served on the COVID Vaccine Development committee at Moderna, USA . The development of vaccines in record time encapsulates the prerequisites for discovery: research, technology, anticipation and inquiring minds .

Comment on Summary Accuracy:
The summary captures the core negative aspects such as poor communication, chaotic organization, and overall disaster, which align with the detailed description in the original text.


In [84]:
prompt = (
    "Analyze the following review text and perform the following tasks:\n"
    "1. Classify its sentiment, ensuring that the outcome is negative.\n"
    "2. Translate the text to French.\n"
    "3. Identify and list any named entities (e.g., PERSON, ORGANIZATION, etc.).\n"
    "4. Generate a concise summary and comment on the summary's accuracy in capturing the negative aspects.\n\n"
    "Review: \"" + text + "\""
)
print("\nGenerated Prompt:")
print(prompt)


Generated Prompt:
Analyze the following review text and perform the following tasks:
1. Classify its sentiment, ensuring that the outcome is negative.
2. Translate the text to French.
3. Identify and list any named entities (e.g., PERSON, ORGANIZATION, etc.).
4. Generate a concise summary and comment on the summary's accuracy in capturing the negative aspects.

Review: "Having served on the COVID Vaccine Development committee at Moderna, USA,     Dr. Nader was involved in the fight against the pandemic of the century. As     the race was on to develop a vaccine – the ultimate defense against a virus     of which little was known – what helped to expedite the process at the     pharmaceutical and biotechnology company was the availability of the     technology – messenger RNA – which had been 10 years in the making.    The development of vaccines in record time encapsulates the prerequisites     for discovery: research, technology, anticipation and inquiring minds, skills     that shou