### Hugging Face Transformers: Tasks ###
Use Hugging Face models for NLP tasks
- Text classification
- Named entity recognition
- Question - answering
- Translation
- Text generation

In [1]:
import os
import glob
import numpy as np
import pandas as pd

# PyTorch packages
import torch
import torch.nn as nn

# Hugging Face
from transformers import pipeline
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import set_seed

# Appearance of the Notebook
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
np.set_printoptions(linewidth=110)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)

# Import this module with autoreload
%load_ext autoreload
%autoreload 2
import nlptools as nlpt
print(f'NLP Tools package version:  {nlpt.__version__}')
print(f'PyTorch version:            {torch.__version__}')

NLP Tools package version:  0.0.post1.dev31+gb496182.d20250109
PyTorch version:            2.6.0a0+df5bbc09d1.nv24.11


In [2]:
# GPU checks
is_cuda = torch.cuda.is_available()
print(f'CUDA available: {is_cuda}')
print(f'Number of GPUs found:  {torch.cuda.device_count()}')

if is_cuda:
    print(f'Current device ID:     {torch.cuda.current_device()}')
    print(f'GPU device name:       {torch.cuda.get_device_name(0)}')
    print(f'CUDNN version:         {torch.backends.cudnn.version()}')
    device_str = 'cuda:0'
    torch.cuda.empty_cache() 
else:
    device_str = 'cpu'
device = torch.device(device_str)
print()
print(f'Device for model training/inference: {device}')

CUDA available: True
Number of GPUs found:  1
Current device ID:     0
GPU device name:       NVIDIA GeForce RTX 3070 Laptop GPU
CUDNN version:         90501

Device for model training/inference: cuda:0


### Task: Text classification ###

In [3]:
# Default model: distilbert-base-uncased-finetuned-sst-2-english 
classifier = pipeline(model = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english')
text1 = 'I love you'
text2 = 'I hate you'
outputs = classifier(text2)
display(pd.DataFrame(outputs))

Device set to use cuda:0


Unnamed: 0,label,score
0,NEGATIVE,0.999113


### Task: Named entity recognition NER ###

In [4]:
ner_tagger = pipeline('ner', aggregation_strategy='simple')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


In [5]:
text = 'My name is Andreas Werdich and I am working at Harvard Medical School'
outputs = ner_tagger(text)
display(pd.DataFrame(outputs))

Unnamed: 0,entity_group,score,word,start,end
0,PER,0.995936,Andreas Werdich,11,26
1,ORG,0.866248,Harvard Medical School,47,69


In [5]:
# Medical terms extractions
model_name = 'blaze999/Medical-NER'
ner_tagger = pipeline(model=model_name, aggregation_strategy='simple')

Device set to use cuda:0


In [6]:
text = 'A 48 year-old female presented with vaginal bleeding and abnormal Pap smears. Upon diagnosis of invasive non-keratinizing SCC of the cervix, she underwent a radical hysterectomy with salpingo-oophorectomy which demonstrated positive spread to the pelvic lymph nodes and the parametrium. Pathological examination revealed that the tumour also extensively involved the lower uterine segment.'
outputs = ner_tagger(text)
display(pd.DataFrame(outputs))

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Unnamed: 0,entity_group,score,word,start,end
0,AGE,0.941736,48 year-old,1,13
1,SEX,0.818373,female,13,20
2,CLINICAL_EVENT,0.853015,presented,20,30
3,BIOLOGICAL_STRUCTURE,0.944933,vaginal,35,43
4,SIGN_SYMPTOM,0.872998,bleeding,43,52
5,LAB_VALUE,0.769152,abnormal,56,65
6,DIAGNOSTIC_PROCEDURE,0.987654,Pap smears,65,76
7,DETAILED_DESCRIPTION,0.934324,invasive,95,104
8,DETAILED_DESCRIPTION,0.961545,non-keratinizing,104,121
9,DISEASE_DISORDER,0.784436,SCC,121,125


In [7]:
model_name = 'varunnagda/bert-medication'
ner_tagger = pipeline(model=model_name, aggregation_strategy='simple')

Device set to use cuda:0


In [8]:
text = """
John Doe, a 53-year-old male, was admitted to City General Hospital on 09/25/2023 with acute exacerbation of COPD 
and community-acquired pneumonia. 
He received high-flow oxygen therapy and intravenous antibiotics, 
which led to significant improvement in his respiratory function. 
By 10/01/2023, he transitioned to oral antibiotics and nasal cannula oxygen. 
Physical therapy sessions enhanced his lung capacity, enabling discharge on 10/05/2023 in stable condition. 
The discharge medications include Albuterol Inhaler, Tiotropium, a tapering course of Prednisone, Azithromycin, and Pantoprazole. 
John is advised to avoid lung irritants, adhere to his medication regimen, and seek medical attention if symptoms recur. 
Follow-up appointments are scheduled with pulmonology on 10/12/2023 and his primary care physician on 10/14/2023.
"""
outputs = ner_tagger(text)
display(pd.DataFrame(outputs))

Unnamed: 0,entity_group,score,word,start,end
0,Medication,0.94449,albuterol,506,515
1,Medication,0.616578,##r,522,523
2,Medication,0.952474,tiotropium,525,535
3,Medication,0.96722,prednisone,558,568
4,Medication,0.972024,azithromycin,570,582
5,Medication,0.975528,pantoprazole,588,600


### Task: Extractive Question-Answering ###

In [9]:
# Context
text = '''
Dear Amazon Customer Service,
I hope this message finds you well. I am writing to bring to your attention an issue with a recent order delivered to me. 
I ordered a web camera (Order #2378), but unfortunately, I received a different model than what I had originally ordered.
I kindly request assistance in resolving this matter. 
Could you please provide guidance on how to return the incorrect item and ensure 
the correct web camera is sent to me as soon as possible? Thank you for your prompt attention to this issue. 
I appreciate your support and look forward to resolving this matter quickly.
Best regards, Andreas Werdich
'''
# Questions
question_list = ['What was wrong with the order?', 
                 'What is the name of the person who wrote the message?',
                 'What is the order number?']

# Models to try
model_name_list = ['distilbert/distilbert-base-cased-distilled-squad',
                   'deepset/roberta-base-squad2']

In [11]:
# Run the models on the task
for model_name in model_name_list:
    print(f'model: {model_name}')
    reader = pipeline(task = 'question-answering', model=model_name)
    outputs = reader(question=question_list, context=text) 
    outputs_df = pd.DataFrame(outputs).assign(question=question_list)
    display(outputs_df)

model: facebook/bart-large-cnn


Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at facebook/bart-large-cnn and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


Unnamed: 0,score,start,end,answer,question
0,0.000144,375,443,return the incorrect item and ensure \nthe cor...,What was wrong with the order?
1,0.000146,375,443,return the incorrect item and ensure \nthe cor...,What is the name of the person who wrote the m...
2,0.000142,375,443,return the incorrect item and ensure \nthe cor...,What is the order number?


model: sshleifer/distilbart-cnn-12-6


Some weights of BartForQuestionAnswering were not initialized from the model checkpoint at sshleifer/distilbart-cnn-12-6 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


Unnamed: 0,score,start,end,answer,question
0,0.000267,522,523,\n,What was wrong with the order?
1,0.000192,522,523,\n,What is the name of the person who wrote the m...
2,0.000266,522,523,\n,What is the order number?


### Task: text summarization ###

In [12]:
model_name_list = ['facebook/bart-large-cnn',
                   'sshleifer/distilbart-cnn-12-6']
                   
for model_name in model_name_list:
    print(f'model: {model_name}')
    summarizer = pipeline(task='summarization', model=model_name)
    max_length = 130
    min_length = 30
    summary = summarizer(text, 
                         max_length=max_length, 
                         min_length=min_length, 
                         do_sample=True)
    display(summary[0].get('summary_text'))

model: facebook/bart-large-cnn


Device set to use cuda:0


'A German man ordered a web camera but received a different model than what he had ordered. He asked Amazon Customer Service to help him return the incorrect item. The customer service rep sent him a new camera.'

model: sshleifer/distilbart-cnn-12-6


Device set to use cuda:0


' Amazon customer service . Andreas Werdich writes to Amazon Customer Service . He received a different model than what he had originally ordered for a web camera . He asks for assistance in resolving this matter .'

### Task: Text generation ###

In [13]:
response = 'I am sorry that your order was mixed up'
# GPT-2 is not so good at writing a response
p_1 = f'User: {text.replace("\n", "")}. Customer service representative response: {response}'

# # Let's try a different one
p_2 = f'There was n alligator '

prompt_list = [p_1, p_2]

In [15]:
# Models to try
model_name_list = ['openai-community/gpt2']
                   #'mistralai/Mistral-7B-Instruct-v0.2']

for model_name in model_name_list:
    print(model_name)
    generator = pipeline(task='text-generation', model=model_name)
    for prompt in prompt_list:
        print()
        print(f'prompt: {prompt}')
        set_seed(1334)
        outputs = generator(prompt, max_new_tokens=128, 
                            do_sample=True, return_full_text=False)
        display(outputs)

openai-community/gpt2


Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



prompt: User: Dear Amazon Customer Service,I hope this message finds you well. I am writing to bring to your attention an issue with a recent order delivered to me. I ordered a web camera (Order #2378), but unfortunately, I received a different model than what I had originally ordered.I kindly request assistance in resolving this matter. Could you please provide guidance on how to return the incorrect item and ensure the correct web camera is sent to me as soon as possible? Thank you for your prompt attention to this issue. I appreciate your support and look forward to resolving this matter quickly.Best regards, Andreas Werdich. Customer service representative response: I am sorry that your order was mixed up


[{'generated_text': ". I did not receive the same item on my order and my current order. I received the item and it was delivered the same day as you would expect. Your customer service company had promised me this same item which is not only delivered but arrived in my address which is about 5 days before I received it. I am sure this is due to not having a buyer's documentation card used as a delivery method. Any additional inquiries please contact me. Thank you for your quick response.\n\nIf you would like additional guidance, I would like to contact you as soon as possible. Thank you for your prompt response."}]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



prompt: There was n alligator 


[{'generated_text': 'ichthyosaurs, and it has not had dinosaurs," said Mark Ewan Johnson, a doctoral student and former senior scientist of paleoscientists at the Ohio State University, who was not involved with the work.\n\nAn estimated 15 million years ago the dinosaurs disappeared about 5,000 feet away from one of the giants. Archaeologists find skeletons of a young giant that weighs about 500 pounds. The bones of a skeleton known as a jay can go missing in the Cretaceous period about 120,000 years ago.\n\nJohnson said there have been no documented signs of dinosaurs that lived during this era, a rarity even in'}]

### Translation ###

In [17]:
model_name = 'Helsinki-NLP/opus-mt-en-de'
translator = pipeline(model=model_name)

config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

Device set to use cuda:0


In [18]:
outputs = translator(text)
display(text)
print()
display(outputs)

'\nDear Amazon Customer Service,\nI hope this message finds you well. I am writing to bring to your attention an issue with a recent order delivered to me. \nI ordered a web camera (Order #2378), but unfortunately, I received a different model than what I had originally ordered.\nI kindly request assistance in resolving this matter. \nCould you please provide guidance on how to return the incorrect item and ensure \nthe correct web camera is sent to me as soon as possible? Thank you for your prompt attention to this issue. \nI appreciate your support and look forward to resolving this matter quickly.\nBest regards, Andreas Werdich\n'




[{'translation_text': 'Sehr geehrter Amazon Customer Service, ich hoffe, diese Nachricht findet Sie gut. Ich schreibe, um Ihre Aufmerksamkeit auf ein Problem mit einer kürzlich an mich gelieferten Bestellung zu bringen. Ich bestellte eine Web-Kamera (Ordnung #2378), aber leider erhielt ich ein anderes Modell als das, was ich ursprünglich bestellt hatte. Ich bitte um Unterstützung bei der Lösung dieser Angelegenheit. Könnten Sie bitte Hinweise geben, wie Sie den falschen Artikel zurückgeben und sicherstellen, dass die richtige Web-Kamera wird mir so schnell wie möglich gesendet? Vielen Dank für Ihre schnelle Aufmerksamkeit zu diesem Thema. Ich schätze Ihre Unterstützung und freue mich darauf, diese Angelegenheit schnell zu lösen.'}]