<a href="https://colab.research.google.com/github/aaalexlit/medium_articles/blob/main/Leveraging_Huggingface_with_pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
%%capture
!pip install transformers
!pip install datasets

In [8]:
%%capture
!pip3 install memory_profiler
%load_ext memory_profiler

In [9]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained("amandakonet/climatebert-fact-checking")
tokenizer = AutoTokenizer.from_pretrained("amandakonet/climatebert-fact-checking")

In [10]:
features = tokenizer(['Beginning in 2005, however, polar ice modestly receded for several years'], 
                   ['Polar Discovery "Continued Sea Ice Decline in 2005'],  
                   padding='max_length', truncation=True, return_tensors="pt", max_length=512)

model.eval()
with torch.no_grad():
   scores = model(**features).logits
   label_mapping = ['entailment', 'contradiction', 'neutral']
   labels = [label_mapping[score_max] for score_max in scores.argmax(dim=1)]
   print(labels)

['neutral']


In [11]:
from datasets import load_dataset

cf_df = load_dataset("amandakonet/climate_fever_adopted", split='test').to_pandas()



In [12]:
input_claims = cf_df['claim'].values.tolist()
input_evidences = cf_df['evidence'].values.tolist()

In [18]:
def predict_using_sample_code():
  features = tokenizer(input_claims, 
                    input_evidences,  
                    padding='max_length', truncation=True, return_tensors="pt", max_length=512)

  model.eval()
  with torch.no_grad():
    scores = model(**features).logits
    label_mapping = ['entailment', 'contradiction', 'neutral']
    labels = [label_mapping[score_max] for score_max in scores.argmax(dim=1)]
    return labels

In [13]:
from transformers import pipeline

def predict_using_pipelienes(claims: [str], evidences: [str]) -> ([str], [float]):
    def claim_evidence_pair_data():
        for claim, evidence in zip(claims, evidences):
            yield {"text": claim, "text_pair": evidence}

    pipe = pipeline("text-classification", model=model,
                    tokenizer=tokenizer, device=-1,
                    truncation=True, padding=True)
    labels = []
    probs = []
    for out in pipe(claim_evidence_pair_data(), batch_size=1):
        labels.append(out['label'])
        probs.append(out['score'])
    return labels, probs

In [14]:
# execution time without xformers on CPU is 4m21s
%%time
%memit pred_labels, pred_probs = predict_using_pipelienes(input_claims, input_evidences)

Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


peak memory: 1552.07 MiB, increment: 11.77 MiB
CPU times: user 4min 18s, sys: 2.92 s, total: 4min 21s
Wall time: 44.1 s


In [None]:
%%capture
!pip install xformers

Xformers improves memory usage drastically



In [16]:
%%time
%memit pred_labels, pred_probs = predict_using_pipelienes(input_claims, input_evidences)

peak memory: 1552.35 MiB, increment: 0.25 MiB
CPU times: user 4min 18s, sys: 2.93 s, total: 4min 21s
Wall time: 43.9 s


The following cell might kill the notebook

In [19]:
%%time
%memit labels = predict_using_sample_code()

peak memory: 70648.12 MiB, increment: 52950.69 MiB
CPU times: user 12min 46s, sys: 6min 37s, total: 19min 23s
Wall time: 3min 48s
