<a href="https://colab.research.google.com/github/coder-omer/NLP/blob/main/Transformers_what_can_they_do.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers, what can they do?

Install the Transformers and Datasets libraries to run this notebook.

In [1]:
!pip install datasets transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.6.1-py3-none-any.whl (441 kB)
[K     |████████████████████████████████| 441 kB 26.8 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 58.2 MB/s 
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 72.9 MB/s 
Collecting multiprocess
  Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 68.7 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 50.0 MB/s 
Collecting urllib3!=1.25.0,!=1.2

## Sentiment_Analysis

In [2]:
from transformers import pipeline   # pipeline bütün işlemlerimizi otomatize ediyor.

classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [3]:
classifier("I've been waiting for a HuggingFace course my whole life.")

[{'label': 'POSITIVE', 'score': 0.9598049521446228}]

In [4]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

## How can we use this model in our own data?

In [5]:
classifier("I've been waiting for a HuggingFace course my whole life.")[0]

{'label': 'POSITIVE', 'score': 0.9598049521446228}

In [6]:
classifier("I've been waiting for a HuggingFace course my whole life.")[0]["label"]

'POSITIVE'

In [7]:
y_pred = []
for i in ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]:
  if classifier(i)[0]['label']=="POSITIVE":
    y_pred.append(0)
  else:
    y_pred.append(1)
y_pred

[0, 1]

## Zero_Shot Classification

In [8]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [9]:
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445988297462463, 0.11197440326213837, 0.04342682659626007]}

In [10]:
classifier(
    ["world cup will be held in qatar this year", "ballet and theater are indispensable for me.", "This year, a new law on animal rights was enacted."],
    candidate_labels=["education", "politics", "business", "art", "sport", "justice"],
)

[{'sequence': 'world cup will be held in qatar this year',
  'labels': ['sport', 'politics', 'business', 'justice', 'art', 'education'],
  'scores': [0.878333568572998,
   0.04051615670323372,
   0.03470335155725479,
   0.01913805864751339,
   0.017712445929646492,
   0.009596425108611584]},
 {'sequence': 'ballet and theater are indispensable for me.',
  'labels': ['art', 'justice', 'sport', 'business', 'education', 'politics'],
  'scores': [0.9215368628501892,
   0.03311404958367348,
   0.023203495889902115,
   0.009502376429736614,
   0.009387053549289703,
   0.0032561684492975473]},
 {'sequence': 'This year, a new law on animal rights was enacted.',
  'labels': ['justice', 'sport', 'business', 'politics', 'art', 'education'],
  'scores': [0.572504997253418,
   0.1423110067844391,
   0.08730437606573105,
   0.08305704593658447,
   0.06940542906522751,
   0.04541708156466484]}]

## Text Generation (Completing the entered sentence in accordance with the context)

In [11]:
from transformers import pipeline

generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [12]:
generator("In this course, we will teach you how to")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use the AngularJS CLI to interact in a way that your UI can respond to your Angular application.\n\nThe AngularJS CLI offers many benefits when using RESTful Request/Response frameworks.\n\n'}]

In [13]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2", device="cuda:0")

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [14]:
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to avoid using your computer to make a habit of switching between multiple computers. We will discuss the basic basics'},
 {'generated_text': 'In this course, we will teach you how to write and manipulate data with a specific knowledge. In this course, you will be able to learn how'}]

## Predict the next token

In [15]:
from transformers import pipeline

unmasker = pipeline("fill-mask")

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [16]:
unmasker("This course will teach you all about <mask> models.", top_k=2)

[{'score': 0.19619810581207275,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052736610174179,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

## Named Entity Recognition (Token Classification)

In [17]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

  "`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to"


In [18]:
ner("My name is Sylvain and I work at Hugging Face in Brooklyn. my phone is 535 555 55 55")

[{'entity_group': 'PER',
  'score': 0.99789876,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9891248,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.993338,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

## Question Answering

In [19]:
from transformers import pipeline

question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [20]:
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

{'score': 0.6949766278266907, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

In [21]:
question_answerer(
    question=["Where do you work?", "In which city do you work?", "what is your name"],
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

[{'score': 0.7144075632095337,
  'start': 33,
  'end': 45,
  'answer': 'Hugging Face'},
 {'score': 0.874370276927948, 'start': 49, 'end': 57, 'answer': 'Brooklyn'},
 {'score': 0.990521252155304, 'start': 11, 'end': 18, 'answer': 'Sylvain'}]

## Summarization

In [22]:
from transformers import pipeline

summarizer = pipeline("summarization", device="cuda:0")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [23]:
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

[{'summary_text': ' America has changed dramatically during recent years . The number of engineering graduates in the U.S. has declined in traditional engineering disciplines such as mechanical, civil,    electrical, chemical, and aeronautical engineering . Rapidly developing economies such as China and India continue to encourage and advance the teaching of engineering .'}]

## Translation

In [24]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")

Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/301M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



In [25]:
translator("Ce cours est produit par Hugging Face.")

[{'translation_text': 'This course is produced by Hugging Face.'}]

## How to use a pre-trained model in Hugginface

### Question Answering

In [26]:
from transformers import pipeline

question_answerer = pipeline("question-answering", model="deepset/roberta-base-squad2")

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/496M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

In [27]:
question_answerer(
    question="Which team beat real madrid?",
    context="Barcelona became champion of Spain after beating Real Madrid",
)

{'score': 0.9333109259605408, 'start': 0, 'end': 9, 'answer': 'Barcelona'}

### Translation

In [28]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-tr-en")

Downloading:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/307M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/840k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/797k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.56M [00:00<?, ?B/s]

In [29]:
translator("Bugün çok mutluyum.")

[{'translation_text': "I'm very happy today."}]

### Zero Shot Classification

In [30]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification", model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli")

Downloading:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/558M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/463 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/18.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/156 [00:00<?, ?B/s]

  "The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option"


In [31]:
classifier(
    ["Galatasaray bu sene şampiyonlar ligine katılacak", "Savaş sonrası enflasyon arttı"],
    candidate_labels=["spor", "siyaset", "ekonomi", "sanat"],
)

  scale, dtype=query_layer.dtype
  score += c2p_att / torch.tensor(scale, dtype=c2p_att.dtype)
  score += p2c_att / torch.tensor(scale, dtype=p2c_att.dtype)


[{'sequence': 'Galatasaray bu sene şampiyonlar ligine katılacak',
  'labels': ['spor', 'siyaset', 'sanat', 'ekonomi'],
  'scores': [0.7193394899368286,
   0.14446699619293213,
   0.06898359209299088,
   0.06720992922782898]},
 {'sequence': 'Savaş sonrası enflasyon arttı',
  'labels': ['ekonomi', 'siyaset', 'sanat', 'spor'],
  'scores': [0.8702464699745178,
   0.10678984224796295,
   0.012205720879137516,
   0.01075796503573656]}]

### Fill Mask

In [32]:
from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [33]:
unmasker("Galatasaray [MASK] Fenerbahce 5-0.", top_k=2)

[{'score': 0.479656845331192,
  'token': 3249,
  'token_str': 'defeated',
  'sequence': 'galatasaray defeated fenerbahce 5 - 0.'},
 {'score': 0.4429788887500763,
  'token': 3786,
  'token_str': 'beat',
  'sequence': 'galatasaray beat fenerbahce 5 - 0.'}]

In [34]:
unmasker("Fenerbahce [MASK] 5-0 to Galatasaray", top_k=2)

[{'score': 0.890109658241272,
  'token': 2439,
  'token_str': 'lost',
  'sequence': 'fenerbahce lost 5 - 0 to galatasaray'},
 {'score': 0.04882640764117241,
  'token': 4558,
  'token_str': 'lose',
  'sequence': 'fenerbahce lose 5 - 0 to galatasaray'}]

## Sentence Similarity

In [35]:
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 4.8 MB/s 
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125938 sha256=08f7b326249fe6d17e12efc8c6131776c43c9920f9b812b9be11dd868284fd11
  Stored in directory: /root/.cache/pip/wheels/bf/06/fb/d59c1e5bd1dac7f6cf61ec0036cc3a10ab8fecaa6b2c3d3ee9
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2


In [36]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/bert-base-nli-mean-tokens')

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [37]:
sentences = ["This is an example sentence", "Each sentence is converted", "Let me give you an example from the sentence"]
embeddings = model.encode(sentences)
embeddings

array([[-0.39309978,  0.03886306,  1.9874251 , ..., -0.6093677 ,
        -1.0946214 ,  0.3264902 ],
       [ 0.06153387,  0.32736215,  1.8332328 , ..., -0.12985355,
         0.4608941 ,  0.2403545 ],
       [ 0.0392002 , -0.08954631,  2.0578134 , ..., -0.11227329,
        -0.9744815 ,  0.11509843]], dtype=float32)

In [38]:
embeddings[0]

array([-3.93099785e-01,  3.88630554e-02,  1.98742509e+00, -1.36893794e-01,
        1.93089887e-01,  3.74967426e-01,  1.15454979e-01,  3.02820861e-01,
        2.32356101e-01, -1.23269022e-01, -2.69239783e-01,  4.10017967e-01,
       -2.14587703e-01,  1.45402700e-01,  4.17345971e-01, -2.67232835e-01,
       -2.92259634e-01, -1.81809559e-01,  9.90739346e-01, -7.87549436e-01,
       -7.95893893e-02,  7.74835050e-01, -3.67453665e-01, -1.04439950e+00,
        3.26537162e-01, -8.63254726e-01,  3.20691079e-01, -1.12830269e+00,
       -4.59388018e-01, -4.49139737e-02,  6.30563498e-02, -6.13953710e-01,
        3.75282139e-01, -1.02702193e-01,  8.16333666e-02,  2.59928197e-01,
        4.26196963e-01, -1.09221926e-02,  1.49220422e-01,  2.61052787e-01,
        8.91624331e-01, -5.76651275e-01,  9.52781379e-01,  1.79337636e-01,
       -9.76019442e-01, -6.75556660e-01, -7.54613757e-01,  3.20075423e-01,
       -3.51041049e-01, -7.56071210e-01, -1.71005118e+00,  3.14682752e-01,
        3.91977638e-01,  

In [39]:
len(embeddings[0])

768

In [40]:
from sklearn.metrics.pairwise import cosine_similarity

In [41]:
for i in range(1, len(embeddings)):

   print(cosine_similarity(embeddings[[0]], embeddings[[i]]))

[[0.5783693]]
[[0.8971247]]


In [42]:
sentences = ["How old are you?", "What is your age?", "How old do I show?"]
embeddings = model.encode(sentences)
for i in range(1, len(embeddings)):

   print(cosine_similarity(embeddings[[0]], embeddings[[i]]))

[[0.81821966]]
[[0.940009]]
