https://huggingface.co/blog/bert-101

### 1. Hello world BERT

https://colab.research.google.com/drive/1YtTqwkwaqV2n56NC8xerflt95Cjyd4NE?usp=sharing

In [1]:
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
unmasker = pipeline('fill-mask', model='bert-base-uncased')

Downloading: 100%|██████████████████████████████████████████████████████████████████| 440M/440M [00:41<00:00, 10.7MB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClass

In [3]:
unmasker("Artificial Intelligence [MASK] take over the world.")

[{'score': 0.31824126839637756,
  'token': 2064,
  'token_str': 'can',
  'sequence': 'artificial intelligence can take over the world.'},
 {'score': 0.1829964816570282,
  'token': 2097,
  'token_str': 'will',
  'sequence': 'artificial intelligence will take over the world.'},
 {'score': 0.05600154399871826,
  'token': 2000,
  'token_str': 'to',
  'sequence': 'artificial intelligence to take over the world.'},
 {'score': 0.045194920152425766,
  'token': 2015,
  'token_str': '##s',
  'sequence': 'artificial intelligences take over the world.'},
 {'score': 0.04515308886766434,
  'token': 2052,
  'token_str': 'would',
  'sequence': 'artificial intelligence would take over the world.'}]

In [4]:
unmasker("The man worked as a [MASK].")

[{'score': 0.09747567027807236,
  'token': 10533,
  'token_str': 'carpenter',
  'sequence': 'the man worked as a carpenter.'},
 {'score': 0.0523834191262722,
  'token': 15610,
  'token_str': 'waiter',
  'sequence': 'the man worked as a waiter.'},
 {'score': 0.04962713643908501,
  'token': 13362,
  'token_str': 'barber',
  'sequence': 'the man worked as a barber.'},
 {'score': 0.03788609057664871,
  'token': 15893,
  'token_str': 'mechanic',
  'sequence': 'the man worked as a mechanic.'},
 {'score': 0.03768084570765495,
  'token': 18968,
  'token_str': 'salesman',
  'sequence': 'the man worked as a salesman.'}]

In [5]:
unmasker("The woman worked as a [MASK].")

[{'score': 0.21981455385684967,
  'token': 6821,
  'token_str': 'nurse',
  'sequence': 'the woman worked as a nurse.'},
 {'score': 0.15974152088165283,
  'token': 13877,
  'token_str': 'waitress',
  'sequence': 'the woman worked as a waitress.'},
 {'score': 0.11547324061393738,
  'token': 10850,
  'token_str': 'maid',
  'sequence': 'the woman worked as a maid.'},
 {'score': 0.03796883299946785,
  'token': 19215,
  'token_str': 'prostitute',
  'sequence': 'the woman worked as a prostitute.'},
 {'score': 0.030423857271671295,
  'token': 5660,
  'token_str': 'cook',
  'sequence': 'the woman worked as a cook.'}]

### 2. Visual BERT with Text Classification (Sentence Sentiment model)

https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb

In [6]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

In [7]:
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

In [8]:
print(df.shape)

(6920, 2)


In [9]:
df.head()

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


In [10]:
batch_1 = df[:2000]

In [11]:
batch_1[1].value_counts()

1    1041
0     959
Name: 1, dtype: int64

#### 2.1 Loading the Pre-trained BERT model

In [12]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Downloading: 100%|██████████████████████████████████████████████████████████████████| 268M/268M [00:25<00:00, 10.6MB/s]
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


#### 2.2 Tokenization

In [13]:
tokenized = batch_1[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [17]:
print(tokenized.shape)
tokenized[0]

(2000,)


[101,
 1037,
 18385,
 1010,
 6057,
 1998,
 2633,
 18276,
 2128,
 16603,
 1997,
 5053,
 1998,
 1996,
 6841,
 1998,
 5687,
 5469,
 3152,
 102]

#### 2.3 Padding

In [20]:
max_len = max(tokenized.str.len())

padded = np.array([sentence + [0]*(max_len-len(sentence)) for sentence in tokenized.values])
padded.shape

(2000, 59)

#### 2.4 Masking (to ignore padding)

In [21]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(2000, 59)

#### 2.5 Run DistillBERT model

In [22]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [26]:
print(last_hidden_states[0].shape)

torch.Size([2000, 59, 768])


<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

Get only [CLS] token, something like the "sentence embedding" used for text clasification

In [30]:
features = last_hidden_states[0][:,0,:].numpy()
print(features.shape)

(2000, 768)


In [31]:
labels = batch_1[1]
labels.shape

(2000,)

#### 2.6 Run Logistic Regression model (with sentence embeddings CLS from BERT)

In [34]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

In [39]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

In [40]:
lr_clf.score(test_features, test_labels)

0.818

#### 2.7 Evaluating model

In [41]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.523 (+/- 0.00)


### 3. Fine tunning

https://huggingface.co/docs/transformers/training#finetune-a-pretrained-model

https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/tensorflow/training.ipynb

#### 3.1 Prepare dataset

In [1]:
from datasets import load_dataset

In [2]:
dataset = load_dataset("yelp_review_full")

Found cached dataset yelp_review_full (C:/Users/victor.cordero/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)


  0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
dataset["train"][100]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

In [4]:
from transformers import AutoTokenizer

In [5]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [6]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [7]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Loading cached processed dataset at C:/Users/victor.cordero/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf\cache-a64a0ad512b79a7d.arrow


  0%|          | 0/50 [00:00<?, ?ba/s]

In [8]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

Loading cached shuffled indices for dataset at C:/Users/victor.cordero/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf\cache-3d316b0c01c52ca4.arrow


In [9]:
small_train_dataset

Dataset({
    features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1000
})

#### 3.2 Train

In [10]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator(return_tensors="tf")

In [11]:
# convert to tensorflow tensors

In [12]:
tf_train_dataset = small_train_dataset.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = small_eval_dataset.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

In [13]:
tf_train_dataset

<PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(None, 512), dtype=tf.int64, name=None), 'token_type_ids': TensorSpec(shape=(None, 512), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(None, 512), dtype=tf.int64, name=None)}, TensorSpec(shape=(None,), dtype=tf.int64, name=None))>

In [14]:
# compile and fit fine-tune

In [15]:
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3)

### 4. Question and answering

https://huggingface.co/tasks/question-answering

In [17]:
from transformers import pipeline

qa_model = pipeline("question-answering") # distilbert-base-cased-distilled-squad model

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [31]:
question = "Who won the match yesterday?"
context = """Last night Real Madrid and Barcelona played the final of the champions league tournament. 
The result was 2-1 for Real Madrid, although the Barcelona was superior along the all match"""
qa_model(question = question, context = context)

{'score': 0.34080541133880615,
 'start': 114,
 'end': 125,
 'answer': 'Real Madrid'}

In [33]:
question = "Who won the match yesterday?"
context = """Last night Barcelona and Real Madrid played the final of the champions league tournament. 
The result was 2-1 for Barcelona, although the Real Madrid was superior along the all match"""
qa_model(question = question, context = context)

{'score': 0.2991205155849457, 'start': 25, 'end': 36, 'answer': 'Real Madrid'}

### 5. Summarization

https://huggingface.co/tasks/summarization

In [34]:
from transformers import pipeline

classifier = pipeline("summarization") # sshleifer/distilbart-cnn-12-6

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [35]:
classifier("""Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 
2018, in an area of more than 105 square kilometres (41 square miles). The City of Paris is the centre and seat of government 
of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 
percent of the population of France as of 2017.""")

Your max_length is set to 142, but you input_length is only 102. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)


[{'summary_text': ' Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018 . The city is the centre and seat of government of the region and province of Île-de-France, or Paris Region . Paris Region has an estimated  population of 12,174,880, or about 18 percent of the population of France .'}]

### 6. Text generation

https://huggingface.co/tasks/text-generation

In [9]:
# text generation

In [1]:
from transformers import pipeline
generator = pipeline('text-generation', model = 'gpt2') # gpt-2 by default

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [2]:
generator("Hello, I'm a language model", max_length = 30, num_return_sequences=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model. I make a database and I define a collection of tables. Now, what does this mean? Well, we"},
 {'generated_text': "Hello, I'm a language modeler. One of my main goals is to help people make good languages in their personal development processes by helping you discover"},
 {'generated_text': 'Hello, I\'m a language modeler. I was learning C for a very short time until I was a language geek."\n\nKirk looks'}]

In [5]:
generator("partido Madrid - Leizpig, Gana el Madrid, 2 goles minutos finales", max_length = 60, num_return_sequences=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'partido Madrid - Leizpig, Gana el Madrid, 2 goles minutos finales de un acesirado, por si esta como e español.\n\nParsas\n\nSaleo Español (El Castillerad'},
 {'generated_text': "partido Madrid - Leizpig, Gana el Madrid, 2 goles minutos finales 1 0 0\n\nGiro started 1-0 ahead of Milan but didn't advance after a late goal-keeping error by David Paterson. Despite a 2-0 deficit in favour"},
 {'generated_text': 'partido Madrid - Leizpig, Gana el Madrid, 2 goles minutos finales, en nombre español - Leizpig de la México, 1.5 goles minutos finales, en y goles minutos'}]

In [6]:
generator("match Madrid - Leipzig, Madrid wins, 2 goals final minutes", max_length = 60, num_return_sequences=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "match Madrid - Leipzig, Madrid wins, 2 goals final minutes in group A. 'It will start out as a struggle because you have to work hard but it is all about the team,' said Ronaldo. 'And that is my only job and my job is to work hard to win the"},
 {'generated_text': 'match Madrid - Leipzig, Madrid wins, 2 goals final minutes - Madrid v FC Jupp Heynckes (Germany vs Italy, 1:12, 7 points): 2-2, 2 goals, 0 draws, 1 defeat: (Spain vs Netherlands, 1:16, 7 points'},
 {'generated_text': 'match Madrid - Leipzig, Madrid wins, 2 goals final minutes on home soil\n\nJoré Moustafa takes his first ever match victory with Spain\n\nTottenham 2-0 Chelsea: Tottenham were in front almost twice as they were forced to give up late chances in the first'}]

In [10]:
# text2text generation

In [8]:
text2text_generator = pipeline("text2text-generation") # t5-base by default

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [11]:
text2text_generator("question: What is 42 ? context: 42 is the answer to life, the universe and everything")

[{'generated_text': 'the answer to life, the universe and everything'}]

In [12]:
text2text_generator("translate from English to French: I'm very happy")

[{'generated_text': 'Je suis très heureux'}]

In [13]:
text2text_generator("Write an article about football with this information: match Madrid - Leipzig, Madrid wins, 2 goals final minutes")

[{'generated_text': 'with this information: match Madrid - Leipzig, Madrid wins, Madrid wins, 2 goals'}]

In [14]:
text2text_generator("match Madrid - Leipzig, Madrid wins, 2 goals final minutes")

[{'generated_text': 'Madrid - Leipzig, Madrid wins, Madrid wins, 2 goals final minutes, Madrid wins'}]

In [15]:
# text2text generation with T0

In [None]:
text2text_generator = pipeline("text2text-generation", model = "bigscience/T0_3B")

Downloading:   0%|          | 0.00/632 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/11.4G [00:00<?, ?B/s]

In [None]:
text2text_generator("Is the word 'table' used in the same meaning in the two previous sentences? Sentence A: you can leave the books on the table over there. Sentence B: the tables in this book are very hard to read." )

In [None]:
text2text_generator("A is the son's of B's brother. What is the family relationship between A and B?")

In [None]:
text2text_generator("Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy")

In [None]:
text2text_generator("Reorder the words in this sentence: justin and name bieber years is my am I 27 old.")