<a href="https://colab.research.google.com/github/catafest/colab_google/blob/master/catafest_021.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, see this https://arxiv.org/abs/1810.04805. 

I used transformers, see this video: https://www.youtube.com/watch?v=SZorAJ4I-sA and this: https://arxiv.org/pdf/2003.08271.pdf.

vocab.txt can be downloaded from Google's BERT repository, see one example: https://github.com/microsoft/SDNet/blob/master/bert_vocab_files/bert-base-uncased-vocab.txt


In [57]:
!pip install transformers



In [58]:
from transformers import pipeline

You can use this model directly with a pipeline for masked language modeling:

In [59]:
unmasker = pipeline('fill-mask', model='bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[MASK], [CLS], [SEP] -  are artificial tokens that are respectively inserted used in *Next Sentence Prediction - NSP*. For example: [SEP] is the special token which separates the 2 sentences.

In [60]:
unmasker("Hello I'm a [MASK].")

[{'score': 0.026759833097457886,
  'sequence': "hello i'm a friend.",
  'token': 2767,
  'token_str': 'friend'},
 {'score': 0.020836783573031425,
  'sequence': "hello i'm a lawyer.",
  'token': 5160,
  'token_str': 'lawyer'},
 {'score': 0.020397325977683067,
  'sequence': "hello i'm a doctor.",
  'token': 3460,
  'token_str': 'doctor'},
 {'score': 0.017135562375187874,
  'sequence': "hello i'm a girl.",
  'token': 2611,
  'token_str': 'girl'},
 {'score': 0.01617709919810295,
  'sequence': "hello i'm a stranger.",
  'token': 7985,
  'token_str': 'stranger'}]

In [61]:
unmasker("Tell this word: [MASK].")

[{'score': 0.033313069492578506,
  'sequence': 'tell this word : no.',
  'token': 2053,
  'token_str': 'no'},
 {'score': 0.018001610413193703,
  'sequence': 'tell this word : go.',
  'token': 2175,
  'token_str': 'go'},
 {'score': 0.010956508107483387,
  'sequence': 'tell this word : yes.',
  'token': 2748,
  'token_str': 'yes'},
 {'score': 0.009109552018344402,
  'sequence': 'tell this word : good.',
  'token': 2204,
  'token_str': 'good'},
 {'score': 0.008192724548280239,
  'sequence': 'tell this word : death.',
  'token': 2331,
  'token_str': 'death'}]

In [62]:
unmasker("[CLS] One [MASK] [SEP] Two [SEP]")

[{'score': 0.9190663695335388,
  'sequence': 'one. two',
  'token': 1012,
  'token_str': '.'},
 {'score': 0.057839520275592804,
  'sequence': 'one ; two',
  'token': 1025,
  'token_str': ';'},
 {'score': 0.01602681539952755,
  'sequence': 'one! two',
  'token': 999,
  'token_str': '!'},
 {'score': 0.006792058702558279,
  'sequence': 'one? two',
  'token': 1029,
  'token_str': '?'},
 {'score': 0.00022950586571823806,
  'sequence': 'one | two',
  'token': 1064,
  'token_str': '|'}]

Next source code for classification head you are instantiating a pretrained model for a different task the warning is telling you that some weights were randomly initialized - which is normal.

PyTorch

In [63]:
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")

text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


TensorFlow

In [64]:
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained("bert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


sentiment-analysis for a ProTV news text.

In [65]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I found out what awaits the 139 Afghans who arrived in Romania. The ridiculous amount they will receive from the state.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'NEGATIVE', 'score': 0.9735418558120728}]

https://www.theverge.com/2021/9/11/22668790/google-one-adds-5tb-storage-plan-photos text sentiment-analysis 

In [66]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("After it ended free unlimited storage for Google Photos in June, many Google users had figure out how to store images and other data in the Google accounts. They could keep their Google account stored data under 15GB, or pay for a Google One plan. Options included a 100GB plan for $1.99 per month, a 200GB plan for $2.99 a month, a 2TB plan for $9.99 a month, or a plan with 10TB of storage for $49.99 per month. 20TB and 30TB plans are also available, for $99.99 and $149.99 per month, respectively.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'NEGATIVE', 'score': 0.9966930150985718}]

https://www.theverge.com/2021/9/11/22668790/google-one-adds-5tb-storage-plan-photos text sentiment-analysis with more info 

In [67]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("After it ended free unlimited storage for Google Photos in June, many Google users had figure out how to store images and other data in the Google accounts. They could keep their Google account stored data under 15GB, or pay for a Google One plan. Options included a 100GB plan for $1.99 per month, a 200GB plan for $2.99 a month, a 2TB plan for $9.99 a month, or a plan with 10TB of storage for $49.99 per month. 20TB and 30TB plans are also available, for $99.99 and $149.99 per month, respectively. Now Google’s introduced a middle option between 2TB and 10TB, as noticed by 9to5Google. The 5TB Google One plan costs $24.99 per month, a good (and less expensive) option for people who want a little more than 2TB but don’t quite need a plan with 10TB of storage or more.If you’re sure the 5TB plan will meet your needs, you can save a little money by prepaying for a year’s subscription; it will run you $249.99. Like the 2TB and 10TB plans, the 5TB plan also includes 10 percent off Google Store purchases, the option to add family members, access to Google experts, and a VPN for Android phones.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'NEGATIVE', 'score': 0.9948194622993469}]