# 🤗 Transformers

In [None]:
!pip install transformers

In [None]:
from transformers import BertForSequenceClassification
from transformers import BertTokenizer

In [None]:
model_name = "ProsusAI/finbert"

In [None]:
model = BertForSequenceClassification.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/758 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [None]:

tokenizer = BertTokenizer.from_pretrained(model_name)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

1. Tokenize
2. TokenIDs -> model
3. Model activations (logits) -> probabilities (using SoftMax)
4. Argmax of those probs

In [None]:
# this is our example text
txt = ("Given the recent downturn in stocks especially in tech which is likely to persist as yields keep going up, "
       "I thought it would be prudent to share the risks of investing in ARK ETFs, written up very nicely by "
       "[The Bear Cave](https://thebearcave.substack.com/p/special-edition-will-ark-invest-blow). The risks comes "
       "primarily from ARK's illiquid and very large holdings in small cap companies. ARK is forced to sell its "
       "holdings whenever its liquid ETF gets hit with outflows as is especially the case in market downturns. "
       "This could force very painful liquidations at unfavorable prices and the ensuing crash goes into a "
       "positive feedback loop leading into a death spiral enticing even more outflows and predatory shorts.")

In [None]:
tokens = tokenizer.encode_plus(txt,
                               max_length=512,
                               truncation=True,
                               padding='max_length',
                               add_special_tokens=True,
                               return_tensors='pt')

In [None]:
tokens[:15]

{'input_ids': tensor([[  101,  2445,  1996,  3522,  2091, 22299,  1999, 15768,  2926,  1999,
           6627,  2029,  2003,  3497,  2000, 29486,  2004, 16189,  2562,  2183,
           2039,  1010,  1045,  2245,  2009,  2052,  2022, 10975, 12672,  3372,
           2000,  3745,  1996, 10831,  1997, 19920,  1999, 15745,  3802, 10343,
           1010,  2517,  2039,  2200, 19957,  2011,  1031,  1996,  4562,  5430,
           1033,  1006, 16770,  1024,  1013,  1013,  1996,  4783,  2906, 27454,
           1012,  4942,  9153,  3600,  1012,  4012,  1013,  1052,  1013,  2569,
           1011,  3179,  1011,  2097,  1011, 15745,  1011, 15697,  1011,  6271,
           1007,  1012,  1996, 10831,  3310,  3952,  2013, 15745,  1005,  1055,
           5665, 18515, 21272,  1998,  2200,  2312,  9583,  1999,  2235,  6178,
           3316,  1012, 15745,  2003,  3140,  2000,  5271,  2049,  9583,  7188,
           2049,  6381,  3802,  2546,  4152,  2718,  2007,  2041, 12314,  2015,
           2004,  2003,  29

Here we have specified a few arguments that require some explanation.

* `max_length` - this tell the tokenizer the maximum number of tokens we want to see in each sample, for BERT we almost always use 512 as that is the length of sequences that BERT consumes.

* `truncation` - if our input string txt contains more tokens than allowed (specified in max_length parameter) then we cut all tokens past the max_length limit.

* `padding` - if our input string txt contains less tokens than specified by `max_length` then we pad the sequence with zeros (0 is the token ID for '[PAD]' - BERTs padding token).

* `add_special_tokens` - whether or not to add special tokens, when using BERT we always want this to be True unless we are adding them ourselves.



 | Token | ID | Description |
  | --- | --- | --- |
  | [PAD] | 0 | Used to fill empty space when input sequence is shorter than required sequence size for model |
  | [UNK] | 100 | If a word/character is not found in BERTs vocabulary it will be represented by this *unknown* token |
  | [CLS] | 101 | Represents the start of a sequence |
  | [SEP] | 102 | Seperator token to denote the end of a sequence and as a seperator where there are multiple sequences |
  | [MASK] | 103 | Token used for masking other tokens, used for masked language modeling |

* `return_tensors` - here we specify either 'pt' to return PyTorch tensors, or 'tf' to return TensorFlow tensors.

The output produced includes three tensors in a dictionary format, 'input_ids', 'token_type_ids', and 'attention_mask'. We can ignore 'token_type_ids' as they are not used by BERT, the other two tensors are however.

* `input_ids` are the token ID representations of our input text. These will be passed into an embedding array where vector representations of each word will be found and passed into the following BERT layers.

* `attention_mask` tells the attention layers in BERT which words to calculate attention for. If you look at this tensors you will see that each 1 value maps to an input ID from the `input_ids` tensor, whereas each 0 value maps to a padding token from the `attention_mask` tensor. In the attention layer (activations mapping to padding tokens are multiplied by 0, and so are cancelled out).

In [None]:
output = model(**tokens)

In [None]:
output

SequenceClassifierOutput(loss=None, logits=tensor([[-1.8200,  2.4484,  0.0216]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [None]:
import torch.nn.functional as F

In [None]:
probs = F.softmax(output[0], dim=-1)

In [None]:
 import torch

pred = torch.argmax(probs)
pred.item()

1