# Applications of Transforomers: hands-on with BERT

## 10.1 Introduction: Working with BERT in practice
* The costs to pretrain BERT and other transformer-models from scratch on large amounts of data can be expensive :(
    * The original authors would have spent $10K for BERT
    * XLNET: $60K
    * GPT-3 Transformer: $4.6M
* Luckily, smaller pretrained models are becoming available for free!
    * Ideally, that means that you would download a pre-trained model, incorporate

## 10.2 A BERT layer
* In deep learning networks, BERT layers can be used on top of an input layer!
    * They encode words in the input layer
* To work with BERT, we need
    * A pre-trained BERT model
    * A facility for importing such a model and expositing it to the rest of our code
* Thank Google for Tensorflowhub!
    * A site for downloading models and other pre-constructed deep learning networks.
* When working with BERT, we need to balance time complexity and quality

## 10.3 Training BERT on your own data
* There are many Python frameworks based on Keras that allow us to work w/ BERT
    * fast-bert
    * keras-bert
* Start with a list of normal sentences
    * In this case we're going to use Edgar Allen Poe's The Cask of Amontillado.
    * We can choose to split on '.' but we can also delimit on other closing punctuation to get additional (pseudo-)sentences.
* Next create the BERT data
    * We generate this utilizing a generator so that we're not limited by the machine's storage capacities.
    * Under the hood, keras_bert will insert the CLS and SEP delimiters within our paired sentence data.
* Overall process:
    * Tensorflow Hub: Load Tokenizer
        * Tokenize data, mask data, generate segment positions, gather label of input example
        * Processed Data
    * Data -> Process Data
        * convert_examples_to_features
* For our complimentary training tasks:
    * Some stored sentiment data
        * I hate pizza, negative

## 10.4 Fine-tuning BERT
* Pick up a pre-trained mode and use a labeled dataset!
    * BERT added as a layer that learns the labeling task

## 10.5 Inspecting BERT
* Unlike Word2Vec, BERT can identifiy homonyms since it uses more contextual vectors!

## 10.6 Applying BERT
* Large models like BERT/Word2Vec can showcase bias due to the sources underlying the models.
    * Counteractable utilizing
        * Word-Embedding Association Test
        * Relational Inner Product Association Test

# 10.7 Summary
* Existing BERT models can be imported into your Keras network.
* You can train BERT on your own (raw text) data.
* Fine-tuning BERT models on additional labeled data (downstream tasks) may be beneficiary.
* As with any data-driven model, BERT is susceptible to bias, and may produce undesirable associations between words, reflecting cultural and societal biases that have crept into the raw data underlying a BERT model. It is important to be aware of this as an NLP engineer.