### Transformer-based Text Representation

Transformer-based Text Representation refers to word or sentence embeddings generated by models that use the self-attention mechanism to convert text data into vectors (numerical representations). This approach has gained popularity, especially with models like BERT, GPT, T5, RoBERTa, and XLNet.
<br>
<br>

- #### Multi-Head Attention:

  Multi-Head Attention is a structure that runs multiple self-attention mechanisms in parallel.

  - Features:

    - Captures different contexts &rarr; Different heads allow the model to learn word relationships from various perspectives.
    - Runs in parallel &rarr; Multiple self-attention computations are performed and the results are combined.
    - Prevents information loss &rarr; Instead of using a single attention mechanism, multiple heads enrich the information.
      <br>
      <br>

- #### Masked Multi-Head Attention:

  Masked Multi-Head Attention works by masking future words.
  It is commonly used in language model training (such as GPT) and seq2seq models (e.g., chatbots, machine translation).

  - What’s the difference?

    - Normal Self-Attention &rarr; Considers all words.
    - Masked Self-Attention &rarr; Considers only previous words and "masks" the future words to predict them.
      <br>
      <br>

- #### Add & Norm (Addition and Normalization):

  - These are two critical components used in each transformer block:

    - Residual Connection (Add) &rarr; Combines the input and output to prevent information loss.
    - Layer Normalization (Norm) &rarr; Stabilizes the model's training process.
      <br>
      <br>

- #### Feed Forward Network (FFN):

  After each attention layer, a fully connected neural network (MLP) follows in a transformer.
  <br>
  <br>

- #### Input & Output Embeddings:

  Transformers process words by converting them into vectors.

  - Input Embedding:

    - Each word is converted into a fixed-length vector.
    - These vectors are created using word embeddings, which carry the contextual meaning of the words.
    - Example: "hello" &rarr; [0.12, 0.85, -0.45, ...]
      <br>
      <br>

- #### Linear (Fully Connected Layer):

  This is the final layer of the transformer, which processes the output for word prediction.
  <br>
  <br>

- #### Softmax (Probability Distribution)

  Softmax converts the output into a probability distribution and selects the most likely word.


---


In [11]:
import torch
from transformers import AutoModel, AutoTokenizer

# Model and Tokenizer implement
model_name = "bert-base-uncased"  # lower/upper uncased
model = AutoModel.from_pretrained(model_name)  # pre implemented model
tokenizer = AutoTokenizer.from_pretrained(model_name)  # tokenizer for model

# Text
text = "Transformers are amazing for natural language processing."

# Tokenize
inputs = tokenizer(text, return_tensors="pt")

# Text representation
with torch.no_grad():
    outputs = model(**inputs)

# Output
last_hidden_state = outputs.last_hidden_state
first_token_embedding = last_hidden_state[
    0, 0, :
].numpy()  # [batch, token (Transformers word of text), vector]

print("Text Representation (First token):\n", first_token_embedding)

Text Representation (First token):
 [-1.35336190e-01 -1.13854200e-01 -1.07623339e-01  1.60552442e-01
 -4.46132481e-01 -3.62620324e-01  2.91389942e-01  6.56760871e-01
  1.24576457e-01 -2.41161659e-01  1.84239879e-01  9.58727002e-02
  4.85659018e-02  9.06549916e-02  1.08772933e-01 -5.23374006e-02
 -2.53350526e-01  6.48282230e-01  2.65885174e-01  1.44905701e-01
 -1.14159741e-01 -3.58102292e-01  3.82253021e-01 -2.88915988e-02
  6.59064576e-02 -3.52836013e-01  1.00523211e-01 -2.54869074e-01
 -2.54663825e-01  1.89151853e-01 -3.58156115e-01  4.04400676e-01
 -2.78848946e-01 -4.03717488e-01  4.03150648e-01 -6.64508194e-02
  1.25667036e-01 -1.86263055e-01  1.02758877e-01 -1.58098474e-01
 -2.72423059e-01 -2.71067470e-02  1.15523800e-01 -1.35344446e-01
 -6.70103788e-01 -1.70308184e-02 -3.22344208e+00 -8.86515826e-02
 -3.37871909e-01 -3.52163702e-01  2.26560056e-01 -2.31863350e-01
  6.39325604e-02  3.76260191e-01 -7.27623701e-02  1.21075355e-01
 -2.58751571e-01  3.81401926e-01  8.95989016e-02  1.41