# Transformers

https://en.wikipedia.org/wiki/Attention_Is_All_You_Need
https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)#:~:text=Transformers%2C%20using%20an%20attention%20mechanism,leads%20to%20improved%20training%20speed.


Transformer architectures in GPT models does not rely on fixed-size contexts, as it uses self-attention mechanisms to weight the importance of different words in the sequence, regardless of their positional distance from the word being predicted. This allows the model to consider the entire context (up to a certain sequence length) when making predictions. 


## GPT
GPT models are trained on a large corpus of text data using unsupervised learning where they predict the next word in a sequence given all the previous words, thereby learning the probabilities of word sequences in a way that's more flexible and context-aware than n-gram models. 
- Why not "n-grams" model for ChatGPT? because n-grams is not as contextually aware as GPT

GPT models including ChatGPT are trained using variant of the langauge modeling objective called "autoregressive language modeling" and in the training setup, the model learns to predict the probability of the next token (word or subword) in a sentence given all the previous tokens, and this is done across a massive dataset enabling the model to learn a wide range of language patterns, idioms, facts, and writing styles. 




# Tokenization

"Token" refers to an individual unit of data or text, and it's the outcome of the process of breaking down text into smaller parts that can be easily anaylzed or processed

Tokens are building blocks for further analysis and model training in various NLP tasks. In context of Robotic Learning, it doesn't have a standardized universally accepted meaning like in NLP, but it can still symbolize information. 

Tokens are referred to as the input data. When a model processes text, the first step often involves tokenizing the input text, meaning the text is split into manageable pieces such as words, subwords, n-grams, phrases, characters, etc. 

The inputs of models can also be one-hot encoding (word/subword embeddings) to represent numbers. 

It is also possible that outputs are tokens as well, especially in generative models or sequence-to-sequence models (like translating text, or text generation) 
- for text generation, models such as GPT generate an output of one token at a time, and the model predicts the next token in the sequence based on the previous tokens and this process repreats until a stopping criterion is met
- for sequence-to-sequence task, like machine translation, the model's output is a sequence of tokens in the target language, generated based on the tokens from the input sequence in the source language


## Symbolic Representation
In symbolic AI (foundation for some approaches to robotics), a "token" might represent a discrete, symbolic pieceo of information, which could be used in the decision-making process (where different tokens represent different states, actions, or objects that a robot can recognize). Tokens are abstract representations of real-world entities or actions that a robot needs to understand

For robotic manipulation tasks, "tokens" refer to a class of objects they interact with. For instance, in a sorting task, different types of objects might be assigned tokens that represent their category, and the robot's learning algorithm uses these tokens to make decisions about how to handle each ojbect. 

Tokens can also be referred to as instructions, or commands for the robot. For example, when teaching a robot to perform tasks through demonstration or verbal command. 

Tokens in reinforcement learning represent an element of state or action space. For example, when a robot is learning to navigate an environment, tokens could represent possible movements or interactions (e.g. move forward, turn left, etc.) or features of the environment that are relevant to the robot's learning process. 

In hierarchical learning systems, tokens might be used to represent higher-level goals or subtasks that a robot needs to accomplish as a part of a broader task, help in structuring the learnign process and decomposing complex tasks into manageable subtasks. 


## NLP 
In NLP, words, subwords, characters, phrases, or N-grams are considered tokens. 
- For words, "Machine Learning is fascinating" can be tokenized as ["Machine", "learning", "is", "fascinating"]. 
- For subwords or characters, tokenization may help with handling unknown words, morphological varations, or languages with rich compounding 
- For phrases or N-grams, tokens can be consecutive sequences of words knowns n-grams
    - for example, in bigram (2-gram) tokenization, the phrase "natural language processing" might be split into ["natural language", "language processing"]. 
    - N-grams is a more traditional NLP approach, which is looking at previous n-1 words, calculating the probability of the next word based on historical occurrences. 
    
Tokenization is the process of converting text -> tokens (and crucial for preparing data for ML models from raw text into format algorithms understand/process). 

For numerical representations, tokens are often converted into vetcors using methods such as one-hot encoding, word embeddings, or subword embeddings. This numerical representation is what's atcually used to input the model. 

### Importance of Tokens
Feature Representation - Tokens serve as the basis for feature representation in NLP. For text classification, tokens are used to create a "bag-of-words" or TF-IDF (Term-Frequency-Inverse Document Frequency) representation. 

Input to Models - NLP models based off of transformer architectures (e.g. BERT, GPT, etc.), tokens are converted into embeddings -- dense vector representations before being fed into the model. These embeddings capture the semantic properties of the tokens.

Handling variability - Tokenization helps in standardizing text by breaking it down into uniform units, making it easier to handle variablility in language use, such as different forms of a word or compound words. 

### Challenges 
Different languages pose unique challenges for tokenization. For example, Chinese and Japanese don't use spaces to separate words, requiring more sophisticated methods to identify token boundaries. 

Tokenization can sometimes be ambiguous, especially with words that have multiple meanings, compound words, or when dealing with contractions (e.g. "Can't" could be tokenized into ["can","not"]). 

# Attention Mechanism


# Neural Networks
