<a href="https://colab.research.google.com/github/gitmystuff/HuggingFace/blob/main/AutoTokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AutoTokenizer

* https://dev.to/ajmal_hasan/using-hugging-face-models-in-google-colab-a-beginners-35ll

## Diffuser Sidetrack

Switch to GPU

In [1]:
#  !nvidia-smi

In [2]:
# from diffusers import StableDiffusionPipeline
# import torch

# model_id = "CompVis/stable-diffusion-v1-4"
# pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
# pipe = pipe.to("cuda")

# prompt = "Cat playing upright bass"
# image = pipe(prompt).images[0]
# image.save("cat.png")

## Datasets

In [3]:
# pip install datasets

In [4]:
# get data

In [5]:
# create df

In [6]:
# sentiment analysis pipeline

In [7]:
# sa example

In [8]:
# labels == 0

In [9]:
# def classify_sentiment(row):
#     label = sa(row)[0]['label']
#     if label == 'POSITIVE':
#         return 1
#     else:
#         return 0


In [10]:
# accuracy

## Tokens and Word Embeddings

* https://medium.com/data-science-collective/modern-nlp-tokenization-embedding-and-text-classification-448826f489bf

In the context of Hugging Face, a tokenizer is a crucial component of the natural language processing (NLP) pipeline. Here's a breakdown:

* **Purpose:**
    * Tokenizers are responsible for converting raw text into a format that machine learning models can understand. Since models work with numbers, tokenizers translate text into numerical representations.
    * Essentially, they break down text into smaller units (tokens) and then map those tokens to numerical IDs.

* **Key Functions:**
    * **Tokenization:** Splitting text into meaningful units (words, subwords, or characters).
    * **Vocabulary Creation:** Establishing a mapping between tokens and their corresponding numerical IDs.
    * **Preprocessing:** Handling tasks like normalization (e.g., lowercase conversion), truncation, and padding.
    * **Adding Special Tokens:** Incorporating tokens that have special meanings for models (e.g., `[CLS]`, `[SEP]`, `[UNK]`).

* **Hugging Face's Role:**
    * The Hugging Face `tokenizers` library provides efficient and versatile implementations of various tokenization algorithms.
    * It's designed for performance, making it suitable for both research and production environments.
    * Hugging face also provides pretrained tokenizers that are designed to work with their pretrained models. This ensures that the text is processed in the same way that the model was trained on.

* **Importance:**
    * Tokenization is a fundamental step in NLP, as it directly impacts the quality of the model's input.
    * Choosing the right tokenizer is essential for achieving optimal performance.

In essence, a Hugging Face tokenizer bridges the gap between human language and machine understanding, enabling NLP models to effectively process and analyze text.


### Text -> Tokenization -> Input IDs (Numerical Representation) -> Word Embeddings (Vectors)

1.  **Tokenization:**
    * This is the initial step where raw text is broken down into smaller, meaningful units called "tokens." These tokens can be words, subwords (like parts of words), or even individual characters.
    * The specific method of tokenization can vary (e.g., word-based, subword-based like Byte-Pair Encoding (BPE) or WordPiece).
    * Hugging Face tokenizers handle this step efficiently.

2.  **Conversion to Input IDs:**
    * Once the text is tokenized, each token is then converted into a numerical representation. This is done by looking up the token in the tokenizer's vocabulary, which assigns a unique ID to each token.
    * This results in a sequence of numbers, called input IDs, that the model can understand.

3.  **Word Embeddings (Vectorization):**
    * These input IDs are then fed into an embedding layer within the neural network.
    * The embedding layer transforms each input ID into a dense vector of real numbers, known as a word embedding.
    * These embeddings capture the semantic meaning of the tokens, allowing the model to understand relationships between words.
    * So the input ID's are the index that is used to look up the corresponding vector within the embedding matrix.

Therefore, to summarize:

* Text -> Tokenization -> Input IDs (Numerical Representation) -> Word Embeddings (Vectors)

This process allows NLP models to effectively process and understand textual data.


In [11]:
# bert

In [12]:
# tokenizer

In [13]:
# tokens

In [14]:
# multi line

**1. `input_ids`**

* **What it is:**
    * This is the core output. It's a list of lists, where each inner list represents a sentence from your `data` list.
    * Each number within those inner lists is the numerical ID of a token from the BERT vocabulary.
* **Explanation of the numbers:**
    * `101` represents the `[CLS]` token (classification token). It's added at the beginning of each sentence and is often used for tasks like sentence classification.
    * `102` represents the `[SEP]` token (separator token). It's added at the end of each sentence, signifying the end.
    * The numbers in between (e.g., `1045`, `2066`, `8870`) are the IDs of the words or subwords that were tokenized from your sentences.
    * For example:
        * `1045` corresponds to "i"
        * `2066` corresponds to "like"
        * `8870` corresponds to "cats"
    * The tokenizer's vocabulary maps each of these numbers back to their corresponding tokens.
* **Why it's important:**
    * These `input_ids` are the direct numerical representation of your text that the BERT model (or any transformer-based model) uses as input.

**2. `token_type_ids`**

* **What it is:**
    * This is another list of lists, mirroring the structure of `input_ids`.
    * It's used primarily for tasks involving multiple sentences (e.g., question answering, sentence pair classification).
    * It indicates which segment each token belongs to.
* **Explanation in your case:**
    * In your example, all the `token_type_ids` are `0`. This is because you're only processing single sentences.
    * If you were to provide two sentences to the tokenizer (e.g., a question and an answer), the `token_type_ids` would be used to distinguish between them. The first sentence would have `0`s, and the second sentence would have `1`s.
* **Why it's important:**
    * It helps the model understand the relationships between different segments of text.

**3. `attention_mask`**

* **What it is:**
    * This is also a list of lists, with the same structure as `input_ids`.
    * It's used to tell the model which tokens should be attended to and which should be ignored.
* **Explanation in your case:**
    * All the values in your `attention_mask` are `1`. This means that all the tokens are considered valid and should be attended to.
    * The attention mask becomes particularly important when you're dealing with sequences of varying lengths. To handle this, tokenizers often pad shorter sequences to match the length of the longest sequence in a batch.
    * When padding is added, the `attention_mask` is used to indicate which tokens are real (1) and which are padding (0). The model then ignores the padding tokens during attention.
* **Why it's important:**
    * It ensures that the model focuses on the relevant parts of the input and avoids being misled by padding tokens.

**In summary:**

* `input_ids` are the numerical representation of your text.
* `token_type_ids` are used to distinguish between different segments of text.
* `attention_mask` is used to indicate which tokens should be attended to and which should be ignored (especially padding tokens).

These outputs are essential for feeding your text data into a BERT model and allowing it to perform various NLP tasks.



In [15]:
# from transformers import AutoTokenizer
# import pprint

# model = 'bert-base-uncased'
# tokenizer = AutoTokenizer.from_pretrained(model)

# sentence1 = "What is the capital of France?"
# sentence2 = "Paris is the capital."

# encoded_input = tokenizer(sentence1, sentence2, padding="max_length", max_length=15)
# for k, v in encoded_input.items():
#     print(k, v)

# ids = encoded_input['input_ids']
# print(len(ids))
# pprint.pprint(tokenizer.decode(ids))

### List of Sentences

In [16]:
# model = 'bert-base-uncased'
# tokenizer = AutoTokenizer.from_pretrained(model)

# sentence1 = "What is the capital of France?"
# sentence2 = "Paris is the capital."
# sentence3 = "My cats speak french better than i do."

# sentences = [sentence1, sentence2, sentence3] # Put sentences into a list.

# tokens = tokenizer(sentences, padding="max_length", max_length=15, truncation=True)

# for k, v in tokens.items():
#     print(k, v)

# ids = tokens['input_ids'][0] # Access the first sentence input ids.
# print(len(ids))
# pprint.pprint(tokenizer.decode(ids))

# ids = tokens['input_ids'][1] # Access the first sentence input ids.
# print(len(ids))
# pprint.pprint(tokenizer.decode(ids))

# ids = tokens['input_ids'][2] #Access the third sentence input ids.
# print(len(ids))
# pprint.pprint(tokenizer.decode(ids))

The reason the Hugging Face `tokenizer`, when used with multiple string arguments directly, is designed to handle *pairs* of sentences, not arbitrary lists, stems from the core architecture and training objectives of Transformer models like BERT. Here's a breakdown:

**1. Next Sentence Prediction (NSP) in BERT:**

* BERT, in its original pre-training, was trained with a task called "Next Sentence Prediction" (NSP).
* The model was presented with pairs of sentences and had to predict whether the second sentence followed the first sentence in the original text.
* This task was designed to help BERT understand relationships between sentences, which is crucial for tasks like question answering and natural language inference.
* To facilitate NSP, the input to BERT was structured as two segments, separated by the `[SEP]` token.
* The `token_type_ids` were used to distinguish between these two segments.

**2. Tokenizer Design Reflects BERT's Training:**

* The Hugging Face `tokenizer` was designed to be compatible with pre-trained models like BERT.
* Therefore, its interface for handling multiple string arguments directly reflects the expected input format for BERT's NSP task.
* When you provide two string arguments, the tokenizer assumes you're providing a sentence pair for a task that might involve sentence-level relationships.

**3. Efficiency and Clarity:**

* For tasks that specifically involve sentence pairs (e.g., question answering, paraphrase detection), this design provides a convenient way to prepare the input.
* It's efficient because it directly produces the input format that BERT expects.
* It also helps to make the code more readable when working with those types of tasks.

**4. Handling Arbitrary Lists:**

* For processing arbitrary lists of sentences, the `tokenizer` provides a different interface: passing a list of strings as the first argument.
* This allows you to tokenize multiple independent sentences without assuming a sentence pair relationship.
* This is the proper way to input many sentences into the tokenizer.

**5. Evolution of Transformer Training:**

* It's worth noting that the effectiveness of NSP has been debated, and some later Transformer models (like RoBERTa) have omitted it from their pre-training.
* However, the `tokenizer`'s design still reflects the historical context of BERT's training.

In essence, the tokenizer's behavior is a direct consequence of the training objectives and input format of the models it was designed to support. It prioritizes efficient handling of sentence pairs while also providing a separate mechanism for processing arbitrary lists of sentences.


### Colors by Allison

## Transformers

* Colab User... GenAI... Transformers O

### Encoder vs Decoder

### BERT vs GPT3



The shift from fine-tuning to few-shot learning.

**1. "The parameters are basically statistical weights about how text tends to flow"**

* **Explanation:**
    * Large language models (LLMs) like GPT-3 are trained on vast amounts of text data.
    * During training, they learn to predict the next word in a sequence.
    * The "parameters" of the model are numerical values (weights) that determine the probability of each word appearing in a given context.
    * Essentially, these parameters capture statistical patterns in the training data, reflecting how words and phrases tend to occur together.
    * So, when you say "text tends to flow," you're referring to the model's ability to generate text that follows these learned statistical patterns, creating coherent and seemingly natural language.

**2. "Different from BERT, GPT3 attempts to replace the downstream fine-tuning with few-shot learning."**

* **BERT vs. GPT-3:**
    * **BERT (Bidirectional Encoder Representations from Transformers):**
        * BERT is primarily an encoder-based model, meaning it excels at understanding the context of words within a sentence.
        * Traditionally, BERT is fine-tuned for specific downstream tasks (e.g., sentiment analysis, question answering) by adding a task-specific layer and training it on labeled data.
    * **GPT-3 (Generative Pre-trained Transformer 3):**
        * GPT-3 is a decoder-based model, designed for generating text.
        * It was designed to demonstrate that with enough scale, a language model could perform various tasks with minimal or no task specific training.
        * GPT3 aims to utilize in context learning, and minimize the need for fine tuning.

**3. "Few-Shot Learning (FSL) is a Machine Learning framework that enables a pre-trained model to generalize over new categories of data (that the pre-trained model has not seen during training) using only a few labeled samples per class."**

* **Explanation:**
    * FSL addresses the challenge of training models with limited labeled data.
    * It allows a model to quickly adapt to new tasks or domains by providing only a small number of examples.
    * This is particularly useful when collecting large labeled datasets is expensive or impractical.
    * It is a form of meta learning.

**4. "GPT3 is given a small amount of data demonstration of the task at the inference time as conditioning but unlike the fine-tuning approach, there is no weight update. This is inspired by the fact that once humans have a general language understanding we don’t necessarily need large supervised datasets to learn most language tasks."**

* **In-Context Learning:**
    * Instead of updating the model's weights through fine-tuning, GPT-3 uses "in-context learning."
    * This means that you provide a few examples of the task you want the model to perform directly in the input prompt.
    * For example, you might give GPT-3 a few examples of translating English to French, and then ask it to translate a new sentence.
    * The model uses these examples to understand the task and generate the desired output, without changing its underlying parameters.
* **Human Analogy:**
    * The approach is inspired by how humans learn. We can often grasp new concepts or tasks by seeing just a few examples, without needing extensive training.
    * GPT-3's ability to perform few-shot learning suggests that it has learned a broad understanding of language and can apply that knowledge to new situations.

**In summary:**

* GPT-3 represents a shift towards models that can perform a wide range of tasks with minimal task-specific training.
* Few-shot learning allows these models to quickly adapt to new tasks by providing examples in the input prompt.
* This approach is more flexible and efficient than traditional fine-tuning, and it reflects a more human-like way of learning.


### Temperature, Top-K, Top-P

* Temperature controls the probability selection, or adjusts the randomness of the distribution
* Top-K sampling limits the selection to the top K tokens
* Top-P sampling, or nucleus sampling, includes the most likely tokens whose cumulative probability exceeds a threshold

## Generative AI

### PyTorch

PyTorch is a very popular open-source machine learning framework. Essentially, it's a tool that helps developers and researchers build and train neural networks. Here's a breakdown of its key aspects:

**Core Features:**

* **Tensor Computation:**
    * At its heart, PyTorch works with "tensors," which are multi-dimensional arrays similar to NumPy arrays.
    * It provides tools for performing various mathematical operations on these tensors, and crucially, it can accelerate these computations using GPUs (Graphics Processing Units), significantly speeding up machine learning tasks.
* **Automatic Differentiation (Autograd):**
    * A fundamental part of training neural networks involves calculating gradients. PyTorch's "autograd" feature automates this process.
    * This means you don't have to manually derive complex mathematical formulas; PyTorch keeps track of the operations performed and calculates the gradients for you.
* **Dynamic Computation Graphs:**
    * PyTorch uses dynamic computation graphs, which provide flexibility and make it easier to debug models. This means the graph of operations is built as the code is executed, allowing for more intuitive and flexible development.
* **Neural Network Modules (torch.nn):**
    * PyTorch provides a set of pre-built neural network layers and functions within the `torch.nn` module. This simplifies the process of creating complex neural network architectures.
* **Ease of Use and Flexibility:**
    * PyTorch is known for its user-friendly interface and its close integration with the Python programming language. This makes it relatively easy to learn and use.
    * Its flexibility makes it a favorite among researchers who need to experiment with novel neural network architectures.

**Key Uses:**

* Deep learning research
* Computer vision (image recognition, etc.)
* Natural language processing (text analysis, etc.)
* And many other machine learning applications.

In essence, PyTorch provides the tools and infrastructure needed to build and train powerful machine learning models.


In [17]:
# next word setup

This output shows the download progress of different components required for a language model, likely GPT-2 in this case. Here's a breakdown of what each file represents:

* **`config.json`:**  This file contains the configuration settings for the model, such as the number of layers, attention heads, hidden units, and other architectural choices. It defines the structure and hyperparameters of the model.

* **`model.safetensors`:** This file stores the learned weights (parameters) of the neural network. These weights determine how the model processes and generates text, and they are the result of the model's training on a massive dataset. The ".safetensors" format is a newer format for saving PyTorch models that focuses on safety and portability.

* **`generation_config.json`:** This file contains settings specifically related to text generation with the model. This might include parameters like `max_length`, `temperature` (controlling randomness), and `top_k` (limiting the choices for the next token).

* **`tokenizer_config.json`:**  This file holds the configuration for the tokenizer associated with the model. It defines how the tokenizer splits text into tokens, maps tokens to numerical IDs, and handles special tokens.

* **`vocab.json`:** This file contains the model's vocabulary, which is a list of all the tokens (words, subwords, or characters) the model knows. Each token is associated with a unique numerical ID.

* **`merges.txt`:** This file is used by some tokenizers (like the one for GPT-2) to handle subword tokenization. It defines rules for merging subword units into full words.

* **`tokenizer.json`:** This file might contain additional information or metadata related to the tokenizer.

**In summary, these downloads represent the essential components that make up a language model:**

* The model's architecture (`config.json`)
* The learned knowledge (`model.safetensors`)
* The text processing rules (`tokenizer_config.json`, `vocab.json`, `merges.txt`, `tokenizer.json`)
* The generation parameters (`generation_config.json`)

By downloading all these files, you get a complete and ready-to-use language model that can be loaded and used for various natural language processing tasks.


In [18]:
# AutoModelForCausalLM

Let's break down the output shape `torch.Size([1, 4, 50257])` that you're getting from your GPT-2 model:

**Understanding the Output Shape**

The output shape `torch.Size([1, 4, 50257])` represents the dimensions of the `logits` tensor produced by your GPT-2 model. Here's what each dimension signifies:

* **Dimension 1: Batch Size (1)**
    * The first dimension (with size 1) corresponds to the batch size. In this case, you're processing a single input sequence. If you were processing multiple sequences at once, this dimension would be equal to the number of sequences in your batch.
* **Dimension 2: Sequence Length (4)**
    * The second dimension (with size 4) represents the length of the input sequence. This means your `input_ids` tensor contains 4 tokens. GPT-2 processes these tokens sequentially, generating a probability distribution over possible next tokens at each step.
* **Dimension 3: Vocabulary Size (50257)**
    * The third dimension (with size 50257) corresponds to the size of GPT-2's vocabulary. This means the model is considering 50257 possible tokens that could follow the input sequence. The `logits` tensor contains a score (logit) for each of these possible tokens at each position in the sequence.

**What are Logits?**

* Logits are the raw, unnormalized scores output by the model before they are converted into probabilities using a softmax function.
* Each logit represents how likely the model thinks a particular token is to be the next token in the sequence.
* Higher logits indicate higher likelihood.

**Visualizing the Output**

You can think of the `logits` tensor as a 3D array:

* The first dimension (batch size) is like having multiple of these 3D arrays.
* The second dimension (sequence length) represents the different positions within the input sequence.
* The third dimension (vocabulary size) contains a score for each possible token at each position.

**Example**

Let's say your input sequence is "The cat sat on". The `logits` tensor would have:

* 1st dimension: 1 (because you have one input sequence)
* 2nd dimension: 4 (because "The cat sat on" has 4 tokens)
* 3rd dimension: 50257 (the vocabulary size of GPT-2)

For each of the 4 positions in the sequence, you'd have 50257 logits, representing the model's scores for each possible next token.

**In Summary**

The output shape `torch.Size([1, 4, 50257])` indicates that your GPT-2 model has processed a single input sequence of length 4 and has produced a score for each of the 50257 possible next tokens at each position in the sequence. These scores (logits) can then be used to determine the most likely next token or to generate text.

In [19]:
# logits

In [20]:
# final logits

In [21]:
# top10

### Softmax

The softmax function is a crucial tool in machine learning, particularly in multi-class classification problems. Here's a breakdown of what it is and why it's used:

**Core Function:**

* **Converting Raw Scores to Probabilities:**
    * The softmax function takes a vector of real numbers (often called "logits") as input and transforms them into a probability distribution. This means it outputs a vector of values where each value is between 0 and 1, and the sum of all the values is equal to 1.
* **Multi-Class Classification:**
    * It's primarily used in situations where you need to classify data into multiple distinct categories. For example:
        * Identifying different types of animals in an image.
        * Determining the topic of a text document.
        * Predicting which word comes next in a sentence.

**How It Works:**

1.  **Exponentiation:**
    * First, it takes the exponential of each input value. This ensures that all output values are positive.
2.  **Normalization:**
    * Then, it divides each exponentiated value by the sum of all the exponentiated values. This normalization step is what makes the output values sum to 1, creating a valid probability distribution.

**Why It's Important:**

* **Probabilistic Output:**
    * Softmax provides a clear and interpretable output, giving you the probability of each class. This allows you to understand the model's confidence in its predictions.
* **Facilitates Training:**
    * Because softmax produces a differentiable output, it's compatible with gradient-based optimization algorithms used to train neural networks.
* **Enables Comparisons:**
    * By converting raw scores into probabilities, softmax makes it easy to compare the likelihood of different classes.

**In essence:**

The softmax function is a way to take a set of arbitrary numbers and turn them into a set of probabilities, which is essential for many classification tasks in machine learning.


In [22]:
# top10 probabilities

In [23]:
# probabilities sum

When you see `tensor(0.9999, grad_fn=<AddBackward0>)`, here's a breakdown of what it means in the context of PyTorch:

**1. `tensor(0.9999)`:**

* This indicates the numerical value of the result. In this case, it's 0.9999.
* This value represents the sum of the probabilities obtained after applying the softmax function to your `final_logits`.
* Ideally, the sum of probabilities should be exactly 1.0. The slight deviation (0.9999) is due to floating-point precision limitations in computers.

**2. `grad_fn=<AddBackward0>`:**

* This part is related to PyTorch's automatic differentiation (autograd) system.
* `grad_fn` stands for "gradient function." It indicates the function that was used to compute this tensor, and it's used to calculate gradients during backpropagation.
* `<AddBackward0>` specifically means that the tensor was produced by an addition operation. This is because the `torch.sum()` function, which you used, performs addition.
* In essence, PyTorch is keeping track of the operations that were performed to create this tensor so that it can calculate gradients if needed. This is crucial for training neural networks, where gradients are used to update the model's weights.
* In short it is saying that the operation that created the tensor was an addition operation, and that pytorch is able to calculate the gradient of that operation.

**In simpler terms:**

* The `tensor(0.9999)` tells you that the sum of your probabilities is very close to 1, which is what you expect.
* The `grad_fn` part is PyTorch's way of saying, "I know how this number was calculated, and I can figure out how to adjust the inputs if needed."

**Key takeaway:**

* The numerical value (0.9999) is the result you're interested in for confirming that your probabilities are correctly normalized.
* The `grad_fn` is PyTorch's internal bookkeeping for automatic differentiation, which is essential for training neural networks.


In [24]:
# sum version 2