<a href="https://colab.research.google.com/github/gitmystuff/INFO4080/blob/main/Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers

## Abstract Ideas

* https://energywavetheory.com/equations/emc2/
* https://en.wikipedia.org/wiki/Musical_note
* https://www.dailyarthub.com/shop/objects/free-science-symbols-clip-art-set/
* https://decodingdatascience.com/attention-is-all-you-need-transforming-the-landscape-of-machine-learning/

## Neural Networks

### ANNs

* https://en.wikipedia.org/wiki/Neural_network_(machine_learning)

### CNNs

* https://en.wikipedia.org/wiki/Convolutional_neural_network
* https://www.ibm.com/topics/convolutional-neural-networks
* https://ieeexplore.ieee.org/document/726791

### RNNs

* https://en.wikipedia.org/wiki/Recurrent_neural_network
* x in time t, is input to RNN unit, and outputs a hidden vector, or state, at time t

#### Types of RNNs

* https://www.educative.io/answers/what-are-the-types-of-rnn

### LSTM

https://en.wikipedia.org/wiki/Long_short-term_memory

* Cell
* Input Gate
* Output Gate
* Forget Gate
* Activation Function (represented by $\sigma$)
* https://www.researchgate.net/figure/Structure-of-the-LSTM-cell-and-equations-that-describe-the-gates-of-an-LSTM-cell_fig5_329362532
* The cell state represents the memory of the network, information over time, while the hidden state contains the (new) processed information that is passed to the next time step

### Seq2Seq

* Encoder / Decoder
* https://en.wikipedia.org/wiki/Seq2seq
* https://www.researchgate.net/figure/Seq2Seq-model-neural-encoder-and-decoder-32_fig4_334023532
* h(t), a compressed representation (vector) of the input (thought vector), is between encoder and decoder
* A problem with Seq2Seq is using long passages, long-term dependencies, and remembering context
* h(t) always has the same size and can cause a bottle neck

### LSTM vs Seq2Seq

* Both are many to many models
* LSTMs are building blocks
* Seq2Seq is an architecture built from encoder LSTM and decoder LSTM, etc.

### All You Need is Attention

* For each output token, we want to know which input to pay attention to by using weights
* https://theaisummer.com/attention/
* Attention layer is a mapping of inputs to outputs

#### No More RNN

* RNNs are slow because they are sequential
* Cannot be parallelized
* Vanishing Gradients

### Summary

(RNN, Seq2Seq, Attention, Transformers)

* Attention is all you need - https://arxiv.org/abs/1706.03762
* The attention mechanism allows the model to focus on different parts of the input text. This helps the model to generate more accurate predictions. This allows words in a text sequence to pay attention to the earlier words in the sequence. This helps to model large text sequences in a correct manner.
* This process can be thought of as autoregression (what is the next best word)
* Transformers are big and slow but can be done in parallel, unlike RNNs
* Provides longer range than LSTM
* Created for RNNs (LSTMs), but transformers only use attention
* Assign varying levels of importance to different parts of a sentence or text
* Capture contextual information effectively
* Understand the relationships between words
* Extract the most relevant parts of a query while disregarding less important ones
* Comprehend pronouns, two-sided words, and related sentences more effectively
* Generate coherent answers for queries


## LLMs - Tokens and Transformers

* Completion vs Chat
* Chat needs context, memory
* LangChain
* Transfer Learning: In transfer learning, a machine exploits the knowledge gained from a previous task to improve generalization about another. For example, in training a classifier to predict whether an image contains food, you could use the knowledge it gained during training to recognize drinks. - https://builtin.com/data-science/transfer-learning
* HuggingFace Transformers - https://huggingface.co/models

Transformer Architecture

* https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention

Encoding vs Decoding

However, the main difference is that encoders are designed to learn embeddings that can be used for various predictive modeling tasks such as classification. In contrast, decoders are designed to generate new texts, for example, answering user queries.

https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder

Attention

* The chicken crossed the road because it was hungry
* The chicken crossed the road because it was the way home
* What is it?

Query, Key, Value

* The query vector represents the current focus of the LLM, like the central theme of the sentence. The key and value vectors hold information about each word, like its definition and role in the context.
* https://www.thecloudgirl.dev/blog/how-attention-makes-llms-powerful

Pipeline

<pre>
from transformers import pipeline
</pre>

* Text Processing
* Tokenize Data
* Model
* Numerical Predictions
* Post Processing
* Human Readable Predictions

Highlights

* Tokenizer
* Sentiment Analysis
* Text Generation
* Masked Language Modeling
* NER
* Summarization
* Translation
* Question and Answer
* Zero-Shot Classification

Source

* https://deeplearningcourses.com/c/data-science-transformers-nlp

### Transformer

* Stack of attention layers
* Pre trained stack of mappings from Encoder to Decoder
* The attention function can be considered a mapping between a query and a set of key-value pairs to an output
* Query, Key, Value

Sources

* https://arxiv.org/abs/1706.03762
* https://machinelearningmastery.com/the-transformer-attention-mechanism/
* https://www.linkedin.com/pulse/unpacking-query-key-value-transformers-analogy-database-mohamed-nabil/

### Tokenizer

* Different models yield different outputs
* Padding and Truncation

In [None]:
# from transformers import AutoTokenizer

# model = 'bert-base-uncased'
# tokenizer = AutoTokenizer.from_pretrained(model)

In [None]:
# # https://huggingface.co/docs/transformers/en/model_doc/bert
# tokenizer('happy friday')

In [None]:
# tokenizer('Antidisestablishmentarianism')
# # This 28-letter word refers to a 19th-century political movement that opposed the disestablishment of the Church of England.

In [None]:
# tokens = tokenizer.tokenize('happy friday')
# print(tokens)
# ids = tokenizer.convert_tokens_to_ids(tokens)
# print(ids)
# print(tokenizer.decode(ids))
# ids = tokenizer.encode('happy friday')
# print(ids)
# tokenizer.decode(ids)

In [None]:
# data = [
#     'I like cats.',
#     'Do you like cats too?',
#     'Wow, those cats like really know how to play groovy music.',
# ]
# tokenizer(data)

In [None]:
# sentence1 = "What is the capital of India?"
# sentence2 = "new Delhi is the capital."

# encoded_input = tokenizer(sentence1, sentence2, padding="max_length", max_length=20)
# encoded_input

In [None]:
# tokenizer.decode(encoded_input['input_ids'])

**Attention Mask**

* **Purpose:** Tells the model which tokens to pay attention to and which to ignore.
* **Different Values:**  You'll see different values (0s instead of all 1s) when you have:
* **Padding:** If your input texts have different lengths, they need to be padded to the same length for efficient batch processing.  The `attention_mask` will have 0s for the padded tokens, signaling the model to ignore them.
* **Masking for Prediction:** During training (especially in tasks like masked language modeling), some tokens are intentionally masked (hidden) and the model tries to predict them. The `attention_mask` uses 0s to indicate these masked tokens.

**Token Type IDs**

* **Purpose:**  Distinguishes between different sentences or segments in the input.
* **Different Values:**  You'll see different values (1s) when you have:
* **Multiple Sentences:** For tasks like question answering or sentence pair classification, you input two sentences. The `token_type_ids` will have 0s for the tokens of the first sentence and 1s for the tokens of the second sentence.

**Example**

Let's say you have two sentences:

* Sentence 1: "What is the capital of India?"
* Sentence 2: "New Delhi is the capital."

Here's how the tokenizer might process them, including padding:

```python
from transformers import AutoTokenizer

model = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model)

sentence1 = "What is the capital of India?"
sentence2 = "New Delhi is the capital."

encoded_input = tokenizer(sentence1, sentence2, padding="max_length", max_length=15)
print(encoded_input)
```

Output (likely similar, but exact IDs might vary):

```
{
 'input_ids': [101, 2054, 2003, 1996, 3231, 1997, 2188, 102, 7328, 2003, 1996, 3231, 102, 0, 0],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
}
```

**Explanation:**

* **`input_ids`:** The numerical IDs of the tokens from both sentences, including special tokens and padding (0s).
* **`token_type_ids`:** 0s for the tokens in the first sentence ("What is the capital of France?") and 1s for the second sentence ("Paris is the capital.").
* **`attention_mask`:** 1s for all the actual tokens and 0s for the two padding tokens at the end.

This example shows how `attention_mask` and `token_type_ids` help the model understand the structure and relevant parts of the input text.


### Sentiment Analysis

In [None]:
# from transformers import pipeline

# sa = pipeline('sentiment-analysis')
# print(sa('I love Thursdays'))
# print(sa('I am sad to see the class end on Thursday night'))
# print(sa('I have mixed emotions about Thursdays'))

This output shows the download progress of different components required for a language model, likely GPT-2 in this case. Here's a breakdown of what each file represents:

* **`config.json`:**  This file contains the configuration settings for the model, such as the number of layers, attention heads, hidden units, and other architectural choices. It defines the structure and hyperparameters of the model.

* **`model.safetensors`:** This file stores the learned weights (parameters) of the neural network. These weights determine how the model processes and generates text, and they are the result of the model's training on a massive dataset. The ".safetensors" format is a newer format for saving PyTorch models that focuses on safety and portability.

* **`generation_config.json`:** This file contains settings specifically related to text generation with the model. This might include parameters like `max_length`, `temperature` (controlling randomness), and `top_k` (limiting the choices for the next token).

* **`tokenizer_config.json`:**  This file holds the configuration for the tokenizer associated with the model. It defines how the tokenizer splits text into tokens, maps tokens to numerical IDs, and handles special tokens.

* **`vocab.json`:** This file contains the model's vocabulary, which is a list of all the tokens (words, subwords, or characters) the model knows. Each token is associated with a unique numerical ID.

* **`merges.txt`:** This file is used by some tokenizers (like the one for GPT-2) to handle subword tokenization. It defines rules for merging subword units into full words.

* **`tokenizer.json`:** This file might contain additional information or metadata related to the tokenizer.

**In summary, these downloads represent the essential components that make up a language model:**

* The model's architecture (`config.json`)
* The learned knowledge (`model.safetensors`)
* The text processing rules (`tokenizer_config.json`, `vocab.json`, `merges.txt`, `tokenizer.json`)
* The generation parameters (`generation_config.json`)

By downloading all these files, you get a complete and ready-to-use language model that can be loaded and used for various natural language processing tasks.


In [None]:
# sa(['i really like talking in my sleep',
#     'i wish i would stop snoring because I can\'t hear myself talk and it\'s driving me cuckoo'])

### Text Generation

In [None]:
# url = 'https://raw.githubusercontent.com/gitmystuff/Datasets/main/alice%20in%20wonderland%20chapter%201.txt'

# from transformers import pipeline
# import requests

# response = requests.get(url)
# data = response.text
# print(data)

In [None]:
# lines = data.splitlines()
# lines[2]

In [None]:
# gen = pipeline('text-generation', max_length=100)
# print(gen(lines[2]))

In [None]:
# import textwrap

# def wrap(x):
#   return textwrap.fill(x, replace_whitespace=False, fix_sentence_endings=True)

# prompt = 'The book had no pictures or conversations in it, and what is the use of a book, thought Alice without pictures or conversations'
# out = gen(prompt, max_length=300)
# print(wrap(out[0]['generated_text']))

### Masked Language Modeling

In [None]:
# !wget -nc https://raw.githubusercontent.com/gitmystuff/Datasets/main/news_stories.csv

In [None]:
# import pandas as pd

# df = pd.read_csv('news_stories.csv')
# df.head()

In [None]:
# df = df[df['category'].isin(['POLITICS', 'TRAVEL', 'WEDDINGS', 'EDUCATION', 'SCIENCE'])]
# df['category'].value_counts()

In [None]:
# cats = 'POLITICS'
# texts = df[df['category'] == cats]['short_description']
# texts = texts.reset_index()
# texts.drop('index', axis=1, inplace=True)
# texts.head()

In [None]:
# texts.iloc[3]['short_description']

In [None]:
# mlm = pipeline('fill-mask')

In [None]:
# mlm('He\'s not saying <mask> should go if the sexual assault allegations are true. He\'s telling him to get out now')

### Question Answer

### Zero Shot Classification / Learning

Zero-shot learning and few-shot learning are both machine learning techniques that use minimal data to teach AI, but they differ in how many examples are provided to the model during training:

* **Zero-shot:** The LLM is given **no examples** and has to complete the task based on its general knowledge from training. It's like taking a test without studying!
    * Example:  Asking an LLM to translate a sentence from English to French without providing any prior examples of English-French translations.

* **One-shot:** The LLM is given **one example** of the task before being asked to do it. It's like being shown how to solve one math problem and then being asked to solve a similar one.
    * Example: Showing an LLM one example of a movie review labeled as "positive" and then asking it to classify the sentiment of a new movie review.

* **Few-shot:** The LLM is given a **small number of examples** (usually 2-10) before being asked to perform the task.  This gives it a bit more context to work with.
    * Example: Providing an LLM with a few examples of questions and answers about a specific topic and then asking it to answer a new question on that topic.

**Why are these "shots" important?**

* **Generalization:** They test how well an LLM can generalize its knowledge to new situations and tasks.
* **Learning efficiency:**  They assess how quickly an LLM can learn new concepts from limited examples.
* **Adaptability:** They evaluate the LLM's ability to adapt to different tasks and domains.

**Beyond the basics:**

While zero-shot, one-shot, and few-shot are the most common types of "shots," there are other variations:

* **Multi-shot:**  Providing more than a few examples, but still a relatively small number.
* **In-context learning:**  A broader term that encompasses few-shot learning and refers to the LLM's ability to learn from examples within the prompt itself.

The type of "shot" used depends on the specific task, the capabilities of the LLM, and the desired level of performance.


In [None]:
# from transformers import pipeline

# zsc = pipeline('zero-shot-classification')

In [None]:
# zsc('This is really cool', candidate_labels=['positive', 'negative'])

In [None]:
# idx = 20
# cats = list(set(df['category']))
# print(cats)
# print(df.iloc[idx]['short_description'])
# print(df.iloc[idx]['category'])

In [None]:
# zsc(df.iloc[idx]['short_description'], candidate_labels=cats)

In [None]:
# !pip install transformers langchain langchain-community

In [None]:
# !pip install langchain_huggingface

In [None]:
# from langchain_huggingface import HuggingFacePipeline

# llm = HuggingFacePipeline.from_model_id(
#     model_id="gpt2",
#     task="text-generation",
#     pipeline_kwargs={
#         "max_new_tokens": 100,
#         "top_k": 50,
#         "top_p": 0.95,
#         "temperature": 0.1,
#     },
# )

# llm.invoke("Data Science is")

You're looking at code that creates a Hugging Face language model (LLM) for text generation within LangChain! Let's break down those `pipeline_kwargs`:

* **`model_id="gpt2"`:**  This specifies the name of the pre-trained model you want to use from the Hugging Face Model Hub. In this case, it's "gpt2", a popular language model.

* **`task="text-generation"`:** This tells the `HuggingFacePipeline` that you want to use the model for generating text.

Now, let's dive into the `pipeline_kwargs`:

* **`max_new_tokens=100`:** This limits the number of new tokens (roughly words or sub-words) that the model will generate in its output. Here, the generated text will be at most 100 tokens long.

* **`top_k=50`:** This parameter uses top-k sampling to control the randomness of the generated text.  The model will only consider the 50 most likely tokens when generating the next word. This helps to make the output more focused and coherent.

* **`top_p=0.95`:** This parameter uses nucleus sampling (also known as top-p sampling). It selects tokens from the probability distribution until the cumulative probability exceeds the `top_p` value. This dynamically adjusts the number of tokens considered, allowing for more flexibility and potentially more diverse output.

* **`temperature=0.1`:** This controls the "creativity" or randomness of the generated text. A lower temperature (like 0.1) makes the output more deterministic and focused, while a higher temperature (closer to 1) makes it more random and creative.

**In summary:** These `pipeline_kwargs` fine-tune the behavior of the Hugging Face language model, influencing the length, randomness, and diversity of the generated text. They give you control over how the model generates responses, allowing you to tailor it to your specific needs within your LangChain application.

**top_k and temperature**

`top_k` is another parameter used to control the randomness of language model output, but it works differently than temperature. While temperature scales the entire probability distribution, `top_k` focuses on a specific number of the most likely tokens.

Here's how it works:

1. **Probability Distribution:** At each step of text generation, the language model calculates a probability distribution over its vocabulary (all the possible words or tokens it knows). This distribution represents how likely each word is to be the next one in the sequence.

2. **Selecting Top Tokens:**  `top_k` sampling selects the top `k` most likely tokens from this distribution. For example, if `top_k` is set to 40, only the 40 most probable words will be considered for the next token in the generated text.

3. **Resampling:** The probability distribution is then re-normalized over these selected `k` tokens, and the model samples from this narrowed-down distribution.

**Effects of `top_k`:**

* **Increased Focus:** By limiting the choices to the most likely tokens, `top_k` sampling makes the output more focused and coherent. It prevents the model from considering very low-probability tokens that might lead to nonsensical or irrelevant output.
* **Controlled Randomness:** Even though it reduces the choices, `top_k` still allows for randomness within the top `k` tokens. This helps maintain some diversity and creativity in the generated text.

**Typical Usage:**

* **Smaller `top_k` values (e.g., 10, 20):**  Produce more deterministic and predictable output, suitable for tasks where accuracy is crucial.
* **Larger `top_k` values (e.g., 50, 100):** Allow for more diversity and creativity, useful for tasks like story writing or brainstorming.

**Combining with Temperature:**

`top_k` can be used in conjunction with temperature. Temperature smooths out the distribution within the top `k` tokens, while `top_k` controls the range of tokens considered.

**In your Hugging Face pipeline:**

You can set the `top_k` parameter when initializing your text generation pipeline:

Experiment with different `top_k` values, along with temperature, to find the optimal balance between coherence and diversity for your specific use case.


In [None]:
# from transformers import pipeline
# from langchain.chains import LLMChain
# from langchain.prompts import PromptTemplate
# from langchain.llms import HuggingFacePipeline

# zsc = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# llm = HuggingFacePipeline.from_model_id(
#     model_id="facebook/bart-large-mnli",
#     task="text-generation",
#     pipeline_kwargs={
#         "max_new_tokens": 1, # trying to force sentiment analysis onto a chat bot
#     },
# )

# template = """You are a helpful assistant that can classify text.
# Here is the text: {human_message}
# Classification results: {classification_results}
# """
# prompt = PromptTemplate(
#     input_variables=["human_message", "classification_results"], template=template
# )

# chain = LLMChain(llm=llm, prompt=prompt)

# human_message = "This is really cool"
# classification_results = zsc(human_message, candidate_labels=['positive', 'negative'])

# output = chain.run({
#     "human_message": human_message,
#     "classification_results": classification_results
# })

# print(output)