Focus of the content
- overview of technical concepts
- training approaches
- will not dive deep into generative models for sound or video
- high level understanding the techniques


Will be following this structure:
1. Introduction to generative AI
2. Understanding LLMs
3. Text-to-image models

# Some of the popular models
* __Text-to-text__: Models that generate text from input text, like conversational agents. E.g. LLaMA 2, GPT-4, Claude and PaLM 2
* __Text-to-image__: Models that generate images from text captions. E.g. DALL-E 2, Stable Diffusion and Imagen
* __Text-to-audio__: Models that generate audio clips and music from text. E.g. Jukebox, AudioLM and MusicGen.
* __Text-to-video__: Models that generate video content from text descriptions. Example: Phenaki and Emu Video.
* __Text-to-speech__: Models that synthesize speec audio from input text. E.g. WaveNet and Tacotron
* __Speech-to-text__: Models that transcribe speec to text Automatic Speech Recognition (ASR) E.g. Whisper, SpeechGPT
* __Image-to-text__: Generate image captions from images. E.g. CLIP and DALL-E 3
* __Image-to-Image__: Applications for this type of model are data augmentation such as super resolution, style transfer, and inpainting.
* __Text-to-code__: Models that generate programming code from text. E.g. Stable Diffusion and DALL-E3
* Video-to-audio: Models that analyze video and generate matching audio. Example Soundify.


---
_Importance of  the number of parameters in an LLM: The more parameters a model has, the higher its cpacity to capture relationships between words and phrases as knowledge. The higher-order correlation could be learnt with more parameters such as 'cat' is likely to follow by the word 'dog' if it is preceded by the word 'chase'. However, lower the model's perplexity, the better it will perform, in terms of answering questions._

_Models with 2 - 7 billion parameters, new capabilities such as ability generate different creative text in formats like poems, code, scripts, etc. and also answer even open-ended and chllenging questions._

---
__Representation Learning__ is about a model leanring its internal representations of raw data to perform machine learning task, rather than depending on engineered feature extraction. Model isn't told explicitly what features to look for - it learns representations of the raw pixel data that help it make predictions.

__Problem__

Language models still face limitations when dealing with complex mathematical or logical reasoning tasks. It remains uncertain wheter continually increasing the scale of language models will inevitably lead to new reasnoning capabilities. LLMs are known to return the most probable answers within the context which can sometimes yield fabricated information, termed hallucinations.

---

* __Foundation Model__ AKA base model is a large model that was trained on an immense quantity of data at scale so that it can be adapted to a wide range of downstream tasks. In GPT models, this pre-training is done via self-supervised learning.
* GPT-3 trained on 300 billions tokens and has 175 billion parameters. GPT-4 has 1.73 trillion parameters. Able to process 8x more words than GPT-3. Keep costs reasonable by utilizing a Mixutre of Experts (MoE) model consisting of 16 experts within their model, each having 111 billion parameters (GPT-3). GPT-4 is trained on about 13 trillion tokens.
* Multi-modal version of GPT-4 incorporates a separate vision encoder, trained on joined image and text data. Giving the model the capability to read web page and transcribe what's in images and video.
* Google PaLM 2, focus on improving multilingual and reasoning capabilities while being more compute efficient. Smaller and exhibits faster and more efficient inference, allowing for broader deployment and faster response times for more natural pace of interaction.
* LLaMa and LLaMa 2 by Meta AI with 70B parameters. Allow community to buld on top of them, allowing open-source LLMs. Resulting in Alpaca, Koala, MPT, Gorilla, WizardCoder. LLaMa 2 70B model > PaLM (540B) on almost all benchmarks. But large performance gap between LLaMa 2 70B and GPT-4 and PaLM-2-L.
* LLaMa 2 trained on new mixed of data, with pre-training corpus size increased by 40% (2 trillion tokens of data).
* Claude and Claude 2 are AI assistants created by Antropic. Released on July 2023 best competitor against GPT-4. Key model improvements include an expanded context size up to 200K tokens, far larger than most available models. Open source too. Model card Anthropic showed Claude 2 still has limitations in areas like confabulation, bias, factual errors and potential for misuse problems it has in common with all LLMs.

---

### Major Players
- LLaMa 2 model use 70 bullions parameters and trained on 1.4 trillion tokens.
- PaLM2 use 340 billion parameters and have a larger scale of training data in at least 100 languages
- Mistral has open-license 7B model generated from private datasets and developed with the intent to support open generative AI community

### Transformer
- Hidden state representations from encoders consider not only the inherent meaning of the words (their semantic value) but also their context in the sequence.
- -Decoder uses encoded information to generate the output sequence one item at a time, using the context of the previously generated items.
- Layer Normalization: To stabilize network's learning. Normalizes the model's inputs across the features dimension instead of the batch dimension, thus improving the overall speed and stability of learning

    * Success is due to ability to maintain performance across longer sequences better than other models.
    * Attention mechanisms is to compute a weighted sum of the values. associated with each position and input sequence, based on the similarity between the current position and all other positions. This weighted sum, known as the context vector, is then used as an input to the subsequent layers of the model, enabling the model to selectively attend to relevant parts of the input during the decoding process.


    1. Early attention mechanisms scaled quadratically with the length of the sequences (context size), rendering them inapplicable to settings with long sequences. Most LLMs use some form of Multi-Query Atention (MQA) to alleviate this issues. MQA allows 11 times better throughput and 30% lower latency in inference tasks when compare to baseline without MQA. MQA remove the heads dimension from certain computations.
    2. LLaMa 2 use Grouped-Query Attention: A practice used in autoregressive decoding to cache the key (K) and value (V) pairs for the previous tokens in the sequence. However context window or batch sizes increase.
 
---
### Pretraining
1. Masked Language Modeling (MLM): Used in BERT, without the mask the model attempts to predict the missing tokens based on the context provided
2. Negative Log-Likelihood (NLL) and Perplexity (PPL) are metrics used in training and evaluating language models. NLL is a loss function aimed at maximizing the probability of correct predictions. Lower NLL indicates that the network has successfully learned patterns from the training set.
3. PPL is an exponentiation of NLL, providing a more intuitive way to understand the model's performance. Small PPL indicates a well-trained network while higher values indicate poor learning performance.

---
### Tokenization
1. Byte-Pair Encoding (BPE), WordPiece, and Sentencepiece.
2. LLaMa 2's BPE tokenizer splits numbers into individua digits and uses bytes to decompose unknown UTF-8 characters. The total vocabulary size is 32K tokens
3. LLMs can only generate outputs based on a sequence of otkens that does not exceed its context windows (e.g. 1000 to 10,000 tokens).

## Conditioning
In the recent state this becomes a important skill to adapt model for specific tasks. It includes fine-tuning and prompting
* __Fine-tuning__ involves modifying a pretrained language model by training it on a specific task using supervised learning. Pretrained models are usually trained again using Reinforcement Learning from Human Feedback (RLHF) to be helpful and harmless.
* __Prompting techniques__ present problems in text form to generative models. There are many different kind of prompting techniques, starting from simple questions to detailed instructions. Prompts can include examples of similar problems and their solution. Zero-shot prompting involves no examples, while few-shot prompting includes a small number of examples of relevant problem and solution pairs. Prompt engineering and conditioning methods will be explored further in Chapter 8, Customizing LLMs and Their Output.

## What are text-to-image models?
Text-to-image models are a type of generative AI that creates realistic images from textual descriptions. Have diverse use cases in creative industries and design for generating advertisements, product prototypes, fashion images, and visual effects.

*  __Text-conditioned image generation__: Creating original images from text prompts like "a painting of a cat in a field of flowers."
* __Image inpainting__: Filling in missing or corrupted parts of an image based on the surrounding context. This can restore damaged images (denoising, dehazing, and deblurring) or edit out unwanted elements.
* __Image-to-image translation__: Converting input images to different style or domain specified through text, like "make this photo look like a Monet paintig."
* __Image recognition__: Large foundation models can be used to recognize images, including classifying scenes, but also object detection, for example, detecting faces.

Models like Midjourney, DALL-E 2 and Stable Diffusion providese creative and realistic images derived from textual input or other images. These models work by training deep neural networks on large datasets of image-text pairs.

- Key technique used is diffusion models, which start with random noise and gradually refine it into an image through repeated denoising steps.
- Popular models like Stable Diffusion and DALL-E 2 use a text encoder to map input text into an embedding space. The text embedding is fed into a series of conditional diffusion models, which denoise and refine a latent image in successive stages. The final model output is a high-resolution image aligned with the textual description.
- Two main classes of models: Generative Adversarial Networks (GANs) and diffusion models. GAN like StyleGAN or GANPaint Studio can produce highly realistic images, but training is unstable and computationally expensive. They consist of two networks that are pitted against each other in a game-like setting - the generator, which generates new images from text embeddings and noise, and the discriminator which estimates the probability of the new data being real. As these two networks compete, GANs get better at their task, generating realistic images znd other types of data.
- Diffusion Models have become popular and promising for a wide range of generative tasks, including text-to-image synthesis. These models offer advantages over previous approaches, such as GAN by reducing computation costs and sequential error accumulation.
    - Diffusion Models operate through a process like diffusion in Physiscs. Follow a __forward diffusion process__ by adding noise to an image until it becomes uncharacteristic and noisy.
    - The unique aspect of generative image models is the __reverse diffusion process__, where the model attempts to recover the original image from a noisy, meaningless image.
        1. Iteratively applying noise removal transformations, the model generates images of increasing resolution that align with the given text input
        2. Final output is an image that has been modified based on the text input
        3. E.g. Imagen text-to-image model (_Photorealistic: Text-to-Image Diffusion models with Deep Language Understanding May 2022_), which incorporates frozen text embeddings from LLMs, pretrained on text-only corpora. A text encoder first maps the input text to a sequence of embeddings.
        4. A cascade of conditional diffusion models takes the text as input and generates images.
        5. Only some steps within the 40-step generation process are shown in the diagram Fig 1.8.
        6. U-Net denoising process using the __Denoising Diffusion Implicit Model (DDIM)__ which repeatedly removes Gaussian noise, and then decodes the denoised output into pixel space

    Although they sometimes produce striking results, the instability and inconsistency are significant challeng to aplying these models broadly.

### Stable Diffusion

Developed by CompVis group at LMU Munich
- Significantly cuts training costs and sampling time compared to previous (Pixel-based) diffusion models. By creating high fidelity images from text on consumer GPU, SD model democratizes access. Further, the model's source code and even the weights have been released under the CreativeML OpenRAIL-M license, which doesn't impose restrictions on reuse, distribution, commercialization, and adaptation.
- SD introdced operations in latent (lower-dimensional) space representations, which capture the essential properties of an image, in order to improve computational efficiency. A VAE provides latent space compression (Called perceptual compression in paper), while a U-Net performs iterative denoising.
- 1. It starts by producing a random tensor (random image) in the latent space, which serves as the noise for our initial image.
  2. A noise predictor (U-Net) takes in both the latent noisy image and the provided text prompt and predicts the noise.
  3. The model then subtracts the latent noise from the latent image.
  4. Steps 2 and 3 are repeated for a set number of sampling steps, for instance, 40 times, as shown in the plot.
  5. Finally, the decoder component of the VAE transforms the latent image back into pixel space, providing the final output image.
- __VAE__ is a model that encodes data into a learned, smaller representation (encoding). These representations can then be used to generate new data similar to that used for training (decoding). This VAE is trained first.
- U-Net is a CNN has symmetric encoder-decoder structure. Commonly used for image segmentation tasks.
    -  In SD it can help to introduce and remove noise in image. The U-Net takes a noisy image (seed) as input and processes it through a series of convolutional layers to extract features and lear semantic representations. These convolutional layers, typically organized in a contracting path, reduce the spatial dimensions while increasing the number of channels.
    -  Once contracting path reaches the bottle neck of the U-Net, it then expands through a symmetric expanding path.
    -  The expanding path trnaposed convolutions (aka upsampling or deconvolutions) are applied progressively upsample the spatial dimensions while reducing the number of channels
- __Mean Squared Error (MSE)__ is commonly used for traiing the image generation model in the latent space itself (latent diffusion model), which quantifies the differences between the generated image and the target image.
- E.g. LAION-5B dataset comprising of image-text pairs from Pinterest, WordPress, Blogspot, Flickr, and DeviantArt
- Use the concept of forward and reverse diffusion processes and operating in a lower dimensional latent space for efficiency
- Conditioning process allows these models to be influenced by specific input textual prompts or input types like depth maps or outlines for greater precision to create relevant images. E.g. Generated by text transformer

- See NeRF Diffusion