# Generative AI

Gen AI is a new field of AI that focuses on creating new content, such as images, music, and text. 

<img src="images/timeline-genai.png" width="600" height="400" />

## Gen AI Use Cases

**Text Generation**: Create new text, such as articles, stories, or conversations.

**Image Generation**: Create new images, such as photos, artwork, or designs.

**Music Generation**: Compose music, melodies, or sound effects.

**Video Generation**: Create new videos, such as animations or clips.

**Data Generation**: Create synthetic data for training or testing.

**Style Transfer**: Transfer styles between images, music, or text.

**Image-to-Image Translation**: Translate images from one domain to another.

**Text-to-Image Synthesis**: Generate images from text descriptions.

**Dialogue Generation**: Engage in conversation, responding to user input.

**Creative Writing**: Generate creative writing, such as poetry or short stories.

## Deep Learning

Deep learning is a subfield of machine learning that focuses on neural networks.

Neural Networks are a type of model that is inspired by the human brain. They consist of layers of neurons that process input data and produce output data. Deep Neural Networks are neural networks with many layers. They are capable of learning complex patterns in data.

<img src="images/nn.png" width="600" height="400" />

## Recurrent Neural Networks (RNNs)

Recurrent Neural Networks (RNNs) are a type of neural network that is designed to process sequences of data. They are commonly used for text and speech processing.

<img src="images/RNN.png" width="600" height="400" style=" background-color: white;" />

**Backward Propagation**: RNNs use backpropagation to update their weights and biases during training. This involves computing the gradient of the loss function with respect to the weights and biases of the network.

**Foward Propagation**: RNNs use forward propagation to compute the output of the network given an input sequence. This involves passing the input sequence through the network and computing the output at each time step.

**Long Short-Term Memory (LSTM)**: LSTMs are a type of RNN that are designed to capture long-term dependencies in data. They are capable of learning patterns that are separated by long sequences of data.

Use Cases:
 * Natural Language Processing (NLP)
 * Speech Recognition
 * Time Series Prediction / Anomaly Detection
 * Music Composition
 * Handwriting Recognition
 * Video Activity Recognition



## Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are a type of neural network that is designed to process images. They are commonly used for image recognition and computer vision.

<img src="images/CNN.jpeg" width="600" height="400" style=" background-color: white;" />

CNNs use convolutional layers to extract features from images. These layers apply filters to the input image to detect patterns, such as edges, corners, and textures.

Use Cases:
* Image Classification
* Object Detection
* Image Segmentation
* Face Recognition
* Medical Imaging
* Gesture Recognition
* Image Synthesis
* Self-Driving Cars


## Neural Networks Zoo

The Neural Network Zoo is a collection of neural network architectures that have been developed over the years. It includes a wide range of models, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers. The Neural Network Zoo is a useful resource for researchers and practitioners who want to explore different neural network architectures.

<img src="images/nn-zoo.png" width="800" height="1024" />

Source: https://www.asimovinstitute.org/neural-network-zoo/

## Transformers

Transformers are a type of neural network architecture that has been successful in many Gen AI tasks. They are based on self-attention mechanisms that allow them to model long-range dependencies in data. Transformers have been used in many applications, such as language modeling, translation, and image generation.

<img src="images/transformers.png" width="600" height="600" />

Use Cases:
* Natural Language Processing (NLP)
* Language Modeling
* Question Answering
* Speech Recognition
* Time Series Forecasting
* Image Recognition

## Embeddings

Embeddings are a way to represent data in a lower-dimensional space. They are commonly used in Gen AI to represent words, images, and other types of data. Embeddings are learned during training and capture the relationships between different data points.

<img src="images/Transformers-attention-1.png" />

Using cosine similarity to find similar embeddings.

## Embedding Code Example

In [3]:
import torch
from transformers import BertModel, BertTokenizer

# Initialize tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Define a sentence
sentence = "I love coding Gen AI applications."

# Tokenize the sentence and obtain the input IDs
inputs = tokenizer(sentence, return_tensors="pt")

# Get the embeddings
with torch.no_grad():
    outputs = model(**inputs)

# The last hidden state is the first element of the output tuple
last_hidden_states = outputs.last_hidden_state

print(last_hidden_states)



tensor([[[ 0.1290,  0.0631,  0.0218,  ..., -0.2822,  0.3723,  0.4550],
         [ 0.7875, -0.0711, -0.3430,  ...,  0.0156,  0.5159,  0.1851],
         [ 1.1745,  0.9952,  0.3751,  ...,  0.0757,  0.2636, -0.0634],
         ...,
         [-0.1707,  0.2353, -0.0354,  ..., -0.6519, -0.3293,  0.1394],
         [ 0.1527, -0.3066,  0.0453,  ...,  0.2441,  0.2385, -0.5898],
         [ 0.4515,  0.0422,  0.0441,  ...,  0.2707, -0.4147, -0.2581]]])


### What is a Tensor?

* key data structure used in creating, training, and testing neural networks
* Generalization of vectors and matrices
* Object represented as arrays of numbers
* Tensor Ranks:
  * 0: Scalar
  * 1: Vector
  * 2: Matrix (2D array)
  * 3: 3D array

In [1]:
import numpy as np

# A scalar (rank 0 tensor)
scalar = np.array(5)
print("Scalar (Rank 0 tensor):\n", scalar)

# A vector (rank 1 tensor)
vector = np.array([1, 2, 3])
print("\nVector (Rank 1 tensor):\n", vector)

# A matrix (rank 2 tensor)
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\nMatrix (Rank 2 tensor):\n", matrix)

# A rank 3 tensor
tensor = np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9]],
                   [[10, 11, 12], [13, 14, 15], [16, 17, 18]],
                   [[19, 20, 21], [22, 23, 24], [25, 26, 27]]])
print("\nRank 3 tensor:\n", tensor)

Scalar (Rank 0 tensor):
 5

Vector (Rank 1 tensor):
 [1 2 3]

Matrix (Rank 2 tensor):
 [[1 2 3]
 [4 5 6]
 [7 8 9]]

Rank 3 tensor:
 [[[ 1  2  3]
  [ 4  5  6]
  [ 7  8  9]]

 [[10 11 12]
  [13 14 15]
  [16 17 18]]

 [[19 20 21]
  [22 23 24]
  [25 26 27]]]


## Vector Database

A vector database is a database that stores vectors in a high-dimensional space. It allows you to query vectors based on their similarity to other vectors. This is useful for many Gen AI tasks, such as image search, recommendation systems, and content generation.

<img src="images/vector_databases.png" width="600" height="400" />

Popular Vector Databases:
 * LanceDB
 * Pinecone
 * Chroma
 * OpenSearch
 * Redis
 * Postgres (pg_vector)
 

## Large Language Models

Large language models are a type of Gen AI model that is trained on a large corpus of text data. They are capable of generating human-like text and engaging in conversation. Large language models have been used in many applications, such as chatbots, content generation, and translation.

A lot of people see LLM as a chatbot, but it is more than that. 

<img src="images/cb-hero-top.png" width="300" height="400" style=" background-color: white;" />

A better metaphor it would see LLM as a Operational System (OS).

<img src="images/llm-os.png" width="600" height="400" style=" background-color: white;" />

[GPT-4o](https://openai.com/index/hello-gpt-4o/) and Google products with GenAI - [GoogleIO/24](https://www.tomsguide.com/news/live/google-io-2024-keynote) are going exatclyt into this direction.

## ChatGPT

GPT means: Generative Pre-trained Transformer.

Poster child for LLMs and Transformers Architecture.

ChatGPT is a large language model (LLM) that is based on the GPT-series. It is capable of generating human-like text responses to user input. ChatGPT has been used in many applications, such as chatbots, conversational agents, and creative writing.

<img src="images/GPT_VS_BERT.png" />

### GPT Cost to train

| Model    | Parameters | Cost   |  Time    |
| -------- | ---------- |--------|----------| 
| GPT-3    | 175B       | 4.6M   | 34 days  |
| GPT-4    | 100T       | 2.6B   | 100 days |
| LLAMA-2  | 70B        | 20M    | 23 days  |

Llama-2 paper, it took 184,320 GPU hours of an A100 to train the model. 184320 hours = 7680 days ~= 21 years Renting AWS p4d. 24xlarge instance (8 GPUs) is $32.7726 per hour

Cost References:
* [GPT-4 Training days](https://towardsdatascience.com/the-carbon-footprint-of-gpt-4-d6c676eb21ae#:~:text=Recall%20that%20it's%20estimated%20that,to%202%2C600%20hours%20per%20server.)
* [GPT-3 Cost](https://www.forbes.com/sites/craigsmith/2023/09/08/what-large-models-cost-you--there-is-no-free-ai-lunch/?sh=3561d9e24af7)
* [GPT-3 Cost comments](https://ai.stackexchange.com/questions/43128/what-is-accelerated-years-in-describing-the-amount-of-the-training-time#:~:text=Since%20GPT%2D3%20is%20a,train%20the%20GPT%2D3%20model.)
* [GPT Cost](https://www.moomoo.com/community/feed/109834449715205#:~:text=In%20terms%20of%20the%20training,reach%2046%20million%20U.S.%20dollars.)
* [LLAMA 2 model card](https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md)
* [LLAMA 2 Cost](https://www.quora.com/How-much-money-did-Meta-cost-to-train-Llama-2-Why-has-NVIDIAs-stock-price-been-rising-and-how-high-could-it-ultimately-go#:~:text=Cost%20Breakdown%3A%20Meta%20allocated%20a,the%20training%20of%20Llama%202.)
* [LLAMA2 Cost comments](https://www.linkedin.com/posts/hherry_according-to-the-llama-v2-paper-it-took-activity-7128798861606694912-FLaT/?trk=public_profile_like_view)
* [LLAMA2 training days](https://news.ycombinator.com/item?id=35008694)

Cost of Train in AWS Sagemaker

<img src="images/LLM-Cost-of-training-SageMaker-2023.png" />

Size of GPT-4 model

<img src="images/GPT_4_metaphor.png" />

## Prompt Engineering

Prompt engineering is a technique used to guide the output of a language model by providing a specific prompt. It involves designing prompts that encourage the model to generate desired responses. Prompt engineering is commonly used in chatbots, content generation, and translation.

<img src="images/Prompt-Engineering-Example.png" width="600" height="400" />




### Approaches to prompt engineering

**Zero-Shot Learning** The model is given a task without any examples at the beginning of the prompt. The model is expected to perform the task without any guidance. For instance, if you want the model to translate English to French, you might provide the English sentence you want translated without any examples of English sentences and their French translations. The idea is that the model should be able to perform the task without any examples.

**Few-Shot Learning** The model is given a few examples of the task at the beginning of the prompt. These examples help the model understand the task it needs to perform. For instance, if you want the model to translate English to French, you might provide a few examples of English sentences and their French translations before providing the English sentence you want translated. The idea is that these examples guide the model towards the correct output.

**Chain-of-Thought** The model is given a sequence of prompts that build on each other to perform a task. For instance, if you want the model to write a story, you might provide a sequence of prompts that guide the model towards the desired story. The idea is that each prompt builds on the previous one to create a coherent narrative.

**Tree-of-Thought** The model is given a tree structure of prompts that guide the model towards the desired output. For instance, if you want the model to generate a dialogue, you might provide a tree structure of prompts that guide the model towards the desired conversation. The idea is that the tree structure helps the model navigate the space of possible outputs.

## Hugging Face

Hugging Face is a company that specializes in natural language processing (NLP) and Gen AI. They provide a wide range of models, datasets, and tools for developers to use in their projects. Hugging Face is known for their Transformers library, which is a popular open-source library for working with transformer models. Huggingface is like Github for AI models.

### Hugging Face Tasks

<img src="images/hf-tasks.png" width="600" height="400" />

https://huggingface.co/tasks

It's the real deal! Makes easy to consume models and perform common AI tasks on text, audio, image or video.

In [1]:
from transformers import pipeline

# Initialize the Hugging Face sentiment-analysis pipeline
nlp = pipeline("sentiment-analysis")

# Use the pipeline to analyze this sentence
result = nlp("I love using Generative AI")[0]

# Print the result
print(f"label: {result['label']}, with score: {result['score']}")

2024-05-15 02:46:15.143305: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-15 02:46:16.220799: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2024-05-15 02:46:16.220951: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


label: POSITIVE, with score: 0.9988325238227844


## LLM Model Evaluation Metrics

**BLEU Score**: The BLEU score is a metric used to evaluate the quality of machine translation. It measures the overlap between the generated translation and the reference translation. The BLEU score ranges from 0 to 1, with higher scores indicating better translations.

**ROUGE Score**: The ROUGE score is a metric used to evaluate the quality of text summarization. It measures the overlap between the generated summary and the reference summary. The ROUGE score ranges from 0 to 1, with higher scores indicating better summaries.

**Perplexity**: Perplexity is a metric used to evaluate the quality of language models. It measures how well the model predicts the next word in a sequence. Lower perplexity scores indicate better language models.

**F1 Score**: The F1 score is a metric used to evaluate the quality of classification models. It measures the balance between precision and recall. The F1 score ranges from 0 to 1, with higher scores indicating better classification models.

**Accuracy**: Accuracy is a metric used to evaluate the quality of classification models. It measures the percentage of correctly classified examples. Higher accuracy scores indicate better classification models.

##  LLM model evaluation benchmarks

LLM Benchmarking is a way to evaluate the performance of large language models (LLMs) on a variety of tasks. It involves testing the model on different datasets and measuring its performance using various metrics. LLM benchmarking helps researchers and practitioners understand the strengths and weaknesses of different LLMs and compare them to each other.

<img src="images/llm_leaderboard_sept_2023.png" width="600" height="400" />

Sample Benchmark example (from [Sep/2023](https://www.trustbit.tech/blog/2023/9/20/llm-performance-series-batching))

**LLMU - Large Language Model Understanding**: is a new metric that measures the understanding of large language models (LLMs) in real-world applications. It takes into account factors such as interpretability, fairness, and bias. LLMU is designed to help researchers and practitioners evaluate the ethical implications of using LLMs in different contexts.

**GLUE Benchmark**: GLUE (General Language Understanding Evaluation) benchmark provides a standardized set of diverse NLP tasks to evaluate the effectiveness of different language models

**SuperGLUE Benchmark**: The Super General Language Understanding Evaluation (SuperGLUE) benchmark is an extension of the GLUE benchmark that includes more challenging tasks and requires models to perform better than human baselines. It includes tasks such as natural language inference, coreference resolution, and commonsense reasoning. SuperGLUE Benchmark Compares more challenging and diverse tasks with GLUE, with comprehensive human baselines

**HellaSwag**: HellaSwag is a benchmark that evaluates the ability of language models to perform common-sense reasoning. It includes tasks such as predicting the next word in a sentence and completing a story with a plausible ending. Evaluates how well an LLM can complete a sentence.

**TruthfulQA**: TruthfulQA is a benchmark that evaluates the ability of language models to generate truthful answers to questions. It includes tasks such as fact-checking and question-answering. Evaluates how well an LLM can generate truthful answers to questions.




## ONNX - Open Neural Network Exchange

ONNX is an open-source format for representing deep learning models. It allows models to be trained in one framework and deployed in another. ONNX is supported by many popular deep learning frameworks, such as PyTorch, TensorFlow, and MXNet.

ONNX is a big deal because you can go from PyTorch <--> Tensorflow but also train a model in python and run inference on the model in Java.

<img src="images/onnx.png" width="600" height="400" />

## LangChain

LangChain is a framework designed to simplify the creation of applications using large language models (LLMs). As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.

<img src="images/overall-langchain.jpeg" width="600" height="500" />

Java has [Langchain4J](https://docs.langchain4j.dev/) as langchain implementation for Java.

Using Langchain we can do:
* Smooth integration into your Java applications (or other languages), There is two-way integration between LLMs and Java: you can call LLMs from Java and allow LLMs to call your Java code in return.
* Prompt Templating
* Output parsing
* Patterns like RaG or Agents
* Vector Databases integration (Pinecone, OpenSearch, Redis, pg_vector, more...)

## RaG (Retrieval-augmented Generation)

LLMs can halucinate.

<img src="images/LLM_hallucinations.png"  />

RaG is approach that combines retrieval and generation models to improve the performance of language models. It uses a retrieval model to find relevant information from a large corpus of text and a generation model to generate responses based on that information.

<img src="images/Rag-Simple.png"  />

RaG is key because it allows to generate more accurate and relevant responses by combining the strengths of retrieval and generation models. RaG also can reduce the cost of fine-tunning.


## Stable Diffusion

Stable Diffusion works by iteratively adding noise to an image to create a sequence of noisy images. The model then learns to denoise these images to recover the original image. This process is repeated multiple times to generate high-quality images.

<img src="images/Stable_Diffusion_architecture.png" />

Stable Diffusion Common Models:
* Stable Diffusion 3 (Stability AI)
* DALL-E 3 (Open AI)
* Imagine with Meta AI (Meta)
* Midjourney

Chalenges with Stable Diffusion:
* Computational Intensity / High Hardware Demands
* Quality Variance
* Technical Complexity


## Generative Adversarial Networks (GAN)

Generative Adversarial Networks (GANs) are a type of neural network that is designed to generate new data. They consist of two networks: a generator and a discriminator. The generator creates new data samples, while the discriminator tries to distinguish between real and fake samples.

<img src="images/GANs.png" style=" background-color: white;" />

GAN Pros:
  * Synthetic data generation
  * High-quality results
  * Versatility

GAN Cons:
  * Training Instability
  * Computational Cost
  * Overfitting
  * Bias and Fairness
  * Interpretability and Accountability

Stable Diffusion vs GAN: The main difference between the two methods is their approach to generating new data. Stable Diffusion uses a process of adding and removing noise to an image, while GANs use a game-theoretic approach where two networks compete against each other.

## Scaling LLMs

Large Language Models (LLMs) are fundamentally inefficient. They require a lot of computational resources to train and run. Scaling LLMs is a challenge. There are a lot of optimiations and improvements the community is working on to make LLMs more efficient.

Common LLM architecture improvements:
* Sparsity: Reduce the number of parameters in the model by using sparse matrices.
* Quantization: Reduce the precision of the model's weights and activations to save memory.
* Pruning: Remove unnecessary connections in the model to reduce the number of parameters.
* Knowledge Distillation: Train a smaller model to mimic the behavior of a larger model.
* Model Parallelism: Split the model across multiple devices to speed up training and inference. However the issue is that inreases latency.
* Data Parallelism: Split the data across multiple devices to speed up training. However the issue is that inreases latency.
* Mixed Precision Training: Use a combination of single and half precision to speed up training.
* Gradient Checkpointing: Store intermediate activations to reduce memory usage.
* Efficient Attention Mechanisms: Use more efficient attention mechanisms to reduce the computational cost of self-attention.
* Efficient Transformers: Use more efficient transformer architectures to reduce the number of parameters in the model.
* Efficient Embeddings: Use more efficient embeddings to reduce the size of the model.
* Efficient Activation Functions: Use more efficient activation functions to reduce the computational cost of the model.
* Efficient Loss Functions: Use more efficient loss functions to reduce the computational cost of the model.

### Efficient Transformers

**Flash Attention**: Attention operations have a memory bottleneck. Flash Attention is a more efficient attention mechanism that reduces the computational cost of self-attention. It uses a combination of local and global attention to capture long-range dependencies in data. Flash Attention is an attention algorithm used to reduce this problem and scale transformer-based models more efficiently, enabling faster training and inference.

<img src="images/attention-vs-flash-attention.png" />

**Paged Attention**: PagedAttention attempts to optimize memory use by partitioning the KV cache into blocks that are accessed through a lookup table. Thus, the KV cache does not need to be stored in contiguous memory, and blocks are allocated as needed. The memory efficiency can increase GPU utilization on memory-bound workloads, so more inference batches can be supported.

<img src="images/PagedAttentionKV.jpg" />

**Quantization**: Quantization ishow we can make LLM smaller. Using methods like GPTQ is a post-training quantization method to make the model smaller. It quantizes the layers by finding a compressed version of that weight, that will yield a minimum mean squared error.

<img src="images/LLM-Quantization.png" />



## KAN

[KAN](https://arxiv.org/abs/2404.19756) it's a potential Alternative to MLP.

<img src="images/KAN-simple.png" />

KANs diverge from traditional Multi-Layer Perceptrons (MLPs) by replacing fixed activation functions with learnable functions, effectively eliminating the need for linear weight matrices.

<img src="images/KAN_VS_MLP.png" />

## LLM Security

[OWASP Top 10 for LLMs](https://owasp.org/www-project-top-10-for-large-language-model-applications/)

LLM01: Prompt Injection
 * crafty inputs, causing unintended actions by the LLM

LLM02: Insecure Output Handling
 * XSS, CSRF, SSRF

LLM03: Training Data Poisoning
 * LLM training data is tampered

LLM04: Model Denial of Service
 * Attackers cause resource-heavy operations on LLMs

LLM05: Supply Chain Vulnerabilities
 * third-party datasets, pre- trained models, and plugins can add vulnerabilities

LLM06: Sensitive Information Disclosure 
 * LLMs may inadvertently reveal confidential data in their responses
 * Unauthorized data access, privacy violations, and security breaches.

LLM07: Insecure Plugin Design
 * LLM plugins can have insecure inputs and insufficient access control

LLM08: Excessive Agency
 * Excessive functionality, permissions, or autonomy granted to the LLM-based systems

LLM09: Overreliance
 * Misinformation, miscommunication, legal issues, and security vulnerabilities

LLM10: Model Theft
 * Unauthorized access, copying, or exfiltration of proprietary LLM models
