<center>
<img src="./images/00_main_arcada.png" style="width:1400px">
</center>

# Foundational Models 1/3  
## Introduction

#### Presenter:
> Anton Akusok (PhD), **Senior Lecturer in IT / Senior Data engineer at Zalando**<br/> 
(*email*: anton.akusok@arcada.fi)

## Goals

* What are Fondational models
* Why they exist
* How to use them

# Intro

We wrote a paper on fake signature recognition 6 years ago

<img src="./images/f1-signatures.png" alt="Signatures" width="90%">

Original authors of the dataset created and trained a complex deep learning network for signatures verification

<img src="./images/f1-net-original.png" alt="Original net" width="80%">

We got marginally better results without training any deep learning models.

### How?

<img src="./images/f1-inception.png" alt="Original net" width="80%">

Instead of creating and training a DL model specifically for our task, we used an existing DL model trained for image classification: a **foundational model**.

Comparison:

* signatures data has 30 signatures for 10,000 "people" for 300,000 samples and 2 classes
* "Inception21k: is trained on 14 mln. images with 21,000 classes



# What are Foundational Models

### *Foundation models are AI neural networks trained on massive unlabeled datasets to handle a wide variety of jobs from translating text to analyzing medical images.*  
(NVidia, https://blogs.nvidia.com/blog/what-are-foundation-models/)

They learn the data itself, and make handling this or similar data much easier.

<img src="./images/f1-models.jpg" alt="Models" width="90%">

### What makes Foundational models different from DL or ML models?

* trained on mountains of raw data
* foundational model is per "type of data"
    * regular DL or ML model is per "task"
* mostly with unsupervised learning (no labels)
* but data enrichments can apply: 
    * human validated answers for LLM
    * image + caption for image models
    * task caption + video recording for robot training
* can be adapted to a broad range of tasks

### Why Foundational models exist?

Because data is hard, and foundational models make data easy.

Mathematically, they learn the data manifold.  
(https://prateekvjoshi.com/2014/06/21/what-is-manifold-learning/)

<img src="./images/f1-swissroll.png" alt="Models" width="40%">

### Foundational models make for very cheap solutions

Models themselves are ridiculously expensive to train, but they are trained only once.
<img src="./images/f1-training.jpg" alt="Training" width="80%">

The expensive training is necessary to get good results on complex data like text of images.  

This prevented creating universal text/image models until a few years ago.

Human equivalent:

* foundational human vision model: evolving human being from amoeba
* task adaptation: learning how to read

Evolving a new human being for every class in school would be too expensive!

### Foundational models can learn from two types of data

Two data types in one model makes some very useful models!

* text + image: image generation
* text + audio: text-to-speech, song and soundtrack synthesis
* text + shape: 3D object recognition and generation
* text + video: robot task planning, robot actions generation from request

## What types of Foundational models exist?

Foundational models can be grouped by distinctive "features", but there is no hierarchy and one model can have several "features"  
https://github.com/awaisrauf/Awesome-CV-Foundational-Models

<img src="./images/f1-taxonomy-1.png" alt="Models" width="90%">


Some architecture styles of vision models

Models re-use standard blocks in different combinations; researchers write NN layers, backpropagation rules, etc. to implement these blocks as one model trainable on GPUs.

<img src="./images/f1-taxonomy-2.png" alt="Models" width="90%">

## Feature: Autoregressive models

"Autoregressive" model predicts one next step from previous steps.

Canonical example: LLMs that generate text 1 word at a time

This is why LLMs are "slow" - they cannot generate a batch of 1000 words at once, the whole model repeats for each word independently.

The term **autoregressive** is very old, refers to an old linear regression model where inputs are `X[:-1]` and targets are `X[1:]` so the model is doing regression upon itself - since the "auto"regressive.

LLMs and modern models have **state** that is carried from one sample to the next, slowly changing over time.

**Attention** are weights that connect one sample to select previous samples instead of all previous samples.

<img src="./images/f1-autoregressive.png" alt="Models" width="40%">

## Feature: Diffusion models
https://stable-diffusion-art.com/how-stable-diffusion-work/


Widely used in image generation, and can generate new image from existing sample image:

* super-resolution (literally drawing new image from low-res sample image)
* inpainting (drawing masked parts of an image; can extend image outwards making up its surroundings)
* specific image improvements guided by text prompt

This is the craziest model idea because it starts with noise, **predicts noise**, and adds noise to noise until an image is developed like a photo...

<img src="./images/f1-diffusion.png" alt="Models" width="90%">

## Feature: Variational Autoencoder

A model that *compresses* data to a lower-dimensional vector, and *decompresses* back.

Does not have enough information to store raw data so it learns to improvise and keep the important features only.

Works together with diffusion model to enable denoising in small *latent* space.

Enables very high-resolution images because work happens in resolution-agnostic latent space

<img src="./images/f1-vae.png" alt="Models" width="80%">

## Feature: Generative Adversarial Network (GAN)

A training approach rather than a network itself. Originally for image generation but useful in many other areas like anomaly detection.

Can build your own simple GAN:  
https://www.geeksforgeeks.org/generative-adversarial-network-gan/

<img src="./images/f1-gan.png" alt="Models" width="80%">

## Feature: Contrastive Learning (multimodal models)

Training data is pairs of different data types, e.g. text and image in CLIP (Contrastive Language-Image Pre-training). 

Model lears to generate similar embeddings to correct pairs and different embeddings for wrong pairs
https://www.marqo.ai/course/introduction-to-clip-and-multimodal-models

<img src="./images/f1-clip.png" alt="Models" width="60%">

# How to use Foundational models?

1. Do not train them :D
2. Give them text instructions: Prompt engineering
3. Fine-tuning: Low-rank Adaptation (LoRA), domain adaptation

Text input in foundational models is literally for **our instructions**!

### Foundational models are not omnipotent  
https://www.scribbledata.io/blog/foundation-models-101-a-step-by-step-guide-for-beginners/

<img src="./images/f1-limitations.jpg" alt="Models" width="60%">

### Practical tips:

* never ask a model to create a **whole** solution!
* split solution into many simple steps
* use AI models only where necessary; use basic code if you can
* validate every model usage separately
* create a system that tests different prompts or models; use best ones

Domain adaptation is an open topic with no general solution. Be creative and check what others have done before.