# Autoencoder in Computer Vision

Autoencoders are a type of neural network designed to learn efficient representations of data, typically for the purpose of dimensionality reduction or feature learning. They consist of two main parts:

1. **Encoder**: This part compresses the input data into a lower-dimensional representation.
2. **Decoder**: This part reconstructs the original data from the compressed representation.

<img src="../assets/Schematic-overview-of-variational-autoencoder.png" width="600">

## How Do Autoencoders Work?

Think of an autoencoder as a sophisticated data compressor and decompressor. Here’s how it works step-by-step:

1. **Input**: You start with your original data, for example, an image.
2. **Encoding**:
   - The encoder takes the image and processes it through multiple layers of neurons.
   - Each layer extracts more abstract features from the image, gradually reducing its dimensionality.
   - The final layer of the encoder outputs a compressed representation (latent space), which captures the essential features of the image in a smaller set of dimensions.

3. **Latent Space**: This is the compressed form of the data. It contains the most important features needed to reconstruct the original image.

4. **Decoding**:
   - The decoder takes the compressed representation and processes it through multiple layers of neurons, but in reverse order compared to the encoder.
   - Each layer adds back details and increases the dimensionality until it reconstructs the original image.

5. **Output**: The final output is an approximation of the original image. The goal is for this reconstructed image to be as close to the original as possible.

## How Are Autoencoders Used in Vision Models?

Autoencoders have various applications in vision tasks, such as:

1. **Image Denoising**: Autoencoders can be trained to remove noise from images. The noisy image is inputted, and the autoencoder learns to reconstruct the clean version.

<img src="../assets/denoising.jpg" width="600">

2. **Anomaly Detection**: In manufacturing, for example, an autoencoder can be trained on images of defect-free products. When an image of a defective product is fed into the network, the reconstruction error will be higher, indicating an anomaly.

<img src="../assets/encoder_anomaly_detection.jpg" width="600">

3. **Feature Extraction**: Autoencoders can learn useful features from images, which can then be used in other tasks like image classification or clustering.

<img src="../assets/Structure-of-clustering-model-with-autoencoder-and-K-means-combination.png" width="600">

## Technical Details:

- **Loss Function**: Usually, mean squared error (MSE) between the input and output images is used as the loss function to train the autoencoder.
- **Architectures**: Variants like Convolutional Autoencoders (CAE) are used for image data to leverage the spatial structure of images, using convolutional layers instead of fully connected layers.
- **Regularization**: Techniques like adding noise to the input (Denoising Autoencoders) or penalizing the complexity of the latent space (Sparse Autoencoders) are used to improve the robustness and feature learning capabilities.

## Conclusion

In summary, autoencoder vision models are powerful tools for compressing and reconstructing images, which can be used for tasks such as image denoising, anomaly detection, and feature extraction. They work by learning to encode the essential features of the input data into a lower-dimensional representation and then decoding it back to the original form.




# Transformers and Vision Transformers

<img src="../assets/transformer_artistic.jpeg" width="300" >

## Transformers

Transformers are a type of machine learning model used primarily for natural language processing (NLP) tasks such as translation, text summarization, and sentiment analysis. They were introduced in the paper ["Attention is All You Need" by Vaswani et al. in 2017](https://arxiv.org/abs/1706.03762). Transformers are designed to handle sequential data and are particularly good at capturing long-range dependencies in text.

#### Key Components of Transformers:

<img src="../assets/The-Transformer-model-architecture.png" width="400">

1. **Attention Mechanism**: 
   - The core idea behind transformers is the attention mechanism, which allows the model to focus on different parts of the input sequence when making predictions.
   - Self-Attention: In the context of a single sequence, self-attention allows the model to weigh the importance of different words relative to each other.

2. **Positional Encoding**: 
   - Since transformers don't process sequences in order like traditional RNNs, they use positional encodings to keep track of the position of each word in the sequence.

3. **Encoder-Decoder Architecture**: 
   - **Encoder**: Processes the input sequence and generates a set of encodings.
   - **Decoder**: Uses these encodings to produce the output sequence.

## Vision Transformers (ViTs)

Vision Transformers adapt the transformer architecture, originally designed for NLP, to computer vision tasks such as image classification, object detection, and segmentation. It was first describe in the paper ["A image is worth 16 x 16 Words : Transformers for image recognition at scale"](https://arxiv.org/abs/2010.11929)

#### How Vision Transformers Work:

<img src="../assets/VIT.png" width="600">

1. **Image as a Sequence of Patches**: 
   - Instead of processing an image as a whole, a vision transformer divides it into fixed-size patches (e.g., 16x16 pixels).
   - Each patch is flattened into a vector and treated like a "word" in a text sequence.

2. **Linear Embedding**: 
   - Each patch is linearly embedded into a lower-dimensional space to create a fixed-size vector representation.

3. **Positional Encoding**: 
   - Similar to text transformers, positional encodings are added to each patch embedding to retain spatial information about where each patch is located in the image.

4. **Transformer Layers**: 
   - The sequence of patch embeddings (with positional encodings) is passed through multiple transformer layers.
   - Each layer applies self-attention and feed-forward neural networks to capture the relationships between different patches.

5. **Classification Token**: 
   - A special classification token is prepended to the sequence of patch embeddings. 
   - After passing through the transformer layers, the representation corresponding to this token is used for classification tasks.


## Understanding LLMs and Large Language Vision Models

<img src="../assets/chatgpt.jpeg" width="600">

### Large Language Models (LLMs)

**Large Language Models** are advanced artificial intelligence systems designed to understand and generate human language. Here’s how they work:

1. **Training Data**: LLMs are trained on massive amounts of text data from books, articles, websites, and other sources. This data includes diverse examples of human language.

2. **Neural Networks**: LLMs use neural networks, specifically a type called transformers, to process and learn from the text data. These neural networks consist of layers of interconnected nodes (neurons) that mimic the human brain’s structure.

3. **Understanding Context**: During training, the model learns to predict the next word in a sentence, given the previous words. This helps the model understand context, grammar, and the nuances of language.

4. **Generating Text**: Once trained, LLMs can generate coherent and contextually relevant text. They can answer questions, write essays, create dialogue, and more.

5. **Fine-Tuning**: LLMs can be fine-tuned on specific datasets to improve performance in particular domains, like medical texts or legal documents.

### Large Language Vision Models (LLVMs)

**Large Language Vision Models** combine the capabilities of LLMs with visual understanding. They can process and interpret both text and images. Here’s how they work:

1. **Multi-Modal Training Data**: LLVMs are trained on datasets that include both text and images. For example, they might learn from image captions, descriptions, and other text associated with pictures.

2. **Transformer Architecture**: Similar to LLMs, LLVMs use transformers, but they are adapted to handle both visual and textual information. This involves processing images and text separately before combining the information.

3. **Vision Component**: The vision part of the model uses techniques like convolutional neural networks (CNNs) or vision transformers to extract features from images. These features represent various aspects of the image, such as shapes, colors, and objects.

4. **Language Component**: The language part of the model processes the text data, understanding context and meaning.

5. **Integration**: The model integrates visual features with textual information, allowing it to understand and generate descriptions of images, answer questions about pictures, and more.

6. **Applications**: LLVMs are used in various applications, including image captioning, visual question answering, and generating detailed descriptions of scenes.

### Key Points

- **LLMs**: Focus on understanding and generating human language using large text datasets and transformer neural networks.
- **LLVMs**: Combine language and vision capabilities, processing both text and images to perform tasks that require understanding both modalities.

By leveraging vast amounts of data and sophisticated neural network architectures, these models can perform a wide range of tasks, from generating natural language to interpreting complex visual scenes.

# Med Gemini

[Advancing Multimodal Medical Capabilities of Gemini](https://arxiv.org/pdf/2405.03162)

<img src="../assets/Med-Gemini-4-Overview.width-800.png" width="600" >

<img src="../assets/Med-Gemini-3-Benchmarks.width-800.png" width="600">

<img src="../assets/Med-Gemini-5a-CTScans.width-800.png" width="600">

