### VLMs Get Embeddings

### This file extract embeddings using VLMS

* classifier (str): Name of the VLM BLIP2 and LLAVA supported
* nproc_per_node (int): Number of CPUs avaiable
* batch_size (int): Batch Size

## 1. DAQUAR

* **[DAQUAR Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/visual-turing-challenge#c7057)**:

DAQUAR (Dataset for Question Answering on Real-world images) dataset was created for the purpose of advancing research in visual question answering (VQA). It consists of indoor scene images, each accompanied by sets of questions related to the scene's content. The dataset serves as a benchmark for training and evaluating models in understanding images and answering questions about them.

In [1]:
!torchrun --nproc_per_node=15 src/vlm_embeddings.py --classifier BLIP2 --batch_size 4 --dataset_path ./datasets/daquar --image_col image_id --text_col question --output_dir ./Embeddings_vlm/daquar --output_file embeddings_blip2.csv

2024-02-04 00:02:06.742377: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-04 00:02:06.742380: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-04 00:02:06.749576: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-04 00:02:06.771927: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-04 00:02:06.776670: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-04 00:02:06.783755: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-

## 2. COCO-QA

* **[COCO-QA Dataset](https://www.cs.toronto.edu/~mren/research/imageqa/data/cocoqa/)**:

The COCO-QA (COCO Question-Answering) dataset is designed for the task of visual question-answering. It is a subset of the COCO (Common Objects in Context) dataset, which is a large-scale dataset containing images with object annotations. The COCO-QA dataset extends the COCO dataset by including questions and answers associated with the images. Each image in the COCO-QA dataset is accompanied by a set of questions and corresponding answers.

In [None]:
!torchrun --nproc_per_node=30 src/vlm_embeddings.py --classifier BLIP2 --batch_size 4 --dataset_path ./datasets/coco-qa --image_col image_id --text_col questions --output_dir ./Embeddings_vlm/coco-qa --output_file embeddings_blip2.csv

## 3. Fakeddit

* **[Fakeddit Dataset](https://fakeddit.netlify.app/)**:

Fakeddit is a large-scale multimodal dataset for fine-grained fake news detection. It consists of over 1 million samples from multiple categories of fake news, including satire, misinformation, and fabricated news. The dataset includes text, images, metadata, and comment data, making it a rich resource for developing and evaluating fake news detection models.

In [2]:
!torchrun --nproc_per_node=30 src/vlm_embeddings.py --classifier BLIP2 --batch_size 4 --dataset_path ./datasets/fakeddit --image_col id --text_col title --output_dir ./Embeddings_vlm/fakeddit --output_file embeddings_blip2.csv

2024-02-05 02:52:47.594796: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 02:52:47.594798: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 02:52:47.594796: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 02:52:47.598940: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 02:52:47.599006: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 02:52:47.599021: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 02:52:47.599029: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 02:52:47.599032: I tensorflow/tsl/cud

## 4. Recipes5k

* **[Recipes5k Dataset](http://www.ub.edu/cvub/recipes5k/)**:

The Recipes5k dataset comprises 4,826 recipes featuring images and corresponding ingredient lists, with 3,213 unique ingredients simplified from 1,014 by removing overly-descriptive particles, offering a diverse collection of alternative preparations for each of the 101 food types from Food101, meticulously balanced across training, validation, and test splits. The dataset addresses intra- and inter-class variability, extracted from Yummly with 50 recipes per food type.

In [3]:
!torchrun --nproc_per_node=30 src/vlm_embeddings.py --classifier BLIP2 --batch_size 4 --dataset_path ./datasets/Recipes5k --image_col image --text_col ingredients --output_dir ./Embeddings_vlm/Recipes5k --output_file embeddings_blip2.csv

2024-02-05 09:44:26.799873: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 09:44:26.810695: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 09:44:26.833743: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 09:44:26.838316: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-05 09:44:26.849048: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with 

## 5. BRSET
* **[BRSET Dataset](https://physionet.org/content/brazilian-ophthalmological/1.0.0/)**:

The Brazilian Multilabel Ophthalmological Dataset (BRSET) stands as a pioneering initiative aimed at bridging the gap in ophthalmological datasets, particularly for under-represented populations in low and medium-income countries. This comprehensive dataset encompasses 16,266 images from 8,524 Brazilian patients, incorporating a wide array of data points including demographics, anatomical parameters of the macula, optic disc, and vessels, along with quality control metrics such as focus, illumination, image field, and artifacts.

In [1]:
!torchrun --nproc_per_node=30 src/vlm_embeddings.py --classifier BLIP2 --batch_size 4 --dataset_path ./datasets/brset --image_col image_id --text_col text --output_dir ./Embeddings_vlm/brset --output_file embeddings_blip2.csv

2024-02-05 15:16:39.965440: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 15:16:40.004241: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-05 15:16:40.011938: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 15:16:40.038864: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 15:16:40.051029: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with 

## 6. HAM10000 dataset

* [HAM10000 dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T)

The MNIST: HAM10000 dataset is a large collection of dermatoscopic images from different populations, acquired and stored by the Department of Dermatology at the Medical University of Vienna, Austria. It consists of 10,015 dermatoscopic images which can serve as a training set for academic machine learning purposes in tasks like skin lesion analysis and classification, specifically focusing on the detection of melanoma.

In [2]:
!torchrun --nproc_per_node=30 src/vlm_embeddings.py --classifier BLIP2 --batch_size 4 --dataset_path ./datasets/ham10000 --image_col image_id --text_col text --output_dir ./Embeddings_vlm/ham10000 --output_file embeddings_blip2.csv

2024-02-05 22:14:14.861526: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 22:14:14.875600: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 22:14:14.881054: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 22:14:14.899486: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-05 22:14:14.913370: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with 

## Joslin Center Data

In [None]:
!torchrun --nproc_per_node=30 src/vlm_embeddings.py --classifier BLIP2 --batch_size 4 --dataset_path ./datasets/joslin --image_col ID --text_col text --output_dir ./Embeddings_vlm/joslin --output_file embeddings_blip2.csv