### Setup Environment:

In [1]:
from src.vlm_models import CLIP, BLIP2, LLAVA
from src.classifiers_base import preprocess_df
import pandas as pd
import os

2024-02-08 18:56:34.576997: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-08 18:56:34.614132: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Embeddings Generation

* **Dataframe:** Pandas dataframe with image path and text

* **Image Column:** Column with the path to the images

* **Text Column:** Column with text data

* **Batch Size:** Integer with the size of the batch

In [2]:
#model_name = 'blip2'
#model_name = 'llava'
model_name = 'clip'

In [3]:
if model_name.lower() == 'clip':
    print('Creating Instance of CLIP model')
    model = CLIP()
elif model_name.lower() == 'blip2':
    print('Creating Instance of BLIP 2 model')
    model = BLIP2()
elif model_name.lower() == 'llava':
    print('Creating Instance of LLAVA model')
    model = LLAVA()
else:
    raise NotImplementedError('The model should be clip, blip2 or llava')

Creating Instance of CLIP model


## 1. DAQUAR

* **[DAQUAR Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/visual-turing-challenge#c7057)**:

DAQUAR (Dataset for Question Answering on Real-world images) dataset was created for the purpose of advancing research in visual question answering (VQA). It consists of indoor scene images, each accompanied by sets of questions related to the scene's content. The dataset serves as a benchmark for training and evaluating models in understanding images and answering questions about them.

We'll use the function `get_embeddings_df` to generate the embeddings in `datasets/daquar/images` and store the embeddings in `Embeddings/daquar/Embeddings_Backbone.csv`

In [None]:
batch_size = 16
dataset = 'daquar'
image_col = 'image_id'
text_col = 'question'
output_dir = f'Embeddings_vlm/{dataset}/'
output_file = f'embeddings_{model_name}.csv'

dataset_path = f'datasets/{dataset}/'
images_dir = 'images/'
labels = 'labels.csv'

images_path = os.path.join(dataset_path, images_dir)
labels_path = os.path.join(dataset_path, labels)

df = preprocess_df(df=pd.read_csv(labels_path), image_columns=image_col, images_path=images_path)

model.get_embeddings(dataframe=df, batch_size=batch_size, image_col_name=image_col, text_col_name=text_col, output_dir=output_dir, output_file=output_file)

100%|██████████| 12468/12468 [00:01<00:00, 9739.57it/s] 
100%|██████████| 12468/12468 [00:04<00:00, 2774.97it/s]


Processing batches:   0%|          | 0/520 [00:00<?, ?it/s]

It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_re

Batch 0


## 2. COCO-QA

* **[COCO-QA Dataset](https://www.cs.toronto.edu/~mren/research/imageqa/data/cocoqa/)**:

The COCO-QA (COCO Question-Answering) dataset is designed for the task of visual question-answering. It is a subset of the COCO (Common Objects in Context) dataset, which is a large-scale dataset containing images with object annotations. The COCO-QA dataset extends the COCO dataset by including questions and answers associated with the images. Each image in the COCO-QA dataset is accompanied by a set of questions and corresponding answers.

We'll use the function `get_embeddings_df` to generate the embeddings in `datasets/coco-qa/images` and store the embeddings in `Embeddings/coco-qa/Embeddings_Backbone.csv`

In [None]:
batch_size = 24
dataset = 'coco-qa'
image_col = 'image_id'
text_col = 'questions'
output_dir = f'Embeddings_vlm/{dataset}'
output_file = 'embeddings_clip.csv'

dataset_path = f'datasets/{dataset}/'
images_dir = 'images/'
labels = 'labels.csv'

images_path = os.path.join(dataset_path, images_dir)
labels_path = os.path.join(dataset_path, labels)

df = preprocess_df(df=pd.read_csv(labels_path), image_columns=image_col, images_path=images_path)

model.get_embeddings(dataframe=df, batch_size=batch_size, image_col_name=image_col, text_col_name=text_col, output_dir=output_dir, output_file=output_file)

#### 

## 2. Fakeddit

* **[Fakeddit Dataset](https://fakeddit.netlify.app/)**:

Fakeddit is a large-scale multimodal dataset for fine-grained fake news detection. It consists of over 1 million samples from multiple categories of fake news, including satire, misinformation, and fabricated news. The dataset includes text, images, metadata, and comment data, making it a rich resource for developing and evaluating fake news detection models.

We'll use the function `get_embeddings_df` to generate the embeddings in `datasets/fakeddit/images` and store the embeddings in `Embeddings/fakeddit/Embeddings_Backbone.csv`

In [None]:
batch_size = 24
dataset = 'fakeddit'
image_col = 'id'
text_col = 'title'
output_dir = f'Embeddings_vlm/{dataset}'
output_file = 'embeddings_clip.csv'

dataset_path = f'datasets/{dataset}/'
images_dir = 'images/'
labels = 'labels_subset.csv'

images_path = os.path.join(dataset_path, images_dir)
labels_path = os.path.join(dataset_path, labels)

df = preprocess_df(df=pd.read_csv(labels_path), image_columns=image_col, images_path=images_path)

model.get_embeddings(dataframe=df, batch_size=batch_size, image_col_name=image_col, text_col_name=text_col, output_dir=output_dir, output_file=output_file)

## 4. Recipes5k

* **[Recipes5k Dataset](http://www.ub.edu/cvub/recipes5k/)**:

The Recipes5k dataset comprises 4,826 recipes featuring images and corresponding ingredient lists, with 3,213 unique ingredients simplified from 1,014 by removing overly-descriptive particles, offering a diverse collection of alternative preparations for each of the 101 food types from Food101, meticulously balanced across training, validation, and test splits. The dataset addresses intra- and inter-class variability, extracted from Yummly with 50 recipes per food type.


We'll use the function `get_embeddings_df` to generate the embeddings in `datasets/Recipes5k/images` and store the embeddings in `Embeddings/Recipes5k/Embeddings_Backbone.csv`

In [None]:
batch_size = 24
dataset = 'Recipes5k'
image_col = 'image'
text_col = 'ingredients'
output_dir = f'Embeddings_vlm/{dataset}'
output_file = 'embeddings_clip.csv'

dataset_path = f'datasets/{dataset}/'
images_dir = 'images/'
labels = 'labels.csv'

images_path = os.path.join(dataset_path, images_dir)
labels_path = os.path.join(dataset_path, labels)

df = preprocess_df(df=pd.read_csv(labels_path), image_columns=image_col, images_path=images_path)

model.get_embeddings(dataframe=df, batch_size=batch_size, image_col_name=image_col, text_col_name=text_col, output_dir=output_dir, output_file=output_file)

## 5. BRSET
* **[BRSET Dataset](https://physionet.org/content/brazilian-ophthalmological/1.0.0/)**:

The Brazilian Multilabel Ophthalmological Dataset (BRSET) stands as a pioneering initiative aimed at bridging the gap in ophthalmological datasets, particularly for under-represented populations in low and medium-income countries. This comprehensive dataset encompasses 16,266 images from 8,524 Brazilian patients, incorporating a wide array of data points including demographics, anatomical parameters of the macula, optic disc, and vessels, along with quality control metrics such as focus, illumination, image field, and artifacts.

In [4]:
batch_size = 24
dataset = 'brset'
image_col = 'image_id'
text_col = 'text'
output_dir = f'Embeddings_vlm/{dataset}'
output_file = 'embeddings_clip.csv'

dataset_path = f'datasets/{dataset}/'
images_dir = 'images/'
labels = 'labels.csv'

images_path = os.path.join(dataset_path, images_dir)
labels_path = os.path.join(dataset_path, labels)

df = preprocess_df(df=pd.read_csv(labels_path), image_columns=image_col, images_path=images_path)

model.get_embeddings(dataframe=df, batch_size=batch_size, image_col_name=image_col, text_col_name=text_col, output_dir=output_dir, output_file=output_file)

100%|██████████| 16266/16266 [00:03<00:00, 4597.47it/s]
100%|██████████| 16266/16266 [00:22<00:00, 717.60it/s]


Processing batches:   0%|          | 0/678 [00:00<?, ?it/s]

Unnamed: 0,image_id,DR_ICDR,text,DR_2,DR_3,split,image_embedding_0,image_embedding_1,image_embedding_2,image_embedding_3,...,text_embedding_502,text_embedding_503,text_embedding_504,text_embedding_505,text_embedding_506,text_embedding_507,text_embedding_508,text_embedding_509,text_embedding_510,text_embedding_511
0,datasets/brset/images/img00001.jpg,0,"An image from the right eye of a male patient,...",0,0,train,0.012998,-0.012632,0.010281,0.016375,...,-0.054316,0.026362,0.062661,0.023066,-0.023967,-0.013421,0.033898,-0.058343,0.027308,0.011322
1,datasets/brset/images/img00002.jpg,0,"An image from the left eye of a male patient, ...",0,0,test,0.013524,-0.017406,0.012482,0.012262,...,-0.054454,0.018202,0.061717,0.022467,-0.011161,-0.011376,0.030134,-0.049021,0.021231,0.010651
2,datasets/brset/images/img00003.jpg,0,An image from the right eye of a female patien...,0,0,train,0.036357,-0.018087,-0.001801,0.010425,...,-0.055949,0.011358,0.072995,0.030307,-0.011789,-0.011042,0.042255,-0.057926,0.033326,0.018503
3,datasets/brset/images/img00004.jpg,0,An image from the left eye of a female patient...,0,0,train,0.022021,-0.011668,0.010509,0.022121,...,-0.055314,0.002594,0.074490,0.034430,-0.000350,-0.007918,0.042707,-0.047150,0.027270,0.020940
4,datasets/brset/images/img00005.jpg,0,"An image from the right eye of a male patient,...",0,0,test,0.018684,-0.010326,0.004721,0.006435,...,-0.052848,0.021930,0.068045,0.027663,-0.020903,-0.012616,0.037164,-0.051632,0.033358,0.013019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16261,datasets/brset/images/img16262.jpg,1,"An image from the left eye of a male patient, ...",1,1,test,0.020267,-0.014308,0.007233,0.009556,...,-0.051821,0.018065,0.063733,0.028109,-0.009481,-0.011907,0.032493,-0.043669,0.024307,0.012130
16262,datasets/brset/images/img16263.jpg,0,"An image from the right eye of a male patient,...",0,0,train,0.022504,-0.012735,0.007244,0.008257,...,-0.053089,0.022047,0.066592,0.027557,-0.017143,-0.013583,0.038866,-0.059881,0.031864,0.018169
16263,datasets/brset/images/img16264.jpg,0,"An image from the left eye of a male patient, ...",0,0,test,0.020345,-0.018842,0.001421,0.002808,...,-0.053673,0.012375,0.067669,0.029927,-0.003204,-0.011643,0.036917,-0.050136,0.027056,0.018447
16264,datasets/brset/images/img16265.jpg,0,"An image from the right eye of a male patient,...",0,0,train,0.016526,-0.019370,0.002399,0.017877,...,-0.047541,0.028574,0.065664,0.030953,-0.015932,-0.018688,0.031363,-0.038107,0.039231,0.005134


### 6. HAM10000 dataset

* [HAM10000 dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T)

The MNIST: HAM10000 dataset is a large collection of dermatoscopic images from different populations, acquired and stored by the Department of Dermatology at the Medical University of Vienna, Austria. It consists of 10,015 dermatoscopic images which can serve as a training set for academic machine learning purposes in tasks like skin lesion analysis and classification, specifically focusing on the detection of melanoma.

In [5]:
batch_size = 24
dataset = 'ham10000'
image_col = 'image_id'
text_col = 'text'
output_dir = f'Embeddings_vlm/{dataset}'
output_file = 'embeddings_clip.csv'

dataset_path = f'datasets/{dataset}/'
images_dir = 'images/'
labels = 'labels.csv'

images_path = os.path.join(dataset_path, images_dir)
labels_path = os.path.join(dataset_path, labels)

df = preprocess_df(df=pd.read_csv(labels_path), image_columns=image_col, images_path=images_path)

model.get_embeddings(dataframe=df, batch_size=batch_size, image_col_name=image_col, text_col_name=text_col, output_dir=output_dir, output_file=output_file)

  0%|          | 0/10015 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disablin

Processing batches:   0%|          | 0/418 [00:00<?, ?it/s]

Unnamed: 0,image_id,dx,text,split,image_embedding_0,image_embedding_1,image_embedding_2,image_embedding_3,image_embedding_4,image_embedding_5,...,text_embedding_502,text_embedding_503,text_embedding_504,text_embedding_505,text_embedding_506,text_embedding_507,text_embedding_508,text_embedding_509,text_embedding_510,text_embedding_511
0,datasets/ham10000/images/ISIC_0033319.jpg,nv,Patient diagnosed via histo. Age: 35 years. Se...,train,0.017459,-0.006133,0.042509,0.032119,-0.020298,-0.029999,...,-0.085103,-0.035549,0.022616,0.014842,-0.010682,0.007425,-0.010183,0.059058,0.028891,0.073401
1,datasets/ham10000/images/ISIC_0030823.jpg,nv,Patient diagnosed via follow_up. Age: 40 years...,train,0.013314,-0.004718,0.036896,0.013657,-0.018710,-0.000790,...,-0.024121,0.001504,0.007381,0.028216,-0.033343,-0.007173,0.025764,0.037911,-0.028967,0.032034
2,datasets/ham10000/images/ISIC_0028730.jpg,akiec,Patient diagnosed via histo. Age: 65 years. Se...,train,0.023076,-0.006460,0.046531,-0.007525,-0.052272,0.024759,...,-0.076046,-0.025927,0.019258,0.013950,-0.013910,0.000220,-0.016514,0.060165,0.028732,0.068683
3,datasets/ham10000/images/ISIC_0027299.jpg,nv,Patient diagnosed via follow_up. Age: 40 years...,train,0.002341,-0.042092,0.056254,0.000176,-0.013943,0.011410,...,-0.017013,-0.020450,0.020093,0.021699,-0.018428,0.018907,0.022027,0.045309,-0.014772,0.017288
4,datasets/ham10000/images/ISIC_0032444.jpg,nv,Patient diagnosed via histo. Age: 65 years. Se...,train,0.012029,-0.003644,0.028491,0.017455,-0.017562,-0.005040,...,-0.093017,-0.022929,0.013042,0.007998,-0.020830,-0.015631,-0.007309,0.057171,0.030297,0.068561
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10010,datasets/ham10000/images/ISIC_0034116.jpg,nv,Patient diagnosed via histo. Age: 35 years. Se...,test,0.034317,-0.005902,0.033596,0.013734,-0.001983,0.006587,...,-0.089331,-0.024633,0.013912,0.015758,-0.022870,-0.014493,-0.007229,0.065375,0.024135,0.075245
10011,datasets/ham10000/images/ISIC_0026453.jpg,bcc,Patient diagnosed via histo. Age: 55 years. Se...,test,0.039306,-0.022963,0.006208,0.007151,-0.000994,0.003135,...,-0.092979,-0.026890,0.015404,0.007477,-0.018264,-0.009003,-0.005414,0.054793,0.026829,0.066075
10012,datasets/ham10000/images/ISIC_0029885.jpg,mel,Patient diagnosed via histo. Age: 35 years. Se...,test,-0.007351,-0.020153,0.009813,-0.022048,-0.008027,0.006708,...,-0.101426,-0.029974,0.016112,0.007236,-0.021058,-0.013079,-0.002877,0.057138,0.029186,0.076712
10013,datasets/ham10000/images/ISIC_0033226.jpg,mel,Patient diagnosed via histo. Age: 65 years. Se...,test,0.010150,-0.013687,0.011988,0.025662,0.002185,-0.017817,...,-0.080767,-0.036380,0.022734,0.016851,-0.005421,0.005429,-0.022412,0.060543,0.029883,0.065808


## 7. Colombian Multimodal Satellite dataset
* **[A Multi-Modal Satellite Imagery Dataset for Public Health Analysis in Colombia](https://physionet.org/content/multimodal-satellite-data/1.0.0/)**:

The Multi-Modal Satellite Imagery Dataset in Colombia integrates economic, demographic, meteorological, and epidemiological data. It comprises 12,636 high-quality satellite images from 81 municipalities between 2016 and 2018, with minimal cloud cover. Its applications include deforestation monitoring, education indices forecasting, water quality assessment, extreme climatic event tracking, epidemic illness addressing, and precision agriculture optimization. We'll use it shortly.

In [4]:
batch_size = 24
dataset = 'satellitedata'
image_col = 'image_id'
text_col = 'text'
output_dir = f'Embeddings_vlm/{dataset}'
output_file = 'embeddings_clip.csv'

dataset_path = f'datasets/{dataset}/'
images_dir = 'images/'
labels = 'labels.csv'

images_path = os.path.join(dataset_path, images_dir)
labels_path = os.path.join(dataset_path, labels)

df = preprocess_df(df=pd.read_csv(labels_path), image_columns=image_col, images_path=images_path)

model.get_embeddings(dataframe=df, batch_size=batch_size, image_col_name=image_col, text_col_name=text_col, output_dir=output_dir, output_file=output_file)

100%|██████████| 1560/1560 [00:01<00:00, 804.88it/s]
100%|██████████| 1560/1560 [00:01<00:00, 1367.35it/s]


Processing batches:   0%|          | 0/65 [00:00<?, ?it/s]

Unnamed: 0,image_id,text,Labels,split,image_embedding_0,image_embedding_1,image_embedding_2,image_embedding_3,image_embedding_4,image_embedding_5,...,text_embedding_502,text_embedding_503,text_embedding_504,text_embedding_505,text_embedding_506,text_embedding_507,text_embedding_508,text_embedding_509,text_embedding_510,text_embedding_511
0,datasets/satellitedata/images/73001_2016-06-12...,An image from city Ibagué taken in date 2016-0...,3,train,-0.000648,0.047631,-0.007688,-0.015954,0.033954,-0.024987,...,0.003876,-0.020030,0.014448,0.025682,0.010953,0.002404,0.021032,0.009018,0.002254,-0.014104
1,datasets/satellitedata/images/76001_2017-06-11...,An image from city Cali taken in date 2017-06-...,1,train,0.017359,0.030042,0.016325,-0.010410,0.046289,-0.025459,...,0.019132,-0.028726,0.012622,0.030623,0.007002,0.006732,0.010945,-0.010793,-0.000270,-0.012476
2,datasets/satellitedata/images/8001_2018-04-15.jpg,An image from city Barranquilla taken in date ...,1,train,-0.022790,0.021263,0.040344,-0.016606,0.041269,-0.028196,...,0.014796,-0.006373,-0.014684,0.023051,0.005047,0.000785,0.005979,-0.020611,-0.006694,0.008903
3,datasets/satellitedata/images/23001_2016-05-08...,An image from city Montería taken in date 2016...,1,train,-0.013127,-0.022996,-0.049374,-0.006306,0.013601,-0.003762,...,0.013533,-0.003991,-0.011703,0.025439,0.004510,-0.007244,0.005063,0.042533,-0.002328,-0.006062
4,datasets/satellitedata/images/5001_2017-04-30.jpg,An image from city Medellín taken in date 2017...,1,train,-0.034772,0.040687,0.007086,-0.026446,0.017892,-0.021177,...,0.015247,-0.008037,0.022611,0.020556,0.020584,-0.001789,0.015643,-0.005070,0.008810,0.005491
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1555,datasets/satellitedata/images/50001_2017-03-19...,An image from city Villavicencio taken in date...,1,test,0.011827,-0.015805,-0.005934,-0.023800,0.020632,-0.012690,...,0.007440,-0.015429,0.006488,0.032507,0.011390,-0.019991,0.025756,-0.001323,-0.010401,-0.010772
1556,datasets/satellitedata/images/23001_2017-03-26...,An image from city Montería taken in date 2017...,0,test,-0.013127,-0.022996,-0.049374,-0.006306,0.013601,-0.003762,...,0.014595,-0.012074,-0.008773,0.026105,0.002282,-0.001614,0.008737,0.028377,-0.004192,-0.003824
1557,datasets/satellitedata/images/8001_2017-01-22.jpg,An image from city Barranquilla taken in date ...,0,test,-0.013127,-0.022996,-0.049374,-0.006306,0.013601,-0.003762,...,0.015668,-0.014151,-0.005850,0.026696,0.014855,-0.003303,0.007501,-0.014224,-0.006461,0.009316
1558,datasets/satellitedata/images/76001_2017-09-10...,An image from city Cali taken in date 2017-09-...,0,test,0.008757,-0.000926,0.009275,-0.006556,0.058230,-0.016975,...,0.016767,-0.032829,0.012773,0.030282,0.012048,0.008870,0.012012,-0.005184,-0.003771,-0.007549


## 8. MIMIC CXR
* **[MIMIC CXR](https://physionet.org/content/mimic-cxr/2.0.0/#files-panel)**:

The MIMIC-CXR (Medical Information Mart for Intensive Care, Chest X-Ray) dataset is a large, publicly available collection of chest radiographs with associated radiology reports. It was developed by the MIT Lab for Computational Physiology and provides an extensive resource for training and evaluating machine learning models in the field of medical imaging, particularly in automated radiograph interpretation and natural language processing for clinical narratives.

In [None]:
batch_size = 24
dataset = 'mimic'
image_col = 'image_id'
text_col = 'text'
output_dir = f'Embeddings_vlm/{dataset}'
output_file = 'embeddings_clip.csv'

dataset_path = f'datasets/{dataset}/'
images_dir = 'images/'
labels = 'labels.csv'

images_path = os.path.join(dataset_path, images_dir)
labels_path = os.path.join(dataset_path, labels)

df = preprocess_df(df=pd.read_csv(labels_path), image_columns=image_col, images_path=images_path)

model.get_embeddings(dataframe=df, batch_size=batch_size, image_col_name=image_col, text_col_name=text_col, output_dir=output_dir, output_file=output_file)