### Setup Environment:

In [6]:
from src.embeddings import get_embeddings_df
import pandas as pd

## Embeddings Generation

* **Batch Size:** Images per batch to convert to embeddings (Adjust depending on your memory)

* **Path:** Path to the images

* **Output Directory:** Directory to save the embeddings

* **Backbone:** Select a backbone from the list of possible backbones:
    * 'dinov2_small'
    * 'dinov2_base'
    * 'dinov2_large'
    * 'dinov2_giant'
    * 'sam_base'
    * 'sam_large'
    * 'sam_huge'
    * 'clip_base',
    * 'clip_large',
    * 'convnextv2_tiny'
    * 'convnextv2_base'
    * 'convnextv2_large'
    * 'convnext_tiny'
    * 'convnext_small'
    * 'convnext_base'
    * 'convnext_large'
    * 'swin_tiny'
    * 'swin_small'
    * 'swin_base'
    * 'vit_base'
    * 'vit_large'

In [7]:
# Foundational Models
dino_backbone = ['dinov2_small', 'dinov2_base', 'dinov2_large', 'dinov2_giant']

sam_backbone = ['sam_base', 'sam_large', 'sam_huge']

clip_backbone = ['clip_base', 'clip_large']

# ImageNet:

### Convnext
convnext_backbone = ['convnextv2_tiny', 'convnextv2_base', 'convnextv2_large'] + ['convnext_tiny', 'convnext_small', 'convnext_base', 'convnext_large']

### Swin Transformer
swin_transformer_backbone = ['swin_tiny', 'swin_small', 'swin_base']

### ViT
vit_backbone = ['vit_base', 'vit_large']

backbones = dino_backbone + clip_backbone + sam_backbone + convnext_backbone + swin_transformer_backbone + vit_backbone

backbones

['dinov2_small',
 'dinov2_base',
 'dinov2_large',
 'dinov2_giant',
 'clip_base',
 'clip_large',
 'sam_base',
 'sam_large',
 'sam_huge',
 'convnextv2_tiny',
 'convnextv2_base',
 'convnextv2_large',
 'convnext_tiny',
 'convnext_small',
 'convnext_base',
 'convnext_large',
 'swin_tiny',
 'swin_small',
 'swin_base',
 'vit_base',
 'vit_large']

## 1. DAQUAR

* **[DAQUAR Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/visual-turing-challenge#c7057)**:

DAQUAR (Dataset for Question Answering on Real-world images) dataset was created for the purpose of advancing research in visual question answering (VQA). It consists of indoor scene images, each accompanied by sets of questions related to the scene's content. The dataset serves as a benchmark for training and evaluating models in understanding images and answering questions about them.

We'll use the function `get_embeddings_df` to generate the embeddings in `datasets/daquar/images` and store the embeddings in `Embeddings/daquar/Embeddings_Backbone.csv`

In [None]:
batch_size = 32
path = 'datasets/daquar/images'
dataset = 'daquar'
backbone = 'dinov2_base'
out_dir = 'Embeddings'

get_embeddings_df(batch_size=batch_size, path=path, dataset_name=dataset, backbone=backbone, directory=out_dir)

## 2. COCO-QA

* **[COCO-QA Dataset](https://www.cs.toronto.edu/~mren/research/imageqa/data/cocoqa/)**:

The COCO-QA (COCO Question-Answering) dataset is designed for the task of visual question-answering. It is a subset of the COCO (Common Objects in Context) dataset, which is a large-scale dataset containing images with object annotations. The COCO-QA dataset extends the COCO dataset by including questions and answers associated with the images. Each image in the COCO-QA dataset is accompanied by a set of questions and corresponding answers.

We'll use the function `get_embeddings_df` to generate the embeddings in `datasets/coco-qa/images` and store the embeddings in `Embeddings/coco-qa/Embeddings_Backbone.csv`

In [None]:
batch_size = 32
path = 'datasets/coco-qa/images'
dataset = 'coco-qa'
backbone = 'dinov2_base'
out_dir = 'Embeddings'

get_embeddings_df(batch_size=batch_size, path=path, dataset_name=dataset, backbone=backbone, directory=out_dir)

#### 

## 2. Fakeddit

* **[Fakeddit Dataset](https://fakeddit.netlify.app/)**:

Fakeddit is a large-scale multimodal dataset for fine-grained fake news detection. It consists of over 1 million samples from multiple categories of fake news, including satire, misinformation, and fabricated news. The dataset includes text, images, metadata, and comment data, making it a rich resource for developing and evaluating fake news detection models.

We'll use the function `get_embeddings_df` to generate the embeddings in `datasets/fakeddit/images` and store the embeddings in `Embeddings/fakeddit/Embeddings_Backbone.csv`

In [None]:
batch_size = 32
path = 'datasets/fakeddit/images'
dataset = 'fakeddit'
backbone = 'dinov2_base'
out_dir = 'Embeddings'
image_files = pd.read_csv('datasets/fakeddit/labels.csv')['id'].tolist()

get_embeddings_df(batch_size=batch_size, path=path, dataset_name=dataset, backbone=backbone, directory=out_dir, image_files=image_files)

## 4. Recipes5k

* **[Recipes5k Dataset](http://www.ub.edu/cvub/recipes5k/)**:

The Recipes5k dataset comprises 4,826 recipes featuring images and corresponding ingredient lists, with 3,213 unique ingredients simplified from 1,014 by removing overly-descriptive particles, offering a diverse collection of alternative preparations for each of the 101 food types from Food101, meticulously balanced across training, validation, and test splits. The dataset addresses intra- and inter-class variability, extracted from Yummly with 50 recipes per food type.


We'll use the function `get_embeddings_df` to generate the embeddings in `datasets/Recipes5k/images` and store the embeddings in `Embeddings/Recipes5k/Embeddings_Backbone.csv`

In [None]:
batch_size = 32
path = 'datasets/Recipes5k/images'
dataset = 'Recipes5k'
backbone = 'dinov2_base'
out_dir = 'Embeddings'
image_files = pd.read_csv('datasets/Recipes5k/labels.csv')['image'].tolist()

get_embeddings_df(batch_size=batch_size, path=path, dataset_name=dataset, backbone=backbone, directory=out_dir, image_files=image_files)

## 5. BRSET
* **[BRSET Dataset](https://physionet.org/content/brazilian-ophthalmological/1.0.0/)**:

The Brazilian Multilabel Ophthalmological Dataset (BRSET) stands as a pioneering initiative aimed at bridging the gap in ophthalmological datasets, particularly for under-represented populations in low and medium-income countries. This comprehensive dataset encompasses 16,266 images from 8,524 Brazilian patients, incorporating a wide array of data points including demographics, anatomical parameters of the macula, optic disc, and vessels, along with quality control metrics such as focus, illumination, image field, and artifacts.

In [4]:
batch_size = 32
#path = 'datasets/brset/images'
path = '/gpfs/workdir/restrepoda/datasets/BRSET/brset/images'
dataset = 'brset'
backbone = 'dinov2_giant'
out_dir = 'Embeddings'

get_embeddings_df(batch_size=batch_size, path=path, dataset_name=dataset, backbone=backbone, directory=out_dir)

##################################################  dinov2_giant  ##################################################


Downloading: "https://github.com/facebookresearch/dinov2/zipball/main" to /gpfs/workdir/restrepoda/.cache/torch/hub/main.zip
Downloading: "https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_pretrain.pth" to /gpfs/workdir/restrepoda/.cache/torch/hub/checkpoints/dinov2_vitg14_pretrain.pth
100%|██████████| 4.23G/4.23G [00:39<00:00, 115MB/s] 


Processed batch number: 10
Processed batch number: 20
Processed batch number: 30
Processed batch number: 40
Processed batch number: 50
Processed batch number: 60
Processed batch number: 70
Processed batch number: 80
Processed batch number: 90
Processed batch number: 100
Processed batch number: 110
Processed batch number: 120
Processed batch number: 130
Processed batch number: 140
Processed batch number: 150
Processed batch number: 160
Processed batch number: 170
Processed batch number: 180
Processed batch number: 190
Processed batch number: 200
Processed batch number: 210
Processed batch number: 220
Processed batch number: 230
Processed batch number: 240
Processed batch number: 250
Processed batch number: 260
Processed batch number: 270
Processed batch number: 280
Processed batch number: 290
Processed batch number: 300
Processed batch number: 310
Processed batch number: 320
Processed batch number: 330
Processed batch number: 340
Processed batch number: 350
Processed batch number: 360
P

## 5. mBRSET
* **[mBRSET Dataset](https://physionet.org/content/mbrset/1.0/)**:

The Mobile Brazilian Multilabel Ophthalmological Dataset (mBRSET) stands as a pioneering initiative aimed at bridging the gap in ophthalmological datasets using mobile cameras, particularly for under-represented populations in low and medium-income countries.

In [5]:
batch_size = 32
path = 'datasets/brset/images'
path = '/gpfs/workdir/restrepoda/datasets/mBRSET/mbrset/images'
dataset = 'mbrset'
backbone = 'dinov2_giant'
out_dir = 'Embeddings'

get_embeddings_df(batch_size=batch_size, path=path, dataset_name=dataset, backbone=backbone, directory=out_dir)

##################################################  dinov2_giant  ##################################################


Using cache found in /gpfs/workdir/restrepoda/.cache/torch/hub/facebookresearch_dinov2_main


Processed batch number: 10
Processed batch number: 20
Processed batch number: 30
Processed batch number: 40
Processed batch number: 50
Processed batch number: 60
Processed batch number: 70
Processed batch number: 80
Processed batch number: 90
Processed batch number: 100
Processed batch number: 110
Processed batch number: 120
Processed batch number: 130
Processed batch number: 140
Processed batch number: 150
Processed batch number: 160


### 6. HAM10000 dataset

* [HAM10000 dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T)

The MNIST: HAM10000 dataset is a large collection of dermatoscopic images from different populations, acquired and stored by the Department of Dermatology at the Medical University of Vienna, Austria. It consists of 10,015 dermatoscopic images which can serve as a training set for academic machine learning purposes in tasks like skin lesion analysis and classification, specifically focusing on the detection of melanoma.

In [None]:
batch_size = 32
path = 'datasets/ham10000/images'
dataset = 'ham10000'
backbone = 'dinov2_base'
out_dir = 'Embeddings'

get_embeddings_df(batch_size=batch_size, path=path, dataset_name=dataset, backbone=backbone, directory=out_dir)

### 7. Colombian Multimodal Satellite dataset
* **[A Multi-Modal Satellite Imagery Dataset for Public Health Analysis in Colombia](https://physionet.org/content/multimodal-satellite-data/1.0.0/)**:

The Multi-Modal Satellite Imagery Dataset in Colombia integrates economic, demographic, meteorological, and epidemiological data. It comprises 12,636 high-quality satellite images from 81 municipalities between 2016 and 2018, with minimal cloud cover. Its applications include deforestation monitoring, education indices forecasting, water quality assessment, extreme climatic event tracking, epidemic illness addressing, and precision agriculture optimization. We'll use it shortly.

In [None]:
batch_size = 32
path = 'datasets/satellitedata/images'
dataset = 'satellitedata'
backbone = 'dinov2_base'
out_dir = 'Embeddings'

get_embeddings_df(batch_size=batch_size, path=path, dataset_name=dataset, backbone=backbone, directory=out_dir)

## 8. MIMIC CXR
* **[MIMIC CXR](https://physionet.org/content/mimic-cxr/2.0.0/#files-panel)**:

The MIMIC-CXR (Medical Information Mart for Intensive Care, Chest X-Ray) dataset is a large, publicly available collection of chest radiographs with associated radiology reports. It was developed by the MIT Lab for Computational Physiology and provides an extensive resource for training and evaluating machine learning models in the field of medical imaging, particularly in automated radiograph interpretation and natural language processing for clinical narratives.

In [None]:
batch_size = 32
path = 'datasets/mimic/images'
dataset = 'mimic'
backbone = 'dinov2_base'
out_dir = 'Embeddings'

get_embeddings_df(batch_size=batch_size, path=path, dataset_name=dataset, backbone=backbone, directory=out_dir)

# 9. Joslin Center Data

In [None]:
batch_size = 32
path = 'datasets/joslin/images'
dataset = 'joslin'
backbone = 'dinov2_base'
out_dir = 'Embeddings'
device = "cuda"

get_embeddings_df(batch_size=batch_size, path=path, dataset_name=dataset, backbone=backbone, directory=out_dir)