### Setup Environment:

In [1]:
#from src.embeddings import get_embeddings_df
from src.nlp_models import LLAMA
#from src.nlp_models import GPT
import pandas as pd

## Embeddings Generation

* **Batch Size:** Images per batch to convert to embeddings (Adjust depending on your memory)

* **Path:** Path to the images

* **Output Directory:** Directory to save the embeddings

* **Backbone:** Select a backbone from the list of possible backbones:
    * GPT-3.5 Turbo
    * GPT 4
    * LLAMA 2 7B
    * LLAMA 2 13B
    * LLAMA 2 70B

In [2]:
# Choose your model from the list of models:
#model - GPT()
model = LLAMA(embeddings=True, n_gpu_layers=400)

Model installation aborted.


ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 CUDA devices:
  Device 0: Tesla P100-SXM2-16GB, compute capability 6.0, VMM: yes
  Device 1: Tesla P100-SXM2-16GB, compute capability 6.0, VMM: yes


## 1. DAQUAR

* **[DAQUAR Dataset](https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/research/vision-and-language/visual-turing-challenge#c7057)**:

DAQUAR (Dataset for Question Answering on Real-world images) dataset was created for the purpose of advancing research in visual question answering (VQA). It consists of indoor scene images, each accompanied by sets of questions related to the scene's content. The dataset serves as a benchmark for training and evaluating models in understanding images and answering questions about them.

We'll use the method `get_embeddings_df` to generate the embeddings in `datasets/daquar/images` and store the embeddings in `Embeddings/daquar/Embeddings_Backbone.csv`

In [3]:
model.path = 'datasets/daquar/labels.csv'
column = 'question'
directory = 'Embeddings/daquar'
file = 'text_embeddings.csv'

model.get_embedding_df(column, directory, file)

## 2. COCO-QA

* **[COCO-QA Dataset](https://www.cs.toronto.edu/~mren/research/imageqa/data/cocoqa/)**:

The COCO-QA (COCO Question-Answering) dataset is designed for the task of visual question-answering. It is a subset of the COCO (Common Objects in Context) dataset, which is a large-scale dataset containing images with object annotations. The COCO-QA dataset extends the COCO dataset by including questions and answers associated with the images. Each image in the COCO-QA dataset is accompanied by a set of questions and corresponding answers.

We'll use the function `get_embeddings_df` to generate the embeddings in `datasets/coco-qa/images` and store the embeddings in `Embeddings/coco-qa/Embeddings_Backbone.csv`

In [None]:
model.path = 'datasets/coco-qa/labels.csv'
column = 'questions'
directory = 'Embeddings/coco-qa'
file = 'text_embeddings.csv'

model.get_embedding_df(column, directory, file)

#### 

## 3. Fakeddit

* **[Fakeddit Dataset](https://fakeddit.netlify.app/)**:

Fakeddit is a large-scale multimodal dataset for fine-grained fake news detection. It consists of over 1 million samples from multiple categories of fake news, including satire, misinformation, and fabricated news. The dataset includes text, images, metadata, and comment data, making it a rich resource for developing and evaluating fake news detection models.

We'll use the function `get_embeddings_df` to generate the embeddings in `datasets/fakeddit/images` and store the embeddings in `Embeddings/fakeddit/Embeddings_Backbone.csv`

In [None]:
model.path = 'datasets/fakeddit/labels.csv'
column = 'title'
directory = 'Embeddings/fakeddit'
file = 'text_embeddings.csv'

model.get_embedding_df(column, directory, file)

0 Embeddings generated!
5000 Embeddings generated!
10000 Embeddings generated!


## 4. Recipes5k

* **[Recipes5k Dataset](http://www.ub.edu/cvub/recipes5k/)**:

The Recipes5k dataset comprises 4,826 recipes featuring images and corresponding ingredient lists, with 3,213 unique ingredients simplified from 1,014 by removing overly-descriptive particles, offering a diverse collection of alternative preparations for each of the 101 food types from Food101, meticulously balanced across training, validation, and test splits. The dataset addresses intra- and inter-class variability, extracted from Yummly with 50 recipes per food type.


We'll use the function `get_embeddings_df` to generate the embeddings in `datasets/Recipes5k/images` and store the embeddings in `Embeddings/Recipes5k/Embeddings_Backbone.csv`

In [4]:
model.path = 'datasets/Recipes5k/labels.csv'
column = 'ingredients'
directory = 'Embeddings/Recipes5k'
file = 'text_embeddings.csv'

model.get_embedding_df(column, directory, file)

## 5. BRSET
* **[BRSET Dataset](https://physionet.org/content/brazilian-ophthalmological/1.0.0/)**:

The Brazilian Multilabel Ophthalmological Dataset (BRSET) stands as a pioneering initiative aimed at bridging the gap in ophthalmological datasets, particularly for under-represented populations in low and medium-income countries. This comprehensive dataset encompasses 16,266 images from 8,524 Brazilian patients, incorporating a wide array of data points including demographics, anatomical parameters of the macula, optic disc, and vessels, along with quality control metrics such as focus, illumination, image field, and artifacts.

In [3]:
model.path = 'datasets/brset/labels.csv'
column = 'text'
directory = 'Embeddings/brset'
file = 'text_embeddings.csv'

model.get_embedding_df(column, directory, file)

0 Embeddings generated!
500 Embeddings generated!
1000 Embeddings generated!
1500 Embeddings generated!
2000 Embeddings generated!
2500 Embeddings generated!
3000 Embeddings generated!
3500 Embeddings generated!
4000 Embeddings generated!
4500 Embeddings generated!
5000 Embeddings generated!
5500 Embeddings generated!
6000 Embeddings generated!
6500 Embeddings generated!
7000 Embeddings generated!
7500 Embeddings generated!
8000 Embeddings generated!
8500 Embeddings generated!
9000 Embeddings generated!
9500 Embeddings generated!
10000 Embeddings generated!
10500 Embeddings generated!
11000 Embeddings generated!
11500 Embeddings generated!
12000 Embeddings generated!
12500 Embeddings generated!
13000 Embeddings generated!
13500 Embeddings generated!
14000 Embeddings generated!
14500 Embeddings generated!
15000 Embeddings generated!
15500 Embeddings generated!
16000 Embeddings generated!


### 6. HAM10000 dataset

* [HAM10000 dataset](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T)

The MNIST: HAM10000 dataset is a large collection of dermatoscopic images from different populations, acquired and stored by the Department of Dermatology at the Medical University of Vienna, Austria. It consists of 10,015 dermatoscopic images which can serve as a training set for academic machine learning purposes in tasks like skin lesion analysis and classification, specifically focusing on the detection of melanoma.

In [4]:
model.path = 'datasets/ham10000/labels.csv'
column = 'text'
directory = 'Embeddings/ham10000'
file = 'text_embeddings.csv'

model.get_embedding_df(column, directory, file)

0 Embeddings generated!
500 Embeddings generated!
1000 Embeddings generated!
1500 Embeddings generated!
2000 Embeddings generated!
2500 Embeddings generated!
3000 Embeddings generated!
3500 Embeddings generated!
4000 Embeddings generated!
4500 Embeddings generated!
5000 Embeddings generated!
5500 Embeddings generated!
6000 Embeddings generated!
6500 Embeddings generated!
7000 Embeddings generated!
7500 Embeddings generated!
8000 Embeddings generated!
8500 Embeddings generated!
9000 Embeddings generated!
9500 Embeddings generated!
10000 Embeddings generated!


## 7. Colombian Multimodal Satellite dataset
* **[A Multi-Modal Satellite Imagery Dataset for Public Health Analysis in Colombia](https://physionet.org/content/multimodal-satellite-data/1.0.0/)**:

The Multi-Modal Satellite Imagery Dataset in Colombia integrates economic, demographic, meteorological, and epidemiological data. It comprises 12,636 high-quality satellite images from 81 municipalities between 2016 and 2018, with minimal cloud cover. Its applications include deforestation monitoring, education indices forecasting, water quality assessment, extreme climatic event tracking, epidemic illness addressing, and precision agriculture optimization. We'll use it shortly.

In [3]:
model.path = 'datasets/satellitedata/labels.csv'
column = 'text'
directory = 'Embeddings/satellitedata'
file = 'text_embeddings.csv'

model.get_embedding_df(column, directory, file)

0 Embeddings generated!
500 Embeddings generated!
1000 Embeddings generated!
1500 Embeddings generated!


## 8. MIMIC CXR
* **[MIMIC CXR](https://physionet.org/content/mimic-cxr/2.0.0/#files-panel)**:

The MIMIC-CXR (Medical Information Mart for Intensive Care, Chest X-Ray) dataset is a large, publicly available collection of chest radiographs with associated radiology reports. It was developed by the MIT Lab for Computational Physiology and provides an extensive resource for training and evaluating machine learning models in the field of medical imaging, particularly in automated radiograph interpretation and natural language processing for clinical narratives.

In [3]:
model.path = 'datasets/mimic/labels.csv'
column = 'text'
directory = 'Embeddings/mimic'
file = 'text_embeddings.csv'

model.get_embedding_df(column, directory, file)

0 Embeddings generated!
