## Introduction

##### In this notebook we will explore the multimodal models using OpenFlamingo. Inspired by DeepMind's Flamingo model, it's an open-source model with loads of passibilites.

##### To start off were going to review the basics by following OpenFlamingo's creators, so that we can understand it's architechure and usage. After we review this, we will apply the model to our own data which contains unique UI datasets and their descriptions.

In [None]:
!pip install opendatasets --upgrade

Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22


##### Before running the cell below, please make sure you create a kaggle account and then go to your settings. Once you're in the settings section in Kaggle, go to the section nameed API to generate a new token or expire an existing token. Once you click generate a new token, it will automatically download a json file named kaggle.json. Once you have the json file opened, run the cell below and enter the neccessary data.

In [None]:
import opendatasets as od
dataset_url = 'https://www.kaggle.com/datasets/onurgunes1993/rico-dataset'
od.download(dataset_url)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: carollineosei
Your Kaggle Key: ··········
Downloading rico-dataset.zip to ./rico-dataset


100%|██████████| 6.24G/6.24G [01:22<00:00, 81.6MB/s]





In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = '/content'

###### While we don't need this yet, we are downloading the data set that we will use on the model once we're done reviewing the basics of the model.

In [None]:
!kaggle datasets download -d onurgunes1993/rico-dataset

Downloading rico-dataset.zip to /content
100% 6.24G/6.24G [01:10<00:00, 178MB/s]
100% 6.24G/6.24G [01:10<00:00, 95.1MB/s]


In [None]:
import pandas as pd

rico_data = pd.read_csv('rico-dataset.csv')
rico_data = data.iloc[:6] # Captures first 6 lines of the dataset for scaling purposes

##### To install the package, you can use the command pip install open-flamingo in an existing environment. Alternatively, you can create a new conda environment for running open-flamingo by running `conda env create -f environment.yml`

In [None]:
!pip install open-flamingo[all]

Collecting webdataset (from open-flamingo[all])
  Downloading webdataset-0.2.79-py3-none-any.whl (65 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.4/65.4 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Collecting inflection (from open-flamingo[all])
  Downloading inflection-0.5.1-py2.py3-none-any.whl (9.5 kB)
Collecting wandb (from open-flamingo[all])
  Downloading wandb-0.16.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
Collecting pycocoevalcap (from open-flamingo[all])
  Downloading pycocoevalcap-1.2-py3-none-any.whl (104.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.3/104.3 MB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting braceexpand (from open-flamingo[all])
  Downloading braceexpand-0.1.7-py2.py3-none-any.whl (5.9 kB)
Collecting GitPython!=3.1.29,>=1.0.0 (from wandb->open-flamingo[all])
  Downloading GitPython-3.1.40-

In [None]:
!pip install pre-commit

Collecting pre-commit
  Downloading pre_commit-3.5.0-py2.py3-none-any.whl (203 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m203.7/203.7 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cfgv>=2.0.0 (from pre-commit)
  Downloading cfgv-3.4.0-py2.py3-none-any.whl (7.2 kB)
Collecting identify>=1.0.0 (from pre-commit)
  Downloading identify-2.5.33-py2.py3-none-any.whl (98 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.9/98.9 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nodeenv>=0.11.1 (from pre-commit)
  Downloading nodeenv-1.8.0-py2.py3-none-any.whl (22 kB)
Collecting virtualenv>=20.10.0 (from pre-commit)
  Downloading virtualenv-20.25.0-py3-none-any.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
Collecting distlib<1,>=0.3.7 (from virtualenv>=20.10.0->pre-commit)
  Downloading distlib-0.3.7-py2.py3-none-any.whl (468 kB)
[2K    

In [None]:
!pre-commit --version

pre-commit 3.5.0


## Initializing an OpenFlamingo Model

##### Before running the following cell it's important to understand what is happening in the cell below. This code imports the `create_model_and_transforms` function from the open_flamingo package. It then calls this function to create a model, an image processor, and a tokenizer. The `clip_vision_encoder_path` parameter specifies the path to the vision encoder used in the model, which is set to `ViT-L-14`.

##### The `clip_vision_encoder_pretrained` parameter specifies the pretrained weights for the vision encoder, which are set to `openai`. The `lang_encoder_path` parameter specifies the path to the language encoder used in the model, which is set to `anas-awadalla/mpt-1b-redpajama-200b`. The `tokenizer_path` parameter specifies the path to the tokenizer used in the model, which is also set to `anas-awadalla/mpt-1b-redpajama-200b`.

##### The `cross_attn_every_n_layers` parameter specifies the number of layers between cross attention layers, which is set to 1. Finally, the `cache_dir` parameter specifies the directory where the cache is stored, which is set to `"PATH/TO/CACHE/DIR"` by default.

In [None]:
from open_flamingo import create_model_and_transforms

model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="anas-awadalla/mpt-1b-redpajama-200b",
    tokenizer_path="anas-awadalla/mpt-1b-redpajama-200b",
    cross_attn_every_n_layers=1,
    cache_dir="PATH/TO/CACHE/DIR"  # Defaults to ~/.cache
)

100%|███████████████████████████████████████| 933M/933M [00:10<00:00, 85.6MiB/s]


tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

configuration_mosaic_gpt.py:   0%|          | 0.00/8.87k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b:
- configuration_mosaic_gpt.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


mosaic_gpt.py:   0%|          | 0.00/20.5k [00:00<?, ?B/s]

param_init_fns.py:   0%|          | 0.00/15.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b:
- param_init_fns.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


attention.py:   0%|          | 0.00/13.8k [00:00<?, ?B/s]

low_precision_layernorm.py:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b:
- low_precision_layernorm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b:
- attention.py
- low_precision_layernorm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


gpt_blocks.py:   0%|          | 0.00/3.11k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b:
- gpt_blocks.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b:
- mosaic_gpt.py
- param_init_fns.py
- attention.py
- gpt_blocks.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin:   0%|          | 0.00/5.25G [00:00<?, ?B/s]

KeyboardInterrupt: ignored

## Downloading Pretrained Weights

##### This code imports the `hf_hub_download` function from the huggingface_hub package. It then calls this function to download the checkpoint file for the OpenFlamingo-3B-vitl-mpt1b model from the Hugging Face model hub. The `hf_hub_download` function takes two parameters: the name of the model and the name of the checkpoint file. In this case, the model name is `"openflamingo/OpenFlamingo-3B-vitl-mpt1b"` and the checkpoint file name is `"checkpoint.pt"`.

##### The function returns the path to the downloaded checkpoint file, which is then assigned to the `checkpoint_path` variable. Finally, the `load_state_dict` method of the `torch` module is called to load the state dictionary of the model from the checkpoint file. The `strict` parameter is set to `False` to allow loading of the state dictionary even if the shapes of the tensors do not match exactly

In [None]:
# grab model checkpoint from huggingface hub
from huggingface_hub import hf_hub_download
import torch

checkpoint_path = hf_hub_download("openflamingo/OpenFlamingo-3B-vitl-mpt1b", "checkpoint.pt")
model.load_state_dict(torch.load(checkpoint_path), strict=False)

## Generating Text

##### This code loads three images from the internet using their URLs. The images are then preprocessed to be used as input to the OpenFlamingo model.

##### In Step 2, the images are converted to a tensor of shape batch_size x num_media x num_frames x channels x height x width, where batch_size = 1, num_media = 3, num_frames = 1, channels = 3, height = 224, and width = 224.

##### In Step 3, the text is preprocessed to include special tokens that indicate the location of images and the end of text portions associated with images.

##### Finally, in Step 4, the OpenFlamingo model is used to generate text based on the images and text input. The generate method of the model is called with the preprocessed images and text as input, along with some additional parameters that control the generation process. Then generated text is then printed to the console.

In [None]:
from PIL import Image
import requests
import torch

"""
Step 1: Load images
"""
demo_image_one = Image.open(
    requests.get(
        "http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
    ).raw
)

demo_image_two = Image.open(
    requests.get(
        "http://images.cocodataset.org/test-stuff2017/000000028137.jpg",
        stream=True
    ).raw
)

query_image = Image.open(
    requests.get(
        "http://images.cocodataset.org/test-stuff2017/000000028352.jpg",
        stream=True
    ).raw
)


"""
Step 2: Preprocessing images
Details: For OpenFlamingo, we expect the image to be a torch tensor of shape
 batch_size x num_media x num_frames x channels x height x width.
 In this case batch_size = 1, num_media = 3, num_frames = 1,
 channels = 3, height = 224, width = 224.
"""
vision_x = [image_processor(demo_image_one).unsqueeze(0), image_processor(demo_image_two).unsqueeze(0), image_processor(query_image).unsqueeze(0)]
vision_x = torch.cat(vision_x, dim=0)
vision_x = vision_x.unsqueeze(1).unsqueeze(0)

"""
Step 3: Preprocessing text
Details: In the text we expect an <image> special token to indicate where an image is.
 We also expect an <|endofchunk|> special token to indicate the end of the text
 portion associated with an image.
"""
tokenizer.padding_side = "left" # For generation padding tokens should be on the left
lang_x = tokenizer(
    ["<image>An image of two cats.<|endofchunk|><image>An image of a bathroom sink.<|endofchunk|><image>An image of"],
    return_tensors="pt",
)


"""
Step 4: Generate text
"""
generated_text = model.generate(
    vision_x=vision_x,
    lang_x=lang_x["input_ids"],
    attention_mask=lang_x["attention_mask"],
    max_new_tokens=20,
    num_beams=3,
)

print("Generated text: ", tokenizer.decode(generated_text[0]))

## Training

##### This code below is a command that runs the train.py script for the openflamingo package. The script trains a multimodal language model using the **Multimodal C4** dataset and the **MPT-1B-RedPajama 200B** language model. The lm_path parameter specifies the path to the language model, which is set to  `anas-awadalla/mpt-1b-redpajama-200b`. The `tokenizer_path` parameter specifies the path to the tokenizer used by the language model, which is also set to `anas-awadalla/mpt-1b-redpajama-200b`.

##### The `cross_attn_every_n_layers` parameter specifies the number of layers between cross attention layers, which is set to 1. The `dataset_resampled` parameter specifies that the dataset should be resampled. The `batch_size_mmc4` parameter specifies the batch size for the Multimodal C4 dataset, which is set to 32. The `batch_size_laion` parameter specifies the batch size for the MPT-1B-RedPajama-200B dataset, which is set to 64. The `train_num_samples_mmc4` parameter specifies the number of samples to use for training on the Multimodal C4 dataset, which is set to 125000. The `train_num_samples_laion` parameter specifies the number of samples to use for training on the **MPT-1B-RedPajama-200B** dataset, which is set to 250000.

##### The `loss_multiplier_laion` parameter specifies the loss multiplier for the **MPT-1B-RedPajama-200B** dataset, which is set to 0.2. The `workers` parameter specifies the number of workers to use for data loading, which is set to 4. The `run_name` parameter specifies the name of the run, which is set to `OpenFlamingo-3B-vitl-mpt1b`. The `num_epochs` parameter specifies the number of epochs to train for, which is set to 480. The `warmup_steps` parameter specifies the number of warmup steps, which is set to 1875. The `mmc4_textsim_threshold` parameter specifies the text similarity threshold for the **Multimodal C4** dataset, which is set to 0.24.

##### The `laion_shards` parameter specifies the path to the shards for the MPT-1B-RedPajama-200B dataset. The `mmc4_shards` parameter specifies the path to the shards for the Multimodal C4 dataset. Finally, the `report_to_wandb` parameter specifies whether to report the results to Weights & Biases

In [None]:
torchrun --nnodes=1 --nproc_per_node=4 open_flamingo/train/train.py \
  --lm_path anas-awadalla/mpt-1b-redpajama-200b \
  --tokenizer_path anas-awadalla/mpt-1b-redpajama-200b \
  --cross_attn_every_n_layers 1 \
  --dataset_resampled \
  --batch_size_mmc4 32 \
  --batch_size_laion 64 \
  --train_num_samples_mmc4 125000\
  --train_num_samples_laion 250000 \
  --loss_multiplier_laion 0.2 \
  --workers=4 \
  --run_name OpenFlamingo-3B-vitl-mpt1b \
  --num_epochs 480 \
  --warmup_steps  1875 \
  --mmc4_textsim_threshold 0.24 \
  --laion_shards "/path/to/shards/shard-{0000..0999}.tar" \
  --mmc4_shards "/path/to/shards/shard-{0000..0999}.tar" \
  --report_to_wandb

## Evaluation

##### To evaluate the data we're going to downlad the WordNet database which is a large english dictionary that is used in NLP task such as text classification, sentiment anaysis, and machine translation. We're also going to create a word cloud from the given text to display and visualize word frquency.

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from collections import Counter

# Convert the generated text to a string
generated_text_str = tokenizer.decode(generated_text[0], skip_special_tokens=True)

# Split the text into words (tokens)
generated_words = generated_text_str.split()

# Calculate the word count and word frequency
word_count = len(generated_words)
word_frequency = Counter(generated_words)

# Print some basic statistics
print("Total word count:", word_count)
print("Vocabulary size:", len(word_frequency))
print("Most common words:")
for word, freq in word_frequency.most_common(10):
    print(f"{word}: {freq} times")

# Create and display a word cloud to visualize word frequency
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_frequency)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Word Cloud of Generated Text")
plt.show()

## Citations

Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Jitsev, J., Kornblith, S., Koh, P.W., Ilharco, G., Wortsman, M., Schmidt, L. (2023). OpenFlamingo. (Version 3.5.0). https://github.com/mlfoundations/open_flaming

## Applying the model to UI Dataset and Descriptions

In [None]:
# Import necessary libraries

from open_flamingo import create_model_and_transforms
from huggingface_hub import hf_hub_download
from PIL import Image
import requests
import torch
import os
import json

In [None]:
# Create model and transforms

model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="anas-awadalla/mpt-1b-redpajama-200b",
    tokenizer_path="anas-awadalla/mpt-1b-redpajama-200b",
    cross_attn_every_n_layers=1,
    cache_dir="PATH/TO/CACHE/DIR"  # Defaults to ~/.cache
    )

In [None]:
# Load model checkpoint

checkpoint_path = hf_hub_download("openflamingo/OpenFlamingo-3B-vitl-mpt1b", "checkpoint.pt")
model.load_state_dict(torch.load(checkpoint_path), strict=False)

In [None]:
# List all files in the directory containing your dataset
rico_data = os.listdir('/content/rico-dataset/rico_dataset_v0.1_semantic_annotations/semantic_annotations')

# Sort the files to ensure they are in the correct order
rico_data.sort()

# Select the first 6 items
rico_data = rico_data[:6]

# Separate the images and json files into two lists
ui_images = [file for file in rico_data if file.endswith('.png')]
ui_descriptions = [file for file in rico_data if file.endswith('.json')]

# For each json file, read the file and extract the description
#text_descriptions = []
#for file in ui_descriptions:
#    with open('path_to_your_dataset/' + file) as json_file:
#        data = json.load(json_file)
#        text_descriptions.append(data.get('description', ''))  # Use an empty string as default if 'description' is not found

ui_descriptions = []
for file in ui_descriptions_files:
    with open('/content/rico-dataset/rico_dataset_v0.1_semantic_annotations/semantic_annotations/' + file) as json_file:
        data = json.load(json_file)
        ui_descriptions.append(data.get('description', ''))  # Use an empty string as default if 'description' is not found

In [None]:
# Preprocess your text


tokenizer.padding_side = "left"

for i in range(len(ui_images)):
    vision_x = image_processor(Image.open('/content/rico-dataset/rico_dataset_v0.1_semantic_annotations/semantic_annotations/' + ui_images[i])).unsqueeze(0).unsqueeze(1).unsqueeze(0)
    lang_x = tokenizer(
        ["<image>" + desc + "<|endofchunk|>" for desc in ui_descriptions],
        return_tensors="pt",
)

In [None]:
# Generate text


generated_text = model.generate(
    vision_x=vision_x,
    lang_x=lang_x["input_ids"],
    attention_mask=lang_x["attention_mask"],
    max_new_tokens=20,
    num_beams=3,
)

print("Generated text for image ", i, ": ", tokenizer.decode(generated_text[0]))


## Comparative Analysis

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Initialize a TfidfVectorizer
vectorizer = TfidfVectorizer()

# For each image-description pair
for i in range(len(ui_images)):
    # ... (existing code to generate text) ...

    # Get the original and generated descriptions
    original_description = ui_descriptions[i]
    generated_description = tokenizer.decode(generated_text[0])

    # Calculate the TF-IDF vectors of the original and generated descriptions
    tfidf_matrix = vectorizer.fit_transform([original_description, generated_description])

    # Calculate the cosine similarity
    similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])

    print("Similarity for image ", i, ": ", similarity[0][0])

This will print the cosine similarity between the original and generated descriptions for each image.

You can also alternatively manually inspect a sample of the generated and original texts to get a sense of how well the model is performing. Here's how you could print a sample:

In [None]:
# Print a sample of original and generated descriptions
for i in range(min(5, len(ui_images))):  # Print up to 5 samples
    print("Original description for image ", i, ": ", ui_descriptions[i])
    print("Generated description for image ", i, ": ", tokenizer.decode(generated_text[0]))

## Visualizations

In [None]:
# Installing word Cloud
pip install wordcloud matplotlib

from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Combine all generated text into one string
all_generated_text = ' '.join([tokenizer.decode(generated_text[0]) for generated_text in generated_texts])

# Create a WordCloud object
wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                stopwords = None,
                min_font_size = 10).generate(all_generated_text)

# Plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

plt.show()


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Combine your original descriptions and generated text into one list
all_text = ui_descriptions + generated_text

# Convert your text into TF-IDF vectors
tfidf_matrix = vectorizer.fit_transform(all_text)

# Calculate the cosine similarity matrix
cosine_sim_matrix = cosine_similarity(tfidf_matrix)


The resulting matrix will have the shape (n_samples, n_samples), where n_samples is the number of texts in all_text. The diagonal of this matrix will be 1s (since a text is perfectly similar to itself), and the off-diagonal elements will be the cosine similarities between different texts. Once you have your similarity matrix, you can visualize it as a heatmap using a library like seaborn:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
sns.heatmap(cosine_sim_matrix, annot=True, cmap='coolwarm')
plt.show()

2. **Comparative Analysis**: You can compare the generated text with the original descriptions to see how well the model is performing. This could involve calculating similarity metrics, or manually inspecting a sample of the generated and original texts.

```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Initialize a TfidfVectorizer
vectorizer = TfidfVectorizer()

# For each image-description pair
for i in range(len(ui_images)):
    # ... (existing code to generate text) ...

    # Get the original and generated descriptions
    original_description = ui_descriptions[i]
    generated_description = tokenizer.decode(generated_text[0])

    # Calculate the TF-IDF vectors of the original and generated descriptions
    tfidf_matrix = vectorizer.fit_transform([original_description, generated_description])

    # Calculate the cosine similarity
    similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])

    print("Similarity for image ", i, ": ", similarity[0][0])
```

This will print the cosine similarity between the original and generated descriptions for each image.

You can also manually inspect a sample of the generated and original texts to get a sense of how well the model is performing. Here's how you could print a sample:

```python
# Print a sample of original and generated descriptions
for i in range(min(5, len(ui_images))):  # Print up to 5 samples
    print("Original description for image ", i, ": ", ui_descriptions[i])
    print("Generated description for image ", i, ": ", tokenizer.decode(generated_text[0]))
```


### Visulaization

1. **Word Clouds**: A word cloud can give you a quick visual overview of the most common words in your generated text.

```python

# Installing word Cloud
pip install wordcloud matplotlib

from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Combine all generated text into one string
all_generated_text = ' '.join([tokenizer.decode(generated_text[0]) for generated_text in generated_texts])

# Create a WordCloud object
wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                stopwords = None,
                min_font_size = 10).generate(all_generated_text)

# Plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
  
plt.show()
```

3. **Heatmaps**: If you calculate similarity metrics between the original and generated descriptions, you could visualize these as a heatmap.


```python
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Combine your original descriptions and generated text into one list
all_text = ui_descriptions + generated_text

# Convert your text into TF-IDF vectors
tfidf_matrix = vectorizer.fit_transform(all_text)

# Calculate the cosine similarity matrix
cosine_sim_matrix = cosine_similarity(tfidf_matrix)

# The resulting matrix will have the shape (n_samples, n_samples), where n_samples is the number of texts in all_text.
# The diagonal of this matrix will be 1s (since a text is perfectly similar to itself), and the off-diagonal elements will be the cosine similarities between different texts.
```

Once you have your similarity matrix, you can visualize it as a heatmap using a library like seaborn:

```python
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
sns.heatmap(cosine_sim_matrix, annot=True, cmap='coolwarm')
plt.show()
```

### Improving the model:

1. **Tune Hyperparameters**: Adjusting the model's hyperparameters can often improve performance. This could involve changing learning rates, batch sizes etc.

2. **Use More Data**: If possible, training the model on a larger or more diverse dataset can often lead to better results.

3. **Advanced Techniques**: Depending on your specific use case and the model architecture you're using, there may be advanced techniques you can use to improve performance. For example, if your model is overfitting, you might try regularization techniques like dropout or weight decay.

4. **Model Architecture**: Try different model architectures or pre-trained models which may perform better on your specific task.




Existing code:

```python
import os
import json
from PIL import Image
import requests
import torch

# Create model and transforms

model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="anas-awadalla/mpt-1b-redpajama-200b",
    tokenizer_path="anas-awadalla/mpt-1b-redpajama-200b",
    cross_attn_every_n_layers=1,
    cache_dir="PATH/TO/CACHE/DIR"  # Defaults to ~/.cache
)

# Load model checkpoint

checkpoint_path = hf_hub_download("openflamingo/OpenFlamingo-3B-vitl-mpt1b", "checkpoint.pt")
model.load_state_dict(torch.load(checkpoint_path), strict=False)

# List all files in the directory containing your dataset
files = os.listdir('path_to_your_dataset')

# Sort the files to ensure they are in the correct order
files.sort()

# Select the first 6 items
files = files[:6]

# Separate the images and json files into two lists
ui_images = [file for file in files if file.endswith('.png')]
ui_descriptions_files = [file for file in files if file.endswith('.json')]

# For each json file, read the file and extract the description
ui_descriptions = []
for file in ui_descriptions_files:
    with open('path_to_your_dataset/' + file) as json_file:
        data = json.load(json_file)
        ui_descriptions.append(data.get('description', ''))  # Use an empty string as default if 'description' is not found

# Preprocess your images and text, and generate text for each image-description pair

tokenizer.padding_side = "left"

for i in range(len(ui_images)):
    vision_x = image_processor(Image.open('path_to_your_dataset/' + ui_images[i])).unsqueeze(0).unsqueeze(1).unsqueeze(0)
    
    lang_x = tokenizer(
        "<image>" + ui_descriptions[i] + "<|endofchunk|>",
        return_tensors="pt",
    )

    generated_text = model.generate(
        vision_x=vision_x,
        lang_x=lang_x["input_ids"],
        attention_mask=lang_x["attention_mask"],
        max_new_tokens=20,
        num_beams=3,
    )

    print("Generated text for image ", i, ": ", tokenizer.decode(generated_text[0]))
```

The training process for this model is contained within the `train.py` script located in the `open_flamingo/train` directory as mentioned in the GitHub readme file. The command provided in the readme file runs this script with a set of specified parameters.

The command includes several hyperparameters that you could tune to potentially improve the model's performance:

1. **Batch Size**: This is specified by the `--batch_size_mmc4` and `--batch_size_laion` arguments. You could try different values for these to see if it improves performance.

2. **Number of Epochs**: This is specified by the `--num_epochs` argument. Again, you could try different values for this.

3. **Warmup Steps**: This is specified by the `--warmup_steps` argument. Warmup steps are used in some optimizers to gradually increase the learning rate at the start of training, which can sometimes improve performance.

4. **Loss Multiplier**: The `--loss_multiplier_laion` argument specifies a multiplier for the loss function used during training. Changing this could potentially affect how the model learns from the data.

To modify these hyperparameters, you would adjust their values in the command used to run `train.py`.

For example, if you wanted to change the batch size to 64 and 128 for MMC4 and LAION datasets respectively, and train for 500 epochs with 2000 warmup steps, your command might look something like this:

```bash
torchrun --nnodes=1 --nproc_per_node=4 open_flamingo/train/train.py \
  --lm_path anas-awadalla/mpt-1b-redpajama-200b \
  --tokenizer_path anas-awadalla/mpt-1b-redpajama-200b \
  --cross_attn_every_n_layers 1 \
  --dataset_resampled \
  --batch_size_mmc4 64 \
  --batch_size_laion 128 \
  --train_num_samples_mmc4 125000\
  --train_num_samples_laion 250000 \
  --loss_multiplier_laion 0.2 \
  --workers=4 \
  --run_name OpenFlamingo-3B-vitl-mpt1b \
  --num_epochs 500 \
  --warmup_steps 2000 \
  --mmc4_textsim_threshold 0.24 \
  --laion_shards "/path/to/shards/shard-{0000..0999}.tar" \
  --mmc4_shards "/path/to/shards/shard-{0000..0999}.tar" \
  --report_to_wandb
```

Remember to adjust the paths to the shard files and other parameters as needed for your specific setup.

As for using more data, improving data quality, changing model architecture, or using advanced techniques like ensembling or transfer learning, you would need to modify the `train.py` script and potentially other parts of the codebase. The specifics of how to do this would depend on your particular use case and the structure of the existing codebase.