## Introduction

The idea behind this notebook is to:
1. Build a Keras NLP `Backbone` model using the Hugging Face configuration (`config.json`). This creates a backbone model with randomly initialized weights.
2. Get the Keras NLP `Preprocessor` from Kaggle. We will later require this to run an end to end experiment.
3. Build a `CausalLM` using the backbone (randomly initialized weights) and the preprocessor.
4. Run generation using the `CausalLM` model.

## Setup and Imports

In [None]:
!pip install -q -U keras-nlp
!pip install -q -U keras>=3

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m571.8/571.8 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m68.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m85.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m82.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m103.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m58.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m347.7/347.7 kB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency re

In [None]:
import json
from huggingface_hub import hf_hub_download
from keras_nlp.models import (
    PaliGemmaBackbone,
    PaliGemmaCausalLMPreprocessor,
    PaliGemmaCausalLM,
)

## Download the `config.json` from HF Hub

In [None]:
model_id = "google/paligemma-3b-pt-224"
hf_config_file = hf_hub_download(model_id, "config.json")
with open(hf_config_file) as f:
    transformers_config = json.load(f)

config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

In [None]:
text_config = transformers_config["text_config"]
vision_config = transformers_config["vision_config"]

## Build the KerasNLP Backbone using the HF Configuration

Here one needs to go back and forth between the configurations from Kaggle and Hugging Face. For example I had to refer to [Kaggle Config](https://www.kaggle.com/models/google/paligemma?select=config.json) and [Hugging Face Config](https://huggingface.co/google/paligemma-3b-pt-224/blob/main/config.json) for the PaliGemma model.

In [None]:
backbone = PaliGemmaBackbone(
    vocabulary_size=transformers_config["image_token_index"],
    image_size=(
        vision_config["image_size"]
        if "image_size" in vision_config.keys()
        else 224
    ),
    num_layers=text_config["num_hidden_layers"],
    num_query_heads=text_config["num_attention_heads"],
    num_key_value_heads=text_config["num_key_value_heads"],
    hidden_dim=text_config["hidden_size"],
    intermediate_dim=text_config["intermediate_size"] * 2,
    head_dim=text_config["num_image_tokens"],
    vit_patch_size=vision_config["patch_size"],
    vit_num_heads=vision_config["num_attention_heads"],
    vit_hidden_dim=vision_config["hidden_size"],
    vit_num_layers=vision_config["num_hidden_layers"],
    vit_intermediate_dim=vision_config["intermediate_size"],
)

## Get model summary

In [None]:
backbone.summary()

## Build the CausalLM

Here we first fetch the Preprocessor from Kaggle, and then use the previously built Backbone and the Preprocessor to instantiate our CausalLM.

In [None]:
import os
from google.colab import userdata

os.environ["KAGGLE_KEY"] = userdata.get("KAGGLE_KEY")
os.environ["KAGGLE_USERNAME"] = userdata.get("KAGGLE_USERNAME")

In [None]:
processor = PaliGemmaCausalLMPreprocessor.from_preset(
    "pali_gemma_3b_224"
)

causal_lm = PaliGemmaCausalLM(
    preprocessor=processor,
    backbone=backbone,
)

Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_224/1/download/model.safetensors...
Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_224/1/download/model.safetensors.index.json...
Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_224/1/download/metadata.json...
100%|██████████| 143/143 [00:00<00:00, 171kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_224/1/download/preprocessor.json...
Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_224/1/download/tokenizer.json...
100%|██████████| 410/410 [00:00<00:00, 394kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_224/1/download/assets/tokenizer/vocabulary.spm...
100%|██████████| 4.07M/4.07M [00:01<00:00, 2.84MB/s]


## Run an End2End Generation

In [None]:
import numpy as np

image = np.random.rand(224, 224, 3)
prompt = ["where is the cow standing?"]

In [None]:
outputs = causal_lm.generate({
    "images": image,
    "prompts": prompt,
})

You will notice that the outputs are gibberish, this is because the backbone has randomly initialized weights!

In [None]:
outputs

['where is the cow standing? forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested forested