<h1 style="text-align: center;"><strong>Custom Encoder Decoder Model</strong> :</h1>

## We'll learn how to create a VisionEncoderDecoderModel using any Vision model as Encoder and any LLM as Decoder.

<a id="1"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(60, 121, 245) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 1. Importing Some Libraries / Dependencies </b></div>

In [35]:
!pip install jiwer

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [29]:
from dataclasses import dataclass
import warnings
import torch
from datasets import Dataset, DatasetDict
import numpy as np
import pandas as pd
from PIL import Image
import os
from jiwer import wer , cer
from transformers import (
    VisionEncoderDecoderModel,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    ViTImageProcessor,
    AutoTokenizer,
    default_data_collator
)
warnings.filterwarnings("ignore")

<a id="2"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(255, 217, 19) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 2. ⚙️ The "Too Cool for School" Config</b></div>

In [30]:
@dataclass
class Config:
    output_dir: str = "/kaggle/working/"
    encoder_checkpoint : str = "google/vit-base-patch16-224-in21k"  
    decoder_checkpoint : str = "gpt2"
    max_length: int = 512  
    early_stoping : str = 'never'
    no_n_gram : int = 3
    length_penalty : int = 2.0
    num_beams : int = 4
    per_device_train_batch_size: int = 2 
    per_device_eval_batch_size: int = 12
    saving_steps : int = 25
    logging_steps : int = 25
    eval_steps : int = 25
    lr: float = 2e-5
    report : str = 'none'
    startegy : str = 'steps'
    epoch : int = 1
    
cf = Config()
    

<a id="3"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(255, 7, 19) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 3. Initializing the Custom Encoder Decoder Model </b></div>

In [17]:
model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
    cf.encoder_checkpoint, cf.decoder_checkpoint ,trust_remote_code=True
)

feature_extractor = ViTImageProcessor.from_pretrained(cf.encoder_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(cf.decoder_checkpoint)

Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.crossattention.c_attn.bias', 'h.0.crossattention.c_attn.weight', 'h.0.crossattention.c_proj.bias', 'h.0.crossattention.c_proj.weight', 'h.0.crossattention.q_attn.bias', 'h.0.crossattention.q_attn.weight', 'h.0.ln_cross_attn.bias', 'h.0.ln_cross_attn.weight', 'h.1.crossattention.c_attn.bias', 'h.1.crossattention.c_attn.weight', 'h.1.crossattention.c_proj.bias', 'h.1.crossattention.c_proj.weight', 'h.1.crossattention.q_attn.bias', 'h.1.crossattention.q_attn.weight', 'h.1.ln_cross_attn.bias', 'h.1.ln_cross_attn.weight', 'h.10.crossattention.c_attn.bias', 'h.10.crossattention.c_attn.weight', 'h.10.crossattention.c_proj.bias', 'h.10.crossattention.c_proj.weight', 'h.10.crossattention.q_attn.bias', 'h.10.crossattention.q_attn.weight', 'h.10.ln_cross_attn.bias', 'h.10.ln_cross_attn.weight', 'h.11.crossattention.c_attn.bias', 'h.11.crossattention.c_attn.weight', 'h.11.crossat

# <div style="font-size:32px; font-family:consolas; text-align:center; color:rgb(255, 0, 0);">
# <b> Set special <span style="color:rgb(0, 255, 0);">Tokens</span> for the <span style="color:rgb(0, 0, 255);">Encoder Decoder Model</span> </b>
### > eos
### > bos
### > pad

In [31]:
model.config.decoder_start_token_id = tokenizer.bos_token_id
tokenizer.pad_token = tokenizer.eos_token  
model.config.pad_token_id = tokenizer.pad_token_id

# <div style="font-size:32px; font-family:consolas; text-align:center; color:rgb(255, 70, 80);">
# <b> Make sure <span style="color:rgb(255, 30, 20);">vocab size</span> is set <span style="color:rgb(3, 3, 255);">correctly</span> </b>

In [6]:
model.config.vocab_size = model.config.decoder.vocab_size

# <div style="font-size:32px; font-family:consolas; text-align:center; color:rgb(255, 70, 80);">
# <b> Set <span style="color:rgb(255, 30, 20);">Beam Search </span> parameters <span style="color:rgb(3, 3, 255);"></span> </b>

In [7]:
model.config.eos_token_id = tokenizer.sep_token_id
model.config.max_length = cf.max_length
model.config.early_stopping = cf.early_stoping
model.config.no_repeat_ngram_size = cf.no_n_gram
model.config.length_penalty = cf.length_penalty
model.config.num_beams = cf.num_beams
model.decoder.resize_token_embeddings(len(tokenizer))

Embedding(50257, 768)

<a id="4"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(255, 0, 19) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 4.MODEL ARCHITECTURE</b></div>

In [8]:
model

VisionEncoderDecoderModel(
  (encoder): ViTModel(
    (embeddings): ViTEmbeddings(
      (patch_embeddings): ViTPatchEmbeddings(
        (projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
      )
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): ViTEncoder(
      (layer): ModuleList(
        (0-11): 12 x ViTLayer(
          (attention): ViTSdpaAttention(
            (attention): ViTSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): ViTSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (intermediate): ViTIntermediate(
            (dense): Linear(i

![02szHZu.png](https://imgur.com/02szHZu.png)

# <div style="font-size:32px; font-family:consolas; text-align:center; color:rgb(255, 70, 80);">
# <b>Verify <span style="color:rgb(255, 30, 20);"></span><span style="color:rgb(3, 3, 255);">Configurtions</span> </b>

In [9]:
print(f"Decoder Start Token ID: {model.config.decoder_start_token_id}")
print(f"Pad Token ID: {model.config.pad_token_id}")
print(f"Vocabulary Size: {model.config.vocab_size}")

#Very Important Step otherwise you'll see errors during training { adjust this as your choice}

Decoder Start Token ID: 50256
Pad Token ID: 50256
Vocabulary Size: 50257


In [10]:
max_target_length = cf.max_length

<a id="5"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(31, 193, 27) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 5. Loading Files </b></div>

In [11]:
csv_file = "/kaggle/input/ai-god/train.csv"
df = pd.read_csv(csv_file)

df = df.head(5000)  #------> only using first 5k data , you can use as much as you can ^_^

In [12]:
image_folder = "/kaggle/input/ai-god-2/train_images/train_images"

<a id="6"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(82, 15, 70) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 6. A Function to load images using the ID </b></div>

In [13]:
def load_images(image_id):
    image_path = os.path.join(image_folder, f"{image_id}.png")  #----> Adjust the file extension if needed
    image = Image.open(image_path).convert("RGB")
    return feature_extractor(images=image, return_tensors="pt").pixel_values.squeeze(0)

# <div style="font-size:32px; font-family:consolas; text-align:center; color:rgb(255, 70, 80);">
# <b>Converting the <span style="color:rgb(255, 30, 20);"></span> DataFrame into a <span style="color:rgb(3, 3, 255);">Dataset</span> </b>

In [32]:
dataset = Dataset.from_pandas(df)

<a id="7"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(82, 15, 70) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 7. A Function to load images and add them to the dataset</b></div>

In [15]:
def add_image(example):
    example['pixel_values'] = load_images(example['unique Id'])
    return example

In [33]:
dataset = dataset.map(add_image)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

# <div style="font-size:32px; font-family:consolas; text-align:center; color:rgb(255, 70, 80);">
# <b>Tokenizing the <span style="color:rgb(255, 30, 20);">Labels </span> and add them to the <span style="color:rgb(3, 3, 255);">Dataset</span> </b>

In [19]:
def tokenize_labels(example):
    example['labels'] = tokenizer(example['transcription'],
                                  return_tensors='pt',
                                  truncation=True,
                                  padding="max_length", 
                                  max_length=max_target_length).input_ids.squeeze(0)
    return example

In [34]:
dataset = dataset.map(tokenize_labels)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [23]:
dataset = dataset.remove_columns(["unique Id", "transcription"]) #------> Removing the unnecessary columns

# <div style="font-size:32px; font-family:consolas; text-align:center; color:rgb(255, 70, 80);">
# <b>Spliting the <span style="color:rgb(255, 30, 20);">Dataset </span> into the <span style="color:rgb(3, 3, 255);">train , test , valid sets</span> </b>

In [24]:
train_testvalid = dataset.train_test_split(0.1)
test_valid = train_testvalid['test'].train_test_split(0.5)

In [25]:
train_test_valid_dataset = DatasetDict({
    'train': train_testvalid['train'],
    'test': test_valid['test'],
    'valid': test_valid['train']
})

print(train_test_valid_dataset)  

DatasetDict({
    train: Dataset({
        features: ['pixel_values', 'labels'],
        num_rows: 4500
    })
    test: Dataset({
        features: ['pixel_values', 'labels'],
        num_rows: 250
    })
    valid: Dataset({
        features: ['pixel_values', 'labels'],
        num_rows: 250
    })
})


<a id="8"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(200, 13, 12) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 8. Custom compute metrices ( WER , CER )</b></div>

In [26]:
def custom_compute_metrics(pred):
    preds = pred.predictions
    labels = pred.label_ids

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    wer_score = wer(decoded_labels, decoded_preds)  #---->WER

    cer_score = cer(decoded_labels, decoded_preds)  #---->CER

    return {
        'wer': wer_score,
        'cer': cer_score,
    }

In [27]:
for param in model.encoder.parameters():
    param.requires_grad = False

<a id="9"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(20, 13, 121) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> 9. Training Arguments / Trainer </b></div>

In [28]:
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    eval_strategy= cf.startegy,
    per_device_train_batch_size=cf.per_device_train_batch_size,
    per_device_eval_batch_size=cf.per_device_eval_batch_size,
    overwrite_output_dir=True,
    fp16= True,
    load_best_model_at_end=True,
    output_dir=cf.output_dir,
    logging_steps=cf.logging_steps,
    save_steps=cf.saving_steps,
    eval_steps=cf.eval_steps,
    report_to =cf.report,
    learning_rate = cf.lr,
    num_train_epochs = cf.epoch
)

In [29]:
trainer = Seq2SeqTrainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        train_dataset=train_test_valid_dataset['train'],
        eval_dataset=train_test_valid_dataset['valid'],
        data_collator=default_data_collator,
        compute_metrics=custom_compute_metrics,
        )

In [None]:
trainer.train()

Step,Training Loss,Validation Loss,Wer,Cer
25,4.7689,0.24776,1.032375,1.030173
50,0.2395,0.227945,1.0,1.0
75,0.2319,0.220672,1.032012,1.025669
100,0.223,0.214311,1.046199,0.754969
125,0.2139,0.210121,1.644234,0.822251
150,0.2229,0.206539,1.28956,0.739311


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask an

<a id="BONUS"></a>
# <div style="box-shadow: rgba(0, 0, 0, 0.16) 0px 1px 4px inset, rgb(20, 13, 121) 0px 0px 0px 3px inset; padding:20px; font-size:32px; font-family: consolas; text-align:center; display:fill; border-radius:15px;  color:rgb(34, 34, 34);"> <b> BONUS : INFERENCE</b></div>

In [None]:
from transformers import VisionEncoderDecoderModel, AutoTokenizer ,ViTImageProcessor

In [None]:
model = VisionEncoderDecoderModel.from_pretrained(saved_model_checkpoint)

feature_extractor = ViTImageProcessor.from_pretrained(saved_model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(saved_model_checkpoint)

In [None]:
model.eval()

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

<div style="box-shadow: rgba(240, 46, 170, 0.4) -5px 5px inset, rgba(240, 46, 170, 0.3) -10px 10px inset, rgba(240, 46, 17, 0.2) -15px 15px inset, rgba(24, 46, 170, 0.1) -20px 20px inset, rgba(240, 46, 170, 0.05) -25px 25px inset; padding:20px; font-size:30px; font-family: consolas; display:fill; border-radius:15px; color: rgba(240, 0, 170, 0.7)"> <b> 💻 Thank You!</b></div>

<p style="font-family:verdana; color:rgb(34, 34, 34); font-family: consolas; font-size: 16px;"> If you enjoy this Custom Vision Encoder Decoder Model,upvote this notebook. Happy coding!🚀💻🌟. <br>
    </p>

![6np2LHr.gif](https://imgur.com/6np2LHr.gif)