<a href="https://colab.research.google.com/github/aimedvedeva/APITradersAnalytics/blob/main/fine_tune_BLIP_experiment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune BLIP using Hugging Face `transformers` and `datasets` 🤗

This tutorial is largely based from the [GiT tutorial](https://colab.research.google.com/drive/1HLxgrG7xZJ9FvXckNG61J72FkyrbqKAA?usp=sharing) on how to fine-tune GiT on a custom image captioning dataset.

## Set-up environment

In [None]:
!pip install git+https://github.com/huggingface/transformers.git@main

## Create PyTorch Dataset

The lines below are entirely copied from the original notebook!

In [None]:
import pandas as pd
df =  pd.read_csv('/content/drive/MyDrive/IndustrialML/train_data/dataset_with_short_descriptions_140_photos.csv')

In [None]:
descriptions = df['description'].tolist()
image_paths = df['image_path'].tolist()

In [None]:
from torch.utils.data import Dataset, DataLoader
from PIL import Image

class ImageCaptioningDataset():
    def __init__(self, list_image_path,list_txt, processor):
        # Initialize image paths and corresponding texts
        self.image_path = list_image_path
        self.processor = processor
        # Tokenize text using CLIP's tokenizer
        self.texts = list_txt

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        # Preprocess image using CLIP's preprocessing function
        image = Image.open(self.image_path[idx])
        text = self.texts[idx]
        encoding = self.processor(image, text, padding="max_length", return_tensors="pt")
        # remove batch dimension
        encoding = {k:v.squeeze() for k,v in encoding.items()}
        return encoding



## Load model and processor

In [None]:
from transformers import AutoProcessor, BlipForConditionalGeneration

processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

Downloading (…)rocessor_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/506 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Now that we have loaded the processor, let's load the dataset and the dataloader:

In [None]:
train_dataset = ImageCaptioningDataset(image_paths, descriptions, processor)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=2)

In [None]:
next(iter(train_dataloader))

## Train the model

Let's train the model! Run the simply the cell below for training the model

In [None]:
import torch

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

model.train()

for epoch in range(50):
  print("Epoch:", epoch)
  for idx, batch in enumerate(train_dataloader):
    input_ids = batch.pop("input_ids").to(device)
    pixel_values = batch.pop("pixel_values").to(device)

    outputs = model(input_ids=input_ids,
                    pixel_values=pixel_values,
                    labels=input_ids)

    loss = outputs.loss

    print("Loss:", loss.item())

    loss.backward()

    optimizer.step()
    optimizer.zero_grad()

Epoch: 0
Loss: 12.905766487121582
Loss: 10.327374458312988
Loss: 9.581689834594727
Loss: 9.386303901672363
Loss: 9.157736778259277
Loss: 8.90240478515625
Loss: 8.773415565490723
Loss: 8.620710372924805
Loss: 8.421791076660156
Loss: 8.287577629089355
Loss: 8.18593978881836
Loss: 8.16653060913086
Loss: 7.994688987731934
Loss: 7.918001651763916
Loss: 7.717723369598389
Loss: 7.639864444732666
Loss: 7.525322437286377
Loss: 7.431062698364258
Loss: 7.2976202964782715
Loss: 7.181963920593262
Loss: 7.068962097167969
Loss: 7.004933834075928
Loss: 6.851524353027344
Loss: 6.7672038078308105
Loss: 6.679562091827393
Loss: 6.506647109985352
Loss: 6.388906955718994
Loss: 6.263095855712891
Loss: 6.145172119140625
Loss: 6.069498538970947
Loss: 5.8921589851379395
Loss: 5.847213268280029
Loss: 5.688695430755615
Loss: 5.587106704711914
Loss: 5.404354095458984
Loss: 5.3541483879089355
Loss: 5.224182605743408
Loss: 5.1114091873168945
Loss: 4.968290328979492
Loss: 4.889471054077148
Loss: 4.766401290893555
Los

# Save model to huggingface

In [None]:
import locale
print(locale.getpreferredencoding())

ANSI_X3.4-1968


In [None]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [None]:
pip install huggingface_hub



In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the '

In [None]:
# huggingface api key (write)
# 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

In [None]:
# from huggingface_hub import notebook_login

# notebook_login()

In [None]:
model.push_to_hub('blip-image-captioning-base-fashionimages-finetuned')
processor.push_to_hub('blip-image-captioning-base-fashionimages-finetuned-processor')

pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/alesanm/blip-image-captioning-base-fashionimages-finetuned-processor/commit/2c62fbd915a880d7bac3cb78d3734c39a2caba7a', commit_message='Upload processor', commit_description='', oid='2c62fbd915a880d7bac3cb78d3734c39a2caba7a', pr_url=None, pr_revision=None, pr_num=None)

# Load model from huggingface and make an inference

In [None]:
from transformers import AutoProcessor, BlipForConditionalGeneration

processor1 = AutoProcessor.from_pretrained("alesanm/blip-image-captioning-base-fashionimages-finetuned-processor")
model1 = BlipForConditionalGeneration.from_pretrained("alesanm/blip-image-captioning-base-fashionimages-finetuned")

Downloading (…)rocessor_config.json:   0%|          | 0.00/431 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/502 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/4.72k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

In [None]:
from PIL import Image

example_img= Image.open('/content/drive/MyDrive/IndustrialML/chanel/chanel_17.jpg')

In [None]:
model1 = model1.to(torch.device("cuda"))

In [None]:
inputs = processor1(images=example, return_tensors="pt").to(device)
pixel_values = inputs.pixel_values

generated_ids = model1.generate(pixel_values=pixel_values, max_length=50)
generated_caption = processor1.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_caption)

clothes : formal, eveningwear colors : metallic, ivory occasion : gala, special event details : sequins, beading trends : glamorous, sparkly figures : slender, petite demographic : women, 18 - 35


## Inference

Let's check the results on our train dataset

In [None]:
# load image
example = Image.open('/content/drive/MyDrive/IndustrialML/chanel/chanel_17.jpg')

In [None]:
# prepare image for the model
inputs = processor(images=example, return_tensors="pt").to(device)
pixel_values = inputs.pixel_values

generated_ids = model.generate(pixel_values=pixel_values, max_length=50)
generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_caption)

clothes : formal, eveningwear colors : metallic, ivory occasion : gala, special event details : sequins, beading trends : glamorous, sparkly figures : slender, petite demographic : women, 18 - 35


#Add evaluation

In [None]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained BERT model and tokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model_bert = BertModel.from_pretrained("bert-base-uncased")

In [None]:
def get_sentence_embeddings(sentences, tokenizer, model):
    # Tokenize the sentences and convert them to tensors
    inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    # Get the sentence embeddings
    with torch.no_grad():
        outputs = model_bert(input_ids, attention_mask=attention_mask)
        embeddings = outputs.last_hidden_state.mean(dim=1)  # Average pooling

    return embeddings

In [None]:
model.to(device)

model.eval()

test_img = Image.open(image_paths[0])

# prepare image for the model
inputs = processor(images=test_img, return_tensors="pt").to(device)
pixel_values = inputs.pixel_values

generated_ids = model.generate(pixel_values=pixel_values, max_length=50)
generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

In [None]:
print(generated_caption)

a model walks the runway at chan chan


In [None]:
generated_caption='Clothes:: formal dresses, evening gowns ; style : elegant, sophisticated ; colors : black, white, gold ; occasion : formal dinner, wedding, : sequins, lace ; trends : metallic fabrics, evening gowns ; body types : slim'


In [None]:
test_caption =descriptions[0]

In [None]:
# Get embeddings for generated and test sentences
generated_embeddings = get_sentence_embeddings(generated_caption, tokenizer, model_bert)
test_embeddings = get_sentence_embeddings(test_caption, tokenizer, model_bert)

# Calculate similarity between each generated sentence and each test sentence
similarity_scores = cosine_similarity(generated_embeddings, test_embeddings)

In [None]:
similarity_scores

array([[0.9718214]], dtype=float32)