# **1- KAGGLE API + DATASET**
BURAYI GOOGLE COLAB İÇERİSİNDE KAGGLE API KULLANIMI VE DATASET İNDİRMESİ İÇİN KULLANDIM.

I USED THIS TO USE THE KAGGLE API AND DOWNLOAD DATASETS WITHIN GOOGLE COLAB.

In [None]:
# DATASET EKLEME ALANI
# ADD DATASET

# **2- KÜTÜPHANELER ve GOOGLE DRIVE**
LIBRARIES + GOOGLE DRIVE

In [None]:
!pip install torchvision timm
!pip install transformers accelerate datasets
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
!pip install transformers
!pip install pandas
!pip install numpy
!pip install pillow
!pip install tqdm
!pip install scikit-learn
!pip install nltk

!git clone https://github.com/tylin/coco-caption.git
!pip install pycocoevalcap

from google.colab import drive
drive.mount('/content/drive')

## **3- İLK EĞİTİM**
BURADAKİ KOD, MODEL EĞİTİMİ İÇİN BAŞLANGIÇ KODUDUR. İLK ÖĞRENİMLERİ BURADA YAPTIM. 10 EPOCH SONRA DEVAM KODUNA GEÇTİM.

THE CODE HERE IS THE STARTING CODE FOR MODEL TRAINING. I DID MY FIRST LEARNINGS HERE. AFTER 10 EPOCHES I SWITCHED TO THE CONTINUING CODE.

In [None]:
import os
import pandas as pd
import re
from PIL import Image
from tqdm import tqdm
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BlipProcessor, BlipForConditionalGeneration, get_cosine_schedule_with_warmup
from torch import optim
from torch.nn import CrossEntropyLoss
import torchvision.transforms as T
from sklearn.model_selection import train_test_split
from torch.cuda.amp import GradScaler
from torch.amp import autocast
from pycocoevalcap.cider.cider import Cider


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


def clean_caption(caption):
    caption = re.sub(r'[^\x00-\x7F]+', ' ', caption)
    caption = re.sub(r'\s+', ' ', caption).strip()
    return caption


class CaptionDataset(Dataset):
    def __init__(self, dataframe, image_folder, processor):
        self.df = dataframe.reset_index(drop=True)
        self.image_folder = image_folder
        self.processor = processor
        self.transform = T.Compose([
            T.Resize((384, 384)),
            T.ColorJitter(brightness=0.05, contrast=0.05),
            T.RandomHorizontalFlip(p=0.2),
        ])

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        image_id = row['image_id']
        caption = clean_caption(row['caption'])
        image_path = os.path.join(self.image_folder, f"{image_id}.jpg")
        image = Image.open(image_path).convert('RGB')
        image = self.transform(image)

        inputs = self.processor(images=image, text=caption, return_tensors="pt", padding="max_length", truncation=True, max_length=64)

        return {
            'pixel_values': inputs['pixel_values'].squeeze(0),
            'input_ids': inputs['input_ids'].squeeze(0),
            'attention_mask': inputs['attention_mask'].squeeze(0),
            'caption': caption,
            'image_id': image_id
        }


def compute_cider_score(preds, gts):
    scorer = Cider()
    score, _ = scorer.compute_score(gts, preds)
    return score

csv_path = # path of csv file included ID numbers of the images to be used for training
image_folder = # path of the training images
model_save_path = "/content/drive/MyDrive/blip_finetuned"

df = pd.read_csv(csv_path)
train_df, val_df = train_test_split(df, test_size=0.1, random_state=42)

model_name = "Salesforce/blip-image-captioning-base"
processor = BlipProcessor.from_pretrained(model_name)
model = BlipForConditionalGeneration.from_pretrained(model_name).to(device)

train_dataset = CaptionDataset(train_df, image_folder, processor)
val_dataset = CaptionDataset(val_df, image_folder, processor)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)

num_epochs = 30
optimizer = optim.AdamW(model.parameters(), lr=1e-5)
num_training_steps = num_epochs * len(train_loader)
lr_scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=500, num_training_steps=num_training_steps)
loss_fn = CrossEntropyLoss(label_smoothing=0.1)
scaler = GradScaler()

print("✅ Sıfırdan eğitim başlatıldı...")
for epoch in range(num_epochs):
    print(f"\n📆 Epoch {epoch+1}/{num_epochs}")
    model.train()
    total_loss = 0
    steps = 0
    loop = tqdm(train_loader, desc="Train", leave=False)

    for batch in loop:
        pixel_values = batch['pixel_values'].to(device)
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)

        decoder_input_ids = input_ids[:, :-1]
        labels = input_ids[:, 1:].detach().clone()
        labels[labels == processor.tokenizer.pad_token_id] = -100

        with autocast(device_type='cuda'):
            outputs = model(pixel_values=pixel_values, input_ids=decoder_input_ids, attention_mask=attention_mask[:, :-1])
            loss = loss_fn(outputs.logits.view(-1, outputs.logits.size(-1)), labels.view(-1))

        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()
        lr_scheduler.step()

        total_loss += loss.item()
        steps += 1
        loop.set_postfix(loss=loss.item())

    avg_loss = total_loss / steps
    print(f"📉 Avg Train Loss: {avg_loss:.4f}")

    model.eval()
    preds, gts = {}, {}
    with torch.no_grad():
        for batch in tqdm(val_loader, desc="Validation", leave=False):
            pixel_values = batch['pixel_values'].to(device)
            image_ids = batch['image_id']
            captions = batch['caption']

            generated_ids = model.generate(
                pixel_values=pixel_values,
                max_length=64,
                num_beams=4,
                decoder_start_token_id=processor.tokenizer.cls_token_id
            )
            generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

            for img_id, gt, pred in zip(image_ids, captions, generated_texts):
                preds[img_id] = [pred.strip()]
                gts[img_id] = [gt.strip()]

    cider = compute_cider_score(preds, gts)
    print(f"🍏 Validation CIDEr: {cider:.4f}")

    epoch_save_path = os.path.join(model_save_path, f"epoch_{epoch+1}")
    os.makedirs(epoch_save_path, exist_ok=True)
    model.save_pretrained(epoch_save_path)
    processor.save_pretrained(epoch_save_path)

print("\n🏁 Eğitim tamamlandı.")


# **4- EĞİTİME DEVAM KODU**
YUKARIDA EĞİTİLİP DRIVE'A KAYDEDİLEN MODELİN EĞİTİMİ BURADA DEVAM ETTİRİLDİ. BURADA CIDER SKORU DÜŞÜK GELENLER ÜZERİNE YOĞUNLAŞILDI VE KALİTE ARTTIRILDI. LR=1E-5 İLE BAŞLADIKTAN 6 EPOCH SONRA LR DEĞERİNİ AŞAĞIDAKİ DEĞERE GETİRDİM VE 4 EPOCH DAHA EĞİTTİM. BÖYLELİKLE OVERFIT OLMADAN DÜZGÜN Bİ ÇALIŞMA YAPMAYA ÇALIŞTIM.

CONTINUE TRAINING

THE TRAINING OF THE MODEL TRAINED ABOVE AND REGISTERED TO DRIVE WAS CONTINUED HERE. HERE, THE FOCUS WAS ON THOSE WITH LOW CIDER SCORES AND THE QUALITY WAS INCREASED. STARTING WITH LR=1E-5, AFTER 6 EPOCHES I BRING THE LR VALUE TO THE VALUE BELOW AND TRAINED 4 MORE EPOCHES. THUS I TRIED TO DO A PROPER WORK WITHOUT OVERFIT.

In [None]:
import os
import pandas as pd
import re
import random
from PIL import Image
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
from transformers import BlipProcessor, BlipForConditionalGeneration, get_cosine_schedule_with_warmup
from torch import torch, optim
from torch.nn import CrossEntropyLoss
import torchvision.transforms as T
from sklearn.model_selection import train_test_split
from torch.cuda.amp import GradScaler, autocast
from pycocoevalcap.cider.cider import Cider

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def clean_caption(caption):
    caption = re.sub(r'[^\x00-\x7F]+', ' ', caption)
    caption = re.sub(r'\s+', ' ', caption).strip()
    return caption

def augment_caption(caption):
    # Basit synonym replacement
    synonyms = {
        "man": "guy", "woman": "lady", "dog": "puppy",
        "cat": "kitten", "bike": "bicycle", "car": "vehicle"
    }
    words = caption.split()
    new_caption = [synonyms.get(word, word) for word in words]
    return ' '.join(new_caption)

class CaptionDataset(Dataset):
    def __init__(self, dataframe, image_folder, processor, augment=False):
        self.df = dataframe.reset_index(drop=True)
        self.image_folder = image_folder
        self.processor = processor
        self.augment = augment
        self.transform = T.Compose([
            T.Resize((384, 384)),
            T.RandomApply([T.RandomRotation(10)], p=0.3),
            T.RandomApply([T.RandomCrop(384, pad_if_needed=True)], p=0.3),
            T.ColorJitter(brightness=0.1, contrast=0.1),
            T.RandomHorizontalFlip(p=0.5),
        ])

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        image_id = str(self.df.iloc[idx]['image_id'])
        caption = clean_caption(self.df.iloc[idx]['caption'])
        if self.augment:
            caption = augment_caption(caption)
        image_path = os.path.join(self.image_folder, str(image_id) + ".jpg")
        image = Image.open(image_path).convert('RGB')
        image = self.transform(image)

        inputs = self.processor(
            images=image,
            text=caption,
            return_tensors="pt",
            padding="max_length",
            truncation=True,
            max_length=64
        )

        return {
            'pixel_values': inputs['pixel_values'].squeeze(0),
            'input_ids': inputs['input_ids'].squeeze(0),
            'attention_mask': inputs['attention_mask'].squeeze(0),
            'caption': caption,
            'image_id': image_id
        }

def compute_cider_score(preds, gts):
    scorer = Cider()
    score, _ = scorer.compute_score(gts, preds)
    return score

df = # pd.read_csv('path of train.csv file')
image_folder = # path of the training images folder
train_df, val_df = train_test_split(df, test_size=0.1, random_state=42)

model_path = "/content/drive/MyDrive/blip_finetuned"
processor = BlipProcessor.from_pretrained(model_path)
model = BlipForConditionalGeneration.from_pretrained(model_path).to(device)

num_epochs = 30
optimizer = optim.AdamW(model.parameters(), lr=5e-6)
loss_fn = CrossEntropyLoss(label_smoothing=0.1)
scaler = GradScaler()
num_training_steps = num_epochs * (len(train_df) // 16)
lr_scheduler = get_cosine_schedule_with_warmup(optimizer, 500, num_training_steps)

for epoch in range(num_epochs):
    print(f"\n📆 Epoch {epoch + 1} / {num_epochs}")
    model.train()
    train_dataset = CaptionDataset(train_df, image_folder, processor, augment=True)
    train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
    total_loss, steps = 0, 0
    loop = tqdm(train_loader, desc="Train", leave=False)

    for batch in loop:
        pixel_values = batch['pixel_values'].to(device)
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)

        decoder_input_ids = input_ids[:, :-1]
        labels = input_ids[:, 1:].clone()
        labels[labels == processor.tokenizer.pad_token_id] = -100

        with autocast():
          outputs = model(
            pixel_values=pixel_values,
            input_ids=decoder_input_ids,
            attention_mask=attention_mask[:, :-1]
          )
          logits = outputs.logits
          loss = loss_fn(logits.view(-1, logits.size(-1)), labels.view(-1))



        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()
        lr_scheduler.step()

        total_loss += loss.item()
        steps += 1
        loop.set_postfix(loss=loss.item())

    avg_loss = total_loss / steps
    print(f"📉 Epoch {epoch+1} - Avg Train Loss: {avg_loss:.4f}")

    model.eval()
    preds, gts, bad_samples = {}, {}, []
    val_loader = DataLoader(CaptionDataset(val_df, image_folder, processor), batch_size=32)

    with torch.no_grad():
        for batch in tqdm(val_loader, desc="Validation", leave=False):
            pixel_values = batch['pixel_values'].to(device)
            image_ids = batch['image_id']
            captions = batch['caption']

            generated_ids = model.generate(pixel_values=pixel_values, max_length=64, num_beams=5)
            generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

            for img_id, gt, pred in zip(image_ids, captions, generated_texts):
                preds[img_id] = [pred.strip()]
                gts[img_id] = [gt.strip()]
                if pred.strip().lower() != gt.strip().lower():
                    bad_samples.append({'image_id': img_id, 'caption': gt})

    cider_score = compute_cider_score(preds, gts)
    print(f"🍏 Validation CIDEr: {cider_score:.4f}")

    if bad_samples:
        train_df = pd.concat([train_df, pd.DataFrame(bad_samples)], ignore_index=True)

    model.save_pretrained(model_path)
    processor.save_pretrained(model_path)

print("\n🏁 Eğitim tamamlandı.")


# **5- TAHMİN KODU**
BURADAKİ KOD, TEST KLASÖRÜNDEKİ GÖRSELLERE TAHMİN ÜRETMEK İÇİN KULLANILIR.

**PREDICTION CODE**

THE CODE HERE IS USED TO GENERATE PREDICTIONS FOR THE IMAGES IN THE TEST FOLDER.

In [None]:
import os
import pandas as pd
from PIL import Image
import torch
from transformers import BlipProcessor, BlipForConditionalGeneration
from tqdm import tqdm

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Dosya yolları
test_csv_path = # path of the file included ID numbers of the images to be used for testing
test_image_folder = # TEST IMAGES FOLDER PATH
model_path = '/content/drive/MyDrive/blip_finetuned'

test_df = pd.read_csv(test_csv_path)

processor = BlipProcessor.from_pretrained(model_path)
model = BlipForConditionalGeneration.from_pretrained(model_path).to(device)
model.eval()

results = []

for idx, row in tqdm(test_df.iterrows(), total=len(test_df), desc="Generating Captions"):
    image_id = row['image_id']
    image_path = os.path.join(test_image_folder, f"{image_id}.jpg")

    
    image = Image.open(image_path).convert('RGB')

    
    inputs = processor(images=image, return_tensors="pt").to(device)

    
    with torch.no_grad():
        generated_ids = model.generate(
            pixel_values=inputs['pixel_values'],
            max_length=64,
            num_beams=5,
            repetition_penalty=2.0,
            length_penalty=1.0,
            early_stopping=True
        )

    
    caption = processor.tokenizer.decode(generated_ids[0], skip_special_tokens=True).strip()

    results.append({
        "image_id": image_id,
        "caption": caption
    })


submission_df = pd.DataFrame(results)
submission_df.to_csv("/content/drive/MyDrive/tahminler.csv", index=False)

print("✅ Tahminler 'tahminler.csv' dosyasına kaydedildi.")


# **6- TEK TAHMİN KODU**
BURADAKİ KOD, SEÇİLEN GÖRSELE TAHMİN ÜRETMEK İÇİN KULLANILIR.

THE CODE HERE IS USED TO GENERATE A PREDICTION FOR THE SELECTED IMAGE.

In [None]:
import torch
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_path = 'blip_finetuned'
processor = BlipProcessor.from_pretrained(model_path)
model = BlipForConditionalGeneration.from_pretrained(model_path).to(device)
model.eval()

image_path = 'fotolar/balon.jpg'

image = Image.open(image_path).convert('RGB')

inputs = processor(images=image, return_tensors="pt").to(device)

with torch.no_grad():
    generated_ids = model.generate(
        pixel_values=inputs['pixel_values'],
        max_length=64,
        num_beams=5,
        repetition_penalty=2.0,
        length_penalty=1.0,
        early_stopping=True
    )

caption = processor.tokenizer.decode(generated_ids[0], skip_special_tokens=True).strip()
print(f"🖼️ Caption: {caption}")