# Embedding Fine-tuning Tutorial

Bu notebook, Turkish BERT ve BGE modellerini soru-cevap verileri üzerinde fine-tune etmek için adım adım bir kılavuzdur.

## İçindekiler
1. [Kurulum](#kurulum)
2. [Veri Yükleme ve Hazırlama](#veri-yükleme)
3. [Turkish BERT Fine-tuning](#turkish-bert)
4. [BGE Model Fine-tuning](#bge-model)
5. [Model Değerlendirme](#değerlendirme)
6. [Kullanım Örnekleri](#kullanım)

## 1. Kurulum <a id="kurulum"></a>

Gerekli kütüphaneleri yükleyin:

In [None]:
!pip install -q sentence-transformers transformers torch pandas numpy scikit-learn scipy tqdm

In [None]:
import json
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer, InputExample, losses
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from torch.utils.data import DataLoader
import torch
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# GPU kontrolü
print(f"CUDA mevcut: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Belleği: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 2. Veri Yükleme ve Hazırlama <a id="veri-yükleme"></a>

In [None]:
# Örnek veri yükleme
with open('example_qa_data.json', 'r', encoding='utf-8') as f:
    qa_data = json.load(f)

print(f"Toplam {len(qa_data)} soru-cevap çifti yüklendi.")
print("\nİlk örnek:")
print(f"Soru: {qa_data[0]['question']}")
print(f"Cevap: {qa_data[0]['answer']}")

In [None]:
# Veriyi DataFrame'e çevir
df = pd.DataFrame(qa_data)
df.head()

In [None]:
# Veri istatistikleri
print("Veri İstatistikleri:")
print(f"Toplam örnek sayısı: {len(df)}")
print(f"Ortalama soru uzunluğu: {df['question'].str.len().mean():.2f} karakter")
print(f"Ortalama cevap uzunluğu: {df['answer'].str.len().mean():.2f} karakter")

In [None]:
# Veriyi train/test olarak ayır
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
print(f"Training örnekleri: {len(train_df)}")
print(f"Test örnekleri: {len(test_df)}")

In [None]:
# InputExample formatına dönüştür
def create_examples(df):
    examples = []
    for _, row in df.iterrows():
        examples.append(InputExample(texts=[row['question'], row['answer']], label=1.0))
    return examples

train_examples = create_examples(train_df)
eval_examples = create_examples(test_df)

print(f"Training examples: {len(train_examples)}")
print(f"Evaluation examples: {len(eval_examples)}")

## 3. Turkish BERT Fine-tuning <a id="turkish-bert"></a>

In [None]:
# Model yükleme
turkish_bert = SentenceTransformer('dbmdz/bert-base-turkish-cased')
turkish_bert.max_seq_length = 128

print(f"Model yüklendi: {turkish_bert}")
print(f"Max sequence length: {turkish_bert.max_seq_length}")

In [None]:
# Orijinal model ile bir test (fine-tuning öncesi)
test_questions = [
    "Python nedir?",
    "Makine öğrenmesi nasıl çalışır?"
]
test_answers = [
    "Python yüksek seviyeli bir programlama dilidir.",
    "Makine öğrenmesi verilerden pattern öğrenir."
]

q_emb_before = turkish_bert.encode(test_questions, normalize_embeddings=True)
a_emb_before = turkish_bert.encode(test_answers, normalize_embeddings=True)

sim_before = cosine_similarity(q_emb_before, a_emb_before)
print("Fine-tuning ÖNCESI benzerlik skorları:")
for i, (q, a) in enumerate(zip(test_questions, test_answers)):
    print(f"  {q} -> {a}")
    print(f"  Benzerlik: {sim_before[i][i]:.4f}\n")

In [None]:
# DataLoader ve Loss oluşturma
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(turkish_bert)

# Evaluator
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(eval_examples, name='turkish-bert-eval')

print("DataLoader ve Loss fonksiyonu hazır.")

In [None]:
# Fine-tuning
output_path = './models/turkish-bert-qa-finetuned'

turkish_bert.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
    epochs=3,
    warmup_steps=100,
    optimizer_params={'lr': 2e-5},
    output_path=output_path,
    evaluation_steps=50,
    save_best_model=True,
    show_progress_bar=True
)

print(f"\nModel kaydedildi: {output_path}")

In [None]:
# Fine-tuned model yükleme ve test
turkish_bert_ft = SentenceTransformer(output_path)

q_emb_after = turkish_bert_ft.encode(test_questions, normalize_embeddings=True)
a_emb_after = turkish_bert_ft.encode(test_answers, normalize_embeddings=True)

sim_after = cosine_similarity(q_emb_after, a_emb_after)

print("Fine-tuning SONRASI benzerlik skorları:")
for i, (q, a) in enumerate(zip(test_questions, test_answers)):
    print(f"  {q} -> {a}")
    print(f"  Önce: {sim_before[i][i]:.4f}")
    print(f"  Sonra: {sim_after[i][i]:.4f}")
    print(f"  İyileşme: {(sim_after[i][i] - sim_before[i][i])*100:.2f}%\n")

## 4. BGE Model Fine-tuning <a id="bge-model"></a>

In [None]:
# BGE-M3 model yükleme
bge_model = SentenceTransformer('BAAI/bge-m3')
bge_model.max_seq_length = 512

print(f"Model yüklendi: {bge_model}")
print(f"Max sequence length: {bge_model.max_seq_length}")

In [None]:
# BGE için instruction ekle
query_instruction = "Bu soruyu cevaplamak için ilgili bilgiyi ara: "

def create_bge_examples(df, use_instruction=True):
    examples = []
    for _, row in df.iterrows():
        if use_instruction:
            question = f"{query_instruction}{row['question']}"
        else:
            question = row['question']
        examples.append(InputExample(texts=[question, row['answer']], label=1.0))
    return examples

train_examples_bge = create_bge_examples(train_df, use_instruction=True)
eval_examples_bge = create_bge_examples(test_df, use_instruction=True)

print(f"BGE Training examples: {len(train_examples_bge)}")
print(f"BGE Evaluation examples: {len(eval_examples_bge)}")

In [None]:
# Orijinal BGE model testi
test_questions_inst = [f"{query_instruction}{q}" for q in test_questions]

q_emb_bge_before = bge_model.encode(test_questions_inst, normalize_embeddings=True)
a_emb_bge_before = bge_model.encode(test_answers, normalize_embeddings=True)

sim_bge_before = cosine_similarity(q_emb_bge_before, a_emb_bge_before)
print("BGE Fine-tuning ÖNCESI benzerlik skorları:")
for i, (q, a) in enumerate(zip(test_questions, test_answers)):
    print(f"  {q} -> {a}")
    print(f"  Benzerlik: {sim_bge_before[i][i]:.4f}\n")

In [None]:
# DataLoader ve Loss
train_dataloader_bge = DataLoader(train_examples_bge, shuffle=True, batch_size=16)
train_loss_bge = losses.MultipleNegativesRankingLoss(bge_model)
evaluator_bge = EmbeddingSimilarityEvaluator.from_input_examples(eval_examples_bge, name='bge-eval')

print("BGE DataLoader ve Loss fonksiyonu hazır.")

In [None]:
# BGE Fine-tuning
output_path_bge = './models/bge-m3-qa-finetuned'

bge_model.fit(
    train_objectives=[(train_dataloader_bge, train_loss_bge)],
    evaluator=evaluator_bge,
    epochs=3,
    warmup_steps=100,
    optimizer_params={'lr': 1e-5},  # BGE için daha düşük learning rate
    output_path=output_path_bge,
    evaluation_steps=50,
    save_best_model=True,
    show_progress_bar=True,
    scheduler='warmuplinear'
)

print(f"\nBGE Model kaydedildi: {output_path_bge}")

In [None]:
# Fine-tuned BGE model testi
bge_model_ft = SentenceTransformer(output_path_bge)

q_emb_bge_after = bge_model_ft.encode(test_questions_inst, normalize_embeddings=True)
a_emb_bge_after = bge_model_ft.encode(test_answers, normalize_embeddings=True)

sim_bge_after = cosine_similarity(q_emb_bge_after, a_emb_bge_after)

print("BGE Fine-tuning SONRASI benzerlik skorları:")
for i, (q, a) in enumerate(zip(test_questions, test_answers)):
    print(f"  {q} -> {a}")
    print(f"  Önce: {sim_bge_before[i][i]:.4f}")
    print(f"  Sonra: {sim_bge_after[i][i]:.4f}")
    print(f"  İyileşme: {(sim_bge_after[i][i] - sim_bge_before[i][i])*100:.2f}%\n")

## 5. Model Değerlendirme <a id="değerlendirme"></a>

In [None]:
# Test seti üzerinde değerlendirme
test_questions_all = test_df['question'].tolist()
test_answers_all = test_df['answer'].tolist()

# Turkish BERT
q_emb_bert = turkish_bert_ft.encode(test_questions_all, normalize_embeddings=True)
a_emb_bert = turkish_bert_ft.encode(test_answers_all, normalize_embeddings=True)
sim_matrix_bert = cosine_similarity(q_emb_bert, a_emb_bert)

# BGE
test_questions_all_inst = [f"{query_instruction}{q}" for q in test_questions_all]
q_emb_bge = bge_model_ft.encode(test_questions_all_inst, normalize_embeddings=True)
a_emb_bge = bge_model_ft.encode(test_answers_all, normalize_embeddings=True)
sim_matrix_bge = cosine_similarity(q_emb_bge, a_emb_bge)

print("Embedding'ler hesaplandı.")

In [None]:
# Accuracy@k hesaplama
def calculate_accuracy_at_k(sim_matrix, k_values=[1, 3, 5]):
    n = len(sim_matrix)
    results = {}

    for k in k_values:
        correct = 0
        for i in range(n):
            top_k_indices = np.argsort(sim_matrix[i])[-k:][::-1]
            if i in top_k_indices:
                correct += 1
        results[f'accuracy@{k}'] = correct / n

    return results

# Turkish BERT metrikleri
metrics_bert = calculate_accuracy_at_k(sim_matrix_bert)
print("Turkish BERT Metrikleri:")
for metric, value in metrics_bert.items():
    print(f"  {metric}: {value:.4f}")

# BGE metrikleri
metrics_bge = calculate_accuracy_at_k(sim_matrix_bge)
print("\nBGE Metrikleri:")
for metric, value in metrics_bge.items():
    print(f"  {metric}: {value:.4f}")

In [None]:
# Benzerlik dağılımı görselleştirme
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Turkish BERT
correct_sims_bert = [sim_matrix_bert[i][i] for i in range(len(sim_matrix_bert))]
axes[0].hist(correct_sims_bert, bins=20, alpha=0.7, color='blue', edgecolor='black')
axes[0].set_title('Turkish BERT - Doğru Eşleşme Benzerlik Dağılımı')
axes[0].set_xlabel('Cosine Similarity')
axes[0].set_ylabel('Frekans')
axes[0].axvline(np.mean(correct_sims_bert), color='red', linestyle='--', label=f'Ortalama: {np.mean(correct_sims_bert):.3f}')
axes[0].legend()

# BGE
correct_sims_bge = [sim_matrix_bge[i][i] for i in range(len(sim_matrix_bge))]
axes[1].hist(correct_sims_bge, bins=20, alpha=0.7, color='green', edgecolor='black')
axes[1].set_title('BGE - Doğru Eşleşme Benzerlik Dağılımı')
axes[1].set_xlabel('Cosine Similarity')
axes[1].set_ylabel('Frekans')
axes[1].axvline(np.mean(correct_sims_bge), color='red', linestyle='--', label=f'Ortalama: {np.mean(correct_sims_bge):.3f}')
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# Model karşılaştırması
comparison_df = pd.DataFrame([
    {'Model': 'Turkish BERT', **metrics_bert, 'Avg Similarity': np.mean(correct_sims_bert)},
    {'Model': 'BGE-M3', **metrics_bge, 'Avg Similarity': np.mean(correct_sims_bge)}
])

print("\nModel Karşılaştırması:")
print(comparison_df.to_string(index=False))

## 6. Kullanım Örnekleri <a id="kullanım"></a>

In [None]:
# Semantic Search Örneği
def semantic_search(query, model, answer_database, top_k=3, use_instruction=False):
    """
    Semantic search ile en ilgili cevapları bulur.
    """
    if use_instruction:
        query = f"{query_instruction}{query}"

    # Embedding'leri hesapla
    query_emb = model.encode([query], normalize_embeddings=True)
    answer_embs = model.encode(answer_database, normalize_embeddings=True)

    # Benzerlik hesapla
    similarities = cosine_similarity(query_emb, answer_embs)[0]

    # Top-k sonuçları al
    top_k_indices = np.argsort(similarities)[-top_k:][::-1]

    results = []
    for idx in top_k_indices:
        results.append({
            'answer': answer_database[idx],
            'score': similarities[idx]
        })

    return results

# Cevap veritabanı
answer_db = df['answer'].tolist()

# Örnek sorgular
query1 = "Python hangi tür bir dildir?"
query2 = "Derin öğrenme ile ilgili bilgi ver"

print("=" * 70)
print("TURKISH BERT - Semantic Search")
print("=" * 70)

for query in [query1, query2]:
    print(f"\nSoru: {query}")
    results = semantic_search(query, turkish_bert_ft, answer_db, top_k=3)
    for i, result in enumerate(results, 1):
        print(f"  {i}. {result['answer'][:60]}... (Skor: {result['score']:.4f})")

print("\n" + "=" * 70)
print("BGE - Semantic Search")
print("=" * 70)

for query in [query1, query2]:
    print(f"\nSoru: {query}")
    results = semantic_search(query, bge_model_ft, answer_db, top_k=3, use_instruction=True)
    for i, result in enumerate(results, 1):
        print(f"  {i}. {result['answer'][:60]}... (Skor: {result['score']:.4f})")

In [None]:
# Clustering örneği
from sklearn.cluster import KMeans

# Tüm cevapların embedding'lerini hesapla
all_answers = df['answer'].tolist()
answer_embeddings = bge_model_ft.encode(all_answers, normalize_embeddings=True)

# K-means clustering
n_clusters = 5
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
clusters = kmeans.fit_predict(answer_embeddings)

# Her cluster'dan örnekler göster
print("\n" + "=" * 70)
print("CLUSTERING SONUÇLARI")
print("=" * 70)

for cluster_id in range(n_clusters):
    print(f"\nCluster {cluster_id + 1}:")
    cluster_answers = [all_answers[i] for i in range(len(all_answers)) if clusters[i] == cluster_id]
    for answer in cluster_answers[:3]:  # İlk 3 örnek
        print(f"  - {answer[:60]}...")

In [None]:
# Model kaydetme ve yükleme
print("\nModeller kaydedildi:")
print(f"  Turkish BERT: {output_path}")
print(f"  BGE-M3: {output_path_bge}")
print("\nModelleri yüklemek için:")
print(f"  model = SentenceTransformer('{output_path}')")

## Sonuç

Bu notebook'ta:
1. ✅ Turkish BERT modelini fine-tune ettik
2. ✅ BGE-M3 modelini fine-tune ettik
3. ✅ Her iki modeli değerlendirdik
4. ✅ Semantic search ve clustering örnekleri yaptık

### Öneriler:
- Daha büyük veri setleri ile daha iyi sonuçlar elde edebilirsiniz
- Farklı hiperparametrelerle deneme yapın
- Production kullanımı için modeli ONNX formatına çevirebilirsiniz
- FAISS gibi kütüphanelerle büyük ölçekli arama yapabilirsiniz

### Kaynaklar:
- [Sentence Transformers Docs](https://www.sbert.net/)
- [BGE Models](https://huggingface.co/BAAI)
- [Turkish BERT](https://huggingface.co/dbmdz/bert-base-turkish-cased)