# Code Search Engine with Fine-tuning

Building a semantic code search engine that understands natural language queries. Testing it on the CoSQA dataset and fine-tuning to improve performance.

**Stack**: Qdrant for vector storage, Sentence Transformers for embeddings, PyTorch Lightning for training.


## Setup


In [None]:
%pip install -q ipywidgets qdrant-client sentence-transformers torch pytorch-lightning datasets transformers matplotlib


In [3]:
from src.search_engine import SearchEngine
from src.metrics import evaluate_ranking
from src.data import load_cosqa_eval
from scripts.evaluate import evaluate_model
from scripts.train import train, plot_losses
import json
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')


## Basic Search Demo

Simple case â€” searching through a few animal-related documents to see how the engine works.


In [None]:
documents = [
  "Elephants are the largest living terrestrial animals. Three living species are currently recognised: the African bush elephant (Loxodonta africana), the African forest elephant (L. cyclotis), and the Asian elephant (Elephas maximus). They are the only surviving members of the family Elephantidae and the order Proboscidea.",
  "A dolphin is a common name used for some of the aquatic mammals in the cetacean clade Odontoceti, the toothed whales. Dolphins belong to the families Delphinidae (the oceanic dolphins), along with the river-dolphin families Platanistidae, Iniidae, Pontoporiidae and probably the extinct Lipotidae (baiji or Chinese river dolphin).",
  "The tiger (Panthera tigris) is a large cat and a member of the genus Panthera native to Asia. It has a powerful, muscular body with a large head and paws, a long tail and orange fur with black, mostly vertical stripes.",
  "Penguins are a group of flightless, semi-aquatic birds which live almost exclusively in the Southern Hemisphere. They have highly adapted for life in the ocean: flippers instead of wings, streamlined bodies for swimming, and counter-shaded dark and white plumage.",
  "A honey bee (also spelled honeybee) is a eusocial flying insect from the genus Apis of the largest bee family, Apidae. All honey bees are nectarivorous pollinators native to mainland Afro-Eurasia, but human migrations since the Age of Discovery introduced multiple subspecies of the western honey bee into the New World and Australia, resulting in the current cosmopolitan distribution of honey bees in all continents except Antarctica.",
  "The cheetah (Acinonyx jubatus) is a lightly built, spotted cat characterised by a small rounded head, a short snout, black tear-like facial streaks, a deep chest, long thin legs and a long tail. Its slender, canine-like form is highly adapted for speed, contrasting sharply with other large felids.",
  "Owls are birds from the order Strigiformes, which includes over 200 species of mostly solitary and nocturnal birds of prey typified by an upright stance, a large broad head, binocular vision, binaural hearing, sharp talons and feathers adapted for silent flight.",
  "The gray wolf or grey wolf (Canis lupus) is a large canine native to wilderness and remote areas of Eurasia and North America. Adult wolves measure 105-160 cm (41-63 in) in length and 80-85 cm (31-33 in) at shoulder height.",
]

print(f"Got {len(documents)} documents")
for i, doc in enumerate(documents, 1):
    print(f"{i}. {doc}")


Sample collection: 8 documents
1. Elephants are the largest living terrestrial animals. Three living species are currently recognised: the African bush elephant (Loxodonta africana), the African forest elephant (L. cyclotis), and the Asian elephant (Elephas maximus). They are the only surviving members of the family Elephantidae and the order Proboscidea.
2. A dolphin is a common name used for some of the aquatic mammals in the cetacean clade Odontoceti, the toothed whales. Dolphins belong to the families Delphinidae (the oceanic dolphins), along with the river-dolphin families Platanistidae, Iniidae, Pontoporiidae and probably the extinct Lipotidae (baiji or Chinese river dolphin).
3. The tiger (Panthera tigris) is a large cat and a member of the genus Panthera native to Asia. It has a powerful, muscular body with a large head and paws, a long tail and orange fur with black, mostly vertical stripes.
4. Penguins are a group of flightless, semi-aquatic birds which live almost exclusivel

In [9]:
engine = SearchEngine(model_name="sentence-transformers/all-MiniLM-L6-v2", collection_name="demo")
engine.add_documents(documents)
print(f"Indexed {len(documents)} docs")


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Indexed 8 documents


In [None]:
queries = [
    "What are the different species of elephants?",
    "What adaptations help owls hunt at night?",
    "Where do tigers live and what do they look like?",
]

for query in queries:
    print(f"\nQuery: {query}")
    results = engine.search(query, top_k=3)
    for i, res in enumerate(results, 1):
        print(f"{i}. [{res['score']:.3f}] {res['text'][:60]}...")



Query: What are the different species of elephants?
  1. [0.636] Elephants are the largest living terrestrial animals. T...
  2. [0.426] Owls are birds from the order Strigiformes, which inclu...
  3. [0.333] The tiger (Panthera tigris) is a large cat and a member...

Query: What adaptations help owls hunt at night?
  1. [0.643] Owls are birds from the order Strigiformes, which inclu...
  2. [0.288] Penguins are a group of flightless, semi-aquatic birds ...
  3. [0.211] The gray wolf or grey wolf (Canis lupus) is a large can...

Query: Where do tigers live and what do they look like?
  1. [0.614] The tiger (Panthera tigris) is a large cat and a member...
  2. [0.378] The gray wolf or grey wolf (Canis lupus) is a large can...
  3. [0.351] Elephants are the largest living terrestrial animals. T...


## Evaluation on CoSQA

Now testing on the real dataset. CoSQA has code search queries where we need to find the right function from natural language descriptions.

I'm tracking three metrics - Recall@10 (did we find the right code?), MRR@10 (how high did we rank it?), and NDCG@10 (overall ranking quality).


In [None]:
corpus, queries_dict, qrels = load_cosqa_eval()

print(f"Dataset loaded:")
print(f"Corpus: {len(corpus)} code snippets")
print(f"Queries: {len(queries_dict)} total")
print(f"Test queries: {len(qrels)}")


In [None]:
print("Evaluating baseline model...")
baseline_metrics = evaluate_model("sentence-transformers/all-MiniLM-L6-v2", k=10)

with open("baseline_results.json", "w") as f:
    json.dump(baseline_metrics, f, indent=2)


## Fine-tuning

Time to make the model better at code search. Using cross-entropy loss with a neat trick called in-batch negatives - instead of manually creating negative examples, we just use the other code snippets in the same batch.


In [None]:
from config import EPOCHS, PLOT_PATH

losses = train()
plot_losses(losses, EPOCHS, PLOT_PATH)

print(f"\nTraining complete:")
print(f"Initial loss: {losses[0]:.4f}")
print(f"Final loss: {losses[-1]:.4f}")
print(f"Reduction: {((losses[0] - losses[-1]) / losses[0] * 100):.1f}%")


## Results


In [None]:
from config import OUTPUT_DIR

print("Evaluating fine-tuned model...")
finetuned_metrics = evaluate_model(OUTPUT_DIR, k=10)

print("\nComparison:")
for metric in baseline_metrics:
    b = baseline_metrics[metric]
    f = finetuned_metrics[metric]
    improvement = ((f - b) / b * 100) if b > 0 else 0
    print(f"{metric}: {b:.4f} -> {f:.4f} ({improvement:+.1f}%)")


## Bonus Experiments

Tried two extra things:

**Function names vs whole bodies** - Does indexing just the function name work as well as the whole function body?

**Distance metrics** - Cosine similarity is standard, but what about others


In [None]:
from scripts.bonus_experiments import evaluate_with_function_names, evaluate_with_distance_metrics
func_results = evaluate_with_function_names()
dist_results = evaluate_with_distance_metrics()


## Wrapping Up

Built a working code search engine that takes natural language queries and finds relevant code snippets. The baseline model does okay, but fine-tuning on the CoSQA training data makes it noticeably better at the task.

The in-batch negatives approach is pretty ok - you get contrastive learning without extra work since every batch naturally has positive and negative examples. Trains faster and works well.
