# Code API for custom database
Iâ€™ve implemented two models. For a given query, we search for the closest solutions in two embedding spaces: one generated by the NLP model **sentence-transformers/all-MiniLM-L6-v2**, which embeds function names, and another generated by **codesage/codesage-base-v2**, which embeds function bodies.

In [3]:
from code_search_api import CodeSearchAPI

api = CodeSearchAPI()
api.create_vector_db("demo_code_repository/example_code_file_1.py")
functions = api.extract_function_bodies("demo_code_repository/example_code_file_1.py")

(nlp_model_results, code_model_results) = api.search("function that adds two numbers")

print()
print("NLP model results:")
for match in nlp_model_results:
    print(f"Function name match: {match.key} with distance {match.distance} : {functions[match.key - 1][0]}")

print()
print("Code model results:")
for match in code_model_results:
    print(f"Function body match: {match.key} with distance {match.distance} : {functions[match.key - 1][0]}")

No existing query vector database file.
No existing code vector database file.
Using device for nlp embedding model: cuda
Using device for code embedding model: cuda


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



NLP model results:
Function name match: 1 with distance 0.3948790431022644 : add_numbers
Function name match: 3 with distance 0.5458637475967407 : multiply_numbers
Function name match: 2 with distance 0.5667691230773926 : subtract_numbers
Function name match: 18 with distance 0.6176607608795166 : sum_list
Function name match: 50 with distance 0.6925162672996521 : count_digits
Function name match: 25 with distance 0.7290886044502258 : fibonacci
Function name match: 4 with distance 0.741529107093811 : divide
Function name match: 10 with distance 0.7870303392410278 : factorial
Function name match: 23 with distance 0.8031855225563049 : count_occurrences
Function name match: 46 with distance 0.804297685623169 : has_duplicates

Code model results:
Function body match: 1 with distance 0.38877278566360474 : add_numbers
Function body match: 3 with distance 0.7932268381118774 : multiply_numbers
Function body match: 2 with distance 0.8255326747894287 : subtract_numbers
Function body match: 32 wi

## Evaluation

I evaluated the model on the *test* split of the **CoSQA** dataset.

**Recall@10**:

$$
\text{Recall@10} = \frac{\text{Number of occurrences of the desired object within top 10 retrieved results}}{\text{Total size of the evaluation dataset}}
$$

**MRR@10**:  
Measures the average reciprocal rank of the first relevant result within the top 10 retrieved results.  
$$
\text{MRR@10} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i}
$$


**NDCG@10**:  
Evaluates the ranking quality by assigning higher importance to relevant items appearing near the top of the ranked list.  

$$
\text{NDCG@10} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\log_2(\text{rank}_i + 1)}
$$



We can notice that in the CoSQA dataset, the function bodies include descriptions, so we embed those descriptions instead of the function names using our NLP model.

In [4]:
from create_vector_db_from_CoSQA import create_vector_db_from_cosqa
from evaluate_model import evaluate_model

create_vector_db_from_cosqa()
evaluate_model()

Datasets loaded.
Query vector database loaded from file
No existing code vector database file.
Using device for code embedding model: cuda
Using device for nlp embedding model: cuda


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Processing row 20150/20604
Processing row 20200/20604
Processing row 20250/20604
Processing row 20300/20604
Processing row 20350/20604
Processing row 20400/20604
Processing row 20450/20604
Processing row 20500/20604
Processing row 20550/20604
Processing row 20600/20604
Datasets loaded.
Query vector database loaded from file
Code vector database loaded from file
Using device for code embedding model: cuda
Using device for nlp embedding model: cuda


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



Recall@10 for code embeddings: 448/500 = 0.8960
Recall@10 for query embeddings: 440/500 = 0.8800

MRR@10 for code embeddings: 377.2884920634921/500 = 0.7546
MRR@10 for query embeddings: 324.3527777777777/500 = 0.6487

NDCG@10 for code embeddings: 395.1079768572858/500 = 0.7902
NDCG@10 for query embeddings: 352.5450813293562/500 = 0.7051


# Finetuning

As it's only a demonstration for saving time I will only finetune **codesage** code model. First experiment on test split of dataset to check if it overfits "correctly".

In [5]:
from evaluate_model import evaluate_model
from create_vector_db_from_CoSQA import create_vector_db_from_cosqa
from fine_tune_model_code import fine_tune_code_model

print("Training on test split (for demonstration purposes to check if model works we try to overfit first).")
fine_tune_code_model(split="test")

print("\nCreating vector DB from CoSQA corpus with fine-tuned model and evaluating.")
create_vector_db_from_cosqa(code_model_path="fine-tuned_models/code_model_finetuned_test.pth")

print("\nEvaluating fine-tuned model on CoSQA test set.")
evaluate_model(code_model_path="fine-tuned_models/code_model_finetuned_test.pth")

Training on test split (for demonstration purposes to check if model works we try to overfit first).
Datasets loaded.
Number of batches in training data: 167
Using device for code embedding model: cuda


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Batch 50/167
Batch 100/167
Batch 150/167
Error in batch 164 : CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 8.00 GiB of which 0 bytes is free. Of the allocated memory 13.69 GiB is allocated by PyTorch, and 602.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Epoch 1, Loss: 0.3219295811956514
Total errors during training: 1

Creating vector DB from CoSQA corpus with fine-tuned model and evaluating.
Datasets loaded.
Query vector database loaded from file
No existing code vector database file.
Using device for code embedding model: cuda
Loaded model weights from fine-tuned_models/code_model_finetuned_test.pth
Using device for nlp embedding model: cuda


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Processing row 20150/20604
Processing row 20200/20604
Processing row 20250/20604
Processing row 20300/20604
Processing row 20350/20604
Processing row 20400/20604
Processing row 20450/20604
Processing row 20500/20604
Processing row 20550/20604
Processing row 20600/20604

Evaluating fine-tuned model on CoSQA test set.
Datasets loaded.
Query vector database loaded from file
Code vector database loaded from file for finetuned model
Using device for code embedding model: cuda
Loaded model weights from fine-tuned_models/code_model_finetuned_test.pth
Using device for nlp embedding model: cuda


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



Recall@10 for code embeddings: 492/500 = 0.9840
Recall@10 for query embeddings: 440/500 = 0.8800

MRR@10 for code embeddings: 443.9440476190475/500 = 0.8879
MRR@10 for query embeddings: 324.3527777777777/500 = 0.6487

NDCG@10 for code embeddings: 456.1547442424455/500 = 0.9123
NDCG@10 for query embeddings: 352.5450813293562/500 = 0.7051


As we can see after overfitting we reach Recall@10 = 0.9840 MRR@10 = 0.8109 and NDCG@10 = 0.8540 which is an improvement!
Let's try again (on the train split of dataset this time).

In [None]:
from evaluate_model import evaluate_model
from create_vector_db_from_CoSQA import create_vector_db_from_cosqa
from fine_tune_model_code import fine_tune_code_model

print("Training on train split")
#fine_tune_code_model(split="train") # Uncomment to retrain

print("\nCreating vector DB from CoSQA corpus with fine-tuned model and evaluating.")
create_vector_db_from_cosqa(code_model_path="fine-tuned_models/code_model_finetuned_train.pth", split="train")

print("\nEvaluating fine-tuned model on CoSQA test set.")
evaluate_model(code_model_path="fine-tuned_models/code_model_finetuned_train.pth", split = "train")

Training on train split

Creating vector DB from CoSQA corpus with fine-tuned model and evaluating.
Datasets loaded.
Query vector database loaded from file
Code vector database loaded from file
Using device for code embedding model: cuda
Loaded model weights from fine-tuned_models/code_model_finetuned_train.pth
Using device for nlp embedding model: cuda


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Processing row 20150/20604
Processing row 20200/20604
Processing row 20250/20604
Processing row 20300/20604
Processing row 20350/20604
Processing row 20400/20604
Processing row 20450/20604
Processing row 20500/20604
Processing row 20550/20604
Processing row 20600/20604

Evaluating fine-tuned model on CoSQA test set.
Datasets loaded.
Query vector database loaded from file
Code vector database loaded from file for finetuned model
Using device for code embedding model: cuda
Loaded model weights from fine-tuned_models/code_model_finetuned_train.pth
Using device for nlp embedding model: cuda


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



Recall@10 for code embeddings: 452/500 = 0.9040
Recall@10 for query embeddings: 440/500 = 0.8800

MRR@10 for code embeddings: 347.99999999999994/500 = 0.6960
MRR@10 for query embeddings: 324.3527777777777/500 = 0.6487

NDCG@10 for code embeddings: 373.4832661627109/500 = 0.7470
NDCG@10 for query embeddings: 352.5450813293562/500 = 0.7051


As we can see after training 1/2 epoch on train split of dataset we reach Recall@10 = 0.9040 MRR@10 = 0.6960 and NDCG@10 = 0.7570. Recall@10 improved while the two another came out worse!. Maybe longer training would help.

Vector database parameters:
- connectivity : Edges per node in ANN (Approximate Nearest Neighbour) for more efficient retrieval of closest vectors.
- expansion_add : Parameters that tells to how many existing nodes can we connect our new node in ANN.
- expansion_search : Number of neighbours we traverse when performing search in ANN.