# Part 4. Experiment with a different embedding model

The previous parts of the research focused on developing a combined strategy using BAAI embeddings. In this part we will recreate the experiment with LLAMA3 embeddings and compare the performance.

## Tools

In [4]:
import pandas as pd

from llama_index.embeddings.ollama import OllamaEmbedding

from utils.generic import get_driver, Models, Vectors
from utils.evalutation import (
    mrr_score,
    hits_at_n_score,
)
from utils.index_search_helpers import (predict_with_vector_index,
                                        get_embedding_col_name,
                                        vector_index_query,
                                        get_combined_search_for_df
                                        )
from utils.index_update_helpers import batch_update_synonym_centorid_embeddings

In [2]:
driver = get_driver()

In [3]:
df = pd.read_csv('../data/processed/ncbi_specific_disease_singular_id.csv', sep=',')

## Vector indices on LLAMA3 embeddings

Before we proceed, we need to create embeddings for `Description` in the dataset.

In [8]:
embed_model =OllamaEmbedding(
    model_name="llama3",
    base_url="http://localhost:11434",
    ollama_additional_kwargs={"mirostat": 0},
)

df['DiseaseEmbedding-llama3'] = df['Description'].apply(lambda text: embed_model.get_text_embedding(text))

In [9]:
df.to_csv('../data/processed/ncbi_specific_disease_singular_id.csv', sep=',', index=False)

We have already created LLAMA3 embeddings for `DiseaseName` and `Synonyms` during the data processing step, now we can create centroid of synonyms embeddings and update the knowledge graph with it.

In [6]:
batch_update_synonym_centorid_embeddings(driver=driver, embedding_model_name=Models.LLAMA3.value)

Let us create vector index for it. The length of Llama 3 model's embedding is 4096, thus we will indicate it as `vector.dimensions` 

In [37]:
create_llama3_synonyms_combined_vector_index_query = """
    CREATE VECTOR INDEX llama3VectorIndex_combinedSynonym IF NOT EXISTS
    FOR (d:Disease)
    ON d.`SynonymsCentroidEmbedding-llama3`
    OPTIONS {
        indexConfig: {
            `vector.dimensions`: 4096,
            `vector.similarity_function`: 'cosine'
        }
    }
"""

create_llama3_disease_name_vector_index_query = """
    CREATE VECTOR INDEX llama3VectorIndex IF NOT EXISTS
    FOR (d:Disease)
    ON d.`DiseaseEmbedding-llama3`
    OPTIONS {
        indexConfig: {
            `vector.dimensions`: 4096,
            `vector.similarity_function`: 'cosine'
        }
    }
"""

In [11]:
with driver.session() as session:
    session.run(create_llama3_synonyms_combined_vector_index_query)

In [38]:
with driver.session() as session:
    session.run(create_llama3_disease_name_vector_index_query)

Let us test these indices.

In [44]:
predicted_values_name_vector = predict_with_vector_index(
    driver=driver,
    index="llama3VectorIndex",
    query=vector_index_query,
    dataset=df,
    embedding_col=get_embedding_col_name(Models.LLAMA3, 'DiseaseEmbedding'),
    limit=10,
    threshold=0.8
    )

In [45]:
mrr_score(predicted_values_name_vector)

0.1050902311434035

In [46]:
hits_at_n_score(predicted_values_name_vector, 1)

0.09273054199845877

In [47]:
hits_at_n_score(predicted_values_name_vector, 5)

0.12098638582070383

In [49]:
predicted_values_syn_vector = predict_with_vector_index(
    driver=driver,
    index="llama3VectorIndex_combinedSynonym",
    query=vector_index_query,
    dataset=df,
    embedding_col=get_embedding_col_name(Models.LLAMA3, 'DiseaseEmbedding'),
    limit=10,
    threshold=0.8
    )

In [50]:
mrr_score(predicted_values_syn_vector)

0.022950228126184975

In [51]:
hits_at_n_score(predicted_values_syn_vector, 1)

0.01592602106344721

In [52]:
hits_at_n_score(predicted_values_syn_vector, 5)

0.033650141279219115

As we can see, the results from using these queries in isolation are not very  promising. Now let us search with a combined search strategy like we did for BAAI embeddings and compare the results.
The best results with a limit=100 for BAAI embeddings were:
- accuracy: 0.6696634985872079 (baseline: 0.52520366598778)
- mrr score: 0.7013352845422232
- hit@5 score: 0.7467248908296943

In [5]:
combined_search_limit_10 = get_combined_search_for_df(
    dataset=df,
    embedding_col=get_embedding_col_name(Models.LLAMA3, 'DiseaseEmbedding'),
    driver=driver,
    limit=10,
    name_vec_index=Vectors.LLAMA3_DISEASE_NAME.value,
    centoid_vec_index=Vectors.LLAMA3_DISEASE_SYNONYMS_CENTROID.value
)

In [6]:
mrr_score(combined_search_limit_10)

0.64882012484761

In [7]:
hits_at_n_score(combined_search_limit_10, 1)

0.6175186231697919

In [8]:
hits_at_n_score(combined_search_limit_10, 5)

0.6884151040328795

We see an immediate improvement for the same limit=10, however, let us compare the accuracy if we allow 100 candidates.

In [9]:
combined_search_limit_100 = get_combined_search_for_df(
    dataset=df,
    embedding_col=get_embedding_col_name(Models.LLAMA3, 'DiseaseEmbedding'),
    driver=driver,
    limit=100,
    name_vec_index=Vectors.LLAMA3_DISEASE_NAME.value,
    centoid_vec_index=Vectors.LLAMA3_DISEASE_SYNONYMS_CENTROID.value
)

In [10]:
mrr_score(combined_search_limit_100)

0.6517433872628721

In [11]:
hits_at_n_score(combined_search_limit_100, 1)

0.6175186231697919

In [12]:
hits_at_n_score(combined_search_limit_100, 5)

0.6884151040328795

As we can see, the results are not dramatically better for a larger limit. Let us now summarize these findings.

## Results analysis and summary

The usual hypothesis is that the larger the embedding model - the better it captures nuances of the target language and thus produces better results. However, as we can see from the experiment above - this is not the case for Llama3 embeddings.
As we confirmed via these experiments, the choice of the embedding model has an influence of the accuracy of the predictions, however, the combinations of different search strategies alongside using heuristic methods for accuracy boosting that is suitable for the domain proved to be more effective.