# Cosine Similarity between Two Word Embeddings



In [10]:
import pandas as pd
import torch
from transformers import BertTokenizer, BertModel
from scipy.spatial.distance import cosine
import re

## Find Most 5 Similarest Entities with BERT
Our statistical results for similarity matching show that extracting the 5 most similar entities can cover the most exemplary question.

I tried to use BERT and SpaCy models to perform similarity matching on entities. Finally, I found that bert-base-uncased can more efficiently find the most relevant entities for most problems in the first five, while other models cannot. There has been a significant improvement. For example, spacy's en_core_web_lg similarity results are much irrelevant to given entities.




In [1]:
# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [2]:
def bert_encode(text):
    encoded_input = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
    with torch.no_grad():
        output = model(**encoded_input)
    embedding = output.last_hidden_state.mean(dim=1).squeeze()
    return embedding

he cosine similarity between the two embeddings is calculated, which is 1 - cosine_distance. This gives a similarity score between 0 and 1, where 1 means identical and 0 means completely dissimilar.

In [3]:
def compute_similarity(text1, text2):
    # Encode the texts
    embedding1 = bert_encode(text1)
    embedding2 = bert_encode(text2)
    # Ensure embeddings are numpy 1-D arrays for cosine similarity calculation
    embedding1_np = embedding1.cpu().numpy() if embedding1.is_cuda else embedding1.numpy()
    embedding2_np = embedding2.cpu().numpy() if embedding2.is_cuda else embedding2.numpy()
    # Compute cosine similarity
    return 1 - cosine(embedding1_np, embedding2_np)


This function extracts abbreviations from text within parentheses. If no parentheses are found, it returns the original text.

In [4]:
def extract_abbreviation(text):
    if isinstance(text, str):
        match = re.search(r'\(([^)]+)\)', text)
        return match.group(1) if match else text
    else:
        return text

In [8]:
def find_most_similar_entities(df, entity_column):
    # Read data from an Excel file
    df_entity = pd.read_excel('/content/extracted_headentity_list.xlsx')
    df_entity['entity_abb'] = df_entity['entity_label'].apply(extract_abbreviation)
    df_entity['entity_lowercase'] = df_entity['entity_abb'].str.lower()

    def process_entity(question_entity):
        similarity_scores = {}
        threshold = 0.5  # You may need to adjust this based on BERT's behavior

        for index, row in df_entity.iterrows():
            if pd.isna(row['entity_lowercase']):
                continue

            similarity = compute_similarity(question_entity.lower(), row['entity_lowercase'])
            if similarity > threshold:
                similarity_scores[index] = similarity

        top_5_similarities = sorted(similarity_scores.items(), key=lambda x: x[1], reverse=True)[:5]

        if top_5_similarities:
            similar_entities = [df_entity.at[index, 'entity_label'] for index, _ in top_5_similarities]
            entity_uris = [df_entity.at[index, 'entity_uri'] for index, _ in top_5_similarities]
            return similar_entities, entity_uris
        else:
            return None, None

    def process_entity_list(entity_list):
        similar_entities_list = []
        entity_uris_list = []

        for entity in entity_list.split(','):
            entity = entity.strip()
            if entity:
                similar_entities, entity_uris = process_entity(entity)
                if similar_entities and entity_uris:
                    similar_entities_list.extend(similar_entities)
                    entity_uris_list.extend(entity_uris)

        if similar_entities_list:
            return similar_entities_list, entity_uris_list
        else:
            return None, None

    results = df[entity_column].apply(lambda x: process_entity(x) if isinstance(x, str) and ',' not in x else process_entity_list(x) if isinstance(x, str) else (None, None))
    df['Similar Entities'] = results.apply(lambda x: x[0] if x else None)
    df['Entity URIs'] = results.apply(lambda x: x[1] if x else None)
    return df

df = pd.read_excel('EntityandRelationfromQuestion.xlsx')
# df = df.head(2) # Just demonstrating on the first second row, because run the whole dataset will take a long time
# Assuming 'df' is your DataFrame with a column 'Named Entities'
df = find_most_similar_entities(df, 'Named Entities')  # Replace 'Named Entities' with the actual column name

# Save the modified DataFrame to an Excel file
df.to_excel("SimilarEntities5.xlsx", index=False)
df

Unnamed: 0,Text,Named Entities,Predicate Verbs,Similar Entities,Entity URIs
0,Who is working in the Computational Materials ...,the Computational Materials Science,work,"[computational materials science, Computationa...","[http://demo.fiz-karlsruhe.de/matwerk/E67431, ..."
1,What are the research projects associated to E...,EMMO,research project associate,[Elemental Multiperspective Material Ontology ...,[http://demo.fiz-karlsruhe.de/matwerk/E1126751...
2,"Who are the contributors of the data ""datasets""?",datasets,contributor,"[datasets, dataset, Image data, data portal]",[http://demo.fiz-karlsruhe.de/matwerk/E1172216...
3,"Who is working with Researcher ""Ebrahim Norouz...",Ebrahim Norouzi,work,"[Ebrahim Norouzi, Ahmad Zainul Ihsan, Mirza Mo...","[http://demo.fiz-karlsruhe.de/matwerk/E15879, ..."
4,"Who is the email address of ""ParaView""?",ParaView,email address,"[paraview, ParaView, data portal, dataset]",[http://demo.fiz-karlsruhe.de/matwerk/E1231097...
5,What are the affiliations of Volker Hofmann?,Volker Hofmann,affiliation,"[Dr. Volker Hofmann, Niklas Siemer, Dr. Tilma...","[http://demo.fiz-karlsruhe.de/matwerk/E9912, h..."
6,"What is ""Molecular Dynamics"" Software? List th...","Molecular Dynamics"" Software?","programming language , documentation page , r...","[Atomic Simulation Recipes, Workshop: From Ele...","[http://demo.fiz-karlsruhe.de/matwerk/E552776,..."
7,What are pre- and post-processing tools for MD...,MD,pre- and post - processing tool,"[Molecular Dynamics (MD), Crystallography Open...","[http://demo.fiz-karlsruhe.de/matwerk/E61379, ..."
8,What are some workflow environments for comput...,computational materials science,some workflow environment,"[computational materials science, Computationa...","[http://demo.fiz-karlsruhe.de/matwerk/E67431, ..."
9,How should I cite pyiron?,pyiron,cite,"[Pyiron, Pyrho, pyDOE, cython]","[http://demo.fiz-karlsruhe.de/matwerk/E457491,..."


Note: The similarity matching above is not always effective. For example, entity'BAM reference data: results of ASTM E139 -11 creep tests on a reference material of Nimonic 75 nickel-base alloy' should contain such an entity "BAM reference data" within the top 5 similar entity.However, the top five matched entities may not contain the correct answer.


This suggests that we need to improve our similarity matching methods for better results. This could serve as a future work. Here, we simply include entities from the knowledge graph that are relevant to entities mentioned in the question.



In [9]:
import pandas as pd
import ast  # Add this import to use ast.literal_eval

# Assuming you've already loaded your DataFrames
df_entity = pd.read_excel('/content/extracted_headentity_list.xlsx')
df = pd.read_excel('SimilarEntities5.xlsx')


# Function to safely convert string representations of lists to actual lists
def safe_list_eval(cell_value):
    try:
        # Attempt to evaluate the string as a list
        return ast.literal_eval(cell_value)
    except (ValueError, SyntaxError):
        # If there's an error (e.g., cell_value is not a string representation of a list), return a list with the original cell_value
        return [cell_value] if cell_value else []

# Initialize 'Similar Entities' and 'Entity URIs' columns as lists if they are not already and not empty
for i in range(len(df)):
    df.at[i, 'Similar Entities'] = safe_list_eval(df.at[i, 'Similar Entities'])
    df.at[i, 'Entity URIs'] = safe_list_eval(df.at[i, 'Entity URIs'])

# Iterate through each element in the "Named Entities" column of df
for i, named_entities in enumerate(df['Named Entities']):
    indexes = []
    # Iterate through df_entity and check each element
    for index, row in df_entity.iterrows():
        word = row['entity_label']
        if isinstance(word, str) and named_entities in word:
            indexes.append(index)

    if indexes:
        matched_entity_label = df_entity.at[indexes[-1], 'entity_label']
        matched_entity_uri = df_entity.at[indexes[-1], 'entity_uri']

        if matched_entity_label not in df.at[i, 'Similar Entities']:
            df.at[i, 'Similar Entities'].append(matched_entity_label)
            df.at[i, 'Entity URIs'].append(matched_entity_uri)

# Save the modified DataFrame to an Excel file
df.to_excel("SimilarEntities5+1.xlsx", index=False)
df

Unnamed: 0,Text,Named Entities,Predicate Verbs,Similar Entities,Entity URIs
0,Who is working in the Computational Materials ...,the Computational Materials Science,work,"[computational materials science, Computationa...","[http://demo.fiz-karlsruhe.de/matwerk/E67431, ..."
1,What are the research projects associated to E...,EMMO,research project associate,[Elemental Multiperspective Material Ontology ...,[http://demo.fiz-karlsruhe.de/matwerk/E1126751...
2,"Who are the contributors of the data ""datasets""?",datasets,contributor,"[datasets, dataset, Image data, data portal, F...",[http://demo.fiz-karlsruhe.de/matwerk/E1172216...
3,"Who is working with Researcher ""Ebrahim Norouz...",Ebrahim Norouzi,work,"[Ebrahim Norouzi, Ahmad Zainul Ihsan, Mirza Mo...","[http://demo.fiz-karlsruhe.de/matwerk/E15879, ..."
4,"Who is the email address of ""ParaView""?",ParaView,email address,"[paraview, ParaView, data portal, dataset]",[http://demo.fiz-karlsruhe.de/matwerk/E1231097...
5,What are the affiliations of Volker Hofmann?,Volker Hofmann,affiliation,"[Dr. Volker Hofmann, Niklas Siemer, Dr. Tilma...","[http://demo.fiz-karlsruhe.de/matwerk/E9912, h..."
6,"What is ""Molecular Dynamics"" Software? List th...","Molecular Dynamics"" Software?","programming language , documentation page , r...","[Atomic Simulation Recipes, Workshop: From Ele...","[http://demo.fiz-karlsruhe.de/matwerk/E552776,..."
7,What are pre- and post-processing tools for MD...,MD,pre- and post - processing tool,"[Molecular Dynamics (MD), Crystallography Open...","[http://demo.fiz-karlsruhe.de/matwerk/E61379, ..."
8,What are some workflow environments for comput...,computational materials science,some workflow environment,"[computational materials science, Computationa...","[http://demo.fiz-karlsruhe.de/matwerk/E67431, ..."
9,How should I cite pyiron?,pyiron,cite,"[Pyiron, Pyrho, pyDOE, cython, https://github....","[http://demo.fiz-karlsruhe.de/matwerk/E457491,..."


## Find Most 7 Similarest Relations with Spacy and added "description"/"type"
Our statistical results for similarity matching show that extracting the 7 most similar relations can cover the most exemplary question. Additionally, some answers to questions are included in the descriptive information, so such "mwo:description" and "dcterms:description" will be added.

Here, SpaCy's en_core_web_lg model is efficient for relation matching, compare to BERT model.

In [6]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [9]:
import spacy
def find_most_similar_relationships(df, relationship_column):
    # Load the spaCy English model
    nlp = spacy.load("en_core_web_lg")

    # Read data from an Excel file
    df_relationship = pd.read_excel('/content/extracted_relation_list.xlsx')

    # Convert the second column to lowercase and save it in a new column 'entity_lowercase'
    df_relationship['relationship_lowercase'] = df_relationship['Predicate readable'].str.lower()

    # Define a function to remove specific words and plural 's'
    def preprocess_text(text):
        # Remove specific words
        words_to_remove = {'has', 'is', 'of', 'in'}
        tokens = text.split()
        tokens = [word for word in tokens if word not in words_to_remove]

        # Remove trailing 's' for plurals
        processed_text = ' '.join(tokens)
        if processed_text.endswith('s'):
            processed_text = processed_text[:-1]

        return processed_text

    # Apply preprocessing to 'relationship_lowercase'
    df_relationship['cleaned_relationship'] = df_relationship['relationship_lowercase'].apply(preprocess_text)

    # Function to find most similar relationships
    def process_relationship(question_relationship):
        if not isinstance(question_relationship, str):
            return None, None, None

        question_word = nlp(preprocess_text(question_relationship.lower()))

        # Initialize a dictionary to store similarity scores
        similarity_scores = {}

        # Set a similarity threshold
        threshold = 0.6

        # Iterate through each word in the dataset and calculate its similarity to word1
        for index, row in df_relationship.iterrows():
            # Skip if the word is NaN
            if pd.isna(row['cleaned_relationship']):
                continue

            word2 = nlp(row['cleaned_relationship'])
            similarity = question_word.similarity(word2)
            similarity_scores[index] = similarity

            # Only store words with similarity scores above the threshold
            if similarity >= threshold:
                similarity_scores[index] = similarity

        # Find the top 9 highest similarity scores
        top_9_similarities = sorted(similarity_scores.items(), key=lambda x: x[1], reverse=True)[:9]
        # For the Q3: Who is the email address of "ParaView"? , we have extracted the relation "email address"，
        # The right relation "mwo:hasContactPoint" is the 7th similaries relation

        # For the Q10: How should I cite pyiron?
        # The right relation "schema:citation" is the 9th similaries relation

        similar_relationships = []
        relationship_uris = []
        relationship_uris_withNS = []

        for index, similarity_score in top_9_similarities:
            if 0 <= index < len(df_relationship):
                similar_relationships.append(df_relationship.at[index, 'Predicate readable'])
                relationship_uris.append(df_relationship.at[index, 'Predicate_uri'])
                relationship_uris_withNS.append(df_relationship.at[index, 'Predicate with Namespace'])
        if 'mwo:description' not in relationship_uris_withNS:
          relationship_uris_withNS.append('mwo:description')
        if 'dcterms:description' not in relationship_uris_withNS:
          relationship_uris_withNS.append('dcterms:description')
        if 'rdf:type' not in relationship_uris_withNS:
          relationship_uris_withNS.append('rdf:type')

        return similar_relationships, relationship_uris, relationship_uris_withNS

    # Apply to each entity in the provided column of df
    results = df[relationship_column].apply(lambda x: process_relationship(x))
    df['Similar Relations'] = results.apply(lambda x: x[0] if x else None)
    df['Relation URIs'] = results.apply(lambda x: x[1] if x else None)
    df['Relation_uris_with_Namespace'] = results.apply(lambda x: x[2] if x else None)

    return df

df = pd.read_excel("SimilarEntities5+1.xlsx")
# Example usage with a DataFrame 'df' and a column 'Named Relationships'
df = find_most_similar_relationships(df, 'Predicate Verbs')

# To save the modified DataFrame:
df.to_excel("phrase_similarity5+1X9.xlsx", index=False)

df

  similarity = question_word.similarity(word2)


Unnamed: 0,Text,Named Entities,Predicate Verbs,Similar Entities,Entity URIs,Similar Relations,Relation URIs,Relation_uris_with_Namespace
0,Who is working in the Computational Materials ...,the Computational Materials Science,work,"['computational materials science', 'Computati...",['http://demo.fiz-karlsruhe.de/matwerk/E67431'...,"[has work package, has expertise in, has fundi...",[https://nfdi.fiz-karlsruhe.de/ontology/exampl...,"[mwo:hasWorkPackage, mwo:hasExpertiseIn, nfdic..."
1,What are the research projects associated to E...,EMMO,research project associate,['Elemental Multiperspective Material Ontology...,['http://demo.fiz-karlsruhe.de/matwerk/E112675...,"[has related Project, related participant proj...","[https://schema.org/dateCreated, http://nfdi.f...","[nfdicore:relatedProject, mwo:relatedParticipa..."
2,"Who are the contributors of the data ""datasets""?",datasets,contributor,"['datasets', 'dataset', 'Image data', 'data po...",['http://demo.fiz-karlsruhe.de/matwerk/E117221...,"[has contributor, related participant project ...",[http://www.geneontology.org/formats/oboInOwl#...,"[mwo:hasContributor, mwo:relatedParticipantPro..."
3,"Who is working with Researcher ""Ebrahim Norouz...",Ebrahim Norouzi,work,"['Ebrahim Norouzi', 'Ahmad Zainul Ihsan', 'Mir...",['http://demo.fiz-karlsruhe.de/matwerk/E15879'...,"[has work package, has expertise in, has fundi...",[https://nfdi.fiz-karlsruhe.de/ontology/exampl...,"[mwo:hasWorkPackage, mwo:hasExpertiseIn, nfdic..."
4,"Who is the email address of ""ParaView""?",ParaView,email address,"['paraview', 'ParaView', 'data portal', 'datas...",['http://demo.fiz-karlsruhe.de/matwerk/E123109...,"[has email address , has postal address, has w...","[http://purl.obolibrary.org/obo/IAO_0000119, h...","[mwo:emailAddress, mwo:hasPostalAddress, mwo:h..."
5,What are the affiliations of Volker Hofmann?,Volker Hofmann,affiliation,"['Dr. Volker Hofmann', 'Niklas Siemer', 'Dr. ...","['http://demo.fiz-karlsruhe.de/matwerk/E9912',...","[has affiliation, has curation status, has par...",[http://purls.helmholtz-metadaten.de/mwo/isOnl...,"[mwo:hasAffiliation, ns2:IAO_0000114, nfdicore..."
6,"What is ""Molecular Dynamics"" Software? List th...","Molecular Dynamics"" Software?","programming language , documentation page , r...","['Atomic Simulation Recipes', 'Workshop: From ...",['http://demo.fiz-karlsruhe.de/matwerk/E552776...,"[has documentation, has bibliographic citation...",[https://w3id.org/scholarlydata/ontology/confe...,"[mwo:hasDocumentation, dcterms:bibliographicCi..."
7,What are pre- and post-processing tools for MD...,MD,pre- and post - processing tool,"['Molecular Dynamics (MD)', 'Crystallography O...",['http://demo.fiz-karlsruhe.de/matwerk/E61379'...,"[required tool, has related resource, related ...",[http://purls.helmholtz-metadaten.de/mwo/hasWe...,"[mwo:requiredTool, mwo:hasRelatedResource, mwo..."
8,What are some workflow environments for comput...,computational materials science,some workflow environment,"['computational materials science', 'Computati...",['http://demo.fiz-karlsruhe.de/matwerk/E67431'...,"[has some values from, has work package, has r...",[http://purls.helmholtz-metadaten.de/mwo/hasWo...,"[owl:someValuesFrom, mwo:hasWorkPackage, mwo:h..."
9,How should I cite pyiron?,pyiron,cite,"['Pyiron', 'Pyrho', 'pyDOE', 'cython', 'Pyreti...",['http://demo.fiz-karlsruhe.de/matwerk/E457491...,"[has annotated source , has bibliographic cita...",[http://emmo.info/emmo#EMMO_967080e5_2f42_4eb2...,"[owl:annotatedSource, dcterms:bibliographicCit..."


## Sentence Similarity between the given question and descriptions
We find that the interval descriptions could contain some answers to a certain question, so we do sentence similarity matching between the given question and interval description to find the most N similar entities, wich contain relevant information. It is not easy to find correcet N, because the similarity is easiliy effect by the

In [29]:
# texts which actually contain relevant information in description
texts = [
    'What is "Molecular Dynamics" Software? List the programming language, documentation page, repository, and license information.', #Q7
    'What are pre- and post-processing tools for MD simulations?', #Q8
    'What are some workflow environments for computational materials science?', #Q9
    'Where can I find a list of interatomic potentials?', #Q11
    'What are python libraries used for calculating local atomic structural environment?', #Q12
    'What are the electronic lab notebooks available?', #Q13
    'What are the software for Molecular Dynamics (MD)?', #Q14
    'What are the ontologies in nanomaterials domain?', #Q15
    'What is DAMASK?', #Q16
    'What are the data portals for materials science ontologies?',#Q17
    'What are the instruments for APT?', #Q18
    'In which institution can I find tomography equipment?',#Q19
    'What are the educational resources for Ontology?',
    'What is the API of Materials Project?',
    'Which simulation software have a python API?',
    'What is the documentation of the "MatDB Online"?',
    'What are the types of software licenses?',
    'What are the software used to produce the data in the Materials Cloud repository?',
    'What are datasets produced by the BAM organization?',
    'What are some available datasets of mechanical properties of steels?',
    'What are datasets related to "Transmission electron microscopy"?',
    'What is the license of the dataset "Elastic Constant Demo Data"?',
    'What is the repository for "BAM reference data"?',
    'What are the different data formats in the "BAM reference data"?',
    'What is the software version of "pacemaker"?',
    'What is the field of research "BAM reference data"?',
    'What is the description of the "BAM reference data"?',
    'What are the datasets produced in 2022?',
    'Who are the creators of the "BAM reference data"?',
    'What are the datasets published by "BAM"?'

]

In [30]:
import pandas as pd
from transformers import BertTokenizer, BertModel
import torch
from scipy.spatial.distance import cosine

# Load your DataFrame
df_des = pd.read_excel(('/content/extracted_headentity_list.xlsx'))

# Function to load the model and tokenizer
def get_model_and_tokenizer(model_name="bert-base-uncased"):
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertModel.from_pretrained(model_name)
    return model, tokenizer

# Function to encode a sentence
def encode_sentence(sentence, model, tokenizer):
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1)

# Function to compare two sentences
def compare_sentences(sentence1, sentence2, model, tokenizer):
    embedding1 = encode_sentence(sentence1, model, tokenizer)
    embedding2 = encode_sentence(sentence2, model, tokenizer)
    similarity = 1 - cosine(embedding1[0].numpy(), embedding2[0].numpy())
    return similarity

# Load the pre-trained model and tokenizer
model, tokenizer = get_model_and_tokenizer()

# Create an empty DataFrame to store the results
results_df = pd.DataFrame(columns=["Text", "Similar Entities", "Entity URIs", "Similarities"])
threshold = 0.7  # Define the similarity score threshold

# For each sentence in texts
for sentence1 in texts:
    similarities = []
    # Compute similarity with each sentence in df_des
    for idx, row in df_des.iterrows():
        sentence2 = str(row["entity_description"])  # Ensure sentence2 is a string
        similarity = compare_sentences(sentence1, sentence2, model, tokenizer)
        if similarity > threshold:  # Only consider similarities above the threshold
            similarities.append((similarity, idx))

    # Continue only if there are similarities above the threshold
    if similarities:
        # Sort indices by similarity and select top results
        top_indices = sorted(similarities, key=lambda x: x[0], reverse=True)[:8]

        # Extract similar entities, their URIs, and similarities
        similar_entities = [df_des.iloc[idx]["entity_label"] for _, idx in top_indices]
        entity_uris = [df_des.iloc[idx]["entity_uri"] for _, idx in top_indices]
        similarity_scores = [sim[0] for sim in top_indices]

        # Add to the results DataFrame
        results_df = pd.concat([results_df, pd.DataFrame({"Text": [sentence1],
                                                          "Similar Entities": [similar_entities],
                                                          "Entity URIs": [entity_uris],
                                                          "Similarities": [similarity_scores]})], ignore_index=True)

# To save the modified DataFrame:
results_df.to_excel("sentence_similarity.xlsx", index=False)

# Print out the sentences with their similarities above the threshold
for index, row in results_df.iterrows():
    print(f"Text: {row['Text']}")
    for i, (entity, uri, similarity) in enumerate(zip(row['Similar Entities'], row['Entity URIs'], row['Similarities'])):
        print(f"{i+1}. Entity: {entity}, URI: {uri}, Similarity: {similarity}")
    print("\n")



Text: Who is working in the Computational Materials Science field?
1. Entity: Materials Design Ontology (MDO), URI: http://demo.fiz-karlsruhe.de/matwerk/E1131257, Similarity: 0.7672711610794067
2. Entity: Open Materials Database, URI: http://demo.fiz-karlsruhe.de/matwerk/E643412, Similarity: 0.7573269009590149
3. Entity: Pyiron YouTube channel, URI: http://demo.fiz-karlsruhe.de/matwerk/E1245283, Similarity: 0.7545511722564697
4. Entity: MaterialsProject, URI: http://demo.fiz-karlsruhe.de/matwerk/E1025597, Similarity: 0.7534968256950378
5. Entity: MaterialsMine (MM), URI: http://demo.fiz-karlsruhe.de/matwerk/E1128252, Similarity: 0.7513790726661682
6. Entity: Thermo-Calc, URI: http://demo.fiz-karlsruhe.de/matwerk/E451776, Similarity: 0.751322865486145
7. Entity: Avogadro, URI: http://demo.fiz-karlsruhe.de/matwerk/E472907, Similarity: 0.743194580078125
8. Entity: Polymer Genome, URI: http://demo.fiz-karlsruhe.de/matwerk/E1066071, Similarity: 0.7423355579376221


Text: Who are the contrib

In [11]:
df_sentence = pd.read_excel('/content/SimilarEntities5+1.xlsx')#
df_sentence

Unnamed: 0,Text,Named Entities,Predicate Verbs,Similar Entities,Entity URIs
0,Who is working in the Computational Materials ...,the Computational Materials Science,work,"['computational materials science', 'Computati...",['http://demo.fiz-karlsruhe.de/matwerk/E67431'...
1,What are the research projects associated to E...,EMMO,research project associate,['Elemental Multiperspective Material Ontology...,['http://demo.fiz-karlsruhe.de/matwerk/E112675...
2,"Who are the contributors of the data ""datasets""?",datasets,contributor,"['datasets', 'dataset', 'Image data', 'data po...",['http://demo.fiz-karlsruhe.de/matwerk/E117221...
3,"Who is working with Researcher ""Ebrahim Norouz...",Ebrahim Norouzi,work,"['Ebrahim Norouzi', 'Ahmad Zainul Ihsan', 'Mir...",['http://demo.fiz-karlsruhe.de/matwerk/E15879'...
4,"Who is the email address of ""ParaView""?",ParaView,email address,"['paraview', 'ParaView', 'data portal', 'datas...",['http://demo.fiz-karlsruhe.de/matwerk/E123109...
5,What are the affiliations of Volker Hofmann?,Volker Hofmann,affiliation,"['Dr. Volker Hofmann', 'Niklas Siemer', 'Dr. ...","['http://demo.fiz-karlsruhe.de/matwerk/E9912',..."
6,"What is ""Molecular Dynamics"" Software? List th...","Molecular Dynamics"" Software?","programming language , documentation page , r...","['Atomic Simulation Recipes', 'Workshop: From ...",['http://demo.fiz-karlsruhe.de/matwerk/E552776...
7,What are pre- and post-processing tools for MD...,MD,pre- and post - processing tool,"['Molecular Dynamics (MD)', 'Crystallography O...",['http://demo.fiz-karlsruhe.de/matwerk/E61379'...
8,What are some workflow environments for comput...,computational materials science,some workflow environment,"['computational materials science', 'Computati...",['http://demo.fiz-karlsruhe.de/matwerk/E67431'...
9,How should I cite pyiron?,pyiron,cite,"['Pyiron', 'Pyrho', 'pyDOE', 'cython', 'Pyreti...",['http://demo.fiz-karlsruhe.de/matwerk/E457491...


In [13]:
df_word = pd.read_excel('/content/phrase_similarity5+1X9.xlsx')
df_sentence = pd.read_excel('sentence_similarity.xlsx')
df_sentence.drop(columns=['Similarities'], inplace=True)

# and convert them if they are not
for col in ['Similar Entities', 'Entity URIs']:
    if not isinstance(df_word[col].iloc[0], list):
        df_word[col] = df_word[col].apply(eval)
    if not isinstance(df_sentence[col].iloc[0], list):
        df_sentence[col] = df_sentence[col].apply(eval)


# Perform the merge operation ensuring all entries from df_word are retained
df_merged = pd.merge(df_sentence, df_word, on='Text', how='right', suffixes=('_sentence', '_word'))

# Combine 'Similar Entities' and 'Entity URIs' columns while removing duplicates
df_merged['Similar Entities'] = df_merged.apply(
    lambda row: list(set((row['Similar Entities_sentence'] if isinstance(row['Similar Entities_sentence'], list) else []) +
                         (row['Similar Entities_word'] if isinstance(row['Similar Entities_word'], list) else []))),
    axis=1
)
df_merged['Entity URIs'] = df_merged.apply(
    lambda row: list(set((row['Entity URIs_sentence'] if isinstance(row['Entity URIs_sentence'], list) else []) +
                         (row['Entity URIs_word'] if isinstance(row['Entity URIs_word'], list) else []))),
    axis=1
)

# Drop the now redundant columns
df_merged.drop(columns=['Similar Entities_sentence', 'Similar Entities_word',
                        'Entity URIs_sentence', 'Entity URIs_word'], inplace=True)

# Let's check the merged dataframe
df_merged.to_excel('relevant_entities_relations.xlsx', index=False)

df_merged

Unnamed: 0,Text,Named Entities,Predicate Verbs,Similar Relations,Relation URIs,Relation_uris_with_Namespace,Similar Entities,Entity URIs
0,Who is working in the Computational Materials ...,the Computational Materials Science,work,"['has work package', 'has expertise in', 'has ...",['https://nfdi.fiz-karlsruhe.de/ontology/examp...,"['mwo:hasWorkPackage', 'mwo:hasExpertiseIn', '...","[Computational Material Science, computational...","[http://demo.fiz-karlsruhe.de/matwerk/E67431, ..."
1,What are the research projects associated to E...,EMMO,research project associate,"['has related Project', 'related participant p...","['https://schema.org/dateCreated', 'http://nfd...","['nfdicore:relatedProject', 'mwo:relatedPartic...","[ruby, R. S. Elliott and E. B. Tadmor, ""Knowle...","[http://demo.fiz-karlsruhe.de/matwerk/E837572,..."
2,"Who are the contributors of the data ""datasets""?",datasets,contributor,"['has contributor', 'related participant proje...",['http://www.geneontology.org/formats/oboInOwl...,"['mwo:hasContributor', 'mwo:relatedParticipant...","[Image data, Framework for curation and distri...",[http://demo.fiz-karlsruhe.de/matwerk/E1196832...
3,"Who is working with Researcher ""Ebrahim Norouz...",Ebrahim Norouzi,work,"['has work package', 'has expertise in', 'has ...",['https://nfdi.fiz-karlsruhe.de/ontology/examp...,"['mwo:hasWorkPackage', 'mwo:hasExpertiseIn', '...","[Dr. Amir Laadhar, Mirza Mohtashim Alam, Ahmad...","[http://demo.fiz-karlsruhe.de/matwerk/E10181, ..."
4,"Who is the email address of ""ParaView""?",ParaView,email address,"['has email address ', 'has postal address', '...","['http://purl.obolibrary.org/obo/IAO_0000119',...","['mwo:emailAddress', 'mwo:hasPostalAddress', '...","[paraview, data portal, Standardised documenta...",[http://demo.fiz-karlsruhe.de/matwerk/E1231097...
5,What are the affiliations of Volker Hofmann?,Volker Hofmann,affiliation,"['has affiliation', 'has curation status', 'ha...",['http://purls.helmholtz-metadaten.de/mwo/isOn...,"['mwo:hasAffiliation', 'ns2:IAO_0000114', 'nfd...","[Prof. Dr. Jörg Neugebauer, Markus Schilling,...","[http://demo.fiz-karlsruhe.de/matwerk/E33641, ..."
6,"What is ""Molecular Dynamics"" Software? List th...","Molecular Dynamics"" Software?","programming language , documentation page , r...","['has documentation', 'has bibliographic citat...",['https://w3id.org/scholarlydata/ontology/conf...,"['mwo:hasDocumentation', 'dcterms:bibliographi...","[controlled vocabulary, Atomic Simulation Reci...","[http://demo.fiz-karlsruhe.de/matwerk/E63482, ..."
7,What are pre- and post-processing tools for MD...,MD,pre- and post - processing tool,"['required tool', 'has related resource', 'rel...",['http://purls.helmholtz-metadaten.de/mwo/hasW...,"['mwo:requiredTool', 'mwo:hasRelatedResource',...","[MDTraj, Pizza.py Toolkit, Silicon, Extensible...","[http://demo.fiz-karlsruhe.de/matwerk/E469997,..."
8,What are some workflow environments for comput...,computational materials science,some workflow environment,"['has some values from', 'has work package', '...",['http://purls.helmholtz-metadaten.de/mwo/hasW...,"['owl:someValuesFrom', 'mwo:hasWorkPackage', '...","[Computational Material Science, computational...",[http://demo.fiz-karlsruhe.de/matwerk/E1066071...
9,How should I cite pyiron?,pyiron,cite,"['has annotated source ', 'has bibliographic c...",['http://emmo.info/emmo#EMMO_967080e5_2f42_4eb...,"['owl:annotatedSource', 'dcterms:bibliographic...","[pyDOE, cython, Pyretis, Pyrho, https://github...","[http://demo.fiz-karlsruhe.de/matwerk/E598872,..."
