# Entity Alignment: Hash Approach

This notebook presents a method for entity alignment and merging using sentence embeddings and hashing as an attempt for a more efficient processing. This deals with the problem of merging/entity alignment for our knowledge graph. The notebook includes a hashing function to group similar entity types. Through practical examples, it illustrates the process of merging entities with different similarities, and generates 'also known as' relations for entities with different names but identified as similar. This approach should be further revised.

In [None]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m873.7 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=94a331d8c1900b87b75e6552bc84c695f61d962fcee47c97c1ce8b087abe5680
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence-t

In [None]:
!pip install rdflib

Collecting rdflib
  Downloading rdflib-7.0.0-py3-none-any.whl (531 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m531.9/531.9 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting isodate<0.7.0,>=0.6.0 (from rdflib)
  Downloading isodate-0.6.1-py2.py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.7/41.7 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: isodate, rdflib
Successfully installed isodate-0.6.1 rdflib-7.0.0


In [None]:
from sentence_transformers import SentenceTransformer, util
import hashlib
from collections import defaultdict

# Initialize the model
merge_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

r1 = {'head': 'Université Lyon 2', 'head_type': 'org', 'type': 'subsidiary', 'tail': 'Paris School of Economics', 'tail_type': 'org'}
r2 = {'head': 'Université Lumière Lyon 2', 'head_type': 'org', 'type': 'part of', 'tail': 'GATE', 'tail_type': 'org'}

# r1 = {'head': 'Napoleon', 'head_type': 'Person', 'type': 'Lead', 'tail': 'War of waterloo', 'tail_type': 'Event'}
# r2 = {'head': 'Napoleon Bonaparte', 'head_type': 'Person', 'type': 'Born', 'tail': 'Corsica', 'tail_type': 'Location'}

relations = [r1, r2]

def encode_relation(relation):
    relation_str = ' '.join([relation['head'], relation['type']])
    return merge_model.encode(relation_str)

def similarity_score(enc1, enc2):
    return util.pytorch_cos_sim(enc1, enc2).item()

def should_merge(rel1, rel2, threshold):
    enc1 = encode_relation(rel1)
    enc2 = encode_relation(rel2)
    sim = similarity_score(enc1, enc2)
    print('Similarity between', rel1['head'], 'and', rel2['head'], 'is :', sim)
    return sim >= threshold

def hash_relation(relation):
    return hashlib.md5(relation['head_type'].encode()).hexdigest()

def align_and_merge_relations(relations):
    hashed_groups = defaultdict(list)
    merged_relations = []
    aka_relations = []

    for relation in relations:
        hash_key = hash_relation(relation)
        hashed_groups[hash_key].append(relation)

    for group in hashed_groups.values():
        merged_relations.extend(merge_group(group, aka_relations))

    return merged_relations, aka_relations

def merge_group(group, aka_relations, threshold=0.8):
    merged = []

    while group:
        base = group.pop()
        merge_candidates = [base]

        for other in list(group):
            if should_merge(base, other, threshold):
                merge_candidates.append(other)
                group.remove(other)
                # 'also known as' relation if names are different
                if base['head'] != other['head']:
                    aka_relations.append({'head': base['head'], 'type': 'also known as', 'tail': other['head']})

        merged_head = merge_candidates[0]['head']
        merged_type = merge_candidates[0]['type']
        tails = {cand['tail'] for cand in merge_candidates}
        merged_tail = ', '.join(tails)

        merged.append({'head': merged_head, 'head_type': 'org', 'type': merged_type, 'tail': merged_tail, 'tail_type': 'org'})

    return merged

merged_relations, aka_relations = align_and_merge_relations(relations)
print("Merged Relations:", merged_relations)
print("Also Known As Relations:", aka_relations)


Similarity between Université Lumière Lyon 2 and Université Lyon 2 is : 0.8301360607147217
Merged Relations: [{'head': 'Université Lumière Lyon 2', 'head_type': 'org', 'type': 'part of', 'tail': 'Paris School of Economics, GATE', 'tail_type': 'org'}]
Also Known As Relations: [{'head': 'Université Lumière Lyon 2', 'type': 'also known as', 'tail': 'Université Lyon 2'}]
