# Entity alignment experiment

Problem: The entities extracted from LLM can be messy, we need to align it into some canonical form to make sure they are referring to the same thing.

Steps:

1. Extract entities
2. Project to semantic space
3. Use distance metric to defines it canonical form, semi-manually

In [1]:
import pandas as pd

In [15]:
df = pd.read_sql("SELECT * FROM entities", "sqlite:///data/entities.db")
df.head()

Unnamed: 0,hashed_text,paper_id,locations,stratigraphic_names,lithologies
0,e4e6002ff052b9a38219850ded0fb4d6e7a71c78a22558...,5579aff0e138231c9f6883e9,"Fortieth Parallel, Danforth Hills KRCRA, White...",coal fields,coal
1,3350d7bfada96f205f228e871c3685de2a9005bb6cecca...,5579aff0e138231c9f6883e9,"Fort Union, Wasatch, Green River","Fort Union Formation, Wasatch Formation, Green...","detrital material, conglomerates, sandstones, ..."
2,9c1ac904c7057e1bb19ad43f41b6bc9b2eda5af5ce71b0...,5579aff0e138231c9f6883e9,northeastern part of the quadrangle,"Williams Fork Formation, Trout Creek Sandstone...","sandstone, siltstone, claystone, carbonaceous ..."
3,02916f6d7e87ab1c8065656bcb7a56428d02c81e946bc8...,5579aff0e138231c9f6883e9,western half and south-central parts of the qu...,"Wasatch Formation, Green River Formation, Anvi...","fluvialte sandstone, siltstone, shale, dark-gr..."
4,9273c5ae295642f4990cfbd61a9b4b887e30b70506612e...,5579aff0e138231c9f6883e9,Tennessee Gas Transmission No. 1 USA Chorney w...,"Mesaverde Group, Late Cretaceous age, Mancos S...",fine-grained thick-bedded to massive sandstone...


In [30]:
def flatten(x: pd.Series) -> list[str]:
    """Flatten a list of lists."""
    outputs = []
    for i in x:
        outputs.extend([j.strip() for j in i.split(",") if j.strip()])
    return sorted(list(set(outputs)))

In [31]:
locations = flatten(df.locations)
locations[:3]

['! see. 19', '% mile northeast of Anchorage', '% miles northwest of well 155']

In [32]:
strats = flatten(df.stratigraphic_names)
strats[:3]

["'Bend series'", "'Cascade River Schist of Misch (1966)'", "'D' coal bed"]

In [33]:
liths = flatten(df.lithologies)
liths[:3]

["'Stratified",
 "'bedded intervals several meters thick of light-gray to white diatomaceous earth'",
 "'bedding faults'"]

In [34]:
df = pd.DataFrame(
    {
        "category": ["location"] * len(locations)
        + ["stratigraphic_name"] * len(strats)
        + ["lithology"] * len(liths),
        "name": locations + strats + liths,
    }
)

In [35]:
df

Unnamed: 0,category,name
0,location,! see. 19
1,location,% mile northeast of Anchorage
2,location,% miles northwest of well 155
3,location,% miles south of Derby
4,location,'1'. 12 S.
...,...,...
21887,lithology,zoisite
21888,lithology,zoisite(?)
21889,lithology,zone of gash veins
21890,lithology,zoned plagioclase


In [None]:
from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = model.encode(sentences)
print(embeddings)