# Named Entity Linking using Knowledge-base

In this notebook, we show how we approached to link the ambiguous entities using SpaCy's Knowledge-base.
The datasets we used is the list of all the company entities from DBPedia.
Also, we can explore different datasets in the future (Wikidata, Bloomberg, etc.)

In [1]:
import spacy
nlp = spacy.load("en_core_web_lg")
import pandas as pd
import glob
import numpy as np
from spacy.kb import KnowledgeBase


In [2]:
import warnings;
warnings.filterwarnings('ignore');

In [12]:
def load_entities():

    # load the pre-processed entity dataframe from DBPedia
    ent_df = pd.read_csv("ent_df.csv")
    
    # create a dictionary to store info
    names = dict()
    descriptions = dict()
    alias = dict()
    
    # store each value in a dictionary
    for idx, row in ent_df.iterrows():
        cid = row[0]
        name = row[1]
        ali = row[2]
        desc = row[3]
        
        # assign in ditionaries
        names[cid] = name
        descriptions[cid] = desc
        
        # make sure that alias and name are not the same
        if name != ali:
            alias[cid] = ali
        
    return names, descriptions, alias, ent_df

In [19]:
def insert_entities(kb,dic, length = 300, freq = 1):
    """
        this function adds entities to the KB based on the given dictionary
    """
    for qid, item in dic.items():
        
        # Check if the item exists in the KB
        if len(kb.get_candidates(str(item)))>0:
            continue
        
        # Create the vectorized entity. For now, it's a vector of zeros
        item_enc = np.zeros(length)
        
        # insert the entities
        kb.add_entity(entity=qid, entity_vector = item_enc, freq = freq)
    
    return kb

In [None]:
def insert_alias(kb, dic, prob=[1]):
    """
        this functions adds alias to entities in the KB based on the given dictionary
    """
    for qid, ali in dic.items():
        kb.add_alias(alias=str(ali), entities=[qid], probabilities=prob)
        kb.add_alias(alias=str(ali.lower()), entities=[qid],  probabilities=prob)

In [14]:
# load entities based on the files
name_dict, desc_dict, alias_dict, ent_df = load_entities()

# Below is how we added the entities into the newly created knoweldge base.
# Since we already have a pre-defined knowledge-base, we are going to use it.

kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=300)

"""
kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=300)
for qid, desc in desc_dict.items():
    #desc_doc = nlp(str(desc))
    #desc_enc = desc_doc.vector
    desc_enc = np.zeros(300)
    kb.add_entity(entity=qid, entity_vector=desc_enc, freq=5)

# Add entities
kb = insert_entities(kb, desc_dict)

# Add aliases
kb = insert_alias(kb, name_dict)

# Add aliases
kb = insert_alias(kb, alias_dict)


"""

# Load th pre-defined knowledge-base
kb.load_bulk("kb_new")


## Add entities from the knowledge-graph

In order to add the entities we extracted from 10-K to the knowledge-base, we first need to check if the entity is already in the KB or not. If it is not in the KB, we need to add new entity to the KB.

In [20]:
def check_add_kb(kb, lst):
    """
        This function first checks if the given entity is in the KB.
        If not, we add the entity to the KB.
        Also, it addas alias to the KB for future usage
        
    """
    for item in lst:
        if kb.get_candidates(item):
            continue
        else:
            print(item)
            kb.add_entity(entity=item, entity_vector = np.zeros(300), freq = 1)
            kb.add_alias(alias=str(item), entities=[item], probabilities=[1])
            kb.add_alias(alias=str(item).lower(), entities=[item], probabilities=[1])
    
    return kb


In [None]:
# Load the sentences
sent_df = pd.read_csv("automobile_sents.csv")

# We are goig to sue the two columns
sent_df = sent_df[['source','target']]

# removing the parantehses, / sign, and tailing white spaces
sent_df['source'] = sent_df['source'].str.replace(r"\(.*\)","")
sent_df['target'] = sent_df['target'].str.replace(r"\(.*\)","")

# removing / sing
sent_df['source'] = sent_df['source'].str.replace(r"\/.*\/*","")
sent_df['target'] = sent_df['target'].str.replace(r"\/.*\/","")

# removing 's
sent_df['target'] = sent_df['target'].str.replace(r"’s","")
sent_df['target'] = sent_df['target'].str.replace(r"'s","")

# removing the
sent_df['source'] = sent_df['source'].str.replace(r"the ","")
sent_df['target'] = sent_df['target'].str.replace(r"the ","")

# removing white spaces
sent_df['source'] = sent_df['source'].str.strip()
sent_df['target'] = sent_df['target'].str.strip()


# get the list of unique entities
source_list = sent_df['source'].unique()
target_list = sent_df['target'].unique()

In [None]:
# If the target is the same as  the source, we know that it is already in the KB
# So we filter out only those targets that are not the same as the sources

new_df = pd.DataFrame()
for idx, row in sent_df.iterrows():
    if row['source'] != row['target']:
        new_df = new_df.append(row)

Let's check how many entities are already in the KB and how many are not

In [23]:
quick= pd.read_csv("quick_check.csv", header = None)
lst = quick[1]

# Print out the entity ID if the item is in the KB
# If not, print out the name of the company
for item in lst:
    if kb.get_candidates(item):
        print(kb.get_candidates(item))
    else:
        print(item)
