# Report 1
### Identifying entities with named entity recognition systems 

##### Problem to solve:
Entity identification with spacy's NER system is not always accurate. For instance, in a document the name "Beyonce" and "Beyonce Giselle Knowles" are referring to the same person, however, their labels are different. We are trying to improve the NER system and ensure the same entities' labels are identical.
##### Idea 2. Ensure the correct labeling with similarity


The idea is to improve the accuracy of the entity recognition with its similarity to other entities. For instance, if an entity is labeled as ORG, however, it is similar to many entities that are labeled PERSON, we would think about changing its entity label accordingly. \

The details of the algorithm is demonstrated as follows:

1. For each entity, calculate its similarity with all other entities.

2. Take a threshold to filter and collect all entities that this entity is similar to. 

3. Find the most common label among these similar entities and use this specified label to update the NER system.


##### Step1, obtain the text to process from a json file (here I only processed the first 20 paragraphs from the json file)

In [9]:
import json
import spacy
import neuralcoref
import string
from spacy.tokens import Span
from spacy import displacy
nlp = spacy.load("en_core_web_lg")
with open("/Users/leo/PycharmProjects/EECS4080/train-v2.0.json") as f:
    data = json.load(f)
count = 0
result = []
for i in range(2):
    for d in data["data"][i]["paragraphs"]:
        if count == 20:
             break
        if "context" in d.keys():
            print(d["context"])
            result.append(d["context"])
            print("\n---------------------------------------------------------------------------\n")
            count += 1
    
print(count)

AttributeError: type object 'neuralcoref.neuralcoref.array' has no attribute '__reduce_cython__'

##### step2, visualize its original NER labels
We can find that the name of Beyoncé is labeled as an organization instead of a person. However, Beyoncé's full name is labeled correctly as a person.  

In [2]:
text="\n---------------------------------------------------------------------------------\n".join(result)
print(text)   
doc = nlp(text)
displacy.serve(doc, style="ent")

Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
---------------------------------------------------------------------------------
Following the disbandment of Destiny's Child in June 2005, she released her second solo album, B'Day (2006), which contained hits "Déjà Vu", "Irreplaceable", and "Beautiful Liar". Beyoncé also ventured into acting, with a G

  "__main__", mod_spec)



Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


##### Step3, calculate similarity and filter out unrelated entites with a threshold

In [4]:
import warnings
warnings.filterwarnings("ignore")
from collections import Counter
doc1 = nlp(text)

# add all the ent strings into a set
ent_set = set()
# remove repeat ent and ent that are not noun
# add all the ent strings that are noun into a dict as keys, ent as values
ent_Noun_dict = {}
for ent in doc1.ents:
    ent_set.add(ent.text)
    if ent.text not in ent_Noun_dict:
        ent_Noun_dict[ent.text] = ent

# add ent pairs to similarity_dict
# calculate the similarity between each ent pairs
similarity_dict = {}
for k1 in ent_Noun_dict.keys():
    for k2 in ent_Noun_dict.keys():
        s = ent_Noun_dict[k1].similarity(ent_Noun_dict[k2])
        if k1 not in similarity_dict:
            similarity_dict[k1] = [(ent_Noun_dict[k2], s)]
        else:
            similarity_dict[k1].append((ent_Noun_dict[k2], s))

                
# collect list of labels based on similar words (threshold > 0.5)
label_dict = {}
for k1 in ent_Noun_dict.keys():
    for k2 in ent_Noun_dict.keys():
        s = ent_Noun_dict[k1].similarity(ent_Noun_dict[k2])
        if s > 0.6: # threshold
            if k1 not in label_dict:
                label_dict[k1] = [ent_Noun_dict[k2].label_]
            else:
                label_dict[k1].append(ent_Noun_dict[k2].label_)
                
# find the most frequent label in the list of labels
label_dict_update = {}
for k in label_dict:
    label_dict_update[k] = Counter(label_dict[k]).most_common(1)[0][0]
    
     
print(similarity_dict['Beyoncé'])
print("\n----------------------------------------------------------------------------------\n")
print(label_dict['Beyoncé'])


[(Beyoncé Giselle Knowles-Carter, 0.7206297), (September 4, 1981, -0.16642813), (American, 0.08907436), (Houston, 0.08850297), (Texas, -0.00039640086), (the late 1990s, -0.18449691), (R&B, 0.0), (Mathew Knowles, 0.46935746), (one, -0.13911682), (Beyoncé, 1.0), (Dangerously in Love, -0.021065364), (2003, -0.18708612), (five, -0.17569552), (Grammy Awards, 0.20545188), (Billboard, 0.1288819), (100, -0.15955852), (Crazy in Love, -0.06071433), (Baby Boy, 0.034600094), (June 2005, -0.08648711), (second, -0.13986963), (B'Day, 0.0), (2006, -0.17212312), (Déjà Vu, 0.14225185), (Beautiful Liar, 0.1254584), (Dreamgirls, 0.38043478), (The Pink Panther, -0.01855444), (2009, -0.13535015), (Jay Z, 0.24507129), (Etta James, 0.2858564), (Cadillac Records, 0.10079185), (third, -0.14467286), (Sasha Fierce, 0.2748853), (six, -0.16174723), (2010, -0.13955076), (fourth, -0.106824405), (4 (2011, -0.21309662), (1970s, -0.13497537), (1980s, -0.10630715), (1990s, -0.074820936), (fifth, -0.09991249), (2013, -0.1

##### Step4, update the entities label with the most common label among its similar entities

In [5]:
# update the ent list
label_dict = label_dict_update
change_list = []
change_dict = {}
for ent in doc.ents:
    # if ent is in label_dict and its label is different from label in the dict, we need to update it
    if ent.text in label_dict and ent.label_ != label_dict[ent.text]:
        # remove the ent from the ent list
        ents = list(doc.ents)
        ents.remove(ent)
        doc.ents = tuple(ents)
        # add ent to the change_list
        change_list.append(ent)
        if ent.text not in change_dict:
            change_dict[ent.text] = [ent.label_, label_dict[ent.text]]
        
update_ent = []
for e in change_list:
    span = Span(doc, e.start, e.end, label=label_dict[e.text])
    update_ent.append(span)
print(change_dict)
doc.ents = list(doc.ents) + update_ent       
        

{'one': ['CARDINAL', 'WORK_OF_ART'], 'Beyoncé': ['ORG', 'PERSON'], 'five': ['CARDINAL', 'DATE'], 'Grammy Awards': ['PERSON', 'EVENT'], 'Billboard': ['ORG', 'FAC'], 'second': ['ORDINAL', 'DATE'], 'over 118 million': ['CARDINAL', 'MONEY'], '60 million': ['CARDINAL', 'MONEY'], '20': ['CARDINAL', 'DATE'], 'The Recording Industry Association of America': ['ORG', 'WORK_OF_ART'], 'America': ['GPE', 'NORP'], 'the Top Female Artist of': ['ORG', 'WORK_OF_ART'], "St. Mary's": ['GPE', 'PERSON'], 'the High School for the Performing and Visual Arts': ['ORG', 'WORK_OF_ART'], "St. John's": ['FAC', 'PERSON'], 'United Methodist Church': ['ORG', 'GPE'], 'three': ['CARDINAL', 'DATE'], 'Atlanta Records': ['ORG', 'GPE'], 'first': ['ORDINAL', 'WORK_OF_ART'], "Destiny's Child": ['ORG', 'WORK_OF_ART'], 'Best R&B/Soul or Rap New Artist': ['ORG', 'WORK_OF_ART'], 'Best R&B/Soul Single': ['ORG', 'WORK_OF_ART'], 'number-one': ['CARDINAL', 'WORK_OF_ART'], 'Annual Grammy Awards': ['EVENT', 'WORK_OF_ART'], 'Boyz II Me

##### Step5, display the updated NER of the document

In [6]:
displacy.serve(doc, style="ent")


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.
