### Part-(a)

**(a) For each award, create a table (eg., pandas dataframe) with two columns, one is the recordid, and another column is a list of persons and organizations (except the name of the proposing organization which is in the Company column) specified using NER. You may need to pre-process the company name before eliminating it. For example, for a company called ABC Inc., you may want to remove the substring "Inc." (and others like LLC, Corp, etc.)**

In [1]:
#!pip install psycopg2-binary

In [2]:
#!pip install spacy

In [3]:
#!python -m spacy download en_core_web_sm

In [4]:
from nltk.tokenize.treebank import TreebankWordDetokenizer


In [5]:
import psycopg2
import pandas as pd
from nltk.tokenize import word_tokenize
import string

In [6]:
import spacy
nlp_spacy = spacy.load("en_core_web_sm")

In [7]:
conn = psycopg2.connect(database="dse203",
                        host="localhost",
                        user="postgres",
                        password="admin",
                        port="5432")

In [8]:
cursor = conn.cursor()

In [9]:
sql = '''SELECT "RecordID","Abstract","Company"
                FROM award_data 
                WHERE "Abstract" like '%radiation hardness%'
                '''

cursor.execute(sql)
result = cursor.fetchall()
conn.close()

In [10]:
#dataframe to save the final list of persons and org names
output_df = pd.DataFrame()

In [11]:
#This routine checks if two organization names match ?
def is_matching_org(org_name,proposed_name):
    proposed_name = proposed_name.translate(proposed_name.maketrans('','', string.punctuation)).lower()
    word_tokens = word_tokenize(proposed_name)
    
    #This is custom stop words we want to remove
    stop_words = {'inc','corp','llc'}
    
    #removing stop words
    filtered_proposed_name = [w for w in word_tokens if not w.lower() in stop_words]
    
    #De-tokenize to comparing org. names
    filtered_proposed_name = TreebankWordDetokenizer().detokenize(filtered_proposed_name)
    
    org_name = org_name.translate(proposed_name.maketrans('','', string.punctuation)).lower()
    
    #If org names match return true
    if(org_name==filtered_proposed_name):
        return True
    else:
        return False

In [12]:
#This part goes through the results from SQL query and adds the required results to a dataframe

for x in result:

    rec_id = int(x[0])
    doc_spacy = nlp_spacy(x[1])
    persons = set()
    orgs = set()
    
    #USing Spacy NER finding out persons and organizations
    for ent in doc_spacy.ents:
        if(ent.label_=='PERSON'):
            persons.add(ent.text)
            
        elif(ent.label_=='ORG'):
            orgs.add(ent.text)
             
    #company name from our data            
    company = x[2]
    
    #Creating a copy of all the organization NER identified 
    orgs_new = orgs.copy()
    
    for org_name in orgs:
        #Checking of the Company Name is already present in our list
        orgs_company_match = is_matching_org(org_name,company)
        
    
        if(orgs_company_match):
            #Removing the Company name from our identified organization list
            orgs_new.remove(org_name)
    
    #Adding our required records to dataframe.
    output_df = output_df.append({'rec_id':rec_id,'persons':persons, 'orgs':orgs_new,'company':company}, ignore_index=True)

In [13]:
pd.set_option('display.max_colwidth', None)

In [14]:
#Printing sample from the final output dataframe
display(output_df)

Unnamed: 0,rec_id,persons,orgs,company
0,101.0,"{Alphacore, Exoatmospheric Kill Vehicle}","{PCB, CMOS, GaN, AC, DC-DC, POL}","Alphacore, Inc."
1,432.0,{Phase II FRIB},"{the Facility for Rare Isotope Beams, Success, Advent Technologies, CVD, GLCT, Michigan State University, FRIB}","Great Lakes Crystal Technologies, Inc."
2,156.0,{},{},BLUESHIFT OPTICS LLC
3,160.0,{},"{Neutron, Hg2Br2, COTS, Brimrose, SiPM, DOE, neutron detection}",BRIMROSE TECHNOLOGY CORPORATION
4,250.0,{},"{Fermilab, RF, the Relativistic Heavy Ion Collider}","Caporus Technologies, LLC"
...,...,...,...,...
367,170647.0,{},"{Harvard University, CCD, Transport of Radiation in Matter, the Charge Transfer Efficiency}",Princeton Optronics
368,163576.0,{},"{NETD, GaAs, the Minnesota Consortium for Defense Conversion, BMDO}","Top-Vu Technology, Inc."
369,163577.0,{},"{GaAs, GaAas}","Top-Vu Technology, Inc."
370,164801.0,{},"{Bond Etchback SOI, CMOS, Silicon, Epitaxial Layer Overgrowth, CVD, SOD, SOS, SOI}",Crystallume/edi
