# Purpose
The purpose of this notebook is to compare the results of entity extraction methods, from using the entity extraction model to identify nodes, to simply splitting based on connection words.

## Import

### Packages

In [1]:
# General
import codecs, io, os, re, sys, time

import pandas as pd

### Custom Functions

In [2]:
sys.path.append('*')
from source_entity_extraction import *

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\canfi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\canfi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Data
The training data is imported and the necessary columns are converted to lists.

In [3]:
#import data
path_dir_data ="./../data/input/"
file_training_data = "training_data_dir_multiclass.xlsx"
path_training_data = os.path.join(path_dir_data, file_training_data)
dataset = pd.read_excel(path_training_data,  engine='openpyxl')

#convert into lists
df = pd.DataFrame({
    'file_name': dataset.file_name,
    'hypothesis_num': dataset.hypothesis_num,
    'hypothesis': dataset.sentence, 
    'node1': dataset.node_1, 
    'node2': dataset.node_2
    })

df.dropna(inplace = True)

## Pre-Processing
Common text preprocessing before comparing entity extraction methods.

In [4]:
# Regex patterns
whitespace_pattern = re.compile(r'\s+')

# Extract columns as lists
hypothesis_list = df['hypothesis'].tolist()
node1_list = df['node1'].tolist()
node2_list = df['node2'].tolist()

# Remove hypothesis tags
hypothesis_list = remove_hypothesis_tag(hypothesis_list)

# Text Processing
for i, n in enumerate(node1_list): 
    n_iter=n.lower() \
            .replace('.', '') \
            .strip()

    # Replace &
    n_iter = re.sub("&", "and", n_iter)

    # Remove extra whitespace
    n_iter=remove_whitespace(n_iter)

    # Replace with text processed string
    node1_list[i] = n_iter

for j, n in enumerate(node2_list): 
    n_iter=n.lower() \
            .replace('.', '') \
            .strip()
    
    # Replace &
    n_iter = re.sub("&", "and", n_iter)

    # Remove extra whitespace
    n_iter=remove_whitespace(n_iter)

    # Replace with text processed string
    node2_list[j] = n_iter

for k, n in enumerate(hypothesis_list):        
    n_iter = n.lower() \
              .replace('.', '') \
              .replace(':', '') \
              .strip()

    # Replace &
    n_iter = re.sub("&", "and", n_iter)

    # Remove extra whitespace
    n_iter=remove_whitespace(n_iter)

    # Replace with text processed string
    hypothesis_list[k] = n_iter        

df['hypothesis_pr'] = hypothesis_list
df['node1_pr'] = node1_list
df['node2_pr'] = node2_list

In [5]:
df.sample(10)

Unnamed: 0,file_name,hypothesis_num,hypothesis,node1,node2,hypothesis_pr,node1_pr,node2_pr
135,wb93smj.txt,h6,hypo 4: firm performance will be negatively as...,firm performance,turnover,firm performance will be negatively associated...,firm performance,turnover
553,kl02ms.txt,h2,"hypo 1: the more a ﬁrm prevents waste, the hig...",ﬁrm prevents waste,ﬁnancial performance,"the more a ﬁrm prevents waste, the higher its ...",ﬁrm prevents waste,ﬁnancial performance
271,atc05ijhrm.txt,h5,"hypo 3: in general, the results indicated that...",ﬁrms,level of overall ﬁrm performance,"in general, the results indicated that larger ...",ﬁrms,level of overall ﬁrm performance
59,clv11ijhrm.txt,h6,hypo 3: the relational side of social capital ...,relational side of social capital,unique human capital,the relational side of social capital has a po...,relational side of social capital,unique human capital
386,bs05jams.txt,h3,hypo 1: em is directly and positively related ...,em,firm’s new product success in its principal ma...,em is directly and positively related to the f...,em,firm’s new product success in its principal ma...
50,cl10pp.txt,h6,hypo 3: employees’ shared climate perception o...,employees’ shared climate perception of the un...,collective employee helping behavior,employees’ shared climate perception of the un...,employees’ shared climate perception of the un...,collective employee helping behavior
78,ct05crr.txt,h10,hypo 2: the ﬁrm’s growth positively related to...,ﬁrm’s growth,ﬁrm’s proﬁtability,the ﬁrm’s growth positively related to the ﬁrm...,ﬁrm’s growth,ﬁrm’s proﬁtability
613,ltg13jcp.txt,h6,hypo 2: green products is positively associate...,green products,ﬁrm,green products is positively associated with ﬁrm,green products,ﬁrm
431,fs90amj.txt,h4,hypo 1: the greater a firm's current market pe...,firm's current market performance,reputation,the greater a firm's current market performanc...,firm's current market performance,reputation
514,k03jom.txt,h16,hypo 2: training is positively related to qual...,training,quality data and reporting,training is positively related to quality data...,training,quality data and reporting


# Split Hypothesis
The following method is the keyword split. This uses manually identified keywords to split the hypothesis into two parts, rather than extract the two possible entities.

## Initialize

In [6]:
lst_split_words =[
    "more likely", "impact", "effect", "association", "interact", 
    "associated", "enhance", "related", "contributes", "influence",
    "inﬂuence" , "performance", "results", "produces"
]

df["node1_split"] = ""
df["node2_split"] = ""

## Execute

In [7]:
for i in list(df.index):
    for split_word in lst_split_words:
        if any(word in df["hypothesis_pr"][i] for word in ["between"]):
            nodes = df["hypothesis_pr"][i].split("and")
            df["node1_split"][i] = nodes[0]
            df["node2_split"][i] = nodes[1]
        
        elif any(word in df["hypothesis_pr"][i] for word in ["the greater"]):
            nodes = df["hypothesis_pr"][i].split(",")
#             print(nodes)
            if len(nodes) > 1:
                df["node1_split"][i] = nodes[0]
                df["node2_split"][i] = nodes[1]
        else:            
            nodes = df["hypothesis_pr"][i].split(split_word)
            if len(nodes) > 1:
                df["node1_split"][i] = nodes[0]
                df["node2_split"][i] = nodes[1]

### Output

In [10]:
path_dir_data = "./../data"
subfolder_a = "output"
subfolder_entity = "entity_extraction"
file_name= "entity_extraction_vs_keyword_split.csv"
path = os.path.join(path_dir_data, subfolder_a, subfolder_entity, file_name)


df.to_csv(path, index = False)

# Results

In [9]:
df.sample(10)

Unnamed: 0,file_name,hypothesis_num,hypothesis,node1,node2,hypothesis_pr,node1_pr,node2_pr,node1_split,node2_split
131,wb93smj.txt,h2,hypo 3: environmental complexity will be posit...,environmental complexity,turnover,environmental complexity will be positively as...,environmental complexity,turnover,environmental complexity will be positively,with turnover within the top management team
99,cys14jcp.txt,h2,hypo 2: the greater the ﬁrm’s eco-organization...,ﬁrm’s eco-organizational innovation,eco-product innovation,the greater the ﬁrm’s eco-organizational innov...,ﬁrm’s eco-organizational innovation,eco-product innovation,the greater the ﬁrm’s eco-organizational innov...,the greater its eco-product innovation
125,w02jom.txt,h5,hypo 1: results indicate a negative associatio...,hpws,workforce turnover,results indicate a negative association betwee...,hpws,workforce turnover,results indicate a negative association betwee...,workforce turnover
495,jv04amj.txt,h4,hypo 2: a performance orientation is positivel...,performance orientation,in-role job performance,a performance orientation is positively relate...,performance orientation,in-role job performance,a,orientation is positively related to in-role ...
244,sc02amj.txt,h5,hypo 4: senior executive turnover following an...,senior executive turnover following an outside...,postsuccession operational performance,senior executive turnover following an outside...,senior executive turnover following an outside...,postsuccession operational performance,senior executive turnover following an outside...,
309,bb95jibs.txt,h16,hypo 4: subsidiaries with matched strategies e...,subsidiaries with matched strategies,rates of turnover,subsidiaries with matched strategies experienc...,subsidiaries with matched strategies,rates of turnover,,
179,zn02smj.txt,h19,hypo 3: focusing on external technological sou...,outsourcing and alliances,new product radicalness and tc speed,"focusing on external technological sources, ou...",outsourcing and alliances,new product radicalness and tc speed,"focusing on external technological sources, ou...",to new product radicalness and tc speed
53,clrv08aos.txt,h1,hypo 1: environmental performance and the leve...,environmental performance,level of discretionary environmental disclosures,environmental performance and the level of dis...,environmental performance,level of discretionary environmental disclosures,environmental,and the level of discretionary environmental ...
163,zmzttj11jcp.txt,h8,hypo 2: government’s driving forces impact sig...,government’s driving forces,economic performance of smes,government’s driving forces impact signiﬁcantl...,government’s driving forces,economic performance of smes,government’s driving forces impact signiﬁcantl...,of smes
57,clv11ijhrm.txt,h4,hypo 3: the relational side of social capital ...,relational side of social capital,human capital,the relational side of social capital has a po...,relational side of social capital,human capital,the relational side of social capital has a po...,on human capital
