## Pre-processing and Extracting Features

This python code is responsible for carrying out the entore preprocessing and feature extraction from the corpus, and also creating the feature vector from the terms that are annotatted on the sentence, using the external knowledge base. It also validates from the distant resource whether a pair of terms is a valid example of a hypernym relation or not. <br>
To run the python script file on stout, command lines are **python Features_Extract.py >> output.txt**


###  Python Modules
We first import all the modules that will be required for the entire process, at once. These include NLP libraries such as spaCy, and NLTK.<br>                                                            We also load the English functions from these modules since our analysis will be based on an English corpus. <br>
<br>
Pymagnitude is a python module which helps us to load embeddings of grammatical and syntactic features


In [None]:
import spacy
import nltk
import subprocess
from spacy import displacy
from collections import Counter
import en_core_web_sm
from nltk.corpus import wordnet as wn
nlp = en_core_web_sm.load()
import pymagnitude
import numpy as np
from pymagnitude import *
from spacy.matcher import PhraseMatcher

###  Creating Dictionary
Since we do our processing with our input as a sentence, the function below will create a dictionary with keys as the tokens of the sentence.<br> The values to these keys will be some particular features which can be directly extracted with the help of spaCy's inbuilt functions.<br> Once we build this dictionary, we can pass this to the function that will eventually create our feature vector.

In [None]:
def create_dict(test_text,terms):
    listofTokens = []
    listofFeatures = []
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()

    #Converting each and every word of the sentence to lower case
    test_text = test_text.lower()
    #Tokenizing our sentence
    word_list = nltk.word_tokenize(test_text)
    #Lemmatizing each and every token
    test_text = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
    doc = nlp(test_text)
    #print(doc)
    for i in range(len(doc)):
         # print("Token text:",doc[i].text, "Token Lemma:", doc[i].lemma_,"Token POS :", doc[i].pos_,"Token tag:", doc[i].tag_, "Token dependency:",doc[i].dep_,
         #       "Token Shape:", doc[i].shape_,"Is token alphabetic:", doc[i].is_alpha,"Is it a stop word: ", doc[i].is_stop)
         #Storing each tokens of the sentence in a list
         listofTokens.append(str(doc[i]))
         temp_list = [doc[i].lemma_,doc[i].pos_,doc[i].tag_,doc[i].dep_,doc[i].shape_,doc[i].is_alpha,doc[i].is_stop]
         #We store each Lemma, POS, Part-of-speech tag, dependency tag,shape of the token,if it is alphanumeric, and if it is a stopword
         listofFeatures.append(temp_list)
    #print(listofTokens)

    #print(listofFeatures)
    zipbObj = zip(listofTokens, listofFeatures)
    #Creating a dictionary with key as the token, and its value as its certain features
    dictOfWords = dict(zipbObj)
    #Named-Entity-Recognition for tokens which are present in the sentence
    for ent in doc.ents:
        if(ent.text in listofTokens):
            dictOfWords[ent.text].append(ent.label_)
        print(ent.text, ent.start_char, ent.end_char, ent.label_)
    print(dictOfWords)
    #Calling the features_extract function to extract features and create the feature vector
    features_extract(test_text,dictOfWords,terms)

###   Feature Extraction Function
In this function we have as arguments our sentence (test_text), the dictionary with keys as tokens of this sentence and values as their grammatical and syntactic features, as well as the list of terms  from the external Knowledge Base. <br> Moreover it implements our approach of annotating terms found on the sentence using spaCy's phraseMatcher. It also describes a method that would retrieve features of at max 3 terms, left of the first term, and right of the second term.<br>The function takes into consideration the window of maximum 10 tokens between the pair of terms. In addition, it also retrieves features from the token sequence that appear on the shortest dependency path between the two terms.<br><br>
       We used Pymagnitude module to encode our syntactic and grammatical features into 4-dimensional embeddings, to integrate that in our feature vector. These embeddings are used repeatedly for every token we examine to be part of our feature vector.
    

In [None]:
def features_extract(test_text,dictOfWords,terms):
    
    #Creating a copy of terms from external KB
    terms_temp = terms.copy()
    #Using pymagnitude module to retrieve embeddings of POS and dependency tags
    pos_vectors = FeaturizerMagnitude(100, namespace = "PartsOfSpeech")
    dependency_vectors = FeaturizerMagnitude(100, namespace = "SyntaxDependencies")
    pos_vectors.dim
    nlp = spacy.load('en_core_web_sm')
    #Loading PhraseMatcher from spaCy
    matcher = PhraseMatcher(nlp.vocab)
    #dictionaries for storing features of tokens before and after the terms
    dictbefore={}
    dictafter={}


    #Stripping punctuation marks from the sentence
    test_text = test_text.strip('"')
    print(test_text)
    
    
    # Only run nlp.make_doc to speed things up
    patterns = [nlp.make_doc(text) for text in terms]
    matcher.add("TerminologyList", None, *patterns)
    #Making our sentence a spaCy object
    doc = nlp(test_text)
    #matches is a list that will store the start index, end index and the match id of the term
    matches = matcher(doc)
    start_idx = []
    end_idx = []
    # Flagit is a list which will avoid duplication of terms in the sentence
    flagit = []
    new_matches = []
    
    for match_id, start, end in matches:
        #the term matched
        span = doc[start:end]
        
        if span in flagit:
            continue
        #If the term has more than one token
        if end - start>1:
            #the following code snippet ensures that only the term with the most tokens will be selected
            for item in matches:
            
                if item[1] >=start and item[1]<end:
                    
                    
                    flag_term = doc[item[1]:item[2]]
                   
                    if item in new_matches:
                       
                        new_matches.remove(item)
                    flagit.append(flag_term)
            
            for i in range(start,end,1):
                
                #After selecting the largest length for the term, we remove the tokens that constitute it from the term list
                t1 = str(doc[i:i+1])
                
                if t1 in terms_temp:
                    
                    if i in start_idx:
                        start_idx.remove(i)
                    if i in end_idx:
                        end_idx.remove(i)
                    try:
                        terms.remove(t1)
                    except:
                        continue
       
        #new matches will discard all the duplicate terms and terms which were part of a larger term
        add_match_entity = [match_id,start,end]
        new_matches.append(add_match_entity)
        start_idx.append(start)
        end_idx.append(end-1)
        print("term matched:",span.text)
      
    size = len(doc)
    cnt=0
    print(start_idx)
    print(end_idx)
    for i in range(len(doc)):
    
        #We look at feautres of tokens before the first term
        if i in start_idx :
            id = start_idx.index(i)
            
            cnt+=1
            print("Tokens before T"+str(cnt))
            print(i)
        
            temp = []
        #Since we fixed the max terms to look only at 3 terms, this is for checking the corner cases
            if i < 3:
                for j in range(i-1,-1,-1) :
                    if j in end_idx:
                       break
                    print(doc[j],":",dictOfWords[str(doc[j])])
                    temp.append(dictOfWords[str(doc[j])])
            else:
                for j in range(i-1,i-4,-1):
                    if j in end_idx:
                        break
                    print(doc[j],":",dictOfWords[str(doc[j])])
                    temp.append(dictOfWords[str(doc[j])])
            dictbefore["T"+str(cnt)] = temp
       #We look at features after the second term
            k = end_idx[id]
            temp = []
            print("Tokens after T"+str(cnt))
        #Corner case since we check for at max 3 terms after the second term
            if i > len(doc) - 4:
                for j in range(k+1,size-1,1):
                    if j in start_idx:
                        break
                    print(doc[j],":",dictOfWords[str(doc[j])])
                    temp.append(dictOfWords[str(doc[j])])
            else:
                for j in range(k+1,k+4,1):
                    if j in start_idx:
                        break
                    print(doc[j],":",dictOfWords[str(doc[j])])
                    temp.append(dictOfWords[str(doc[j])])
            dictafter["T"+str(cnt)] = temp

                
    for key in dictbefore:
        #Default value for our embeddings is set as [1,1,1,1]
        pos_temp = np.float64([1,1,1,1])        
        tag_temp = np.float64([1,1,1,1])
        dependency_temp = np.float64([1,1,1,1])
    
    
        #To take into account the sequence in which the tokens appear, we multiply the embeddings of their features such that it will be unique for each sequence
        for i in range(0,len(dictbefore[key]),1):
                        pos_temp *= np.array(pos_vectors.query(dictbefore[key][i][1]))
                        tag_temp *= np.array(pos_vectors.query(dictbefore[key][i][2]))
                        dependency_temp *= np.array(dependency_vectors.query(dictbefore[key][i][3]))
                
        #Converting numpy array to a list and then merging all the lists into one
        
        #Needs to be converted to list for babelnet 
        #pos_temp = pos_temp.tolist()
        #tag_temp = tag_temp.tolist()
        #dependency_temp = dependency_temp.tolist()
        temp = [pos_temp,tag_temp,dependency_temp]
        temp = [val for sublist in temp for val in sublist]
        dictbefore[key] = temp
   
    
    #Doing the same as above for the terms after the second term
    for key in dictafter:
        pos_temp = np.float64([1,1,1,1])        
        tag_temp = np.float64([1,1,1,1])
        dependency_temp = np.float64([1,1,1,1])
   
        for i in range(0,len(dictafter[key]),1):
                
                    pos_temp *= np.array(pos_vectors.query(dictafter[key][i][1]))
                    tag_temp *= np.array(pos_vectors.query(dictafter[key][i][2]))
                    dependency_temp *= np.array(dependency_vectors.query(dictafter[key][i][3]))
                
        #Needs to be converted to list for babelnet  
        #pos_temp = pos_temp.tolist()
        #tag_temp = tag_temp.tolist()
        #dependency_temp = dependency_temp.tolist()
        temp = [pos_temp,tag_temp,dependency_temp]
        temp = [val for sublist in temp for val in sublist]
        dictafter[key] = temp

    dictentity = {}
    count=1;

    #The following code snippet is for the shortest dependency path
    for match_id, start, end in new_matches:
        span = doc[start:end]
       
        if start not in start_idx or end-1 not in end_idx:
            continue
        
      
        flag=1
        if end - start > 1:
            for i in range(start,end,1):
                temp = dictOfWords[str(doc[i])]
        
            #Since shortest dependency path is token based,
            #for terms with more than one tokens, we will choose only
            #the token that is a noun or the rightmost token
        
                if temp[1] in {"PROPN","NOUN"}:
                    flag=0
                    dictentity["T"+str(count)] = str(doc[i])
            if flag==1:
                    dictentity["T"+str(count)] = str(doc[end-1])
        else:
            dictentity["T"+str(count)] = str(span)
        count = count + 1
        import networkx as nx
    dictSDP={}
    #Assigning shortest dependency from term1 to term2
    for token in doc:
        print((token.head.text, token.text, token.dep_))
    edges = []
    for token in doc:
        for child in token.children:
            edges.append(('{0}'.format(token.lower_),
                          '{0}'.format(child.lower_)))
            
    
    #Constructing features for the tokens that are present in the shortest dependency path the same way we did earlier
    for i in range(0,len(dictentity),1):
        for j in range(i+1,len(dictentity),1):
            entity1 = dictentity["T"+str(i+1)]
            entity2 = dictentity["T"+str(j+1)]
            vec_temp = np.float64([1,1,1,1])
            dictSDP[str(i+1)+str(j+1)]  = vec_temp
            
            #If both refer to the same term, then continue
            if entity1 == entity2:                
                continue
            graph = nx.Graph(edges)
            #If there exists a shortest dependency path between the two tokens, then we build the features with embeddings
            try:
                print("Shortest path length betwen T"+str(i+1),"and T"+str(j+1),":",nx.shortest_path_length(graph, source=entity1, target=entity2))
                path=nx.shortest_path(graph, source=entity1, target=entity2)
                print("Tokens in shortest path:",path)
                vec = []
                dep_temp = np.float64([1,1,1,1])
                print("Dependency tags of the tokens:")
                for x in path:
                    print(dictOfWords[x][3])
                    dep_temp *= np.array(dependency_vectors.query(dictOfWords[x][3]))
                    dep_temp = dep_temp.tolist()
                    vec = dep_temp
                    dictSDP[str(i+1)+str(j+1)]  = vec    
            except:
                continue

            features_list = []
            
    #The following for loops are solely responsible for creating the feature vectors
    for i in range(len(start_idx)):
        for j in range(i+1,len(end_idx),1):
            
            #This takes care of the fact that the window for the two terms should be at max 10 tokens
            if start_idx[j] - end_idx[i] <= 10 and end_idx[j]!=end_idx[i] and start_idx[j]!=start_idx[i]: 
                termA = "T"+str(i+1)
                termB = "T"+str(j+1)
                
                
                if dictentity[termA] == dictentity[termB]:
                    continue
                print("Distance between T"+str(i+1)+" and T"+str(j+1)+":", start_idx[j]- end_idx[i] -1)
                print("Tokens between T"+str(i+1)+" and T"+str(j+1))
                print("the terms:" ,dictentity["T"+str(i+1)], dictentity["T"+str(j+1)])
                file='hyp2.txt' 
                t1 = str(dictentity["T"+str(i+1)])
                t2 = str(dictentity["T"+str(j+1)])
                
                
                #Checker is 0 if it is not hypernym relation and 1 if it is.
                checker = 0
                #All synsets for the lemmas of term1 and term2 from WordNet
                size1 = len(wn.synsets(t1))
                size2 = len(wn.synsets(t2))
                
                
                #To check if term2 is hypernym of term1 
                for i1 in range(0,size1,1):
                    if checker == 1:
                        break
                    t1_syn = wn.synsets(t1)[i1]
                    for j1 in range(0,size2,1):
                        t2_syn = wn.synsets(t2)[j1]
                        hypo1 = set([i1 for i1 in t1_syn.closure(lambda s:s.hyponyms())])
                      
                        if t2_syn in hypo1 :
                            checker = 1
                            
               #To check if term1 is hypernym of term2         
                
                for i1 in range(0,size2,1):
                    if checker == 1:
                        break
                    t2_syn = wn.synsets(t2)[i1]
                    for j1 in range(0,size1,1):
                        t1_syn = wn.synsets(t1)[j1]
                        hypo2 = set([i1 for i1 in t2_syn.closure(lambda s:s.hyponyms())])
                       # print(hypo2)
                        if t2_syn in hypo2 :
                            checker = 1

                print(checker)
                
                
                #This part is for Babelnet
               # with open(file, 'w') as filetowrite:
               #         t1 = str(dictentity["T"+str(i+1)])
               #         t2 = str(dictentity["T"+str(j+1)])
               #         filetowrite.write(dict_of_terms[t1][0]+";"+dict_of_terms[t2][0])
                
               # out = subprocess.Popen(['bash','hyp_check2.sh'], 
               #     stdout=subprocess.PIPE, 
               #     stderr=subprocess.STDOUT)
               # stdout,stderr = out.communicate()
               # x=stdout.splitlines()
               # result= str(x[-1])
               # result = result.split(";")
               # tf = result[-1]
               # tf = tf[:-1]
               
               # pos_temp = np.float64([1,1,1,1])
                
                
               #Building features from tokens between the terms same way as done for tokens before and after 
                tag_temp = np.float64([1,1,1,1])
                dependency_temp = np.float64([1,1,1,1])
                for k in range(end_idx[i]+1,start_idx[j],1):
                    
                    temp = []
                    pos_temp *= np.array(pos_vectors.query(dictOfWords[str(doc[k])][1]))
                    tag_temp *= np.array(pos_vectors.query(dictOfWords[str(doc[k])][2]))
                    dependency_temp *= np.array(dependency_vectors.query(dictOfWords[str(doc[k])][3]))

               #Needs to be converted to list for babelnet   
               # pos_temp = pos_temp.tolist()
               # tag_temp = tag_temp.tolist()
               # dependency_temp = dependency_temp.tolist()
             
                #Combining all the feature list
                try:
                    temp = [dictbefore[termA],pos_temp,tag_temp,dependency_temp,dictafter[termB],dictSDP[str(i+1)+str(j+1)]]
                    
                
                except:
                    continue
                #Merging all the feature lists to a feature vector
                flattened = [val for sublist in temp for val in sublist]
                flattened.insert(0,start_idx[j]- end_idx[i] -1)
                
                
                flattened.append(checker)
                        
                print(flattened)


The cell below is responsible to retrieve the list of terms from the Knowledge Base as well as the dictionary that corresponds to its original form or synsets

In [None]:
import pickle
with open ('list_wordnet', 'rb') as fp:
    list_of_terms = pickle.load(fp)
with open ('dict_wordnet', 'rb') as fp:
    dict_of_terms = pickle.load(fp)
    
    

###  Multiprocessing 
  As the entire preprocessing task is very taxing and time consuming for the CPU , we try to divide the tasks parallely among the number of CPUs present on the Stout cluster. This improves the overall performance of our task

In [None]:
import multiprocessing as mp

pool = mp.Pool(mp.cpu_count())


###  Running it on the Corpus
   The following code tries to generate sample positive and negative examples from 10 pages of the corpus, built for the process of testing our preprocessing and feature extracting technique. We also call implement pooling to improve the time consumption .

In [None]:

import string
#To run over 10 pages of our corpus
for i in range(0,9):
    #Creating base string to help in automation
    path = '/data/bphukan/webbase_all/delorme.com_shu.pages_'
    path+=str(i)+'.txt'
    with open(path, 'r') as f:
        nltk.download('punkt')
        #Reading one line at a time from the text file
        line = f.readline()
        cnt = 1
        while line:
           #a_list contains the list of sentences from the line currently being read
           a_list = nltk.tokenize.sent_tokenize(line)
            #For each sentence in a_list
           for line_text in a_list:
                #To check if the sentence is NULL or only has whitespaces
               if line_text and line_text.isspace()== False:
                    #Removing all the punctuation marks in the sentence
                    line_text=line_text.strip(string.punctuation)
                    
                    #Applying the CPU pooling to our functions for preprocessing
                    results = [pool.apply(create_dict, args=(line_text,list_of_terms))]
                    print(results)

                    
                   
           line = f.readline()
pool.close()
       

