## Knowledge graph generation from scientific text

In this jupyter notebook we show the demo for the full end to end framework for knowledge graph generation. We first take the input text which is present in the `./Data/TestData/` folder. This folder contains abstracts of all the papers from the website [paperswithcode.com](https://paperswithcode.com/). We use this as our dataset to create reproducible graphs of different modalities.

### Environment

The required packages are listed in [the requirements file](./requirements.txt) located under `text2graph` folder. The following command could be used to install these packages in the target Python environment:

`pip install -r requirements.txt`

Detailed Instructions:

Below are more detailed sample commands to create a new environment named 't2g' using conda command line, to install the requirements, and to run the notebook using this new environment as its kernel:

`conda create --name t2g python=3.6
activate t2g
cd 'path-to-text2graph-folder'
pip install -r requirements.txt
python -m ipykernel install --user --name=t2g
jupyter notebook`

Select the kernel 't2g' in the notebook.

### Loading the libraries

Import all the requisite `spacy` libraries. Also instantiate the data folders. The different directories are:

* NER Models at `./Models/`
* Paper abstracts at `./Data/TestData/`
* NER Model outputs at `./Output/`

In [1]:
from __future__ import unicode_literals, print_function
import os
from pathlib import Path
import spacy
import plac

model_dir = './Models/'
test_dir = './Data/TestData/'
output_dir = './Output/'



## Load NER Models

Now load the NER models that we saved from the NER pipeline. 

In [2]:
nlp = spacy.load(model_dir)

## Read the abstracts

Read the abstracts that are located at the `/Data/TestData/` folder. The abstracts are read and then *tokenized* using the `nltk` function `sent_tokenize`. This creates a list of sentences

In [3]:
import nltk

abstracts = []
sentences = []
ent_tagged_text = []

def tag_entity_text(sentence,text, replacementText):
    newString1 = ""
    for (t,r) in zip(text,replacementText):
        newString1 = sentence.replace(t,r)
        sentence = newString1
    return sentence

with open('./Data/TestData/LeNet.txt') as file:
    lines = file.readlines()
    abstxt = ''.join(str(line) for line in lines)
    abstxt = abstxt.lower()
    abstracts.append(abstxt)
    

# tokenize the abstracts into a sentence
for abstract in abstracts:
    sents = nltk.sent_tokenize(abstract)
    for sent in sents:
        sentences.append(sent)


### Create NER tagged sentences from our Spacy model

Each of these sentences is then sent to `spacy`'s `nlp` function, which tags the entities. A tuple is returned from which we can extract the following:

* `.ents` returns all the entities tagged by `spacy`
* `.text` gives the entity texts that have been tagged


### Convert Spacy tagged sentences to semeval
The `spacy` tagged sentences are converted to `semeval` style markup text.

In [6]:
text = []
replacementText = []
semeval_tagged = []
for sentence in sentences:
#     print(sentence + " : : ")
    ner_tagged = nlp(sentence)
    tagged_entities = ner_tagged.ents
    tuple_length = len(tagged_entities)
    semevalify = ""
    if(tuple_length == 2):
        text = []
        replacementText = []
        for (i,ent) in enumerate(tagged_entities):
#             print(str(i) + ent.text)
            text.append(ent.text)
            replacementText.append('<e' + str(i+1) + '>' + ent.text + '</e' + str(i+1) + '>')
            semevalify = tag_entity_text(sentence,text,replacementText)
    semeval_tagged.append(semevalify)
            
semeval_tagged = list(filter(None, semeval_tagged))
# print(type(semeval_tagged))
for sentence in semeval_tagged:
    print(sentence)
    print("\n")

<e1>convolutional <e2>neural networks</e2></e1> are are a special kind of multi-layer <e2>neural networks</e2>.




# Call the saved Relationship classification module

The `semeval` tagged list of sentences `semeval_tagged` is then sent to the next module, which will classify the relationship. For that we first import the requisite libraries. Then the saved relationship extraction model is loaded from `/DCC/src/text2graph/model`


In [5]:
from models import KerasTextClassifier
import numpy as np
from sklearn.model_selection import train_test_split

clf = KerasTextClassifier()
clf.load('Models/re_model_latest_auto_ann_run_2019_25_09')



### List the available relationship classes

The following lines of code lists the relationship classes that our model can handle and extract. Since the classes are `label-encoded` we create a dictionary to map them to their actual names. This would be helpful later in knowledge graph triple generation

In [7]:
# Create a label dictionary
label_dict = {}
for i,c in enumerate(list(clf.encoder.classes_)):
    print(str(i) + ": " + c)
    label_dict[i] = c

0: Compare
1: Conjunction
2: Evaluate-for
3: Feature-of
4: Part-of
5: Used-for
6: isA
7: sameAs


## Predict the relationship for the classes

For the sentences from the abstract, we now classify and predict the relationships

In [8]:
y_pred = clf.predict(semeval_tagged)

In [9]:
## Show predictions side by side
for i,sentence in enumerate(semeval_tagged):
    print(sentence + ":\n" +  label_dict.get(y_pred[i]) + "\n")

<e1>convolutional <e2>neural networks</e2></e1> are are a special kind of multi-layer <e2>neural networks</e2>.:
sameAs



In [10]:
import re
def generate_graph(sentence,predicted_class):
    e1 = re.search('<e1>(.+?)</e1>',sentence)
    e2 = re.search('<e2>(.+?)</e2>',sentence)
    return (e1.group(1),predicted_class,e2.group(1))
    

In [11]:
for i,sentence in enumerate(semeval_tagged):
    print(generate_graph(sentence,label_dict.get(y_pred[i])))

('convolutional <e2>neural networks</e2>', 'sameAs', 'neural networks')
