# OpenIE Implementation

Note: This file was written using Google Colaboratory. Appropriate changes need to be made if it has to be run on Jupyter Notebook/ any other Python interactive environment. 

Instructions to run on Colab: The csv (dataset.csv can be found in the repo)  has to be uploaded to Files (in the side bar) in order to access it.

In [None]:
!pip install stanza

Collecting stanza
[?25l  Downloading https://files.pythonhosted.org/packages/50/ae/a70a58ce6b4e2daad538688806ee0f238dbe601954582a74ea57cde6c532/stanza-1.2-py3-none-any.whl (282kB)
[K     |█▏                              | 10kB 12.5MB/s eta 0:00:01[K     |██▎                             | 20kB 18.2MB/s eta 0:00:01[K     |███▌                            | 30kB 12.0MB/s eta 0:00:01[K     |████▋                           | 40kB 9.3MB/s eta 0:00:01[K     |█████▉                          | 51kB 5.6MB/s eta 0:00:01[K     |███████                         | 61kB 6.1MB/s eta 0:00:01[K     |████████▏                       | 71kB 6.0MB/s eta 0:00:01[K     |█████████▎                      | 81kB 6.5MB/s eta 0:00:01[K     |██████████▌                     | 92kB 6.4MB/s eta 0:00:01[K     |███████████▋                    | 102kB 6.5MB/s eta 0:00:01[K     |████████████▉                   | 112kB 6.5MB/s eta 0:00:01[K     |██████████████                  | 122kB 6.5MB/s eta 0:00

In [None]:
import stanza
import pandas as pd
import numpy as np
import re

## Starting the CoreNLP Server with the OpenIE Client

In [None]:
# Download the Stanford CoreNLP package with Stanza's installation command
# This'll take several minutes, depending on the network speed
corenlp_dir = './corenlp'
stanza.install_corenlp(dir=corenlp_dir)

# Set the CORENLP_HOME environment variable to point to the installation location
import os
os.environ["CORENLP_HOME"] = corenlp_dir

2021-04-01 07:34:17 INFO: Installing CoreNLP package into ./corenlp...
Downloading http://nlp.stanford.edu/software/stanford-corenlp-latest.zip: 100%|██████████| 505M/505M [03:54<00:00, 2.15MB/s]


In [None]:
 # Check for successful installation - must display a bunch of jar files
!ls $CORENLP_HOME

build.xml				  jollyday.jar
corenlp.sh				  LIBRARY-LICENSES
CoreNLP-to-HTML.xsl			  LICENSE.txt
ejml-core-0.39.jar			  Makefile
ejml-core-0.39-sources.jar		  patterns
ejml-ddense-0.39.jar			  pom-java-11.xml
ejml-ddense-0.39-sources.jar		  pom.xml
ejml-simple-0.39.jar			  protobuf.jar
ejml-simple-0.39-sources.jar		  README.txt
input.txt				  RESOURCE-LICENSES
input.txt.out				  SemgrexDemo.java
input.txt.xml				  ShiftReduceDemo.java
javax.activation-api-1.2.0.jar		  slf4j-api.jar
javax.activation-api-1.2.0-sources.jar	  slf4j-simple.jar
javax.json-api-1.0-sources.jar		  stanford-corenlp-4.2.0.jar
javax.json.jar				  stanford-corenlp-4.2.0-javadoc.jar
jaxb-api-2.4.0-b180830.0359.jar		  stanford-corenlp-4.2.0-models.jar
jaxb-api-2.4.0-b180830.0359-sources.jar   stanford-corenlp-4.2.0-sources.jar
jaxb-core-2.3.0.1.jar			  StanfordCoreNlpDemo.java
jaxb-core-2.3.0.1-sources.jar		  StanfordDependenciesManual.pdf
jaxb-impl-2.4.0-b180830.0438.jar	  sutime
jaxb-impl-2.4.0-b180830.0438-sources

In [None]:
from stanza.server import CoreNLPClient

In [None]:
client = CoreNLPClient(
    annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner','openie'], 
    memory='4G', 
    endpoint='http://localhost:9001',
    be_quiet=True)
print(client)

# Start the background server and wait for some time
# Note that in practice this is totally optional, as by default the server will be started when the first annotation is performed
client.start()
import time; time.sleep(10)

2021-04-01 07:42:24 INFO: Writing properties to tmp file: corenlp_server-8243c47f82734441.props
2021-04-01 07:42:24 INFO: Starting server with command: java -Xmx4G -cp ./corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-8243c47f82734441.props -annotators tokenize,ssplit,pos,lemma,ner,openie -preload -outputFormat serialized


<stanza.server.client.CoreNLPClient object at 0x7f8e85da4750>


In [None]:
# To check if server is running
!ps -o pid,cmd | grep java

    196 java -Xmx4G -cp ./corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-8243c47f82734441.props -annotators tokenize,ssplit,pos,lemma,ner,openie -preload -outputFormat serialized
    225 /bin/bash -c ps -o pid,cmd | grep java
    227 grep java


## Importing the Dataset

In [None]:
dataset = pd.read_csv("dataset.csv")

Finding average number of characters in each article

In [None]:
l = list(map(len, dataset.articles))
print(sum(l)/len(l))

2837.75565


Finding number of sentences in each article

In [None]:
def num_lines(text):
    return len(re.findall(r'\.', text))

dataset['sentences'] = dataset['articles'].apply(num_lines)
# print(l)

In [48]:
articles = list(dataset.articles)[:25]
print(articles[-1])

Private offices in Mumbai are set to reopen on Monday with 10 per cent strength, if required, with the remaining employees working from home, as part of the Maharashtra government’s ‘Mission Begin Again’. Mumbai Traffic Police have prepared for a surge in vehicular traffic on roads considering the state’s phased easing of the lockdown. While some companies have made arrangements for sanitisers and disposable face masks for their employees, there are others still taking a guarded approach to opening up. Parth Shah, from LSD films Pvt ltd, said they are not opening their office for the next few days. “Only housekeeping staff is coming to office.” Several offices have adopted a wait-and-watch approach before allowing employees to function, even as some offices in the business districts of Bandra Kurla Complex and Nariman Point spruced up their premises for the limited opening. “We will be opening our office from tomorrow. Arrangements have been made to ensure that every employee gets chec

Finding all triplets for the first 25 articles 

(Note: In the Files menu, a new openie_triplets.txt file will be created - double click or download and open).

In [52]:
# for article in articles:
#     pass
file_openie = open("openie_triplets.txt","w+")

i=0
for article in articles:
    file_openie.write("Article "+ str(i+1) + "\n\n")
    doc = client.annotate(article)
    sentences = doc.text.split(".")

    j = 0
    for sentence in doc.sentence:
        file_openie.write("Sentence: " + sentences[j] + "\n\n")
        file_openie.write("Triples:\n")
        for triple in sentence.openieTriple:
            file_openie.write("\n-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*\n")
            file_openie.write("Subject: " + triple.subject)
            file_openie.write("\nRelation: " + triple.relation)
            file_openie.write("\nObject: " + triple.object)
            
        file_openie.write("\n-----------------------------------\n")
        j+=1

    i+=1

file_openie.close()

2021-04-01 09:01:19 INFO: Starting server with command: java -Xmx4G -cp ./corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-8243c47f82734441.props -annotators tokenize,ssplit,pos,lemma,ner,openie -preload -outputFormat serialized
