<a href="https://colab.research.google.com/github/devnac221990/KDM---2/blob/main/KDM_ICP2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [31]:
# Install stanza, Installing and importing Stanza are as simple as running the following commands. 
!pip install stanza

# Import stanza
import stanza



Setting up Stanford CoreNLP

In order for the interface to work, the Stanford CoreNLP library has to be installed and a CORENLP_HOME environment variable has to be pointed to the installation location.

Here I am going to show you how to download and install the CoreNLP library on your machine, with Stanza's installation command:

In [32]:
# Download the Stanford CoreNLP package with Stanza's installation command
# This'll take several minutes, depending on the network speed
corenlp_dir = './corenlp'
stanza.install_corenlp(dir=corenlp_dir)

# Set the CORENLP_HOME environment variable to point to the installation location
import os
os.environ["CORENLP_HOME"] = corenlp_dir



That's all for the installation!

We can now double check if the installation is successful by listing files in the CoreNLP directory. 

You should be able to see a number of .jar files by running the following command:

In [33]:
# Examine the CoreNLP installation folder to make sure the installation is successful
!ls $CORENLP_HOME

build.xml				  jollyday.jar
corenlp.sh				  LIBRARY-LICENSES
CoreNLP-to-HTML.xsl			  LICENSE.txt
ejml-core-0.39.jar			  Makefile
ejml-core-0.39-sources.jar		  patterns
ejml-ddense-0.39.jar			  pom-java-11.xml
ejml-ddense-0.39-sources.jar		  pom.xml
ejml-simple-0.39.jar			  protobuf.jar
ejml-simple-0.39-sources.jar		  README.txt
input.txt				  RESOURCE-LICENSES
input.txt.out				  SemgrexDemo.java
input.txt.xml				  ShiftReduceDemo.java
javax.activation-api-1.2.0.jar		  slf4j-api.jar
javax.activation-api-1.2.0-sources.jar	  slf4j-simple.jar
javax.json-api-1.0-sources.jar		  stanford-corenlp-4.2.0.jar
javax.json.jar				  stanford-corenlp-4.2.0-javadoc.jar
jaxb-api-2.4.0-b180830.0359.jar		  stanford-corenlp-4.2.0-models.jar
jaxb-api-2.4.0-b180830.0359-sources.jar   stanford-corenlp-4.2.0-sources.jar
jaxb-core-2.3.0.1.jar			  StanfordCoreNlpDemo.java
jaxb-core-2.3.0.1-sources.jar		  StanfordDependenciesManual.pdf
jaxb-impl-2.4.0-b180830.0438.jar	  sutime
jaxb-impl-2.4.0-b180830.0438-sources

Constructing CoreNLPClient

At a high level, the CoreNLP Python interface works by first starting a background Java CoreNLP server process, and then initializing a client instance in Python which can pass the text to the background server process, and accept the returned annotation results.

We wrap these functionalities in a CoreNLPClient class. Therefore, we need to start by importing this class from Stanza.

In [34]:
# Import client module
from stanza.server import CoreNLPClient

After the import is done, we can construct a CoreNLPClient instance. The constructor method takes a Python list of annotator names as argument. Here let's explore some basic annotators including tokenization, sentence split, part-of-speech tagging, lemmatization, named entity recognition (NER), parsing and Coreference resolution. 

Additionally, the client constructor accepts a memory argument, which specifies how much memory will be allocated to the background Java process. An endpoint option can be used to specify a port number used by the communication between the server and the client. The default port is 9000. However, since this port is pre-occupied by a system process in Colab, we'll manually set it to 9001 in the following example.

Also, here we manually set be_quiet=True to avoid an IO issue in colab notebook. You should be able to use be_quiet=False on your own computer, which will print detailed logging information from CoreNLP during usage.

For more options in constructing the clients, please refer to 'https://stanfordnlp.github.io/stanza/corenlp_client.html#corenlp-client-options'

In [35]:
# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
client = CoreNLPClient(
    annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse','coref'], 
    memory='4G', 
    endpoint='http://localhost:9001',
    be_quiet=True)
print(client)

# Start the background server and wait for some time
# Note that in practice this is totally optional, as by default the server will be started when the first annotation is performed
client.start()
import time; time.sleep(10)

2021-01-28 18:11:53 INFO: Writing properties to tmp file: corenlp_server-34b5c5bb0adf4d4b.props
2021-01-28 18:11:53 INFO: Starting server with command: java -Xmx4G -cp ./corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-34b5c5bb0adf4d4b.props -annotators tokenize,ssplit,pos,lemma,ner,parse,depparse,coref -preload -outputFormat serialized


<stanza.server.client.CoreNLPClient object at 0x7ff557377780>


After the above code block finishes executing, if you print the background processes, you should be able to find the Java CoreNLP server running.

In [36]:
# Print background processes and look for java
# You should be able to see a StanfordCoreNLPServer java process running in the background
!ps -o pid,cmd | grep java

    335 java -Xmx4G -cp ./corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-34b5c5bb0adf4d4b.props -annotators tokenize,ssplit,pos,lemma,ner,parse,depparse,coref -preload -outputFormat serialized
    356 /bin/bash -c ps -o pid,cmd | grep java
    358 grep java


Annotating Text

Annotating a piece of text is as simple as passing the text into an annotate function of the client object. After the annotation is complete, a Document object will be returned with all annotations.

Note that although in general annotations are very fast, the first annotation might take a while to complete in the notebook. Please stay patient.

In [37]:
# Annotate some text
text = "Xi Jinping is a Chinese politician who has served as General Secretary of the Chinese Communist Party (CCP) and Chairman of the Central Military Commission (CMC) since 2012, and President of the People's Republic of China (PRC) since 2013. He has been the paramount leader of China, the most prominent political leader in the country, since 2012. The son of Chinese Communist veteran Xi Zhongxun, he was exiled to rural Yanchuan County as a teenager following his father's purge during the Cultural Revolutionandlived in a cave in the village of Liangjiahe, where he joined the CCP and worked as the party secretary."
document = client.annotate(text)
print(type(document))

<class 'CoreNLP_pb2.Document'>


Accessing Annotations

Annotations can be accessed from the returned Document object.

A Document contains a list of Sentences, which contain a list of Tokens. Here let's first explore the annotations stored in all tokens.

In [38]:
print("{:12s}\t{:12s}\t{:6s}\t{}".format("Word", "Lemma", "POS", "NER"))

for i, sent in enumerate(document.sentence):
    print("[Sentence {}]".format(i+1))
    for t in sent.token:
        print("{:12s}\t{:12s}\t{:6s}\t{}".format(t.word, t.lemma, t.pos, t.ner))
    print("")

Word        	Lemma       	POS   	NER
[Sentence 1]
Xi          	Xi          	NNP   	PERSON
Jinping     	Jinping     	NNP   	PERSON
is          	be          	VBZ   	O
a           	a           	DT    	O
Chinese     	chinese     	JJ    	NATIONALITY
politician  	politician  	NN    	TITLE
who         	who         	WP    	O
has         	have        	VBZ   	O
served      	serve       	VBN   	O
as          	as          	IN    	O
General     	General     	NNP   	TITLE
Secretary   	Secretary   	NNP   	O
of          	of          	IN    	O
the         	the         	DT    	O
Chinese     	Chinese     	NNP   	ORGANIZATION
Communist   	Communist   	NNP   	ORGANIZATION
Party       	Party       	NNP   	ORGANIZATION
(           	(           	-LRB- 	O
CCP         	CCP         	NNP   	ORGANIZATION
)           	)           	-RRB- 	O
and         	and         	CC    	O
Chairman    	chairman    	NN    	TITLE
of          	of          	IN    	O
the         	the         	DT    	O
Central     	Central     	NNP   	O

Alternatively, you can also browse the NER results by iterating over entity mentions over the sentences. For example:

In [39]:
# Iterate over all detected entity mentions
print("{:30s}\t{}".format("Mention", "Type"))

for sent in document.sentence:
    for m in sent.mentions:
        print("{:30s}\t{}".format(m.entityMentionText, m.entityType))

Mention                       	Type
Xi Jinping                    	PERSON
Chinese                       	NATIONALITY
politician                    	TITLE
General                       	TITLE
Chinese Communist Party       	ORGANIZATION
CCP                           	ORGANIZATION
Chairman                      	TITLE
Central Military Commission   	ORGANIZATION
CMC                           	ORGANIZATION
2012                          	DATE
President                     	TITLE
People's                      	LOCATION
Republic of China             	COUNTRY
PRC                           	COUNTRY
2013                          	DATE
leader                        	TITLE
China                         	COUNTRY
leader                        	TITLE
2012                          	DATE
He                            	PERSON
Chinese                       	NATIONALITY
Communist                     	IDEOLOGY
Xi Zhongxun                   	PERSON
Yanchuan County               	LOCATION
Cultural Revolutionan

To print all annotations a sentence, token or mention has, you can simply print the corresponding obejct.

In [40]:
# Print annotations of a token
print(document.sentence[0].token[0])

# Print annotations of a mention
print(document.sentence[0].mentions[0])

word: "Xi"
pos: "NNP"
value: "Xi"
before: ""
after: " "
originalText: "Xi"
ner: "PERSON"
lemma: "Xi"
beginChar: 0
endChar: 2
utterance: 0
speaker: "PER0"
beginIndex: 0
endIndex: 1
tokenBeginIndex: 0
tokenEndIndex: 1
hasXmlContext: false
isNewline: false
coarseNER: "PERSON"
fineGrainedNER: "PERSON"
corefMentionIndex: 0
entityMentionIndex: 0
nerLabelProbs: "PERSON=0.9477364553819386"

sentenceIndex: 0
tokenStartInSentenceInclusive: 0
tokenEndInSentenceExclusive: 2
ner: "PERSON"
entityType: "PERSON"
entityMentionIndex: 0
canonicalEntityMentionIndex: 0
entityMentionText: "Xi Jinping"



In [21]:
  # get the first sentence
sentence = document.sentence[0]
    
# get the constituency parse of the first sentence
print('---')
print('constituency parse of first sentence')
constituency_parse = sentence.parseTree
print(constituency_parse)

---
constituency parse of first sentence
child {
  child {
    child {
      child {
        value: "Xi"
      }
      value: "NNP"
      score: -12.861265182495117
    }
    child {
      child {
        value: "Jinping"
      }
      value: "NNP"
      score: -12.74289608001709
    }
    value: "NP"
    score: -28.568775177001953
  }
  child {
    child {
      child {
        value: "is"
      }
      value: "VBZ"
      score: -0.14797931909561157
    }
    child {
      child {
        child {
          child {
            child {
              value: "a"
            }
            value: "DT"
            score: -1.5601264238357544
          }
          child {
            child {
              value: "Chinese"
            }
            value: "JJ"
            score: -5.580296039581299
          }
          child {
            child {
              value: "politician"
            }
            value: "NN"
            score: -9.421514511108398
          }
          value: "NP"
      

In [22]:
 # get the first subtree of the constituency parse
print('first subtree of constituency parse')
print(constituency_parse.child[0])

first subtree of constituency parse
child {
  child {
    child {
      value: "Xi"
    }
    value: "NNP"
    score: -12.861265182495117
  }
  child {
    child {
      value: "Jinping"
    }
    value: "NNP"
    score: -12.74289608001709
  }
  value: "NP"
  score: -28.568775177001953
}
child {
  child {
    child {
      value: "is"
    }
    value: "VBZ"
    score: -0.14797931909561157
  }
  child {
    child {
      child {
        child {
          child {
            value: "a"
          }
          value: "DT"
          score: -1.5601264238357544
        }
        child {
          child {
            value: "Chinese"
          }
          value: "JJ"
          score: -5.580296039581299
        }
        child {
          child {
            value: "politician"
          }
          value: "NN"
          score: -9.421514511108398
        }
        value: "NP"
        score: -19.258865356445312
      }
      child {
        child {
          child {
            child {
          

In [23]:
# get the value of the first subtree
print('---')
print('value of first subtree of constituency parse')
print(constituency_parse.child[0].value)

---
value of first subtree of constituency parse
S


In [24]:
  # get the first token of the first sentence
  print('first token of first sentence')
  token = sentence.token[0]
  print(token)

first token of first sentence
word: "Xi"
pos: "NNP"
value: "Xi"
before: ""
after: " "
originalText: "Xi"
ner: "PERSON"
lemma: "Xi"
beginChar: 0
endChar: 2
utterance: 0
speaker: "PER0"
beginIndex: 0
endIndex: 1
tokenBeginIndex: 0
tokenEndIndex: 1
hasXmlContext: false
isNewline: false
coarseNER: "PERSON"
fineGrainedNER: "PERSON"
corefMentionIndex: 0
entityMentionIndex: 0
nerLabelProbs: "PERSON=0.9477364553819386"



In [25]:
  # get the part-of-speech tag

  print('part of speech tag of token')
  token.pos
  print(token.pos)

part of speech tag of token
NNP


In [26]:
# get the named entity tag
print('named entity tag of token')
print(token.ner)

named entity tag of token
PERSON


In [27]:
# get an entity mention from the first sentence
print('first entity mention in sentence')
print(sentence.mentions[0])

first entity mention in sentence
sentenceIndex: 0
tokenStartInSentenceInclusive: 0
tokenEndInSentenceExclusive: 2
ner: "PERSON"
entityType: "PERSON"
entityMentionIndex: 0
canonicalEntityMentionIndex: 0
entityMentionText: "Xi Jinping"



In [28]:
 # access the coref chain
print('coref chains for the example')
print(document.corefChain)

coref chains for the example
[chainID: 18
mention {
  mentionID: 18
  mentionType: "NOMINAL"
  number: "SINGULAR"
  gender: "NEUTRAL"
  animacy: "INANIMATE"
  beginIndex: 15
  endIndex: 17
  headIndex: 16
  sentenceIndex: 1
  position: 5
}
mention {
  mentionID: 14
  mentionType: "PROPER"
  number: "SINGULAR"
  gender: "NEUTRAL"
  animacy: "INANIMATE"
  beginIndex: 7
  endIndex: 8
  headIndex: 7
  sentenceIndex: 1
  position: 1
}
representative: 1
, chainID: 34
mention {
  mentionID: 1
  mentionType: "PROPER"
  number: "UNKNOWN"
  gender: "NEUTRAL"
  animacy: "INANIMATE"
  beginIndex: 18
  endIndex: 19
  headIndex: 18
  sentenceIndex: 0
  position: 2
}
mention {
  mentionID: 34
  mentionType: "PROPER"
  number: "UNKNOWN"
  gender: "NEUTRAL"
  animacy: "INANIMATE"
  beginIndex: 40
  endIndex: 42
  headIndex: 41
  sentenceIndex: 2
  position: 15
}
representative: 1
, chainID: 27
mention {
  mentionID: 0
  mentionType: "PROPER"
  number: "SINGULAR"
  gender: "UNKNOWN"
  animacy: "ANIMATE"

In [29]:
mychains = list()
chains = document.corefChain
for chain in chains:
    mychain = list()
    # Loop through every mention of this chain
    for mention in chain.mention:
        # Get the sentence in which this mention is located, and get the words which are part of this mention
        # (we can have more than one word, for example, a mention can be a pronoun like "he", but also a compound noun like "His wife Michelle")
        words_list = document.sentence[mention.sentenceIndex].token[mention.beginIndex:mention.endIndex]
        #build a string out of the words of this mention
        ment_word = ' '.join([x.word for x in words_list])
        mychain.append(ment_word)
    mychains.append(mychain)

for chain in mychains:
    print(' <-> '.join(chain))

the country <-> China
CCP <-> the CCP
Xi Jinping <-> He <-> his <-> he <-> he
2012 <-> 2012


Shutting Down the CoreNLP Server

To shut down the background CoreNLP server process, simply call the stop function of the client. Note that once a server is shutdown, you'll have to restart the server with the start() function before any annotation is requested.

In [30]:
# Shut down the background CoreNLP server
client.stop()

time.sleep(10)
!ps -o pid,cmd | grep java

KeyboardInterrupt: ignored