<a href="https://colab.research.google.com/github/Weijieaaa/-/blob/main/demo/Stanza_CoreNLP_Interface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stanza: A Tutorial on the Python CoreNLP Interface

![Latest Version](https://img.shields.io/pypi/v/stanza.svg?colorB=bc4545)
![Python Versions](https://img.shields.io/pypi/pyversions/stanza.svg?colorB=bc4545)

While the Stanza library implements accurate neural network modules for basic functionalities such as part-of-speech tagging and dependency parsing, the [Stanford CoreNLP Java library](https://stanfordnlp.github.io/CoreNLP/) has been developed for years and offers more complementary features such as coreference resolution and relation extraction. To unlock these features, the Stanza library also offers an officially maintained Python interface to the CoreNLP Java library. This interface allows you to get NLP anntotations from CoreNLP by writing native Python code.


This tutorial walks you through the installation, setup and basic usage of this Python CoreNLP interface. If you want to learn how to use the neural network components in Stanza, please refer to other tutorials.

## 1. Installation

Before the installation starts, please make sure that you have Python 3 and Java installed on your computer. Since Colab already has them installed, we'll skip this procedure in this notebook.

### Installing Stanza

Installing and importing Stanza are as simple as running the following commands:

In [6]:
# Install stanza; note that the prefix "!" is not needed if you are running in a terminal
!pip install stanza

# Import stanza
import stanza



### Setting up Stanford CoreNLP

In order for the interface to work, the Stanford CoreNLP library has to be installed and a `CORENLP_HOME` environment variable has to be pointed to the installation location.

Here we are going to show you how to download and install the CoreNLP library on your machine, with Stanza's installation command:

In [7]:
# Download the Stanford CoreNLP package with Stanza's installation command
# This'll take several minutes, depending on the network speed
corenlp_dir = './corenlp'
stanza.install_corenlp(dir=corenlp_dir)

# Set the CORENLP_HOME environment variable to point to the installation location
import os
os.environ["CORENLP_HOME"] = corenlp_dir



That's all for the installation! 🎉  We can now double check if the installation is successful by listing files in the CoreNLP directory. You should be able to see a number of `.jar` files by running the following command:

In [3]:
# Examine the CoreNLP installation folder to make sure the installation is successful
!ls $CORENLP_HOME

build.xml				  LIBRARY-LICENSES
corenlp.sh				  LICENSE.txt
CoreNLP-to-HTML.xsl			  Makefile
ejml-core-0.39.jar			  patterns
ejml-core-0.39-sources.jar		  pom-java-11.xml
ejml-ddense-0.39.jar			  pom-java-17.xml
ejml-ddense-0.39-sources.jar		  pom.xml
ejml-simple-0.39.jar			  protobuf-java-3.19.6.jar
ejml-simple-0.39-sources.jar		  README.txt
input.txt				  RESOURCE-LICENSES
input.txt.out				  sample-project-pom.xml
input.txt.xml				  SemgrexDemo.java
istack-commons-runtime-3.0.7.jar	  ShiftReduceDemo.java
istack-commons-runtime-3.0.7-sources.jar  slf4j-api.jar
javax.activation-api-1.2.0.jar		  slf4j-simple.jar
javax.activation-api-1.2.0-sources.jar	  stanford-corenlp-4.5.7.jar
javax.json-api-1.0-sources.jar		  stanford-corenlp-4.5.7-javadoc.jar
javax.json.jar				  stanford-corenlp-4.5.7-models.jar
jaxb-api-2.4.0-b180830.0359.jar		  stanford-corenlp-4.5.7-sources.jar
jaxb-api-2.4.0-b180830.0359-sources.jar   StanfordCoreNlpDemo.java
jaxb-impl-2.4.0-b180830.0438.jar	  StanfordDependenci

**Note 1**:
If you are want to use the interface in a terminal (instead of a Colab notebook), you can properly set the `CORENLP_HOME` environment variable with:

```bash
export CORENLP_HOME=path_to_corenlp_dir
```

Here we instead set this variable with the Python `os` library, simply because `export` command is not well-supported in Colab notebook.


**Note 2**:
The `stanza.install_corenlp()` function is only available since Stanza v1.1.1. If you are using an earlier version of Stanza, please check out our [manual installation page](https://stanfordnlp.github.io/stanza/client_setup.html#manual-installation) for how to install CoreNLP on your computer.

**Note 3**:
Besides the installation function, we also provide a `stanza.download_corenlp_models()` function to help you download additional CoreNLP models for different languages that are not shipped with the default installation. Check out our [automatic installation website page](https://stanfordnlp.github.io/stanza/client_setup.html#automated-installation) for more information on how to use it.

## 2. Annotating Text with CoreNLP Interface

### Constructing CoreNLPClient

At a high level, the CoreNLP Python interface works by first starting a background Java CoreNLP server process, and then initializing a client instance in Python which can pass the text to the background server process, and accept the returned annotation results.

We wrap these functionalities in a `CoreNLPClient` class. Therefore, we need to start by importing this class from Stanza.

In [2]:
# Import client module
from stanza.server import CoreNLPClient

After the import is done, we can construct a `CoreNLPClient` instance. The constructor method takes a Python list of annotator names as argument. Here let's explore some basic annotators including tokenization, sentence split, part-of-speech tagging, lemmatization and named entity recognition (NER).

Additionally, the client constructor accepts a `memory` argument, which specifies how much memory will be allocated to the background Java process. An `endpoint` option can be used to specify a port number used by the communication between the server and the client. The default port is 9000. However, since this port is pre-occupied by a system process in Colab, we'll manually set it to 9001 in the following example.

Also, here we manually set `be_quiet=True` to avoid an IO issue in colab notebook. You should be able to use `be_quiet=False` on your own computer, which will print detailed logging information from CoreNLP during usage.

For more options in constructing the clients, please refer to the [CoreNLP Client Options List](https://stanfordnlp.github.io/stanza/corenlp_client.html#corenlp-client-options).

In [5]:
# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
client = CoreNLPClient(
    annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'],
    memory='4G',
    endpoint='http://localhost:9001',
    be_quiet=True)
print(client)

# Start the background server and wait for some time
# Note that in practice this is totally optional, as by default the server will be started when the first annotation is performed
client.start()
import time; time.sleep(10)

INFO:stanza:Writing properties to tmp file: corenlp_server-76e938a6f4fd46e2.props
INFO:stanza:Starting server with command: java -Xmx4G -cp ./corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-76e938a6f4fd46e2.props -annotators tokenize,ssplit,pos,lemma,ner -preload -outputFormat serialized


<stanza.server.client.CoreNLPClient object at 0x7c6650ebe830>


After the above code block finishes executing, if you print the background processes, you should be able to find the Java CoreNLP server running.

In [6]:
# Print background processes and look for java
# You should be able to see a StanfordCoreNLPServer java process running in the background
!ps -o pid,cmd | grep java

   3151 java -Xmx4G -cp ./corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -time
   3210 /bin/bash -c ps -o pid,cmd | grep java
   3212 grep java


### Annotating Text

Annotating a piece of text is as simple as passing the text into an `annotate` function of the client object. After the annotation is complete, a `Document`  object will be returned with all annotations.

Note that although in general annotations are very fast, the first annotation might take a while to complete in the notebook. Please stay patient.

In [7]:
# Annotate some text
text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
document = client.annotate(text)
print(type(document))

<class 'CoreNLP_pb2.Document'>


In [8]:
document

text: "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
sentence {
  token {
    word: "Albert"
    pos: "NNP"
    value: "Albert"
    before: ""
    after: " "
    originalText: "Albert"
    ner: "PERSON"
    coarseNER: "PERSON"
    fineGrainedNER: "PERSON"
    nerLabelProbs: "PERSON=0.9999331283889166"
    lemma: "Albert"
    beginChar: 0
    endChar: 6
    tokenBeginIndex: 0
    tokenEndIndex: 1
    hasXmlContext: false
    isNewline: false
    entityMentionIndex: 0
  }
  token {
    word: "Einstein"
    pos: "NNP"
    value: "Einstein"
    before: " "
    after: " "
    originalText: "Einstein"
    ner: "PERSON"
    coarseNER: "PERSON"
    fineGrainedNER: "PERSON"
    nerLabelProbs: "PERSON=0.9999992044892745"
    lemma: "Einstein"
    beginChar: 7
    endChar: 15
    tokenBeginIndex: 1
    tokenEndIndex: 2
    hasXmlContext: false
    isNewline: false
    entityMentionIndex: 0
  }
  token {
    word: "was"
    pos: "VBD"
    value: "

## 3. Accessing Annotations

Annotations can be accessed from the returned `Document` object.

A `Document` contains a list of `Sentence`s, which contain a list of `Token`s. Here let's first explore the annotations stored in all tokens.

In [9]:
# Iterate over all tokens in all sentences, and print out the word, lemma, pos and ner tags
print("{:12s}\t{:12s}\t{:6s}\t{}".format("Word", "Lemma", "POS", "NER"))

for i, sent in enumerate(document.sentence):
    print("[Sentence {}]".format(i+1))
    for t in sent.token:
        print("{:12s}\t{:12s}\t{:6s}\t{}".format(t.word, t.lemma, t.pos, t.ner))
    print("")

Word        	Lemma       	POS   	NER
[Sentence 1]
Albert      	Albert      	NNP   	PERSON
Einstein    	Einstein    	NNP   	PERSON
was         	be          	VBD   	O
a           	a           	DT    	O
German      	German      	JJ    	NATIONALITY
-           	-           	HYPH  	O
born        	bear        	VBN   	O
theoretical 	theoretical 	JJ    	TITLE
physicist   	physicist   	NN    	TITLE
.           	.           	.     	O

[Sentence 2]
He          	he          	PRP   	O
developed   	develop     	VBD   	O
the         	the         	DT    	O
theory      	theory      	NN    	O
of          	of          	IN    	O
relativity  	relativity  	NN    	O
.           	.           	.     	O



Alternatively, you can also browse the NER results by iterating over entity mentions over the sentences. For example:

In [10]:
# Iterate over all detected entity mentions
print("{:30s}\t{}".format("Mention", "Type"))

for sent in document.sentence:
    for m in sent.mentions:
        print("{:30s}\t{}".format(m.entityMentionText, m.entityType))

Mention                       	Type
Albert Einstein               	PERSON
German                        	NATIONALITY
theoretical physicist         	TITLE
He                            	PERSON


To print all annotations a sentence, token or mention has, you can simply print the corresponding obejct.

In [11]:
# Print annotations of a token
print(document.sentence[0].token[0])

# Print annotations of a mention
print(document.sentence[0].mentions[0])

word: "Albert"
pos: "NNP"
value: "Albert"
before: ""
after: " "
originalText: "Albert"
ner: "PERSON"
coarseNER: "PERSON"
fineGrainedNER: "PERSON"
nerLabelProbs: "PERSON=0.9999331283889166"
lemma: "Albert"
beginChar: 0
endChar: 6
tokenBeginIndex: 0
tokenEndIndex: 1
hasXmlContext: false
isNewline: false
entityMentionIndex: 0

sentenceIndex: 0
tokenStartInSentenceInclusive: 0
tokenEndInSentenceExclusive: 2
ner: "PERSON"
entityType: "PERSON"
entityMentionIndex: 0
canonicalEntityMentionIndex: 0
entityMentionText: "Albert Einstein"



**Note**: Since the Stanza CoreNLP client interface simply ports the CoreNLP annotation results to native Python objects, for a comprehensive lists of available annotators and how their annotation results can be accessed, you will need to visit the [Stanford CoreNLP website](https://stanfordnlp.github.io/CoreNLP/).

## 4. Shutting Down the CoreNLP Server

To shut down the background CoreNLP server process, simply call the `stop` function of the client. Note that once a server is shutdown, you'll have to restart the server with the `start()` function before any annotation is requested.

In [18]:
# Shut down the background CoreNLP server
client.stop()

time.sleep(10)
!ps -o pid,cmd | grep java

   5625 /bin/bash -c ps -o pid,cmd | grep java
   5627 grep java


### More Information

For more information on how to use the `CoreNLPClient`, please go to the [CoreNLPClient documentation page](https://stanfordnlp.github.io/stanza/corenlp_client.html).

## 5. Simplifying Client Usage with the Python `with` statement

In the above demo, we explicitly called the `client.start()` and `client.stop()` functions to start and stop a client-server connection. However, doing this in practice is usually suboptimal, since you may forget to call the `stop()` function at the end, resulting in an unused server process occupying your machine memory.

To solve is, a simple solution is to use the client interface with the [Python `with` statement](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement). The `with` statement provides an elegant way to automatically start and stop the server process in your Python program, without you needing to worry about this. The following code snippet demonstrates how to establish a client, annotate an example text and then stop the server with a simple `with` statement. Note that we **always recommend** you to use the `with` statement when working with the Stanza CoreNLP client interface.

In [14]:
print("Starting a server with the Python \"with\" statement...")
with CoreNLPClient(annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'],
                   memory='4G', endpoint='http://localhost:9001', be_quiet=True) as client:
    text = "Albert Einstein was a German-born theoretical physicist."
    document = client.annotate(text)

    print("{:30s}\t{}".format("Mention", "Type"))
    for sent in document.sentence:
        for m in sent.mentions:
            print("{:30s}\t{}".format(m.entityMentionText, m.entityType))

print("\nThe server should be stopped upon exit from the \"with\" statement.")

INFO:stanza:Writing properties to tmp file: corenlp_server-9540d3fa061f47c3.props
INFO:stanza:Starting server with command: java -Xmx4G -cp ./corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-9540d3fa061f47c3.props -annotators tokenize,ssplit,pos,lemma,ner -preload -outputFormat serialized


Starting a server with the Python "with" statement...
Mention                       	Type
Albert Einstein               	PERSON
German                        	NATIONALITY
theoretical physicist         	TITLE

The server should be stopped upon exit from the "with" statement.


In [None]:
import json
import networkx as nx
from pathlib import Path
from tqdm import tqdm
# Path to the input file
load_file = "/content/twitter/train.txt"
# Output directory for JSON files
output_dir = Path("/content/output")
output_dir.mkdir(parents=True, exist_ok=True)
# Function to process a single sentence and extract entities and relations
def process_sentence(client, sentence):
    document = client.annotate(sentence)

    # Create a directed graph to store entities and relations
    G = nx.DiGraph()

    entities = []
    relations = []

    for sent in document.sentence:
        tokens = sent.token
        dependencies = sent.basicDependencies.edge

        # Extract entities
        if hasattr(sent, 'mentions'):
            for m in sent.mentions:
                entity = {
                    'text': m.entityMentionText,
                    'type': m.entityType
                }
                entities.append(entity)
                G.add_node(entity['text'], type=entity['type'])

        # Extract relations using dependency parsing
        for dep in dependencies:
            dep_type = dep.dep
            governor = tokens[dep.source - 1].word
            dependent = tokens[dep.target - 1].word

            if dep_type == 'nsubj':
                subject = dependent
                predicate = governor
            elif dep_type in ['dobj', 'obj']:
                object_ = dependent
                predicate = governor
                relations.append({
                    'predicate': predicate,
                    'relation': dep_type,
                    'object': object_
                })
                G.add_edge(predicate, object_, relation=dep_type)
            elif dep_type == 'amod':
                # Modifier relationship, e.g., adjective modifying a noun
                object_ = governor
                modifier = dependent
                relations.append({
                    'predicate': modifier,
                    'relation': dep_type,
                    'object': object_
                })
                G.add_edge(modifier, object_, relation=dep_type)
            elif dep_type == 'cop':
                # Copula relation, e.g., "was"
                predicate = governor
                object_ = dependent
                relations.append({
                    'predicate': predicate,
                    'relation': dep_type,
                    'object': object_
                })
                G.add_edge(predicate, object_, relation=dep_type)

    return {
        "entities": entities,
        "relations": relations
    }
# Read and parse the input file
with open(load_file, "r", encoding="utf-8") as f:
    lines = f.readlines()
    raw_words, raw_targets = [], []
    raw_word, raw_target = [], []
    imgs = []

    for line in lines:
        if line.startswith("IMGID:"):
            img_id = line.strip().split('IMGID:')[1]
            imgs.append(img_id)
            continue
        if line.strip() != "":
            parts = line.strip().split('\t')
            if len(parts) >= 2:
                raw_word.append(parts[0])
                label = parts[1]
                if 'OTHER' in label:
                    label = label.replace('OTHER', 'MISC')
                raw_target.append(label)
        else:
            if raw_word and raw_target:
                raw_words.append(raw_word)
                raw_targets.append(raw_target)
                raw_word, raw_target = [], []
    # Handle the last sentence if the file doesn't end with a newline
    if raw_word and raw_target:
        raw_words.append(raw_word)
        raw_targets.append(raw_target)
# Initialize the CoreNLPClient once for efficiency
with CoreNLPClient(
    annotators=['tokenize', 'ssplit', 'pos', 'depparse', 'lemma', 'ner'],
    memory='4G',
    endpoint='http://localhost:9004',
    be_quiet=True
) as client:
    for i in tqdm(range(len(raw_words))):
        sentence = " ".join(raw_words[i]).strip()
        img_id = imgs[i] if i < len(imgs) else f"unknown_{i}.jpg"

        if not sentence:
            print(f"Skipping empty sentence for image {img_id}.")
            continue

        try:
            output = process_sentence(client, sentence)
        except Exception as e:
            print(f"Error processing sentence for image {img_id}: {e}")
            output = {"entities": [], "relations": []}

        # Define the output JSON file path
        json_file_path = output_dir / f"{img_id}.json"

        # Write the JSON output to the file
        with open(json_file_path, "w", encoding="utf-8") as json_file:
            json.dump(output, json_file, ensure_ascii=False, indent=4)

        print(f"Processed and saved JSON for image {img_id}.")
print("All sentences have been processed and JSON files have been saved.")

INFO:stanza:Writing properties to tmp file: corenlp_server-07378623176d4625.props
INFO:stanza:Starting server with command: java -Xmx4G -cp ./corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9004 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-07378623176d4625.props -annotators tokenize,ssplit,pos,depparse,lemma,ner -preload -outputFormat serialized
  0%|          | 0/4000 [00:00<?, ?it/s]

## 6. Other Resources

- [Stanza Homepage](https://stanfordnlp.github.io/stanza/)
- [FAQs](https://stanfordnlp.github.io/stanza/faq.html)
- [GitHub Repo](https://github.com/stanfordnlp/stanza)
- [Reporting Issues](https://github.com/stanfordnlp/stanza/issues)
