Skip to content
This repository has been archived by the owner on Dec 12, 2018. It is now read-only.

Commit

Permalink
Coref might be done
Browse files Browse the repository at this point in the history
  • Loading branch information
Diane M. Napolitano committed Mar 25, 2013
1 parent 9668932 commit d29ad8f
Show file tree
Hide file tree
Showing 9 changed files with 541 additions and 59 deletions.
33 changes: 33 additions & 0 deletions README_coref.md
@@ -0,0 +1,33 @@
How to Resolve Coreferences with the Stanford Deterministic Coreference Resolution System (dcoref) via this Apache Thrift Server
================================================================================================================================

## How to Interact with the Methods and Data Structures

Presently, the return type of all of the methods below is a Java `ArrayList<String>`/Python `unicode` list, where each element is the original, tokenized sentence, annotated in-line with MUC-style SGML. Please run `scripts/coref_client.py` in order to see examples of this output, as it is produced by each of these methods. In short, each annotation has an `ID` field, and, most likely, a `REF` field. The numbers are assigned interally by dcoref during the coreference resolution process. If dcoref thinks a word or phrase could be a coreference, it is assigned an `ID`. If that coreference refers to another one, the one it is referring to is noted by `REF`. Thus, you may see output like:

```XML
<COREF ID="1">Barack Obama</COREF> ... .
<COREF ID="4" REF="1">He</COREF> ... .
```

Notice that these results span two sentences. You'll typically want to call these methods with more than one sentence, although you could, just as easily, call any of them with only one. If you'd like to resolve coreferences in:

* arbitrary (potentially several sentences worth of), untokenized, un-parsed, un-tagged text, and you're cool with CoreNLP handling all of those tasks for you, call `resolve_coreferences_in_text(text)`. `text` is a Java `String`/Python `str` or `unicode`.
* one or more sentences' worth of tokens (the output from some sentence and then word tokenizer), where each sentence's tokens are separated by a space. For example, if your tokenizer produced:

```Python
output = [["Barack", "Hussein", "Obama", "II", "is", "the", "44th", "and", "current", "President", "of", "the", "United", "States", ",", "in", "office", "since", "2009" "."],
["He", "is", "the", "first", "African", "American", "to", "hold", "the", "office", "."]]
```

you can then create a list `tokenized_sentences` as

```Python
tokenized_sentences = [" ".join(o) for o in output]
```

and then call `resolve_coreferences_in_tokenized_sentences(tokenized_sentences)`.

* one or more parse trees in Stanford Parser's "oneline" output format, call `resolve_coreferences_in_trees(trees)`, where `trees` is a Java `List<String>`/Python list containing `str`/`unicode`.

Each one of these methods will call Stanford's Named Entity Recognizer prior to running dcoref (it is required in order for dcoref to perform its magic).
2 changes: 2 additions & 0 deletions corenlp.thrift
Expand Up @@ -29,5 +29,7 @@ service StanfordCoreNLP
list<NamedEntity> get_entities_from_text(1:string text),
list<NamedEntity> get_entities_from_tokens(1:list<string> tokens),
list<NamedEntity> get_entities_from_trees(1:list<string> trees),
list<string> resolve_coreferences_in_text(1:string text),
list<string> resolve_coreferences_in_tokenized_sentences(1:list<string> sentencesWithTokensSeparatedBySpace),
list<string> resolve_coreferences_in_trees(1:list<string> trees)
}
14 changes: 14 additions & 0 deletions gen-py/corenlp/StanfordCoreNLP-remote
Expand Up @@ -30,6 +30,8 @@ if len(sys.argv) <= 1 or sys.argv[1] == '--help':
print ' get_entities_from_text(string text)'
print ' get_entities_from_tokens( tokens)'
print ' get_entities_from_trees( trees)'
print ' resolve_coreferences_in_text(string text)'
print ' resolve_coreferences_in_tokenized_sentences( sentencesWithTokensSeparatedBySpace)'
print ' resolve_coreferences_in_trees( trees)'
print ''
sys.exit(0)
Expand Down Expand Up @@ -124,6 +126,18 @@ elif cmd == 'get_entities_from_trees':
sys.exit(1)
pp.pprint(client.get_entities_from_trees(eval(args[0]),))

elif cmd == 'resolve_coreferences_in_text':
if len(args) != 1:
print 'resolve_coreferences_in_text requires 1 args'
sys.exit(1)
pp.pprint(client.resolve_coreferences_in_text(args[0],))

elif cmd == 'resolve_coreferences_in_tokenized_sentences':
if len(args) != 1:
print 'resolve_coreferences_in_tokenized_sentences requires 1 args'
sys.exit(1)
pp.pprint(client.resolve_coreferences_in_tokenized_sentences(eval(args[0]),))

elif cmd == 'resolve_coreferences_in_trees':
if len(args) != 1:
print 'resolve_coreferences_in_trees requires 1 args'
Expand Down

0 comments on commit d29ad8f

Please sign in to comment.