Coref might be done

dmnapolitano · Mar 25, 2013 · d29ad8f · d29ad8f
1 parent 9668932
commit d29ad8f
Show file tree

Hide file tree

Showing 9 changed files with 541 additions and 59 deletions.
diff --git a/README_coref.md b/README_coref.md
@@ -0,0 +1,33 @@
+How to Resolve Coreferences with the Stanford Deterministic Coreference Resolution System (dcoref) via this Apache Thrift Server
+================================================================================================================================
+
+## How to Interact with the Methods and Data Structures
+
+Presently, the return type of all of the methods below is a Java `ArrayList<String>`/Python `unicode` list, where each element is the original, tokenized sentence, annotated in-line with MUC-style SGML.  Please run `scripts/coref_client.py` in order to see examples of this output, as it is produced by each of these methods.  In short, each annotation has an `ID` field, and, most likely, a `REF` field.  The numbers are assigned interally by dcoref during the coreference resolution process.  If dcoref thinks a word or phrase could be a coreference, it is assigned an `ID`.  If that coreference refers to another one, the one it is referring to is noted by `REF`.  Thus, you may see output like:
+
+```XML
+<COREF ID="1">Barack Obama</COREF> ... . 
+<COREF ID="4" REF="1">He</COREF> ... .
+```
+
+Notice that these results span two sentences.  You'll typically want to call these methods with more than one sentence, although you could, just as easily, call any of them with only one.  If you'd like to resolve coreferences in:
+
+* arbitrary (potentially several sentences worth of), untokenized, un-parsed, un-tagged text, and you're cool with CoreNLP handling all of those tasks for you, call `resolve_coreferences_in_text(text)`.  `text` is a Java `String`/Python `str` or `unicode`.
+* one or more sentences' worth of tokens (the output from some sentence and then word tokenizer), where each sentence's tokens are separated by a space.  For example, if your tokenizer produced:
+
+```Python
+output = [["Barack", "Hussein", "Obama", "II", "is", "the", "44th", "and", "current", "President", "of", "the", "United", "States", ",", "in", "office", "since", "2009" "."],
+       	  ["He", "is", "the", "first", "African", "American", "to", "hold", "the", "office", "."]]
+```
+
+you can then create a list `tokenized_sentences` as
+
+```Python
+tokenized_sentences = [" ".join(o) for o in output]
+```
+
+and then call `resolve_coreferences_in_tokenized_sentences(tokenized_sentences)`.
+
+* one or more parse trees in Stanford Parser's "oneline" output format, call `resolve_coreferences_in_trees(trees)`, where `trees` is a Java `List<String>`/Python list containing `str`/`unicode`.
+
+Each one of these methods will call Stanford's Named Entity Recognizer prior to running dcoref (it is required in order for dcoref to perform its magic).
diff --git a/corenlp.thrift b/corenlp.thrift
@@ -29,5 +29,7 @@ service StanfordCoreNLP
     list<NamedEntity> get_entities_from_text(1:string text),
     list<NamedEntity> get_entities_from_tokens(1:list<string> tokens),
     list<NamedEntity> get_entities_from_trees(1:list<string> trees),
+    list<string> resolve_coreferences_in_text(1:string text),
+    list<string> resolve_coreferences_in_tokenized_sentences(1:list<string> sentencesWithTokensSeparatedBySpace),
     list<string> resolve_coreferences_in_trees(1:list<string> trees)
 }
diff --git a/gen-py/corenlp/StanfordCoreNLP-remote b/gen-py/corenlp/StanfordCoreNLP-remote
@@ -30,6 +30,8 @@ if len(sys.argv) <= 1 or sys.argv[1] == '--help':
   print '   get_entities_from_text(string text)'
   print '   get_entities_from_tokens( tokens)'
   print '   get_entities_from_trees( trees)'
+  print '   resolve_coreferences_in_text(string text)'
+  print '   resolve_coreferences_in_tokenized_sentences( sentencesWithTokensSeparatedBySpace)'
   print '   resolve_coreferences_in_trees( trees)'
   print ''
   sys.exit(0)
@@ -124,6 +126,18 @@ elif cmd == 'get_entities_from_trees':
     sys.exit(1)
   pp.pprint(client.get_entities_from_trees(eval(args[0]),))
 
+elif cmd == 'resolve_coreferences_in_text':
+  if len(args) != 1:
+    print 'resolve_coreferences_in_text requires 1 args'
+    sys.exit(1)
+  pp.pprint(client.resolve_coreferences_in_text(args[0],))
+
+elif cmd == 'resolve_coreferences_in_tokenized_sentences':
+  if len(args) != 1:
+    print 'resolve_coreferences_in_tokenized_sentences requires 1 args'
+    sys.exit(1)
+  pp.pprint(client.resolve_coreferences_in_tokenized_sentences(eval(args[0]),))
+
 elif cmd == 'resolve_coreferences_in_trees':
   if len(args) != 1:
     print 'resolve_coreferences_in_trees requires 1 args'