Permalink
Browse files

Now with part-of-speech tagging ๐Ÿ˜„

  • Loading branch information...
1 parent 0bd2e64 commit 74692ffa4d53a8de88181dd9d096a5223a42e9e6 @dmnapolitano committed May 10, 2013
View
@@ -9,7 +9,8 @@ Things you can do with it:
- Parse Trees **See README_parser.md**
- Named Entities **See README_ner.md**
- Resolved Coreferences **See README_coref.md**
- - Evaluate Stanford Tregex patterns over parse trees **See README_tregex.md **
+ - Stanford Tregex patterns evaluated over parse trees **See README_tregex.md**
+ - Sentences tagged for Part-of-Speech **See README_tagger.md**
* Send unicode (optional), receive unicode (always).
* Do these things in a multithreaded way without having to think about it too much (Thrift provides ten threads).
* Communicate with the server using the language of your choice (with some additional coding if your choice isn't "Java" or "Python").
View
@@ -0,0 +1,18 @@
+How to Get Part-of-Speech Tags from the Stanford POS Tagger via this Apache Thrift Server
+=========================================================================================
+
+## How to Interact with the Methods and Data Structures
+
+The core return type here is a data structure called `TaggedToken` which has two members:
+
+* `tag`: A Unicode string containing the Penn Treebank part-of-speech tag assigned to this token. Should always be upper-case.
+* `token`: The token, a Unicode string, from the sentence, with that corresponding part-of-speech tag. Whatever its original casing was will be the same here.
+
+A `List<TaggedToken>` is a Python list of `TaggedToken` objects/Java `ArrayList<TaggedToken>` that corresponds to the sentence that was part-of-speech tagged, maintaining order of the tokens in the sentence. Depending on what you would like to tag, you'll get back either a single sentence worth of tags or multiple sentences worth (depending on how many sentences were found by Stanford's Tokenizer). If you'd like to tag:
+
+* arbitrary (potentially several sentences worth of), untokenized text, and you're cool with CoreNLP performing the necessary tokenization, call `tag_text(untokenizedText)`. `untokenizedText` is a Java `String`/Python `str` or `unicode`.
+Returns: a Java `List<List<TaggedToken>>`, which corresponds to an `ArrayList` of sentences in the order they were in the `untokenizedText`.
+* one sentence worth of tokens (the output from some sentence and then word tokenizers of your choosing), call `tag_tokenized_sentence(tokens)`, where `tokens` is a Java `List<String>`/Python list containing `str`/`unicode`.
+Returns: a Java `ArrayList<TaggedToken>`/Python list of `TaggedToken` objects.
+
+For examples of how to call both of these methods, and how one can do some nice things with the value(s) returned, see `scripts/tagger_client.py`.
View
@@ -15,6 +15,12 @@ struct NamedEntity
4:i32 endOffset
}
+struct TaggedToken
+{
+ 1:string tag,
+ 2:string token
+}
+
exception SerializedException
{
1: required binary payload
@@ -33,5 +39,7 @@ service StanfordCoreNLP
list<string> resolve_coreferences_in_text(1:string text),
list<string> resolve_coreferences_in_tokenized_sentences(1:list<string> sentencesWithTokensSeparatedBySpace),
list<string> resolve_coreferences_in_trees(1:list<string> trees),
- list<string> evaluate_tregex_pattern(1:string parseTree, 2:string tregexPattern)
+ list<string> evaluate_tregex_pattern(1:string parseTree, 2:string tregexPattern),
+ list<list<TaggedToken>> tag_text(1:string untokenizedText),
+ list<TaggedToken> tag_tokenized_sentence(1:list<string> tokenizedSentence)
}
@@ -35,6 +35,8 @@ if len(sys.argv) <= 1 or sys.argv[1] == '--help':
print ' resolve_coreferences_in_tokenized_sentences( sentencesWithTokensSeparatedBySpace)'
print ' resolve_coreferences_in_trees( trees)'
print ' evaluate_tregex_pattern(string parseTree, string tregexPattern)'
+ print ' tag_text(string untokenizedText)'
+ print ' tag_tokenized_sentence( tokenizedSentence)'
print ''
sys.exit(0)
@@ -158,6 +160,18 @@ elif cmd == 'evaluate_tregex_pattern':
sys.exit(1)
pp.pprint(client.evaluate_tregex_pattern(args[0],args[1],))
+elif cmd == 'tag_text':
+ if len(args) != 1:
+ print 'tag_text requires 1 args'
+ sys.exit(1)
+ pp.pprint(client.tag_text(args[0],))
+
+elif cmd == 'tag_tokenized_sentence':
+ if len(args) != 1:
+ print 'tag_tokenized_sentence requires 1 args'
+ sys.exit(1)
+ pp.pprint(client.tag_tokenized_sentence(eval(args[0]),))
+
else:
print 'Unrecognized method %s' % cmd
sys.exit(1)
Oops, something went wrong.

0 comments on commit 74692ff

Please sign in to comment.