Permalink
Browse files

Now with the ability to tokenize arbitrary text and untokenize a sing…

…le sentence.
  • Loading branch information...
1 parent 809daf3 commit 854f872150cfdcaf098e3da3e348f02d0acaee22 @dmnapolitano committed Jun 5, 2013
View
@@ -11,6 +11,7 @@ Things you can do with it:
- Resolved Coreferences **See README_coref.md**
- Stanford Tregex patterns evaluated over parse trees **See README_tregex.md**
- Sentences tagged for Part-of-Speech **See README_tagger.md**
+ - Tokenized (or even untokenized) text **See README_tokenizer.md**
* Send unicode (optional), receive unicode (always).
* Do these things in a multithreaded way without having to think about it too much (Thrift provides ten threads).
* Communicate with the server using the language of your choice (with some additional coding if your choice isn't "Java" or "Python").
View
@@ -0,0 +1,15 @@
+How to Get Tokenized Text (and Untokenized Text) From the Stanford PTB Tokenizer via this Apache Thrift Server
+==============================================================================================================
+
+## How to Interact with the Methods
+
+Two methods here, hopefully pretty straightforward:
+
+* `untokenize_sentence(sentenceTokens)` where `sentenceTokens` is a Python `list`/Java `ArrayList<String>` corresponding to one sentence worth of tokens.
+ Returns: a Python `unicode`/Java `String` which is `sentenceTokens` untokenized by Stanford CoreNLP.
+* `tokenize_text(arbitraryText)` where `arbitraryText` is a Python `str` or `unicode`/Java `String` holding some arbitrary text you'd like tokenized.
+ Returns: a Python `list` of lists of `unicode` objects/Java `ArrayList<ArrayList<String>`, where each sub-list is a list of tokens corresponding to one sentence (so each element in the outer list is one sentence).
+
+The only thing this doesn't do yet is simply return untokenized sentences (i.e., per each sentence split, don't tokenize each sentence). Of course you _could_ call `untokenize_sentence()` on each sentence returned by `tokenize_text()` if you really needed this and didn't feel like waiting for me to implement it. :grin:
+
+For examples of how to call both of these methods, see `scripts/tokenizer_client.py`.
View
@@ -41,5 +41,7 @@ service StanfordCoreNLP
list<string> resolve_coreferences_in_trees(1:list<string> trees),
list<string> evaluate_tregex_pattern(1:string parseTree, 2:string tregexPattern),
list<list<TaggedToken>> tag_text(1:string untokenizedText),
- list<TaggedToken> tag_tokenized_sentence(1:list<string> tokenizedSentence)
+ list<TaggedToken> tag_tokenized_sentence(1:list<string> tokenizedSentence),
+ string untokenize_sentence(1:list<string> sentenceTokens),
+ list<list<string>> tokenize_text(1:string arbitraryText)
}
@@ -37,6 +37,8 @@ if len(sys.argv) <= 1 or sys.argv[1] == '--help':
print ' evaluate_tregex_pattern(string parseTree, string tregexPattern)'
print ' tag_text(string untokenizedText)'
print ' tag_tokenized_sentence( tokenizedSentence)'
+ print ' string untokenize_sentence( sentenceTokens)'
+ print ' tokenize_text(string arbitraryText)'
print ''
sys.exit(0)
@@ -172,6 +174,18 @@ elif cmd == 'tag_tokenized_sentence':
sys.exit(1)
pp.pprint(client.tag_tokenized_sentence(eval(args[0]),))
+elif cmd == 'untokenize_sentence':
+ if len(args) != 1:
+ print 'untokenize_sentence requires 1 args'
+ sys.exit(1)
+ pp.pprint(client.untokenize_sentence(eval(args[0]),))
+
+elif cmd == 'tokenize_text':
+ if len(args) != 1:
+ print 'tokenize_text requires 1 args'
+ sys.exit(1)
+ pp.pprint(client.tokenize_text(args[0],))
+
else:
print 'Unrecognized method %s' % cmd
sys.exit(1)
Oops, something went wrong.

0 comments on commit 854f872

Please sign in to comment.