Clojure text parser with annotations using Stanford CoreNLP
Clojure
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
doc
src/stanford_talk
test/stanford_talk
.gitignore
LICENSE
README.md
project.clj

README.md

stanford-talk

Uses the Stanford CoreNLP to parse and annotate text.

Usage

Experimental but basic usage is:

(process-text "My name is Carin. I love cheese. I live in Cincinnati.")

It returns a data structure like:

{:token-data
 ([{:sent-num 0, :token "My", :pos "PRP$", :net "O", :lemma "my", :sentiment "Neutral"}
   {:sent-num 0, :token "name", :pos "NN", :net "O", :lemma "name", :sentiment "Neutral"}
   {:sent-num 0, :token "is", :pos "VBZ", :net "O", :lemma "be", :sentiment "Neutral"}
   {:sent-num 0, :token "Carin", :pos "NNP", :net "PERSON", :lemma "Carin", :sentiment "Neutral"}
   {:sent-num 0, :token ".", :pos ".", :net "O", :lemma ".", :sentiment "Neutral"}]
  [{:sent-num 1, :token "I", :pos "PRP", :net "O", :lemma "I", :sentiment "Neutral"}
   {:sent-num 1, :token "love", :pos "VBP", :net "O", :lemma "love", :sentiment "Very positive"}
   {:sent-num 1, :token "cheese", :pos "NN", :net "O", :lemma "cheese", :sentiment "Neutral"}
   {:sent-num 1, :token ".", :pos ".", :net "O", :lemma ".", :sentiment "Neutral"}]
  [{:sent-num 2, :token "I", :pos "PRP", :net "O", :lemma "I", :sentiment "Neutral"}
   {:sent-num 2, :token "live", :pos "VBP", :net "O", :lemma "live", :sentiment "Neutral"}
   {:sent-num 2, :token "in", :pos "IN", :net "O", :lemma "in", :sentiment "Neutral"}
   {:sent-num 2, :token "Cincinnati", :pos "NNP", :net "LOCATION", :lemma "Cincinnati", :sentiment "Neutral"}
   {:sent-num 2, :token ".", :pos ".", :net "O", :lemma ".", :sentiment "Neutral"}]),
 :refs
 [[{:sent-num 0, :token "Carin", :gender "UNKNOWN", :mention-type "PROPER", :number "SINGULAR", :animacy "ANIMATE"}]
  [{:sent-num 0, :token "My", :gender "UNKNOWN", :mention-type "NOMINAL", :number "SINGULAR", :animacy "INANIMATE"}]
  [{:sent-num 0, :token "My", :gender "UNKNOWN", :mention-type "PRONOMINAL", :number "SINGULAR", :animacy "ANIMATE"}
   {:sent-num 1, :token "I", :gender "UNKNOWN", :mention-type "PRONOMINAL", :number "SINGULAR", :animacy "ANIMATE"}
   {:sent-num 2, :token "I", :gender "UNKNOWN", :mention-type "PRONOMINAL", :number "SINGULAR", :animacy "ANIMATE"}]
  [{:sent-num 2, :token "Cincinnati", :gender "NEUTRAL", :mention-type "PROPER", :number "SINGULAR", :animacy "INANIMATE"}]]}

where

  • :text-data = The data associated with each word token
    • :sent-num = Sentence number
    • :token = Original text
    • :pos = Part of Speech tag. Guide for decoding
    • :net = Named Entity Tag Annotation: PERSON, LOCATION, ORGANIZATION, MISC, MONEY, NUMBER, ORDINAL, PERCENT, DATE, TIME, DURATION, SET
    • :lemma = [lemma](https://simple.wikipedia.org/wiki/Lemma_(linguistics) for token
    • :sentiment = "Very negative", "Negative", "Neutral", "Positive", or "Very positive"
  • :refs = The sentence mentions/references for token in sentences
    • :sent-num = Sentence number
    • :token = The word
    • :gender = Gender of the word if it can be determined
    • :mention-type = PRONOMINAL, NOMINAL, PROPER, LIST
    • :number = SINGULAR, PLURAL, UNKNOWN
    • :animacy = ANIMATE, INANIMATE, UNKNOWN (see animacy)

License

Copyright © 2015 Carin Meier

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.