Speech act classifier for text based on Stanford CoreNLP and Weka
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
doc initial commit Oct 20, 2015
resources initial commit Oct 20, 2015
src/speech_acts_classifier initial commit Oct 20, 2015
test/speech_acts_classifier initial commit Oct 20, 2015
.gitignore initial commit Oct 20, 2015
LICENSE initial commit Oct 20, 2015
README.md initial commit Oct 20, 2015
project.clj initial commit Oct 20, 2015



An experiment with parsing natural language and classifying the speech act of the sentence. This is especially important when a machine is trying to understand the meaning of a sentence in an environment, like a chat session, where missing punctuation is common.

This project classifies three speech acts: statements, questions, and expressives. Expressives are speech acts that express a mental state of the speaker. For example, "Thanks", "Ok", "lol".

The parsing and annotation is done with the wrapper around the Stanford CoreNLP library.

The classification uses the weka java library. A random forest model was trained on the following sentence features of the pos annotations:

  • Sentence length
  • Number of nouns in the sentence (NN, NNS, NNP, NNPS)
  • If the sentence ends in a noun or adjective (NN, NNS, NNP, NNPS, JJ, JJR, JJS)
  • If the sentence begins in a verb (VB, VBD, VBG, VBP, VPZ)
  • The count of the wh, (like who, what) markers (WDT, WRB, WP, WP$)

Training data for statements and questions were scraped from answers.com and then cleaned up by hand. The expressives were hand entered.

  • ~ 200 statements
  • ~ 200 questions
  • ~ 80 expressives

Summary of the Trained Model with cross validation:

Correctly Classified Instances         407               85.3249 %
Incorrectly Classified Instances        70               14.6751 %
Kappa statistic                          0.7658
Mean absolute error                      0.1185
Root mean squared error                  0.2665
Relative absolute error                 28.3497 %
Root relative squared error             58.3073 %
Total Number of Instances              477

The random forest model was chosen after interactively running the data through different models in weka explorer.


There are two main ways to use it.

The first is to use the classify-text function in the core. This will return back a keyword that is either :question, :statement, or :expressive.

(ns talk
  (:require [speech-acts-classifier.core :as c]))

(c/classify-text "I like cheese")
;; -> :statement

(c/classify-text "How do you make cheese")
;; -> :question

(c/classify-text "Right on")
;; -> :expressive

The second way is even more fun. It is a super simple chat bot based on your text. It will do a quick check to see if the text ends with a question mark. If not, it will run the classifier.

lein run
Hello.  Let's chat.
>> I like cheese
Nice to know.
>> Where do you go to buy your cheese
That is an interesting question.
>> wow


Classifying Sentences as Speech Acts in Message Board Posts Automated Speech Act Classification For Online Chat Student Speech Act Classification Using Machine Learning

Further Exploration


Copyright © 2015 Carin Meier

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.