# Small Test using NPS Chat for Dialogue Act Classification

I used a simple, quick and dirty analysis for classification. Most of the techniques are as suggested in the nltk website documentation. Used treebank for sentence segmentation.

In [1]:
# import os, sys
import nltk
from nltk.corpus import nps_chat as nps_chat
from nltk.corpus import treebank_raw as treebank

#tar_sentence = """Hello everyone, I'm trying to solve the seek and destroy bonfire but I'm trying to understand how to fix some logic in my code."""

def tokens_and_boundaries(dataset):
    tokens = []
    boundaries = set()
    offset = 0
    for sent in dataset:
        #print(sent)
        tokens.extend(sent)
        offset += len(sent)
        boundaries.add(offset-1)
    return tokens,boundaries


def punct_features(tokens, i):
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
            'prev-word': tokens[i-1].lower(),
            'punct': tokens[i],
            'prev-word-is-one-char': len(tokens[i-1]) == 1}

def featuredsets(dataset):
    tokens, boundaries = tokens_and_boundaries(dataset)
    boundaries = sorted(list(boundaries))
    return [(punct_features(tokens, i), i in boundaries) for i in range(1, len(tokens)-1) if tokens[i] in '.?!']

def segment_sentences(words):
    start = 0
    sents = []
    for i, word in enumerate(words):
        #print(i, word)
        if word == '...':
            word = '.'
        if word in '.?!' and classifier_segsen.classify(punct_features(words, i)) == True: #1) if a punctuation mark AND words classified as TRUE 
            sents.append(words[start:i+1])                                          #2) add to sents all word between the start of a sentence and the end
            start = i+1                                                             #3) move to next sentence
    if start < len(words):                                                          #4) if start is still less than the total paragraph..
        sents.append(words[start:])                                                 #5) add all (???)
    return sents   


def dialogue_act_features(sen, tok = False):
    features = {}
    if tok == False:
        for word in nltk.word_tokenize(sen):
            features['contains({})'.format(word.lower())] = True
    else:
        for word in sen:
            features['contains({})'.format(word.lower())] = True
    return features

After defining the functions, the classifiers are trained using NaiveBayesClassifier.

In [2]:
train_sentsegset = treebank.sents()
train_featsentsegset = featuredsets(train_sentsegset)
#print(train_featsentsegset[:10])
classifier_segsen = nltk.NaiveBayesClassifier.train(train_featsentsegset)

#train_dialactset = nps_chat.xml_posts()[:10000]
train_dialactset = nps_chat.xml_posts()
train_featdialactset = [(dialogue_act_features(train_sen.text), train_sen.get('class')) for train_sen in train_dialactset]
classifier_dialact = nltk.NaiveBayesClassifier.train(train_featdialactset)


Some messages were selected. The selection was not totally random as the idea was to evaluate the results of the classification.

In [3]:
tar_sentences = ["""The content part is transcribing other peoples work and putting it on the website. The html, css, and javascript is either fixing glitches with the site or adding a tweak to it. It is estimated that it is going to be approximately 16 hours a week of work. Paid position, don't know how much. Please feel free to reach out to the e-mail address or call to get further information. END""",
               """@sb Yes, as I said I do not know if comparing results would be really possible but since I have little to no experience with JS I was hoping that someone might have some idea on how to do it properly. END""",
               """@sb Seek & Destroy is one of the more difficult bonfires... did you try Bonfire-seek-and-destroy hint? END""",
               """can someone help me with this...how do i pick subindex of an array? END""",
               """Oh, I'm sorry @sb, you probably meant I should go the the HelpBonfires chat room. Thanks! END""",
               """Hi everyone. Hope you're having fun coding. It's been a few days I'm trying to finish the 'Seek & Destroy' bonfire but I don't manage to do it. I know the solution should be simple, but I can't get it to work. Has anyone finished it and would be kind enough to give me a tip? END""",
               """@sb Does the webpage have to look the same? END""",
               """i'm also trying to set env variables using package config. is there a particular way of doing that with heroku beside heroku set? END""",
               """hey, someone did bonfire-convert-html entities? END""",
               """This could does not generate any error but does not move to the next exercise too. END""",
               """@sb It is because your location was not correctly detected by http://ip-api.com/. Could you open this link and verify the City field in a result table? END""",
               """@sb That should be good sign. You should be challenged while learning new stuff. If you need help with specific bonfire post your code here and ask question. END""",
               """usually the code editor , codepen / freecodecamp editor, displays a notification to the left of any line with bad code. I imagine the line with meter+= showed some red or yellow indicator and there's also the browser's console , which often also shows some kind of information when it can't make sense of your code. END""",
               """without putting something in the div or what razetime wrote there is nothing to apply the background color to. END""",
               ]

Following the classification procedure.

In [4]:
for tar_sentence in tar_sentences:
    tar_sentence_tok = nltk.word_tokenize(tar_sentence)
    for iw, w in enumerate(tar_sentence_tok):
        if w == '...' and iw < (len(tar_sentence_tok)-2):
            tar_sentence_tok[iw] = '.'
            tar_sentence_tok[iw+1] = tar_sentence_tok[iw+1].title()
    #print(tar_sentence_tok)
    
    #train_tokens, train_boundaries = tokens_and_boundaries(train_sentsegset)
    #print(train_sentsegset[0], train_tokens[:10],list(train_boundaries)[:10])

    tar_tokens, tar_boundaries = tokens_and_boundaries(tar_sentence_tok)
    #print(tar_tokens, sorted(list(tar_boundaries)))


    #To use this classifier to perform sentence segmentation, 
    #we simply check each punctuation mark to see whether it's labeled as a boundary; 
    #and divide the list of words at the boundary marks.
    
                                                                     #6) return a list of all sentences
    
    #print(segment_sentences(tar_sentence_tok))

    #for sen in segment_sentences(tar_sentence_tok):
    #    print(sen)


    for sen in segment_sentences(tar_sentence_tok):
        if sen != ['END']:
            tar_featuredsen = dialogue_act_features(sen, True)
            #print(train_featdialactset[0], tar_featuredsen)
            print(' '.join(sen),'-------->>',classifier_dialact.classify(tar_featuredsen))

The content part is transcribing other peoples work and putting it on the website . -------->> nAnswer
The html , css , and javascript is either fixing glitches with the site or adding a tweak to it . -------->> Continuer
It is estimated that it is going to be approximately 16 hours a week of work . -------->> Statement
Paid position , do n't know how much . -------->> Reject
Please feel free to reach out to the e-mail address or call to get further information . -------->> Clarify
@ sb Yes , as I said I do not know if comparing results would be really possible but since I have little to no experience with JS I was hoping that someone might have some idea on how to do it properly . -------->> nAnswer
@ sb Seek & Destroy is one of the more difficult bonfires . -------->> Statement
Did you try Bonfire-seek-and-destroy hint ? -------->> ynQuestion
can someone help me with this . -------->> Statement
How do i pick subindex of an array ? -------->> whQuestion
Oh , I 'm sorry @ sb , you prob