*Praktische Dialogmodellierung, Universität Potsdam, SoSe 19, David Schlangen*

# Die "Dialogue State Tracking Challenge" (DSTC) Daten

Mit diesem Notebook werden wir die Daten aus der "Dialogue State Tracking Challenge 2" (siehe [DSTC-Webseite](http://camdial.org/~mh521/dstc/) etwas erkunden.

Um das Notebook ausführen zu können, müssen Sie zuerst in diesem Verzeichnis das Skript `download_data.sh` ausführen. 

In [1]:
import pandas as pd
import json
import sys

pd.set_option('display.max_colwidth', -1)

In [2]:
# nice JSON viewer, from https://stackoverflow.com/questions/18873066/pretty-json-formatting-in-ipython-notebook
import uuid
from IPython.display import display_javascript, display_html, display
import json

class RenderJSON(object):
    def __init__(self, json_data):
        if isinstance(json_data, dict):
            self.json_str = json.dumps(json_data)
        else:
            self.json_str = json_data
        self.uuid = str(uuid.uuid4())

    def _ipython_display_(self):
        display_html('<div id="{}" style="height: 600px; width:100%;"></div>'.format(self.uuid), raw=True)
        display_javascript("""
        require(["https://rawgit.com/caldwell/renderjson/master/renderjson.js"], function() {
        document.getElementById('%s').appendChild(renderjson(%s))
        });
        """ % (self.uuid, self.json_str), raw=True)

In [3]:
sys.path.append('./')
from _dstc2_scripts.dataset_walker import dataset_walker
from _dstc2_scripts import misc

In [4]:
dataset = dataset_walker("dstc2_dev", dataroot="_Data/", labels=True)

## Die Daten als JSON-Struktur

In [5]:
this_dial_id = 6

for m, call in enumerate(dataset):
    if m == this_dial_id:
        for n, (turn, label) in enumerate(call):
            print '-' * 20, n, '-' * 20
            # For a full view of the JSON, including ASR output, uncomment the following line:
            #display(RenderJSON(turn))
            # If you want to see the JSON of the system utterance and the gold standard info about the
            #  user input, uncomment the following two lines:
            display(RenderJSON(turn['output']))
            display(RenderJSON(label))
            # The same, but unfolded:
            #print json.dumps(turn['output'], indent=4)
            #print json.dumps(label, indent=4)
    elif m > this_dial_id:
        break

-------------------- 0 --------------------


-------------------- 1 --------------------


-------------------- 2 --------------------


## Als Pandas-DataFrame

In [6]:
def json_to_cam(da_json):
    full_act = []
    for this_da in da_json:
        this_act = this_da['act']
        slots = []
        for this_slot in this_da['slots']:
            slots.append('{}={}'.format(this_slot[0], this_slot[1]))
        full_act.append('{}({})'.format(this_act, ', '.join(slots)))
    return ' | '.join(full_act)

In [7]:
out = []
for m, call in enumerate(dataset):
    for n, (turn, label) in enumerate(call):
        this_turn = []
        # print '-' * 20, n, '-' * 20
        sys_utt = turn['output']['transcript']
        usr_utt = label['transcription']
        sys_sem = json_to_cam(turn['output']['dialog-acts'])
        usr_sem = label['semantics']['cam'].replace(',', ', ')
        out.append((m, n, sys_utt, sys_sem, usr_utt, usr_sem))

columns = 'dial_id turn_id sys_utt sys_sem usr_utt usr_sem'.split()
dstc_df = pd.DataFrame(out, columns=columns)

In [8]:
dstc_df[dstc_df['dial_id'] == 9]

Unnamed: 0,dial_id,turn_id,sys_utt,sys_sem,usr_utt,usr_sem
76,9,0,"Hello , welcome to the Cambridge restaurant system? You can ask for restaurants by area , price range or food type . How may I help you?",welcomemsg(),im looking for spanish food,"inform(food=spanish, task=find)"
77,9,1,la raza serves spanish food .,offer(name=la raza) | inform(food=spanish),what is their address,request(addr)
78,9,2,"Sure , la raza is on 4 - 6 Rose Crescent",offer(name=la raza) | inform(addr=4 - 6 rose crescent),what is their phone number,request(phone)
79,9,3,The phone number of la raza is 01223 464550 .,offer(name=la raza) | inform(phone=01223 464550),thank you good bye,thankyou()|bye()


## Die unterliegende Ontologie

In [9]:
with open('_dstc2_scripts/config/ontology_dstc2.json', 'r') as f:
    ontology = json.load(f)

In [10]:
ontology

{u'informable': {u'area': [u'centre', u'north', u'west', u'south', u'east'],
  u'food': [u'afghan',
   u'african',
   u'afternoon tea',
   u'asian oriental',
   u'australasian',
   u'australian',
   u'austrian',
   u'barbeque',
   u'basque',
   u'belgian',
   u'bistro',
   u'brazilian',
   u'british',
   u'canapes',
   u'cantonese',
   u'caribbean',
   u'catalan',
   u'chinese',
   u'christmas',
   u'corsica',
   u'creative',
   u'crossover',
   u'cuban',
   u'danish',
   u'eastern european',
   u'english',
   u'eritrean',
   u'european',
   u'french',
   u'fusion',
   u'gastropub',
   u'german',
   u'greek',
   u'halal',
   u'hungarian',
   u'indian',
   u'indonesian',
   u'international',
   u'irish',
   u'italian',
   u'jamaican',
   u'japanese',
   u'korean',
   u'kosher',
   u'latin american',
   u'lebanese',
   u'light bites',
   u'malaysian',
   u'mediterranean',
   u'mexican',
   u'middle eastern',
   u'modern american',
   u'modern eclectic',
   u'modern european',
   u'modern

## Als Hilfe beim Erstellen von Regeln

Die Daten anders herum indiziert.

In [11]:
out = []
for m, call in enumerate(dataset):
    for n, (turn, label) in enumerate(call):
        this_turn = []
        usr_utt = label['transcription']
        usr_sem = label['semantics']['json']
        for full_da in usr_sem:
            # this is hard coding that there is only one slot/value pair in the list
            #  this seems to be the case in the data, but the format would allow
            #  for arbitrarily many. This conversion does not anymore do that.
            slot_vals = []
            for slot_value in full_da['slots']:
                slot_vals.extend([slot_value[0], slot_value[1]])
            out.append([m, n, usr_utt, full_da['act']] + slot_vals)

columns = 'dial_id turn_id usr_utt intent slot val'.split()
dstc_intent_df = pd.DataFrame(out, columns=columns)

In [12]:
dstc_intent_df.head(20)

Unnamed: 0,dial_id,turn_id,usr_utt,intent,slot,val
0,0,0,i would like to find an expensive restaurant in the south part,inform,pricerange,expensive
1,0,0,i would like to find an expensive restaurant in the south part,inform,area,south
2,0,1,does not matter,inform,this,dontcare
3,0,2,any type of food is okay,inform,food,dontcare
4,0,3,what is the address,request,slot,addr
5,0,4,what is the phone number,request,slot,phone
6,0,5,what type of food,request,slot,food
7,0,6,okay thank,thankyou,,
8,0,7,thank you good bye,thankyou,,
9,0,7,thank you good bye,bye,,
