This notebook shows some examples of coreference in MultiWOZ dataset, such as "it", "that", "them". A short list of libraries for coreference resolution is also provided for analyzing the whole dataset in the future.

### Load data and libraries

In [1]:
from pathlib import Path
import json

In [2]:
RED = '\x1b[31m'
BLUE = '\x1b[34m'
NC = '\x1b[0m'

In [3]:
dataset_dir = Path('../../data/multiwoz2_parsed')

raw_dials_path = dataset_dir / '..' / 'MULTIWOZ2 2' / 'data.json'
delex_dials_path = dataset_dir / 'multi-woz' / 'delex.json'
train_dials_path = dataset_dir / 'train_dials.json'
valid_dials_path = dataset_dir / 'val_dials.json'
test_dials_path = dataset_dir / 'test_dials.json'

gen_dir = Path('../multiwoz/model/data')

valid_dials_gen_path = gen_dir / 'val_dials' / 'val_dials_gen.json'
test_dials_gen_path = gen_dir / 'test_dials' / 'test_dials_gen.json'

In [4]:
with open(raw_dials_path, 'r') as raw_dial_f:
    raw_dials = json.load(raw_dial_f)

# Uncomment to load the parsed and generated dialogues.
# with open(delex_dials_path, 'r') as delex_dial_f:
#     delex_dials = json.load(delex_dial_f)
# with open(valid_dials_path, 'r') as val_dial_f:
#     valid_dials = json.load(val_dial_f)
# with open(test_dials_path, 'r') as test_dial_f:
#     test_dials = json.load(test_dial_f)

# with open(valid_dials_gen_path, 'r') as val_dial_gen_f:
#     valid_dials_gen = json.load(val_dial_gen_f)
# with open(test_dials_gen_path, 'r') as test_dial_gen_f:
#     test_dials_gen = json.load(test_dial_gen_f)

In [5]:
# Uncomment to display the parsed and generated dialogues.
def show_turn(dial_id, turn_id, filt='11111'):
    return '\n'.join(filter(None, [
        '' if filt[0] == '0' else '{}User   (raw):\n{}\n{}'.
            format(RED, raw_dials[dial_id]['log'][turn_id*2]['text'], NC),
#         '' if filt[1] == '0' else '{}User   (delex) (input):\n{}\n{}'.
#             format(RED, valid_dials[dial_id]['usr'][turn_id].strip(), NC),
        '' if filt[2] == '0' else '{}System (raw):\n{}\n{}'.
            format(BLUE, raw_dials[dial_id]['log'][turn_id*2+1]['text'], NC),
#         '' if filt[3] == '0' else '{}System (delex) (ground truth):\n{}\n{}'.
#             format(BLUE, valid_dials[dial_id]['sys'][turn_id].strip(), NC),
#         '' if filt[4] == '0' else '{}System (gen):\n{}\n{}'.
#             format(BLUE, valid_dials_gen[dial_id][turn_id], NC),
    ]))

In [6]:
def sent_iterator(raw_dials):
    for dial_id, dial in raw_dials.items():
        yield dial_id
        for turn in dial['log']:
            yield turn['text']

### Example with "it"

In [7]:
dial_id = list(raw_dials)[2]

for turn in range(4):
    print(show_turn(dial_id, turn, filt='10100'))

[31mUser   (raw):
I need to book a hotel in the east that has 4 stars.  
[0m
[34mSystem (raw):
I can help you with that. What is your price range?
[0m
[31mUser   (raw):
That doesn't matter as long as it has free wifi and parking.
[0m
[34mSystem (raw):
If you'd like something cheap, I recommend the Allenbell. For something moderately priced, I would recommend the Warkworth House.
[0m
[31mUser   (raw):
Could you book the Wartworth for one night, 1 person?
[0m
[34mSystem (raw):
What day will you be staying?
[0m
[31mUser   (raw):
Friday and Can you book it for me and get a reference number ?
[0m
[34mSystem (raw):
Booking was successful.
Reference number is : BMUKPTG6.  Can I help you with anything else today?
[0m


### Example with "that"

In [8]:
dial_id = 'PMUL1635.json' #list(raw_dials)[1]
n_turns = len(raw_dials[dial_id]['log']) // 2

for turn in range(n_turns):
    print(show_turn(dial_id, turn, filt='10100'))

[31mUser   (raw):
I need to book a hotel in the east that has 4 stars.  
[0m
[34mSystem (raw):
I can help you with that. What is your price range?
[0m
[31mUser   (raw):
That doesn't matter as long as it has free wifi and parking.
[0m
[34mSystem (raw):
If you'd like something cheap, I recommend the Allenbell. For something moderately priced, I would recommend the Warkworth House.
[0m
[31mUser   (raw):
Could you book the Wartworth for one night, 1 person?
[0m
[34mSystem (raw):
What day will you be staying?
[0m
[31mUser   (raw):
Friday and Can you book it for me and get a reference number ?
[0m
[34mSystem (raw):
Booking was successful.
Reference number is : BMUKPTG6.  Can I help you with anything else today?
[0m
[31mUser   (raw):
I am looking to book a train that is leaving from Cambridge to Bishops Stortford on Friday. 
[0m
[34mSystem (raw):
There are a number of trains leaving throughout the day.  What time would you like to travel?
[0m
[31mUser   (raw):
I want to g

### Example with "them"

In [9]:
dial_id = list(raw_dials)[6]

for turn in range(3, 5):
    print(show_turn(dial_id, turn, filt='10100'))

[31mUser   (raw):
The price doesn't really matter. I just need free parking. It doesn't really need to have internet though. 
[0m
[34mSystem (raw):
There are 5 guesthouses that have free parking. Should I book one of them for you?
[0m
[31mUser   (raw):
Okay, none of them DON'T offer free wifi? If not, I'll need the address for one that does have wifi, please. Tell me about your favorite.
[0m
[34mSystem (raw):
The allenbell is a guesthouse on the east.  The addres sis 517a coldham lane post code cb13js.
[0m


### Libraries for coreference resolution

- The Stanford NLP toolkit has a [coreference resolution module](https://stanfordnlp.github.io/CoreNLP/coref.html), but it is written in Java.
- [This GitHub repository](https://github.com/huggingface/neuralcoref) has a pre-trained model integrated in spaCy pipeline, but it uses CPython instaed of an ordinary Python. They claim that the model is trained for coreference resolution in dialogues.