# RSA-UID-switchbaord

Investigating (possibly) meaningful disfluencies in switchboard

### Objectives
1. analysis: do disfluencies appear before high-surprisal items?
2. listener exp: are listeners inclined to choose high-surprisal items as good candidates when they see 'uh'

### Resources
* http://www.nltk.org/_modules/nltk/corpus/reader/switchboard.html

In [83]:
from collections import Counter, defaultdict
from nltk.corpus import switchboard as sb
import numpy as np
import random
import re

Example discourse...

In [2]:
discourses = [d for d in sb.discourses()]

In [3]:
discourses[0][:5]

[<A.1: 'Uh , do you have a pet Randy ?'>,
 <B.2: 'Uh , yeah , currently we have a poodle .'>,
 <A.3: 'A poodle , miniature or , uh , full size ?'>,
 <B.4: "Yeah , uh , it 's , uh miniature .">,
 <A.5: 'Uh-huh .'>]

### Preprocessing helpers

In [19]:
def replace_start(s):
    start_patt = r"^<[AB]\.[0-9]{1,}:\s[\'\"]"
    replace_patt = ''
    s = re.sub(start_patt, replace_patt, s)
    return s

def replace_end(s):
    end_patt = r"\s?\'?>"
    replace_patt = ''
    s = re.sub(end_patt, replace_patt, s)
    return s

def remove_punct(s):
    s = re.sub('[,\.\?:\{\}\[\]]', '', s)
    return s

def turn2Str(t):
    unicodeStr = t.unicode_repr()
    s = preprocess_turn_str(unicodeStr)
    return s

def preprocess_turn_str(s):
    s = replace_start(s)
    s = replace_end(s)
    s = remove_punct(s)
    s = s.lower()
    s = re.sub('\s{2,}', ' ', s)
    s = re.sub(r'"$', '', s)
    s = s.rstrip()
    return s

check...

In [20]:
exTurn = discourses[0][2]
print(str(exTurn))
turn2Str(exTurn)

<A.3: 'A poodle , miniature or , uh , full size ?'>


'a poodle miniature or uh full size'

## Data

In [23]:
allStrs = [turn2Str(t) for d in discourses for t in d]
orderedTokens = ' '.join(['^ ' + s + ' $' for s in allStrs]).split(' ')

### LM

In [114]:
class Ngram:
    def __init__(self, n=2):
        self.n = n
        
    def ingest(self, orderedTokens):
        d = defaultdict(Counter)
        for i, word in enumerate(orderedTokens):
            if i == len(orderedTokens)-1:
                break
            d[word][orderedTokens[i+1]] += 1
        self.lm = d
        
    def get_lm(self):
        return self.lm
    
    def get_candidates(self, token):
        return self.lm[token]

check...

In [115]:
ng = Ngram()
exCorpus = ['apple', 'on', 'the', 'apple', 'range', 'apple']
ng.ingest(exCorpus)
ng.get_lm()

defaultdict(collections.Counter,
            {'apple': Counter({'on': 1, 'range': 1}),
             'on': Counter({'the': 1}),
             'range': Counter({'apple': 1}),
             'the': Counter({'apple': 1})})

Build switchboard ngram

In [116]:
ng = Ngram()
ng.ingest(orderedTokens)

## Experiment - brainstorm

* Task: sentence completion - particiants see a sentence fragment and have to fill in the missing item from a choice of five candidates. 
    + include the "true" candidate?
    + include a spectrum of high->low suprisal items
* What if we had people incrementally build sentences from the LM candidates?
    

In [204]:
# sent = random.choice(allStrs).split(' ')
# index = random.randint(0, len(sent))
# sent[:index]
exSent = 'yeah i have a uh well a mutt myself i call it a uh uh chowperd'
exSentSplit = exSent.split(' ')
index = random.randint(0, len(exSentSplit)-1)
fragment = exSentSplit[:index]

In [205]:
lastToken = fragment[-1]
trueCandidate = exSentSplit[index]

In [206]:
lm_candidates = list(ng.get_candidates(lastToken).keys())
candidates = set(random.sample(lm_candidates, min(5, len(lm_candidates)))) | {trueCandidate}

In [207]:
print("Full utterace:\t{}".format(exSent))
print('Fragment:\t{}'.format(' '.join(fragment)))
print('Last token:\t{}'.format(lastToken))
print('Candidates:\t{}'.format(candidates))
print('True candidate:\t{}'.format(trueCandidate))

Full utterace:	yeah i have a uh well a mutt myself i call it a uh uh chowperd
Fragment:	yeah i have a
Last token:	a
Candidates:	{'reporter', 'uh', 'minimum', 'belt', 'historical', 'daughter'}
True candidate:	uh


In [201]:
uhStrs = [s for s in allStrs if 'uh' in s.split()]