# RSA-UID-switchbaord

Investigating (possibly) meaningful disfluencies in switchboard

### Objectives
1. analysis: do disfluencies appear before high-surprisal items?
2. listener exp: are listeners inclined to choose high-surprisal items as good candidates when they see 'uh'

### Resources
* Switchboard http://www.nltk.org/_modules/nltk/corpus/reader/switchboard.html
* typing animation https://codepen.io/rusjames/pen/uAFhE

In [107]:
from collections import Counter, defaultdict
from nltk.corpus import switchboard as sb
import numpy as np
import pdb
import random
import re

Example discourse...

In [2]:
discourses = [d for d in sb.discourses()]

In [3]:
discourses[0][:5]

[<A.1: 'Uh , do you have a pet Randy ?'>,
 <B.2: 'Uh , yeah , currently we have a poodle .'>,
 <A.3: 'A poodle , miniature or , uh , full size ?'>,
 <B.4: "Yeah , uh , it 's , uh miniature .">,
 <A.5: 'Uh-huh .'>]

### Preprocessing helpers

In [4]:
def replace_start(s):
    start_patt = r"^<[AB]\.[0-9]{1,}:\s[\'\"]"
    replace_patt = ''
    s = re.sub(start_patt, replace_patt, s)
    return s

def replace_end(s):
    end_patt = r"\s?\'?>"
    replace_patt = ''
    s = re.sub(end_patt, replace_patt, s)
    return s

def remove_punct(s):
    s = re.sub('[,\.\?:\{\}\[\]]', '', s)
    return s

def turn2Str(t):
    unicodeStr = t.unicode_repr()
    s = preprocess_turn_str(unicodeStr)
    return s

def preprocess_turn_str(s):
    s = replace_start(s)
    s = replace_end(s)
    s = remove_punct(s)
    s = s.lower()
    s = re.sub('\s{2,}', ' ', s)
    s = re.sub(r'"$', '', s)
    s = s.rstrip()
    return s

check...

In [5]:
exTurn = discourses[0][2]
print(str(exTurn))
turn2Str(exTurn)

<A.3: 'A poodle , miniature or , uh , full size ?'>


'a poodle miniature or uh full size'

## Data

In [135]:
allStrs = [turn2Str(t) for d in discourses for t in d]
allStrsStartEnd = ['^ ' + s + ' $' for s in allStrs]
orderedTokens = ' '.join(allStrsStartEnd).split(' ')

### LM

In [181]:
class Ngram:
    def __init__(self, n=2):
        self.n = n
        self.vocab_size = 0
        
    def ingest(self, orderedTokens):
        self.vocab_size = len(orderedTokens)
        d = defaultdict(Counter)
        for i, word in enumerate(orderedTokens):
            if i == len(orderedTokens)-self.n:
                break
            currTup = tuple(orderedTokens[i:i+self.n-1])
            d[currTup][orderedTokens[i+self.n-1]] += 1
        self.lm = d
        
    def get_lm(self):
        return self.lm
    
    def get_candidates(self, token):
        return self.lm[token]
    
    def get_lm_prob(self, tup):
        context = tup[0:self.n-1]
        target = tup[self.n-1]
        num = self.lm[context][target]
        denom = np.sum([d[1] for d in self.lm[context].items()])
        return num / denom
        

check...

In [121]:
ng = Ngram(n=3)
exCorpus = ['apple', 'on', 'the', 'apple', 'on', 'a', 'range', 'apple']
ng.ingest(exCorpus)
ng.get_lm_prob(('apple', 'on', 'apple'))

0.0

Build switchboard ngram

In [122]:
ng = Ngram(2)
ng.ingest(orderedTokens)

## Experiment - brainstorm

* Task: sentence completion - particiants see a sentence fragment and have to fill in the missing item from a choice of five candidates. 
    + include the "true" candidate?
    + include a spectrum of high->low suprisal items
* What if we had people incrementally build sentences from the LM candidates?
    

In [123]:
# sent = random.choice(allStrs).split(' ')
# index = random.randint(0, len(sent))
# sent[:index]
exSent = 'yeah i have a uh well a mutt myself i call it a uh uh chowperd'
exSentSplit = exSent.split(' ')
index = random.randint(1, len(exSentSplit)-1)
fragment = exSentSplit[:index]

In [124]:
lastToken = fragment[-1]
print(lastToken)
trueCandidate = exSentSplit[index]

a


In [50]:
lm_candidates = list(ng.get_candidates().keys())
print(lm_candidates)
candidates = set(random.sample(lm_candidates, min(5, len(lm_candidates)))) | {trueCandidate}

[]


In [51]:
print("Full utterace:\t{}".format(exSent))
print('Fragment:\t{}'.format(' '.join(fragment)))
print('Last token:\t{}'.format(lastToken))
print('Candidates:\t{}'.format(candidates))
print('True candidate:\t{}'.format(trueCandidate))

Full utterace:	yeah i have a uh well a mutt myself i call it a uh uh chowperd
Fragment:	yeah i have a uh well a mutt myself i call it a uh uh
Last token:	uh
Candidates:	{'chowperd'}
True candidate:	chowperd


## Experiment - just "uh's"

In [136]:
uhStrs = [s for s in allStrsStartEnd if 'uh' in s.split()]
uhStrs_sample = random.sample(uhStrs, 10)
uhStrs_sample[0]

'^ well this was not a learning thing by any means it was just a uh $'

In [137]:
def get_match_indices(x, s):
    return [i for i, w in enumerate(s.split(' ')) if w == x]

In [138]:
indices = get_match_indices('uh', uhStrs_sample[0])
index = indices[0]; index

15

In [139]:
' '.join(uhStrs_sample[0].split(' ')[0:index+1])

'^ well this was not a learning thing by any means it was just a uh'

In [141]:
class categoricalDistr:
    def __init__(self, vs, ps):
        self.vs = vs
        self.ps = ps
        self.normed_ps = self.normalize(self.ps)
        self.distr = dict(zip(self.vs, self.normed_ps))
        
    def sample(self, k=5):
        vals = np.random.choice(a=self.vs, size=(k), p=self.normed_ps)
        d = {}
        for v in vals:
            d[v] = self.distr[v]
        return d

    def normalize(self, x):
        return x / np.sum(x)
        


In [20]:
vs = list(ng.get_candidates('uh').keys())
ps = list(ng.get_candidates('uh').values())
C = categoricalDistr(vs, ps)

In [21]:
C.sample()

{'$': 0.056916996047430828,
 'even': 0.0015810276679841897,
 'have': 0.0023715415019762848,
 'i': 0.11304347826086956,
 'uh': 0.039525691699604744}

## Reproducing Levy & Jaeger (2007) that-omission analysis (section 4)

**Objective**:
What is the relationship between `uh/um` presence and the informativity of the following text?
* H_0: there is no relationship
* H_1: presence of `uh/um` indicates the following content is high-surprisal.

In [182]:
n = 3
ng = Ngram(n)
ng.ingest(orderedTokens)

In [183]:
uhStrs = [s for s in allStrsStartEnd if 'uh' in s.split()]
uhStrs[0]

'^ uh do you have a pet randy $'

In [184]:
def getKtuple(s, k=2):
    s_split = s.split(' ')
    d = []
    for i, token in enumerate(s_split):
        if i == len(s_split)-k:
            break
        curr_tup = tuple(s_split[i:i+k])
        d.append(curr_tup)
    return d

In [185]:
getKtuple(uhStrs[0], n)

[('^', 'uh', 'do'),
 ('uh', 'do', 'you'),
 ('do', 'you', 'have'),
 ('you', 'have', 'a'),
 ('have', 'a', 'pet'),
 ('a', 'pet', 'randy')]

In [186]:
for tup in getKtuple(uhStrs[0], n):
    print('{}, {}'.format(tup, ng.get_lm_prob(tup)))

('^', 'uh', 'do'), 0.007042253521126761
('uh', 'do', 'you'), 0.5
('do', 'you', 'have'), 0.2112676056338028
('you', 'have', 'a'), 0.24285714285714285
('have', 'a', 'pet'), 0.008771929824561403
('a', 'pet', 'randy'), 1.0


### Todo

* Get all instances of find all context / targets (w/ and w/o disfluencies and get surprisal)