In [84]:
#Prints **all** console output, not just last item in cell 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Motivation/Problem-Statement" data-toc-modified-id="Motivation/Problem-Statement-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Motivation/Problem Statement</a></span><ul class="toc-item"><li><span><a href="#Code-requirements" data-toc-modified-id="Code-requirements-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Code requirements</a></span></li><li><span><a href="#Exposition-/-other-notebooks" data-toc-modified-id="Exposition-/-other-notebooks-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Exposition / other notebooks</a></span></li></ul></li><li><span><a href="#Boilerplate-code-for-representing-and-manipulating-probability-distributions" data-toc-modified-id="Boilerplate-code-for-representing-and-manipulating-probability-distributions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Boilerplate code for representing and manipulating probability distributions</a></span></li><li><span><a href="#Boilerplate-code-for-generating-model-code-given-model-inputs" data-toc-modified-id="Boilerplate-code-for-generating-model-code-given-model-inputs-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Boilerplate code for generating model code given model inputs</a></span></li><li><span><a href="#Getting-data" data-toc-modified-id="Getting-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Getting data</a></span><ul class="toc-item"><li><span><a href="#Converting-data-to-the-required-probability-distributions" data-toc-modified-id="Converting-data-to-the-required-probability-distributions-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Converting data to the required probability distributions</a></span><ul class="toc-item"><li><span><a href="#Explorations-/-asides-/-sanity-checks" data-toc-modified-id="Explorations-/-asides-/-sanity-checks-4.1.1"><span class="toc-item-num">4.1.1&nbsp;&nbsp;</span>Explorations / asides / sanity checks</a></span></li><li><span><a href="#Testing" data-toc-modified-id="Testing-4.1.2"><span class="toc-item-num">4.1.2&nbsp;&nbsp;</span>Testing</a></span></li></ul></li></ul></li><li><span><a href="#Generating-a-model" data-toc-modified-id="Generating-a-model-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Generating a model</a></span><ul class="toc-item"><li><span><a href="#Testing-the-model" data-toc-modified-id="Testing-the-model-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Testing the model</a></span></li></ul></li></ul></div>

**Notebook author:** emeinhardt@ucsd.edu

# Motivation/Problem Statement

The goal of this collection of notebooks ('Model Notebook k', $k \in \{0, 1, 2, 3\}$) is to document/develop code for calculations related to an isolated word recognition task where:

 - A *lexicon* $\mathcal{L}$ is a set of **wordforms**  $w$, where a **wordform** is a finite sequence $s_0, s_1, s_2 ... s_f$ of segments, and $s_0^i$ denotes the prefix of some wordform $w \in \mathcal{L}$ such that $f = |w| - 1$ and such that $w$ begins with segments $s_0, s_1, ..., s_i$, where $0 \leq i \leq f$.
 - In a single episode of the task, a **speaker** samples a wordform $w = s_0^f$ from $\mathcal{L} \sim p(s_0^f)$.
 - The speaker incrementally produces their intended wordform (one segment at a time, for our purposes) and the **listener** incrementally perceives $\sigma_0^i \sim p(\sigma_0^i | s_0^i)$.
 - The listener considers what they have perceived so far $\sigma_0^i$ and reasons about what the most likely actual intended wordform of the speaker is $\sim p(\hat{s_0^f}|\sigma_0^i) \propto p(\sigma_0^i|s_0^f) p(s_0^f) = p(\sigma_0^i|s_0^i) p(s_0^i)$.


In particular, we want, for each possible intended wordform $s_0^f \in \mathcal{L}$, for each possible prefix $s_0^i$ of $s_0^f$, the listener's expected distribution over the speaker's intended wordform given the actual prefix produced so far $s_0^i$, where the expectation is taken with respect to possible perceived sequences $\sigma_0^i$: 

$$p(\hat{s_0^f}|s_0^i) = \sum_\limits{\sigma_0^i} p(s_0^f|\sigma_0^i)p(\sigma_0^i|s_0^i) \propto \sum_\limits{\sigma_0^i} p(\sigma_0^i|\hat{s_0^f})p(\hat{s_0^f})p(\sigma_0^i|s_0^i)$$

## Code requirements

The code for calculating 
$$p(\hat{s_0^f}|s_0^i) = \sum_\limits{\sigma_0^i} p(s_0^f|\sigma_0^i)p(\sigma_0^i|s_0^i) \propto \sum_\limits{\sigma_0^i} p(\sigma_0^i|\hat{s_0^f})p(\hat{s_0^f})p(\sigma_0^i|s_0^i)$$
needs to take only two basic inputs:

 - a prior distribution over the lexicon $p(s_0^f) = p(\hat{s_0^f})$.
 - an incrementally defined channel distribution $p(\sigma_0^i|s_0^i) = p(\sigma_0^i|\hat{s_0^i})$, defined (for now) in terms of a uniphone channel distribution $p(\sigma_i|s_i)$, where the distribution over what segment the listener perceives as the $i$th one depends (by assumption) *only* on the actual $i$th segment $s_i$ the speaker produced (if they've actually produced it at the time we're asking about).

from which it needs to generate all relevant code and representations of probability distributions or means of estimating them.

## Exposition / other notebooks

*Notebook 0* introduces the model being developed and implemented, the organization of other notebooks in the collection, and introduces the adapted form of Peter Norvig's code for conveniently representing and manipulating probability distributions that I make use of in other notebooks.

*Notebook 1* introduces the final model implementation and the derivations underlying its implementation and does so using randomly generated but highly constrained binary lexicons with a simple noise model. It is intended to introduce the model implementation/derivation in a context where the behavior and purpose (and basic correctness) of code is discernable by going through the notebook and the generated example.

*Notebook 2* is more for testing purposes. It contains multiple implementations of the model. One is defined entirely in terms of the abstractions Peter Norvig provides -- abstractions that cannot scale but which were easy to use and whose correctness I have high confidence in. The second implementation is the one presented in notebooks 1 and 3. 

*This notebook*, *Notebook 3* is a demonstration notebook where real data is loaded and basic queries are calculated.

# Boilerplate code for representing and manipulating probability distributions

In [2]:
import random

#from 
#    http://nbviewer.jupyter.org/url/norvig.com/ipython/Probability.ipynb
#    http://nbviewer.jupyter.org/url/norvig.com/ipython/ProbabilityParadox.ipynb
#with slight modification.

from fractions import Fraction

from collections import defaultdict, Counter

is_predicate = callable

def P(event, space): 
    """The probability of an event, given a sample space of equiprobable outcomes. 
    event: a collection of outcomes, or a predicate that is true of outcomes in the event. 
    space: a set of outcomes or a probability distribution of {outcome: frequency} pairs."""
    if is_predicate(event):
        event = such_that(event, space)
    if isinstance(space, ProbDist):
        return sum(space[o] for o in space if o in event)
    else:
        return Fraction(len(event & space), len(space))
    
def such_that(predicate, space): 
    """The outcomes in the sample pace for which the predicate is true.
    If space is a set, return a subset {outcome,...};
    if space is a ProbDist, return a ProbDist {outcome: frequency,...};
    in both cases only with outcomes where predicate(element) is true."""
    if isinstance(space, ProbDist):
        return ProbDist({o:space[o] for o in space if predicate(o)})
    else:
        return {o for o in space if predicate(o)}

# class ProbDist(dict):
class ProbDist(Counter):
    "A Probability Distribution; an {outcome: probability} mapping where probabilities sum to 1."
    def __init__(self, mapping=(), **kwargs):
        self.update(mapping, **kwargs)
        total = sum(self.values())
        if isinstance(total, int): 
            total = Fraction(total, 1)
        for key in self: # Make probabilities sum to 1.
            self[key] = self[key] / total
            
    def __and__(self, predicate): # Call this method by writing `probdist & predicate`
        "A new ProbDist, restricted to the outcomes of this ProbDist for which the predicate is true."
        return ProbDist({e:self[e] for e in self if predicate(e)})
    
    def __repr__(self):
        s = ""
        for k in self:
            if isinstance(self[k], Fraction):
                s+="{0}: {2}/{3} = {1}\n".format(transcriptionReprHack(k), float(self[k]), self[k].numerator, self[k].denominator)
            else:
                s+="{0}: {1}\n".format(transcriptionReprHack(k), float(self[k]))
        return s

dottedStringToTuple = lambda ds: tuple(ds.split('.'))
tupleToDottedString = lambda t: '.'.join(t)

def transcriptionReprHack(k):
    if type(k) == type(tuple()):
        if all(map(lambda el: type(el) == type(''), k)):
            return tupleToDottedString(k)
    return k.__repr__()    

def Uniform(outcomes): return ProbDist({e: 1 for e in outcomes})

def joint(A, B):
    """The joint distribution of two independent probability distributions. 
    Result is all entries of the form {(a, b): P(a) * P(b)}"""
    return ProbDist({(a,b): A[a] * B[b]
                    for a in A
                    for b in B})

from itertools import product
from functools import reduce
import operator

def prod(iterable):
    return reduce(operator.mul, iterable, 1)

def union(iterable):
    return reduce(set.union, iterable)

def joint2(iter_of_dists):
    #ProbDist({(a,b): A[a] * B[b] for a in A for b in B})
    #ProbDist({ab: A[ab[0]] * B[ab[1]] for ab in product(A,B)})
    return ProbDist({each : prod(dist[each[i]] for i,dist in enumerate(iter_of_dists)) for each in list(product(*iter_of_dists))})


In [3]:
#bookkeeping - don't worry about this function...
def getBoundaryVal(dist, i):
    #p = {a:1/2, b:1/2} -> 2 items -> 2-1=1 boundaries where 
    #                   the boundary separating 'a' (item 0) from 'b' (item 1) is at 0+p(a)
    #q = {0:1/3, 1:1/3, 2:1/6, 3:1/6} -> 4 items -> 4-1=3 boundaries where 
    #                   the boundary separating 0 from 1   is at 0 + q(0) = 1/3,
    #                   the boundary separating 1 from 2   is at 0 + q(0) + q(1) = 2/3,
    #                   the boundary separating 2 from 3   is at 0 + q(0) + q(1) + q(2) = 5/6,
    #                ...the boundary separating i from i+1 is at \sum_{j=0}^{j=i} q(j)
    outcomes = list(dist.keys())
    if i >= len(outcomes) - 1:
        raise Exception("Boundary i = {0} out of bounds / does not exist for distribution {1} with {2} outcomes.".format(i, dist, len(outcomes)))
    if i == 0:
        return dist[outcomes[0]]
    return dist[outcomes[i]] + getBoundaryVal(dist, i-1)

#bookkeeping - don't worry about this function...
def getSampleOutcomeIndex(randReal, boundariesLeft, currIndex):
#     print("boundariesLeft: {0}".format(boundariesLeft))
#     print("currIndex: {0}".format(currIndex))
    if boundariesLeft == [] or randReal <= boundariesLeft[0]:
        return currIndex
    return getSampleOutcomeIndex(randReal, boundariesLeft[1:], currIndex + 1)


def sampleFrom(dist, num_samples = None):
    """
    Given a distribution (either an {outcome: probability} mapping where the 
    probabilities sum to 1 or an implicit definition of a distribution via a thunk), 
    this returns a single sample from the distribution, unless num_samples is specified, 
    in which case a generator with num_samples samples is returned.
    """
    if num_samples == None:
        if callable(dist):
            return dist()
        elif isinstance(dist, ProbDist):
            outcomes = list(dist.keys())
        #     print("outcomes: {0}".format(outcomes))

            boundaries = [getBoundaryVal(dist, i) for i in range(len(outcomes)-1)]
        #     print("boundaries: {0}".format(boundaries))

            randVal = random.random() #random real from unit interval
        #     print("randval: {0}".format(randVal))

            sampledOutcomeIndex = getSampleOutcomeIndex(randVal, boundaries, 0)
        #     print("sampledOutcomeIndex: {0}".format(sampledOutcomeIndex))
            if not (sampledOutcomeIndex >= 0 and sampledOutcomeIndex < len(outcomes)):
                print('sampledOutcomeIndex: {0}'.format(sampledOutcomeIndex))
                print('len(outcomes): {0}'.format(len(outcomes)))
                if len(outcomes) == 0:
                    print('len(outcomes) == 0! dist:')
                    print(type(dist))
                    print(dist)
            assert(sampledOutcomeIndex >= 0 and sampledOutcomeIndex < len(outcomes))

            sampledOutcome = outcomes[sampledOutcomeIndex]
        #     print("sampledOutcome: {0}".format(sampledOutcome))
            return sampledOutcome
    else:
        if callable(dist):
            return (dist() for each in range(num_samples))
        elif isinstance(dist, ProbDist):
            return (sampleFrom(dist, num_samples = None) for each in range(num_samples))

from collections import Counter

def frequencies(samples):
    return Counter(samples)

def makeSampler(dist):
    """
    Given a ProbDist, returns a thunk that when called, returns one sample from dist.
    """
    return lambda: sampleFrom(dist)

In [4]:
leftEdge = '⋊'
rightEdge = '⋉'
edgeSymbols = set([leftEdge, rightEdge])

# Boilerplate code for generating model code given model inputs

In [5]:
# Generating functions and values derived from p(X_0^f)...

import itertools

def getPrefixes(s):
    return set(s[0:i] for i in range(1, len(s)+1))

def getHasPrefix(prefix):
    l = len(prefix)
    hasAsPrefix = lambda full_input_seq: full_input_seq[0:l] == prefix
    return hasAsPrefix
  
def getHasSuffix(suffix):
    l = len(suffix)
    hasAsSuffix = lambda full_input_seq: full_input_seq[-len(suffix):] == suffix
    return hasAsSuffix

def padInputSequenceWithBoundaries(inputSeq):
    temp = list(inputSeq)
    temp = tuple([leftEdge] + temp + [rightEdge])
    return temp

def trimBoundariesFromSequence(seq):
    temp = list(seq)
    if temp[0] == leftEdge:
        temp = temp[1:]
    if temp[-1] == rightEdge:
        temp = temp[:-1]
    return tuple(temp)

def generateInputModel(inputDist, pad_with_boundaries = None):
    if pad_with_boundaries == None:
        pad_with_boundaries = False
    model = {} #FIXME consider changing to a namedtuple/frozendict (possibly just before returning a value)
    
    if pad_with_boundaries:
        old_map = list(inputDist.items())
        old_inputs = list(map(lambda p: p[0], old_map))
#         old_inputs = list(inputDist.keys())
        old_values = [v for k,v in old_map]
        new_inputs = list(map(padInputSequenceWithBoundaries, old_inputs))
        assert(list(map(trimBoundariesFromSequence, new_inputs)) == old_inputs)
        assert(list(zip(old_inputs, old_values)) == old_map)
        new_map = dict(list(zip(new_inputs, old_values)))
        inputDist = new_map
    
    model['inputDist'] = inputDist
    model['p(X_0^f)'] = inputDist
    
    inputs = list(inputDist.keys())
    model['inputs'] = inputs
    
    
    inputAlphabet = set(itertools.chain.from_iterable([set(k) for k in inputs ]))
#   print(inputAlphabet)
#     if pad_with_boundaries and (leftEdge not in inputAlphabet):
#         inputAlphabet += [leftEdge]
#     if pad_with_boundaries and (rightEdge not in inputAlphabet):
#         inputAlphabet += [rightEdge]
    model['inputAlphabet'] = inputAlphabet
    

    prefixes = map(getPrefixes, inputs)
    prefixes = set(itertools.chain.from_iterable(prefixes))
#   print("<= 10 prefixes of lexicon:")
#   print(list(prefixes)[0:11])
    model['prefixes'] = prefixes
  
    
    # convenient calculation of p(x_0^i) as pmf, not as 'dist'
    def prefixProb(input_prefix):
        hasAsPrefix = getHasPrefix(input_prefix)
        return P(hasAsPrefix, inputDist)
    model['prefixProb'] = prefixProb
    model['p(x_0^i)'] = prefixProb
    
    
    # definition of p(X_0^f | x_0^i) as dist
    def inputDist_givenPrefix(prefix):
        hasPrefix = getHasPrefix(prefix)
        return inputDist & hasPrefix
    model['inputDist_givenPrefix'] = inputDist_givenPrefix
    model['p(X_0^f | x_0^i)'] = inputDist_givenPrefix
    
    
    # definition of p(x_{i+1}^f | x_0^i) as pmf, not as 'dist'
    def suffixProb(input_suffix, input_prefix):
        #p(X_0^f | x_0^i)
        d = inputDist_givenPrefix(input_prefix)
        isExactlyEqualTo = lambda s: s == tuple(list(input_prefix) + list(input_suffix))
        return P(isExactlyEqualTo, d)
    model['suffixProb'] = suffixProb
    model['p(x_{i+1}^f | x_0^i)'] = suffixProb
    
    return model

In [85]:
# Generating functions and values derived from p(Y_i|x_i)...


def generateChannelModel(channelOutput_i_dist, inputModel, pad_with_boundaries = None):
    if pad_with_boundaries == None:
        pad_with_boundaries = False
    
    model = {} #FIXME consider changing to a namedtuple/frozendict (possibly just before returning a value)
    
    model['channelOutput_i_dist'] = channelOutput_i_dist
    model['p(Y_i|x_i)'] = channelOutput_i_dist
    
    
    # definition of p(Y_i) as a dist
    sourceInputAlphabet = inputModel['inputAlphabet']
    
    
    getKeys = lambda d: set(d.keys()) #assumes they're hashable
#     getVals = lambda d: set(d.values()) #assumes they're hashable
    channelInputAlphabet = getKeys(channelOutput_i_dist)
    
    getOutputSymbols = lambda cond_channel_dist: getKeys(cond_channel_dist)
    outputSymbolsByInput = map(lambda eachInputSymbol: getOutputSymbols(channelOutput_i_dist[eachInputSymbol]), channelOutput_i_dist)
    channelOutputAlphabet = reduce(set.union, outputSymbolsByInput)
    
    totalAlphabet = reduce(set.union, [channelOutputAlphabet, channelInputAlphabet, sourceInputAlphabet])
    
    
#     channelOutput_i_marginal_dist = ProbDist({y: sum(channelOutput_i_dist(x)[y] for x in inputAlphabet) for y in inputAlphabet})
    channelOutput_i_marginal_dist = ProbDist({y: sum(channelOutput_i_dist[x].get(y, 0.0) for x in channelInputAlphabet) for y in channelOutputAlphabet})
    model['channelOutput_i_marginal_dist'] = channelOutput_i_marginal_dist
    model['p(Y_i)'] = channelOutput_i_marginal_dist
    
    
    # definition of p(y_0^j|x_0^i) as a pmf but not as a 'dist', assuming uniphone noise/channel model
    prefixes = inputModel['prefixes']
    def channelOutput_prefix_prob(inputPrefix, outputPrefix):
        if len(outputPrefix) == len(inputPrefix):
        #         |y_0^j|    ==      |x_0^i|
#             slice_prob = lambda y_i, x_i: P({y_i}, channelOutput_i_dist(x_i))
            slice_prob = lambda y_i, x_i: P({y_i}, channelOutput_i_dist[x_i])
            return prod(slice_prob(y, x) for y,x in zip(outputPrefix, inputPrefix))
        elif len(outputPrefix) < len(inputPrefix): #truncate excess part of inputPrefix
                  #|y_0^j|     <      |x_0^i|
            return channelOutput_prefix_prob(inputPrefix[:len(outputPrefix)], outputPrefix)
        else:
                  #|y_0^j|     >      |x_0^i|
            hasInputPrefixAsPrefix = getHasPrefix(inputPrefix)
            my_prefixes = (pref for pref in prefixes if len(pref) == len(outputPrefix) and hasInputPrefixAsPrefix(pref))
            #         p(y_0^j|x_0^j)                                 p(x_0^j)          1   / p(x_0^i)
            terms = ((channelOutput_prefix_prob(pref, outputPrefix), inputModel['prefixProb'](pref), 0.0 if inputModel['prefixProb'](inputPrefix) == 0.0 else 1.0 / inputModel['prefixProb'](inputPrefix) ) for pref in my_prefixes)
            return sum(map(prod, terms))
    model['p(y_0^j|x_0^i)'] = channelOutput_prefix_prob
    model['channelOutput_prefix_prob'] = channelOutput_prefix_prob


    # definition of sampler from p(Y_0^i|x_0^i), assuming uniphone noise/channel model
    def channelOutput_prefix_sampler(inputPrefix):
#         return tuple([sampleFrom(channelOutput_i_dist(x_i)) for x_i in inputPrefix])
        return tuple([sampleFrom(channelOutput_i_dist[x_i]) for x_i in inputPrefix])
    model['sample from p(Y_0^i|x_0^i)'] = channelOutput_prefix_sampler
    model['channelOutput_prefix_sampler'] = channelOutput_prefix_sampler

    
    # calculation of p(y_0^i) as a pmf but not as a 'dist'
    #   marginalizes over all prefixes of input sequences with length matching y_0^i 
    def channelOutput_prefix_marginal_prob(outputPrefix):

        l = len(outputPrefix)
        my_prefixes = (prefix for prefix in prefixes if len(prefix) == l)

        #\sum_{X_0^i} p(x_0^i) * p(y_0^i|x_0^i)
        terms = ((inputModel['prefixProb'](input_prefix), channelOutput_prefix_prob(input_prefix, outputPrefix)) for input_prefix in my_prefixes)
        probs = map(prod, terms)
    #     probs = (prefixProb(input_prefix) * channelOutput_prefix_prob(input_prefix, outputPrefix) for input_prefix in my_prefixes)
    #     probs = [prefixProb(input_prefix) * channelOutput_prefix_prob(input_prefix, outputPrefix) for input_prefix in my_prefixes] #only for testing; use generator exp otherwise
    #     print('terms of \sum_{X_0^i} p(x_0^i) * p(y_0^i|x_0^i):')
    #     print(probs)
        return sum(probs)
    model['p(y_0^i)'] = channelOutput_prefix_marginal_prob
    model['channelOutput_prefix_marginal_prob'] = channelOutput_prefix_marginal_prob
    
    
    
    return model

In [86]:
from math import exp

#Hoeffding's inequality w.r.t. mc estimate p-hat(x-hat_0^j|x_0^i):
# Let  X = p-hat(x-hat_0^j|x_0^i)
#      𝛍 = p(x-hat_0^j|x_0^i)
#      n = the number of samples that go into calculating X
# then
#      p(|X - 𝛍| > 𝛆) ≤ exp(2n𝛆²)
#
#  E.g. for n = 1000, 𝛆 = 0.01
#      p(|X - 𝛍| > 𝛆) ≤ ≈0.82
#
#       for n = 1000, 𝛆 = 0.03
#     p(|X - 𝛍| > 𝛆) ≤ ≈0.165
#
#       for n = 1000, 𝛆 = 0.05
#     p(|X - 𝛍| > 𝛆) ≤ ≈0.0068
#
#       for n = 1000, 𝛆 = 0.1
#     p(|X - 𝛍| > 𝛆) ≤ ≈2.06*10^-9
def hoeffdingProb(epsilon, num_samples):
    upperBound = exp(-2.0 * num_samples * epsilon * epsilon)
    return upperBound

def generateTotalModel(inputDist, channelOutput_i_dist, pad_with_boundaries = None):
    if pad_with_boundaries == None:
        pad_with_boundaries = False
    
    inputModel = generateInputModel(inputDist, pad_with_boundaries = pad_with_boundaries)
    channelModel = generateChannelModel(channelOutput_i_dist, inputModel, pad_with_boundaries = pad_with_boundaries)
    model = {**inputModel, **channelModel} #FIXME consider changing to a namedtuple/frozendict (possibly just before returning a value)

    # receiver/listener model
    #  = mc estimate of p(x-hat_0^j|x_0^i) as pmf, not as 'dist', assuming uniphone error probs
    #                         x_0^i             x-hat_0^j
    def est_channelInput_prob(true_inputPrefix, poss_inputPrefix, num_samples = None):
        if num_samples == None:
            num_samples = 1000

    #     print('true x_0^i: {0}'.format(true_inputPrefix))
    #     print('x-hat_0^j: {0}'.format(poss_inputPrefix))
    #     print('num samples: {0}'.format(num_samples))

        lenTruePrefix = len(true_inputPrefix) #|x_0^i|
        lenPossPrefix = len(poss_inputPrefix) #|x-hat_0^j|

        #if |x-hat_0^j|  < |x_0^i|, then calculate \sum_\limits{x-hat_0^i | x-hat_0^j is a prefix} p(x-hat_0^i|x_0^i)
        if lenPossPrefix < lenTruePrefix:
            hasPossPrefixAsPrefix = getHasPrefix(poss_inputPrefix)
            my_prefixes = (pref for pref in model['prefixes'] if len(pref) == lenTruePrefix and hasPossPrefixAsPrefix(pref))
            probs = (est_channelInput_prob(true_inputPrefix, pref) for pref in my_prefixes)
            return sum(probs)

        #if |x-hat_0^j| ≥ |x_0^i|, then proceed 'normally'...

        #samples from p(Y_0^i|x_0^i)
        samples = sampleFrom(lambda: model['channelOutput_prefix_sampler'](true_inputPrefix), num_samples)
    #     samples = list(sampleFrom(lambda: channelOutput_prefix_sampler(true_inputPrefix), num_samples)) # only for debugging - otherwise use generator exp
    #     print('samples from p(Y_0^i|x_0^i):')
    #     print(samples)
    #     likelihoods = [channelOutput_prefix_prob(poss_inputPrefix, outputPrefix) for outputPrefix in samples]
    #     print('likelihoods p(y_0^i|x-hat_0^i):')
    #     print(likelihoods)
    #     priorOfOutputs = [channelOutput_prefix_marginal_prob(outputPrefix) for outputPrefix in samples]
    #     print('priorOfOutputs p(y_0^i):')
    #     print(priorOfOutputs)

        #p(x-hat_0^j)
        poss_inputPrefix_prob = model['prefixProb'](poss_inputPrefix)
    #     print('p(x-hat_0^j) = {0}'.format( poss_inputPrefix_prob ))

        #terms in \sum_{samples y_0^i from p(Y_0^i|x_0^i)} p(y_0^i|x-hat_0^i) * p(x-hat_0^j) / p(y_0^i)
#         terms = (model['channelOutput_prefix_prob'](poss_inputPrefix, outputPrefix)  * poss_inputPrefix_prob / model['channelOutput_prefix_marginal_prob'](outputPrefix) for outputPrefix in samples)
        terms = ((model['channelOutput_prefix_prob'](poss_inputPrefix, outputPrefix), 0.0 if model['channelOutput_prefix_marginal_prob'](outputPrefix) == 0.0 else poss_inputPrefix_prob / model['channelOutput_prefix_marginal_prob'](outputPrefix) ) for outputPrefix in samples)
    #     for outputPrefix in samples:
    #       print('y_0^i: {0}'.format(outputPrefix))
    #       print('term: {0} * {1} / {2}'.format(channelOutput_prefix_prob(poss_inputPrefix, outputPrefix), poss_inputPrefix_prob, channelOutput_prefix_marginal_prob(outputPrefix)))
    #     terms = [channelOutput_prefix_prob(poss_inputPrefix, outputPrefix)  * poss_inputPrefix_prob / channelOutput_prefix_marginal_prob(outputPrefix) for outputPrefix in samples] #use only for debugging, otherwise use generator exp
    #     print('terms in sum:')
    #     print(terms)
#         est = sum(terms) / num_samples
        est = sum(map(prod,terms)) / num_samples

        return est
    model['p-hat(x-hat_0^j|x_0^i)'] = est_channelInput_prob
    model['est_channelInput_prob'] = est_channelInput_prob
    
    return model


# Getting data

In [8]:
%pwd

'/Users/ericmeinhardt/Google Drive/Ideas/HH/recoverability/c2-jn'

In [9]:
%ls *IPhOD*

IPhOD-aligned gate-3 diphone channel distribution.json
IPhOD-aligned gate-6 diphone channel distribution.json
IPhOD-aligned gate3 uniphone channel distribution.json
IPhOD-aligned gate6 uniphone channel distribution.json
IPhOD-aligned response diphones.txt
IPhOD-aligned stimuli diphones.txt
IPhOD-aligned stimuli uniphones.txt
IPhOD-aligned trials just-gate3.csv
IPhOD-aligned trials just-gate6.csv
IPhOD-aligned uniphone channel distribution.json
IPhOD2_Words_IPA_prob_caughtCotMerged_schwa_phonWordToProb.json
IPhOD_aligned_trials.csv
IPhOD_noStress_lowercase_IPA_ALLDUPES.json
IPhOD_noStress_lowercase_IPA_noDupes_consolidated_UTFesc.json
IPhOD_noStress_lowercase_IPA_noDupes_consolidated_noUTFesc.json
IPhOD_processed.json
IPhOD_processed_UTFesc.json
IPhOD_processed_noUTFesc.json
IPhODv2.0_REALS.zip
illegal triphones from IPhOD-aligned response diphones.txt
illegal triphones from IPhOD-aligned stimuli diphones.txt
triphones from IPhOD-aligned response diphones.txt
triph

In [10]:
%ls Hammond*

Hammond's mysterious newdic.txt
Hammond-aligned gate-3 diphone channel distribution.json
Hammond-aligned gate-6 diphone channel distribution.json
Hammond-aligned gate3 uniphone channel distribution.json
Hammond-aligned gate6 uniphone channel distribution.json
Hammond-aligned response diphones.txt
Hammond-aligned stimuli diphones.txt
Hammond-aligned stimuli uniphones.txt
Hammond-aligned trials just-gate3.csv
Hammond-aligned trials just-gate6.csv
Hammond-aligned uniphone channel distribution.json
Hammond_aligned_trials.csv
Hammond_newdic_IPA.csv
Hammond_newdic_IPA_aligned.csv
Hammond_newdic_IPA_aligned_phonWordToNlprob.json
Hammond_newdic_IPA_aligned_phonWordToProb.json


In [11]:
uniphone_dist_fn_IPhOD = 'IPhOD-aligned uniphone channel distribution.json'
uniphone_dist_fn_hammond = 'Hammond-aligned uniphone channel distribution.json'
uniphone_dist_fn = ''

In [12]:
lexicon_dist_fn_IPhOD = 'IPhOD2_Words_IPA_prob_caughtCotMerged_schwa_phonWordToProb.json'
lexicon_dist_fn_hammond = 'Hammond_newdic_IPA_aligned_phonWordToProb.json'
lexicon_dist_fn = ''

In [13]:
which = 'IPhOD'
# which = 'hammond'

In [14]:
if which == 'IPhOD':
    uniphone_dist_fn = uniphone_dist_fn_IPhOD
    lexicon_dist_fn = lexicon_dist_fn_IPhOD
else:
    uniphone_dist_fn = uniphone_dist_fn_hammond
    lexicon_dist_fn = lexicon_dist_fn_hammond

In [15]:
import json

In [16]:
with open(uniphone_dist_fn, encoding='utf-8') as data_file:
   uniphone_dist_in = json.loads(data_file.read())

In [17]:
test_k = list(uniphone_dist_in.keys())[0]
test_k
uniphone_dist_in[test_k]

'p'

{'aʊ': 0.0006578947368421052,
 'b': 0.006048976608187134,
 'd': 0.00034722222222222224,
 'dʒ': 0.0006944444444444445,
 'eɪ': 0.001388888888888889,
 'f': 0.004422514619883041,
 'g': 0.0010416666666666669,
 'h': 0.01822002923976608,
 'i': 0.00034722222222222224,
 'k': 0.030957602339181284,
 'l': 0.004093567251461988,
 'm': 0.0010051169590643274,
 'n': 0.0003289473684210526,
 'oʊ': 0.0010233918128654971,
 'p': 0.8784173976608189,
 'r': 0.00034722222222222224,
 's': 0.0006944444444444445,
 't': 0.003654970760233918,
 'tʃ': 0.0006944444444444445,
 'u': 0.0003289473684210526,
 'v': 0.0030884502923976607,
 'w': 0.0006944444444444445,
 'ŋ': 0.0006944444444444445,
 'ɑ': 0.0013340643274853802,
 'ɚ': 0.00034722222222222224,
 'ɛ': 0.003801169590643275,
 'ɪ': 0.002741228070175439,
 'ʃ': 0.0013523391812865497,
 'ʌ': 0.02814327485380117,
 'θ': 0.0030884502923976603}

In [18]:
uniphone_dist = uniphone_dist_in

In [19]:
# %cd ..
%cd IPhODv2.0_REALS 

/Users/ericmeinhardt/Google Drive/Ideas/HH/recoverability/c2-jn/IPhODv2.0_REALS


In [20]:
with open(lexicon_dist_fn, encoding='utf-8') as data_file:
   lexicon_in = json.loads(data_file.read())

In [21]:
%cd ..

/Users/ericmeinhardt/Google Drive/Ideas/HH/recoverability/c2-jn


In [22]:
test_k = list(lexicon_in.keys())[0]
test_k
lexicon_in[test_k]

'⋊.ʃ.ʊ.r.ə.t.i.⋉'

1.4555929067625541e-07

In [23]:
lexicon = {dottedStringToTuple(k):lexicon_in[k] for k in lexicon_in}
dottedStringToTuple(test_k)
lexicon[dottedStringToTuple(test_k)]

('⋊', 'ʃ', 'ʊ', 'r', 'ə', 't', 'i', '⋉')

1.4555929067625541e-07

## Converting data to the required probability distributions

In [24]:
# p(X_0^f)
inputDist = ProbDist(lexicon)

phonWords = set(inputDist.keys())
someWords = list(phonWords)[:10]
{w:inputDist[w] for w in someWords}

{('⋊',
  'd',
  'j',
  'u',
  'p',
  'l',
  'ə',
  'k',
  'eɪ',
  't',
  'ɪ',
  'd',
  '⋉'): 4.262807798377089e-07,
 ('⋊',
  'f',
  'ɑ',
  'r',
  't',
  'ə',
  'f',
  'ə',
  'k',
  'eɪ',
  'ʃ',
  'ə',
  'n',
  '⋉'): 1.0397092191163634e-07,
 ('⋊', 'h', 'ɑ', 'ŋ', 'k', 'ɪ', 'ŋ', '⋉'): 3.1607160261137446e-06,
 ('⋊', 'k', 'æ', 'b', 'n', 'ə', 't', '⋉'): 4.330388897619654e-06,
 ('⋊',
  'm',
  'ə',
  't',
  'ɪ',
  'r',
  'i',
  'ə',
  'l',
  'aɪ',
  'z',
  'ɪ',
  'z',
  '⋉'): 6.23825531469818e-08,
 ('⋊', 'r', 'æ', 'd', 'm', 'ə', 'n', '⋉'): 2.079418438232727e-08,
 ('⋊',
  's',
  'ɛ',
  'k',
  'ə',
  'n',
  'd',
  'ɛ',
  'r',
  'i',
  '⋉'): 4.262807798377089e-06,
 ('⋊',
  's',
  'ʌ',
  'f',
  'ə',
  'k',
  'eɪ',
  'ʃ',
  'ə',
  'n',
  '⋉'): 7.797819143372725e-07,
 ('⋊', 'æ', 'b', 'ə', 'l', 'oʊ', 'n', 'i', '⋉'): 5.302517017493452e-07,
 ('⋊', 'ə', 'n', 't', 'æ', 'ŋ', 'l', 'ɪ', 'ŋ', '⋉'): 1.0397092191163634e-07}

### Explorations / asides / sanity checks

 - How many words in the lexicon have probability 0? (The answer should be that zero words do.)

In [25]:
zeroWords = {k for k in phonWords if inputDist[k] == 0}
len(zeroWords)

0

 - How many words are there of each length (in phonemes, i.e. not counting word edge symbols.)?

In [27]:
def getLength(word):
    '''
    Returns the length of the word in a lexicon distribution, not including edgesymbols.
    '''
    return len(word) - 2

len(('⋊', 'æ', 'b', 'ə', 'l', 'oʊ', 'n', 'i', '⋉'))
getLength(('⋊', 'æ', 'b', 'ə', 'l', 'oʊ', 'n', 'i', '⋉'))

9

7

In [57]:
myWords = list(inputDist.keys())

In [58]:
lengths = list(map(getLength, myWords))

In [59]:
from statistics import mean, median
mean(lengths)
median(lengths)

6.7828168794767985

7.0

In [67]:
lengthCounter = Counter(lengths)
lengthCounter.keys()
lengthCounter.most_common()

dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 28])

[(6, 9137),
 (5, 8185),
 (7, 8153),
 (8, 6641),
 (4, 5556),
 (9, 4605),
 (10, 2940),
 (3, 2310),
 (11, 1783),
 (12, 1027),
 (13, 464),
 (2, 291),
 (14, 166),
 (15, 65),
 (16, 29),
 (1, 13),
 (17, 8),
 (18, 1),
 (20, 1),
 (28, 1)]

In [61]:
lengths_with_more_than_1000_words = {k for k in lengthCounter.keys() if lengthCounter[k] > 1000}
len(lengths_with_more_than_1000_words)
min(lengths_with_more_than_1000_words)
max(lengths_with_more_than_1000_words)
all(l >= min(lengths_with_more_than_1000_words) and l <= max(lengths_with_more_than_1000_words) for l in lengths_with_more_than_1000_words)
sum(lengthCounter[k] for k in lengths_with_more_than_1000_words)
sum(lengthCounter[k] for k in lengthCounter)

10

3

12

True

50337

51376

 - 98% of the pronounciations in IPhOD are 3-12 phonemes long.
 - Just about 50% of the pronounciations in IPhOD are either 5, 6, or 7 phonemes long.
 - About 16% of the pronounciations in IPhOD are 4 phonemes long or shorter.

#FIXME plotnine plot of length of word type vs. count of word types with that length...

In [74]:
word_lengths = list(range(min(lengths) + 2, max(lengths) + 2 + 1))
len(word_lengths)
words_by_length = {l:[w for w in myWords if len(w) == l] for l in word_lengths}

28

In [79]:
words_by_length[22]

[('⋊',
  'd',
  'i',
  'ɪ',
  'n',
  's',
  't',
  'ɪ',
  't',
  'u',
  'ʃ',
  'ə',
  'n',
  'ə',
  'l',
  'ə',
  'z',
  'eɪ',
  'ʃ',
  'ə',
  'n',
  '⋉')]

 - How many unique prefixes of each length are there (not counting edge symbols, again)?

In [62]:
my_prefixes = list(map(getPrefixes, myWords))
my_prefixes = set(itertools.chain.from_iterable(my_prefixes))
len(myWords)
len(my_prefixes)

51376

162198

In [63]:
def getPrefixLength(prefix):
    if prefix[0] == leftEdge:
        adjustment = -1
    if prefix[-1] == rightEdge:
        adjustment = -2
    return len(prefix) + adjustment
len(('⋊', 'æ', 'b', 'ə', 'l', 'oʊ', 'n', 'i', '⋉'))
getPrefixLength(('⋊', 'æ', 'b', 'ə', 'l', 'oʊ', 'n', 'i', '⋉'))
len(('⋊', 'æ', 'b', 'ə', 'l', 'oʊ', 'n', 'i'))
getPrefixLength(('⋊', 'æ', 'b', 'ə', 'l', 'oʊ', 'n', 'i'))

9

7

8

7

In [64]:
prefixLengths = list(map(getPrefixLength, my_prefixes))
mean(prefixLengths)
median(prefixLengths)

6.562713473655656

6.0

In [66]:
prefix_lengthCounter = Counter(prefixLengths)
prefix_lengthCounter.keys()
prefix_lengthCounter.most_common()
sum(prefix_lengthCounter.values())

dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28])

[(6, 29673),
 (5, 29127),
 (7, 25047),
 (4, 20596),
 (8, 19251),
 (9, 13023),
 (10, 8076),
 (3, 7368),
 (11, 4726),
 (12, 2546),
 (13, 1120),
 (2, 913),
 (14, 411),
 (15, 164),
 (16, 68),
 (1, 51),
 (17, 19),
 (18, 4),
 (20, 3),
 (19, 2),
 (28, 2),
 (0, 1),
 (21, 1),
 (22, 1),
 (23, 1),
 (24, 1),
 (25, 1),
 (26, 1),
 (27, 1)]

162198

 - ≈94% of all unique prefixes in IPhOD are 3-10 phonemes long.
 - ≈84% of all unqiue prefixes in IPhOD are 4-9 phonemes long.
 - ≈52% of all unique prefixes in IPhOD are 5-7 phonemes long.
 - ≈18% of all unique prefixes (= prefix types) in IPhOD are 1-4 phonemes long. 

In [80]:
prefix_lengths = list(range(min(prefixLengths), max(lengths) + 2 + 1))
len(prefix_lengths)
prefixes_by_length = {l:[p for p in my_prefixes if len(p) == l] for l in prefix_lengths}
prefixes_by_length[24]

31

[('⋊',
  'æ',
  'n',
  't',
  'aɪ',
  'd',
  'ɪ',
  's',
  'ə',
  's',
  't',
  'æ',
  'b',
  'l',
  'ɪ',
  'ʃ',
  'm',
  'ə',
  'n',
  't',
  'ɛ',
  'r',
  'i',
  'ə')]

 - How many neighbors does each unique full word and each unique prefix have?

In [81]:
def num_shared_indices(wordA, wordB):
    '''
    Given two sequences, this function returns the number of indices where wordA and wordB have the same values. 
    '''
    #NB if one word is longer than the other, zip truncates the extra values in the longer word - which is just fine.
    equalVals = [pairedVals[0] == pairedVals[1] for pairedVals in zip(wordA, wordB)]
    return sum(equalVals)

def hammingDistance(wordA, wordB):
    '''
    Returns the Hamming distance between two sequences, where the Hamming distance between two words of unequal length is stipulated as undefined, which is to say 'None'.
    '''
    if len(wordA) != len(wordB):
        return None
    differentIndices = len(wordA) - num_shared_indices(wordA, wordB)
    return differentIndices

def isNeighbor(targetWord, possibleNeighbor, distanceThreshold = 1):
    distance = hammingDistance(targetWord, possibleNeighbor)
    if distance == None:
        return False
    return distance <= distanceThreshold

def getNeighbors(targetWord, words, distanceThreshold = 1):
#     sameLengthWords = [w for w in words if len(w) == len(targetWord)]
#     return [w for w in sameLengthWords if isNeighbor(targetWord, w, distanceThreshold) and targetWord != w]
    return [w for w in words if isNeighbor(targetWord, w, distanceThreshold) and targetWord != w]

def numNeighbors(targetWord, words, distanceThreshold = 1):
    return len(getNeighbors(targetWord, words, distanceThreshold))

In [82]:
fullWord_neighborCounts = [numNeighbors(w, words_by_length[len(w)]) for w in myWords]
mean(fullWord_neighborCounts)
median(fullWord_neighborCounts)
fullWord_neighborCounter = Counter(fullWord_neighborCounts)
fullWord_neighborCounter.keys()
fullWord_neighborCounter.most_common()
sum(fullWord_neighborCounter.values())

3.3258330738087825

1.0

dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43])

[(0, 21456),
 (1, 10445),
 (2, 4535),
 (3, 2540),
 (4, 1718),
 (5, 1344),
 (6, 1156),
 (7, 918),
 (8, 784),
 (9, 696),
 (10, 631),
 (11, 524),
 (12, 508),
 (13, 380),
 (15, 344),
 (14, 337),
 (16, 312),
 (17, 273),
 (18, 257),
 (19, 223),
 (20, 187),
 (21, 180),
 (22, 170),
 (24, 153),
 (23, 148),
 (25, 137),
 (27, 129),
 (28, 128),
 (26, 124),
 (29, 119),
 (32, 101),
 (30, 97),
 (31, 76),
 (33, 66),
 (34, 36),
 (35, 35),
 (37, 29),
 (36, 26),
 (38, 21),
 (39, 16),
 (40, 6),
 (41, 6),
 (43, 4),
 (42, 1)]

51376

In [83]:
prefix_neighborCounts = [numNeighbors(p, prefixes_by_length[len(p)]) for p in my_prefixes]
mean(prefix_neighborCounts)
median(prefix_neighborCounts)
prefix_neighborCounter = Counter(prefix_neighborCounts)
prefix_neighborCounter.keys()
prefix_neighborCounter.most_common()
sum(prefix_neighborCounter.values())

5.020542793375998

2.0

dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61])

[(0, 45698),
 (1, 33872),
 (2, 18003),
 (3, 11104),
 (4, 7668),
 (5, 5686),
 (6, 4440),
 (7, 3531),
 (8, 2943),
 (9, 2542),
 (10, 2269),
 (11, 2006),
 (12, 1924),
 (13, 1685),
 (14, 1502),
 (15, 1344),
 (16, 1218),
 (17, 1093),
 (18, 1002),
 (19, 1001),
 (20, 862),
 (22, 813),
 (21, 800),
 (23, 733),
 (24, 657),
 (26, 607),
 (25, 587),
 (27, 554),
 (28, 491),
 (31, 447),
 (29, 433),
 (30, 409),
 (33, 383),
 (32, 372),
 (35, 346),
 (37, 324),
 (34, 307),
 (36, 291),
 (38, 284),
 (40, 259),
 (41, 251),
 (39, 238),
 (42, 207),
 (43, 192),
 (44, 181),
 (45, 134),
 (46, 127),
 (47, 117),
 (48, 63),
 (49, 49),
 (50, 39),
 (51, 32),
 (52, 30),
 (53, 11),
 (55, 11),
 (56, 8),
 (54, 7),
 (57, 4),
 (59, 3),
 (60, 2),
 (58, 1),
 (61, 1)]

162198

### Testing

Let's trim the lexicon down to 10 words for the sake of inspectability throughout testing in this section/immediately below.

In [26]:
#let's trim the lexicon down to 10 words for the sake of inspectability
lexicon_actual = inputDist
inputDist = ProbDist({w:inputDist[w] for w in someWords})
inputDist

ProbDist({('⋊',
           'b',
           'oʊ',
           'l',
           'ʃ',
           'ə',
           'v',
           'ɪ',
           'k',
           's',
           '⋉'): 0.006997455470737914,
          ('⋊', 'k', 'oʊ', 'k', 'oʊ', '⋉'): 0.31075063613231546,
          ('⋊',
           'k',
           'ɑ',
           'r',
           'b',
           'j',
           'ɚ',
           'i',
           'ʃ',
           'ə',
           'n',
           '⋉'): 0.0006361323155216284,
          ('⋊',
           'k',
           'ə',
           'm',
           'p',
           'r',
           'ɛ',
           's',
           't',
           '⋉'): 0.028625954198473278,
          ('⋊',
           'n',
           'æ',
           'n',
           'oʊ',
           's',
           'ɛ',
           'k',
           'ə',
           'n',
           'd',
           '⋉'): 0.004452926208651399,
          ('⋊', 'p', 'r', 'ɑ', 's', 'ɛ', 's', 't', '⋉'): 0.07983460559796436,
          ('⋊', 's', 'k', 'i', 'm', 'z', '

In [27]:
print(inputDist)
q = list(inputDist.keys())[0]
q
type(q)
inputDist[q]
type(inputDist[q])

⋊.s.k.i.m.z.⋉: 0.05597964376590331
⋊.θ.r.ʌ.s.s.⋉: 0.0025445292620865138
⋊.s.ɪ.ŋ.ɚ.⋉: 0.4990458015267174
⋊.k.ɑ.r.b.j.ɚ.i.ʃ.ə.n.⋉: 0.0006361323155216284
⋊.p.r.ɑ.s.ɛ.s.t.⋉: 0.07983460559796436
⋊.b.oʊ.l.ʃ.ə.v.ɪ.k.s.⋉: 0.006997455470737914
⋊.k.ə.m.p.r.ɛ.s.t.⋉: 0.028625954198473278
⋊.n.æ.n.oʊ.s.ɛ.k.ə.n.d.⋉: 0.004452926208651399
⋊.k.oʊ.k.oʊ.⋉: 0.31075063613231546
⋊.ɪ.n.ɑ.k.j.ə.l.eɪ.t.⋉: 0.011132315521628498



('⋊', 's', 'k', 'i', 'm', 'z', '⋉')

tuple

0.05597964376590331

float

In [28]:
test_k = list(uniphone_dist_in.keys())[0]
test_k

uniphone_dist_in.update({rightEdge:{rightEdge: 1.0}})
uniphone_dist_in.update({leftEdge:{leftEdge: 1.0}})

#P(Y_i|X_i) as a dict mapping to ProbDists
channelOutput_i_dist = {phone:ProbDist(uniphone_dist[phone]) for phone in uniphone_dist}
channelOutput_i_dist[test_k]

'aɪ'

ProbDist({'aɪ': 0.8234302484302483,
          'aʊ': 0.0033783783783783825,
          'b': 0.0012643325143325162,
          'd': 0.0013513513513513532,
          'eɪ': 0.11726214851214868,
          'f': 0.00037878787878787933,
          'g': 0.00037878787878787933,
          'h': 0.0016635954135954156,
          'i': 0.001515151515151517,
          'j': 0.0020833333333333363,
          'n': 0.00037878787878787933,
          'oʊ': 0.0003378378378378383,
          'r': 0.0029484029484029527,
          't': 0.0007575757575757587,
          'u': 0.00018939393939393966,
          'æ': 0.005023205023205031,
          'ð': 0.0003378378378378383,
          'ŋ': 0.00037878787878787933,
          'ɑ': 0.02136568386568389,
          'ɔɪ': 0.0043355855855855925,
          'ɚ': 0.0005272317772317779,
          'ɛ': 0.0026566339066339106,
          'ɪ': 0.0025081900081900116,
          'ʌ': 0.00445331695331696,
          'ʒ': 0.0003378378378378383,
          'θ': 0.0007575757575757587})

In [29]:
print('p(Y_i|X_i = p):')
print(channelOutput_i_dist['p'])
print('p(Y_i|X_i = b):')
print(channelOutput_i_dist['b'])

p(Y_i|X_i = p):
'd': 0.000347222222222222
'p': 0.8784173976608184
'tʃ': 0.000694444444444444
'n': 0.0003289473684210524
'θ': 0.003088450292397658
'u': 0.0003289473684210524
'ʃ': 0.0013523391812865489
'l': 0.004093567251461985
'b': 0.006048976608187129
'm': 0.0010051169590643268
'h': 0.01822002923976607
'r': 0.000347222222222222
'ŋ': 0.000694444444444444
'oʊ': 0.0010233918128654965
'k': 0.030957602339181264
's': 0.000694444444444444
't': 0.0036549707602339153
'v': 0.0030884502923976586
'ʌ': 0.028143274853801154
'f': 0.004422514619883039
'aʊ': 0.0006578947368421048
'ɚ': 0.000347222222222222
'w': 0.000694444444444444
'i': 0.000347222222222222
'eɪ': 0.001388888888888888
'g': 0.0010416666666666662
'dʒ': 0.000694444444444444
'ɑ': 0.0013340643274853793
'ɛ': 0.003801169590643272
'ɪ': 0.002741228070175437

p(Y_i|X_i = b):
'd': 0.0264285714285714
'p': 0.15612781954887195
'tʃ': 0.001428571428571427
'n': 0.005357142857142851
'θ': 0.007142857142857135
'z': 0.00035714285714285676
'l': 0.004943609022

**Note!** A peculiarity of the diphone gating trial data is that while [ə] is indeed a segment recognized in the original dataset, they didn't ask participants to differentiate it from [ʌ] -- indeed, North American/non-British linguists typically consider these to be stress variants of each other rather than distinct phonemes in (at least) N. American English.  One result of this is that there are exactly zero trials in which participants identified [ə] as [ə]. *In terms of the code below, this means that $p(Y = [ə]|X = [ə]) = 0$.* 

This matters because, in the code below, I often examine conditional distributions -- e.g. $p(y_0^j|x_0^i)$ or $p(\hat{x}_0^j|x_0^i)$ where one of the two strings/events in question is a prefix of the other -- sometimes identical. The fact that $p(Y = [ə]|X = [ə]) = 0$ means that $p(y_0^j = s|x_0^i = s) = 0$ if [ə] is in the produced prefix $s$. *Don't be spooked by seeing zero probabilities like this - it's not evidence that something, somewhere is wrong.*

In [30]:
channelOutput_i_dist['ə']
'ə' in channelOutput_i_dist['ə']

ProbDist({'aɪ': 0.025272556390977443,
          'aʊ': 0.005557644110275688,
          'b': 0.006278195488721804,
          'd': 0.002086466165413534,
          'eɪ': 0.1606766917293233,
          'f': 0.002261904761904762,
          'g': 0.0014285714285714286,
          'h': 0.003593358395989975,
          'i': 0.005357142857142857,
          'j': 0.0008959899749373433,
          'n': 0.0010150375939849624,
          'oʊ': 0.044251253132832055,
          'p': 0.0006578947368421052,
          'r': 0.0026879699248120296,
          's': 0.0002380952380952381,
          't': 0.0013721804511278195,
          'u': 0.002086466165413534,
          'w': 0.0003289473684210526,
          'æ': 0.0019392230576441103,
          'ɑ': 0.05714598997493732,
          'ɔɪ': 0.030375939849624056,
          'ɚ': 0.04531954887218044,
          'ɛ': 0.10053884711779446,
          'ɪ': 0.08782581453634083,
          'ʊ': 0.044047619047619044,
          'ʌ': 0.36676065162907273})

False

# Generating a model

In [31]:
print('DERIVED VALUES/FUNCTIONS:')

total_model = generateTotalModel(inputDist, channelOutput_i_dist, pad_with_boundaries = False)
total_model

DERIVED VALUES/FUNCTIONS:


{'channelOutput_i_dist': {'aɪ': ProbDist({'aɪ': 0.8234302484302483,
            'aʊ': 0.0033783783783783825,
            'b': 0.0012643325143325162,
            'd': 0.0013513513513513532,
            'eɪ': 0.11726214851214868,
            'f': 0.00037878787878787933,
            'g': 0.00037878787878787933,
            'h': 0.0016635954135954156,
            'i': 0.001515151515151517,
            'j': 0.0020833333333333363,
            'n': 0.00037878787878787933,
            'oʊ': 0.0003378378378378383,
            'r': 0.0029484029484029527,
            't': 0.0007575757575757587,
            'u': 0.00018939393939393966,
            'æ': 0.005023205023205031,
            'ð': 0.0003378378378378383,
            'ŋ': 0.00037878787878787933,
            'ɑ': 0.02136568386568389,
            'ɔɪ': 0.0043355855855855925,
            'ɚ': 0.0005272317772317779,
            'ɛ': 0.0026566339066339106,
            'ɪ': 0.0025081900081900116,
            'ʌ': 0.00445331695331696,
           

What can we do with this model?

In [32]:
sorted(list(total_model.keys()))

['channelOutput_i_dist',
 'channelOutput_i_marginal_dist',
 'channelOutput_prefix_marginal_prob',
 'channelOutput_prefix_prob',
 'channelOutput_prefix_sampler',
 'est_channelInput_prob',
 'inputAlphabet',
 'inputDist',
 'inputDist_givenPrefix',
 'inputs',
 'p(X_0^f | x_0^i)',
 'p(X_0^f)',
 'p(Y_i)',
 'p(Y_i|x_i)',
 'p(x_0^i)',
 'p(x_{i+1}^f | x_0^i)',
 'p(y_0^i)',
 'p(y_0^i|x_0^i)',
 'p-hat(x-hat_0^j|x_0^i)',
 'prefixProb',
 'prefixes',
 'sample from p(Y_0^i|x_0^i)',
 'suffixProb']

In [33]:
{k:type(total_model[k]) for k in total_model.keys()}

{'channelOutput_i_dist': dict,
 'channelOutput_i_marginal_dist': __main__.ProbDist,
 'channelOutput_prefix_marginal_prob': function,
 'channelOutput_prefix_prob': function,
 'channelOutput_prefix_sampler': function,
 'est_channelInput_prob': function,
 'inputAlphabet': set,
 'inputDist': __main__.ProbDist,
 'inputDist_givenPrefix': function,
 'inputs': list,
 'p(X_0^f | x_0^i)': function,
 'p(X_0^f)': __main__.ProbDist,
 'p(Y_i)': __main__.ProbDist,
 'p(Y_i|x_i)': dict,
 'p(x_0^i)': function,
 'p(x_{i+1}^f | x_0^i)': function,
 'p(y_0^i)': function,
 'p(y_0^i|x_0^i)': function,
 'p-hat(x-hat_0^j|x_0^i)': function,
 'prefixProb': function,
 'prefixes': set,
 'sample from p(Y_0^i|x_0^i)': function,
 'suffixProb': function}

## Testing the model

In [34]:
# print('TEST INPUTS:')
an_input = total_model['inputs'][1]
prefix_a = an_input[0:2]
prefix_b = an_input[0:3]
prefix_c = an_input[0:4]
prefix_d = an_input[:-2]
suffix_a = an_input[-3:]
suffix_b = tuple([an_input[-2],'⋉'])
suffix_c = an_input[-2:]

print("{0}: {1}".format('an_input', an_input))
print("{0}: {1}".format('prefix_a', prefix_a))
print("{0}: {1}".format('prefix_b', prefix_b))
print("{0}: {1}".format('prefix_c', prefix_c))
print("{0}: {1}".format('prefix_d', prefix_d))
print("{0}: {1}".format('suffix_a', suffix_a))
print("{0}: {1}".format('suffix_b', suffix_b))
print("{0}: {1}".format('suffix_c', suffix_c))

an_input: ('⋊', 'θ', 'r', 'ʌ', 's', 's', '⋉')
prefix_a: ('⋊', 'θ')
prefix_b: ('⋊', 'θ', 'r')
prefix_c: ('⋊', 'θ', 'r', 'ʌ')
prefix_d: ('⋊', 'θ', 'r', 'ʌ', 's')
suffix_a: ('s', 's', '⋉')
suffix_b: ('s', '⋉')
suffix_c: ('s', '⋉')


The two cells below show properties of the input distribution $p(X_0^f)$ and calculations derived from it...

In [35]:
someInputs = total_model['inputs'][:10]
print('some inputs:')
print(someInputs)
print('preview of p(X_0^f):')
print({k:total_model['p(X_0^f)'][k] for k in someInputs})
assert(all([k in total_model['p(X_0^f)'] for k in someInputs]))

print('input alphabet:')
print(total_model['inputAlphabet'])
print('input alphabet - edge symbols:')
total_model['inputAlphabet'] - edgeSymbols

print("<= 10 prefixes of lexicon:")
print(list(total_model['prefixes'])[0:11])

some inputs:
[('⋊', 's', 'k', 'i', 'm', 'z', '⋉'), ('⋊', 'θ', 'r', 'ʌ', 's', 's', '⋉'), ('⋊', 's', 'ɪ', 'ŋ', 'ɚ', '⋉'), ('⋊', 'k', 'ɑ', 'r', 'b', 'j', 'ɚ', 'i', 'ʃ', 'ə', 'n', '⋉'), ('⋊', 'p', 'r', 'ɑ', 's', 'ɛ', 's', 't', '⋉'), ('⋊', 'b', 'oʊ', 'l', 'ʃ', 'ə', 'v', 'ɪ', 'k', 's', '⋉'), ('⋊', 'k', 'ə', 'm', 'p', 'r', 'ɛ', 's', 't', '⋉'), ('⋊', 'n', 'æ', 'n', 'oʊ', 's', 'ɛ', 'k', 'ə', 'n', 'd', '⋉'), ('⋊', 'k', 'oʊ', 'k', 'oʊ', '⋉'), ('⋊', 'ɪ', 'n', 'ɑ', 'k', 'j', 'ə', 'l', 'eɪ', 't', '⋉')]
preview of p(X_0^f):
{('⋊', 's', 'k', 'i', 'm', 'z', '⋉'): 0.05597964376590331, ('⋊', 'θ', 'r', 'ʌ', 's', 's', '⋉'): 0.0025445292620865138, ('⋊', 's', 'ɪ', 'ŋ', 'ɚ', '⋉'): 0.4990458015267174, ('⋊', 'k', 'ɑ', 'r', 'b', 'j', 'ɚ', 'i', 'ʃ', 'ə', 'n', '⋉'): 0.0006361323155216284, ('⋊', 'p', 'r', 'ɑ', 's', 'ɛ', 's', 't', '⋉'): 0.07983460559796436, ('⋊', 'b', 'oʊ', 'l', 'ʃ', 'ə', 'v', 'ɪ', 'k', 's', '⋉'): 0.006997455470737914, ('⋊', 'k', 'ə', 'm', 'p', 'r', 'ɛ', 's', 't', '⋉'): 0.028625954198473278, ('⋊', '

{'b',
 'd',
 'eɪ',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'oʊ',
 'p',
 'r',
 's',
 't',
 'v',
 'z',
 'æ',
 'ŋ',
 'ɑ',
 'ə',
 'ɚ',
 'ɛ',
 'ɪ',
 'ʃ',
 'ʌ',
 'θ'}

<= 10 prefixes of lexicon:
[('⋊', 's', 'k', 'i', 'm', 'z', '⋉'), ('⋊', 'θ', 'r', 'ʌ', 's', 's', '⋉'), ('⋊', 's', 'ɪ', 'ŋ', 'ɚ', '⋉'), ('⋊', 'b', 'oʊ', 'l', 'ʃ', 'ə', 'v', 'ɪ'), ('⋊', 's', 'ɪ', 'ŋ', 'ɚ'), ('⋊', 'n'), ('⋊', 'k', 'ɑ', 'r', 'b', 'j', 'ɚ', 'i', 'ʃ', 'ə', 'n', '⋉'), ('⋊', 'p', 'r', 'ɑ', 's', 'ɛ', 's', 't', '⋉'), ('⋊', 'p', 'r'), ('⋊', 'b', 'oʊ', 'l'), ('⋊', 'b', 'oʊ', 'l', 'ʃ', 'ə', 'v', 'ɪ', 'k', 's', '⋉')]


...like 
 - the probability the speaker's intended wordform has a particular prefix $p(x_0^i)$.
 - the distribution over full wordforms given a specific prefix $p(X_0^f|x_0^i)$.
 - the probability of a particular suffix given a particular prefix $p(x_{i+1}^f|x_0^i)$

In [36]:
print('p(X_0^f):')
# print('...')
print(inputDist)
test_prefix = prefix_a
print("TEST PREFIX: {0}".format(test_prefix))
# print("{0} in dist:{1}".format(test_prefix, test_prefix in total_model['prefixProb']))
print('p(x_0^i = {0}) = {1}'.format(test_prefix, total_model['prefixProb'](test_prefix)))
print('p(X_0^f| x_0^i = {0}):'.format(test_prefix))
print(total_model['inputDist_givenPrefix'](test_prefix))
test_prefix = prefix_b
print("TEST PREFIX: {0}".format(test_prefix))
print('p(x_0^i = {0}) = {1}'.format(test_prefix, total_model['prefixProb'](test_prefix)))
print('p(X_0^f| x_0^i = {0}):'.format(test_prefix))
print(total_model['inputDist_givenPrefix'](test_prefix))
test_prefix = prefix_c
print("TEST PREFIX: {0}".format(test_prefix))
print('p(x_0^i = {0}) = {1}'.format(test_prefix, total_model['prefixProb'](test_prefix)))
print('p(X_0^f| x_0^i = {0}):'.format(test_prefix))
print(total_model['inputDist_givenPrefix'](test_prefix))

test_suffix = suffix_a
print("TEST PREFIX: {0}".format(test_prefix))
print('TEST SUFFIX: {0}'.format(test_suffix))
print('Let j = i+1:')
print(' p(x_j^f = {0}|x_0^i = {1}) = {2}'.format(test_suffix, test_prefix, total_model['suffixProb'](test_suffix, test_prefix)))
test_prefix = prefix_d
test_suffix = suffix_c
print("TEST PREFIX: {0}".format(test_prefix))
print('TEST SUFFIX: {0}'.format(test_suffix))
print('Let j = i+1:')
print(' p(x_j^f = {0}|x_0^i = {1}) = {2}'.format(test_suffix, test_prefix, total_model['suffixProb'](test_suffix, test_prefix)))


p(X_0^f):
⋊.s.k.i.m.z.⋉: 0.05597964376590331
⋊.θ.r.ʌ.s.s.⋉: 0.0025445292620865138
⋊.s.ɪ.ŋ.ɚ.⋉: 0.4990458015267174
⋊.k.ɑ.r.b.j.ɚ.i.ʃ.ə.n.⋉: 0.0006361323155216284
⋊.p.r.ɑ.s.ɛ.s.t.⋉: 0.07983460559796436
⋊.b.oʊ.l.ʃ.ə.v.ɪ.k.s.⋉: 0.006997455470737914
⋊.k.ə.m.p.r.ɛ.s.t.⋉: 0.028625954198473278
⋊.n.æ.n.oʊ.s.ɛ.k.ə.n.d.⋉: 0.004452926208651399
⋊.k.oʊ.k.oʊ.⋉: 0.31075063613231546
⋊.ɪ.n.ɑ.k.j.ə.l.eɪ.t.⋉: 0.011132315521628498

TEST PREFIX: ('⋊', 'θ')
p(x_0^i = ('⋊', 'θ')) = 0.0025445292620865138
p(X_0^f| x_0^i = ('⋊', 'θ')):
⋊.θ.r.ʌ.s.s.⋉: 1.0

TEST PREFIX: ('⋊', 'θ', 'r')
p(x_0^i = ('⋊', 'θ', 'r')) = 0.0025445292620865138
p(X_0^f| x_0^i = ('⋊', 'θ', 'r')):
⋊.θ.r.ʌ.s.s.⋉: 1.0

TEST PREFIX: ('⋊', 'θ', 'r', 'ʌ')
p(x_0^i = ('⋊', 'θ', 'r', 'ʌ')) = 0.0025445292620865138
p(X_0^f| x_0^i = ('⋊', 'θ', 'r', 'ʌ')):
⋊.θ.r.ʌ.s.s.⋉: 1.0

TEST PREFIX: ('⋊', 'θ', 'r', 'ʌ')
TEST SUFFIX: ('s', 's', '⋉')
Let j = i+1:
 p(x_j^f = ('s', 's', '⋉')|x_0^i = ('⋊', 'θ', 'r', 'ʌ')) = 1.0
TEST PREFIX: ('⋊', 'θ', 'r', 'ʌ', 's')
TE

Below are the channel distribution over (two examples of) a single segment $p(Y|X = x)$ and the marginal distribution $p(Y)$...

In [37]:
test_phone_a = list(total_model['inputAlphabet'])[3]
test_phone_b = list(total_model['inputAlphabet'])[23]

print('p(Y_i|X_i = {0}):'.format(test_phone_a))
# print(channelOutput_i_dist(0))
print(channelOutput_i_dist[test_phone_a])
print('p(Y_i|X_i = {0}):'.format(test_phone_b))
# print(channelOutput_i_dist(1))
print(channelOutput_i_dist[test_phone_b])

print("p(Y_i):")
print(total_model['channelOutput_i_marginal_dist'])

p(Y_i|X_i = n):
'aɪ': 0.0003289473684210526
'd': 0.0006944444444444445
'p': 0.0006944444444444445
'tʃ': 0.0020833333333333337
'ʒ': 0.0006944444444444445
'n': 0.8828947368421052
'θ': 0.0006944444444444445
'u': 0.001736111111111111
'z': 0.00034722222222222224
'l': 0.003435672514619883
'b': 0.0006578947368421052
'ð': 0.0006944444444444445
'ɑ': 0.003983918128654971
'ŋ': 0.019060672514619886
'h': 0.004093567251461989
'ɪ': 0.008607456140350878
'm': 0.02388523391812866
'oʊ': 0.00034722222222222224
'k': 0.0006944444444444445
'æ': 0.0006944444444444445
'j': 0.0010233918128654971
'ʊ': 0.002722953216374269
'ʌ': 0.025292397660818724
'ɚ': 0.0010416666666666669
'w': 0.007346491228070177
'i': 0.0006944444444444445
'g': 0.0020833333333333337
'dʒ': 0.001388888888888889
'v': 0.001388888888888889
'ɛ': 0.0006944444444444445

p(Y_i|X_i = oʊ):
'aɪ': 0.0003378378378378383
'p': 0.00036764705882352984
'ʒ': 0.00036764705882352984
'n': 0.0009191176470588246
'θ': 0.00036764705882352984
'u': 0.009994369369369382
'

Below are example calculations related to the following channel distribution over prefixes:
 - $p(y_0^i|x_0^i)$
 - $p(y_0^j|x_0^i)$

In [38]:
print("p(X_0^f):")
print(inputDist)
  
test_input_full = an_input
test_output_full = an_input
print("TEST INPUT full: {0}".format(test_input_full))
print("TEST OUTPUT full: {0}".format(test_output_full))
print('p(y_0^i = {0}| x_0^i = {1}) = {2}'.format(test_output_full, test_input_full, total_model['channelOutput_prefix_prob'](test_input_full, test_output_full)))
# print('len({0}) = {1}'.format(trimBoundariesFromSequence(test_input_full), len(trimBoundariesFromSequence(test_input_full))))
# print('p(no_error) = {0}'.format(channelOutput_i_dist[]))
# print('{2}^(len({0})) = {1}'.format(trimBoundariesFromSequence(test_input_full), pNoError_i ** len(trimBoundariesFromSequence(test_input_full)), pNoError_i))

print(' ')
test_input_slice = prefix_a
test_output_slice = prefix_a
print("TEST INPUT prefix: {0}".format(test_input_slice))
print("TEST OUTPUT prefix: {0}".format(test_output_slice))
print('p(y_0^i = {0}| x_0^i = {1}) = {2}'.format(test_output_slice, test_input_slice, total_model['channelOutput_prefix_prob'](test_input_slice, test_output_slice)))

print(' ')
test_input_slice = prefix_a
test_output_slice = prefix_b
print("TEST INPUT prefix: {0}".format(test_input_slice))
print("TEST OUTPUT prefix: {0}".format(test_output_slice))
print('p(y_0^j = {0}| x_0^i = {1}) = {2}'.format(test_output_slice, test_input_slice, total_model['channelOutput_prefix_prob'](test_input_slice, test_output_slice)))

print(' ')
test_input_slice = prefix_b
test_output_slice = prefix_a
print("TEST INPUT prefix: {0}".format(test_input_slice))
print("TEST OUTPUT prefix: {0}".format(test_output_slice))
print('p(y_0^j = {0}| x_0^i = {1}) = {2}'.format(test_output_slice, test_input_slice, total_model['channelOutput_prefix_prob'](test_input_slice, test_output_slice)))

p(X_0^f):
⋊.s.k.i.m.z.⋉: 0.05597964376590331
⋊.θ.r.ʌ.s.s.⋉: 0.0025445292620865138
⋊.s.ɪ.ŋ.ɚ.⋉: 0.4990458015267174
⋊.k.ɑ.r.b.j.ɚ.i.ʃ.ə.n.⋉: 0.0006361323155216284
⋊.p.r.ɑ.s.ɛ.s.t.⋉: 0.07983460559796436
⋊.b.oʊ.l.ʃ.ə.v.ɪ.k.s.⋉: 0.006997455470737914
⋊.k.ə.m.p.r.ɛ.s.t.⋉: 0.028625954198473278
⋊.n.æ.n.oʊ.s.ɛ.k.ə.n.d.⋉: 0.004452926208651399
⋊.k.oʊ.k.oʊ.⋉: 0.31075063613231546
⋊.ɪ.n.ɑ.k.j.ə.l.eɪ.t.⋉: 0.011132315521628498

TEST INPUT full: ('⋊', 'θ', 'r', 'ʌ', 's', 's', '⋉')
TEST OUTPUT full: ('⋊', 'θ', 'r', 'ʌ', 's', 's', '⋉')
p(y_0^i = ('⋊', 'θ', 'r', 'ʌ', 's', 's', '⋉')| x_0^i = ('⋊', 'θ', 'r', 'ʌ', 's', 's', '⋉')) = 0.33122571061530065
 
TEST INPUT prefix: ('⋊', 'θ')
TEST OUTPUT prefix: ('⋊', 'θ')
p(y_0^i = ('⋊', 'θ')| x_0^i = ('⋊', 'θ')) = 0.6907205882352944
 
TEST INPUT prefix: ('⋊', 'θ')
TEST OUTPUT prefix: ('⋊', 'θ', 'r')
p(y_0^j = ('⋊', 'θ', 'r')| x_0^i = ('⋊', 'θ')) = 0.5782879288175375
 
TEST INPUT prefix: ('⋊', 'θ', 'r')
TEST OUTPUT prefix: ('⋊', 'θ')
p(y_0^j = ('⋊', 'θ')| x_0^i = ('⋊'

Below are samples from $p(Y_0^i|x_0^i)$:

In [39]:
test_input_full = an_input
print("TEST INPUT full: {0}".format(test_input_full))
outputSamples = list(sampleFrom(lambda : total_model['channelOutput_prefix_sampler'](test_input_full), 1000))
outputSamplesDS = list(map(tupleToDottedString, outputSamples))
outputSamplesDS_dist = ProbDist(Counter(outputSamplesDS))
print('est. p(Y_0^i = {0}|x_0^i = {0}) based on 1,000 samples:'.format(an_input))
print(outputSamplesDS_dist[tupleToDottedString(test_input_full)])
print('Examples of output samples:')
outputSamplesDS[:10]

TEST INPUT full: ('⋊', 'θ', 'r', 'ʌ', 's', 's', '⋉')
est. p(Y_0^i = ('⋊', 'θ', 'r', 'ʌ', 's', 's', '⋉')|x_0^i = ('⋊', 'θ', 'r', 'ʌ', 's', 's', '⋉')) based on 1,000 samples:
333/1000
Examples of output samples:


['⋊.θ.r.ʌ.s.s.⋉',
 '⋊.θ.r.ʌ.s.s.⋉',
 '⋊.θ.r.ɑ.i.s.⋉',
 '⋊.p.r.ʌ.s.ɔɪ.⋉',
 '⋊.θ.r.ʌ.s.s.⋉',
 '⋊.θ.r.ʊ.s.s.⋉',
 '⋊.θ.r.ʊ.s.s.⋉',
 '⋊.f.r.ɚ.s.s.⋉',
 '⋊.θ.r.ɑ.s.s.⋉',
 '⋊.ʌ.r.ʌ.s.s.⋉']

In [40]:
print("TEST INPUT prefix: {0}".format(prefix_b))
print('10 samples from p(Y_0^i|x_0^i = {0}): {1}'.format(prefix_b, list(sampleFrom(lambda : total_model['channelOutput_prefix_sampler'](prefix_b), 10)) ))

TEST INPUT prefix: ('⋊', 'θ', 'r')
10 samples from p(Y_0^i|x_0^i = ('⋊', 'θ', 'r')): [('⋊', 'f', 'æ'), ('⋊', 'θ', 'r'), ('⋊', 'θ', 'r'), ('⋊', 'θ', 'r'), ('⋊', 'dʒ', 'r'), ('⋊', 'ð', 'r'), ('⋊', 'ɛ', 'r'), ('⋊', 'θ', 'r'), ('⋊', 'θ', 'ɚ'), ('⋊', 'θ', 'r')]


Below is an example calculation of $p(y_0^i)$:

In [41]:
print('TEST OUTPUT prefix = {0}'.format(prefix_b))
print('p(y_0^i = {0}) = {1}'.format(prefix_b, total_model['channelOutput_prefix_marginal_prob'](prefix_b)))

TEST OUTPUT prefix = ('⋊', 'θ', 'r')
p(y_0^i = ('⋊', 'θ', 'r')) = 0.0016787690901115722


Finally, below are example calculations of $p(\hat{x}_0^j| x_0^i)$:

In [42]:
print("p(X_0^f):")
print(inputDist)

print('est p(x-hat_0^j = {0}| x_0^i = {1}) = {2}'.format(('⋊', 1), prefix_b, total_model['est_channelInput_prob'](prefix_b, ('⋊', 1))))
print(' ')

print('est p(x-hat_0^j = {0}| x_0^i = {1}) = {2}'.format(prefix_b, prefix_b, total_model['est_channelInput_prob'](prefix_b, prefix_b)))
print(' ')
print('est p(x-hat_0^j = {0}| x_0^i = {1}) = {2}'.format(an_input, prefix_b, total_model['est_channelInput_prob'](prefix_b, an_input)))
print(' ')
print('est p(x-hat_0^j = {0}| x_0^i = {1}) = {2}'.format(prefix_b, an_input, total_model['est_channelInput_prob'](an_input, prefix_b)))

print(' ')
print(' ')
print('est p(x-hat_0^j = {0}| x_0^i = {1}) = {2}'.format(prefix_a, prefix_a, total_model['est_channelInput_prob'](prefix_b, prefix_b)))
print(' ')
print('est p(x-hat_0^j = {0}| x_0^i = {1}) = {2}'.format(prefix_a, prefix_b, total_model['est_channelInput_prob'](prefix_b, prefix_b)))
print(' ')
print('est p(x-hat_0^j = {0}| x_0^i = {1}) = {2}'.format(prefix_b, prefix_a, total_model['est_channelInput_prob'](prefix_b, prefix_b)))

p(X_0^f):
⋊.s.k.i.m.z.⋉: 0.05597964376590331
⋊.θ.r.ʌ.s.s.⋉: 0.0025445292620865138
⋊.s.ɪ.ŋ.ɚ.⋉: 0.4990458015267174
⋊.k.ɑ.r.b.j.ɚ.i.ʃ.ə.n.⋉: 0.0006361323155216284
⋊.p.r.ɑ.s.ɛ.s.t.⋉: 0.07983460559796436
⋊.b.oʊ.l.ʃ.ə.v.ɪ.k.s.⋉: 0.006997455470737914
⋊.k.ə.m.p.r.ɛ.s.t.⋉: 0.028625954198473278
⋊.n.æ.n.oʊ.s.ɛ.k.ə.n.d.⋉: 0.004452926208651399
⋊.k.oʊ.k.oʊ.⋉: 0.31075063613231546
⋊.ɪ.n.ɑ.k.j.ə.l.eɪ.t.⋉: 0.011132315521628498

est p(x-hat_0^j = ('⋊', 1)| x_0^i = ('⋊', 'θ', 'r')) = 0
 
est p(x-hat_0^j = ('⋊', 'θ', 'r')| x_0^i = ('⋊', 'θ', 'r')) = 0.7132299796807328
 
est p(x-hat_0^j = ('⋊', 'θ', 'r', 'ʌ', 's', 's', '⋉')| x_0^i = ('⋊', 'θ', 'r')) = 0.7215248357105462
 
est p(x-hat_0^j = ('⋊', 'θ', 'r')| x_0^i = ('⋊', 'θ', 'r', 'ʌ', 's', 's', '⋉')) = 0.9999763433531295
 
 
est p(x-hat_0^j = ('⋊', 'θ')| x_0^i = ('⋊', 'θ')) = 0.720068487711772
 
est p(x-hat_0^j = ('⋊', 'θ')| x_0^i = ('⋊', 'θ', 'r')) = 0.7201675102912667
 
est p(x-hat_0^j = ('⋊', 'θ', 'r')| x_0^i = ('⋊', 'θ')) = 0.7181326246871341


How long does it take to calculate the query $p(\hat{X}_0^i = s|X_0^i = s)$, over random prefixes from our tiny ten word lexicon?

In [43]:
test_input_full = an_input
print("TEST INPUT full: {0}".format(test_input_full))
testPrefixes = sorted(list(getPrefixes(an_input)), key = lambda prefix: len(prefix))
len(testPrefixes)
print('Test prefixes of increasing length:')
testPrefixes
# testPrefixes - {tuple(['⋊'])}
# testPrefixes = testPrefixes- {tuple(['⋊'])}

TEST INPUT full: ('⋊', 'θ', 'r', 'ʌ', 's', 's', '⋉')


7

Test prefixes of increasing length:


[('⋊',),
 ('⋊', 'θ'),
 ('⋊', 'θ', 'r'),
 ('⋊', 'θ', 'r', 'ʌ'),
 ('⋊', 'θ', 'r', 'ʌ', 's'),
 ('⋊', 'θ', 'r', 'ʌ', 's', 's'),
 ('⋊', 'θ', 'r', 'ʌ', 's', 's', '⋉')]

In [44]:
import time
import statistics

def timeCall(thunk):
    timeStart = time.time()
    result = thunk()
    timeEnd = time.time()
    timePassed = timeEnd - timeStart
    return timePassed


times = dict()
for eachPrefix in testPrefixes:
#     print(eachPrefix)
#     results[eachPrefix] = total_model['est_channelInput_prob'](eachPrefix, eachPrefix)
    times[eachPrefix] = timeCall(lambda: total_model['est_channelInput_prob'](eachPrefix, eachPrefix))
#     print('{0}: {1}'.format(eachPrefix, times[eachPrefix]))
print('times:')
times
timeValues = [times[eachPrefix] for eachPrefix in testPrefixes]

print('difference between successive queries (of increasing prefix lengths):')
diffs = [trial_next_time - trial_time for trial_time, trial_next_time in zip(timeValues, timeValues[1:])]
diffs
print('mean difference:')
statistics.mean(diffs)

print('time/segment:')
timePerSegment = [times[eachPrefix] / (1.0 * len(eachPrefix)) for eachPrefix in testPrefixes]
timePerSegment
print('mean time/segment:')
statistics.mean(timePerSegment)

times:


{('⋊',): 0.15563607215881348,
 ('⋊', 'θ'): 1.1722090244293213,
 ('⋊', 'θ', 'r'): 2.2120699882507324,
 ('⋊', 'θ', 'r', 'ʌ'): 2.6993730068206787,
 ('⋊', 'θ', 'r', 'ʌ', 's'): 3.2806639671325684,
 ('⋊', 'θ', 'r', 'ʌ', 's', 's'): 3.9217159748077393,
 ('⋊', 'θ', 'r', 'ʌ', 's', 's', '⋉'): 3.872938871383667}

difference between successive queries (of increasing prefix lengths):


[1.0165729522705078,
 1.0398609638214111,
 0.4873030185699463,
 0.5812909603118896,
 0.6410520076751709,
 -0.048777103424072266]

mean difference:


0.6195504665374756

time/segment:


[0.15563607215881348,
 0.5861045122146606,
 0.7373566627502441,
 0.6748432517051697,
 0.6561327934265136,
 0.6536193291346232,
 0.5532769816262382]

mean time/segment:


0.5738528004308947

How long does it take to calculate $p(\hat{X}_0^f = s|X_0^f = s)$ as the size of the lexicon grows?

In [45]:
#first we want to restore the input distribution to the full one we imported...

len(inputDist) #current input distribution
len(lexicon_actual) #reference to the originally imported lexicon
inputDist = lexicon_actual
new_total_model = generateTotalModel(inputDist, channelOutput_i_dist, pad_with_boundaries = False)
# new_total_model
sorted(new_total_model.keys())

10

51376

['channelOutput_i_dist',
 'channelOutput_i_marginal_dist',
 'channelOutput_prefix_marginal_prob',
 'channelOutput_prefix_prob',
 'channelOutput_prefix_sampler',
 'est_channelInput_prob',
 'inputAlphabet',
 'inputDist',
 'inputDist_givenPrefix',
 'inputs',
 'p(X_0^f | x_0^i)',
 'p(X_0^f)',
 'p(Y_i)',
 'p(Y_i|x_i)',
 'p(x_0^i)',
 'p(x_{i+1}^f | x_0^i)',
 'p(y_0^i)',
 'p(y_0^i|x_0^i)',
 'p-hat(x-hat_0^j|x_0^i)',
 'prefixProb',
 'prefixes',
 'sample from p(Y_0^i|x_0^i)',
 'suffixProb']

Across $m$ $n$-word sublexicons of the actual lexicon, what is the average time to make $k$ random queries of $p(\hat{X}_0^f = s|X_0^f = s)$ and where $s$ is chosen from a uniform distribution over words in the lexicon? Let's start with $m=3$ and $k=10$ and let $n$ vary over $\{10, 100, 1000, 10000, 20000, 40000\}$.

In [46]:
my_m = 3
my_k = 10
my_ns = [10, 100, 1000, 10000, 20000, 40000]

def getLexiconSampleOfSize(full_lexicon, n):
    some_entries = random.sample(phonWords, n) #sample n words without replacement from the lexicon
    newLexicon = ProbDist({w:full_lexicon[w] for w in some_entries})
    return newLexicon

def genMatchingModel(random_sublexicon):
    model = generateTotalModel(random_sublexicon, channelOutput_i_dist, pad_with_boundaries = False)
    return model

def makeRandomQuery(a_total_model):
    random_word = random.choice(a_total_model['inputs'])
    ds = tupleToDottedString(random_word)
    # print('Starting query for {0}'.format(ds))
    timeStart = time.time()
    query_result = a_total_model['p-hat(x-hat_0^j|x_0^i)'](random_word, random_word)
    timeEnd = time.time()
    timeTaken = timeEnd - timeStart #seconds
    # print('Finished query. Time taken (s): {0}'.format(timeTaken))
    return (random_word, query_result, timeTaken)

def getTime(model, a_k_val):
    timeStart = time.time()
    makeQuery = lambda: makeRandomQuery(model)
    queries = [makeQuery() for each in range(a_k_val)]
    timeEnd = time.time()
    timeTaken = timeEnd - timeStart #seconds
    print('Time taken (s) for {0} queries: {1}'.format(a_k_val, timeTaken))
    return timeTaken

def collectTimes(m, n, k):
    parameters = {'m':m, 'n':n, 'k':k}
    random_sublexicons = list(getLexiconSampleOfSize(inputDist, n) for each in range(m))
    models = list(map(genMatchingModel, random_sublexicons))
    times = list(map(lambda model: getTime(model, k), models))
    mean_time = statistics.mean(times)
    print('Mean time (over {0} sublexicons) to make {1} queries over a {2}-word sublexicon: {3}'.format(m, k, n, mean_time))
    return {
        'parameters':parameters, 
        'sublexicons':random_sublexicons, 
        'models':models, 
        'times':times, 
        'mean time':mean_time
    }

observations = [collectTimes(my_m, each_n, my_k) for each_n in my_ns]

Time taken (s) for 10 queries: 42.809144020080566
Time taken (s) for 10 queries: 54.05713701248169
Time taken (s) for 10 queries: 62.51040315628052
Mean time (over 3 sublexicons) to make 10 queries over a 10-word sublexicon: 53.12556139628092
Time taken (s) for 10 queries: 103.55769801139832
Time taken (s) for 10 queries: 99.49169898033142
Time taken (s) for 10 queries: 101.20855593681335
Mean time (over 3 sublexicons) to make 10 queries over a 100-word sublexicon: 101.4193176428477
Time taken (s) for 10 queries: 442.6917288303375
Time taken (s) for 10 queries: 557.3850648403168
Time taken (s) for 10 queries: 492.1795291900635
Mean time (over 3 sublexicons) to make 10 queries over a 1000-word sublexicon: 497.41877428690594
Time taken (s) for 10 queries: 64854.661350011826
Time taken (s) for 10 queries: 4353.231388092041
Time taken (s) for 10 queries: 5082.366680145264
Mean time (over 3 sublexicons) to make 10 queries over a 10000-word sublexicon: 24763.419806083042


KeyboardInterrupt: 

(Jupyter is buggy and slow; when corresponding code is run *outside* of a notebook, the mean time to make 10 queries is somewhere between 45-65 seconds no matter how large the (sub)lexicons in question are.)