In [1]:
#Prints **all** console output, not just last item in cell 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

**Eric Meinhardt / emeinhardt@ucsd.edu**

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Motivation" data-toc-modified-id="Motivation-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Motivation</a></span></li><li><span><a href="#Use" data-toc-modified-id="Use-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Use</a></span></li><li><span><a href="#Import-libraries-and-data" data-toc-modified-id="Import-libraries-and-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Import libraries and data</a></span></li><li><span><a href="#Basic-representations---words-and-prefixes" data-toc-modified-id="Basic-representations---words-and-prefixes-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Basic representations - words and prefixes</a></span></li><li><span><a href="#Basic-vectorized-representations" data-toc-modified-id="Basic-vectorized-representations-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Basic vectorized representations</a></span><ul class="toc-item"><li><span><a href="#One-hot-representations" data-toc-modified-id="One-hot-representations-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>One-hot representations</a></span></li><li><span><a href="#Padding-and-trimming-to-create-a-fixed-size-representation" data-toc-modified-id="Padding-and-trimming-to-create-a-fixed-size-representation-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Padding and trimming to create a fixed-size representation</a></span></li></ul></li><li><span><a href="#Prefixes" data-toc-modified-id="Prefixes-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Prefixes</a></span><ul class="toc-item"><li><span><a href="#Generating-prefixes" data-toc-modified-id="Generating-prefixes-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Generating prefixes</a></span></li><li><span><a href="#Generating-padded/trimmed-prefixes" data-toc-modified-id="Generating-padded/trimmed-prefixes-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Generating padded/trimmed prefixes</a></span></li><li><span><a href="#Detecting-whether-p-is-a-prefix-of-w" data-toc-modified-id="Detecting-whether-p-is-a-prefix-of-w-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Detecting whether <code>p</code> is a prefix of <code>w</code></a></span></li><li><span><a href="#Generating-a-prefix-word-relation" data-toc-modified-id="Generating-a-prefix-word-relation-6.4"><span class="toc-item-num">6.4&nbsp;&nbsp;</span>Generating a <code>prefix-word</code> relation</a></span></li><li><span><a href="#The-(p,w,l)-relation-where-w-trimmed-to-l-is-p" data-toc-modified-id="The-(p,w,l)-relation-where-w-trimmed-to-l-is-p-6.5"><span class="toc-item-num">6.5&nbsp;&nbsp;</span>The <code>(p,w,l)</code> relation where <code>w</code> trimmed to <code>l</code> is <code>p</code></a></span></li></ul></li><li><span><a href="#Hamming-distance" data-toc-modified-id="Hamming-distance-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Hamming distance</a></span><ul class="toc-item"><li><span><a href="#Distance-between-symbol-vectors" data-toc-modified-id="Distance-between-symbol-vectors-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Distance between symbol vectors</a></span></li><li><span><a href="#Hamming-distance-between-stacks-of-symbol-vectors-(strings)" data-toc-modified-id="Hamming-distance-between-stacks-of-symbol-vectors-(strings)-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Hamming distance between stacks of symbol vectors (strings)</a></span></li><li><span><a href="#Distance-between-a-string-and-a-stack-of-strings" data-toc-modified-id="Distance-between-a-string-and-a-stack-of-strings-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>Distance between a string and a stack of strings</a></span></li><li><span><a href="#Hamming-distance-between-every-pair-of-strings-in-a-stack" data-toc-modified-id="Hamming-distance-between-every-pair-of-strings-in-a-stack-7.4"><span class="toc-item-num">7.4&nbsp;&nbsp;</span>Hamming distance between every pair of strings in a stack</a></span></li></ul></li><li><span><a href="#$k$-cousin-calculation" data-toc-modified-id="$k$-cousin-calculation-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>$k$-cousin calculation</a></span><ul class="toc-item"><li><span><a href="#Definitions,-motivation,-and-calculation-sketch" data-toc-modified-id="Definitions,-motivation,-and-calculation-sketch-8.1"><span class="toc-item-num">8.1&nbsp;&nbsp;</span>Definitions, motivation, and calculation sketch</a></span></li></ul></li><li><span><a href="#Export" data-toc-modified-id="Export-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Export</a></span></li></ul></div>

# Motivation

Given a finite set of strings (wordforms) $L$, we may want to efficiently calculate for subsequent use
 - the natural relation between the set of prefixes $P$ (of $L$) and $L$ indicating which prefixes are prefixes of a given string $s \in L$ and which strings $s \in L$ have a given $p \in P$ as a prefix
 - the matrix of Hamming distances between all pairs of strings (full wordforms) in $L$
 - the matrix of Hamming distances between all pairs of prefixes of strings in $L$
 - the "$k$-cousin" function/relation between strings in $L$ and prefixes of strings of $L$. (See the $k$-cousin calculation section header for more of an explanation.)

This notebook documents vectorized and otherwise parallelized code for such calculations.

# Use

Given 
 - a filepath $p$ to *either*
    - a conditional distribution on segmental wordforms given an orthographic wordform $p(W|V)$
    - an unconditioned distribution on segmental wordforms $p(W)$
 - an output filepath prefix $o$
 
this notebook calculates and writes to file 
 - what the prefix relation of $W$ is
 - what the Hamming distance between all pairs of wordforms in $W$ is
   - **NB:** for storage and time complexity reasons, $-1$ is used instead of $\infty$ to represent distance between strings of differing length. ($\infty$ requires floats, where everything else here is nicely represented using (u)int8 types; the same note applies to the other two output matrices representing Hamming distance information.) 
 - what the Hamming distance between all pairs of prefixes of $W$ is
 - what the $k$-cousin relation/function between all prefixes of $W$ and $W$ is
   - **NB:** the matrix describing this is memory-mapped, unlike the other two.

# Import libraries and data

In [2]:
from os import getcwd, chdir, listdir, path, mkdir, makedirs

In [3]:
from boilerplate import *

In [4]:
from funcy import *

In [5]:
# Parameters

p = ''
# p = 'LTR_Buckeye_aligned_w_GD_AmE_destressed/LTR_Buckeye_aligned_CM_filtered_LM_filtered.pW_V.json'
# p = 'LTR_Buckeye_aligned_w_GD_AmE_destressed/LTR_Buckeye_aligned_CM_filtered_LM_filtered_trim.pW_V.json'
# p = 'LTR_newdic_destressed_aligned_w_GD_AmE_destressed/LTR_newdic_destressed_aligned_CM_filtered_LM_filtered.pW_V.json'
# p = 'LTR_newdic_destressed_aligned_w_GD_AmE_destressed/LTR_newdic_destressed_aligned_CM_filtered_LM_filtered_trim.pW_V.json'
# p = 'LTR_CMU_destressed_aligned_w_GD_AmE_destressed/LTR_CMU_destressed_aligned_CM_filtered_LM_filtered.pW_V.json'
# p = 'LTR_CMU_destressed_aligned_w_GD_AmE_destressed/LTR_CMU_destressed_aligned_CM_filtered_LM_filtered_trim.pW_V.json'
# p = 'LTR_NXT_swbd_destressed_aligned_w_GD_AmE_destressed/LTR_NXT_swbd_destressed_aligned_CM_filtered_LM_filtered_trim.pW_V.json'
# p = 'LTR_NXT_swbd_destressed_aligned_w_GD_AmE_destressed/LTR_NXT_swbd_destressed_aligned_CM_filtered_LM_filtered.pW_V.json'
# p = 'CM_AmE_destressed_aligned_w_LTR_NXT_swbd_destressed_pseudocount0.01/LTR_NXT_swbd_destressed_aligned_CM_filtered_LM_filtered_pX0X1X2.json'

o = ''
# o = 'LTR_Buckeye_aligned_w_GD_AmE_destressed/LTR_Buckeye_aligned_CM_filtered_LM_filtered'
# o = 'LTR_Buckeye_aligned_w_GD_AmE_destressed/LTR_Buckeye_aligned_CM_filtered_LM_filtered_trim'
# o = 'LTR_newdic_destressed_aligned_w_GD_AmE_destressed/LTR_newdic_destressed_aligned_CM_filtered_LM_filtered'
# o = 'LTR_newdic_destressed_aligned_w_GD_AmE_destressed/LTR_newdic_destressed_aligned_CM_filtered_LM_filtered_trim'
# o = 'LTR_CMU_destressed_aligned_w_GD_AmE_destressed/LTR_CMU_destressed_aligned_CM_filtered_LM_filtered'
# o = 'LTR_CMU_destressed_aligned_w_GD_AmE_destressed/LTR_CMU_destressed_aligned_CM_filtered_LM_filtered_trim'
# o = 'LTR_NXT_swbd_destressed_aligned_w_GD_AmE_destressed/LTR_NXT_swbd_destressed_aligned_CM_filtered_LM_filtered'
# o = 'LTR_NXT_swbd_destressed_aligned_w_GD_AmE_destressed/LTR_NXT_swbd_destressed_aligned_CM_filtered_LM_filtered_trim'
# o = 'CM_AmE_destressed_aligned_w_LTR_NXT_swbd_destressed_pseudocount0.01/LTR_NXT_swbd_destressed_aligned_CM_filtered_LM_filtered_X0X1X2'

g = ''
# g = 'False'

In [6]:
if g == '' or g == 'True' or g == True:
    g = True
elif g == 'False' or g == False:
    g = False
else:
    raise Exception(f"g must be one of {'', 'True', 'False'}, got {g} instead.")

In [7]:
from probdist import *
from string_utils import *

In [8]:
from tqdm import tqdm

from joblib import Parallel, delayed

J = -1
BACKEND = 'multiprocessing'
# BACKEND = 'loky'
V = 10
PREFER = 'processes'
# PREFER = 'threads'

def identity(x):
    return x

def par(gen_expr):
    return Parallel(n_jobs=J, backend=BACKEND, verbose=V, prefer=PREFER)(gen_expr)

In [9]:
import sparse

In [10]:
!free -h

              total        used        free      shared  buff/cache   available
Mem:           125G        868M        8.2G        105M        116G        123G
Swap:          2.0G        104M        1.9G


In [11]:
if 'pW_V' in p:
    pW_V = condDistsAsProbDists(importProbDist(p))
elif 'pX0X1X2' in p:
    pW = ProbDist(importProbDist(p))
else:
    raise Exception(f"Unknown type of 'p' parameter = {p}")

In [12]:
!free -h

              total        used        free      shared  buff/cache   available
Mem:           125G        871M        8.2G        105M        116G        123G
Swap:          2.0G        104M        1.9G


In [13]:
testing = False
benchmark = False

In [14]:
my_dtype = np.int8

In [15]:
import torch

In [16]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

#Additional Info when using cuda
if device.type == 'cuda':
#     if g and l:
#         print("Disabling 'parallelize' flag...")
#         l = False
#     import cupy
    
    print(torch.cuda.get_device_name(0))
    total_mem_MB = torch.cuda.get_device_properties(device).total_memory / 1e6
    print('Total Memory: {0}'.format(total_mem_MB) )
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_cached(0)/1024**3,1), 'GB')
    if torch.cuda.device_count() > 1:
        print(torch.cuda.get_device_name(1))
        print('Memory Usage:')
        print('Allocated:', round(torch.cuda.memory_allocated(1)/1024**3,1), 'GB')
        print('Cached:   ', round(torch.cuda.memory_cached(1)/1024**3,1), 'GB')
elif g:
    print("g set to 'True', but torch cannot find a GPU. Setting g to 'False'.")
    g = False
else:
    pass
#     raise Exception(f"g set to 'True' but torch cannot find a GPU.")

Using device: cuda

GeForce RTX 2070
Total Memory: 8367.439872
Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB


In [17]:
gpu = torch.device('cuda')
cpu = torch.device('cpu')

my_device = cpu

In [18]:
cuda_ft = torch.cuda.FloatTensor
cuda_dt = torch.cuda.DoubleTensor

ft = torch.FloatTensor
dt = torch.DoubleTensor

my_ft = ft
my_dt = dt

my_type = my_ft
# my_type = my_dt

torch.set_default_tensor_type(my_type)

my_cpu_type = torch.int8
my_cuda_type = torch.float16
# my_tt = torch.float32
# my_tt = torch.float64

# Basic representations - words and prefixes

These are reference objects we will work with and use to check vectorized calculations...

In [19]:
if 'pW_V' in p:
    # Vs = set(pW_V.keys())
    Ws = union(mapValues(lambda dist: set(conditions(dist)), 
                         pW_V).values())
elif 'pX0X1X2' in p:
    Ws = set(conditions(pW))
else:
    raise Exception(f"Unknown type of 'p' parameter = {p}")

# len(Vs)
len(Ws)

6737

In [20]:
Ws_t = tuple(sorted(list(Ws)))

In [21]:
#≈200s on CMU on solomonoff
Ps = union(list(par(delayed(getPrefixes)(w) for w in Ws)))
# Ps = union(par(delayed(getPrefixes)(w) for w in Ws))
# Ps = union([getPrefixes(w) for w in Ws])
Ps_t = tuple(sorted(list(Ps)))
len(Ps_t)

[Parallel(n_jobs=-1)]: Using backend MultiprocessingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0030s.) Setting batch_size=132.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 6737 out of 6737 | elapsed:    0.1s finished


21180

# Basic vectorized representations

We will want to work with 
 1. one-hot vector-based representations of strings
 2. fixed-dimension representations of strings

To support #2, we will want to pad or trim (i.e. de-suffix = remove material corresponding to the right edge of the string) one-hot representations of string(s).

## One-hot representations

In [22]:
def to_uint8(arr):
    return arr.astype(np.uint8)

np.ones(3).dtype
to_uint8(np.ones(3)).dtype

dtype('float64')

dtype('uint8')

In [23]:
Xs = lexiconToInventory(Ws)
len(Xs)

Xmap = seqsToIndexMap(Xs)
XOHmap = seqsToOneHotMap(Xs)
# XOHmap = mapValues(to_uint8, seqsToOneHotMap(Xs))

def dsToUniphoneIndices(ds, uniphoneToIndexMap):
    uniphoneSeq = ds2t(ds)
    return np.array([uniphoneToIndexMap[uniphone] for uniphone in uniphoneSeq], dtype=np.uint8)

def dsToUniphoneOHs(ds, uniphoneToOHmap):
    uniphoneSeq = ds2t(ds)
    return np.array([uniphoneToOHmap[uniphone] for uniphone in uniphoneSeq], dtype=np.uint8)

dsToUniphoneIndices('t.i.f.l', Xmap)
dsToUniphoneOHs('t.i.f.l', XOHmap)

41

array([18,  9,  6, 12], dtype=uint8)

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=uint8)

In [24]:
OHXmap = oneHotToSeqMap(Xs)

def OHsToDS(OHs, OHtoUniphoneMap):
    return t2ds([OHtoUniphoneMap(OH)
                 for OH in OHs if OH.sum() > 0])

#should give us back what we put in
OHsToDS(dsToUniphoneOHs('t.i.f.l', XOHmap),
        OHXmap)

#should yield the empty string
OHsToDS(np.array([0]), OHXmap)

't.i.f.l'

''

In [25]:
random_w = choice(Ws_t); random_w
len(ds2t(random_w))

'⋊.oʊ.ɑ.l.⋉.⋉'

6

In [26]:
dsToUniphoneIndices(random_w, Xmap)
random_w_OH = dsToUniphoneOHs(random_w, XOHmap)
random_w_OH.shape

array([40, 15, 27, 12, 39, 39], dtype=uint8)

(6, 41)

## Padding and trimming to create a fixed-size representation

The padding one-hot vector is **the zero vector**.

In [27]:
def padWord(w_OHs, goal_length):
    l = w_OHs.shape[0]
    if l > goal_length:
        raise Exception(f"word length = {l} > goal length = {goal_length}")
    if l == goal_length:
        return w_OHs
    return np.pad(w_OHs,
                  ((0, goal_length - l), (0,0)),
                  mode='constant',
                  constant_values=0)


def trimWord(w_OHs, goal_length):
    l = w_OHs.shape[0]
    if l < goal_length:
        raise Exception(f"word length = {l} < goal length = {goal_length}")
    if l == goal_length:
        return w_OHs
    return w_OHs[:goal_length]


def adjustWord(w_OHs, goal_length):
    l = w_OHs.shape[0]
    if l == goal_length:
        return w_OHs
    elif l < goal_length:
        return padWord(w_OHs, goal_length)
    else:
        return trimWord(w_OHs, goal_length)

    
def lexiconToFixedSizeOHs(Ws, fixed_size = None):
    maxL = max({len(ds2t(w)) for w in Ws})
    if fixed_size is None:
        fixed_size = maxL    
    
    Ws_OH = (dsToUniphoneOHs(w, XOHmap) for w in Ws)
    Ws_OH_adjusted = np.array([adjustWord(w_OH, fixed_size) for w_OH in Ws_OH])
    return Ws_OH_adjusted

In [28]:
random_w_OH
random_w_OH.shape

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]],
      dtype=uint8)

(6, 41)

In [29]:
padWord(random_w_OH, random_w_OH.shape[0] + 1)
assert np.array_equal(padWord(random_w_OH, random_w_OH.shape[0] + 1), 
                      adjustWord(random_w_OH, random_w_OH.shape[0] + 1))

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=uint8)

In [30]:
trimWord(random_w_OH, random_w_OH.shape[0] - 1)
assert np.array_equal(trimWord(random_w_OH, random_w_OH.shape[0] - 1), 
                      adjustWord(random_w_OH, random_w_OH.shape[0] - 1))

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]],
      dtype=uint8)

In [31]:
Ws_npf = lexiconToFixedSizeOHs(Ws_t)
Ws_npf.dtype
Ws_npf.shape #:: (|Ws|, maxL, |Xs|) = (n, L_bar, s)
Ws_npf.nbytes / 1e6
Ws_npf.nbytes / 1e9

dtype('uint8')

(6737, 6, 41)

1.657302

0.001657302

In [32]:
Ws_sf = sparse.COO.from_numpy(Ws_npf)
Ws_sf.dtype
Ws_sf.shape
Ws_sf.nbytes / 1e6
Ws_sf.density

dtype('uint8')

(6737, 6, 41)

1.01055

0.024390243902439025

We may also want to detect and/or undo padding/trimming:

In [33]:
#Recall: a padded OH matrix will have at least one row that is a zero vector
def isPaddedOHstack(p_OH):
    return not np.product( np.sum(p_OH, axis=1) )

def unpad(padded_p_OH):
#     if not isPaddedOHstack(p_OH):
#         return padded_p_OH
    rowIsUnPadded = np.sum(padded_p_OH, axis=1)
    isPadded = not np.product(rowIsUnPadded)
    if not isPadded:
        return padded_p_OH
    nonPaddingRows = np.array([padded_p_OH_row 
                               for i, padded_p_OH_row in enumerate(padded_p_OH) 
                               if rowIsUnPadded[i]])
    return nonPaddingRows

In [36]:
def containsAnyPaddedOHstacks(L_OHs):
    return any(map(isPaddedOHstack, L_OHs))

In [37]:
containsAnyPaddedOHstacks(Ws_npf)

False

In [38]:
w0 = Ws_t[0]; w0
w0_l = len(ds2t(Ws_t[0])); w0_l

# random_w_OH = dsToUniphoneOHs(random_w, XOHmap)
unpadded_w0_OH_rep = dsToUniphoneOHs(w0, XOHmap); unpadded_w0_OH_rep.shape
OHsToDS(unpadded_w0_OH_rep, OHXmap)
assert not isPaddedOHstack(unpadded_w0_OH_rep)

padded_w0_OH_rep = Ws_npf[0]; padded_w0_OH_rep.shape
OHsToDS(padded_w0_OH_rep, OHXmap)
assert isPaddedOHstack(padded_w0_OH_rep) or not containsAnyPaddedOHstacks(Ws_npf)

'⋊.aɪ.b.aɪ.⋉.⋉'

6

(6, 41)

'⋊.aɪ.b.aɪ.⋉.⋉'

(6, 41)

'⋊.aɪ.b.aɪ.⋉.⋉'

In [39]:
padded_w0_OH_rep.shape
padded_w0_OH_rep

(6, 41)

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]],
      dtype=uint8)

In [40]:
np.sum(padded_w0_OH_rep, axis=1)
np.sum(padded_w0_OH_rep, axis=1).sum()

array([1, 1, 1, 1, 1, 1], dtype=uint64)

6

In [41]:
def trueLength(possibly_padded_OHs):
    return np.sum(possibly_padded_OHs, axis=1).sum()

def unpaddedMask(possibly_padded_OHs):
    return np.sum(possibly_padded_OHs, axis=1)

trueLength(padded_w0_OH_rep)
assert trueLength(padded_w0_OH_rep) == unpadded_w0_OH_rep.shape[0]

6

# Prefixes

## Generating prefixes

In [42]:
random_w
random_w_OH
a_prefix_of_random_w_OH = random_w_OH[:-3] # <- that's a prefix
a_prefix_of_random_w = OHsToDS(a_prefix_of_random_w_OH, OHXmap)
a_prefix_of_random_w

'⋊.oʊ.ɑ.l.⋉.⋉'

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]],
      dtype=uint8)

'⋊.oʊ.ɑ'

In [43]:
len(ds2t(random_w))
random_w_OH.shape
random_w_OH[:2].shape
random_w_OH[:10].shape

6

(6, 41)

(2, 41)

(6, 41)

In [44]:
def getPrefixes_OH(w_OH):
    return [w_OH] + [w_OH[:-i] for i in range(1,len(w_OH))]

random_w_OH.shape
lmap(lambda m: m.shape, getPrefixes_OH(random_w_OH))
lmap(lambda m: np.array_equal(m, random_w_OH), getPrefixes_OH(random_w_OH)) #< only the leftmost value should be True

(6, 41)

[(6, 41), (5, 41), (4, 41), (3, 41), (2, 41), (1, 41)]

[True, False, False, False, False, False]

In [45]:
wordlengths = {len(ds2t(w)) for w in Ws}
wordlengths
wordlengths = tuple(range(min(wordlengths), max(wordlengths)+1))
wordlengths

{6}

(6,)

In [46]:
random_w_OH.shape
max(wordlengths)
diff = max(wordlengths) - random_w_OH.shape[0]; diff

random_w_OH_padded = np.pad(random_w_OH, 
                            ((0, max(wordlengths) - random_w_OH.shape[0]), (0,0)), 
                            mode='constant', 
                            constant_values=0.0)
random_w_OH_padded.shape
assert np.array_equal(random_w_OH_padded[:random_w_OH.shape[0]],
                      random_w_OH)
random_w_OH_padded[random_w_OH.shape[0]:].shape
assert np.array_equal(random_w_OH_padded[random_w_OH.shape[0]:], 
                      np.zeros((diff, random_w_OH.shape[1])))

(6, 41)

6

0

(6, 41)

(0, 41)

## Generating padded/trimmed prefixes

Let's incorporate padding and trimming...

In [47]:
def getPrefixes_OH(w_OH, padded_length=None):
    unpadded = [w_OH] + [w_OH[:-i] for i in range(1,len(w_OH))]
    if padded_length is None:
        return unpadded
    return list(map(lambda p_OH: padWord(p_OH, padded_length), unpadded))

In [48]:
random_w
random_w_OH.shape
list(map(lambda m: m.shape, getPrefixes_OH(random_w_OH)))
list(map(lambda m: np.array_equal(m, random_w_OH), getPrefixes_OH(random_w_OH)))  #< only the leftmost value should be True
list(map(lambda m: m.shape, getPrefixes_OH(random_w_OH, max(wordlengths))))

'⋊.oʊ.ɑ.l.⋉.⋉'

(6, 41)

[(6, 41), (5, 41), (4, 41), (3, 41), (2, 41), (1, 41)]

[True, False, False, False, False, False]

[(6, 41), (6, 41), (6, 41), (6, 41), (6, 41), (6, 41)]

In [49]:
padded_prefixes_random_w_OH = getPrefixes_OH(random_w_OH, max(wordlengths))

padded_prefixes_random_w_OH2 = lmap(partial(adjustWord, goal_length=max(wordlengths)), 
                                    getPrefixes_OH(random_w_OH))

# type(padded_prefixes_random_w_OH)
# type(padded_prefixes_random_w_OH2)
assert len(padded_prefixes_random_w_OH) == len(padded_prefixes_random_w_OH2)

for pair in zip(padded_prefixes_random_w_OH, padded_prefixes_random_w_OH2):
    assert np.array_equal(pair[0], pair[1])

Let's re-use the `adjustWord` function and return a fixed dimension ndarray...

In [50]:
#FINAL version
def getPrefixes_OH(w_OH, goal_length=None):
    my_prefixes = [w_OH] + [w_OH[:-i] for i in range(1,len(w_OH))]
    if goal_length is None:
        return my_prefixes
    return np.array(lmap(partial(adjustWord, goal_length=goal_length),
                         my_prefixes))

padded_prefixes_random_w_OH3 = getPrefixes_OH(random_w_OH, max(wordlengths))

for pair in zip(padded_prefixes_random_w_OH2, padded_prefixes_random_w_OH3):
    assert np.array_equal(pair[0], pair[1])

Downstream calculations probably only actually want/need prefixes of length 3 or more (because triphones...), but let's let downstream notebooks / contexts of use take care of that...

In [51]:
only_viable_prefixes = False

In [52]:
if only_viable_prefixes:
    prefixlengths = range(3, max(wordlengths)+1)
else:
    prefixlengths = range(1, max(wordlengths)+1)
prefixlengths

range(1, 7)

In [53]:
prefixlengths
len(list(prefixlengths))

range(1, 7)

6

In [54]:
wordlengths
len(wordlengths)

(6,)

1

Below is a sequence of fixed-length representations of the lexicon of increasing size:
 - `Ps_l[i]` :: (|Ws|, i, |Xs|)
 - `Ps_l[i][j]` :: (i, |Xs|) is the matrix representing wordform `i` padded or trimmed to be length `i`

In [55]:
#32s CMU/solomonoff
#13s CMU/sidious
Ps_l = [None for each in range(min(prefixlengths))] + list(par(delayed(lexiconToFixedSizeOHs)(Ws_t, l) for l in prefixlengths))

[Parallel(n_jobs=-1)]: Using backend MultiprocessingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0538s.) Setting batch_size=6.
[Parallel(n_jobs=-1)]: Done   2 out of   6 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done   3 out of   6 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done   4 out of   6 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:    0.1s finished


In [56]:
Ps_l[4].shape

(6737, 4, 41)

In [57]:
max_length = max(wordlengths); max_length

6

In [58]:
for l in prefixlengths:
    Ps_l[l].shape

sum([Ps_l[l].nbytes / 1e9 for l in prefixlengths])

(6737, 1, 41)

(6737, 2, 41)

(6737, 3, 41)

(6737, 4, 41)

(6737, 5, 41)

(6737, 6, 41)

0.005800557

In [59]:
OHsToDS(Ps_l[5][Ws_t.index(random_w)], OHXmap)

'⋊.oʊ.ɑ.l.⋉'

In [60]:
random_w
Ws_t.index(random_w)
OHsToDS(Ws_npf[Ws_t.index(random_w)], OHXmap)
for l in prefixlengths:
    OHsToDS(Ps_l[l][Ws_t.index(random_w)], OHXmap)

'⋊.oʊ.ɑ.l.⋉.⋉'

3054

'⋊.oʊ.ɑ.l.⋉.⋉'

'⋊'

'⋊.oʊ'

'⋊.oʊ.ɑ'

'⋊.oʊ.ɑ.l'

'⋊.oʊ.ɑ.l.⋉'

'⋊.oʊ.ɑ.l.⋉.⋉'

In [62]:
random_l = choice(list(wordlengths))
random_l

Ps_l[random_l][Ws_t.index(random_w)].shape
np.sum(Ps_l[random_l][Ws_t.index(random_w)], axis=1)
np.product( np.sum(Ps_l[random_l][Ws_t.index(random_w)], axis=1) )
not np.product( np.sum(Ps_l[random_l][Ws_t.index(random_w)], axis=1) )

6

(6, 41)

array([1, 1, 1, 1, 1, 1], dtype=uint64)

1

False

In [63]:
def retrievePrefixes(w_idx=None, w=None, Ws_t=None, Ps_t=None, max_l=None, asType='indices'):
    if asType == 'indices' and Ps_t is None:
        raise Exception("Must specify sorted prefix iterable if asType = 'indices'")
    if w_idx is None and (w is None or Ws_t is None):
        raise Exception("Not enough information provided to specify a wordform index.")
    
    if max_l is None and Ws_t is not None:
        max_l = max({len(ds2t(w)) for w in Ws_t})
    if max_l is None and Ws_t is None:
        max_l = max({len(ds2t(w)) for w in Ps_t})
    
    
    if w_idx is None:
        w_idx = Ws_t.index(w)
    
    prefixSuperset = [Ps_l[l][w_idx] for l in range(min(prefixlengths), max_l+1)]
    if asType == 'padded OHs':
        return prefixSuperSet
    
    isPadded = np.array([isPaddedOHstack(p_OH) for p_OH in prefixSuperset])
    uniqueOHs = [p_OH for i, p_OH in enumerate(prefixSuperset) if not isPadded[i]]
    if asType == 'OHs':
        return uniqueOHs
    
    uniqueStrings = list(map(lambda p_OH: OHsToDS(p_OH, OHXmap), uniqueOHs))
    if asType == 'ds':
        return uniqueStrings
    
    uniqueIndices = list(map(lambda p: Ps_t.index(p), uniqueStrings))
    if asType == 'indices':
        return uniqueIndices
    raise Exception('Function should have returned something before now...')

In [64]:
my_max_l = max({len(ds2t(w)) for w in Ws_t})

In [65]:
longest_wordforms = {w for w in Ws if len(ds2t(w)) == my_max_l}; longest_wordforms
longest_wordform = list(longest_wordforms)[0]; longest_wordform

{'⋊.p.ʌ.z.⋉.⋉',
 '⋊.ɹ.æ.s.⋉.⋉',
 '⋊.m.k.oʊ.⋉.⋉',
 '⋊.ɚ.m.j.⋉.⋉',
 '⋊.ɛ.v.ɪ.⋉.⋉',
 '⋊.i.ə.n.⋉.⋉',
 '⋊.k.d.eɪ.⋉.⋉',
 '⋊.i.n.i.⋉.⋉',
 '⋊.oʊ.k.oʊ.⋉.⋉',
 '⋊.eɪ.tʃ.d.⋉.⋉',
 '⋊.k.b.ʌ.⋉.⋉',
 '⋊.s.t.u.⋉.⋉',
 '⋊.ʌ.l.tʃ.⋉.⋉',
 '⋊.f.t.l.⋉.⋉',
 '⋊.m.ɪ.ŋ.⋉.⋉',
 '⋊.ŋ.l.i.⋉.⋉',
 '⋊.k.æ.k.⋉.⋉',
 '⋊.ŋ.g.ɚ.⋉.⋉',
 '⋊.s.g.ʌ.⋉.⋉',
 '⋊.z.ɪ.k.⋉.⋉',
 '⋊.j.u.ʃ.⋉.⋉',
 '⋊.k.oʊ.s.⋉.⋉',
 '⋊.i.z.d.⋉.⋉',
 '⋊.s.k.ɪ.⋉.⋉',
 '⋊.w.aɪ.i.⋉.⋉',
 '⋊.s.l.ə.⋉.⋉',
 '⋊.aɪ.t.ɹ.⋉.⋉',
 '⋊.g.eɪ.m.⋉.⋉',
 '⋊.l.p.ɑ.⋉.⋉',
 '⋊.ɪ.n.u.⋉.⋉',
 '⋊.aɪ.t.m.⋉.⋉',
 '⋊.ɪ.d.ɚ.⋉.⋉',
 '⋊.θ.i.d.⋉.⋉',
 '⋊.k.ɚ.k.⋉.⋉',
 '⋊.ɑ.ŋ.t.⋉.⋉',
 '⋊.p.ɚ.tʃ.⋉.⋉',
 '⋊.n.i.ɪ.⋉.⋉',
 '⋊.w.eɪ.f.⋉.⋉',
 '⋊.s.ə.z.⋉.⋉',
 '⋊.æ.d.ɹ.⋉.⋉',
 '⋊.ɪ.z.eɪ.⋉.⋉',
 '⋊.l.ɚ.dʒ.⋉.⋉',
 '⋊.l.oʊ.ʒ.⋉.⋉',
 '⋊.s.i.aɪ.⋉.⋉',
 '⋊.ŋ.g.ɪ.⋉.⋉',
 '⋊.t.ɚ.b.⋉.⋉',
 '⋊.b.ɚ.l.⋉.⋉',
 '⋊.ʌ.b.d.⋉.⋉',
 '⋊.i.ð.z.⋉.⋉',
 '⋊.u.m.ə.⋉.⋉',
 '⋊.s.oʊ.m.⋉.⋉',
 '⋊.ɹ.æ.v.⋉.⋉',
 '⋊.ʃ.ə.z.⋉.⋉',
 '⋊.p.ʌ.n.⋉.⋉',
 '⋊.g.ə.z.⋉.⋉',
 '⋊.b.ɑ.d.⋉.⋉',
 '⋊.dʒ.m.ɪ.⋉.⋉',
 '⋊.u.b.i.⋉.⋉',
 '⋊.b.ɪ.z.⋉.⋉',
 '⋊.æ.t.l.⋉.⋉',
 '⋊.ɚ.g.ɹ.⋉.⋉',
 '⋊.

'⋊.p.ʌ.z.⋉.⋉'

In [66]:
random_w
retrievePrefixes(w=random_w, Ws_t=Ws_t, max_l=my_max_l, asType='ds')
retrievePrefixes(w=random_w, Ws_t=Ws_t, Ps_t=Ps_t, max_l=my_max_l, asType='indices')

'⋊.oʊ.ɑ.l.⋉.⋉'

['⋊', '⋊.oʊ', '⋊.oʊ.ɑ', '⋊.oʊ.ɑ.l', '⋊.oʊ.ɑ.l.⋉', '⋊.oʊ.ɑ.l.⋉.⋉']

[0, 9159, 9597, 9601, 9602, 9603]

In [67]:
longest_wordform
retrievePrefixes(w=longest_wordform, Ws_t=Ws_t, max_l=my_max_l, asType='ds')
retrievePrefixes(w=longest_wordform, Ws_t=Ws_t, Ps_t=Ps_t, max_l=my_max_l, asType='indices')

'⋊.p.ʌ.z.⋉.⋉'

['⋊', '⋊.p', '⋊.p.ʌ', '⋊.p.ʌ.z', '⋊.p.ʌ.z.⋉', '⋊.p.ʌ.z.⋉.⋉']

[0, 9756, 10442, 10461, 10462, 10463]

## Detecting whether `p` is a prefix of `w`

In [68]:
#naive implementation
# could be made more efficient if that's important
def is_a_prefix(p_OH, w_OH):
    unpadded_p_OH, unpadded_w_OH = unpad(p_OH), unpad(w_OH)
    p_l = unpadded_p_OH.shape[0]
    w_l = unpadded_w_OH.shape[0]
    if p_l > w_l:
#         print('case 1')
        return False
    elif p_l == w_l:
#         print('case 2')
        return np.array_equal(unpadded_p_OH, unpadded_w_OH)
    else: #p_l < w_l
#         print('case 3')
        trimmed_w_OH = unpadded_w_OH[:p_l]
        return np.array_equal(unpadded_p_OH, trimmed_w_OH)
#         return np.array_equal(np.dot(unpadded_p_OH, 
#                                      trimmed_w_OH.T),
#                               np.eye(p_l))

In [69]:
random_w
a_prefix_of_random_w
random_w_OH.shape
a_prefix_of_random_w_OH.shape
assert (a_prefix_of_random_w in getPrefixes(random_w)) == is_a_prefix(a_prefix_of_random_w_OH, random_w_OH)
lmap(lambda p: is_a_prefix(p, random_w_OH),
     getPrefixes_OH(random_w_OH))
' '
random_other_p = choice(list(getPrefixes(choice(Ws_t))));
random_w
random_other_p
random_other_p_OH = dsToUniphoneOHs(random_other_p, XOHmap)
random_other_p_OH.shape

assert (random_other_p in getPrefixes(random_w)) == is_a_prefix(random_other_p_OH, random_w_OH) 

'⋊.oʊ.ɑ.l.⋉.⋉'

'⋊.oʊ.ɑ'

(6, 41)

(3, 41)

[True, True, True, True, True, True]

' '

'⋊.oʊ.ɑ.l.⋉.⋉'

'⋊.n.i'

(3, 41)

## Generating a `prefix-word` relation

In [70]:
prefix_relation_shape = (len(Ws_t), len(Ps_t))
prefix_relation_shape

(6737, 21180)

In [71]:
def prefixIndicesToOHslice(prefix_idxs, num_Ps_t):
    '''
    Takes a list of prefix indices (e.g. that are prefixes of some w)
    and returns a (dense) binary vector where those indices are 1 and
    others are zero.
    '''
    my_slice = np.zeros(shape=(num_Ps_t,), dtype=np.uint8)
#     for idx in prefix_idxs:
#         my_slice[idx] = 1.0
#     return my_slice
#     return np.put(my_slice, prefix_idxs, 1) #<<< returns None because numpy is stateful AF
    np.put(my_slice, prefix_idxs, 1)
    return my_slice

# retrievePrefixes(w_idx, Ps_t, asType='indices')

In [72]:
def slice_calc(w_idx):
    return prefixIndicesToOHslice(retrievePrefixes(w_idx=w_idx, 
                                                   Ps_t=Ps_t,
                                                   max_l=my_max_l,
                                                   asType='indices'), 
                                  len(Ps_t))

# ≈3m on CMU on solomonoff
# 50s CMU / sidious
prefix_relation_np = np.stack(list(par(delayed(slice_calc)(w_idx)
                                       for w_idx in np.arange(prefix_relation_shape[0]))))#, 
#                               dtype=np.uint8)

# prefix_relation_np = np.stack([prefixIndexListToSlice(retrievePrefixes(w_idx=w_idx, 
#                                                                        Ps_t=Ps_t, 
#                                                                        asType='indices'), 
#                                                       len(Ps_t))
#                                for w_idx in np.arange(prefix_relation_shape[1])])
prefix_relation_np.shape

[Parallel(n_jobs=-1)]: Using backend MultiprocessingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0049s.) Setting batch_size=80.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0886s.) Setting batch_size=360.
[Parallel(n_jobs=-1)]: Done 6737 out of 6737 | elapsed:    1.1s finished


(6737, 21180)

## The `(p,w,l)` relation where `w` trimmed to `l` is `p`

(The logical place for this calculation is here, but the motivation is given in the section on $k$-cousins.)

In [73]:
# p_to_l = {p:len(ds2t(p)) for p in Ps}

In [74]:
(len(Ws_t), len(Ps_t))
prefix_relation_np.shape

(6737, 21180)

(6737, 21180)

First we'll want to be able to retrieve the indices or strings of wordforms for each prefix such that that prefix is a prefix of those wordforms...

In [75]:
# #est 30m on cmu+solomonoff

# #maps each prefix p to an array of wordform indices s.t.
# # p is a prefix of each of the wordforms with those indices
# # p_to_w_idxs = {p:prefix_relation_np[:,Ps_t.index(p)].nonzero()[0]
# #                for p in tqdm(Ps)}

# #est 17m on cmu/solomonoff
# # def p_to_w_idx_calc(p):
# #     return p, prefix_relation_np[:,Ps_t.index(p)].nonzero()[0]

# # p_to_w_idxs = dict(par(delayed(p_to_w_idx_calc)(p)
# #                        for p in Ps))

# def p_to_w_idxs(p):
#     return prefix_relation_np[:,Ps_t.index(p)].nonzero()[0]

In [76]:
# random_w
# a_prefix_of_random_w
# ' '
# Ws_t.index(random_w)
# p_to_w_idxs(a_prefix_of_random_w)
# lmap(lambda w_idx: Ws_t[w_idx], 
#      p_to_w_idxs(a_prefix_of_random_w))

In [77]:
#est 30-90m cmu+solomonoff
# p_to_ws = {p:set(map(lambda w_idx: Ws_t[w_idx],
#                      p_to_w_idxs(p)))
#            for p in tqdm(Ps, total=len(Ps))}

#est ?
# def p_to_ws_calc(p):
#     return p, set(map(lambda w_idx: Ws_t[w_idx],
#                       p_to_w_idxs(p)))

# p_to_ws = dict(par(delayed(p_to_ws_calc)(p)
#                    for p in Ps))

# def p_to_ws(p):
#     return set(map(lambda w_idx: Ws_t[w_idx],
#                    p_to_w_idxs(p)))

In [78]:
# p_to_ws(a_prefix_of_random_w)

In [79]:
# #est 10-15m cmu+solomonoff
# #maps each prefix index to an arbitray wordform index s.t.
# # that prefix is a prefix of that wordform
# # p_idx_to_w_idx = np.array([p_to_w_idxs(Ps_t[p_idx])[0]
# #                            for p_idx in tqdm(np.arange(len(Ps_t)), total=len(Ps_t))], 
# #                           dtype=np.int8)

# def p_idx_to_w_idx_calc(p_idx):
#     indices = p_to_w_idxs(Ps_t[p_idx])
#     if len(indices) > 0:
#         return indices[0]
#     else:
#         return -1

# # ≈6m cmu+solomonoff
# # ?m cmu+sidious
# # ≈1.3m cmu+wittgenstein
# p_idx_to_w_idx = np.array(list(par(delayed(p_idx_to_w_idx_calc)(p_idx)
#                                    for p_idx in np.arange(len(Ps_t)))), 
#                           dtype=np.int32)

In [80]:
# len(Ps_t)
# p_idx_to_w_idx.shape
# p_idx_to_w_idx.dtype
# p_idx_to_w_idx.nbytes / 1e9 #FIXME reconsider dtype
# p_idx_to_w_idx[Ps_t.index(a_prefix_of_random_w)]

In [81]:
# Ps_t[0]
# p_idx_to_w_idx[0]
# Ws_t[p_idx_to_w_idx[0]]

In [82]:
# Ps_t[2]
# p_idx_to_w_idx[2]

In [83]:
# np.where(p_idx_to_w_idx == -1)[0]
# assert np.where(p_idx_to_w_idx == -1)[0].size == 0

In [84]:
# w_idx_to_p_idx = {p_idx_to_w_idx[p_idx]:p_idx
#                   for p_idx in tqdm(range(len(Ps_t)), total=len(Ps_t))}

In [85]:
# Ws_t[232]
# w_idx_to_p_idx[232]
# Ps_t[w_idx_to_p_idx[232]]

In [86]:
# w_idx_to_l_to_p_idx = {(p_idx_to_w_idx[p_idx], p_to_l[Ps_t[p_idx]]):p_idx
#                        for p_idx in tqdm(range(len(Ps_t)), total=len(Ps_t))}

# w_idx_to_l_to_p_idx2 = {(w_idx, l): Ps_t.index( t2ds(ds2t(Ws_t[w_idx])[:l]) )
#                         for w_idx in tqdm(range(len(Ws_t)), total=len(Ws_t)) for l in range(1, len(ds2t(Ws_t[w_idx])))}

# w_idx_to_l_to_p_idx3 = {(Ws_t.index(w), len(ds2t(p))): Ps_t.index(p)
#                         for w in tqdm(Ws_t) for p in getPrefixes(w)}

# ≈10m on CMU/wittgenstein
# w_idx_to_l_to_p_idx = {(Ws_t.index(w), len(ds2t(p))): Ps_t.index(p)
#                         for w in tqdm(Ws_t) for p in getPrefixes(w)}

def w_idx_to_l_to_p_idx_calc(w, p):
    return ((Ws_t.index(w), len(ds2t(p))), Ps_t.index(p))

# ≈1.3m cmu+wittgenstein
w_idx_to_l_to_p_idx = dict(par(delayed(w_idx_to_l_to_p_idx_calc)(w,p)
                               for w in Ws_t for p in getPrefixes(w)))

[Parallel(n_jobs=-1)]: Using backend MultiprocessingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0037s.) Setting batch_size=108.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0444s.) Setting batch_size=972.
[Parallel(n_jobs=-1)]: Done 1900 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 3736 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 40422 out of 40422 | elapsed:    1.0s finished


In [87]:
# len(w_idx_to_l_to_p_idx)
# # len(w_idx_to_l_to_p_idx2)
# len(w_idx_to_l_to_p_idx3)

In [88]:
random_w
len(ds2t(random_w))
associated_p_idx = w_idx_to_l_to_p_idx[(Ws_t.index(random_w), len(ds2t(random_w)))]; associated_p_idx
Ps_t[associated_p_idx]

'⋊.oʊ.ɑ.l.⋉.⋉'

6

9603

'⋊.oʊ.ɑ.l.⋉.⋉'

In [89]:
random_wordform_idx = choice(range(len(Ws_t)))
Ws_t[random_wordform_idx]
[w_idx_to_l_to_p_idx.get((random_wordform_idx,l), None) for l in prefixlengths if w_idx_to_l_to_p_idx.get((random_wordform_idx,l), None) != None]
lmap(lambda p_idx: Ps_t[p_idx],
     [w_idx_to_l_to_p_idx.get((random_wordform_idx,l), None) for l in prefixlengths if w_idx_to_l_to_p_idx.get((random_wordform_idx,l), None) != None])

'⋊.d.ə.s.⋉.⋉'

[0, 1469, 1973, 2001, 2002, 2003]

['⋊', '⋊.d', '⋊.d.ə', '⋊.d.ə.s', '⋊.d.ə.s.⋉', '⋊.d.ə.s.⋉.⋉']

In [90]:
random_prefix = choice(Ps_t); random_prefix
random_prefix_idx = Ps_t.index(random_prefix); random_prefix_idx
random_prefix_l = len(ds2t(random_prefix)); random_prefix_l
# p_to_l[random_prefix]
# associated_w_idx = p_idx_to_w_idx[random_prefix_idx]; associated_w_idx
# Ws_t[associated_w_idx]
# Ps_l[random_prefix_l][associated_w_idx].shape
# w_idx_to_l_to_p_idx[(associated_w_idx, random_prefix_l)]
# Ps_t[w_idx_to_l_to_p_idx[(associated_w_idx, random_prefix_l)]]

'⋊.ʌ.s.p.⋉'

20777

5

# Hamming distance

In [91]:
random_prefixes = choices(Ps_t, k=1000)

In [92]:
random_w
random_other_p
some_random_prefixes = random_prefixes[:10]; some_random_prefixes

'⋊.oʊ.ɑ.l.⋉.⋉'

'⋊.n.i'

['⋊.ɪ.ɹ.i.⋉',
 '⋊.u.k.l.⋉.⋉',
 '⋊.d.l.ɪ.⋉.⋉',
 '⋊.ʃ.oʊ.d',
 '⋊.w.ʊ.d.⋉.⋉',
 '⋊.g.dʒ.ɛ',
 '⋊.h.ə.b.⋉.⋉',
 '⋊.aɪ.t.æ.⋉',
 '⋊.aɪ.p.ə.⋉',
 '⋊.m.oʊ.ʃ']

In [93]:
random_other_p_OH.dtype

dtype('uint8')

In [94]:
# length_mismatch_constant = np.inf
length_mismatch_constant = -1

## Distance between symbol vectors

In [95]:
zero = np.zeros(shape=XOHmap['f'].shape, dtype=my_dtype)
zero

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int8)

In [96]:
XOHmap['f'] #a
XOHmap['g'] #b

diff_V = XOHmap['f'] - XOHmap['g']; diff_V #will be the zero vector iff a = b
sum_V = XOHmap['f'] + XOHmap['g']; sum_V
prod_V = XOHmap['f'] * XOHmap['g']; prod_V #a * b will be the zero vector iff a ≠ b and a * b = a = b iff a = b
dot_prod_V = np.dot(XOHmap['f'], XOHmap['g']); dot_prod_V #a.b will be 0 iff a ≠ b and a.b = 1 iff a = b
np.dot(XOHmap['f'], XOHmap['f'])

array([0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.])

array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.])

array([ 0.,  0.,  0.,  0.,  0.,  0.,  1., -1.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.])

array([0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.])

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.])

0.0

1.0

In [97]:
random_OHs = choices(list(map(lambda x: XOHmap[x], Xs)), k=1000)

In [98]:
if benchmark:
    %timeit not np.array_equal(choice(random_OHs), choice(random_OHs))

In [99]:
if benchmark:
    %timeit not np.array_equal(choice(random_OHs) - choice(random_OHs), zero)

In [100]:
if benchmark:
    %timeit not (choice(random_OHs) - choice(random_OHs)).any()

In [101]:
if benchmark:
    %timeit not (choice(random_OHs) - choice(random_OHs)).sum()

In [102]:
if benchmark:
    %timeit (choice(random_OHs) * choice(random_OHs)).any()

In [103]:
if benchmark:
    %timeit np.dot(choice(random_OHs), choice(random_OHs))

(Unsurprising) conclusion on checking for vector equality: `np.dot` is about 2-3 times as fast as methods involving element-wise array equality checking or sums and differences possibly involving the zero vector.

In [104]:
def d_s_np(x_OH, y_OH):
    '''
    Hamming distance between symbol x and symbol y,
    where both symbols are one-hot vectors.
    '''
    return not np.dot(x_OH, y_OH)

In [105]:
for each_OH in random_OHs:
    assert d_s_np(each_OH, each_OH) == 0 and np.array_equal(each_OH, each_OH)
    random_OH = choice(random_OHs)
    assert d_s_np(each_OH, random_OH) == (not np.array_equal(each_OH, random_OH))

## Hamming distance between stacks of symbol vectors (strings)

In [106]:
np.zeros((2,3)).astype(np.int64)

array([[0, 0, 0],
       [0, 0, 0]])

In [107]:
'Direct comparison for equality:'
np.array_equal(dsToUniphoneOHs('t.i.f', XOHmap), dsToUniphoneOHs('t.i.f', XOHmap)) #true
np.array_equal(dsToUniphoneOHs('t.i.f', XOHmap), dsToUniphoneOHs('t.i.g', XOHmap)) #false

'Difference:'
(dsToUniphoneOHs('t.i.f', XOHmap) - dsToUniphoneOHs('t.i.f', XOHmap)).sum()
(dsToUniphoneOHs('t.i.f', XOHmap) - dsToUniphoneOHs('t.i.f', XOHmap)).prod()
(dsToUniphoneOHs('t.i.f', XOHmap).astype(np.int64) - dsToUniphoneOHs('t.i.g', XOHmap).astype(np.int64)).sum()
(dsToUniphoneOHs('t.i.f', XOHmap) - dsToUniphoneOHs('t.i.g', XOHmap)).prod()
# (dsToUniphoneOHs('t.i.f', XOHmap) - dsToUniphoneOHs('t.i.g', XOHmap)).sum()

'Hadamard product:'
3 - (dsToUniphoneOHs('t.i.f', XOHmap) * dsToUniphoneOHs('t.i.f', XOHmap)).sum()
3 - (dsToUniphoneOHs('t.i.f', XOHmap) * dsToUniphoneOHs('t.i.g', XOHmap)).sum()
3 - (dsToUniphoneOHs('t.i.f', XOHmap) * dsToUniphoneOHs('d.i.g', XOHmap)).sum()
3 - (dsToUniphoneOHs('t.i.f', XOHmap) * dsToUniphoneOHs('d.u.g', XOHmap)).sum()

'Dot product:'
np.dot(dsToUniphoneOHs('t.i.f', XOHmap),
       dsToUniphoneOHs('t.i.f', XOHmap).T)
dsToUniphoneOHs('t.i.f', XOHmap) @ dsToUniphoneOHs('t.i.f', XOHmap).T
dsToUniphoneOHs('t.i.f', XOHmap) @ dsToUniphoneOHs('t.i.g', XOHmap).T
dsToUniphoneOHs('t.i.f', XOHmap) @ dsToUniphoneOHs('t.u.f', XOHmap).T
(dsToUniphoneOHs('t.i.f', XOHmap) @ dsToUniphoneOHs('t.i.f', XOHmap).T).sum(axis=0)
(dsToUniphoneOHs('t.i.f', XOHmap) @ dsToUniphoneOHs('t.i.f', XOHmap).T).sum(axis=1)
(dsToUniphoneOHs('t.i.f', XOHmap) @ dsToUniphoneOHs('t.i.g', XOHmap).T).sum(axis=0)
(dsToUniphoneOHs('t.i.f', XOHmap) @ dsToUniphoneOHs('t.i.g', XOHmap).T).sum(axis=1)
(dsToUniphoneOHs('t.i.f', XOHmap) @ dsToUniphoneOHs('t.u.g', XOHmap).T).sum(axis=0)
(dsToUniphoneOHs('t.i.f', XOHmap) @ dsToUniphoneOHs('t.u.g', XOHmap).T).sum(axis=1)
(dsToUniphoneOHs('t.i.f', XOHmap) @ dsToUniphoneOHs('t.u.g', XOHmap).T).sum(axis=0).sum()
3 - (dsToUniphoneOHs('t.i.f', XOHmap) @ dsToUniphoneOHs('t.u.g', XOHmap).T).sum()

'Direct comparison for equality:'

True

False

'Difference:'

0

0

0

0

'Hadamard product:'

0.0

1.0

2.0

3.0

'Dot product:'

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]], dtype=uint8)

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]], dtype=uint8)

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 0]], dtype=uint8)

array([[1, 0, 0],
       [0, 0, 0],
       [0, 0, 1]], dtype=uint8)

array([1, 1, 1], dtype=uint64)

array([1, 1, 1], dtype=uint64)

array([1, 1, 0], dtype=uint64)

array([1, 1, 0], dtype=uint64)

array([1, 0, 0], dtype=uint64)

array([1, 0, 0], dtype=uint64)

1

2.0

In [108]:
def d_w_np_hadamard(x_OHs, y_OHs):
    '''
    Hamming distance between stacks of symbols x and y,
    where both stacks are of one-hot vectors.
    '''
    l = x_OHs.shape[0]
    
    return l - (x_OHs * y_OHs).sum()

# turns out to be both incorrect and scales poorly
# def d_w_np_dot(x_OHs, y_OHs):
#     '''
#     Hamming distance between stacks of symbols x and y,
#     where both stacks are of one-hot vectors.
#     '''
#     l = x_OHs.shape[0]
    
#     return l - (x_OHs @ y_OHs.T).sum()

def d_w_np_direct(x_OHs, y_OHs):
    '''
    Hamming distance between stacks of symbols x and y,
    where both stacks are of one-hot vectors.
    '''
    l = x_OHs.shape[0]
    return np.array([(not np.array_equal(x_OHs[i], y_OHs[i])) for i in range(l)]).sum()

d_s_npu = np.vectorize(d_s_np, otypes=[np.uint8], signature="(s),(s)->()")

def d_w_np_u(x_OHs, y_OHs):
    return d_s_npu(x_OHs, y_OHs).sum()

In [109]:
num_random_fixed_size_OHs = 1000

my_fixed_size = 20 #longer length = more revealing

random_fixed_size_OHs = [np.stack([choice(random_OHs) for each in range(my_fixed_size)])
                         for each in range(num_random_fixed_size_OHs)]

In [110]:
random_fixed_size_OHs[0].shape

(20, 41)

In [111]:
if benchmark:
    #indicates overhead of choosing random inputs
    %timeit (choice(random_fixed_size_OHs), choice(random_fixed_size_OHs))

In [112]:
if benchmark:
    %timeit d_w_np_hadamard(choice(random_fixed_size_OHs), choice(random_fixed_size_OHs))

In [113]:
# %%timeit

# d_w_np_dot(choice(random_fixed_size_OHs), choice(random_fixed_size_OHs))

In [114]:
if benchmark:
    %timeit d_w_np_direct(choice(random_fixed_size_OHs), choice(random_fixed_size_OHs))

In [115]:
if benchmark:
    %timeit d_w_np_u(choice(random_fixed_size_OHs), choice(random_fixed_size_OHs))

Check for correctness...

In [116]:
if testing:
    for my_hamming_distance_function in (d_w_np_direct, d_w_np_hadamard, d_w_np_u):
    # for my_hamming_distance_function in (d_w_np_direct, d_w_np_hadamard, d_w_np_dot, d_w_np_u):
        print(f'Checking {str(my_hamming_distance_function)}')
        for each_OHstack in random_fixed_size_OHs:
        #     my_hamming_distance_function = d_w_np_hadamard
    #         my_hamming_distance_function = d_w_np_direct
        #     my_hamming_distance_function = 
            if not (my_hamming_distance_function(each_OHstack, each_OHstack) == 0 and np.array_equal(each_OHstack, each_OHstack) == True):
                each_s = OHsToDS(each_OHstack, OHXmap)
                print(f'each_s = {each_s}')
                print(f'd_h = {d_h(each_s, each_s)}')
                print(f'my_hamming_distance_function(each_OHstack, each_OHstack) = {my_hamming_distance_function(each_OHstack, each_OHstack)}')
            assert my_hamming_distance_function(each_OHstack, each_OHstack) == 0 and np.array_equal(each_OHstack, each_OHstack) == True

            random_OHstack = choice(random_fixed_size_OHs)
            each_s = OHsToDS(each_OHstack, OHXmap)
            random_s = OHsToDS(random_OHstack, OHXmap)

            if not (my_hamming_distance_function(each_OHstack, random_OHstack) == d_h(each_s, random_s)):
                print(f'each_s = {each_s}', f'random_s = {random_s}')
                pprint_aligned_DSs(align_DSs([each_s, random_s]))
                print(f'd_h = {d_h(each_s, random_s)}')
                print(f'my_hamming_distance_function(each_OHstack, random_OHstack) = {my_hamming_distance_function(each_OHstack, random_OHstack)}')
            assert my_hamming_distance_function(each_OHstack, random_OHstack) == d_h(each_s, random_s)

**Conclusion:** The Hadamard product scales very well for checking Hamming distance between two (unpadded one-hot) strings.

In [117]:
def d_h_np(x_OHs, y_OHs):
    '''
    Hamming distance between sequences of symbols x and y,
    where both symbols are represented by one-hot vectors and
    neither is a padded stack.
    '''
    l = x_OHs.shape[0]
    if l != y_OHs.shape[0]:
        return length_mismatch_constant
#         return np.infty
    return l - (x_OHs * y_OHs).sum(dtype=my_dtype)

To accommodate padded OH vectors, we need mechanisms for accounting for padding vectors.

In [118]:
def d_h_np(x_OHs, y_OHs, paddedOHs=False):
    '''
    Hamming distance between sequences of symbols x and y,
    where both symbols are represented by one-hot vectors.
    '''
    if paddedOHs:
        x_l = trueLength(x_OHs)
        y_l = trueLength(y_OHs)
        if x_l != y_l:
            return length_mismatch_constant
#             return np.infty
        else: #true lengths *are* the same...
            true_l = x_l
#             x_pl = x_OHs.shape[0]
#             y_pl = y_OHs.shape[0]
            
            #correct but involves the creation of new OH stacks
#             trimmed_x_OHs, trimmed_y_OHs = adjustWord(x_OHs, true_l), adjustWord(y_OHs, true_l)
#             return true_l - (trimmed_x_OHs * trimmed_y_OHs).sum()

            return true_l - ((x_OHs[:true_l] * y_OHs[:true_l])).sum(dtype=my_dtype)
            
    l = x_OHs.shape[0]
    if l != y_OHs.shape[0]:
        return -1
#         return np.infty
    return l - (x_OHs * y_OHs).sum(dtype=my_dtype)

In [122]:
# w0 = Ws_t[0]; w0
w0
# w0_l = len(ds2t(Ws_t[0])); 
w0_l
' '
# unpadded_w0_OH_rep = dsToUniphoneOHs(w0, XOHmap); unpadded_w0_OH_rep.shape
unpadded_w0_OH_rep.shape
OHsToDS(unpadded_w0_OH_rep, OHXmap)
assert not isPaddedOHstack(unpadded_w0_OH_rep)
' '
# padded_w0_OH_rep = Ws_npf[0]; padded_w0_OH_rep.shape
padded_w0_OH_rep.shape
OHsToDS(padded_w0_OH_rep, OHXmap)
assert isPaddedOHstack(padded_w0_OH_rep) or not containsAnyPaddedOHstacks(Ws_npf)
' '
d_h_np(unpadded_w0_OH_rep, unpadded_w0_OH_rep)
d_h_np(unpadded_w0_OH_rep, padded_w0_OH_rep, True)
d_h_np(padded_w0_OH_rep, padded_w0_OH_rep, True)
' '
random_other_p
random_other_p_OH.shape
random_other_p_OH_padded = adjustWord(random_other_p_OH, 20)
random_other_p_OH_padded.shape
assert isPaddedOHstack(random_other_p_OH_padded) or not containsAnyPaddedOHstacks(Ws_npf)
d_h(w0, random_other_p)
d_h_np(padded_w0_OH_rep, random_other_p_OH_padded, True)

'⋊.aɪ.b.aɪ.⋉.⋉'

6

' '

(6, 41)

'⋊.aɪ.b.aɪ.⋉.⋉'

' '

(6, 41)

'⋊.aɪ.b.aɪ.⋉.⋉'

' '

0

0.0

0.0

' '

'⋊.n.i'

(3, 41)

(20, 41)

inf

-1

In [123]:
num_random_padded_OHs = 1000

random_padded_OHs = [choice(Ws_npf)
                     for each in range(num_random_padded_OHs)]

In [124]:
def my_d_h(u,v):
    result = d_h(u,v)
    if result == np.inf:
        return length_mismatch_constant
    else:
        return result

In [125]:
if testing:
    for each_OH_stack in random_padded_OHs:
        other_OH_stack = choice(random_padded_OHs)

        each_s  = OHsToDS(each_OH_stack, OHXmap)
        other_s = OHsToDS(other_OH_stack, OHXmap)

        if trueLength(each_OH_stack) == trueLength(other_OH_stack):
            print('matching true lengths...') #more useful tests

        assert my_d_h(each_s, other_s) == d_h_np(each_OH_stack, other_OH_stack, True)

In [126]:
if benchmark:
    %timeit d_h_np(choice(random_padded_OHs), choice(random_padded_OHs), True)

## Distance between a string and a stack of strings

The cell below is correct, modulo accounting for padded stacks. (In those cases, levenshtein distance enters the calculation...)

In [127]:
random_w
random_w_idx = Ws_t.index(random_w)
random_w_OHf = Ws_npf[random_w_idx]; random_w_OHf.shape
Ws_npf.shape
np.array(lmap(lambda w_OHf: d_h_np(random_w_OHf, w_OHf, False),
              Ws_npf))

random_w_OHf.shape[0] - np.einsum('nls->n', random_w_OHf * Ws_npf)
random_w_OHf.shape[0] - np.einsum('nls->n', np.einsum('ls,nls->nls', random_w_OHf, Ws_npf))
random_w_OHf.shape[0] - np.einsum('nls->n', (random_w_OHf * Ws_sf).todense())

'⋊.oʊ.ɑ.l.⋉.⋉'

(6, 41)

(6737, 6, 41)

array([3, 3, 3, ..., 3, 3, 3])

array([3, 3, 3, ..., 3, 3, 3], dtype=uint8)

array([3, 3, 3, ..., 3, 3, 3], dtype=uint8)

array([3, 3, 3, ..., 3, 3, 3], dtype=uint8)

In [128]:
# Ws_npfProd = np.einsum('mls,nls->mnls', Ws_npf, Ws_npf)

In [129]:
# only correct for random_w_OHf
# Ws_npfReduc = random_w_OHf.shape[0] - np.einsum('mnls->mn', Ws_npfProd)

In [130]:
# random_w_idx

In [131]:
# Ws_npfReduc[random_w_idx]
# np.array_equal(Ws_npfReduc[random_w_idx], random_w_OHf.shape[0] - np.einsum('nls->n', random_w_OHf * Ws_npf))

Let's modify these calculations to account for padding:

In [132]:
trueLength(random_w_OHf)
Ws_npf.shape
Ws_npf_trueLengths = np.sum(Ws_npf, axis=2, dtype=my_dtype).sum(axis=1, dtype=my_dtype)
Ws_npf_trueLengths.shape
Ws_npf_trueLengths

6

(6737, 6, 41)

(6737,)

array([6, 6, 6, ..., 6, 6, 6], dtype=int8)

In [133]:
trueLength_match = Ws_npf_trueLengths == trueLength(random_w_OHf); trueLength_match.nonzero()[0]
trueLength_mismatch = Ws_npf_trueLengths != trueLength(random_w_OHf); trueLength_mismatch
# Ws_npf_trueLengths[0], trueLength_mismatch[0]
# Ws_npf_trueLengths[1], trueLength_mismatch[1]
# Ws_npf_trueLengths[2], trueLength_mismatch[2]
random_w_OHf_distance_mask = np.ones(Ws_npf_trueLengths.shape)
np.putmask(random_w_OHf_distance_mask, trueLength_mismatch, length_mismatch_constant)
random_w_OHf_distance_mask
random_w_OHf_distance_mask[trueLength_match]
# random_w_OHf_distance_mask[0]
# random_w_OHf_distance_mask[1]
# random_w_OHf_distance_mask[2]

array([   0,    1,    2, ..., 6734, 6735, 6736])

array([False, False, False, ..., False, False, False])

array([1., 1., 1., ..., 1., 1., 1.])

array([1., 1., 1., ..., 1., 1., 1.])

In [134]:
def OHstack_to_trueLength_mask(paddedOHstack, trueLengths):
    mask = np.ones(trueLengths.shape)
    np.putmask(mask, trueLengths != trueLength(paddedOHstack), length_mismatch_constant)
    return mask.astype(my_dtype)

assert np.array_equal(random_w_OHf_distance_mask,
                      OHstack_to_trueLength_mask(random_w_OHf, Ws_npf_trueLengths))

In [135]:
random_w
random_w_idx = Ws_t.index(random_w)
random_w_OHf = Ws_npf[random_w_idx]; random_w_OHf.shape
Ws_npf.shape
np.array(lmap(lambda w_OHf: d_h_np(random_w_OHf, w_OHf, True),
              Ws_npf))

(random_w_OHf.shape[0] - np.einsum('nls->n', random_w_OHf * Ws_npf)) * random_w_OHf_distance_mask
(random_w_OHf.shape[0] - np.einsum('nls->n', np.einsum('ls,nls->nls', random_w_OHf, Ws_npf))) * random_w_OHf_distance_mask
(random_w_OHf.shape[0] - np.einsum('nls->n', (random_w_OHf * Ws_sf).todense())) * random_w_OHf_distance_mask

'⋊.oʊ.ɑ.l.⋉.⋉'

(6, 41)

(6737, 6, 41)

array([3., 3., 3., ..., 3., 3., 3.])

array([3., 3., 3., ..., 3., 3., 3.])

array([3., 3., 3., ..., 3., 3., 3.])

array([3., 3., 3., ..., 3., 3., 3.])

In [136]:
if benchmark:
    %timeit -r 10 -n 10 np.array(lmap(lambda w_OHf: d_h_np(choice(Ws_npf), w_OHf), Ws_npf))

In [137]:
if benchmark:
    rand_w_OHf = choice(Ws_npf)
    %timeit -r 10 -n 10 rand_w_OHf.shape[0] - np.einsum('nls->n', rand_w_OHf * Ws_npf)

In [138]:
if benchmark:
    rand_w_OHf = choice(Ws_npf)
    %timeit -r 10 -n 10 rand_w_OHf.shape[0] - np.einsum('nls->n', np.einsum('ls,nls->nls', rand_w_OHf, Ws_npf))

In [139]:
if benchmark:
    rand_w_OHf = choice(Ws_npf)
    %timeit -r 10 -n 10 rand_w_OHf.shape[0] - np.einsum('nls->n', (rand_w_OHf * Ws_sf).todense())

In [140]:
rand_w_OHf = choice(Ws_npf)

map_result = np.array(lmap(lambda w_OHf: d_h_np(rand_w_OHf, w_OHf),
                           Ws_npf))

einsum_hadamard_result = rand_w_OHf.shape[0] - np.einsum('nls->n', rand_w_OHf * Ws_npf)

einsum_einsum_result = rand_w_OHf.shape[0] - np.einsum('nls->n', np.einsum('ls,nls->nls', rand_w_OHf, Ws_npf))

einsum_hadamard_sparse_result = rand_w_OHf.shape[0] - np.einsum('nls->n', (rand_w_OHf * Ws_sf).todense())

assert np.array_equal(map_result, einsum_hadamard_result)
assert np.array_equal(map_result, einsum_einsum_result)
assert np.array_equal(map_result, einsum_hadamard_sparse_result)

In [141]:
def d_h_np_string_to_strings(x_OHs, L_OHs, paddedOHs=False, L_OHs_trueLengths=None, use_GPU=False):
    memTrigger()
    if not use_GPU:
        x_OHs = x_OHs.astype(my_dtype)
        L_OHs = L_OHs.astype(my_dtype)
        l = x_OHs.shape[0]
        if not paddedOHs:
            return l - np.einsum('nls->n', x_OHs * L_OHs, dtype=my_dtype)
        else:
            true_l = trueLength(x_OHs)
            if L_OHs_trueLengths is None:
                L_OHs_trueLengths = np.sum(L_OHs, axis=2, dtype=np.uint8).sum(axis=1, dtype=np.uint8)
            trueLength_mask = OHstack_to_trueLength_mask(x_OHs, L_OHs_trueLengths)
            return (true_l - np.einsum('nls->n', x_OHs * L_OHs, dtype=my_dtype)) * trueLength_mask
    else:
        if not paddedOHs:
            l = x_OHs.shape[0]
            x_OHs_t = x_OHs
            L_OHs_t = L_OHs
#             x_OHs_t = torch.from_numpy(x_OHs)#.type(torch.float16)
#             L_OHs_t = torch.from_numpy(L_OHs)#.type(torch.float16)
            return l - torch.einsum('nls->n', x_OHs_t.cuda() * L_OHs_t.cuda()).cpu().numpy().astype(my_dtype)
#             return l - torch.einsum('nls->n', x_OHs_t.type(torch.float16).cuda() * L_OHs_t.type(torch.float16).cuda())
        else:
            x_OHs_t = x_OHs
            L_OHs_t = L_OHs
#             true_l = torch.sum(x_OHs_t, dim=1).sum()
            true_l = trueLength(x_OHs_t.numpy())
#             true_l = trueLength(x_OHs_t)
            if L_OHs_trueLengths is None:
                L_OHs_trueLengths = np.sum(L_OHs.numpy(), axis=2, dtype=np.uint8).sum(axis=1, dtype=np.uint8)
#                 L_OHs_trueLengths = torch.sum(L_OHs_t, dim=2, dtype=torch.int32).sum(dim=1, dtype=torch.int32)
            trueLength_mask = OHstack_to_trueLength_mask(x_OHs_t.numpy(), L_OHs_trueLengths)
#             trueLength_mask = torch.from_numpy(OHstack_to_trueLength_mask(x_OHs_t.numpy(), L_OHs_trueLengths.numpy()))
#             trueLength_mask = OHstack_to_trueLength_mask(x_OHs_t, L_OHs_trueLengths)
            return (true_l - torch.einsum('nls->n', x_OHs_t.cuda() * L_OHs_t.cuda()).cpu().numpy().astype(my_dtype)) * trueLength_mask
                
        


# def d_h_np_string_to_strings(x_OHs, L_OHs, paddedOHs=False, L_OHs_trueLengths=None, use_GPU=False):
# # def d_h_np_string_to_strings(x_OHs, L_OHs, paddedOHs=False, L_OHs_trueLengths=None, my_dtype=None):
# #     if my_dtype is None:
# #         my_dtype = np.uint8
# #         my_dtype = np.int8
#     x_OHs = x_OHs.astype(my_dtype)
#     L_OHs = L_OHs.astype(my_dtype)
#     l = x_OHs.shape[0]
#     if not paddedOHs:
#         if not use_GPU:
#             return l - np.einsum('nls->n', x_OHs * L_OHs, dtype=my_dtype)
#         else:
#             x_OHs_t = torch.from_numpy(x_OHs)#.type(torch.float16)
#             L_OHs_t = torch.from_numpy(L_OHs)#.type(torch.float16)
#             return l - torch.einsum('nls->n', x_OHs_t.cuda() * L_OHs_t.cuda()).cpu().numpy().astype(my_dtype)
# #             return l - torch.einsum('nls->n', x_OHs_t.type(torch.float16).cuda() * L_OHs_t.type(torch.float16).cuda())
# #             l_t
# #             raise Exception('under construction')
#     else:
#         true_l = trueLength(x_OHs)
#         if L_OHs_trueLengths is None:
#             L_OHs_trueLengths = np.sum(L_OHs, axis=2, dtype=my_dtype).sum(axis=1, dtype=my_dtype)
#         trueLength_mask = OHstack_to_trueLength_mask(x_OHs, L_OHs_trueLengths)
#         if not use_GPU:
#             return (true_l - np.einsum('nls->n', x_OHs * L_OHs, dtype=my_dtype)) * trueLength_mask
#         else:
# #             true_l_t = 
#             x_OHs_t = torch.from_numpy(x_OHs)#.type(torch.float16)
#             L_OHs_t = torch.from_numpy(L_OHs)#.type(torch.float16)
#             L_OHs_trueLengths_t = torch.from_numpy(L_OHs_trueLengths)#.type(torch.float16)
#             trueLength_mask_t = torch.from_numpy(trueLength_mask)#.type(torch.float16)
#             return (true_l - torch.einsum('nls->n', x_OHs_t.cuda() * L_OHs_t.cuda()).cpu().numpy().astype(my_dtype))
# #             raise Exception('under construction')

In [142]:
rand_w_OHf.shape
rand_w = OHsToDS(rand_w_OHf, OHXmap); rand_w

d_h_np_string_to_strings(rand_w_OHf, Ws_npf, True)

(6, 41)

'⋊.t.aɪ.g.⋉.⋉'

array([3, 3, 3, ..., 3, 3, 3], dtype=int8)

In [143]:
if testing:
    rand_w_dists = d_h_np_string_to_strings(rand_w_OHf, Ws_npf, True)

    for i, each_OH in enumerate(Ws_npf):
        if trueLength(each_OH) == trueLength(rand_w_OHf):
            each_w = OHsToDS(each_OH, OHXmap)
            assert d_h(rand_w, each_w) == rand_w_dists[i]

In [144]:
if benchmark:
    %timeit d_h_np_string_to_strings(choice(Ws_npf), Ws_npf)

In [145]:
if benchmark:
    %timeit d_h_np_string_to_strings(choice(Ws_npf), Ws_npf, True, Ws_npf_trueLengths)

In [146]:
if benchmark:
    %timeit d_h_np_string_to_strings(choice(Ws_npf), Ws_npf, True)

**Conclusion:** Fortunately, it looks like applying a mask to account for padding has a small cost provided you pre-calculate the true lengths of every padded vector in the stack of strings you are computing distances with respect to.

## Hamming distance between every pair of strings in a stack

In [147]:
# map_result2 = np.array(lmap(lambda key_w_OHf: np.array(lmap(lambda w_OHf: d_h_np(key_w_OHf, w_OHf, True),
#                                                             Ws_npf)),
#                             Ws_npf))

In [148]:
if testing:
    #≈4m cmu+wittgenstein
    map_result3 = np.stack([d_h_np_string_to_strings(key_w_OHf, Ws_npf, True, Ws_npf_trueLengths)
                            for key_w_OHf in tqdm(Ws_npf)])
    map_result3.shape

In [149]:
if testing:
    map_result3[random_w_idx]
    (map_result3 != length_mismatch_constant).nonzero()
    word_idx_pairs_w_finite_hamming_distance = lzip(*(map_result3 != length_mismatch_constant).nonzero())
    choices(word_idx_pairs_w_finite_hamming_distance, k=100)

In [150]:
if testing:
    for idx_u, idx_v in choices(word_idx_pairs_w_finite_hamming_distance, k=100):
        print('------------------------')
        pprint_aligned_DSs(align_DSs([OHsToDS(Ws_npf[idx_u], OHXmap), 
                                      OHsToDS(Ws_npf[idx_v], OHXmap)]))
        map_result3[idx_u, idx_v]

In [151]:
if testing:
    assert np.array_equal(map_result3[random_w_idx], 
                          d_h_np_string_to_strings(random_w_OHf, Ws_npf, True))
    del map_result3

In [152]:
def construct_hadamard_product_block(row_indices, A, B):
    return np.einsum('mls,nls->mnls', A[row_indices], B, dtype=my_dtype)
# def construct_hadamard_product_block(A_slice, B_slice):
#     return np.einsum('mls,nls->mnls', A_slice, B_slice)

def calculate_block_sum(block):
    return np.einsum('mnls->mn', block, dtype=my_dtype)

def block_sum_op(row_indices, A, B, l):
    memTrigger()
    return l - calculate_block_sum(construct_hadamard_product_block(row_indices, A, B))

def construct_hadamard_product_block_t(A_block, B, use_GPU=True):
    return torch.einsum('mls,nls->mnls', A_block, B).type(my_cpu_type)
#     return torch.einsum('mls,nls->mnls', A_block, B)#.type(my_cpu_type)

def calculate_block_sum_t(block):
#     print(f"block.dtype = {block.dtype}")
#     print(f"block.device = {block.device}")
#     block_sum = torch.einsum('mnls->mn', block).type(my_cpu_type)
#     print(f"block_sum.dtype = {block_sum.dtype}")
#     print(f"block_sum.device = {block_sum.device}")
#     print('computed block_sum.')
#     return block_sum
    return torch.einsum('mnls->mn', block).type(my_cpu_type)
#     return torch.einsum('mnls->mn', block)#.type(my_cpu_type)

def block_sum_op_t(A_block, B, l, use_GPU=True):
# def block_sum_op_t(row_indices, A, B, l, use_GPU=True):
    memTrigger()
    torch.cuda.empty_cache()
#     print(f'row_indices.dtype = {row_indices.dtype}')
#     print(f'A.dtype = {A.dtype}')
#     print(f'B.dtype = {B.dtype}')
#     print(f'l.dtype = {l.dtype}')

#     A_c = A[row_indices].cuda()
#     B_c = B.cuda()
#     prodBlock = construct_hadamard_product_block_t(A_c, B_c)

#     print(f"prodBlock.dtype = {prodBlock.dtype}")
#     print(f"prodBlock.device = {prodBlock.device}")

#     blockSum_c = calculate_block_sum_t(prodBlock)
#     blockSum = blockSum_c.cpu()
#     print(f"blockSum.dtype = {blockSum.dtype}")
#     print(f"l.dtype = {l.dtype}")
#     result = l - blockSum.type(my_cpu_type)
    
#     return result
#     return l - (calculate_block_sum_t(construct_hadamard_product_block_t(A[row_indices].cuda(), B.cuda())).cpu())
#     return l - calculate_block_sum_t(construct_hadamard_product_block_t(A[row_indices].cuda(), B.cuda())).cpu().type(my_cpu_type)
#     return l - calculate_block_sum_t(construct_hadamard_product_block_t(A[row_indices].cuda(), B.cuda())).cpu()
#     return (l.cuda() - calculate_block_sum_t(construct_hadamard_product_block_t(A[row_indices].cuda(), B.cuda()))).cpu()
#     return (l.cuda() - calculate_block_sum_t(construct_hadamard_product_block_t(A_block.cuda(), B.cuda()))).cpu()
    return l - (calculate_block_sum_t(construct_hadamard_product_block_t(A_block.cuda(), B.cuda()))).cpu()

def H_d_np(L_OHs, paddedOHs=False, parallel=False, use_GPU=False, wec=False, wec_block_size=100):
# def H_d_np(L_OHs, paddedOHs=False, parallel=False, my_dtype=None):
#     if my_dtype is None:
#         my_dtype = np.uint8
#         my_dtype = np.int8
    L_OHs = L_OHs.astype(my_dtype)
    L_OHs_trueLengths = np.sum(L_OHs, axis=2, dtype=np.uint8).sum(axis=1, dtype=np.uint8)
    if not parallel and not use_GPU:
        if not wec:
    #         return np.stack([d_h_np_string_to_strings(key_w_OHf, L_OHs, paddedOHs=paddedOHs, my_dtype=my_dtype)
            return np.stack([d_h_np_string_to_strings(key_w_OHf, L_OHs, paddedOHs=paddedOHs, L_OHs_trueLengths=L_OHs_trueLengths).astype(my_dtype)
                             for key_w_OHf in tqdm(L_OHs)]).astype(my_dtype)
        else:
            m = L_OHs_trueLengths.shape[0]
            n = m
            stampedNote('Start wec')
            lengthTerm = np.einsum('m,mn->mn', L_OHs_trueLengths.astype(my_dtype), np.ones((m,n), dtype=my_dtype))
            stampedNote(f'lengthTerm.nbytes / 1e9 = {lengthTerm.nbytes / 1e9}')
            print(f'{lengthTerm.dtype}')
            
            block_length = wec_block_size
            num_blocks = int(np.rint( m / block_length ))
            block_onsets = [block_index * block_length 
                            for block_index in range(num_blocks)]
            block_ends = block_onsets[1:] + [m]
            block_startStop_pairs = tuple(zip(block_onsets, block_ends))
            V = 1
            P_d = np.concatenate(list(par(delayed(block_sum_op)(np.arange(block_start, block_end), L_OHs, L_OHs, lengthTerm[np.arange(block_start, block_end)])
                                          for block_start, block_end in tqdm(block_startStop_pairs, 
                                                                             total=len(block_startStop_pairs)))))
            V = 10
            return P_d
#             prodTerm = np.einsum('mij,nij->mnij', L_OHs, L_OHs) #memory error, naturally
#             stampedNote(f'prodTerm.nbytes / 1e9 = {prodTerm.nbytes / 1e9}')
#             print(f'{prodTerm.dtype}')
            
#             reducTerm = np.einsum('mnls->mn', prodTerm)
#             del prodTerm
#             stampedNote(f'reducTerm.nbytes / 1e9 = {reducTerm.nbytes / 1e9}')
#             print(f'{reducTerm.dtype}')
            
#             result = lengthTerm - reducTerm
#             del lengthTerm
#             del reducTerm
#             return result
    elif parallel and not use_GPU:
#         return np.stack(par(delayed(d_h_np_string_to_strings)(key_w_OHf, L_OHs, paddedOHs=paddedOHs, my_dtype=my_dtype)
        return np.stack(par(delayed(d_h_np_string_to_strings)(key_w_OHf, L_OHs, paddedOHs=paddedOHs, L_OHs_trueLengths=L_OHs_trueLengths)
                            for key_w_OHf in L_OHs)).astype(my_dtype)
    else:
        if not wec:
            return np.stack([d_h_np_string_to_strings(torch.from_numpy(key_w_OHf), torch.from_numpy(L_OHs), paddedOHs=paddedOHs, L_OHs_trueLengths=torch.from_numpy(L_OHs_trueLengths), use_GPU=True)
                             for key_w_OHf in tqdm(L_OHs)]).astype(my_dtype)
        else:
            torch.cuda.empty_cache()
            m = L_OHs_trueLengths.shape[0]
            n = m
            stampedNote('Start wec')
            lengthTerm = np.einsum('m,mn->mn', L_OHs_trueLengths.astype(my_dtype), np.ones((m,n), dtype=my_dtype))
            stampedNote(f'lengthTerm.nbytes / 1e9 = {lengthTerm.nbytes / 1e9}')
            print(f'{lengthTerm.dtype}')
            lengthTerm = torch.from_numpy(lengthTerm)
            
            block_length = wec_block_size
            num_blocks = int(np.rint( m / block_length ))
            block_onsets = [block_index * block_length 
                            for block_index in range(num_blocks)]
            block_ends = block_onsets[1:] + [m]
            block_startStop_pairs = tuple(zip(block_onsets, block_ends))
            blockRanges = tuple([torch.arange(block_start, block_end)
                                 for block_start, block_end in block_startStop_pairs])
            
            L_OHs_t = torch.from_numpy(L_OHs)
            
#             P_d = np.concatenate([block_sum_op_t(torch.arange(block_start, block_end), L_OHs_t, L_OHs_t, lengthTerm[torch.arange(block_start, block_end)]).numpy()
#                                   for block_start, block_end in tqdm(block_startStop_pairs, 
#                                                                      total=len(block_startStop_pairs))])
            P_d = np.concatenate([block_sum_op_t(L_OHs_t[block_range], L_OHs_t, lengthTerm[block_range]).numpy()
                                          for block_range in tqdm(blockRanges, 
                                                                  total=len(blockRanges))])
            return P_d

In [153]:
!free -h

              total        used        free      shared  buff/cache   available
Mem:           125G        1.1G        7.9G        105M        116G        123G
Swap:          2.0G        104M        1.9G


In [154]:
# ≈2.5m cmu+wittgenstein
# 36s NXT_swbd+wittgenstein
H_d_np_W = H_d_np(Ws_npf, paddedOHs=True, parallel=True)
H_d_np_W.shape
H_d_np_W.nbytes / 1e9
H_d_np_W.dtype

[Parallel(n_jobs=-1)]: Using backend MultiprocessingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0114s.) Setting batch_size=34.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.1387s.) Setting batch_size=98.
[Parallel(n_jobs=-1)]: Done 642 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 1220 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done 6737 out of 6737 | elapsed:    1.4s finished


(6737, 6737)

0.045387169

dtype('int8')

In [155]:
!free -h

              total        used        free      shared  buff/cache   available
Mem:           125G        1.2G        7.8G        105M        116G        123G
Swap:          2.0G        104M        1.9G


In [156]:
if g and testing:
    # ≈2.5m cmu+wittgenstein
    torch.cuda.empty_cache()
    H_d_np_W_g = H_d_np(Ws_npf, paddedOHs=True, parallel=False, use_GPU=True)
    torch.cuda.empty_cache()

In [157]:
Ws_npf.shape

(6737, 6, 41)

In [158]:
!free -h

              total        used        free      shared  buff/cache   available
Mem:           125G        1.2G        7.9G        105M        116G        123G
Swap:          2.0G        104M        1.9G


In [159]:
if testing:
    # ≈0.5m NXT_swbd+wittgenstein, for block size 25 and memory overhead is ? (peak=?GB)
    # ≈1.8m cmu+wittgenstein, for block size 100 and memory overhead is ENOROMOUS (peak=90-95GB)
    # ≈1.7m cmu+wittgenstein, for block size 50 and memory overhead is tolerable (peak=45-50GB)
    # ≈1.9m cmu+wittgenstein, for block size 25 and memory overhead is tolerable (peak=25-27GB)
    H_d_np_W_wec = H_d_np(Ws_npf, paddedOHs=True, parallel=False, use_GPU=False, wec=True, wec_block_size=25)

if g and testing:
    # 28s NXT_swbd+wittgenstein, block size 10, peak GPU mem usage = 1.6GB
    # ≈3m cmu+wittgenstein, for block size 5, peak GPU mem usage = 1.8GB 
    # ≈3.4m cmu+wittgenstein, for block size 20, peak GPU mem usage = 5.4GB 
    # ≈2.6m cmu+wittgenstein, for block size 10, peak GPU mem usage = 3.0GB 
    # ≈1m cmu+wittgenstein, for block size 10, peak GPU mem usage = 5.6GB 
    torch.cuda.empty_cache()
    H_d_np_W_wec = H_d_np(Ws_npf, paddedOHs=True, parallel=False, use_GPU=True, wec=True, wec_block_size=10)
    torch.cuda.empty_cache()

In [160]:
!free -h

              total        used        free      shared  buff/cache   available
Mem:           125G        1.2G        7.9G        105M        116G        123G
Swap:          2.0G        104M        1.9G


In [161]:
# if testing:
#     #using the multiprocessing backend ensures parallelization preserves order
#     H_d_np_W_noPar = H_d_np(Ws_npf, paddedOHs=True, parallel=False)
#     assert np.array_equal(H_d_np_W, H_d_np_W_noPar)

In [162]:
# !free -h

In [163]:
Ps_npf = lexiconToFixedSizeOHs(Ps_t)
Ps_npf.shape
Ps_npf.nbytes / 1e9

(21180, 6, 41)

0.00521028

In [164]:
!free -h

              total        used        free      shared  buff/cache   available
Mem:           125G        1.2G        7.8G        105M        116G        123G
Swap:          2.0G        104M        1.9G


In [165]:
len(Ws_t)
len(Ps_t)
(len(Ws_t) * len(Ws_t)) / (len(Ps_t) * len(Ps_t))

#wrong by about an order of magnitude for cmu?
#est amount of memory required for H_d_np_P as a multiple of the memory required for H_d_np_W
# 1 / ((len(Ws_t) * len(Ws_t)) / (len(Ps_t) * len(Ps_t)))

6737

21180

0.10117685676351182

In [166]:
(H_d_np_W.nbytes / 1e9)

#wrong by about an order of magnitude for cmu?
#est amount of memory required for H_d_np_P in GB
# H_d_np_P_est_space_GB = (H_d_np_W.nbytes / 1e9) * (1 / ((len(Ws_t) * len(Ws_t)) / (len(Ps_t) * len(Ps_t))))
# H_d_np_P_est_space_GB


memAvailable()
# (H_d_np_W.nbytes / 1e9) / memTotal()

0.045387169

123.32780456542969

In [167]:
o

'CM_AmE_destressed_aligned_w_LTR_NXT_swbd_destressed_pseudocount0.01/LTR_NXT_swbd_destressed_aligned_CM_filtered_LM_filtered_X0X1X2'

In [168]:
o + '_H_d_P' + '.npy'

'CM_AmE_destressed_aligned_w_LTR_NXT_swbd_destressed_pseudocount0.01/LTR_NXT_swbd_destressed_aligned_CM_filtered_LM_filtered_X0X1X2_H_d_P.npy'

In [169]:
torch.cuda.empty_cache()

In [170]:
# if true:
if len(Ps_t) > 60000 and (memAvailable() < 160):
# if H_d_np_P_est_space_GB > 100 or (memAvailable() - H_d_np_P_est_space_GB) < 5:
    print('Constructing H_d_np_W via memory mapping *now*...')
    
    H_d_np_P_fp = o + '_H_d_P' + '.npy'
    H_d_np_P = np.memmap(H_d_np_P_fp, dtype=my_dtype, mode='w+', shape=(len(Ps_t), len(Ps_t)))
    if g:
        H_d_np_P[:] = H_d_np(Ps_npf, paddedOHs=True, parallel=False, use_GPU=True)
    else:
        H_d_np_P[:] = H_d_np(Ps_npf, paddedOHs=True, parallel=True)
        
    H_d_P_md = {'W':{'from fp':p,
                     'changes':'sorted',
                     'size':len(Ws_t)},
                     'P':{'from_fp':p,
                          'changes':'extracted from W, sorted',
                          'size':len(Ps_t)}}
    exportMatrixMetadata(o + '_H_d_P' + '.npy' + '_metadata.json',
                         o + '_H_d_P' + '.npy' + '_metadata.json',
                         H_d_np_P,
                         H_d_P_md,
                         'Step 4b',
                         'Calculate word-prefix relation, Hamming distances, and k-cousin relation.ipynb',
                        {'Storage':'file is MEMORY MAPPED.'})
        
    alreadyMemoryMapped_H_d_p = True
else:
    alreadyMemoryMapped_H_d_p = False
#     paddedOHs, parallel, use_GPU, wec, wec_block_size


    #10.5m = 91.5cps NXT_swbd+wittgenstein, peak memory usage @ ?/54941 calcs ≈GB (baseline 25-27GB), peak GPU RAM usage 6.7GB
#     H_d_np_P = H_d_np(Ps_npf, True, False, True, True, 15)

    #≈35cps cmu+wittgenstein, peak memory usage @ ≈12940/129403 calcs ≈29.5GB (baseline 13GB), peak GPU RAM use 5.6GB
    #8.83m = 103.6cps NXT_swbd+wittgenstein, peak memory usage @ ?/54941 calcs ≈GB (baseline 25-27GB), peak GPU RAM usage 2.7GB
#     H_d_np_P = H_d_np(Ps_npf, True, False, True, True, 5)
    
#     H_d_np_P = H_d_np(Ps_npf, True, False, True, True, 3) #≈33cps cmu+wittgenstein, peak memory usage @ 12942/129403 calcs ≈37.8GB (baseline 21GB), peak GPU RAM use 3.7GB


    #5.5m = 166.5cps NXT_swbd+wittgenstein, peak memory usage ≈102GB, (baseline 34GB)
#     H_d_np_P = H_d_np(Ps_npf, True, False, False, True, 45)

    #5.5m = 166.5cps NXT_swbd+wittgenstein, peak memory usage ≈73GB, (baseline 28-30GB)
#     H_d_np_P = H_d_np(Ps_npf, True, False, False, True, 25)

    #4.5cps (67.5cps?) cmu+wittgenstein, peak memory usage @ 300/129403 calcs ≈98GB
    #5.5m = 166.5cps NXT_swbd+wittgenstein, peak memory usage ≈56GB (baseline 28GB)
    H_d_np_P = H_d_np(Ps_npf, True, False, False, True, 15)

    #6.6cps (66cps?) cmu+wittgenstein, peak memory usage @ 1246/129403 calcs ≈58GB
#     H_d_np_P = H_d_np(Ps_npf, True, False, False, True, 10) 

#     if g:
#         #≈53 cps cmu+wittgenstein, peak memory usage @ ≈12940/129403 calcs ≈10.8GB (baseline 7GB), peak GPU RAM usage 1.7GB
#         #7.13m = 128 cps NXT_swbd+wittgenstein, peak GPU RAM usage 1.1GB
#         H_d_np_P = H_d_np(Ps_npf, True, False, True) 
#     else:
#         #57 cps cmu+wittgenstein, w/ 129403 calcs to do for cmu
#         #7.9m = 115.9 cps NXT_swbd+wittgenstein w/ 54941 calcs to do
#         H_d_np_P = H_d_np(Ps_npf, True, True) 
    
    H_d_np_P.shape
    H_d_np_P.nbytes / 1e9
    H_d_np_P.dtype
    

Start wec @ 12:24:55


  0%|          | 0/1412 [00:00<?, ?it/s]

lengthTerm.nbytes / 1e9 = 0.4485924 @ 12:24:55
int8


[Parallel(n_jobs=-1)]: Using backend MultiprocessingBackend with 32 concurrent workers.
  0%|          | 1/1412 [00:00<14:21,  1.64it/s][Parallel(n_jobs=-1)]: Batch computation too fast (0.1084s.) Setting batch_size=2.
  5%|▍         | 64/1412 [00:00<09:37,  2.34it/s][Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.2s
  7%|▋         | 99/1412 [00:00<04:39,  4.70it/s][Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:    0.4s
  9%|▊         | 123/1412 [00:01<03:13,  6.65it/s][Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.5s
 10%|█         | 142/1412 [00:01<02:15,  9.34it/s][Parallel(n_jobs=-1)]: Done  49 tasks      | elapsed:    0.6s
 12%|█▏        | 170/1412 [00:01<01:35, 13.05it/s][Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:    0.7s
 16%|█▌        | 221/1412 [00:01<00:39, 29.79it/s][Parallel(n_jobs=-1)]: Done  98 tasks      | elapsed:    1.3s
 18%|█▊        | 250/1412 [00:02<00:24, 47.70it/s][Parallel(n_jobs=-1)]: Done 132 tasks      | elapsed:    1.6s

(21180, 21180)

0.4485924

dtype('int8')

In [171]:
if testing:
    (H_d_np_P == np.nan).nonzero() #should be empty
    assert (H_d_np_P == np.nan).nonzero()[0].size == 0

In [172]:
!free -h

              total        used        free      shared  buff/cache   available
Mem:           125G        2.0G        7.0G        105M        116G        122G
Swap:          2.0G        104M        1.9G


# $k$-cousin calculation

## Definitions, motivation, and calculation sketch

Let $s$ be a finite-length string over $\Sigma$ and let $L$ be a finite set of strings over $\Sigma$.

**k-sphere**: $s'$ is in the *exact* $k$-sphere of $s$ w.r.t. $L$ iff $s' \in L \land $ the Hamming distance of $s'$ from $s$ is *exactly* $k$.

**k-cousin**: string $p$ is an *exact* $k$-cousin of segmental wordform $w$ wr.t. $L$ iff
 - $w \in L$
 - $p \in \text{prefixes}(L)$
 - $\exists p' \in \text{exact-}k\text{-sphere}(p) \cap \text{prefixes}(w)$
 - i.e. if $w$, when trimmed to length $|p|$ to produce prefix $p'$ has exactly Hamming distance $k$ from $p$, then $p$ and $w$ are exactly $k$-cousins.
 - *NB:* if $|w| < p$, then $p$ and $w$ are $\infty$-cousins, since the Hamming distance between the closest prefix $p' = w$ of $w$ and $p$ is $\infty$.

**Motivation**: Consider incremental word recognition:
 - for low $k$, the exact $k$-cousins of a prefix $p$ are complete wordforms that are more plausible full intended wordforms causing $p$ than higher exact $k$-cousins
 - for low $k$, the exact $k$-cousins of a wordform $w$ are prefixes that are more likely incremental misperceptions or misproductions of $w$ than higher $k$-cousins

**Calculation sketch**:
 1. Calculate the pairwise Hamming distances between all pairs of prefixes.
 2. Given a mapping (calculated earlier) from every wordform (index) $w$ and length $l$ to the prefix (index) $p$ that results when $w$ is trimmed to length $l$, we can trivially calculate for every prefix-wordform pair $p', w'$ the exact $k$ s.t. $p'$ and $w$ are exact $k$-cousins.

In [173]:
H_d_np_P.shape

(21180, 21180)

In [174]:
# P_idxs_of_Ws_t = np.array([w_idx_to_p_idx[w_idx] for w_idx in range(len(Ws_t))])
# assert Ws_t == tuple([Ps_t[p_idx] for p_idx in P_idxs_of_Ws_t])

In [175]:
P_idxs_of_trimmed_Ws_t = lambda l: np.array([w_idx_to_l_to_p_idx.get((w_idx, l), None)
                                             for w_idx in range(len(Ws_t))])
P_idxs_of_trimmed_Ws_t(5)

array([    4,     7,    10, ..., 21171, 21175, 21178])

In [176]:
k_cousin_function_np_shape = (len(Ps_t), len(Ws_t))
k_cousin_function_np_shape

(21180, 6737)

In [177]:
H_d_np_col_retrieval = lambda p_idx, p_idxs_or_Nones: np.array([H_d_np_P[p_idx, p_idx_prime]
                                                                if p_idx_prime is not None else length_mismatch_constant
                                                                for p_idx_prime in p_idxs_or_Nones])
def H_d_np_col_retrieval_par(p_idx):
    return np.array([H_d_np_P[p_idx, p_idx_prime]
                     if p_idx_prime is not None else length_mismatch_constant
                     for p_idx_prime in P_idxs_of_trimmed_Ws_t( len(ds2t(Ps_t[p_idx])) )])

In [178]:
#67s NXT_swbd+wittgenstein, w/ baseline memory usage 39GB, peak ≈45GB?
k_cousin_function_np = np.stack(par(delayed(H_d_np_col_retrieval_par)(p_idx)
                                    for p_idx in range(len(Ps_t)))).astype(my_dtype)

#7.5m NXT_swbd + wittgenstein
# k_cousin_function_np = np.stack([H_d_np_col_retrieval(p_idx, P_idxs_of_trimmed_Ws_t( len(ds2t(p)) ))
#                                 for p_idx, p in tqdm(enumerate(Ps_t), total=len(Ps_t))]).astype(my_dtype)
# k_cousin_function_np = np.stack([H_d_np_P[p_idx, P_idxs_of_trimmed_Ws_t( len(ds2t(p)) )]
#                                 for p_idx, p in tqdm(enumerate(Ps_t), total=len(Ps_t))])
k_cousin_function_np.shape
k_cousin_function_np.nbytes / 1e9
k_cousin_function_np.dtype

[Parallel(n_jobs=-1)]: Using backend MultiprocessingBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0181s.) Setting batch_size=22.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  49 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.1427s.) Setting batch_size=60.
[Parallel(n_jobs=-1)]: Done 438 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 812 tasks      | elapsed:    0.4s
[Parallel(n_jobs=-1)]: Done 1230 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done 1952 tasks      | elapsed:    1.0s
[Parallel(n_jobs=-1)]: Done 3212 tasks      | elapsed:    1.1s
[Parallel(n_jobs=-1)]: Done 4472 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done 5852 tasks      | elapsed:    

(21180, 6737)

0.14268966

dtype('int8')

In [179]:
rand_pref = choice(Ps_t)
while rand_pref[-1] == rightEdge:
    rand_pref = choice(Ps_t)
rand_pref

rand_pref_idx = Ps_t.index(rand_pref)
rand_pref_idx

rand_pref_l = len(ds2t(rand_pref))
rand_pref_l

'⋊.aɪ.k.l'

118

4

In [180]:
check_arr = []
for w_idx in tqdm(range(len(Ws_t))):
    w = Ws_t[w_idx]
    w_l = len(ds2t(w))
    if w_l >= rand_pref_l:
        my_p_prime_t = ds2t(w)[:rand_pref_l]
        my_p_prime = t2ds(my_p_prime_t)
        my_p_prime_idx = Ps_t.index( my_p_prime )
#         print(my_p_prime, my_p_prime_idx)
        k_val = H_d_np_P[rand_pref_idx, my_p_prime_idx]
    else:
        k_val = length_mismatch_constant
    check_arr.append(k_val)
check_arr_np = np.array(check_arr)
k_cousin_function_np[rand_pref_idx] == check_arr_np
assert np.array_equal(k_cousin_function_np[rand_pref_idx], check_arr_np)

100%|██████████| 6737/6737 [00:00<00:00, 6814.14it/s] 


array([ True,  True,  True, ...,  True,  True,  True])

In [187]:
if testing:
    rand_pref_5cousins = get_k_cousins(rand_pref, 5, Ws_t, Ps_t, exactlyK = True)
    sorted(rand_pref_5cousins)

[]

In [188]:
if testing:
    (k_cousin_function_np[rand_pref_idx] == 5).nonzero()[0]

    k_cousin_function_np[rand_pref_idx, 
                         (k_cousin_function_np[rand_pref_idx] == 5).nonzero()[0]  ]

    lmap(lambda w_idx: Ws_t[w_idx], 
         (k_cousin_function_np[rand_pref_idx] == 5).nonzero()[0])

    assert sorted(rand_pref_5cousins) == sorted(lmap(lambda w_idx: Ws_t[w_idx], 
                                                     (k_cousin_function_np[rand_pref_idx] == 5).nonzero()[0]))


array([], dtype=int64)

array([], dtype=int8)

[]

In [183]:
if testing:
    num_checks = 1000

    rand_prefs = []
    while len(rand_prefs) < num_checks:
        rand_pref = choice(Ps_t)
        while rand_pref[-1] == rightEdge:
            rand_pref = choice(Ps_t)
        rand_prefs.append(rand_pref)

    rand_pref_idxs = lmap(lambda p: Ps_t.index(p), 
                          rand_prefs)
    rand_pref_ls = lmap(lambda p: len(ds2t(p)),
                        rand_prefs)
    rand_ks = [choice([1,2,3,4]) for each in rand_prefs]

    for p, p_idx, p_l, k in tqdm(zip(rand_prefs, rand_pref_idxs, rand_pref_ls, rand_ks),
                                 total=len(rand_prefs)):
        #reference implementation
        rand_pref_k_cousins_ref = sorted(get_k_cousins(p, k, Ws_t, Ps_t, exactlyK = True))

        rand_pref_k_cousins = sorted(lmap(lambda w_idx: Ws_t[w_idx],
                                          (k_cousin_function_np[p_idx] == k).nonzero()[0]))
        assert rand_pref_k_cousins_ref == rand_pref_k_cousins

In [184]:
def get_k_cousins_of_pref(p, k):
    p_idx = Ps_t.index(p)
    return lmap(lambda w_idx: Ws_t[w_idx],
                (k_cousin_function_np[p_idx] == k).nonzero()[0])

rand_pref
get_k_cousins_of_pref(rand_pref, 5)

'⋊.aɪ.k.l'

[]

# Export

We want to export
 - the prefix-word relation
 - the Hamming distance matrix between all pairs of wordforms
 - the Hamming distance matrix between all pairs of prefixes
 - the $k$-cousin relation between all pairs of prefixes and wordforms
 
plus associated metadata.

In [190]:
prefix_relation_np.shape
len(Ws_t), len(Ps_t)

np.save(o + '_prefix_relation' + '.npy', prefix_relation_np)

(6737, 21180)

(6737, 21180)

In [191]:
prefix_relation_md = {'W':{'from fp':p,
                           'changes':'sorted',
                           'size':len(Ws_t)},
                      'P':{'from_fp':p,
                           'changes':'extracted from W, sorted',
                           'size':len(Ps_t)}}
exportMatrixMetadata(o + '_prefix_relation' + '.npy' + '_metadata.json',
                     path.basename(o) + '_prefix_relation' + '.npy' + '_metadata.json',
                     prefix_relation_np,
                     prefix_relation_md,
                     'Step 4b',
                     'Calculate word-prefix relation, Hamming distances, and k-cousin relation.ipynb',
                    {})

Wrote metadata for 
	LTR_NXT_swbd_destressed_aligned_CM_filtered_LM_filtered_X0X1X2_prefix_relation.npy_metadata.json
 to 
	CM_AmE_destressed_aligned_w_LTR_NXT_swbd_destressed_pseudocount0.01/LTR_NXT_swbd_destressed_aligned_CM_filtered_LM_filtered_X0X1X2_prefix_relation.npy_metadata.json


In [192]:
H_d_np_W.shape
len(Ws_t), len(Ws_t)

np.save(o + '_H_d_W' + '.npy', H_d_np_W)

(6737, 6737)

(6737, 6737)

In [193]:
H_d_W_md = {'W':{'from fp':p,
                 'changes':'sorted',
                 'size':len(Ws_t)}}
exportMatrixMetadata(o + '_H_d_W' + '.npy' + '_metadata.json',
                     path.basename(o) + '_H_d_W' + '.npy' + '_metadata.json',
                     H_d_np_W,
                     H_d_W_md,
                     'Step 4b',
                     'Calculate word-prefix relation, Hamming distances, and k-cousin relation.ipynb',
                    {})

Wrote metadata for 
	LTR_NXT_swbd_destressed_aligned_CM_filtered_LM_filtered_X0X1X2_H_d_W.npy_metadata.json
 to 
	CM_AmE_destressed_aligned_w_LTR_NXT_swbd_destressed_pseudocount0.01/LTR_NXT_swbd_destressed_aligned_CM_filtered_LM_filtered_X0X1X2_H_d_W.npy_metadata.json


In [194]:
H_d_np_P.shape
len(Ps_t), len(Ps_t)

if not alreadyMemoryMapped_H_d_p:
    H_d_np_P_mm = np.memmap(o + '_H_d_P' + '.npy', dtype=my_dtype, mode='w+', shape=(len(Ps_t), len(Ps_t)))
    H_d_np_P_mm[:] = H_d_np_P
#     np.save(path.join(o, o + '_H_d_P' + '.npy'), H_d_np_P)
    
    H_d_P_md = {'W':{'from fp':p,
                     'changes':'sorted',
                     'size':len(Ws_t)},
                     'P':{'from_fp':p,
                          'changes':'extracted from W, sorted',
                          'size':len(Ps_t)}}
    exportMatrixMetadata(o + '_H_d_P' + '.npy' + '_metadata.json',
                         path.basename(o) + '_H_d_P' + '.npy' + '_metadata.json',
                         H_d_np_P,
                         H_d_P_md,
                         'Step 4b',
                         'Calculate word-prefix relation, Hamming distances, and k-cousin relation.ipynb',
                        {'Storage':'file is MEMORY MAPPED.'})

(21180, 21180)

(21180, 21180)

Wrote metadata for 
	LTR_NXT_swbd_destressed_aligned_CM_filtered_LM_filtered_X0X1X2_H_d_P.npy_metadata.json
 to 
	CM_AmE_destressed_aligned_w_LTR_NXT_swbd_destressed_pseudocount0.01/LTR_NXT_swbd_destressed_aligned_CM_filtered_LM_filtered_X0X1X2_H_d_P.npy_metadata.json


In [195]:
k_cousin_function_np.shape
len(Ps_t), len(Ws_t)

np.save(o + '_k_cousin_function' + '.npy', k_cousin_function_np)

(21180, 6737)

(21180, 6737)

In [196]:
k_cousin_function_md = {'P':{'from_fp':p,
                             'changes':'extracted from W, sorted',
                             'size':len(Ps_t)},
                        'W':{'from fp':p,
                             'changes':'sorted',
                             'size':len(Ws_t)}}
exportMatrixMetadata(o + '_k_cousin_function' + '.npy' + '_metadata.json',
                     path.basename(o) + '_k_cousin_function' + '.npy' + '_metadata.json',
                     k_cousin_function_np,
                     k_cousin_function_md,
                     'Step 4b',
                     'Calculate word-prefix relation, Hamming distances, and k-cousin relation.ipynb',
                    {})

Wrote metadata for 
	LTR_NXT_swbd_destressed_aligned_CM_filtered_LM_filtered_X0X1X2_k_cousin_function.npy_metadata.json
 to 
	CM_AmE_destressed_aligned_w_LTR_NXT_swbd_destressed_pseudocount0.01/LTR_NXT_swbd_destressed_aligned_CM_filtered_LM_filtered_X0X1X2_k_cousin_function.npy_metadata.json
