In [1]:
cd ..

/home/jonatan/Dropbox/FairyText


## Vecs Containers

When training NLP models you normally have input samples which are of variable length. For example sentences with different number of words. But in order to train with minibatches you need to stack them in a matrix. 
I have implemented different containers for extremely fast minibatch-creation from variable length vectors

In [2]:
from rush.vecs import FloatVecs, IntVecs, ShortVecs

You can create instances of these containers from a list of 1d numpy.ndarrays or torch-tensors

### Example

Load some sentences

In [3]:
import numpy as np
import pandas as pd
import re

pattern = re.compile('[^a-z +]')
with open('data/multi30k/train.en','r') as f:
    sentences = pd.Series([pattern.sub(' ',x.lower()) for x in f.readlines()])
sentences.head()

0    two young  white males are outside near many b...
1    several men in hard hats are operating a giant...
2     a little girl climbing into a wooden playhouse  
3    a man in a blue shirt is standing on a ladder ...
4            two men are at the stove preparing food  
dtype: object

Build vocabulary and mappings to and from integers

In [4]:
word_set = set([])
for sent in sentences:
    word_set |= set([x.strip() for x in sent.split(' ')])
all_words = sorted(list(word_set))
all_words[:25]

['',
 'a',
 'aaa',
 'aaron',
 'abandon',
 'abandoned',
 'abdomen',
 'aberdeen',
 'able',
 'aboard',
 'abound',
 'about',
 'above',
 'abroad',
 'abs',
 'abstract',
 'accelerates',
 'accept',
 'accepting',
 'accepts',
 'accessing',
 'accessories',
 'accident',
 'accommodates',
 'accompanied']

In [5]:
id2word_dict = dict(zip(range(len(all_words)),all_words))
id2word_dict[len(all_words)] = 'Ø'
word2id_dict = {word: id for id, word in id2word_dict.items()}

def get_word_IDs(sentence):
    IDs = []
    for word in sentence.split(' '):
        try:
            IDs.append( word2id_dict[word] )
        except:
            IDs.append( word2id_dict['Ø'] )
    return np.asarray(IDs, dtype=np.int32)

Get sentences as variable length vectors of integers

In [6]:
Sentences_as_ints = pd.Series([get_word_IDs(sent) for sent in sentences])
Sentences_as_ints.head()

0    [9027, 9666, 0, 9481, 5028, 319, 5775, 5503, 5...
1    [7384, 5190, 4245, 3896, 3916, 319, 5701, 1, 3...
2     [1, 4870, 3585, 1657, 4360, 1, 9570, 6253, 0, 0]
3    [1, 5032, 4245, 1, 878, 7473, 4387, 8090, 5677...
4    [9027, 5190, 319, 407, 8640, 8202, 6431, 3341,...
dtype: object

Now we can make an instance of a IntVecs container

In [7]:
variable_length_int_vecs = IntVecs(Sentences_as_ints.tolist())

### make_padded_minibatch method

Lets try to make 5 different random minibatches

In [8]:
minibatchsize = 256
for i in range(5):
    minibatch = variable_length_int_vecs.make_padded_minibatch(
        np.random.randint(0,variable_length_int_vecs.num_vecs, minibatchsize), # randon indices
        fill_value = word2id_dict['Ø'] # padding value
    )
    print(minibatch)


 6062  4245     1  ...   9688  9688  9688
 9027  3588  4245  ...   9688  9688  9688
    1  2510  8920  ...   9688  9688  9688
       ...          ⋱          ...       
    1  7749  5032  ...   9688  9688  9688
 3405  6257     1  ...   9688  9688  9688
    1  8224  9193  ...   9688  9688  9688
[torch.IntTensor of size 256x32]


    1  2510  4495  ...   9688  9688  9688
    1  5032  4245  ...   9688  9688  9688
 5032  9547     1  ...   9688  9688  9688
       ...          ⋱          ...       
 9027  1547  1652  ...   9688  9688  9688
    1  5032  4245  ...   9688  9688  9688
    1  5032  9547  ...   9688  9688  9688
[torch.IntTensor of size 256x33]


 7616  9201  1764  ...   9688  9688  9688
    1  3894  8604  ...   9688  9688  9688
    1  5032  8501  ...   9688  9688  9688
       ...          ⋱          ...       
    1  5032  4245  ...   9688  9688  9688
    1  5032   222  ...   9688  9688  9688
 8680  6062  6985  ...   9688  9688  9688
[torch.IntTensor of size 256x35]


 2502  9405 

A naive implementation would look as follows:

In [9]:
import torch
def naive_implementation(variable_length_vecs, indices, fill_value):
    max_len = max([len(x) for x in variable_length_vecs.iloc[indices]])
    out = np.empty( (len(indices),max_len), dtype=np.int32)
    out.fill(fill_value)
    for i, vec in enumerate(variable_length_vecs.iloc[indices]):
        out[i,:len(vec)] = vec
    return torch.from_numpy(out)
    

Lets try to time the two

In [10]:
%timeit naive_implementation(Sentences_as_ints, np.random.randint(0,variable_length_int_vecs.num_vecs, minibatchsize), fill_value = word2id_dict['Ø'])

374 µs ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [11]:
%timeit variable_length_int_vecs.make_padded_minibatch( np.random.randint(0,variable_length_int_vecs.num_vecs, minibatchsize), fill_value = word2id_dict['Ø'])

21.3 µs ± 3.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### make_minibatch_with_random_lengths method

Now lets say you have huge documents instead of short sentences. Then you might want to randomly sample into each document with a fixed length

In [12]:
max_len = 20
out_buffer = np.empty( (minibatchsize,max_len), dtype=np.int32 )
for i in range(5):
    minibatch = variable_length_int_vecs.make_minibatch_with_random_lengths(
        np.random.randint(0,variable_length_int_vecs.num_vecs, minibatchsize), 
        fill_value = word2id_dict['Ø'],
        out=out_buffer,
        max_len=max_len)
    print(minibatch)


    1  1757  6257  ...   9688  9688  9688
    1  4892   360  ...   9688  9688  9688
    1  5032  4245  ...   9688  9688  9688
       ...          ⋱          ...       
 6062  8090   407  ...   9688  9688  9688
 9560  5677  8640  ...   9688  9688  9688
 3403  7628   319  ...      0  9688  9688
[torch.IntTensor of size 256x20]


    1   626  6248  ...   9688  9688  9688
    1  3763  5651  ...   9688  9688  9688
    1  4689  2126  ...   9688  9688  9688
       ...          ⋱          ...       
    1  5668  5032  ...   9688  9688  9688
    1  5032  4245  ...   9688  9688  9688
 5032  4245  3913  ...   9688  9688  9688
[torch.IntTensor of size 256x20]


    1  9666  3585  ...   9688  9688  9688
    1  9481  5027  ...      0  9688  9688
    0     0  6062  ...   6254     1  3500
       ...          ⋱          ...       
    1  9666   984  ...   9688  9688  9688
 7616  6062  4029  ...   9688  9688  9688
 4050  6062  7613  ...   9688  9688  9688
[torch.IntTensor of size 256x20]


    1  2510 