In [12]:
cd ..

/home/jonatan/Dropbox


## Vecs Containers

When training NLP models you normally have input samples which are of variable length. For example sentences with different number of words. But in order to train with minibatches you need to stack them in a matrix. 
I have implemented different containers for extremely fast minibatch-creation from variable length vectors

In [13]:
from rush.vecs import FloatVecs, IntVecs, ShortVecs

You can create instances of these containers from a list of 1d numpy.ndarrays or torch-tensors

### Example

Load some sentences

In [14]:
import numpy as np
import pandas as pd
import re

pattern = re.compile('[^a-z +]')
with open('data/multi30k/train.en','r') as f:
    sentences = pd.Series([pattern.sub(' ',x.lower()) for x in f.readlines()])
sentences.head()

0    two young  white males are outside near many b...
1    several men in hard hats are operating a giant...
2     a little girl climbing into a wooden playhouse  
3    a man in a blue shirt is standing on a ladder ...
4            two men are at the stove preparing food  
dtype: object

Build mappings to integers

In [15]:
word_set = set([])
for sent in sentences:
    word_set |= set([x.strip() for x in sent.split(' ')])
all_words = sorted(list(word_set))
id2word_dict = dict(zip(range(len(all_words)),all_words))
id2word_dict[len(all_words)] = 'Ø'
word2id_dict = {word: id for id, word in id2word_dict.items()}

def get_word_IDs(sentence):
    IDs = []
    for word in sentence.split(' '):
        try:
            IDs.append( word2id_dict[word] )
        except:
            IDs.append( word2id_dict['Ø'] )
    return np.asarray(IDs, dtype=np.int32)

Get sentences as variable length vectors of integers

In [16]:
Sentences_as_ints = pd.Series([get_word_IDs(sent) for sent in sentences])
Sentences_as_ints.head()

0    [9027, 9666, 0, 9481, 5028, 319, 5775, 5503, 5...
1    [7384, 5190, 4245, 3896, 3916, 319, 5701, 1, 3...
2     [1, 4870, 3585, 1657, 4360, 1, 9570, 6253, 0, 0]
3    [1, 5032, 4245, 1, 878, 7473, 4387, 8090, 5677...
4    [9027, 5190, 319, 407, 8640, 8202, 6431, 3341,...
dtype: object

Now we can make an instance of a IntVecs container

In [17]:
variable_length_int_vecs = IntVecs(Sentences_as_ints.tolist())

### make_padded_minibatch method

Lets try to make 5 different random minibatches

In [18]:
minibatchsize = 256
for i in range(5):
    minibatch = variable_length_int_vecs.make_padded_minibatch(
        np.random.randint(0,variable_length_int_vecs.num_vecs, minibatchsize), # randon indices
        fill_value = word2id_dict['Ø'] # padding value
    )
    print(minibatch)


    1  9560  4245  ...   9688  9688  9688
 5032  9322  4870  ...   9688  9688  9688
 6694  6062   407  ...   9688  9688  9688
       ...          ⋱          ...       
    1  9560  4245  ...   9688  9688  9688
  218  5669  9560  ...   9688  9688  9688
 8680  6062   319  ...   9688  9688  9688
[torch.IntTensor of size 256x30]


 9027  6774   986  ...   9688  9688  9688
    1  3804  4387  ...   9688  9688  9688
 8604  6249   319  ...   9688  9688  9688
       ...          ⋱          ...       
    1  1143  5651  ...   9688  9688  9688
    1  1544  4387  ...   9688  9688  9688
    1  5032  6447  ...   9688  9688  9688
[torch.IntTensor of size 256x29]


 4394  8952  2628  ...   9688  9688  9688
    1  5032  9547  ...   9688  9688  9688
    1  5032  7533  ...   9688  9688  9688
       ...          ⋱          ...       
 9027  9666  9561  ...   9688  9688  9688
    1  4870  3585  ...   9688  9688  9688
 5680  5032  4387  ...   9688  9688  9688
[torch.IntTensor of size 256x29]


    1  6771 

A naive implementation would look as follows:

In [19]:
import torch
def naive_implementation(variable_length_vecs, indices, fill_value):
    max_len = max([len(x) for x in variable_length_vecs.iloc[indices]])
    out = np.empty( (len(indices),max_len), dtype=np.int32)
    out.fill(fill_value)
    for i, vec in enumerate(variable_length_vecs.iloc[indices]):
        out[i,:len(vec)] = vec
    return torch.from_numpy(out)
    

Lets try to time the two

In [20]:
%timeit naive_implementation(Sentences_as_ints, np.random.randint(0,variable_length_int_vecs.num_vecs, minibatchsize), fill_value = word2id_dict['Ø'])

334 µs ± 16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [21]:
%timeit variable_length_int_vecs.make_padded_minibatch( np.random.randint(0,variable_length_int_vecs.num_vecs, minibatchsize), fill_value = word2id_dict['Ø'])

33.9 µs ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### make_minibatch_with_random_lengths method

Now lets say you have huge documents instead of short sentences. Then you might want to randomly sample into each document with a fixed length

In [22]:
max_len = 20
out_buffer = np.empty( (minibatchsize,max_len), dtype=np.int32 )
for i in range(5):
    minibatch = variable_length_int_vecs.make_minibatch_with_random_lengths(
        np.random.randint(0,variable_length_int_vecs.num_vecs, minibatchsize), 
        fill_value = word2id_dict['Ø'],
        out=out_buffer,
        max_len=max_len)
    print(minibatch)


    1  5447  7377  ...   3786  4833  4327
    1  5032  4245  ...   9688  9688  9688
    1  5032  4245  ...   9688  9688  9688
       ...          ⋱          ...       
    1  2510  4387  ...   9688  9688  9688
    1  9560  3841  ...   9688  9688  9688
  218  5669   373  ...   9688  9688  9688
[torch.IntTensor of size 256x20]


 5677  8224  2916  ...   7318  2626  1213
 1547  6254  9547  ...   9688  9688  9688
 9027  3348  6249  ...   9688  9688  9688
       ...          ⋱          ...       
    1  2126  5651  ...   9688  9688  9688
    1  2510  4387  ...   9688  9688  9688
    1  1544  9405  ...      0  9688  9688
[torch.IntTensor of size 256x20]


    1  6092  4245  ...   9688  9688  9688
    1  9560  9324  ...   9688  9688  9688
  984  4245  5715  ...   9688  9688  9688
       ...          ⋱          ...       
    1  5032  4387  ...   9688  9688  9688
    1  3434  6119  ...   7749     0     0
 5061  6062   319  ...   9688  9688  9688
[torch.IntTensor of size 256x20]


    1   823 