In [1]:
cd ..

/home/jonatan/Dropbox/FairyText


## Vecs Containers

When training NLP models you normally have input samples which are of variable length. For example sentences with different number of words. But in order to train with minibatches you need to stack them in a matrix. 
I have implemented different containers for extremely fast minibatch-creation from variable length vectors

In [2]:
from rush.vecs import FloatVecs, IntVecs, ShortVecs

You can create instances of these containers from a list of 1d numpy.ndarrays or torch-tensors

### Example

Load some sentences

In [3]:
import numpy as np
import pandas as pd
import re

pattern = re.compile('[^a-z +]')
with open('data/multi30k/train.en','r') as f:
    sentences = pd.Series([pattern.sub(' ',x.lower()) for x in f.readlines()])
sentences.head()

0    two young  white males are outside near many b...
1    several men in hard hats are operating a giant...
2     a little girl climbing into a wooden playhouse  
3    a man in a blue shirt is standing on a ladder ...
4            two men are at the stove preparing food  
dtype: object

Build mappings to integers

In [4]:
word_set = set([])
for sent in sentences:
    word_set |= set([x.strip() for x in sent.split(' ')])
all_words = sorted(list(word_set))
id2word_dict = dict(zip(range(len(all_words)),all_words))
id2word_dict[len(all_words)] = 'Ø'
word2id_dict = {word: id for id, word in id2word_dict.items()}

def get_word_IDs(sentence):
    IDs = []
    for word in sentence.split(' '):
        try:
            IDs.append( word2id_dict[word] )
        except:
            IDs.append( word2id_dict['Ø'] )
    return np.asarray(IDs, dtype=np.int32)

Get sentences as variable length vectors of integers

In [5]:
Sentences_as_ints = pd.Series([get_word_IDs(sent) for sent in sentences])
Sentences_as_ints.head()

0    [9027, 9666, 0, 9481, 5028, 319, 5775, 5503, 5...
1    [7384, 5190, 4245, 3896, 3916, 319, 5701, 1, 3...
2     [1, 4870, 3585, 1657, 4360, 1, 9570, 6253, 0, 0]
3    [1, 5032, 4245, 1, 878, 7473, 4387, 8090, 5677...
4    [9027, 5190, 319, 407, 8640, 8202, 6431, 3341,...
dtype: object

Now we can make an instance of a IntVecs container

In [6]:
variable_length_int_vecs = IntVecs(Sentences_as_ints.tolist())

### make_padded_minibatch method

Lets try to make 5 different random minibatches

In [7]:
minibatchsize = 256
for i in range(5):
    minibatch = variable_length_int_vecs.make_padded_minibatch(
        np.random.randint(0,variable_length_int_vecs.num_vecs, minibatchsize), # randon indices
        fill_value = word2id_dict['Ø'] # padding value
    )
    print(minibatch)


    1  5032   222  ...   9688  9688  9688
 8640  2502  9134  ...   9688  9688  9688
    1  5032  1928  ...   9688  9688  9688
       ...          ⋱          ...       
  218   373  9560  ...   9688  9688  9688
    1  5032  9405  ...   9688  9688  9688
 9027  6062  8088  ...   9688  9688  9688
[torch.IntTensor of size 256x28]


    1  3763  5651  ...   9688  9688  9688
 6062  2233   407  ...   9688  9688  9688
 6062  3562  5677  ...   9688  9688  9688
       ...          ⋱          ...       
  218   373  5032  ...   9688  9688  9688
 8650   319  4178  ...   9688  9688  9688
    1  3763  5651  ...   9688  9688  9688
[torch.IntTensor of size 256x40]


    1  2510  4387  ...   9688  9688  9688
 8650   319  9027  ...   9688  9688  9688
    1  4703  5032  ...   9688  9688  9688
       ...          ⋱          ...       
    1  9560  9547  ...   9688  9688  9688
 9027  5190   222  ...   9688  9688  9688
    1  8420  2638  ...   9688  9688  9688
[torch.IntTensor of size 256x31]


    1  9666 

A naive implementation would look as follows:

In [8]:
import torch
def naive_implementation(variable_length_vecs, indices, fill_value):
    max_len = max([len(x) for x in variable_length_vecs.iloc[indices]])
    out = np.empty( (len(indices),max_len), dtype=np.int32)
    out.fill(fill_value)
    for i, vec in enumerate(variable_length_vecs.iloc[indices]):
        out[i,:len(vec)] = vec
    return torch.from_numpy(out)
    

Lets try to time the two

In [9]:
%timeit naive_implementation(Sentences_as_ints, np.random.randint(0,variable_length_int_vecs.num_vecs, minibatchsize), fill_value = word2id_dict['Ø'])

352 µs ± 18.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [10]:
%timeit variable_length_int_vecs.make_padded_minibatch( np.random.randint(0,variable_length_int_vecs.num_vecs, minibatchsize), fill_value = word2id_dict['Ø'])

19.5 µs ± 792 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


### make_minibatch_with_random_lengths method

Now lets say you have huge documents instead of short sentences. Then you might want to randomly sample into each document with a fixed length

In [11]:
max_len = 20
out_buffer = np.empty( (minibatchsize,max_len), dtype=np.int32 )
for i in range(5):
    minibatch = variable_length_int_vecs.make_minibatch_with_random_lengths(
        np.random.randint(0,variable_length_int_vecs.num_vecs, minibatchsize), 
        fill_value = word2id_dict['Ø'],
        out=out_buffer,
        max_len=max_len)
    print(minibatch)


 4689  5190   319  ...   3897     0     0
 9481  5032  4077  ...   9688  9688  9688
 8640   493  5651  ...   3845     0     0
       ...          ⋱          ...       
 8680  9666  3588  ...   9688  9688  9688
    1  9560  7497  ...   9688  9688  9688
 7865  6062  6894  ...   9688  9688  9688
[torch.IntTensor of size 256x20]


    1  3894  9666  ...      1  7559     0
    1  5032  4387  ...   9688  9688  9688
    1   544     0  ...   9688  9688  9688
       ...          ⋱          ...       
    1  3763  5651  ...   9688  9688  9688
    1  3763  5651  ...   1608     0     0
    1  5032  4245  ...   9688  9688  9688
[torch.IntTensor of size 256x20]


    1  5032  4245  ...   8192     0     0
 9560  4245  9481  ...   6211   222  3808
 1898  9585  4245  ...   9688  9688  9688
       ...          ⋱          ...       
 8640  3038   319  ...   9688  9688  9688
 9027  5190  8090  ...   9688  9688  9688
  109  5032  4077  ...   9515     0     0
[torch.IntTensor of size 256x20]


 8640  4653 