The lib contains several layers to use in PyTorch models.
from junky.layers import Masking
layer = Masking(input_size, mask=float('-inf'),
indices_to_highlight=-1, highlighting_mask=1,
batch_first=False)
output = layer(x, lens)
Replaces certain elemens of the incoming data x to the mask given.
Args:
input_size: The number of expected features in the input x.
mask: Replace to what. Default is -inf
indices_to_highlight: What positions in the feature
dimension of the
masked positions of the incoming data must not be replaced to the mask
.
Default is -1
. None
means "replace all".
highlighting_mask: Replace data in that positions to what. If None
, the
data will be left as is. Default is 1
batch_first: If True
, then the input and output tensors are provided
as (batch, seq, feature)
(<==> (N, *, H)
). Default: False
.
Shape:
-
Input:
x: :math:(*, N, H)
where :math:*
means any number of additional dimensions and :math:H = \text{input_size}
.
lens: Array of lengths of x by theseq
dimension. We mask data in allseq
positions that greater than lens. -
Output: :math:
(*, N, H)
where all are the same shape as the input and :math:H = \text{input_size}
.
NB: Masking layer was made for using right before Softmax. In that case
and with mask
=-inf
(default), the Softmax output will have zeroes in all
positions corresponding to indices_to_mask
.
NB: Usually, you'll mask positions of all non-pad tags in padded endings
of the input data. Thus, after Softmax, you'll always have the padding tag
predicted for that endings. As the result, you'll have loss = 0
, that
prevents your model for learning on padding.
Examples:
>>> m = Masking(4, batch_first=True)
>>> input = torch.randn(2, 3, 4)
>>> output = m(input, [1, 3])
>>> print(output)
tensor([[[ 1.1912, -0.6164, 0.5299, -0.6446],
[ -inf, -inf, -inf, 1.0000],
[ -inf, -inf, -inf, 1.0000]],
[[-0.3011, -0.7185, 0.6882, -0.1656],
[-0.3316, -0.3521, -0.9717, 0.5551],
[ 0.7721, 0.2061, 0.8932, -1.5827]]])
>>> m = Masking(4, batch_first=True, mask=4.,
indices_to_highlight=(1, -1), highlighting_mask=None)
>>> input = torch.randn(2, 3, 4)
>>> output = m(input, [1, 3])
>>> print(output)
tensor([[[-0.4479, -0.8719, -1.0129, -1.5431],
[ 4.0000, 0.6978, 4.0000, 0.1203],
[ 4.0000, 0.1990, 4.0000, -0.4277]],
[[ 0.2840, 1.1241, -0.5342, 0.2857],
[ 0.3409, 0.7630, 0.4099, 0.1182],
[ 1.3610, -0.1528, -1.7044, -0.4466]]])
from junky.layers import CharEmbeddingRNN
layer = CharEmbeddingRNN(alphabet_size, emb_layer=None, emb_dim=300,
pad_idx=0, out_type='final_concat')
Produces character embeddings using Bidirectional LSTM.
Args:
alphabet_size: Length of character vocabulary.
emb_layer: Optional pre-trained embeddings initialized as
torch.nn.Embedding.from_pretrained()
or elsewise.
emb_dim: Character embedding dimensionality.
emb_dropout: Dropout for embedding layer. Default: 0.0 (no dropout).
pad_idx: Indices of padding element in character vocabulary.
out_type - defines what to get as a result after the BiLSTM. Possible
values:
'final_concat'
- concatenate final hidden states of forward and backward
LSTM;
'final_mean'
- take mean of final hidden states of forward and backward
LSTM;
'all_mean'
- take mean of all timeframes;
'all_max'
- take maximum of all timeframes.
Shape:
-
Input:
x: [batch[seq[word[ch_idx + pad] + word[pad]]]];torch.Tensor
of shape :math:(N, S(padded), C(padded))
, whereN
is batch_size,S
is seq_len andC
is max char_len in a word in current batch.
lens: [seq[word_char_count]];torch.Tensor
of shape :math:(N, S(padded), C(padded))
, word lengths for each sequence in batch. Used in masking & packing/unpacking sequences for LSTM. -
Output: :math:
(N, S, H)
whereN
,S
are the same shape as the input and :math:H = \text{lstm hidden size}
.
NB: In LSTM layer, we ignore padding by applying mask to the tensor and
eliminating all words of len = 0
. After LSTM layer, initial dimensions are
restored using the same mask.
from junky.layers CharEmbeddingCNN
layer = CharEmbeddingCNN(alphabet_size, emb_layer=None, emb_dim=300, emb_dropout=0.0,
pad_idx=0, kernels=[3, 4, 5], cnn_kernel_multiplier=1)
Produces character embeddings using multiple-filter CNN. Max-over-time pooling and ReLU are applied to concatenated convolution layers.
Args:
alphabet_size: Length of character vocabulary.
emb_layer: Optional pre-trained embeddings, initialized as
torch.nn.Embedding.from_pretrained()
or elsewise.
emb_dim: Character embedding dimensionality.
pad_idx: Indices of padding element in character vocabulary.
kernels: Convoluiton filter sizes for CNN layers.
cnn_kernel_multiplier: defines how many filters are created for each kernel size. Default: 1.
Shape:
-
Input:
x: [batch[seq[word[ch_idx + pad] + word[pad]]]];torch.Tensor
of shape :math:(N, S(padded), C(padded))
, whereN
is batch_size,S
is seq_len with padding andC
is char_len with padding in current batch.
lens: [seq[word_char_count]];torch.Tensor
of shape :math:(N, S, C)
, word lengths for each sequence in batch. Used for eliminating padding in CNN layers. -
Output: :math:
(N, S, E)
whereN
,S
are the same shape as the input and :math:E = \text{emb_dim}
.
from junky.layers import HighwayNetwork
layer = HighwayNetwork(in_features, out_features=None,
U_layer=None, U_init_=None, U_dropout=0,
H_features=None, H_activation=F.relu, H_dropout=0,
gate_type='generic', global_highway_input=False,
num_layers=1)
layer(x, x_hw, *U_args, **U_kwargs)
Highway Network is described in Srivastava et al. and Srivastava et al. and it's formalation is: H(x)*T(x) + x*(1 - T(x)), where:
H(x) - affine trainsformation followed by a non-linear activation;
T(x) - transform gate: affine transformation followed by a sigmoid
activation;
* - element-wise multiplication.
There are some variations of it, so we implement more universal architectute: U(x)*H(x)*T(x) + x*C(x), where:
U(x) - user defined layer that we make Highway around; By default,
U(x) = I (identity matrix);
C(x) - carry gate: generally, affine transformation followed by a sigmoid
activation. By default, C(x) = 1 - T(x).
Args:
in_features: number of features in input.
out_features: number of features in output. If None
(default),
out_features = in_features.
U_layer: layer that implements U(x). Default is None
. If
U_layer is callable, it will be used to create the layer; elsewise, we'll
use it as is (if num_layers > 1
, we'll copy it). Note that number of
input features of U_layer must be equal to out_features if
num_layers > 1
.
U_init_: callable to inplace init weights of U_layer.
U_dropout: if non-zero, introduces a Dropout layer on the outputs of U(x) on each layer, with dropout probability equal to U_dropout. Default: 0.
H_features: number of input features of H(x). If None
(default),
H_features = in_features. If 0
, don't use H(x).
H_activation: non-linear activation after H(x). If None
, then no
activation function is used. Default is F.relu
.
H_dropout: if non-zero, introduces a Dropout layer on the outputs of H(x) on each layer, with dropout probability equal to H_dropout. Default: 0.
gate_type: a type of the transform and carry gates:
'generic'
(default): C(x) = 1 - T(x);
'independent'
: use both independent C(x) and T(x);
'T_only'
: don't use carry gate: C(x) = I;
'C_only'
: don't use carry gate: T(x) = I;
'none'
: C(x) = T(x) = I.
global_highway_input: if True
, we treat the input of all the network as
the highway input of every layer. Thus, we use T(x) and C(x) only
once. If global_highway_input is False
(default), every layer receives
the output of the previous layer as the highway input. So, T(x) and
C(x) use different weights matrices in each layer.
num_layers: number of highway layers.
The .forward()
method receives params as follows:
x and x_hw: inputs of the network. The first layer of the network
executes formula: x = U(x)*H(x)*T(x_hw) + x_hw*C(x_hw). Next, if
global_highway_input is False
, x_hw = x. If True
, then
x_hw = x_hw*C(x_hw) and it's already won't change on the other layers.
If x_hw is None
, we adopt x_hw = x.
*U_args and **U_kwargs are params for U_layer if it needs ones.
from junky.layers import HighwayBiLSTM
layer = HighwayBiLSTM(hw_num_layers, lstm_hidden_dim, lstm_num_layers,
in_features, out_features, lstm_dropout,
init_weight=True, init_weight_value=2.0, batch_first=True
)
layer(x)
Highway biLSTM model implementation, modified from from https://github.com/bamtercelboo/pytorch_Highway_Networks/blob/master/models/model_HBiLSTM.py. Original Article.
Params:
hw_num_layers: number of highway biLSTM layers.
in_features: number of features in input.
out_features: number of features in output.
lstm_hidden_dim: hidden dim for LSTM layer.
lstm_num_layers: number of LSTM layers.
lstm_dropout: dropout between 2+ LSTM layers.
init_weight: whether to init bilstm weights as xavier_uniform_
init_weight_value: bilstm weight initialization gain
will be defined as
np.sqrt(init_weight_value)`
batch_first: True if input.size(0) == batch_size.
Input:
x: input tensor of shape (N, S, in_features)
or (S, N, in_features)
,
where N == batch size
. Please specify batch_first=True
,
is input tensor has shape (N, S, E)
.
lens: tensor of sequence lengths without padding.
Output:
Tensor of shape (N, S, out_features)
.