Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-96] Language model with Google's billion words dataset #10025

Merged
merged 11 commits into from
Mar 22, 2018

Conversation

eric-haibin-lin
Copy link
Member

@eric-haibin-lin eric-haibin-lin commented Mar 7, 2018

Description

This example reproduces the result (~42 perplexity) on Exploring the Limits of Language Modeling on the GBW dataset.
See readme.mk for details.
@mli @szha @zheng-da @piiswrong

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

Language model with Google's billion words dataset (#197)
@@ -214,9 +238,7 @@ If ``no_bias`` is set to be true, then the ``bias`` term is ignored.
.set_attr<nnvm::FInferShape>("FInferShape", FullyConnectedShape)
.set_attr<nnvm::FInferType>("FInferType", FullyConnectedType)
.set_attr<FCompute>("FCompute<cpu>", FullyConnectedCompute<cpu>)
#if MXNET_USE_MKLDNN == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for adding sparse support for FC. See line 213.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Both sparse and mkldnn support need run into this dispatch.

std::vector<TBlob> out_blobs(outputs.size());
for (size_t i = 0; i < out_blobs.size(); i++) out_blobs[i] = outputs[i].data();
FullyConnectedCompute<cpu>(attrs, ctx, in_blobs, req, out_blobs);
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ‘FallBackCompute’ should be used to fall back the computation to original cpu implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block will only be executed when MKL is absent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you not use FallBackCompute for fallback?
If an input is a sparse matrix, does data() return a dense ndarray? it doesn't seem SetTBlob is doing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does MKL support kFComputeFallback dispatch mode?
Are you both referring to line 93 - line 99? FallBackCompute is only defined when USE_MKL=1. Can I still use it?
What I need to address is the following case for inference:

  • data = dense
  • weight = rowsparse
  • bias = rowsparse
  • output = dense
    But I don't know how to deal with this efficiently with USE_MKL=1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When USE_MKL=1,
I wanted to simply use the non-MKL FCForward with

data.data(),
weight.data(),
bias.data(),

assuming data.data() returns a TBlob with normal cpu layout even if data is in MKL layout. weight.data() and bias.data() should always return normal cpu layout if weight and bias are row_sparse. Please advice.

@szha szha self-assigned this Mar 8, 2018
@eric-haibin-lin
Copy link
Member Author

@zihaolucky

#endif
if (valid_data && valid_weight && valid_bias && valid_out) {
std::vector<TBlob> in_blobs(inputs.size());
for (size_t i = 0; i < in_blobs.size(); i++) in_blobs[i] = inputs[i].data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if an NDArray uses MKLDNN format, data() will convert its layout inside the array. This caused a race condition, I just fixed. You shouldn't call data() on an NDArray with MKLDNN format. Please check FallBackCompute to see how it is handled correctly.

I also don't understand what happens if the input array is a row sparse. If the row sparse array doesn't have zero-entry rows, it works fine. What is the row sparse array have zero-entry rows, isn't the memory returned from data() is smaller than we expect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the data() should be improved functionality. The developer doesn't know there's a potential issue when working with other formats of NDArray.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the only option is to disable data() for NDArrays with MKLDNN format.

Copy link
Member

@cjolivier01 cjolivier01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create and link a JIRA ticket

@sxjscience
Copy link
Member

sxjscience commented Mar 9, 2018

@eric-haibin-lin The RNN.unroll is just updated to support variable length sequence. Need to resolve the conflict.

@@ -181,3 +181,126 @@ def unroll(self, length, inputs, begin_state=None, layout='NTC', merge_outputs=N
outputs, _, _, _ = _format_sequence(length, outputs, layout, merge_outputs)

return outputs, states


class LSTMPCell(HybridRecurrentCell):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be implemented as a ModifierCell? The implementation should be similar as ZoneOut.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an easy way to modify state in Modifier cell (especially during unroll)??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't it's a good abstraction as a modifier cell. The projection only happens on the hidden state of LSTM, not on the cell state. I am not sure what's the expected behavior for GRU cell and RNN cell where the number of states is just one. Also the paper call this LSTMP, it's specific to LSTM. I think it's find to just call it contrib.LSTMP which inherits HybridRecurrentCell .

@@ -56,7 +56,10 @@ static bool FullyConnectedShape(const nnvm::NodeAttrs& attrs,
}
SHAPE_ASSIGN_CHECK(*in_shape, fullc::kWeight, Shape2(param.num_hidden, num_input));
if (!param.no_bias) {
SHAPE_ASSIGN_CHECK(*in_shape, fullc::kBias, Shape1(param.num_hidden));
if (!shape_assign(&(*in_shape)[fullc::kBias], Shape1(param.num_hidden)) &&
!shape_assign(&(*in_shape)[fullc::kBias], Shape2(param.num_hidden, 1))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we check it against (num_hidden, 1).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bias are trained with sparse embedding with requires a 2-D shape.

""" A dataset for truncated bptt with multiple sentences.
Adapeted from @rafaljozefowicz's implementation.
"""
def __init__(self, vocab, file_pattern, deterministic=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shuffle instead of deterministic, since a fixed random seed may still produce deterministic but shuffled results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

def cross_entropy_loss(inputs, labels, rescale_loss=1):
""" cross entropy loss """
criterion = mx.gluon.loss.SoftmaxCrossEntropyLoss()
loss = criterion.hybrid_forward(S, inputs, labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loss = criterion(inputs, labels) should do. Also, rescale_loss can be put in the constructor call of the SoftmaxCELoss using argument weight

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Tensor<xpu, 1, DType> bias = in_data[fullc::kBias].get_with_shape<xpu, 1, DType>(
Shape1(wmat.shape_[0]), s);
CHECK_EQ(bias.shape_[0], wmat.shape_[0])
<< "bias.data().shape[0] != weight.data().shape[0]. Not supported by FCForward";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use words to describe the error instead. Also, if flatten is True the error message for data.data() might not make sense.

"""
def __init__(self):
self._token_to_id = {}
self._token_to_count = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collections.Counter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member

@cjolivier01 cjolivier01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a JIRA

@eric-haibin-lin eric-haibin-lin changed the title Language model with Google's billion words dataset [MXNET-96] Language model with Google's billion words dataset Mar 13, 2018
@cjolivier01 cjolivier01 dismissed their stale review March 15, 2018 00:48

JIRA ticket was added

@eric-haibin-lin
Copy link
Member Author

@TaoLv @zheng-da could you review the updated code for FC?

// inputs
std::vector<TBlob> in_blobs(inputs.size());
auto get_data = [](const NDArray& nd) -> TBlob {
if (nd.storage_type() == kDefaultStorage) return nd.Reorder2Default().data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nd.Reorder2Default() returns a temp object here, which controls a piece of memory. This memory will be freed when the function returns, and TBlob will reference to a free'd memory.

@eric-haibin-lin
Copy link
Member Author

Anything else to address?
@sxjscience @zheng-da @szha @cjolivier01 @TaoLv

@zheng-da
Copy link
Contributor

looks good to me

@szha szha merged commit 57534bf into apache:master Mar 22, 2018
ashokei pushed a commit to ashokei/incubator-mxnet that referenced this pull request Mar 27, 2018
…#10025)

* Language model with Google's billion words dataset (apache#197)

Language model with Google's billion words dataset (apache#197)

* fix lint

* ffix license

* patch

* fix lint

* cr comment

* update fc fallback

* fix build

* fix temp memory in fc

* fix compilation
jinhuang415 pushed a commit to jinhuang415/incubator-mxnet that referenced this pull request Mar 30, 2018
…#10025)

* Language model with Google's billion words dataset (apache#197)

Language model with Google's billion words dataset (apache#197)

* fix lint

* ffix license

* patch

* fix lint

* cr comment

* update fc fallback

* fix build

* fix temp memory in fc

* fix compilation
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
…#10025)

* Language model with Google's billion words dataset (apache#197)

Language model with Google's billion words dataset (apache#197)

* fix lint

* ffix license

* patch

* fix lint

* cr comment

* update fc fallback

* fix build

* fix temp memory in fc

* fix compilation
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
…#10025)

* Language model with Google's billion words dataset (apache#197)

Language model with Google's billion words dataset (apache#197)

* fix lint

* ffix license

* patch

* fix lint

* cr comment

* update fc fallback

* fix build

* fix temp memory in fc

* fix compilation
@eric-haibin-lin eric-haibin-lin deleted the language-model branch September 2, 2019 23:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants