Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

BiMPM model #1594

Merged
merged 12 commits into from
Aug 17, 2018
Merged

BiMPM model #1594

merged 12 commits into from
Aug 17, 2018

Conversation

hzeng-otterai
Copy link
Contributor

This PR includes the implementation of BiMPM model (#1503) and a Quora paraphrase dataset reader. I am still working on some comments improvements and memory usage optimization but I want to send out the PR to collect feedback too.

Thanks!

@hzeng-otterai
Copy link
Contributor Author

Looks like documents for the classes are missing. I will update the PR.

@hzeng-otterai
Copy link
Contributor Author

Pinging @schmmd , please help to provide comments to this pack.
I asked the original author of the paper https://github.com/zhiguowang/ to review the code he said he is busy recently and will review once he get some time.

@matt-gardner
Copy link
Contributor

Sorry to be slow @handsomezebra, we had a hackathon last week at AI2 that put a bunch of other stuff on pause. I'll look at this, though today is pretty busy for me, and it'll probably be tomorrow before I get to it.

@matt-gardner matt-gardner self-assigned this Aug 14, 2018
Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks great, thanks for doing this! I didn't double check the logic against the paper, as you've said you got close to the same results. What I saw in the logic looked reasonable to me.

The one major quibble I have with this is that the variable and method names are all very abbreviated. We strongly prefer longer, more descriptive variable names, as having code be easy to understand is far more important than saving a few characters of horizontal space. I can figure out that ml_fw probably means forward_matching_layer, for instance, but that's a guess, and it adds to my cognitive load as I read this code. Can you expand the variable names to avoid abbreviations throughout?

@overrides
def _read(self, file_path):
logger.info("Reading instances from lines in file at: %s", file_path)
file_name, member = parse_file_uri(file_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've duplicated the logic below between the two cases (where member is None and where it isn't). Seems like it would be cleaner to push the file decision logic into this method. So you'd have something like:

def open_uri(file_path):
    file_name, member = parse_file_uri(file_path)
    file_path = cached_path(file_path)
    if member is None:
        return open(file_path, 'r')
    else:
        # zipfile opening logic here

def _read(self, file_path):
    with open_uri(file_path) as data_file:
       tsvin = ...

@joelgrus probably knows more than I do about how to handle the with stuff properly in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matt-gardner @joelgrus Sure I will try to refactor this. One related question is I am currently using my personal s3 to host the data file https://s3-us-west-1.amazonaws.com/handsomezebra/public/Quora_question_pair_partition.zip. Do you think you can help me upload the files to allennlp's s3?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd be happy to put that data file (and any model files you have) into our s3 bucket. Just tell me what all of them are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@matt-gardner matt-gardner Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I unzipped it and put the train, dev and test files here: https://s3-us-west-2.amazonaws.com/allennlp/datasets/quora-question-paraphrase/{train,dev,test}.tsv, along with the README. I left out the word vector file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Unzipping it also makes it so you can just remove the zip file logic and simplify that piece of the code. I'd vote for handling zip files in a more general way in our common module, which we can certainly do, but is probably more appropriate for a separate PR.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I will remove the zip file reading.
I tried the s3 URL https://s3-us-west-1.amazonaws.com/allennlp/datasets/quora-question-paraphrase/train.tsv but got a redirection error. But this URL is fine: https://allennlp.s3-us-west-2.amazonaws.com/datasets/quora-question-paraphrase/train.tsv. I can use the second URL but other data on allennlp doesn't seem to have this problem.

Copy link
Contributor

@matt-gardner matt-gardner Aug 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, that's my fault, I had a copy-paste error in the link I gave you. (I edited the previous post to fix it.)

@DatasetReader.register("quora_paraphrase")
class QuoraParaphraseDatasetReader(DatasetReader):
"""
Reads a Quora paraphrase data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An explanation of the expected data format here would be nice (e.g., this is a TSV file, with the following columns).

Aggregator of all BiMPM matching vectors
classifier_feedforward : ``FeedForward``
Fully connected layers for classification.
dropout : ``float``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be marked optional.

BiMPM matching on the output of word embeddings of premise and hypothesis.
encoder1 : ``Seq2SeqEncoder``
First encoder layer for the premise and hypothesis
matcher_fw1 : ``BiMPMMatching``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of fw and bw, can you expand those to forward and backward? It's much easier to read.

self.matcher_fw1.get_output_dim() + self.matcher_bw1.get_output_dim() + \
self.matcher_fw2.get_output_dim() + self.matcher_bw2.get_output_dim()

if matching_dim != self.aggregator.get_input_dim():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a check_dimensions_match function that makes these checks a little easier:

check_dimensions_match(text_field_embedder.get_output_dim(), encoder.get_input_dim(),
"text field embedding dim", "encoder input dim")

from allennlp.nn.util import get_lengths_from_binary_sequence_mask


def masked_max(vector: torch.Tensor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions should go in nn.util.

share_weight_between_directions : ``bool``, optional (default = True)
If True, share weight between premise to hypothesis and hypothesisto premise,
useful for non-symmetric tasks
wo_full_match : ``bool``, optional (default = False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is wo? "without"? That's not obvious.

if num_matching <= 0:
raise ConfigurationError("At least one of the matching method should be enabled")

params = [nn.Parameter(torch.rand(num_perspective, hidden_dim)) for i in range(num_matching)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.rand() isn't a great way to initialize parameters. Can you use something more reasonable as a default here, like torch.nn.init.xavier_uniform_?

Also, having a list of parameters here is pretty hard to interpret if you're looking at parameter values in tensorboard, or during debugging, or similar. I think it'd be better to give these parameters names, like full_match_weights, and assign them directly to self, assigning None if they aren't selected. That would also let you get rid of num_matching, mv_idx, and mv_idx_increment entirely, which are a little confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using initializer in the config file to initialize their weights. So the weight here will be overwritten. But I can use torch.FloatTensor() to create them here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know you re-initialize them in your config, but the default should also do something reasonable, in case people forget or don't want to use a special initialization.

return (mul_result / norm_value.clamp(min=eps)).permute(0, 2, 3, 1)


class BiMPMMatching(nn.Module, FromParams):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to use Google's definition of camel case, where this would be BiMpmMatching (and BiMpm instead of BiMPM).

return value_sum / value_count.clamp(min=eps)


def mpm(vector1: torch.Tensor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more descriptive name would be better here (and below), like multi_perspective_match().

@hzeng-otterai
Copy link
Contributor Author

Thanks a lot, Matt. I agree on most of the comments and I am making changes now...

2. Use allennlp s3 for quora data download.
3. Move masked_max, masked_mean to nn.util
4. Various variable renaming, comments improvements, etc.
@hzeng-otterai
Copy link
Contributor Author

hzeng-otterai commented Aug 17, 2018

Pack updated with all those issues fixed according to the comments. Please help to review. Thanks!

Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I think it's way easier to read now, thanks for making those changes.

Do you have any trained models that we should upload? Have you put any thought into building a demo for this that we can add to demo.allennlp.org?

logger = logging.getLogger(__name__) # pylint: disable=invalid-name


def parse_file_uri(uri: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need this anymore. (Though, as I said before, I think it's totally reasonable to talk about adding this functionality more generally, in common.file_utils. It should just be in a separate PR.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. Let me remove this.

label: str = None) -> Instance:
# pylint: disable=arguments-differ
fields: Dict[str, Field] = {}
tokenized_premise = self._tokenizer.tokenize(premise)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, these are pre-tokenized! That's an important point, thanks for adding it to the class docstring. I think that means you should not tokenize the text here. The reason is because in a demo, you'll want to do tokenization (because someone typing text into a demo won't necessarily tokenize it nicely for you). So if/when we build a demo for this, we'll want the Predictor to have a tokenizer and pass pre-tokenized text into here. So I'd recommend having premise: List[str] and hypothesis: List[str], removing the tokenizer, and just calling row[1].split() and row[2].split() above. Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure it makes sense. But as text_to_instance is an overridden method, will the change of types cause any side effect? How about we do this in the next PR where we add the predictor and demo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the raw Quora data is actually not pre-tokenized. Only this particular train/validate/test split is pre-tokenized. At sometime I will want to use the same split but use non-tokenized version to see if better tokenization could further improve the performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, ok, that's a good point. And yeah, it's fine to worry about this in a subsequent PR. It's just something to be careful about when this actually goes into a demo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and for the types, we're already ignoring types on these methods, because none of them agree with the super class. The point of that method existing on the super class is just to tell you that you really should implement a method with this name, though the particulars of how it looks are going to be specific to your data.

----------
hidden_dim : ``int``, optional (default = 100)
The hidden dimension of the representations
num_perspective : ``int``, optional (default = 20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this should probably be num_perspectives, here, in the description on the next line, and as a variable name in the code..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will rename it.

@hzeng-otterai
Copy link
Contributor Author

Currently I have a running experiment on this version of code. It is not finished yet but it's very probable I get 0.89+ accuracy on the Quora test data. After the PR is merged I can test and upload the model.

And yes I am willing to build a demo from it. Let me work on a PR maybe next week.

@matt-gardner
Copy link
Contributor

Awesome, thanks again for all of this work!

@matt-gardner matt-gardner merged commit 76a65a8 into allenai:master Aug 17, 2018
@hzeng-otterai
Copy link
Contributor Author

hzeng-otterai commented Aug 17, 2018

I put the trained model file here: https://s3-us-west-1.amazonaws.com/handsomezebra/public/bimpm-quora-2018.08.17.tar.gz

Running the following command under the latest master.

./bin/allennlp evaluate https://s3-us-west-1.amazonaws.com/handsomezebra/public/bimpm-quora-2018.08.17.tar.gz https://s3-us-west-2.amazonaws.com/allennlp/datasets/quora-question-paraphrase/test.tsv

Got accuracy 89.3%.

@saurabhvyas
Copy link

Hi, Thanks for your hard work, Can you update the docs, as to how to train your model on my custom data ?

@hzeng-otterai
Copy link
Contributor Author

hzeng-otterai commented Aug 20, 2018

@saurabhvyas
To train and evaluate bimpm model, use the following commands.

allennlp train training_config/bimpm.json -s <serialization_folder>
allennlp evaluate <serialization_folder>/model.tar.gz https://s3-us-west-2.amazonaws.com/allennlp/datasets/quora-question-paraphrase/test.tsv

If you want to use your custom data, please just change train_data_path and validation_data_path in bimpm.json, if your data have the exact same format as Quora's data (tsv file, four columns with label, sentence1, sentence2 and id, word pretokenized).

Otherwise if your data have different format, please write your own dataset reader. Please refer to allennlp's tutorial https://github.com/allenai/allennlp/tree/master/tutorials for some more information.

@schmmd
Copy link
Member

schmmd commented Aug 20, 2018

@handsomezebra yes thank you! And sorry for the slow responses--I was out last week. If you're willing to build a demo it would be great to have one for the Quora dataset. If you make a rough demo we're able to pretty it up.

@saurabhvyas
Copy link

@handsomezebra thanks for your quick reply, I will follow the mentioned steps.

@schmmd
Copy link
Member

schmmd commented Aug 21, 2018

I moved the model into our S3.

$ allennlp evaluate https://s3-us-west-2.amazonaws.com/allennlp/models/bimpm-quora-2018.08.17.tar.gz https://s3-us-west-2.amazonaws.com/allennlp/datasets/quora-question-paraphrase/test.tsv
...
2018-08-21 09:34:18,226 - INFO - allennlp.commands.evaluate - Metrics:
2018-08-21 09:34:18,226 - INFO - allennlp.commands.evaluate - accuracy: 0.893

gabrielStanovsky pushed a commit to gabrielStanovsky/allennlp that referenced this pull request Sep 7, 2018
* Adding Quora data reader and BiMPM model.

* Refactoring and renaming to pass pylint.

* Reduce batch size.

* Adding docs.

* Make title underline longer.

* Adding doc toctree.

* Various improvements to speed and memory.

* Improve comments.

* 1. Remove zip file handling.
2. Use allennlp s3 for quora data download.
3. Move masked_max, masked_mean to nn.util
4. Various variable renaming, comments improvements, etc.

* Remove unused url pattern match and change num_perspective to num_perspectives.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants