BiMPM model #1594

hzeng-otterai · 2018-08-08T21:20:02Z

This PR includes the implementation of BiMPM model (#1503) and a Quora paraphrase dataset reader. I am still working on some comments improvements and memory usage optimization but I want to send out the PR to collect feedback too.

Thanks!

hzeng-otterai · 2018-08-08T23:07:19Z

Looks like documents for the classes are missing. I will update the PR.

hzeng-otterai · 2018-08-14T17:40:13Z

Pinging @schmmd , please help to provide comments to this pack.
I asked the original author of the paper https://github.com/zhiguowang/ to review the code he said he is busy recently and will review once he get some time.

matt-gardner · 2018-08-14T18:06:58Z

Sorry to be slow @handsomezebra, we had a hackathon last week at AI2 that put a bunch of other stuff on pause. I'll look at this, though today is pretty busy for me, and it'll probably be tomorrow before I get to it.

matt-gardner

Overall, this looks great, thanks for doing this! I didn't double check the logic against the paper, as you've said you got close to the same results. What I saw in the logic looked reasonable to me.

The one major quibble I have with this is that the variable and method names are all very abbreviated. We strongly prefer longer, more descriptive variable names, as having code be easy to understand is far more important than saving a few characters of horizontal space. I can figure out that ml_fw probably means forward_matching_layer, for instance, but that's a guess, and it adds to my cognitive load as I read this code. Can you expand the variable names to avoid abbreviations throughout?

matt-gardner · 2018-08-16T00:07:01Z

allennlp/data/dataset_readers/quora_paraphrase.py

+    @overrides
+    def _read(self, file_path):
+        logger.info("Reading instances from lines in file at: %s", file_path)
+        file_name, member = parse_file_uri(file_path)


You've duplicated the logic below between the two cases (where member is None and where it isn't). Seems like it would be cleaner to push the file decision logic into this method. So you'd have something like:

def open_uri(file_path): file_name, member = parse_file_uri(file_path) file_path = cached_path(file_path) if member is None: return open(file_path, 'r') else: # zipfile opening logic here def _read(self, file_path): with open_uri(file_path) as data_file: tsvin = ...

@joelgrus probably knows more than I do about how to handle the with stuff properly in this case.

@matt-gardner @joelgrus Sure I will try to refactor this. One related question is I am currently using my personal s3 to host the data file https://s3-us-west-1.amazonaws.com/handsomezebra/public/Quora_question_pair_partition.zip. Do you think you can help me upload the files to allennlp's s3?

Yes, I'd be happy to put that data file (and any model files you have) into our s3 bucket. Just tell me what all of them are.

Just this data file for now https://s3-us-west-1.amazonaws.com/handsomezebra/public/Quora_question_pair_partition.zip
Thanks!

Ok, I unzipped it and put the train, dev and test files here: https://s3-us-west-2.amazonaws.com/allennlp/datasets/quora-question-paraphrase/{train,dev,test}.tsv, along with the README. I left out the word vector file.

(Unzipping it also makes it so you can just remove the zip file logic and simplify that piece of the code. I'd vote for handling zip files in a more general way in our common module, which we can certainly do, but is probably more appropriate for a separate PR.)

Sure I will remove the zip file reading.
I tried the s3 URL https://s3-us-west-1.amazonaws.com/allennlp/datasets/quora-question-paraphrase/train.tsv but got a redirection error. But this URL is fine: https://allennlp.s3-us-west-2.amazonaws.com/datasets/quora-question-paraphrase/train.tsv. I can use the second URL but other data on allennlp doesn't seem to have this problem.

Ah, sorry, that's my fault, I had a copy-paste error in the link I gave you. (I edited the previous post to fix it.)

matt-gardner · 2018-08-16T00:08:16Z

allennlp/data/dataset_readers/quora_paraphrase.py

+@DatasetReader.register("quora_paraphrase")
+class QuoraParaphraseDatasetReader(DatasetReader):
+    """
+    Reads a Quora paraphrase data


An explanation of the expected data format here would be nice (e.g., this is a TSV file, with the following columns).

matt-gardner · 2018-08-16T00:10:45Z

allennlp/models/bimpm.py

+        Aggregator of all BiMPM matching vectors
+    classifier_feedforward : ``FeedForward``
+        Fully connected layers for classification.
+    dropout : ``float``


This should be marked optional.

matt-gardner · 2018-08-16T00:11:20Z

allennlp/models/bimpm.py

+        BiMPM matching on the output of word embeddings of premise and hypothesis.
+    encoder1 : ``Seq2SeqEncoder``
+        First encoder layer for the premise and hypothesis
+    matcher_fw1 : ``BiMPMMatching``


Instead of fw and bw, can you expand those to forward and backward? It's much easier to read.

matt-gardner · 2018-08-16T00:13:04Z

allennlp/models/bimpm.py

+                       self.matcher_fw1.get_output_dim() + self.matcher_bw1.get_output_dim() + \
+                       self.matcher_fw2.get_output_dim() + self.matcher_bw2.get_output_dim()
+
+        if matching_dim != self.aggregator.get_input_dim():


We have a check_dimensions_match function that makes these checks a little easier:

allennlp/allennlp/models/esim.py

Lines 82 to 83 in 580dc8b

check_dimensions_match(text_field_embedder.get_output_dim(), encoder.get_input_dim(),

"text field embedding dim", "encoder input dim")

matt-gardner · 2018-08-16T00:22:10Z

allennlp/modules/bimpm_matching.py

+from allennlp.nn.util import get_lengths_from_binary_sequence_mask
+
+
+def masked_max(vector: torch.Tensor,


These two functions should go in nn.util.

matt-gardner · 2018-08-16T00:23:21Z

allennlp/modules/bimpm_matching.py

+    share_weight_between_directions : ``bool``, optional (default = True)
+        If True, share weight between premise to hypothesis and hypothesisto premise,
+        useful for non-symmetric tasks
+    wo_full_match : ``bool``, optional (default = False)


What is wo? "without"? That's not obvious.

matt-gardner · 2018-08-16T00:34:28Z

allennlp/modules/bimpm_matching.py

+        if num_matching <= 0:
+            raise ConfigurationError("At least one of the matching method should be enabled")
+
+        params = [nn.Parameter(torch.rand(num_perspective, hidden_dim)) for i in range(num_matching)]


torch.rand() isn't a great way to initialize parameters. Can you use something more reasonable as a default here, like torch.nn.init.xavier_uniform_?

Also, having a list of parameters here is pretty hard to interpret if you're looking at parameter values in tensorboard, or during debugging, or similar. I think it'd be better to give these parameters names, like full_match_weights, and assign them directly to self, assigning None if they aren't selected. That would also let you get rid of num_matching, mv_idx, and mv_idx_increment entirely, which are a little confusing.

I am using initializer in the config file to initialize their weights. So the weight here will be overwritten. But I can use torch.FloatTensor() to create them here.

Yeah, I know you re-initialize them in your config, but the default should also do something reasonable, in case people forget or don't want to use a special initialization.

matt-gardner · 2018-08-16T00:36:02Z

allennlp/modules/bimpm_matching.py

+    return (mul_result / norm_value.clamp(min=eps)).permute(0, 2, 3, 1)
+
+
+class BiMPMMatching(nn.Module, FromParams):


We prefer to use Google's definition of camel case, where this would be BiMpmMatching (and BiMpm instead of BiMPM).

matt-gardner · 2018-08-16T00:36:42Z

allennlp/modules/bimpm_matching.py

+    return value_sum / value_count.clamp(min=eps)
+
+
+def mpm(vector1: torch.Tensor,


A more descriptive name would be better here (and below), like multi_perspective_match().

hzeng-otterai · 2018-08-16T04:31:55Z

Thanks a lot, Matt. I agree on most of the comments and I am making changes now...

2. Use allennlp s3 for quora data download. 3. Move masked_max, masked_mean to nn.util 4. Various variable renaming, comments improvements, etc.

hzeng-otterai · 2018-08-17T03:56:13Z

Pack updated with all those issues fixed according to the comments. Please help to review. Thanks!

matt-gardner

This looks great! I think it's way easier to read now, thanks for making those changes.

Do you have any trained models that we should upload? Have you put any thought into building a demo for this that we can add to demo.allennlp.org?

matt-gardner · 2018-08-17T17:05:23Z

allennlp/data/dataset_readers/quora_paraphrase.py

+logger = logging.getLogger(__name__)  # pylint: disable=invalid-name
+
+
+def parse_file_uri(uri: str):


I think you don't need this anymore. (Though, as I said before, I think it's totally reasonable to talk about adding this functionality more generally, in common.file_utils. It should just be in a separate PR.)

Oh, right. Let me remove this.

matt-gardner · 2018-08-17T17:09:27Z

allennlp/data/dataset_readers/quora_paraphrase.py

+                         label: str = None) -> Instance:
+        # pylint: disable=arguments-differ
+        fields: Dict[str, Field] = {}
+        tokenized_premise = self._tokenizer.tokenize(premise)


Oh, these are pre-tokenized! That's an important point, thanks for adding it to the class docstring. I think that means you should not tokenize the text here. The reason is because in a demo, you'll want to do tokenization (because someone typing text into a demo won't necessarily tokenize it nicely for you). So if/when we build a demo for this, we'll want the Predictor to have a tokenizer and pass pre-tokenized text into here. So I'd recommend having premise: List[str] and hypothesis: List[str], removing the tokenizer, and just calling row[1].split() and row[2].split() above. Does that make sense?

Sure it makes sense. But as text_to_instance is an overridden method, will the change of types cause any side effect? How about we do this in the next PR where we add the predictor and demo?

Also, the raw Quora data is actually not pre-tokenized. Only this particular train/validate/test split is pre-tokenized. At sometime I will want to use the same split but use non-tokenized version to see if better tokenization could further improve the performance.

Oh, ok, that's a good point. And yeah, it's fine to worry about this in a subsequent PR. It's just something to be careful about when this actually goes into a demo.

Oh, and for the types, we're already ignoring types on these methods, because none of them agree with the super class. The point of that method existing on the super class is just to tell you that you really should implement a method with this name, though the particulars of how it looks are going to be specific to your data.

matt-gardner · 2018-08-17T17:13:17Z

allennlp/modules/bimpm_matching.py

+    ----------
+    hidden_dim : ``int``, optional (default = 100)
+        The hidden dimension of the representations
+    num_perspective : ``int``, optional (default = 20)


Nit: this should probably be num_perspectives, here, in the description on the next line, and as a variable name in the code..

Sure, I will rename it.

hzeng-otterai · 2018-08-17T17:46:35Z

Currently I have a running experiment on this version of code. It is not finished yet but it's very probable I get 0.89+ accuracy on the Quora test data. After the PR is merged I can test and upload the model.

And yes I am willing to build a demo from it. Let me work on a PR maybe next week.

…spectives.

matt-gardner · 2018-08-17T19:04:26Z

Awesome, thanks again for all of this work!

hzeng-otterai · 2018-08-17T23:34:44Z

I put the trained model file here: https://s3-us-west-1.amazonaws.com/handsomezebra/public/bimpm-quora-2018.08.17.tar.gz

Running the following command under the latest master.

./bin/allennlp evaluate https://s3-us-west-1.amazonaws.com/handsomezebra/public/bimpm-quora-2018.08.17.tar.gz https://s3-us-west-2.amazonaws.com/allennlp/datasets/quora-question-paraphrase/test.tsv

Got accuracy 89.3%.

saurabhvyas · 2018-08-20T11:43:33Z

Hi, Thanks for your hard work, Can you update the docs, as to how to train your model on my custom data ?

hzeng-otterai · 2018-08-20T15:08:30Z

@saurabhvyas
To train and evaluate bimpm model, use the following commands.

allennlp train training_config/bimpm.json -s <serialization_folder>
allennlp evaluate <serialization_folder>/model.tar.gz https://s3-us-west-2.amazonaws.com/allennlp/datasets/quora-question-paraphrase/test.tsv

If you want to use your custom data, please just change train_data_path and validation_data_path in bimpm.json, if your data have the exact same format as Quora's data (tsv file, four columns with label, sentence1, sentence2 and id, word pretokenized).

Otherwise if your data have different format, please write your own dataset reader. Please refer to allennlp's tutorial https://github.com/allenai/allennlp/tree/master/tutorials for some more information.

schmmd · 2018-08-20T22:10:37Z

@handsomezebra yes thank you! And sorry for the slow responses--I was out last week. If you're willing to build a demo it would be great to have one for the Quora dataset. If you make a rough demo we're able to pretty it up.

saurabhvyas · 2018-08-21T04:24:13Z

@handsomezebra thanks for your quick reply, I will follow the mentioned steps.

schmmd · 2018-08-21T16:37:48Z

I moved the model into our S3.

$ allennlp evaluate https://s3-us-west-2.amazonaws.com/allennlp/models/bimpm-quora-2018.08.17.tar.gz https://s3-us-west-2.amazonaws.com/allennlp/datasets/quora-question-paraphrase/test.tsv
...
2018-08-21 09:34:18,226 - INFO - allennlp.commands.evaluate - Metrics:
2018-08-21 09:34:18,226 - INFO - allennlp.commands.evaluate - accuracy: 0.893

* Adding Quora data reader and BiMPM model. * Refactoring and renaming to pass pylint. * Reduce batch size. * Adding docs. * Make title underline longer. * Adding doc toctree. * Various improvements to speed and memory. * Improve comments. * 1. Remove zip file handling. 2. Use allennlp s3 for quora data download. 3. Move masked_max, masked_mean to nn.util 4. Various variable renaming, comments improvements, etc. * Remove unused url pattern match and change num_perspective to num_perspectives.

hzeng-otterai added 3 commits August 7, 2018 23:23

Adding Quora data reader and BiMPM model.

de67e90

Refactoring and renaming to pass pylint.

a622662

Reduce batch size.

137e685

hzeng-otterai and others added 5 commits August 9, 2018 09:30

Adding docs.

8686d77

Merge branch 'master' into master

517f0e7

Make title underline longer.

3d7abbd

Merge branch 'master' of https://github.com/handsomezebra/allennlp

6adfeda

Adding doc toctree.

b58361b

hzeng-otterai mentioned this pull request Aug 9, 2018

BiMPM implementation in AllenNLP zhiguowang/BiMPM#48

Open

Various improvements to speed and memory.

cbcab36

matt-gardner self-assigned this Aug 14, 2018

Improve comments.

f6ace1e

matt-gardner reviewed Aug 16, 2018

View reviewed changes

1. Remove zip file handling.

6e9f020

2. Use allennlp s3 for quora data download. 3. Move masked_max, masked_mean to nn.util 4. Various variable renaming, comments improvements, etc.

matt-gardner approved these changes Aug 17, 2018

View reviewed changes

Remove unused url pattern match and change num_perspective to num_per…

a820c60

…spectives.

matt-gardner merged commit 76a65a8 into allenai:master Aug 17, 2018

schmmd mentioned this pull request Aug 27, 2018

Create a "model gallery" of user codebases that implement models in AllenNLP. allenai/allennlp-website#53

Closed

DeNeutoy mentioned this pull request Sep 25, 2018

Create a markdown file that enumerates available models #1802

Merged

	check_dimensions_match(text_field_embedder.get_output_dim(), encoder.get_input_dim(),
	"text field embedding dim", "encoder input dim")

		from allennlp.nn.util import get_lengths_from_binary_sequence_mask


		def masked_max(vector: torch.Tensor,

		return (mul_result / norm_value.clamp(min=eps)).permute(0, 2, 3, 1)


		class BiMPMMatching(nn.Module, FromParams):

		return value_sum / value_count.clamp(min=eps)


		def mpm(vector1: torch.Tensor,

		logger = logging.getLogger(__name__) # pylint: disable=invalid-name


		def parse_file_uri(uri: str):

BiMPM model #1594

BiMPM model #1594

Conversation

hzeng-otterai commented Aug 8, 2018

hzeng-otterai commented Aug 8, 2018

hzeng-otterai commented Aug 14, 2018

matt-gardner commented Aug 14, 2018

matt-gardner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matt-gardner Aug 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matt-gardner Aug 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hzeng-otterai commented Aug 16, 2018

hzeng-otterai commented Aug 17, 2018 • edited Loading

matt-gardner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hzeng-otterai commented Aug 17, 2018

matt-gardner commented Aug 17, 2018

hzeng-otterai commented Aug 17, 2018 • edited Loading

saurabhvyas commented Aug 20, 2018

hzeng-otterai commented Aug 20, 2018 • edited Loading

schmmd commented Aug 20, 2018

saurabhvyas commented Aug 21, 2018

schmmd commented Aug 21, 2018

matt-gardner Aug 16, 2018 •

edited

Loading

matt-gardner Aug 16, 2018 •

edited

Loading

hzeng-otterai commented Aug 17, 2018 •

edited

Loading

hzeng-otterai commented Aug 17, 2018 •

edited

Loading

hzeng-otterai commented Aug 20, 2018 •

edited

Loading