-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Conversation
Looks like documents for the classes are missing. I will update the PR. |
Pinging @schmmd , please help to provide comments to this pack. |
Sorry to be slow @handsomezebra, we had a hackathon last week at AI2 that put a bunch of other stuff on pause. I'll look at this, though today is pretty busy for me, and it'll probably be tomorrow before I get to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks great, thanks for doing this! I didn't double check the logic against the paper, as you've said you got close to the same results. What I saw in the logic looked reasonable to me.
The one major quibble I have with this is that the variable and method names are all very abbreviated. We strongly prefer longer, more descriptive variable names, as having code be easy to understand is far more important than saving a few characters of horizontal space. I can figure out that ml_fw
probably means forward_matching_layer
, for instance, but that's a guess, and it adds to my cognitive load as I read this code. Can you expand the variable names to avoid abbreviations throughout?
@overrides | ||
def _read(self, file_path): | ||
logger.info("Reading instances from lines in file at: %s", file_path) | ||
file_name, member = parse_file_uri(file_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've duplicated the logic below between the two cases (where member is None
and where it isn't). Seems like it would be cleaner to push the file decision logic into this method. So you'd have something like:
def open_uri(file_path):
file_name, member = parse_file_uri(file_path)
file_path = cached_path(file_path)
if member is None:
return open(file_path, 'r')
else:
# zipfile opening logic here
def _read(self, file_path):
with open_uri(file_path) as data_file:
tsvin = ...
@joelgrus probably knows more than I do about how to handle the with
stuff properly in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matt-gardner @joelgrus Sure I will try to refactor this. One related question is I am currently using my personal s3 to host the data file https://s3-us-west-1.amazonaws.com/handsomezebra/public/Quora_question_pair_partition.zip
. Do you think you can help me upload the files to allennlp's s3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd be happy to put that data file (and any model files you have) into our s3 bucket. Just tell me what all of them are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just this data file for now https://s3-us-west-1.amazonaws.com/handsomezebra/public/Quora_question_pair_partition.zip
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I unzipped it and put the train, dev and test files here: https://s3-us-west-2.amazonaws.com/allennlp/datasets/quora-question-paraphrase/{train,dev,test}.tsv
, along with the README. I left out the word vector file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Unzipping it also makes it so you can just remove the zip file logic and simplify that piece of the code. I'd vote for handling zip files in a more general way in our common
module, which we can certainly do, but is probably more appropriate for a separate PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I will remove the zip file reading.
I tried the s3 URL https://s3-us-west-1.amazonaws.com/allennlp/datasets/quora-question-paraphrase/train.tsv but got a redirection error. But this URL is fine: https://allennlp.s3-us-west-2.amazonaws.com/datasets/quora-question-paraphrase/train.tsv. I can use the second URL but other data on allennlp doesn't seem to have this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, that's my fault, I had a copy-paste error in the link I gave you. (I edited the previous post to fix it.)
@DatasetReader.register("quora_paraphrase") | ||
class QuoraParaphraseDatasetReader(DatasetReader): | ||
""" | ||
Reads a Quora paraphrase data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An explanation of the expected data format here would be nice (e.g., this is a TSV file, with the following columns).
allennlp/models/bimpm.py
Outdated
Aggregator of all BiMPM matching vectors | ||
classifier_feedforward : ``FeedForward`` | ||
Fully connected layers for classification. | ||
dropout : ``float`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be marked optional.
allennlp/models/bimpm.py
Outdated
BiMPM matching on the output of word embeddings of premise and hypothesis. | ||
encoder1 : ``Seq2SeqEncoder`` | ||
First encoder layer for the premise and hypothesis | ||
matcher_fw1 : ``BiMPMMatching`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of fw
and bw
, can you expand those to forward
and backward
? It's much easier to read.
allennlp/models/bimpm.py
Outdated
self.matcher_fw1.get_output_dim() + self.matcher_bw1.get_output_dim() + \ | ||
self.matcher_fw2.get_output_dim() + self.matcher_bw2.get_output_dim() | ||
|
||
if matching_dim != self.aggregator.get_input_dim(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a check_dimensions_match
function that makes these checks a little easier:
allennlp/allennlp/models/esim.py
Lines 82 to 83 in 580dc8b
check_dimensions_match(text_field_embedder.get_output_dim(), encoder.get_input_dim(), | |
"text field embedding dim", "encoder input dim") |
allennlp/modules/bimpm_matching.py
Outdated
from allennlp.nn.util import get_lengths_from_binary_sequence_mask | ||
|
||
|
||
def masked_max(vector: torch.Tensor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two functions should go in nn.util
.
allennlp/modules/bimpm_matching.py
Outdated
share_weight_between_directions : ``bool``, optional (default = True) | ||
If True, share weight between premise to hypothesis and hypothesisto premise, | ||
useful for non-symmetric tasks | ||
wo_full_match : ``bool``, optional (default = False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is wo
? "without"? That's not obvious.
allennlp/modules/bimpm_matching.py
Outdated
if num_matching <= 0: | ||
raise ConfigurationError("At least one of the matching method should be enabled") | ||
|
||
params = [nn.Parameter(torch.rand(num_perspective, hidden_dim)) for i in range(num_matching)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.rand()
isn't a great way to initialize parameters. Can you use something more reasonable as a default here, like torch.nn.init.xavier_uniform_
?
Also, having a list of parameters here is pretty hard to interpret if you're looking at parameter values in tensorboard, or during debugging, or similar. I think it'd be better to give these parameters names, like full_match_weights
, and assign them directly to self
, assigning None
if they aren't selected. That would also let you get rid of num_matching
, mv_idx
, and mv_idx_increment
entirely, which are a little confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using initializer in the config file to initialize their weights. So the weight here will be overwritten. But I can use torch.FloatTensor() to create them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I know you re-initialize them in your config, but the default should also do something reasonable, in case people forget or don't want to use a special initialization.
allennlp/modules/bimpm_matching.py
Outdated
return (mul_result / norm_value.clamp(min=eps)).permute(0, 2, 3, 1) | ||
|
||
|
||
class BiMPMMatching(nn.Module, FromParams): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer to use Google's definition of camel case, where this would be BiMpmMatching
(and BiMpm
instead of BiMPM
).
allennlp/modules/bimpm_matching.py
Outdated
return value_sum / value_count.clamp(min=eps) | ||
|
||
|
||
def mpm(vector1: torch.Tensor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more descriptive name would be better here (and below), like multi_perspective_match()
.
Thanks a lot, Matt. I agree on most of the comments and I am making changes now... |
2. Use allennlp s3 for quora data download. 3. Move masked_max, masked_mean to nn.util 4. Various variable renaming, comments improvements, etc.
Pack updated with all those issues fixed according to the comments. Please help to review. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I think it's way easier to read now, thanks for making those changes.
Do you have any trained models that we should upload? Have you put any thought into building a demo for this that we can add to demo.allennlp.org?
logger = logging.getLogger(__name__) # pylint: disable=invalid-name | ||
|
||
|
||
def parse_file_uri(uri: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't need this anymore. (Though, as I said before, I think it's totally reasonable to talk about adding this functionality more generally, in common.file_utils
. It should just be in a separate PR.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right. Let me remove this.
label: str = None) -> Instance: | ||
# pylint: disable=arguments-differ | ||
fields: Dict[str, Field] = {} | ||
tokenized_premise = self._tokenizer.tokenize(premise) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, these are pre-tokenized! That's an important point, thanks for adding it to the class docstring. I think that means you should not tokenize the text here. The reason is because in a demo, you'll want to do tokenization (because someone typing text into a demo won't necessarily tokenize it nicely for you). So if/when we build a demo for this, we'll want the Predictor
to have a tokenizer and pass pre-tokenized text into here. So I'd recommend having premise: List[str]
and hypothesis: List[str]
, removing the tokenizer, and just calling row[1].split()
and row[2].split()
above. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure it makes sense. But as text_to_instance is an overridden method, will the change of types cause any side effect? How about we do this in the next PR where we add the predictor and demo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the raw Quora data is actually not pre-tokenized. Only this particular train/validate/test split is pre-tokenized. At sometime I will want to use the same split but use non-tokenized version to see if better tokenization could further improve the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, ok, that's a good point. And yeah, it's fine to worry about this in a subsequent PR. It's just something to be careful about when this actually goes into a demo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and for the types, we're already ignoring types on these methods, because none of them agree with the super class. The point of that method existing on the super class is just to tell you that you really should implement a method with this name, though the particulars of how it looks are going to be specific to your data.
allennlp/modules/bimpm_matching.py
Outdated
---------- | ||
hidden_dim : ``int``, optional (default = 100) | ||
The hidden dimension of the representations | ||
num_perspective : ``int``, optional (default = 20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this should probably be num_perspectives
, here, in the description on the next line, and as a variable name in the code..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will rename it.
Currently I have a running experiment on this version of code. It is not finished yet but it's very probable I get 0.89+ accuracy on the Quora test data. After the PR is merged I can test and upload the model. And yes I am willing to build a demo from it. Let me work on a PR maybe next week. |
Awesome, thanks again for all of this work! |
I put the trained model file here: https://s3-us-west-1.amazonaws.com/handsomezebra/public/bimpm-quora-2018.08.17.tar.gz Running the following command under the latest master.
Got accuracy 89.3%. |
Hi, Thanks for your hard work, Can you update the docs, as to how to train your model on my custom data ? |
@saurabhvyas
If you want to use your custom data, please just change train_data_path and validation_data_path in bimpm.json, if your data have the exact same format as Quora's data (tsv file, four columns with label, sentence1, sentence2 and id, word pretokenized). Otherwise if your data have different format, please write your own dataset reader. Please refer to allennlp's tutorial https://github.com/allenai/allennlp/tree/master/tutorials for some more information. |
@handsomezebra yes thank you! And sorry for the slow responses--I was out last week. If you're willing to build a demo it would be great to have one for the Quora dataset. If you make a rough demo we're able to pretty it up. |
@handsomezebra thanks for your quick reply, I will follow the mentioned steps. |
I moved the model into our S3.
|
* Adding Quora data reader and BiMPM model. * Refactoring and renaming to pass pylint. * Reduce batch size. * Adding docs. * Make title underline longer. * Adding doc toctree. * Various improvements to speed and memory. * Improve comments. * 1. Remove zip file handling. 2. Use allennlp s3 for quora data download. 3. Move masked_max, masked_mean to nn.util 4. Various variable renaming, comments improvements, etc. * Remove unused url pattern match and change num_perspective to num_perspectives.
This PR includes the implementation of BiMPM model (#1503) and a Quora paraphrase dataset reader. I am still working on some comments improvements and memory usage optimization but I want to send out the PR to collect feedback too.
Thanks!