Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TrecQA for MP-CNN #77

Merged
merged 9 commits into from Nov 4, 2017

Conversation

Projects
None yet
4 participants
@tuzhucheng
Copy link
Member

commented Nov 4, 2017

Added TrecQA dataset and support for training TrecQA with MP-CNN.

python main.py mpcnn.trecqa.model --dataset trecqa --epochs 5 --regularization 0.0005 --dropout 0.5 --eps 0.1
Implementation and config map mrr
Paper 0.762 0.830
PyTorch using above config 0.7904 0.8223

@HTAustin @rosequ @Victor0118 please review.

@tuzhucheng

This comment has been minimized.

Copy link
Member Author

commented Nov 4, 2017

Also includes refactoring to modularize the code and removing redundant logging bootstrapping

@Victor0118
Copy link
Member

left a comment

LGTM

@HTAustin
Copy link
Member

left a comment

LGTM. Good job.

test_cross_entropy_loss += F.cross_entropy(output, batch.label, size_average=False).data[0]

true_labels.extend(batch.label.data.cpu().numpy())
predictions.extend(output.data.exp()[:, 1].cpu().numpy())

This comment has been minimized.

Copy link
@rosequ

rosequ Nov 4, 2017

Member

Shouldn't this be

predictions.extend(output.data.exp()[:, 2].cpu().numpy())

since this value is used as relevance score?

This comment has been minimized.

Copy link
@tuzhucheng

tuzhucheng Nov 4, 2017

Author Member

output is a batch_size x 2 tensor. For each example, the first index (index 0) refers to the probability of being in class 0, in second index (index 1) refers to the probability of being in class 1. So I believe this code is correct.

This comment has been minimized.

Copy link
@rosequ

rosequ Nov 4, 2017

Member

@tuzhucheng here isn't index <unk>?

This comment has been minimized.

Copy link
@tuzhucheng

tuzhucheng Nov 4, 2017

Author Member

This is the model output, not the embedding. Output contains log class probabilities.

"""
Create a TRECQA dataset instance
"""
fields = [('id', self.ID_FIELD), ('a', self.TEXT_FIELD), ('b', self.TEXT_FIELD), ('ext_feats', self.EXT_FEATS_FIELD), ('label', self.LABEL_FIELD)]

This comment has been minimized.

Copy link
@rosequ

rosequ Nov 4, 2017

Member

although the files are called a.toks and b.toks, I think we should called these fields as questions and answers

This comment has been minimized.

Copy link
@tuzhucheng

tuzhucheng Nov 4, 2017

Author Member

They are not necessarily question and answer. MP-CNN can be used for semantic text similarity between two sentences in general. I guess I can call them sentence_1 and sentence_2 to be more clearer than just a and b.

"""
:param path: directory containing train, test, dev files
:param vectors_name: name of word vectors file
:param vectors_cache: path to word vectors file

This comment has been minimized.

Copy link
@rosequ

rosequ Nov 4, 2017

Member

Do you mean to say path to word vector cache file?

:return:
"""
if vectors is None:
vectors = Vectors(name=vectors_name, cache=vectors_cache, unk_init=unk_init)

This comment has been minimized.

Copy link
@rosequ

rosequ Nov 4, 2017

Member

the vector cache file is a file generated from the original SM code (https://github.com/aseveryn/deep-qa) and this dependency can be completely discarded. Check this

This comment has been minimized.

Copy link
@tuzhucheng

tuzhucheng Nov 4, 2017

Author Member

MP-CNN is independent of SM-Model so I think it can generate its own cache. Doing it this way will also allow me to experiment with custom word vectors, e.g. GloVe vectors of other dimensions, word2vec trained on Google News, fastText etc..

I'm not quite sure which dependency you are referring to. I think the current approach is very flexible and doesn't tie us to any particular variant of word vectors.

This comment has been minimized.

Copy link
@rosequ

rosequ Nov 4, 2017

Member

I thought vector cache is a dependency here, no? or is it generated from one of your scripts?

This comment has been minimized.

Copy link
@tuzhucheng

tuzhucheng Nov 4, 2017

Author Member

Nope, torchtext is awesome! It generates this automatically on the first run if it is not there :) Nothing needs to be downloaded, all we need is torchtext.

This comment has been minimized.

Copy link
@rosequ

rosequ Nov 4, 2017

Member

oh perfect!

left_out_val_labels = []

for batch_idx, batch in enumerate(self.train_loader):
if batch_idx >= start_val_batch:

This comment has been minimized.

Copy link
@rosequ

rosequ Nov 4, 2017

Member

Why are you doing this? add a comment.

This comment has been minimized.

Copy link
@tuzhucheng

tuzhucheng Nov 4, 2017

Author Member

Yep, sounds good :)

NAME = 'trecqa'
NUM_CLASSES = 2
ID_FIELD = Field(sequential=False, tensor_type=torch.FloatTensor, use_vocab=False, batch_first=True)
TEXT_FIELD = Field(batch_first=True, tokenize=lambda x: x) # tokenizer is identity since we already tokenized it to compute external features

This comment has been minimized.

Copy link
@rosequ

rosequ Nov 4, 2017

Member

The better way of doing this is:

TEXT_FIELD = Field(batch_first=True, sequential=False)

Refer torchtext doc

This comment has been minimized.

Copy link
@tuzhucheng

tuzhucheng Nov 4, 2017

Author Member

You are right in that this is already tokenized, so no tokenization is required.

sequential: Whether the datatype represents sequential data. If False,
no tokenization is applied. Default: True.

When I tried to change this to sequential=False, I get a TypeError: unhashable type: 'list on the last line in this block from torchtext source:

        for data in sources:
            for x in data:
                if not self.sequential:
                    x = [x]
                counter.update(x)

As you can see, if data is not declared sequential, then torchtext wraps data in another list. I already have a list so this will create a list of list, which is incorrect.This is sequential data (a list of tokens), so I think what I did originally makes sense. It's just that the sequence data is already tokenized.

This comment has been minimized.

Copy link
@rosequ

rosequ Nov 4, 2017

Member

I'm sorry, you're right. This was not the case with the previous version of torchtext.

example = Example.fromlist([pair_id, l1, l2, ext_feats, label], fields)
examples.append(example)

map(lambda f: f.close(), [f1, f2, label_file])

This comment has been minimized.

Copy link
@rosequ

rosequ Nov 4, 2017

Member

could you process the files within a with block?

This comment has been minimized.

Copy link
@tuzhucheng

tuzhucheng Nov 4, 2017

Author Member

Sounds good :)


## TrecQA Dataset

To run MP-CNN on (Raw) TrecQA, you first need to the `get_trec_eval.sh` script in `utils` under the repo root.

This comment has been minimized.

Copy link
@rosequ

rosequ Nov 4, 2017

Member

I'd suggest writing the actual command so that a user can directly copy paste it into his terminal. Also add commands to uncompress and make trec_eval.

This comment has been minimized.

Copy link
@tuzhucheng

tuzhucheng Nov 4, 2017

Author Member

Yep, my sentence is also incoherent :)

@tuzhucheng
Copy link
Member Author

left a comment

Will work on changes suggested by @rosequ

"""
Create a TRECQA dataset instance
"""
fields = [('id', self.ID_FIELD), ('a', self.TEXT_FIELD), ('b', self.TEXT_FIELD), ('ext_feats', self.EXT_FEATS_FIELD), ('label', self.LABEL_FIELD)]

This comment has been minimized.

Copy link
@tuzhucheng

tuzhucheng Nov 4, 2017

Author Member

They are not necessarily question and answer. MP-CNN can be used for semantic text similarity between two sentences in general. I guess I can call them sentence_1 and sentence_2 to be more clearer than just a and b.

example = Example.fromlist([pair_id, l1, l2, ext_feats, label], fields)
examples.append(example)

map(lambda f: f.close(), [f1, f2, label_file])

This comment has been minimized.

Copy link
@tuzhucheng

tuzhucheng Nov 4, 2017

Author Member

Sounds good :)

:return:
"""
if vectors is None:
vectors = Vectors(name=vectors_name, cache=vectors_cache, unk_init=unk_init)

This comment has been minimized.

Copy link
@tuzhucheng

tuzhucheng Nov 4, 2017

Author Member

MP-CNN is independent of SM-Model so I think it can generate its own cache. Doing it this way will also allow me to experiment with custom word vectors, e.g. GloVe vectors of other dimensions, word2vec trained on Google News, fastText etc..

I'm not quite sure which dependency you are referring to. I think the current approach is very flexible and doesn't tie us to any particular variant of word vectors.


## TrecQA Dataset

To run MP-CNN on (Raw) TrecQA, you first need to the `get_trec_eval.sh` script in `utils` under the repo root.

This comment has been minimized.

Copy link
@tuzhucheng

tuzhucheng Nov 4, 2017

Author Member

Yep, my sentence is also incoherent :)

test_cross_entropy_loss += F.cross_entropy(output, batch.label, size_average=False).data[0]

true_labels.extend(batch.label.data.cpu().numpy())
predictions.extend(output.data.exp()[:, 1].cpu().numpy())

This comment has been minimized.

Copy link
@tuzhucheng

tuzhucheng Nov 4, 2017

Author Member

output is a batch_size x 2 tensor. For each example, the first index (index 0) refers to the probability of being in class 0, in second index (index 1) refers to the probability of being in class 1. So I believe this code is correct.

left_out_val_labels = []

for batch_idx, batch in enumerate(self.train_loader):
if batch_idx >= start_val_batch:

This comment has been minimized.

Copy link
@tuzhucheng

tuzhucheng Nov 4, 2017

Author Member

Yep, sounds good :)

f1.write(qrel_template.format(qid=qid, docno=docno, rel=actual))
f2.write(results_template.format(qid=qid, docno=docno, sim=predicted))

trec_out = subprocess.check_output(['../utils/trec_eval-9.0.5/trec_eval', '-m', 'map', '-m', 'recip_rank', qrel_fname, results_fname])

This comment has been minimized.

Copy link
@tuzhucheng

tuzhucheng Nov 4, 2017

Author Member

I will make this path absolute so it works regardless where the cwd is - will address in next commit to this PR.

@rosequ

rosequ approved these changes Nov 4, 2017

Copy link
Member

left a comment

LGTM

@tuzhucheng tuzhucheng merged commit 4470983 into castorini:master Nov 4, 2017

@tuzhucheng tuzhucheng deleted the tuzhucheng:mp-cnn-trecqa branch Nov 4, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.