Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] DocBERT Colab notebook #58

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mrkarezina
Copy link

@mrkarezina mrkarezina commented May 18, 2020

I was working on implementing a Colab notebook to replicate some of the results in the docBERT paper. Notebook still need to be polished and i'll add it in a future commit.

One of the issues when running the LogisticRegression was the default vocab size of 36308 which results in mismatch in LogisticRegression when using BagOfWordsTrainer. Updating the vocab size to 36311 solves the issue.

Full traceback:
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/content/hedwig/models/lr/__main__.py", line 85, in <module>
    trainer.train()
  File "/content/hedwig/common/trainers/bow_trainer.py", line 69, in train
    self.train_epoch(train_dataloader)
  File "/content/hedwig/common/trainers/bow_trainer.py", line 43, in train_epoch
    logits = self.model(features)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/hedwig/models/lr/model.py", line 15, in forward
    logit = self.fc1(x)  # (batch, target_size)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1610, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: size mismatch, m1: [32 x 36311], m2: [36308 x 90] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:283

It would also be helpful to have a command line argument for loading the pertained BERT model provided by Hugging Face instead of having to have locally saved weights.

Notebook specific:
It’s difficult to evaluate the rest of the models like BERT large and simple LSTM with word2vec because colab runs out of ram (16GB).

For the pretrained BERT models I used the uncased weights provided by HF. However in 2 trails of base BERT on Reuters the test F1 score is 87.9 and 87.7 slightly different to 90.7 reported in the paper.

Split  Dev/Acc.  Dev/Pr.  Dev/Re.   Dev/F1   Dev/Loss
 TEST    0.8323   0.9264   0.8370   0.8794    15.9778
Split  Dev/Acc.  Dev/Pr.  Dev/Re.   Dev/F1   Dev/Loss
 TEST    0.8297   0.9284   0.8317   0.8774    16.5026

In terms of content for replicating the results, would LR, BERT base on Reuters x 2, and BERT base IMDB be a good start? What other results would it be good to train and include?

@mrkarezina mrkarezina mentioned this pull request May 20, 2020
@@ -6,7 +6,7 @@
class ReutersProcessor(BagOfWordsProcessor):
NAME = 'Reuters'
NUM_CLASSES = 90
VOCAB_SIZE = 36308
VOCAB_SIZE = 36311
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be due to a mismatch in the scikit library version in your environment. In any case, rather than hard-coding the vocabulary size, would it be possible to infer it at run time?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vocab size is hard coded for all the datasets. Should I add a method to calculate vocab size in theBagOfWordsProcessor class that's extended for each dataset?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup that would work

@@ -7,6 +7,7 @@ def get_args():
parser = models.args.get_args()

parser.add_argument('--model', default=None, type=str, required=True)
parser.add_argument('--hf-checkpoint', default=False, type=bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adding a new flag, can't we just add the pre-trained model weights to https://git.uwaterloo.ca/jimmylin/hedwig-data? That way everyone would be able to get the models running out of the box

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would definitely be better! What were the pre-trained models? Where they the checkpoints from hugging face or seperatly trained? I tried using bert-base-uncased and got results comparable to the ones in the paper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, they are the checkpoints from hugging face. Do you have access to any of the DSG servers? We already have the exact pre-trained models we used for our experiments.

Copy link
Author

@mrkarezina mrkarezina May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't have access unfortunately. Is there any way I could download them into the notebook or maybe just wait until they are added into the hedwig-data repo so they are included when cloning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the pre-trained weights to our data repo: https://git.uwaterloo.ca/jimmylin/hedwig-data

@@ -71,8 +71,9 @@ def evaluate_split(model, processor, tokenizer, args, split='dev'):

args.is_hierarchical = False
processor = dataset_map[args.dataset]()
pretrained_vocab_path = PRETRAINED_VOCAB_ARCHIVE_MAP[args.model]
tokenizer = BertTokenizer.from_pretrained(pretrained_vocab_path)
if not args.hf_checkpoint:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand why we need this condition here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag --hf-checkpoint was just allowing to specify one of the hugging face model checkpoints which is automatically loaded making it easier to start training than having to download and move the weights to the hedwig-data directory.

However if the weights are included in hedwig-data then there's no need for the flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't want to have the option to download a checkpoint in order to protect against any changes to these checkpoints by hugging-face, which would invalidate our results

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh makes sense.

@achyudh achyudh mentioned this pull request May 22, 2020
@mrkarezina
Copy link
Author

Only the logistic regression model seemed to be using the VOCAB_SIZE property from the BOW processor. I rearranged the script to calculate the the proper vocab size before initializing the LogisticRegression model.

I also removed the hugging-face flag, although it would be nice if the weights were already in hedwig-data. Should I rename this PR and make a seperate one for the notebook?

@@ -136,5 +136,4 @@ def evaluate_split(model, processor, tokenizer, args, split='dev'):
model = model.to(device)

evaluate_split(model, processor, tokenizer, args, split='dev')
evaluate_split(model, processor, tokenizer, args, split='test')

evaluate_split(model, processor, tokenizer, args, split='test')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please remove this change? It's affecting an unrelated file. Also, as a convention, we have newlines at the end of files.

@@ -71,6 +70,12 @@ def evaluate_split(model, vectorizer, processor, args, split='dev'):
save_path = os.path.join(args.save_path, dataset_map[args.dataset].NAME)
os.makedirs(save_path, exist_ok=True)

if train_examples:
train_features = vectorizer.fit_transform([x.text for x in train_examples])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm but this means that we would fit_transform twice right? I am not sure if you ran this on our larger datasets (IMDB or Yelp) but it's a pretty memory and compute intensive operation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the same fit transform would also be applied in trainers/bow_trainer.py. It's just tricky because you need to initialize the LR model having the vocab size and the BagOfWordsTrainer requires that model.

Would it be ok to pass the train_features to the BagOfWordsTrainer? It's only used by the LR model anyway. It just feels clunky since vectorizer is only passed to BagOfWordsTrainer to compute train / eval features.

Not sure whats a better way to do this.

@achyudh
Copy link
Member

achyudh commented May 29, 2020

I also removed the hugging-face flag, although it would be nice if the weights were already in hedwig-data. Should I rename this PR and make a seperate one for the notebook?

Sorry I don't have much context on the notebook, but it's always preferable to have PRs as atomic, i.e., one pull request with one commit addressing one concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants