-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] DocBERT Colab notebook #58
base: master
Are you sure you want to change the base?
Conversation
@@ -6,7 +6,7 @@ | |||
class ReutersProcessor(BagOfWordsProcessor): | |||
NAME = 'Reuters' | |||
NUM_CLASSES = 90 | |||
VOCAB_SIZE = 36308 | |||
VOCAB_SIZE = 36311 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be due to a mismatch in the scikit library version in your environment. In any case, rather than hard-coding the vocabulary size, would it be possible to infer it at run time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vocab size is hard coded for all the datasets. Should I add a method to calculate vocab size in theBagOfWordsProcessor
class that's extended for each dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup that would work
models/bert/args.py
Outdated
@@ -7,6 +7,7 @@ def get_args(): | |||
parser = models.args.get_args() | |||
|
|||
parser.add_argument('--model', default=None, type=str, required=True) | |||
parser.add_argument('--hf-checkpoint', default=False, type=bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than adding a new flag, can't we just add the pre-trained model weights to https://git.uwaterloo.ca/jimmylin/hedwig-data? That way everyone would be able to get the models running out of the box
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would definitely be better! What were the pre-trained models? Where they the checkpoints from hugging face or seperatly trained? I tried using bert-base-uncased
and got results comparable to the ones in the paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, they are the checkpoints from hugging face. Do you have access to any of the DSG servers? We already have the exact pre-trained models we used for our experiments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I don't have access unfortunately. Is there any way I could download them into the notebook or maybe just wait until they are added into the hedwig-data
repo so they are included when cloning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the pre-trained weights to our data repo: https://git.uwaterloo.ca/jimmylin/hedwig-data
models/bert/__main__.py
Outdated
@@ -71,8 +71,9 @@ def evaluate_split(model, processor, tokenizer, args, split='dev'): | |||
|
|||
args.is_hierarchical = False | |||
processor = dataset_map[args.dataset]() | |||
pretrained_vocab_path = PRETRAINED_VOCAB_ARCHIVE_MAP[args.model] | |||
tokenizer = BertTokenizer.from_pretrained(pretrained_vocab_path) | |||
if not args.hf_checkpoint: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand why we need this condition here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flag --hf-checkpoint
was just allowing to specify one of the hugging face model checkpoints which is automatically loaded making it easier to start training than having to download and move the weights to the hedwig-data
directory.
However if the weights are included in hedwig-data
then there's no need for the flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't want to have the option to download a checkpoint in order to protect against any changes to these checkpoints by hugging-face, which would invalidate our results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh makes sense.
Only the logistic regression model seemed to be using the VOCAB_SIZE property from the BOW processor. I rearranged the script to calculate the the proper vocab size before initializing the I also removed the hugging-face flag, although it would be nice if the weights were already in |
@@ -136,5 +136,4 @@ def evaluate_split(model, processor, tokenizer, args, split='dev'): | |||
model = model.to(device) | |||
|
|||
evaluate_split(model, processor, tokenizer, args, split='dev') | |||
evaluate_split(model, processor, tokenizer, args, split='test') | |||
|
|||
evaluate_split(model, processor, tokenizer, args, split='test') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please remove this change? It's affecting an unrelated file. Also, as a convention, we have newlines at the end of files.
@@ -71,6 +70,12 @@ def evaluate_split(model, vectorizer, processor, args, split='dev'): | |||
save_path = os.path.join(args.save_path, dataset_map[args.dataset].NAME) | |||
os.makedirs(save_path, exist_ok=True) | |||
|
|||
if train_examples: | |||
train_features = vectorizer.fit_transform([x.text for x in train_examples]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm but this means that we would fit_transform
twice right? I am not sure if you ran this on our larger datasets (IMDB or Yelp) but it's a pretty memory and compute intensive operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the same fit transform would also be applied in trainers/bow_trainer.py
. It's just tricky because you need to initialize the LR model having the vocab size and the BagOfWordsTrainer
requires that model.
Would it be ok to pass the train_features
to the BagOfWordsTrainer
? It's only used by the LR model anyway. It just feels clunky since vectorizer
is only passed to BagOfWordsTrainer
to compute train / eval features.
Not sure whats a better way to do this.
Sorry I don't have much context on the notebook, but it's always preferable to have PRs as atomic, i.e., one pull request with one commit addressing one concern. |
I was working on implementing a Colab notebook to replicate some of the results in the docBERT paper. Notebook still need to be polished and i'll add it in a future commit.
One of the issues when running the LogisticRegression was the default vocab size of 36308 which results in mismatch in LogisticRegression when using BagOfWordsTrainer. Updating the vocab size to 36311 solves the issue.
It would also be helpful to have a command line argument for loading the pertained BERT model provided by Hugging Face instead of having to have locally saved weights.
Notebook specific:
It’s difficult to evaluate the rest of the models like BERT large and simple LSTM with word2vec because colab runs out of ram (16GB).
For the pretrained BERT models I used the uncased weights provided by HF. However in 2 trails of base BERT on Reuters the test F1 score is 87.9 and 87.7 slightly different to 90.7 reported in the paper.
In terms of content for replicating the results, would LR, BERT base on Reuters x 2, and BERT base IMDB be a good start? What other results would it be good to train and include?