Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miscellaneous updates and enhancements #38

Merged
merged 15 commits into from Jul 31, 2019
Merged

Miscellaneous updates and enhancements #38

merged 15 commits into from Jul 31, 2019

Conversation

kernelmachine
Copy link
Contributor

This PR contains miscellaneous updates and enhancements to the library, including:

  1. Addition of "KLD clipping", which empirically prevents KLD explosion by clipping the KLD to a range, similar to gradient clipping. This allows for the use of larger learning rates, and thus better topic quality, especially on larger datasets.
  2. Enhancements to preprocessing: addition of tqdm progress bars to fitting/transforming the count vectorizer, faster bgfreq calculation
  3. pylint checks and prettifying
  4. Addition of BERT/ELMo fields to classifier jsonnet for quick comparison
  5. Fixing batchnorm application to reconstruction (which has consistently helped for topic quality), and removing batchnorm application to variational parameters (which consistently degrades topic quality).
  6. Addition of subsampling training data when pretraining vampire, useful for debugging
  7. Addition of serialization directory saving to the "from_files" method of the Vocabulary, which is necessary when recovering a trained VAMPIRE model for continued training.

"EMBEDDING_DROPOUT": 0.5,
"LEARNING_RATE": 0.004,
"DROPOUT": 0.5,
"ENCODER": "CNN",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Classifier environment by default will now be CNN; was this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a typo! Good catch

dev_count_vectorizer = CountVectorizer(stop_words='english', max_features=args.vocab_size, token_pattern=r'\b[^\d\W]{3,30}\b')
reference_matrix = dev_count_vectorizer.fit_transform(tokenized_dev_examples)
reference_vocabulary = dev_count_vectorizer.get_feature_names()
# dev_count_vectorizer = CountVectorizer(stop_words='english', max_features=args.vocab_size, token_pattern=r'\b[^\d\W]{3,30}\b')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these lines are no longer used, we should delete them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually made an update to consolidate the preprocess_data and make_reference_corpus scripts. By default, preprocess_data will make the reference corpus from the validation data. But you can supply an additional arg if you'd like to make a custom reference corpus. See newest commits!

reference_matrix = dev_count_vectorizer.fit_transform(tokenized_dev_examples)
reference_vocabulary = dev_count_vectorizer.get_feature_names()
# dev_count_vectorizer = CountVectorizer(stop_words='english', max_features=args.vocab_size, token_pattern=r'\b[^\d\W]{3,30}\b')
# reference_matrix = dev_count_vectorizer.fit_transform(tokenized_dev_examples)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@@ -87,7 +87,7 @@ def main():

# generate background frequency
print("generating background frequency...")
bgfreq = dict(zip(count_vectorizer.get_feature_names(), master.toarray().sum(1) / args.vocab_size))
bgfreq = dict(zip(count_vectorizer.get_feature_names(), [x[0] for x in np.array(master.sum(1)) / args.vocab_size]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm just now realizing is that bgfreq here is also taking into account frequencies in the dev data. master is the result of stacking train on dev and the frequencies are a result of summing in the first dimension to compute word counts is my understanding. Shouldn't bgfreq be built only from the training data?

@@ -29,7 +29,10 @@ def compute_background_log_frequency(vocab: Vocabulary, vocab_namespace: str, pr
if token in ("@@UNKNOWN@@", "@@PADDING@@", '@@START@@', '@@END@@') or token not in precomputed_bg:
log_term_frequency[i] = 1e-12
elif token in precomputed_bg:
log_term_frequency[i] = precomputed_bg[token]
if precomputed_bg[token] == 0:
log_term_frequency[i] = 1e-12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea, good catch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of my NaN issues during training were caused by this!

@dangitstam
Copy link
Contributor

There is APPLY_BATCHNORM and APPLY_BATCHNORM_1 in various JSON files within the search_spaces directory. Given your changes, are these still necessary?

@kernelmachine
Copy link
Contributor Author

Yes, good catch. we can kill these environment variables.

tokenized_examples = []
with tqdm(open(data_path, "r"), desc=f"loading {data_path}") as f:
for line in f:
example = json.loads(line)
if data_path.endswith(".jsonl") or data_path.endswith(".json"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making the assumption that if it is a .jsonl file, that each JSON object contains a 'text' field. I am okay with this if we mention it in the README, otherwise it may confuse people.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, let's be explicit about that in the README.

@kernelmachine kernelmachine merged commit e3795dd into master Jul 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants