Miscellaneous updates and enhancements #38

kernelmachine · 2019-07-17T21:11:12Z

This PR contains miscellaneous updates and enhancements to the library, including:

Addition of "KLD clipping", which empirically prevents KLD explosion by clipping the KLD to a range, similar to gradient clipping. This allows for the use of larger learning rates, and thus better topic quality, especially on larger datasets.
Enhancements to preprocessing: addition of tqdm progress bars to fitting/transforming the count vectorizer, faster bgfreq calculation
pylint checks and prettifying
Addition of BERT/ELMo fields to classifier jsonnet for quick comparison
Fixing batchnorm application to reconstruction (which has consistently helped for topic quality), and removing batchnorm application to variational parameters (which consistently degrades topic quality).
Addition of subsampling training data when pretraining vampire, useful for debugging
Addition of serialization directory saving to the "from_files" method of the Vocabulary, which is necessary when recovering a trained VAMPIRE model for continued training.

… updates)

dangitstam · 2019-07-18T21:10:06Z

environments/environments.py

-        "EMBEDDING_DROPOUT": 0.5,
-        "LEARNING_RATE": 0.004,
-        "DROPOUT": 0.5,
+        "ENCODER": "CNN",


Classifier environment by default will now be CNN; was this intentional?

This was a typo! Good catch

dangitstam · 2019-07-18T21:12:03Z

scripts/preprocess_data.py

-    dev_count_vectorizer = CountVectorizer(stop_words='english', max_features=args.vocab_size, token_pattern=r'\b[^\d\W]{3,30}\b')
-    reference_matrix = dev_count_vectorizer.fit_transform(tokenized_dev_examples)
-    reference_vocabulary = dev_count_vectorizer.get_feature_names()
+    # dev_count_vectorizer = CountVectorizer(stop_words='english', max_features=args.vocab_size, token_pattern=r'\b[^\d\W]{3,30}\b')


If these lines are no longer used, we should delete them.

I actually made an update to consolidate the preprocess_data and make_reference_corpus scripts. By default, preprocess_data will make the reference corpus from the validation data. But you can supply an additional arg if you'd like to make a custom reference corpus. See newest commits!

dangitstam · 2019-07-18T21:12:15Z

scripts/preprocess_data.py

-    reference_matrix = dev_count_vectorizer.fit_transform(tokenized_dev_examples)
-    reference_vocabulary = dev_count_vectorizer.get_feature_names()
+    # dev_count_vectorizer = CountVectorizer(stop_words='english', max_features=args.vocab_size, token_pattern=r'\b[^\d\W]{3,30}\b')
+    # reference_matrix = dev_count_vectorizer.fit_transform(tokenized_dev_examples)


Same as above.

dangitstam · 2019-07-18T21:20:56Z

scripts/preprocess_data.py

@@ -87,7 +87,7 @@ def main():

    # generate background frequency
    print("generating background frequency...")
-    bgfreq = dict(zip(count_vectorizer.get_feature_names(), master.toarray().sum(1) / args.vocab_size))
+    bgfreq = dict(zip(count_vectorizer.get_feature_names(), [x[0] for x in np.array(master.sum(1)) / args.vocab_size]))


What I'm just now realizing is that bgfreq here is also taking into account frequencies in the dev data. master is the result of stacking train on dev and the frequencies are a result of summing in the first dimension to compute word counts is my understanding. Shouldn't bgfreq be built only from the training data?

dangitstam · 2019-07-18T21:26:13Z

vampire/common/util.py

@@ -29,7 +29,10 @@ def compute_background_log_frequency(vocab: Vocabulary, vocab_namespace: str, pr
        if token in ("@@UNKNOWN@@", "@@PADDING@@", '@@START@@', '@@END@@') or token not in precomputed_bg:
            log_term_frequency[i] = 1e-12
        elif token in precomputed_bg:
-            log_term_frequency[i] = precomputed_bg[token]
+            if precomputed_bg[token] == 0:
+                log_term_frequency[i] = 1e-12


This is a great idea, good catch.

Many of my NaN issues during training were caused by this!

dangitstam · 2019-07-18T21:34:43Z

There is APPLY_BATCHNORM and APPLY_BATCHNORM_1 in various JSON files within the search_spaces directory. Given your changes, are these still necessary?

kernelmachine · 2019-07-19T21:25:36Z

Yes, good catch. we can kill these environment variables.

dangitstam · 2019-07-30T13:11:02Z

scripts/preprocess_data.py

    tokenized_examples = []
    with tqdm(open(data_path, "r"), desc=f"loading {data_path}") as f:
        for line in f:
-            example = json.loads(line)
+            if data_path.endswith(".jsonl") or data_path.endswith(".json"):


This is making the assumption that if it is a .jsonl file, that each JSON object contains a 'text' field. I am okay with this if we mention it in the README, otherwise it may confuse people.

yeah, let's be explicit about that in the README.

kernelmachine added 8 commits July 17, 2019 13:08

add updates (e.g. kld clamping, dataset sampling, preprocessing speed…

2aeaf50

… updates)

updated embedding

f4de5e7

updated embedding

52c7052

update to kld clamping

667a926

updates to dataset reader, removed unused tokenizer

ca7ba3a

all tests pass

8aa43af

all checks pass

767b24f

all tests pass

46f9653

kernelmachine requested a review from dangitstam July 17, 2019 21:22

added intermediate encoder output to vae for efficiency

09659b0

dangitstam reviewed Jul 18, 2019

View reviewed changes

removed dependency on make-reference-corpus script

21ceb63

kernelmachine and others added 4 commits July 19, 2019 14:29

removed typos

b47533d

checks pass

5e3cbc0

Sums across document dimension instead of vocab dimension

85facf6

update'

5704a52

dangitstam reviewed Jul 30, 2019

View reviewed changes

remove pdb trace

c108b11

kernelmachine merged commit e3795dd into master Jul 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Miscellaneous updates and enhancements #38

Miscellaneous updates and enhancements #38

kernelmachine commented Jul 17, 2019

dangitstam Jul 18, 2019

kernelmachine Jul 19, 2019

dangitstam Jul 18, 2019

kernelmachine Jul 19, 2019

dangitstam Jul 18, 2019

dangitstam Jul 18, 2019 •

edited

dangitstam Jul 18, 2019

kernelmachine Jul 19, 2019

dangitstam commented Jul 18, 2019

kernelmachine commented Jul 19, 2019

dangitstam Jul 30, 2019

kernelmachine Jul 31, 2019

Miscellaneous updates and enhancements #38

Miscellaneous updates and enhancements #38

Conversation

kernelmachine commented Jul 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dangitstam Jul 18, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dangitstam commented Jul 18, 2019

kernelmachine commented Jul 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dangitstam Jul 18, 2019 •

edited