New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Miscellaneous updates and enhancements #38
Conversation
environments/environments.py
Outdated
"EMBEDDING_DROPOUT": 0.5, | ||
"LEARNING_RATE": 0.004, | ||
"DROPOUT": 0.5, | ||
"ENCODER": "CNN", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Classifier environment by default will now be CNN; was this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a typo! Good catch
scripts/preprocess_data.py
Outdated
dev_count_vectorizer = CountVectorizer(stop_words='english', max_features=args.vocab_size, token_pattern=r'\b[^\d\W]{3,30}\b') | ||
reference_matrix = dev_count_vectorizer.fit_transform(tokenized_dev_examples) | ||
reference_vocabulary = dev_count_vectorizer.get_feature_names() | ||
# dev_count_vectorizer = CountVectorizer(stop_words='english', max_features=args.vocab_size, token_pattern=r'\b[^\d\W]{3,30}\b') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these lines are no longer used, we should delete them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually made an update to consolidate the preprocess_data
and make_reference_corpus
scripts. By default, preprocess_data
will make the reference corpus from the validation data. But you can supply an additional arg if you'd like to make a custom reference corpus. See newest commits!
scripts/preprocess_data.py
Outdated
reference_matrix = dev_count_vectorizer.fit_transform(tokenized_dev_examples) | ||
reference_vocabulary = dev_count_vectorizer.get_feature_names() | ||
# dev_count_vectorizer = CountVectorizer(stop_words='english', max_features=args.vocab_size, token_pattern=r'\b[^\d\W]{3,30}\b') | ||
# reference_matrix = dev_count_vectorizer.fit_transform(tokenized_dev_examples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
scripts/preprocess_data.py
Outdated
@@ -87,7 +87,7 @@ def main(): | |||
|
|||
# generate background frequency | |||
print("generating background frequency...") | |||
bgfreq = dict(zip(count_vectorizer.get_feature_names(), master.toarray().sum(1) / args.vocab_size)) | |||
bgfreq = dict(zip(count_vectorizer.get_feature_names(), [x[0] for x in np.array(master.sum(1)) / args.vocab_size])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm just now realizing is that bgfreq
here is also taking into account frequencies in the dev data. master
is the result of stacking train on dev and the frequencies are a result of summing in the first dimension to compute word counts is my understanding. Shouldn't bgfreq
be built only from the training data?
@@ -29,7 +29,10 @@ def compute_background_log_frequency(vocab: Vocabulary, vocab_namespace: str, pr | |||
if token in ("@@UNKNOWN@@", "@@PADDING@@", '@@START@@', '@@END@@') or token not in precomputed_bg: | |||
log_term_frequency[i] = 1e-12 | |||
elif token in precomputed_bg: | |||
log_term_frequency[i] = precomputed_bg[token] | |||
if precomputed_bg[token] == 0: | |||
log_term_frequency[i] = 1e-12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea, good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of my NaN issues during training were caused by this!
There is |
Yes, good catch. we can kill these environment variables. |
tokenized_examples = [] | ||
with tqdm(open(data_path, "r"), desc=f"loading {data_path}") as f: | ||
for line in f: | ||
example = json.loads(line) | ||
if data_path.endswith(".jsonl") or data_path.endswith(".json"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is making the assumption that if it is a .jsonl file, that each JSON object contains a 'text' field. I am okay with this if we mention it in the README, otherwise it may confuse people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let's be explicit about that in the README.
This PR contains miscellaneous updates and enhancements to the library, including: