-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for vocab size target #60
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this!
I have some comments on the data structures and the core algorithm.
finalfrontier-utils/src/util.rs
Outdated
Arg::with_name(VOCAB_SIZE) | ||
.long("vocab_size") | ||
.value_name("VOCABULARY_SIZE") | ||
.help("Maximam value of vocabulary size") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: maximum
Better:
Maximum vocabulary size
finalfrontier/src/config.rs
Outdated
#[derive(Clone, Copy, Debug, Serialize)] | ||
#[serde(rename = "SubwordSizedVocab")] | ||
#[serde(tag = "type")] | ||
pub struct SubwordSizedVocabConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we need new structs for this, they are largely overlapping. If we add some other field, we have to add them to 2 or 4 structs. Maybe replace vocab_size
/mincount
counts an enum along the lines of:
pub enum VocabCutoff {
MinCount(usize),
VocabSize(usize),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favour of this, although VocabSize
could be misinterpreted if someone prints the metadata. Perhaps TargetSize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I propose an extension to your suggestion: TargetVocabSize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the same. I like the idea of TargetVocabSize
.
finalfrontier/src/vocab.rs
Outdated
fn from(builder: VocabBuilder<SimpleSizedVocabConfig, T>) -> Self { | ||
let config = builder.config; | ||
|
||
let mut types: Vec<_> = builder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now also has duplication that can be avoided using the enum
suggested above.
finalfrontier/src/vocab.rs
Outdated
if vocab_size < types.len() && vocab_size >= 1 { | ||
let last_word = &types[last_index]; | ||
if last_index >= 1 { | ||
if last_word.cmp(&types[last_index - 1]) == Equal { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last_word == &types[last_index - 1]
finalfrontier/src/vocab.rs
Outdated
@@ -456,6 +534,30 @@ fn bracket(word: &str) -> String { | |||
bracketed | |||
} | |||
|
|||
fn search_first_occurrence<S>(vec: &[CountedType<S>], key: &CountedType<S>) -> Result<usize, Error> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid implementing binary search by hand, it's error prone (even the Java standard library implementation had an error for a decade), plus it's in the standard library as methods of slice (binary_search_*
). See elsewhere in the review for a suggested approach.
finalfrontier/src/vocab.rs
Outdated
.collect(); | ||
types.sort_unstable_by(|w1, w2| w2.cmp(&w1)); | ||
let vocab_size = config.vocab_size; | ||
let mut last_index = vocab_size - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be replaced by something like (totally untested, probably doesn't even compile, but should give the idea):
if let Some(last_type) = types.get(last_index) {
let cutoff_point = match types[..last_index].binary_search_by(|other_type| {
use Ordering::*;
match other_type.count().cmp(&last_type.count()) {
Less => Less,
Equal => Greater,
Greater => Greater,
}
}) {
Ok(idx) => idx,
Err(idx) => idx,
}
}
Factored out as a function for reusability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks almost ok now, some requests for factoring out some stuff.
finalfrontier-utils/src/util.rs
Outdated
min_count, | ||
discard_threshold, | ||
}) | ||
if matches.is_present(VOCAB_SIZE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block occurs here and below (for SubwordVocab
). You can have this block only once, assign the VocabCutoff
result to a vocab_cutoff
variable and use this variable in the simple/subword vocabs.
finalfrontier/src/config.rs
Outdated
@@ -97,6 +97,21 @@ pub struct DepembedsConfig { | |||
pub untyped: bool, | |||
} | |||
|
|||
/// Options for sizing vocabulary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vocabulary -> the vocabulary
finalfrontier/src/config.rs
Outdated
/// Options for sizing vocabulary. | ||
#[derive(Copy, Clone, Debug, Serialize)] | ||
pub enum VocabCutoff { | ||
/// Minimum toke count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: token
finalfrontier/src/config.rs
Outdated
|
||
/// Maximum target vocabulary size | ||
/// | ||
/// If TargetVocabSize is used, then the maximum target vocabulary size is set by this value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace by:
Cut off the vocabulary at the given size, retaining the n most frequent tokens. Tokens with the same count as the token at the cut-off point will also be discarded.
finalfrontier/src/config.rs
Outdated
pub enum VocabCutoff { | ||
/// Minimum toke count | ||
/// | ||
/// If MinCount is used, then no word-specific embeddings will be trained for tokens occurring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace by (since it's not only about not training embeddings, the words will also not be used as context):
Discard tokens that occur less than the given count.
finalfrontier/src/config.rs
Outdated
@@ -114,11 +129,10 @@ pub struct SubwordVocabConfig { | |||
/// buckets. | |||
pub buckets_exp: u32, | |||
|
|||
/// Minimum token count. | |||
/// Vocab cutoff options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation: Vocal -> Vocabulary
finalfrontier/src/config.rs
Outdated
/// No word-specific embeddings will be trained for tokens occurring less | ||
/// than this count. | ||
pub min_count: u32, | ||
/// Ways of sizing vocabularies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to:
Vocabulary size cut-off.
finalfrontier/src/config.rs
Outdated
@@ -133,11 +147,10 @@ pub struct SubwordVocabConfig { | |||
#[serde(rename = "SimpleVocab")] | |||
#[serde(tag = "type")] | |||
pub struct SimpleVocabConfig { | |||
/// Minimum token count. | |||
/// Vocab cutoff options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to:
Vocabulary size cut-off.
@@ -396,16 +420,49 @@ where | |||
{ | |||
fn from(builder: VocabBuilder<SubwordVocabConfig, T>) -> Self { | |||
let config = builder.config; | |||
let vocab_cutoff = config.vocab_cutoff; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factor out overlapping code with From impl for SimpleVocabConfig.
Helpful in refactoring: I don't think we are actually using EOS. IIRC I evaluated use of EOS at some point and there was no difference. Not sure why it shows up here and not above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sebpuetz do you remember this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No recollection of any of that, but it seems about right that EOS marking doesn't have an effect considering we're training with punctuation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it odd too. Okay then, I guess I'll just ignore the whole EOS part. Thanks for the info.
types.sort_unstable_by(|w1, w2| w2.cmp(&w1)); | ||
SimpleVocab::new(builder.config, types, builder.n_items) | ||
VocabCutoff::TargetVocabSize(vocab_size) => { | ||
assert!(vocab_size >= 1, "Target vocab size must be positive"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether we need this assertion. The value is guaranteed to be positive, because it's an unsigned integer. And in the case of someone entering target_size=2
, I believe the training loop would never terminate because the negative sampling ensures not returning the same idx as the focus word as a negative sample in an unconditional loop.
Although having the assertion here doesn't hurt either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, the vocab_size - 1
below could underflow otherwise.
1. Add an option `vocab_size` allowing people to set a target vocabulary size, due to which "mincount" is sized so that the vocabulary size is less than the target vocab size. 2. Add enum `VocabCutoff` with two variants: `TargetVocabSize` and `MinCount` 3. Refactor `From<VocabBuilder<SimpleVocabConfig, T>>` for `SimpleVocab` and `From<VocabBuilder<SubwordVocabConfig, T>>` for `SubwordVocab`, so that `Vocab`s can be built from corresponding `VocabConfig` with `TargetVocabSize` or `MinCount`
I looked into this again and here is a problem I don't quite understand: In
An UnsupportedType error occurred during |
Nevermind I found the issue, the TOML format does not support enums with values. I guess we can match |
It's probably quite frustrating, but it's hard to review this without it being rebased on master. A lot of things moved in the |
I hadn't noticed the August 30 updates. Feel free to ping reviewers again in the future when you have addressed comments. For me the problem is that I review a lot of PRs and GitHub does not provide a good mechanism to keep track of the status of PRs. GitHub is also very spammy with e-mails, so it is hard to keep up with PRs in that way. We should probably add this to the A3 wiki, asking people to explicitly re-request a review when comments have been addressed. For now, I fear that the only solution is indeed to rebase against the current master. This needs to happen anyway to be mergable, since there are conflicting files. |
vocab_size
allowing people to set a target vocabulary size, due to which "mincount" is sized so that the vocabulary size is less than the target vocab size.VocabCutoff
with two variants:TargetVocabSize
andMinCount
From<VocabBuilder<SimpleVocabConfig, T>>
forSimpleVocab
andFrom<VocabBuilder<SubwordVocabConfig, T>>
forSubwordVocab
, so thatVocab
s can be built from correspondingVocabConfig
withTargetVocabSize
orMinCount
Based on: #58