Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for vocab size target #60

Closed
wants to merge 1 commit into from
Closed

Add support for vocab size target #60

wants to merge 1 commit into from

Conversation

RealNicolasBourbaki
Copy link

@RealNicolasBourbaki RealNicolasBourbaki commented Aug 25, 2019

  1. Add an option vocab_size allowing people to set a target vocabulary size, due to which "mincount" is sized so that the vocabulary size is less than the target vocab size.
  2. Add enum VocabCutoff with two variants: TargetVocabSize and MinCount
  3. Refactor From<VocabBuilder<SimpleVocabConfig, T>> for SimpleVocab and From<VocabBuilder<SubwordVocabConfig, T>> for SubwordVocab, so that Vocabs can be built from corresponding VocabConfig with TargetVocabSize or MinCount

Based on: #58

Copy link
Member

@danieldk danieldk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this!

I have some comments on the data structures and the core algorithm.

Arg::with_name(VOCAB_SIZE)
.long("vocab_size")
.value_name("VOCABULARY_SIZE")
.help("Maximam value of vocabulary size")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: maximum

Better:

Maximum vocabulary size

#[derive(Clone, Copy, Debug, Serialize)]
#[serde(rename = "SubwordSizedVocab")]
#[serde(tag = "type")]
pub struct SubwordSizedVocabConfig {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we need new structs for this, they are largely overlapping. If we add some other field, we have to add them to 2 or 4 structs. Maybe replace vocab_size/mincount counts an enum along the lines of:

pub enum VocabCutoff {
    MinCount(usize),
    VocabSize(usize),
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favour of this, although VocabSize could be misinterpreted if someone prints the metadata. Perhaps TargetSize?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I propose an extension to your suggestion: TargetVocabSize

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same. I like the idea of TargetVocabSize.

fn from(builder: VocabBuilder<SimpleSizedVocabConfig, T>) -> Self {
let config = builder.config;

let mut types: Vec<_> = builder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now also has duplication that can be avoided using the enum suggested above.

if vocab_size < types.len() && vocab_size >= 1 {
let last_word = &types[last_index];
if last_index >= 1 {
if last_word.cmp(&types[last_index - 1]) == Equal {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last_word == &types[last_index - 1]

@@ -456,6 +534,30 @@ fn bracket(word: &str) -> String {
bracketed
}

fn search_first_occurrence<S>(vec: &[CountedType<S>], key: &CountedType<S>) -> Result<usize, Error>
Copy link
Member

@danieldk danieldk Aug 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid implementing binary search by hand, it's error prone (even the Java standard library implementation had an error for a decade), plus it's in the standard library as methods of slice (binary_search_*). See elsewhere in the review for a suggested approach.

.collect();
types.sort_unstable_by(|w1, w2| w2.cmp(&w1));
let vocab_size = config.vocab_size;
let mut last_index = vocab_size - 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be replaced by something like (totally untested, probably doesn't even compile, but should give the idea):

if let Some(last_type) = types.get(last_index) {
  let cutoff_point = match types[..last_index].binary_search_by(|other_type| {
    use Ordering::*;
    match other_type.count().cmp(&last_type.count()) {
      Less => Less,
      Equal => Greater,
      Greater => Greater,
    }
  }) {
    Ok(idx) => idx,
    Err(idx) => idx,
  }
}

Factored out as a function for reusability.

Copy link
Member

@danieldk danieldk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks almost ok now, some requests for factoring out some stuff.

min_count,
discard_threshold,
})
if matches.is_present(VOCAB_SIZE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block occurs here and below (for SubwordVocab). You can have this block only once, assign the VocabCutoff result to a vocab_cutoff variable and use this variable in the simple/subword vocabs.

@@ -97,6 +97,21 @@ pub struct DepembedsConfig {
pub untyped: bool,
}

/// Options for sizing vocabulary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vocabulary -> the vocabulary

/// Options for sizing vocabulary.
#[derive(Copy, Clone, Debug, Serialize)]
pub enum VocabCutoff {
/// Minimum toke count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: token


/// Maximum target vocabulary size
///
/// If TargetVocabSize is used, then the maximum target vocabulary size is set by this value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace by:

Cut off the vocabulary at the given size, retaining the n most frequent tokens. Tokens with the same count as the token at the cut-off point will also be discarded.

pub enum VocabCutoff {
/// Minimum toke count
///
/// If MinCount is used, then no word-specific embeddings will be trained for tokens occurring
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace by (since it's not only about not training embeddings, the words will also not be used as context):

Discard tokens that occur less than the given count.

@@ -114,11 +129,10 @@ pub struct SubwordVocabConfig {
/// buckets.
pub buckets_exp: u32,

/// Minimum token count.
/// Vocab cutoff options.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation: Vocal -> Vocabulary

/// No word-specific embeddings will be trained for tokens occurring less
/// than this count.
pub min_count: u32,
/// Ways of sizing vocabularies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to:

Vocabulary size cut-off.

@@ -133,11 +147,10 @@ pub struct SubwordVocabConfig {
#[serde(rename = "SimpleVocab")]
#[serde(tag = "type")]
pub struct SimpleVocabConfig {
/// Minimum token count.
/// Vocab cutoff options.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to:

Vocabulary size cut-off.

@@ -396,16 +420,49 @@ where
{
fn from(builder: VocabBuilder<SubwordVocabConfig, T>) -> Self {
let config = builder.config;
let vocab_cutoff = config.vocab_cutoff;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factor out overlapping code with From impl for SimpleVocabConfig.

Helpful in refactoring: I don't think we are actually using EOS. IIRC I evaluated use of EOS at some point and there was no difference. Not sure why it shows up here and not above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebpuetz do you remember this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No recollection of any of that, but it seems about right that EOS marking doesn't have an effect considering we're training with punctuation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it odd too. Okay then, I guess I'll just ignore the whole EOS part. Thanks for the info.

types.sort_unstable_by(|w1, w2| w2.cmp(&w1));
SimpleVocab::new(builder.config, types, builder.n_items)
VocabCutoff::TargetVocabSize(vocab_size) => {
assert!(vocab_size >= 1, "Target vocab size must be positive");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether we need this assertion. The value is guaranteed to be positive, because it's an unsigned integer. And in the case of someone entering target_size=2, I believe the training loop would never terminate because the negative sampling ensures not returning the same idx as the focus word as a negative sample in an unconditional loop.

Although having the assertion here doesn't hurt either.

Copy link
Member

@sebpuetz sebpuetz Aug 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, the vocab_size - 1 below could underflow otherwise.

1. Add an option `vocab_size` allowing people to set a target vocabulary size, due to which "mincount" is sized so that the vocabulary size is less than the target vocab size.
2. Add enum `VocabCutoff` with two variants: `TargetVocabSize` and `MinCount`
3. Refactor `From<VocabBuilder<SimpleVocabConfig, T>>` for `SimpleVocab` and `From<VocabBuilder<SubwordVocabConfig, T>>` for `SubwordVocab`, so that `Vocab`s can be built from corresponding `VocabConfig` with `TargetVocabSize` or `MinCount`
@RealNicolasBourbaki
Copy link
Author

RealNicolasBourbaki commented Oct 5, 2019

I looked into this again and here is a problem I don't quite understand:

In train_model.rs line 188,

let metadata = Metadata(Value::try_from(trainer.to_metadata())?);

An UnsupportedType error occurred during try_from, why is that?

@RealNicolasBourbaki
Copy link
Author

RealNicolasBourbaki commented Oct 5, 2019

Nevermind I found the issue, the TOML format does not support enums with values.

I guess we can match VocabCutoff 's type when returning the metadata, so when SkipgramMetadata is converted into Value, the cut-off information is no longer an enum.

@sebpuetz
Copy link
Member

sebpuetz commented Oct 5, 2019

It's probably quite frustrating, but it's hard to review this without it being rebased on master. A lot of things moved in the vocab and config module, so what's fine in this PR might be broken on master.

@danieldk
Copy link
Member

danieldk commented Oct 6, 2019

I hadn't noticed the August 30 updates. Feel free to ping reviewers again in the future when you have addressed comments. For me the problem is that I review a lot of PRs and GitHub does not provide a good mechanism to keep track of the status of PRs. GitHub is also very spammy with e-mails, so it is hard to keep up with PRs in that way.

We should probably add this to the A3 wiki, asking people to explicitly re-request a review when comments have been addressed.

For now, I fear that the only solution is indeed to rebase against the current master. This needs to happen anyway to be mergable, since there are conflicting files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants