Add support for vocab size target #60

RealNicolasBourbaki · 2019-08-25T20:47:40Z

Add an option vocab_size allowing people to set a target vocabulary size, due to which "mincount" is sized so that the vocabulary size is less than the target vocab size.
Add enum VocabCutoff with two variants: TargetVocabSize and MinCount
Refactor From<VocabBuilder<SimpleVocabConfig, T>> for SimpleVocab and From<VocabBuilder<SubwordVocabConfig, T>> for SubwordVocab, so that Vocabs can be built from corresponding VocabConfig with TargetVocabSize or MinCount

Based on: #58

danieldk

Thanks for implementing this!

I have some comments on the data structures and the core algorithm.

danieldk · 2019-08-26T06:24:12Z

finalfrontier-utils/src/util.rs

+            Arg::with_name(VOCAB_SIZE)
+                .long("vocab_size")
+                .value_name("VOCABULARY_SIZE")
+                .help("Maximam value of vocabulary size")


Typo: maximum

Better:

Maximum vocabulary size

danieldk · 2019-08-26T06:28:22Z

finalfrontier/src/config.rs

+#[derive(Clone, Copy, Debug, Serialize)]
+#[serde(rename = "SubwordSizedVocab")]
+#[serde(tag = "type")]
+pub struct SubwordSizedVocabConfig {


I am wondering if we need new structs for this, they are largely overlapping. If we add some other field, we have to add them to 2 or 4 structs. Maybe replace vocab_size/mincount counts an enum along the lines of:

pub enum VocabCutoff { MinCount(usize), VocabSize(usize), }

I'm in favour of this, although VocabSize could be misinterpreted if someone prints the metadata. Perhaps TargetSize?

Good point, I propose an extension to your suggestion: TargetVocabSize

I was thinking the same. I like the idea of TargetVocabSize.

danieldk · 2019-08-26T06:29:32Z

finalfrontier/src/vocab.rs

+    fn from(builder: VocabBuilder<SimpleSizedVocabConfig, T>) -> Self {
+        let config = builder.config;
+
+        let mut types: Vec<_> = builder


This now also has duplication that can be avoided using the enum suggested above.

danieldk · 2019-08-26T06:30:50Z

finalfrontier/src/vocab.rs

+        if vocab_size < types.len() && vocab_size >= 1 {
+            let last_word = &types[last_index];
+            if last_index >= 1 {
+                if last_word.cmp(&types[last_index - 1]) == Equal {


last_word == &types[last_index - 1]

danieldk · 2019-08-26T06:34:55Z

finalfrontier/src/vocab.rs

@@ -456,6 +534,30 @@ fn bracket(word: &str) -> String {
    bracketed
 }

+fn search_first_occurrence<S>(vec: &[CountedType<S>], key: &CountedType<S>) -> Result<usize, Error>


Avoid implementing binary search by hand, it's error prone (even the Java standard library implementation had an error for a decade), plus it's in the standard library as methods of slice (binary_search_*). See elsewhere in the review for a suggested approach.

danieldk · 2019-08-26T06:45:48Z

finalfrontier/src/vocab.rs

+            .collect();
+        types.sort_unstable_by(|w1, w2| w2.cmp(&w1));
+        let vocab_size = config.vocab_size;
+        let mut last_index = vocab_size - 1;


This can be replaced by something like (totally untested, probably doesn't even compile, but should give the idea):

if let Some(last_type) = types.get(last_index) { let cutoff_point = match types[..last_index].binary_search_by(|other_type| { use Ordering::*; match other_type.count().cmp(&last_type.count()) { Less => Less, Equal => Greater, Greater => Greater, } }) { Ok(idx) => idx, Err(idx) => idx, } }

Factored out as a function for reusability.

danieldk

Looks almost ok now, some requests for factoring out some stuff.

danieldk · 2019-08-30T11:28:05Z

finalfrontier-utils/src/util.rs

-            min_count,
-            discard_threshold,
-        })
+        if matches.is_present(VOCAB_SIZE) {


This block occurs here and below (for SubwordVocab). You can have this block only once, assign the VocabCutoff result to a vocab_cutoff variable and use this variable in the simple/subword vocabs.

danieldk · 2019-08-30T11:28:43Z

finalfrontier/src/config.rs

@@ -97,6 +97,21 @@ pub struct DepembedsConfig {
    pub untyped: bool,
 }

+/// Options for sizing vocabulary.


vocabulary -> the vocabulary

danieldk · 2019-08-30T11:28:52Z

finalfrontier/src/config.rs

+/// Options for sizing vocabulary.
+#[derive(Copy, Clone, Debug, Serialize)]
+pub enum VocabCutoff {
+    /// Minimum toke count


Typo: token

danieldk · 2019-08-30T11:32:02Z

finalfrontier/src/config.rs

+
+    /// Maximum target vocabulary size
+    ///
+    /// If TargetVocabSize is used, then the maximum target vocabulary size is set by this value.


Replace by:

Cut off the vocabulary at the given size, retaining the n most frequent tokens. Tokens with the same count as the token at the cut-off point will also be discarded.

danieldk · 2019-08-30T11:33:34Z

finalfrontier/src/config.rs

+pub enum VocabCutoff {
+    /// Minimum toke count
+    ///
+    /// If MinCount is used, then no word-specific embeddings will be trained for tokens occurring


Replace by (since it's not only about not training embeddings, the words will also not be used as context):

Discard tokens that occur less than the given count.

danieldk · 2019-08-30T11:35:18Z

finalfrontier/src/config.rs

@@ -114,11 +129,10 @@ pub struct SubwordVocabConfig {
    /// buckets.
    pub buckets_exp: u32,

-    /// Minimum token count.
+    /// Vocab cutoff options.


Documentation: Vocal -> Vocabulary

danieldk · 2019-08-30T11:35:49Z

finalfrontier/src/config.rs

-    /// No word-specific embeddings will be trained for tokens occurring less
-    /// than this count.
-    pub min_count: u32,
+    /// Ways of sizing vocabularies.


Change to:

Vocabulary size cut-off.

danieldk · 2019-08-30T11:36:06Z

finalfrontier/src/config.rs

@@ -133,11 +147,10 @@ pub struct SubwordVocabConfig {
 #[serde(rename = "SimpleVocab")]
 #[serde(tag = "type")]
 pub struct SimpleVocabConfig {
-    /// Minimum token count.
+    /// Vocab cutoff options.


Change to:

Vocabulary size cut-off.

danieldk · 2019-08-30T11:41:24Z

finalfrontier/src/vocab.rs

@@ -396,16 +420,49 @@ where
 {
    fn from(builder: VocabBuilder<SubwordVocabConfig, T>) -> Self {
        let config = builder.config;
+        let vocab_cutoff = config.vocab_cutoff;


Factor out overlapping code with From impl for SimpleVocabConfig.

Helpful in refactoring: I don't think we are actually using EOS. IIRC I evaluated use of EOS at some point and there was no difference. Not sure why it shows up here and not above.

@sebpuetz do you remember this?

No recollection of any of that, but it seems about right that EOS marking doesn't have an effect considering we're training with punctuation.

I found it odd too. Okay then, I guess I'll just ignore the whole EOS part. Thanks for the info.

sebpuetz · 2019-08-30T13:15:52Z

finalfrontier/src/vocab.rs

-        types.sort_unstable_by(|w1, w2| w2.cmp(&w1));
-        SimpleVocab::new(builder.config, types, builder.n_items)
+            VocabCutoff::TargetVocabSize(vocab_size) => {
+                assert!(vocab_size >= 1, "Target vocab size must be positive");


I'm not sure whether we need this assertion. The value is guaranteed to be positive, because it's an unsigned integer. And in the case of someone entering target_size=2, I believe the training loop would never terminate because the negative sampling ensures not returning the same idx as the focus word as a negative sample in an unconditional loop.

Although having the assertion here doesn't hurt either.

Nevermind, the vocab_size - 1 below could underflow otherwise.

1. Add an option `vocab_size` allowing people to set a target vocabulary size, due to which "mincount" is sized so that the vocabulary size is less than the target vocab size. 2. Add enum `VocabCutoff` with two variants: `TargetVocabSize` and `MinCount` 3. Refactor `From<VocabBuilder<SimpleVocabConfig, T>>` for `SimpleVocab` and `From<VocabBuilder<SubwordVocabConfig, T>>` for `SubwordVocab`, so that `Vocab`s can be built from corresponding `VocabConfig` with `TargetVocabSize` or `MinCount`

RealNicolasBourbaki · 2019-10-05T10:38:24Z

I looked into this again and here is a problem I don't quite understand:

In train_model.rs line 188,

let metadata = Metadata(Value::try_from(trainer.to_metadata())?);

An UnsupportedType error occurred during try_from, why is that?

RealNicolasBourbaki · 2019-10-05T11:41:56Z

Nevermind I found the issue, the TOML format does not support enums with values.

I guess we can match VocabCutoff 's type when returning the metadata, so when SkipgramMetadata is converted into Value, the cut-off information is no longer an enum.

sebpuetz · 2019-10-05T16:43:03Z

It's probably quite frustrating, but it's hard to review this without it being rebased on master. A lot of things moved in the vocab and config module, so what's fine in this PR might be broken on master.

danieldk · 2019-10-06T09:37:22Z

I hadn't noticed the August 30 updates. Feel free to ping reviewers again in the future when you have addressed comments. For me the problem is that I review a lot of PRs and GitHub does not provide a good mechanism to keep track of the status of PRs. GitHub is also very spammy with e-mails, so it is hard to keep up with PRs in that way.

We should probably add this to the A3 wiki, asking people to explicitly re-request a review when comments have been addressed.

For now, I fear that the only solution is indeed to rebase against the current master. This needs to happen anyway to be mergable, since there are conflicting files.

danieldk requested review from sebpuetz and danieldk August 26, 2019 06:22

danieldk requested changes Aug 26, 2019

View reviewed changes

danieldk requested changes Aug 30, 2019

View reviewed changes

sebpuetz reviewed Aug 30, 2019

View reviewed changes

RealNicolasBourbaki requested a review from danieldk October 5, 2019 10:16

RealNicolasBourbaki requested a review from sebpuetz October 5, 2019 10:38

sebpuetz mentioned this pull request Oct 13, 2019

Support explicitly stored ngrams #61

Closed

7 tasks

RealNicolasBourbaki closed this Oct 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for vocab size target #60

Add support for vocab size target #60

RealNicolasBourbaki commented Aug 25, 2019 •

edited

Loading

danieldk left a comment

danieldk Aug 26, 2019

danieldk Aug 26, 2019

sebpuetz Aug 26, 2019

danieldk Aug 26, 2019

RealNicolasBourbaki Aug 26, 2019

danieldk Aug 26, 2019

danieldk Aug 26, 2019

danieldk Aug 26, 2019 •

edited

Loading

danieldk Aug 26, 2019

danieldk left a comment

danieldk Aug 30, 2019

danieldk Aug 30, 2019

danieldk Aug 30, 2019

danieldk Aug 30, 2019

danieldk Aug 30, 2019

danieldk Aug 30, 2019

danieldk Aug 30, 2019

danieldk Aug 30, 2019

danieldk Aug 30, 2019

danieldk Aug 30, 2019

sebpuetz Aug 30, 2019

RealNicolasBourbaki Aug 30, 2019

sebpuetz Aug 30, 2019

sebpuetz Aug 30, 2019 •

edited

Loading

RealNicolasBourbaki commented Oct 5, 2019 •

edited

Loading

RealNicolasBourbaki commented Oct 5, 2019 •

edited

Loading

sebpuetz commented Oct 5, 2019

danieldk commented Oct 6, 2019 •

edited

Loading

Add support for vocab size target #60

Add support for vocab size target #60

Conversation

RealNicolasBourbaki commented Aug 25, 2019 • edited Loading

danieldk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danieldk Aug 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danieldk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebpuetz Aug 30, 2019 • edited Loading

Choose a reason for hiding this comment

RealNicolasBourbaki commented Oct 5, 2019 • edited Loading

RealNicolasBourbaki commented Oct 5, 2019 • edited Loading

sebpuetz commented Oct 5, 2019

danieldk commented Oct 6, 2019 • edited Loading

RealNicolasBourbaki commented Aug 25, 2019 •

edited

Loading

danieldk Aug 26, 2019 •

edited

Loading

sebpuetz Aug 30, 2019 •

edited

Loading

RealNicolasBourbaki commented Oct 5, 2019 •

edited

Loading

RealNicolasBourbaki commented Oct 5, 2019 •

edited

Loading

danieldk commented Oct 6, 2019 •

edited

Loading