Skip to content

Conversation

ftgreat
Copy link
Contributor

@ftgreat ftgreat commented Aug 2, 2023

We released Aquila-7B model seriesrelated issue, which based on Chinese and English knowledge.
And also open-sourced them in HuggingFace and FlagAI.

Because of using the BPE tokenizer, our pull request of support BPE tokenizer has been merged.

Could add Aquila-7B models into llama.cpp? Thanks for your review.

ldwang and others added 6 commits July 15, 2023 14:12
Signed-off-by: ldwang <ftgreat@gmail.com>
Signed-off-by: ldwang <ftgreat@gmail.com>
Signed-off-by: ldwang <ftgreat@gmail.com>
Signed-off-by: ldwang <ftgreat@gmail.com>
Signed-off-by: ldwang <ftgreat@gmail.com>
@monatis monatis merged commit 220d931 into ggml-org:master Aug 2, 2023
@goerch
Copy link
Contributor

goerch commented Aug 6, 2023

I'm trying to convert Aquila-7B with

python.exe convert.py models\Aquila-7B --vocabtype bpe

First problem I ran into was the missing encoding in

        if self.vocabtype == "bpe":
          self.sentencepiece_tokenizer = json.loads(open(str(fname_tokenizer), encoding='utf-8').read())

Now I'm stuck with

Exception: Vocab size mismatch (model has 100008, but models\Aquila-7B\vocab.json has 100000).  Most likely you are missing added_tokens.json (should be in models\Aquila-7B).

Edit: is UTF-8 the correct encoding?

@goerch
Copy link
Contributor

goerch commented Aug 6, 2023

OK, I found the missing added_tokens in tokenizer.json:

  "added_tokens": [
    {
      "id": 0,
      "content": "<|endoftext|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 100000,
      "content": "<|startofpiece|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    {
      "id": 100001,
      "content": "<|endofpiece|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    {
      "id": 100002,
      "content": "<|LDWANG|>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    {
      "id": 100003,
      "content": "[MASK]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    {
      "id": 100004,
      "content": "[gMASK]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    {
      "id": 100005,
      "content": "[sMASK]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    {
      "id": 100006,
      "content": "[CLS]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },
    {
      "id": 100007,
      "content": "</s>",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    }
  ],

Now I'm missing how to incorporate them?

@goerch
Copy link
Contributor

goerch commented Aug 6, 2023

Manually generating 'added_tokens.json' with content

{
  "<|endoftext|>": 0,
  "<|startofpiece|>": 100000,
  "<|endofpiece|>": 100001,
  "<|LDWANG|>": 100002,
  "[MASK]": 100003,
  "[gMASK]": 100004,
  "[sMASK]": 100005,
  "[CLS]": 100006,
  "</s>": 100007
}

results in

Exception: Expected added token IDs to be sequential and start at 9; got [0, 100000, 100001, 100002, 100003, 100004, 100005, 100006, 100007]

Edit: removing the entry for "<|endoftext|>" seems to fix the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants