Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallowed Special Token #130

Open
MrAshRhodes opened this issue Jan 3, 2024 · 2 comments
Open

Disallowed Special Token #130

MrAshRhodes opened this issue Jan 3, 2024 · 2 comments

Comments

@MrAshRhodes
Copy link

Hi,

Ive been trying to scrape the openai api docs for testing and im constantly getting the following error. Would anyone know how to resolve?

Im using the docker image if it makes any difference.

Found 634 files to combine...
file:///home/gpt-crawler/node_modules/gpt-tokenizer/esm/GptEncoding.js:78
                throw new Error(`Disallowed special token found: ${match[0]}`);
                      ^

Error: Disallowed special token found: <|endoftext|>
    at GptEncoding.encodeGenerator (file:///home/gpt-crawler/node_modules/gpt-tokenizer/esm/GptEncoding.js:78:23)
    at GptEncoding.isWithinTokenLimit (file:///home/gpt-crawler/node_modules/gpt-tokenizer/esm/GptEncoding.js:147:20)
    at addContentOrSplit (file:///home/gpt-crawler/dist/src/core.js:131:28)
    at write (file:///home/gpt-crawler/dist/src/core.js:156:15)
    at async file:///home/gpt-crawler/dist/src/main.js:4:1

Node.js v20.10.0
Crawling complete..
@ryanspice
Copy link

+1 to this, after crawling 5000+ pages sad to have it fail!

@MrAshRhodes
Copy link
Author

@ryanspice
I think I've got a workaround.

I made changes to the following files GptEncoding.js and specialTokens.jsin node_modules\gpt-tokenizer\esm\

Change the following function in GptEncoding.js

    encodeGenerator(lineToEncode, { allowedSpecial = new Set(), disallowedSpecial = new Set(), } = {}) {
        // Assuming ALL_SPECIAL_TOKENS is a placeholder for all special tokens
        if (disallowedSpecial.has(ALL_SPECIAL_TOKENS)) {
            disallowedSpecial = new Set(this.specialTokenMapping.keys());
            allowedSpecial.forEach(token => disallowedSpecial.delete(token));
        }
        
        // Check for disallowed tokens in the input
        disallowedSpecial.forEach(token => {
            if (lineToEncode.includes(token)) {
                throw new Error(`Disallowed special token found: ${token}`);
            }
        });
    
        return this.bytePairEncodingCoreProcessor.encodeNative(lineToEncode, allowedSpecial);
    }

Then in specialTokens.js replace these bits.

export const EndOfText = "<EOT>";
export const FimPrefix = "<FimPrefix>";
export const FimMiddle = "<FimMiddle>";
export const FimSuffix = "<FimSuffix>";
export const ImStart = "<ImStart>";
export const ImEnd = "<ImEnd>";
export const ImSep = "<ImSep>";
export const EndOfPrompt = "<EndOfPrompt>";

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants