You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ive been trying to scrape the openai api docs for testing and im constantly getting the following error. Would anyone know how to resolve?
Im using the docker image if it makes any difference.
Found 634 files to combine...
file:///home/gpt-crawler/node_modules/gpt-tokenizer/esm/GptEncoding.js:78
throw new Error(`Disallowed special token found: ${match[0]}`);
^
Error: Disallowed special token found: <|endoftext|>
at GptEncoding.encodeGenerator (file:///home/gpt-crawler/node_modules/gpt-tokenizer/esm/GptEncoding.js:78:23)
at GptEncoding.isWithinTokenLimit (file:///home/gpt-crawler/node_modules/gpt-tokenizer/esm/GptEncoding.js:147:20)
at addContentOrSplit (file:///home/gpt-crawler/dist/src/core.js:131:28)
at write (file:///home/gpt-crawler/dist/src/core.js:156:15)
at async file:///home/gpt-crawler/dist/src/main.js:4:1
Node.js v20.10.0
Crawling complete..
The text was updated successfully, but these errors were encountered:
I made changes to the following files GptEncoding.js and specialTokens.jsin node_modules\gpt-tokenizer\esm\
Change the following function in GptEncoding.js
encodeGenerator(lineToEncode, { allowedSpecial = new Set(), disallowedSpecial = new Set(), } = {}) {
// Assuming ALL_SPECIAL_TOKENS is a placeholder for all special tokens
if (disallowedSpecial.has(ALL_SPECIAL_TOKENS)) {
disallowedSpecial = new Set(this.specialTokenMapping.keys());
allowedSpecial.forEach(token => disallowedSpecial.delete(token));
}
// Check for disallowed tokens in the input
disallowedSpecial.forEach(token => {
if (lineToEncode.includes(token)) {
throw new Error(`Disallowed special token found: ${token}`);
}
});
return this.bytePairEncodingCoreProcessor.encodeNative(lineToEncode, allowedSpecial);
}
Hi,
Ive been trying to scrape the openai api docs for testing and im constantly getting the following error. Would anyone know how to resolve?
Im using the docker image if it makes any difference.
The text was updated successfully, but these errors were encountered: