-
Notifications
You must be signed in to change notification settings - Fork 99
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Max Seq length for inference #24
Comments
Prompting with longer sequences requires sharding for the model, which is currently not supported. However, you can generate much longer, up to 500k and beyond on a single 80Gb GPU. If you'd like to test the model with longer prompt I recommend Together's API. |
could you elaborate how to generate 500k on a single 80Gb GPU, I got OOM on A100 with 3kb sequence. Thank you |
@pan-genome we were able to just use the standard HuggingFace sampling API (e.g., loading with |
could you provide a working code example? thank you |
Something like model_config = AutoConfig.from_pretrained(
'togethercomputer/evo-1-131k-base',
trust_remote_code=True,
revision="1.1_fix",
)
model_config.max_seqlen = 500_000
model = AutoModelForCausalLM.from_pretrained(
'togethercomputer/evo-1-131k-base',
config=model_config,
trust_remote_code=True,
revision="1.1_fix",
)
outputs = model.generate(
input_ids,
max_new_tokens=500_000,
temperature=1.,
top_k=4,
) |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
May I ask the proper range for input sequence length to do the inference using the evo-1-131k-base model?
I tried to use a single A100 and got CUDA Out of Memory when inputting a single sequence longer than 1000.
Thank you!
The text was updated successfully, but these errors were encountered: