Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max Seq length for inference #24

Closed
JunboShen opened this issue Mar 4, 2024 · 5 comments
Closed

Max Seq length for inference #24

JunboShen opened this issue Mar 4, 2024 · 5 comments

Comments

@JunboShen
Copy link

May I ask the proper range for input sequence length to do the inference using the evo-1-131k-base model?
I tried to use a single A100 and got CUDA Out of Memory when inputting a single sequence longer than 1000.
Thank you!

@Zymrael
Copy link
Collaborator

Zymrael commented Mar 6, 2024

Prompting with longer sequences requires sharding for the model, which is currently not supported. However, you can generate much longer, up to 500k and beyond on a single 80Gb GPU.

If you'd like to test the model with longer prompt I recommend Together's API.

@pan-genome
Copy link

Prompting with longer sequences requires sharding for the model, which is currently not supported. However, you can generate much longer, up to 500k and beyond on a single 80Gb GPU.

If you'd like to test the model with longer prompt I recommend Together's API.

could you elaborate how to generate 500k on a single 80Gb GPU, I got OOM on A100 with 3kb sequence. Thank you

@brianhie
Copy link
Collaborator

brianhie commented Jun 7, 2024

@pan-genome we were able to just use the standard HuggingFace sampling API (e.g., loading with AutoModelForCausalLM.from_pretrained(), sampling with model.generate()) to generate 500k+ on an 80 Gb GPU.

@pan-genome
Copy link

@pan-genome we were able to just use the standard HuggingFace sampling API (e.g., loading with AutoModelForCausalLM.from_pretrained(), sampling with model.generate()) to generate 500k+ on an 80 Gb GPU.

could you provide a working code example? thank you

@brianhie
Copy link
Collaborator

Something like

model_config = AutoConfig.from_pretrained(
    'togethercomputer/evo-1-131k-base',
    trust_remote_code=True,
    revision="1.1_fix",
)
model_config.max_seqlen = 500_000

model = AutoModelForCausalLM.from_pretrained(
    'togethercomputer/evo-1-131k-base',
    config=model_config,
    trust_remote_code=True,
    revision="1.1_fix",
)

outputs = model.generate(
    input_ids,
    max_new_tokens=500_000,
    temperature=1.,
    top_k=4,
)

@evo-design evo-design locked and limited conversation to collaborators Jun 21, 2024
@brianhie brianhie converted this issue into discussion #73 Jun 21, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants