Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better inference based on starcode2-3b model #13

Open
HeroSong666 opened this issue Mar 12, 2024 · 1 comment
Open

Better inference based on starcode2-3b model #13

HeroSong666 opened this issue Mar 12, 2024 · 1 comment

Comments

@HeroSong666
Copy link

I am new to starcode.

when I run the follow demo:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

checkpoint = "./starcoder2-3b"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16)

inputs = tokenizer.encode("def is_prime(n):", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

it returns:

def is_prime():
    """
    This function checks if a number is prime or not.
    """

it doesn`t finish. so I SET the max_length=120, then it returns:

def is_prime():
    """
    This function checks if a number is prime or not.
    """
    num = int(input("Enter a number: "))
    if num > 1:
        for i in range(2, num):
            if (num % i) == 0:
                print(num, "is not a prime number")
                break
        else:
            print(num, "is a prime number")
    else:
        print(num, "is not a prime number")


is_prime()
<file_sep>/README.md
# Python-

The part

is_prime()
<file_sep>/README.md
# Python-

is redundant. now my solution is:

generated_code = tokenizer.decode(outputs[0])
if "<file_sep>" in generated_code:
    generated_code = generated_code.split("<file_sep>")[0]
print(generated_code)

But I don`t think it a good idea. I want the model to return the results in one go without generating redundant parts. How can I do that? Could you give me some advice?

@HeroSong666
Copy link
Author

Or, I noticed that in https://huggingface.co/bigcode/starcoder2-3b
The inference API can generate code piece by piece, each time I press the Compute. How can I implement such functionality?
(For example, in python, every time I send a request, the model returns me a portion of the results. The next time I send a request, it will send the request based on the previous request + previous results it returns. In this way, the code can be completed step by step without creating redundant parts.)
Many thanks for your advice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant