Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for LLMLingua #1065

Open
TechnotechGit opened this issue Jan 5, 2024 · 3 comments
Open

Support for LLMLingua #1065

TechnotechGit opened this issue Jan 5, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@TechnotechGit
Copy link

microsoft/LLMLingua seems to be an interesting project. It's essentially (lossy) prompt compression, and you can use any HF model with it currently, including GPTQ. I think it would be useful to have llama.cpp supported via llama-cpp-python, since prompt compression would be useful for both CPU and GPU users, and especially alongside llama.cpp itself.

I was trying to implement llama-cpp-python for inference, but got stuck on needing an attention mask (perhaps I missed something). Any ideas on how to go about this?
microsoft/LLMLingua#41

@abetlen abetlen added the enhancement New feature or request label Jan 5, 2024
@abetlen
Copy link
Owner

abetlen commented Jan 5, 2024

Hey @TechnotechGit yes would be happy to help, do you have an outline of some existing code and the requirements of the method? Currently you should be able to extract all of the token logits from the transformer for a given prompt but I'm not sure if anything else is required for llmlingua.

@TechnotechGit
Copy link
Author

On the model side, it seems that it is just attention masks that are needed:

# HF version
response = self.model(
    input_ids[:, past_length:end],
    attention_mask=attention_mask[:, :end],
    past_key_values=past_key_values,
    use_cache=True,
)

I think everything else there is ok.
In terms of the tokeniser, I can mimic the hf calls so that's not a problem, but it again seems to require an attention mask property:

attention_mask = tokenized_text["attention_mask"].to(self.device)

Unfortunately I don't know what the low level for attention masks looks like so I do not know if this would be a big change or not. Might be able to look into it.

@TechnotechGit
Copy link
Author

@abetlen I've been having some trouble with retrieving logits; if you have any experience with transformers, do you know if the logprobs returned by llama-cpp-python are the same as transformers? (Trying to figure out if it's an issue on my end or not since I am getting different shaped tensors)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants