-
Notifications
You must be signed in to change notification settings - Fork 42
FIX: Many fixes for huggingface and llama2 inference #335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -5,7 +5,7 @@ module Completions | |||
module Dialects | |||
class Llama2Classic | |||
def self.can_translate?(model_name) | |||
"Llama2-*-chat-hf" == model_name | |||
%w[Llama2-*-chat-hf Llama2-chat-hf].include?(model_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style used by our summarization strategy setting
lib/completions/endpoints/base.rb
Outdated
@@ -90,6 +90,7 @@ def perform_completion!(prompt, user, model_params = {}) | |||
|
|||
begin | |||
partial = extract_completion_from(raw_partial) | |||
next if partial.blank? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you get a blank partial from a non-blank raw_partial when streaming and getting an special response, like the last message in a stream from huggingface
@@ -5,7 +5,7 @@ module Completions | |||
module Endpoints | |||
class HuggingFace < Base | |||
def self.can_contact?(model_name) | |||
%w[StableBeluga2 Upstage-Llama-2-*-instruct-v2 Llama2-*-chat-hf].include?(model_name) | |||
%w[StableBeluga2 Upstage-Llama-2-*-instruct-v2 Llama2-*-chat-hf Llama2-chat-hf].include?(model_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style used by our summarization strategy setting
URI(SiteSetting.ai_hugging_face_api_url).tap do |uri| | ||
uri.path = @streaming_mode ? "/generate_stream" : "/generate" | ||
end | ||
URI(SiteSetting.ai_hugging_face_api_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving to headers instead of path makes it compatible with both self-hosting and hosted API
|
||
payload[:parameters][:max_new_tokens] = token_limit - prompt_size(prompt) | ||
|
||
if @streaming_mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving to headers instead of path makes it compatible with both self-hosting and hosted API
@@ -56,15 +58,15 @@ def extract_completion_from(response_raw) | |||
|
|||
parsed.dig(:token, :text).to_s | |||
else | |||
parsed[:generated_text].to_s | |||
parsed[0][:generated_text].to_s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consequence of the move to the header for streaming is this slight response format change
end | ||
end | ||
|
||
def partials_from(decoded_chunk) | ||
decoded_chunk | ||
.split("\n") | ||
.map do |line| | ||
data = line.split("data: ", 2)[1] | ||
data = line.split("data:", 2)[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug introduced when porting from old inference class
@@ -22,11 +22,6 @@ def self.perform!( | |||
raise CompletionFailed if model.blank? | |||
|
|||
url = URI(SiteSetting.ai_hugging_face_api_url) | |||
if block_given? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving to headers instead of path makes it compatible with both self-hosting and hosted API
@@ -45,6 +40,10 @@ def self.perform!( | |||
parameters[:max_new_tokens] = token_limit - prompt_size | |||
parameters[:temperature] = temperature if temperature | |||
parameters[:repetition_penalty] = repetition_penalty if repetition_penalty | |||
|
|||
if block_given? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving to headers instead of path makes it compatible with both self-hosting and hosted API
parsed_response = JSON.parse(response_body, symbolize_names: true) | ||
|
||
log.update!( | ||
raw_response_payload: response_body, | ||
request_tokens: tokenizer.size(prompt), | ||
response_tokens: tokenizer.size(parsed_response[:generated_text]), | ||
response_tokens: tokenizer.size(parsed_response.first[:generated_text]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consequence of moving to headers for streaming
No description provided.