Skip to content

FIX: Many fixes for huggingface and llama2 inference #335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 6, 2023

Conversation

xfalcox
Copy link
Member

@xfalcox xfalcox commented Dec 5, 2023

No description provided.

@@ -5,7 +5,7 @@ module Completions
module Dialects
class Llama2Classic
def self.can_translate?(model_name)
"Llama2-*-chat-hf" == model_name
%w[Llama2-*-chat-hf Llama2-chat-hf].include?(model_name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style used by our summarization strategy setting

@@ -90,6 +90,7 @@ def perform_completion!(prompt, user, model_params = {})

begin
partial = extract_completion_from(raw_partial)
next if partial.blank?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you get a blank partial from a non-blank raw_partial when streaming and getting an special response, like the last message in a stream from huggingface

@@ -5,7 +5,7 @@ module Completions
module Endpoints
class HuggingFace < Base
def self.can_contact?(model_name)
%w[StableBeluga2 Upstage-Llama-2-*-instruct-v2 Llama2-*-chat-hf].include?(model_name)
%w[StableBeluga2 Upstage-Llama-2-*-instruct-v2 Llama2-*-chat-hf Llama2-chat-hf].include?(model_name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style used by our summarization strategy setting

URI(SiteSetting.ai_hugging_face_api_url).tap do |uri|
uri.path = @streaming_mode ? "/generate_stream" : "/generate"
end
URI(SiteSetting.ai_hugging_face_api_url)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to headers instead of path makes it compatible with both self-hosting and hosted API


payload[:parameters][:max_new_tokens] = token_limit - prompt_size(prompt)

if @streaming_mode
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to headers instead of path makes it compatible with both self-hosting and hosted API

@@ -56,15 +58,15 @@ def extract_completion_from(response_raw)

parsed.dig(:token, :text).to_s
else
parsed[:generated_text].to_s
parsed[0][:generated_text].to_s
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consequence of the move to the header for streaming is this slight response format change

end
end

def partials_from(decoded_chunk)
decoded_chunk
.split("\n")
.map do |line|
data = line.split("data: ", 2)[1]
data = line.split("data:", 2)[1]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug introduced when porting from old inference class

@@ -22,11 +22,6 @@ def self.perform!(
raise CompletionFailed if model.blank?

url = URI(SiteSetting.ai_hugging_face_api_url)
if block_given?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to headers instead of path makes it compatible with both self-hosting and hosted API

@@ -45,6 +40,10 @@ def self.perform!(
parameters[:max_new_tokens] = token_limit - prompt_size
parameters[:temperature] = temperature if temperature
parameters[:repetition_penalty] = repetition_penalty if repetition_penalty

if block_given?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to headers instead of path makes it compatible with both self-hosting and hosted API

parsed_response = JSON.parse(response_body, symbolize_names: true)

log.update!(
raw_response_payload: response_body,
request_tokens: tokenizer.size(prompt),
response_tokens: tokenizer.size(parsed_response[:generated_text]),
response_tokens: tokenizer.size(parsed_response.first[:generated_text]),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consequence of moving to headers for streaming

@xfalcox xfalcox merged commit d8267d8 into main Dec 6, 2023
@xfalcox xfalcox deleted the fixes-for-huggingface-and-llama branch December 6, 2023 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants