Question about the Q8_0 quants

@city96 I noticed that the data in the flux dev and schnell Q8_0 ggufs are in f16/q8_0, but shouldn't it be f32/q8_0?

Flux in Q8_0:
![image](https://github.com/user-attachments/assets/4a0d6f16-882e-45cd-a3ad-c7e1e7624b1c)

Flux in Q6_K:
![image](https://github.com/user-attachments/assets/cdf281f4-6c9b-41ff-afa5-8032ad021167)

Here's an example of a Llama3.1 gguf in Q8_0:
![image](https://github.com/user-attachments/assets/23a1fe46-3998-4db0-b02f-dc0aaf40d028)

I also checked your t5 Q8_0 gguf and it's using f32 and Q8_0. Is there some kind of reason for the dev/schnell quants being in f16/Q8 instead of f32/Q8?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the Q8_0 quants #79

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question about the Q8_0 quants #79

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions