Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

33B and 65B weights? #94

Closed
trevtravtrev opened this issue Mar 21, 2023 · 16 comments
Closed

33B and 65B weights? #94

trevtravtrev opened this issue Mar 21, 2023 · 16 comments

Comments

@trevtravtrev
Copy link

What would it take to use 33B and 65B weights?

Also, 7B seems to work better than 13B right now.

@tjthejuggler
Copy link

https://huggingface.co/Pi3141/alpaca-30B-ggml

@EfogDev
Copy link

EfogDev commented Mar 21, 2023

https://huggingface.co/Pi3141/alpaca-30B-ggml

Is it gonna work just with ./chat -m ggml-model-q4_0.bin or do I need anything else? Thanks!

@sowa705
Copy link

sowa705 commented Mar 21, 2023

Doesn't seem to work.

llama_model_load: loading model from 'ggml-model-q4_0.bin' - please wait ...
llama_model_load: ggml ctx size = 25631.50 MB
llama_model_load: memory_size =  6240.00 MB, n_mem = 122880
llama_model_load: loading model part 1/4 from 'ggml-model-q4_0.bin'
llama_model_load: llama_model_load: tensor 'tok_embeddings.weight' has wrong size in model file
main: failed to load model from 'ggml-model-q4_0.bin'

@1octopus1
Copy link

1octopus1 commented Mar 21, 2023

image

=( Help

@joaops
Copy link

joaops commented Mar 21, 2023

To work with the 30B model, it is necessary to change lines 34 and 35 of the main.cpp file to the value 1. Originally, the 30B file was divided into 4 parts, just as the 13B file was divided into 2 parts.

// determine number of model parts based on the dimension
static const std::map<int, int> LLAMA_N_PARTS = {
    { 4096, 1 },
    { 5120, 1 },
    { 6656, 4 },
    { 8192, 8 },
};

Change it to:

// determine number of model parts based on the dimension
static const std::map<int, int> LLAMA_N_PARTS = {
    { 4096, 1 },
    { 5120, 1 },
    { 6656, 1 },
    { 8192, 1 },
};

After that, just recompile and run it again.

Credits to the user ItsPi3141, who gave the answer here: Issues 83

@Green-Sky
Copy link

Green-Sky commented Mar 21, 2023

You dont need to touch any code for this.

./main -h gives you the following.

  --n_parts N           number of model parts (default: -1 = determine from dimensions)

edit: Actually i was assuming llama.cpp (not this fork)

@trevtravtrev
Copy link
Author

You dont need to touch any code for this.

./main -h gives you the following.


  --n_parts N           number of model parts (default: -1 = determine from dimensions)

What value do you put?

@Green-Sky
Copy link

Actually i am assuming llama.cpp (not this fork)

What value do you put?

If it is a single model file, 1

@thatblend
Copy link

Doesn't seem to work.

llama_model_load: loading model from 'ggml-model-q4_0.bin' - please wait ...
llama_model_load: ggml ctx size = 25631.50 MB
llama_model_load: memory_size =  6240.00 MB, n_mem = 122880
llama_model_load: loading model part 1/4 from 'ggml-model-q4_0.bin'
llama_model_load: llama_model_load: tensor 'tok_embeddings.weight' has wrong size in model file
main: failed to load model from 'ggml-model-q4_0.bin'

Getting the same issue :(

@trevtravtrev
Copy link
Author

Actually i am assuming llama.cpp (not this fork)

What value do you put?

If it is a single model file, 1

Are you saying this method is valid for llama.cpp but not alpaca.cpp?

@Green-Sky
Copy link

Are you saying this method is valid for llama.cpp but not alpaca.cpp?

yea.

@antimatter15 are there any things left in your fork that did not get upstreamed yet?

@trevtravtrev
Copy link
Author

To work with the 30B model, it is necessary to change lines 34 and 35 of the main.cpp file to the value 1. Originally, the 30B file was divided into 4 parts, just as the 13B file was divided into 2 parts.

// determine number of model parts based on the dimension
static const std::map<int, int> LLAMA_N_PARTS = {
    { 4096, 1 },
    { 5120, 1 },
    { 6656, 4 },
    { 8192, 8 },
};

Change it to:

// determine number of model parts based on the dimension
static const std::map<int, int> LLAMA_N_PARTS = {
    { 4096, 1 },
    { 5120, 1 },
    { 6656, 1 },
    { 8192, 1 },
};

After that, just recompile and run it again.

Credits to the user ItsPi3141, who gave the answer here: Issues 83

Has anyone gotten this 30B model working with the method above yet? If so, how does it compare to the current 7B and 13B weights?

I haven't had a chance to check the implications of this hotfix above in the source code. Is this a change we could push to main and add support for these larger models?

(I will be testing this method when I get home later)

@trevtravtrev
Copy link
Author

static const std::map<int, int> LLAMA_N_PARTS = {
    { 4096, 1 },
    { 5120, 1 },
    { 6656, 1 },
    { 8192, 1 },
};

I do believe the author is referring to the chat.cpp file, not main.cpp.

@trevtravtrev
Copy link
Author

trevtravtrev commented Mar 21, 2023

I've had a chance to implement this method using the 30B weight and test. It works! Upon initial testing, this model seems to be very impressive. While I don't have a baseline to test against, I suspect it is performing better than the 7B and 13B models currently supported.
This model is very memory and CPU intensive, and requires a beefy PC/server to run. It is using roughly 65% of my CPU and 77% of my Memory. It also writes output 2-3x slower than the 7B weight if I had to guess.

My specs are:
CPU: 12th Gen Intel(R) Core(TM) i7-12700KF 3.61 GHz
RAM: 32.0 GB

I don't see any reason why the hotfix above to run the 30B weight, as well as adding documentation to the README should not be pushed to main?

@antimatter15 if I forked, implemented this feature (support for 30B weight) including readme documentation, and submitted a PR would you accept?

@trevtravtrev
Copy link
Author

The code snippet to add support for 30B weight has already been merged in #104 .

I've just submitted a PR #108 to add support/instructions to the README on how to get the 30B weight running.

@antimatter15 would be very helpful if you would accept this PR #108. A lot of people would love this :)

@MasMedIm
Copy link

Doesn't seem to work.

llama_model_load: loading model from 'ggml-model-q4_0.bin' - please wait ...
llama_model_load: ggml ctx size = 25631.50 MB
llama_model_load: memory_size =  6240.00 MB, n_mem = 122880
llama_model_load: loading model part 1/4 from 'ggml-model-q4_0.bin'
llama_model_load: llama_model_load: tensor 'tok_embeddings.weight' has wrong size in model file
main: failed to load model from 'ggml-model-q4_0.bin'

Getting the same issue :(

For this issue try to recompile chat script with : make chat. For me it's working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants