Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add falcon7b example #231

Closed
wants to merge 3 commits into from
Closed

add falcon7b example #231

wants to merge 3 commits into from

Conversation

apage43
Copy link
Contributor

@apage43 apage43 commented Jun 6, 2023

#217
adapted from gpt-neox example and work started in ggerganov/llama.cpp#1602

only supports 7b right now - 40b multiquery attention gets hairier, as its 128 query heads with 8 k and v heads, as opposed to 7B's 71 query heads with 1 k and v head

@AndriyMulyar
Copy link

Nice!

@TheBloke
Copy link
Contributor

TheBloke commented Jun 6, 2023

Well done guys! Really excited for this

@ggerganov
Copy link
Owner

Cool - will take a look soon
Is this using actual MQA or it's still doing the trick with the copies?

@apage43
Copy link
Contributor Author

apage43 commented Jun 7, 2023

Cool - will take a look soon Is this using actual MQA or it's still doing the trick with the copies?

does copies with ggml_repeat presently - also wound up fairly hackily creating a "dummy" tensor of the target shape since there wasn't one to use already handy as well as not transposing V before storing in the kv cache to deal with having to repeat the Vs

@cmp-nct
Copy link

cmp-nct commented Jun 7, 2023

Are you working on a 40B branch already ?

@apage43
Copy link
Contributor Author

apage43 commented Jun 7, 2023

Are you working on a 40B branch already ?

I'm not presently - being that it's big its a bit more inconvenient to hack on as I'd need to use a bigger machine than I'm usually on for dev stuff.

@cmp-nct
Copy link

cmp-nct commented Jun 7, 2023

Are you working on a 40B branch already ?

I'm not presently - being that it's big its a bit more inconvenient to hack on as I'd need to use a bigger machine than I'm usually on for dev stuff.

Did you see https://huggingface.co/jploski/falcon40b-mini-shakespeare ?

@apage43
Copy link
Contributor Author

apage43 commented Jun 7, 2023

Are you working on a 40B branch already ?

I'm not presently - being that it's big its a bit more inconvenient to hack on as I'd need to use a bigger machine than I'm usually on for dev stuff.

Did you see https://huggingface.co/jploski/falcon40b-mini-shakespeare ?

I have now :) I still probably won't get to it soon - but if someone figures out how to support that before this lands I'm happy to incorporate it

@jploski
Copy link
Contributor

jploski commented Jun 10, 2023

I did some work regarding 40B support today: 27cf1ad

After making my head nearly explode several times I reached a point where it generates okay sounding prose from the falcon40b-mini-shakespeare model, but it does not match the Python version output exactly as it should (and as it does for the 7B version).

The main obstacle seems to be that I am unable to make ggml_repeat broadcast multiple keys like the "k = torch.broadcast_to(k, q.shape)" in Python does (I get "1,2,1,2" instead of "1,1,2,2" so to say).

Another big problem is that the I only got the query matrix to look like the original Python one through some brute force offset calculations and copying of subvectors. It probably won't scale at all. I'm under impression that what needs to be done there can't be done using just reshaping or view operations. The memory format (as stored in Python and written by the conversion script) seems to be very difficult to work with in GGML.

Or maybe I'm just too inexperienced in this tensor wrestling... Once again giving up in hope that someone with more clue can pick it up.

@jploski
Copy link
Contributor

jploski commented Jun 11, 2023

I did some work regarding 40B support today: 27cf1ad

As a further explanation of the code and where the complexity comes from here's a visualization of the fused_kqv weights format (from falcon40b-mini-shakespeare config): https://docs.google.com/spreadsheets/d/1FoM6pIUj23GMW4zO_G1hjmEnUacBxBKN/edit?usp=sharing&ouid=111096390735143611797&rtpof=true&sd=true

@KerfuffleV2
Copy link

Maybe just make your own repeat operation that works the way you need? Seems like the repeat op is only implemented for float32 so there's just one version of the function required.

You could create a new op and just cut-and-paste the existing _repeat functions:

ggml/src/ggml.c

Line 8773 in f52d2a0

static void ggml_compute_forward_repeat_f32(

The function looks relatively simple also.

@jploski
Copy link
Contributor

jploski commented Jun 13, 2023

Maybe just make your own repeat operation that works the way you need? Seems like the repeat op is only implemented for float32 so there's just one version of the function required.

I added a new ggml_repeat2 function as suggested (3352043) - although the original ggml_repeat also has a backwards pass and I'm not sure if it's the same for what I added.

With some more tweaks (commited in 3bc786b) I now have a version which works with all falcon-mini-shakespeare models I have unleashed upon this world (both 7B and 40B configs). At least in 32bit, haven't tested quantized yet. The (known) remaining problem is the for-loop-based splitting of query heads. I suspect it's gonna blow up with a real big model, either being slow or exceeding the max number of tensors (4096) allowed by GGML (or both).

(Also it's possible that the implementation does some unnecessary operations like permutes or 4d instead of 3d, but that's minor.)

bin/falcon -m /mnt/seagate/miniconda3/falcon40b/falcon40b-mini-shakespeare/ggml-model--f32.bin --top_p 1 --top_k 1 -s 42 -p "When we loop"

When we loop, and for his head,
And in his head's head's face,
And yet with his head's head is to him;
And now, in this land's face,
And with his head by his head he will die.

I tend to agree, tha'ts almost what happened to me.

@maccam912
Copy link

maccam912 commented Jun 13, 2023

With some more tweaks (commited in 3bc786b) I...

Ok I've been too afraid to ask, but how on earth are you doing these commits that aren't on any branch at all? I wanted to clone the repo and check out the commit but I have no idea how to.

@jploski
Copy link
Contributor

jploski commented Jun 13, 2023

With some more tweaks (commited in 3bc786b) I...

Ok I've been too afraid to ask, but how on earth are you doing these commits that aren't on any branch at all? I wanted to clone the repo and check out the commit but I have no idea how to.

Sorry for the confusion - these commits belong to branch falcon40b of my fork: https://github.com/jploski/ggml/tree/falcon40b - apparently GitHub not clever enough to indicate their source.

@KerfuffleV2
Copy link

@jploski

I was able to convert the real 40B model with my change here to reduce memory during HF conversion (only loads a single part into RAM at a time): jploski#1

It required some work to get inference to actually run. I had to increase ctx_size:

ctx_size += ((size_t)3) * 1024 * 1024 * 1024;

Also, uhh... GGML_MAX_NODES at 4096 didn't quite cut it. Nor did 65535, I eventually just set it to 262144 and was able to run the model. Unfortunately, the output didn't make much sense:

main: seed = 1686733539
falcon_model_load: loading model from '/home/nope/personal/ai/models/falc40b.ggml' - please wait ...
falcon_model_load: n_vocab   = 65024
falcon_model_load: n_embd    = 8192
falcon_model_load: n_head    = 128
falcon_model_load: n_head_kv = 8
falcon_model_load: n_layer   = 60
falcon_model_load: ftype     = 2008
falcon_model_load: qntvr     = 2
falcon_model_load: ggml ctx size = 28175.96 MB
falcon_model_load: memory_size =   480.00 MB, n_mem = 122880
falcon_model_load: ............................................................ done
falcon_model_load: model size = 27436.06 MB / num tensors = 484
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: number of tokens in prompt = 10
main: token[0] =   7107, Once
main: token[1] =   2918,  upon
main: token[2] =    241,  a
main: token[3] =    601,  time
main: token[4] =     23, ,
main: token[5] =    629,  there
main: token[6] =    398,  was
main: token[7] =    241,  a
main: token[8] =   1278,  little
main: token[9] =  27224,  fox

Once upon a time, there was a little fox and, I’re
'  to’ ' .
 it that,. is
,, of . for.' '- you,. we the- the
1 of a
. the

Although it didn't work, even with the crazy number of nodes it wasn't really that slow. It was about the same as a 65B Q4_K_M LLaMA model with llama.cpp.

The mini-Shakespeare model seems fine:

main: seed = 1686733831
falcon_model_load: loading model from '/home/nope/personal/ai/models/falcsp.ggml' - please wait ...
falcon_model_load: n_vocab   = 65024
falcon_model_load: n_embd    = 256
falcon_model_load: n_head    = 4
falcon_model_load: n_head_kv = 2
falcon_model_load: n_layer   = 4
falcon_model_load: ftype     = 2009
falcon_model_load: qntvr     = 2
falcon_model_load: ggml ctx size = 3105.91 MB
falcon_model_load: memory_size =     8.00 MB, n_mem = 8192
falcon_model_load: .... done
falcon_model_load: model size =    25.89 MB / num tensors = 36
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: number of tokens in prompt = 1
main: token[0] =   4031, Now

Now, Clarence, my lord, I am a
the great men: I will to do this day you
In time they may live in men of tears are
Shall be not what we have fought in. What is this
and come to you? I have not made mine eyes,
Which now sent for, or I am so fast?
Your friends shall be revenged on thee, hoar!
And that you must, sirs, that you must do,
My friend to thee that news, with your love,
My father's wife and love for this day.
You are not hot, lords, and what I am not?
To take this, good sweet friend, I am not my life,
I warrant, as I, to have a little thing, my lord,
What you can stay with this good night do you all your tongue?
O, if not my fair soul to my brother, how well,
Where is

main: mem per token =   290292 bytes
main:     load time =   266.82 ms
main:   sample time =    64.16 ms
main:  predict time =   240.96 ms / 1.20 ms per token
main:    total time =   576.08 ms

Both models were quantized to Q5_0.

@jploski
Copy link
Contributor

jploski commented Jun 14, 2023

@jploski

I was able to convert the real 40B model with my change here to reduce memory during HF conversion (only loads a single part into RAM at a time): jploski#1

It required some work to get inference to actually run. I had to increase ctx_size:

ctx_size += ((size_t)3) * 1024 * 1024 * 1024;

Also, uhh... GGML_MAX_NODES at 4096 didn't quite cut it. Nor did 65535, I eventually just set it to 262144 and was able to run the model. Unfortunately, the output didn't make much sense:

Thanks for checking! I was able to reproduce wrong output using an unquantized mini version trained with n_embd = 1024, n_head = 128, n_head = 8. So there must still be a bug somewhere, which the previous three configs I used for testing did not catch.

@KerfuffleV2
Copy link

If the problem is the complicated logic for dealing for the query heads, maybe the easiest way to deal with that is in the conversion tool from the Torch or numpy side. It should be relatively easy to shuffle things around at that point.

Reducing the complexity would make issues easier to debug too, I guess.

@jploski
Copy link
Contributor

jploski commented Jun 14, 2023

If the problem is the complicated logic for dealing for the query heads, maybe the easiest way to deal with that is in the conversion tool from the Torch or numpy side. It should be relatively easy to shuffle things around at that point.

Reducing the complexity would make issues easier to debug too, I guess.

Yes, I agree that reshuffling the weights during conversion will perhaps be the final and most elegant/efficient solution. I just haven't wrapped my head around it yet how changing the layout of the query_key_value tensor maps into fused_qkv from which the qkv vectors are extracted (fused_qkv = self.query_key_value(hidden_states)).

I'd also like to understand the current bug and have a working (if poorly implemented) version to improve on (even if the "improvement" will mean throwing away the overcomplicated code).

@jploski
Copy link
Contributor

jploski commented Jun 14, 2023

I'd also like to understand the current bug and have a working (if poorly implemented) version to improve on (even if the "improvement" will mean throwing away the overcomplicated code).

Understood and fixed in my falcon40b branch. Please recompile and try again.

@KerfuffleV2
Copy link

It's alliiiiive!

main: seed = 1686742967
falcon_model_load: loading model from '/home/nope/personal/ai/models/falc40b.ggml' - please wait ...
falcon_model_load: n_vocab   = 65024
falcon_model_load: n_embd    = 8192
falcon_model_load: n_head    = 128
falcon_model_load: n_head_kv = 8
falcon_model_load: n_layer   = 60
falcon_model_load: ftype     = 2008
falcon_model_load: qntvr     = 2
falcon_model_load: ggml ctx size = 28175.96 MB
falcon_model_load: memory_size =   480.00 MB, n_mem = 122880
falcon_model_load: ............................................................ done
falcon_model_load: model size = 27436.06 MB / num tensors = 484
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: number of tokens in prompt = 10
main: token[0] =   7107, Once
main: token[1] =   2918,  upon
main: token[2] =    241,  a
main: token[3] =    601,  time
main: token[4] =     23, ,
main: token[5] =    629,  there
main: token[6] =    398,  was
main: token[7] =    241,  a
main: token[8] =   1278,  little
main: token[9] =  27224,  fox

Once upon a time, there was a little fox named ‘Pee-Poo’ who had an important mission to accomplish.
She had been assigned the task of finding the ‘Guru of all Gurus’ who was hiding deep in the jungle. And so one day, Pee-Poo set out for her journey. She walked and walked and walked and asked everybody in the jungle where the Guru lived, but nobody could tell her.
“But, how can that be?” she thought to herself, “There has

main: mem per token =  6467732 bytes
main:     load time = 10538.50 ms
main:   sample time =    34.28 ms
main:  predict time = 90867.40 ms / 833.65 ms per token
main:    total time = 104610.47 ms

Not a fan of the name it chose though.

For reference, these are the changes I need to actually run it:

diff --git a/examples/falcon/main.cpp b/examples/falcon/main.cpp
index beac293..c77c610 100644
--- a/examples/falcon/main.cpp
+++ b/examples/falcon/main.cpp
@@ -198,6 +198,7 @@ bool falcon_model_load(const std::string & fname, falcon_model & model, gpt_voca
                     ggml_type_sizef(GGML_TYPE_F32);  // memory_v
 
         ctx_size += (5 + 10 * n_layer) * 256;  // object overhead TODO:
+        ctx_size += ((size_t)3) * 1024 * 1024 * 1024;
         printf("%s: ggml ctx size = %6.2f MB\n", __func__, ctx_size/(1024.0*1024.0));
     }
 
diff --git a/include/ggml/ggml.h b/include/ggml/ggml.h
index e770603..83b0d84 100644
--- a/include/ggml/ggml.h
+++ b/include/ggml/ggml.h
@@ -194,7 +194,7 @@
 #define GGML_QNT_VERSION_FACTOR 1000 // do not change this
 
 #define GGML_MAX_DIMS          4
-#define GGML_MAX_NODES         4096
+#define GGML_MAX_NODES         262144
 #define GGML_MAX_PARAMS        256
 #define GGML_MAX_CONTEXTS      64
 #define GGML_MAX_OPT           4

@TheBloke
Copy link
Contributor

Amazing work guys!

So is https://github.com/jploski/ggml/tree/falcon40b the branch I should use to try converting and running GGMLs?

@cmp-nct
Copy link

cmp-nct commented Jun 14, 2023

262144 nodes, wtf :-)

Awesome to see it works so well!

@jploski
Copy link
Contributor

jploski commented Jun 14, 2023

Amazing work guys!

So is https://github.com/jploski/ggml/tree/falcon40b the branch I should use to try converting and running GGMLs?

I would suggest not converting them just yet - because if/when the qkv reshuffling during conversion is implemented, the binary format of the tensors would change again... which would make all the already published files incompatible.

@TheBloke
Copy link
Contributor

OK fair enough!

@ochafik
Copy link
Collaborator

ochafik commented Jun 15, 2023

Your GGML_MEM_ALIGN theory sounds promising - I checked the ctx_size calculations (particularly bytes per element for f16 and the per-tensor overhead of sizeof(struct ggml_tensor) + GGML_OBJECT_SIZE = 256 bytes per tensor), but after my attempted fixes it was coming out even shorter than before...

As it turns out, GGML's API does have a ggml_tensor_overhead() that returns... 272 (Although in practice I only measured it at 224), so that could also add up.

Now testing a fix (jploski#2)

I also see that there are suspicious hardcoded fixed-size memory "scratch" buffers in falcon_eval, and I find it probably wrong that the allocations there are independent from the initial prompt size (N = number-of-tokens-in-prompt during first invocation of falcon_eval, 1 in subsequent invocations). But the crashes you reported happened even before that while loading the model.

Happy to dive into this another day if there's other crashes!

@apage43
Copy link
Contributor Author

apage43 commented Jun 15, 2023

Although the README says the tokenizer will probably only work for English

I just copied that from the gpt-neox example - the common gpt-style tokenizer in /examples is broken, but actually in a way that is more problematic for text that the vocabulary can tokenize efficiently (i.e. English) - because it will generate suboptimal tokenizations that don't match the way the tokenizer the models are trained with would tokenize the same text

an example of why that causes problems is that if there is an"about" token in the vocabulary, the model will have only seen "about" represented with that token and will have never seen "about" represented as ["ab", "out"] during training even if those tokens exist in the vocabulary, and every time it has seen those tokens it will have been in other contexts. Which means that models often understand suboptimally tokenized text poorly if at all

fixing this unfortunately requires the file to also contain the tokenizer's "merges" list which is not currently captured in to know what order to combine tokens in to match the original behavior, see #220 (comment)

(well, some I've seen fail to decode non-English text but that's because the converter script didn't store the tokens as raw bytes but did a utf-8 decoding/encoding roundtrip on them first - token vocabulary items are bytes, not necessarily valid utf8 strings, and a unicode character may be split across multiple tokens that do not individually contain complete valid utf8 sequences - this is just a problem with those convert scripts though, not the decoding logic, which is pretty simple)

@ochafik
Copy link
Collaborator

ochafik commented Jun 15, 2023

No crash with any model / quantization after jploski#2, but it looks like falcon-7b-instruct q8_0 produces garbage (right after prompt): (edit: false alarm, looks all good)

Write a thank you poem.&&%2)>>ANSWER<<;122>>ABSTRACT<<,1+20>>INTRODUCTION<<5)2>>TITLE<<

@maddes8cht
Copy link

Seems i have successfully compiled falcon.exe for windows, try to run falcon 7b version https://huggingface.co/RachidAR/falcon-7B-ggml with it.
Get the error message:

main: seed = 1686817064
falcon_model_load: loading model from '.\models\falcon-7b-q4_0-ggml.bin' - please wait ...
falcon_model_load: invalid model file '.\models\falcon-7b-q4_0-ggml.bin' (bad Falcon version: 9)
main: failed to load model from '.\models\falcon-7b-q4_0-ggml.bin'.

So there seems to be no 7b model version available as @TheBloke so far only has the 40b versions (thanks @TheBloke for the great work though)

So, while downloading (very slow from current PC) the original files from
https://huggingface.co/tiiuae/falcon-7b
im asking how to do the quatization with falcon-quantize.exe (never done this before), as the commandline help says:

falcon-quantize.exe --help
usage: falcon-quantize.exe model-f32.bin model-quant.bin type`

but there seems to be two files,
pytorch_model-00001-of-00002.bin and pytorch_model-00002-of-00002.bin

@jploski
Copy link
Contributor

jploski commented Jun 15, 2023

Seems i have successfully compiled falcon.exe for windows, try to run falcon 7b version https://huggingface.co/RachidAR/falcon-7B-ggml with it. Get the error message:
falcon_model_load: invalid model file '.\models\falcon-7b-q4_0-ggml.bin' (bad Falcon version: 9)

This error indicates that the GGML file you downloaded is outdated and not compatible with the current implementation.

As for DIY quantize, you need to perform two steps:

  1. Convert the multiple pytorch*.bin checkpoint files to one GGML file
python3 convert-hf-to-ggml.py 0 /path/to/directory/with/the/pytorch/bin/files /path/to/output/directory 1
  1. Quantize the GGML file to your desired format (e.g. 9 = q5_1)
bin/falcon-quantize /tmp/ggml-model--f32.bin /tmp/ggml-model--f32-q5_1.bin 9

@KerfuffleV2
Copy link

KerfuffleV2 commented Jun 15, 2023

I think something might be a tiny bit off with the new context size calculations:

falcon_model_load: loading model from '/somepath/WizardLM-Uncensored-Falcon-40b.ggmlv3.q5_1.bin' - please wait ...
falcon_model_load: n_vocab   = 65025
falcon_model_load: n_embd    = 8192
falcon_model_load: n_head    = 128
falcon_model_load: n_head_kv = 8
falcon_model_load: n_layer   = 60
falcon_model_load: ftype     = 2009
falcon_model_load: qntvr     = 2
falcon_model_load: ggml ctx size = 7483257978003.64 MB
GGML_ASSERT: /somepath/ggml-falcon/src/ggml.c:3982: ctx->mem_buffer != NULL

I tried with other models, including the one I generated myself (which worked the other day). It would probably work on a machine with 6.8 exobytes of RAM but sadly I don't have quite that much.


(ignore, it was due to ne[0] being passed incorrectly, not this code)

edit: This is way, way off:

size_needed += GGML_TYPE_SIZE[type]*(ne[0]/GGML_BLCK_SIZE[type]);

With some debug prints, I can see it produces an absolutely absurd initial value like 4631926893623383552.


edit: Fix here: jploski#3 (no longer needed)

@jploski
Copy link
Contributor

jploski commented Jun 15, 2023

I think something might be a tiny bit off with the new context size calculations:
edit: Fix here: jploski#3

It seems we both independently fixed it the same way (although I don't quite understand what was incorrect about the parameter passing).

@KerfuffleV2
Copy link

although I don't quite understand what was incorrect about the parameter passing

Sorry, that wasn't a good explanation. I think the issue was auto choosing the wrong type, not necessarily the parameter passing itself. I'm pretty sure that auto was choosing a 32bit type, which got passed as a pointer to something expecting a 64bit value. So it would just start reading from the pointer address and of course half of the 64bit value would be basically random so you'd get a crazy result.

@jploski
Copy link
Contributor

jploski commented Jun 15, 2023

although I don't quite understand what was incorrect about the parameter passing

Sorry, that wasn't a good explanation. I think the issue was auto choosing the wrong type, not necessarily the parameter passing itself. I'm pretty sure that auto was choosing a 32bit type, which got passed as a pointer to something expecting a 64bit value. So it would just start reading from the pointer address and of course half of the 64bit value would be basically random so you'd get a crazy result.

Ah yes, good old C where you can read half a variable and be happy. ;)

@khimaros
Copy link

CPU inference is working well here with the latest commits to falcon40b branch and TheBloke's falcon40b-instruct (q4_0) on debian bookworm. thank you for all of the work on this!

@ggerganov
Copy link
Owner

Will be looking into merging this during the weekend.

Wondering if #224 would be enough to avoid the extra ggml_repeat()s so we can have true MQA and save some memory.

One other thing worth mentioning (and I don't know if it's inherent with the model) is that generation seems to slow down as tokens are generated. It's much, much slower after generating 1000 tokens which LLaMA seems to go at close to the same speed up to the context length.

Most likely this is due to extra transpose and copies in the attention.
We had a similar issue in llama.cpp and we fixed it by avoiding the copies:

ggerganov/llama.cpp#775

Tensor overhead should be computed with ggml_tensor_overhead() - it takes into account memory alignment effects.

We should probably also think about an elegant way to plug this inference in llama.cpp and allow usage with both Falcon and LLaMA models. I have some ideas, but not sure what is the best way yet. In any case, the falcon python converter would most likely need to add the necessary padding in the generated ggml files to be able to match the mmap format of llama.cpp. We might also need to do some refactoring first in llama.cpp to simplify both the python and C++ loading code because it has lately become too complex, mostly because it is trying to support old formats that nobody uses anymore

@maddes8cht
Copy link

@jploski
"As for DIY quantize, you need to perform two steps:

Convert the multiple pytorch*.bin checkpoint files to one GGML file

python3 convert-hf-to-ggml.py 0 /path/to/directory/with/the/pytorch/bin/files /path/to/output/directory 1

Okay, created conda environment, installed pytorch, installed transformers, run your line

getting:

python convert-hf-to-ggml.py 0 X:\falcon.ggml\pytorch X:\falcon.ggml\models\ 1
Traceback (most recent call last):
  File "c:\Users\WaWiAdm\Documents\Github\falcon-ggml\examples\falcon\convert-hf-to-ggml.py", line 68, in <module>
    tokenizer = AutoTokenizer.from_pretrained(model_name)
  File "C:\Users\WaWiAdm\anaconda3\envs\falcon\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 658, in from_pretrained
    config = AutoConfig.from_pretrained(
  File "C:\Users\WaWiAdm\anaconda3\envs\falcon\lib\site-packages\transformers\models\auto\configuration_auto.py", line 947, in from_pretrained
    trust_remote_code = resolve_trust_remote_code(
  File "C:\Users\WaWiAdm\anaconda3\envs\falcon\lib\site-packages\transformers\dynamic_module_utils.py", line 535, in resolve_trust_remote_code
    signal.signal(signal.SIGALRM, _raise_timeout_error)
AttributeError: module 'signal' has no attribute 'SIGALRM'. Did you mean: 'SIGABRT'?

what's missing?

@KerfuffleV2
Copy link

@maddes8cht That's an issue with Transformers not the conversion script specifically. You could possibly just comment out that line in dynamic_module_utils.py — most likely that would just prevent timeouts from working (which is probably benign in this case).

@TheBloke
Copy link
Contributor

TheBloke commented Jun 16, 2023

We should probably also think about an elegant way to plug this inference in llama.cpp and allow usage with both Falcon and LLaMA models. I have some ideas, but not sure what is the best way yet. In any case, the falcon python converter would most likely need to add the necessary padding in the generated ggml files to be able to match the mmap format of llama.cpp. We might also need to do some refactoring first in llama.cpp to simplify both the python and C++ loading code because it has lately become too complex, mostly because it is trying to support old formats that nobody uses anymore

That would be amazing. I've wished for a long time that llama.cpp would be more like llm.cpp, and support other model types.

It feels to me like the gap between the capabilities of llama.cpp and non-Llama GGML is getting wider, just as more people are wishing to use non-Llama models, both because of licensing and because they want the special capabilities of other models, like StarCoder for coding.

Anything that could be done to support more model types would be really appreciated by the community I think. Llama.cpp has got amazingly powerful and I'd love if those features could become available for other models too.

@jploski
Copy link
Contributor

jploski commented Jun 16, 2023

Will be looking into merging this during the weekend.

Wondering if #224 would be enough to avoid the extra ggml_repeat()s so we can have true MQA and save some memory.

I just gave it a quick try, but no luck. Merged in #224 locally, commented out the application of ggml_repeat2 for K... No more assertion fail, but unfortunately the result from this multiplication differs from expected, so the implicit broadcast must have produced something different than ggml_repeat2. But I do not understand #224 at this point, so maybe it can be achieved with some extra trickery.

@jploski
Copy link
Contributor

jploski commented Jun 16, 2023

Tensor overhead should be computed with ggml_tensor_overhead() - it takes into account memory alignment effects.

@ggerganov Not sure if that is enough, though: the issue seems to be that ggml_tensor_overhead() + nelements * type_size is different than size_needed as calculated in ggml_new_tensor_impl.

cmp-nct pushed a commit to cmp-nct/ggllm.cpp that referenced this pull request Jun 16, 2023
Added falcon main and library based on llama.cpp
CPU inference works (getting ~260ms/token on 7B 16 bit falcon)
Tested with 7B 16 bit and the two shakespear models (both in 16 bit precisiononly)

TODO/WIP:
1) quantization runs, creates a ggjt 3 file but something is wrong with the quantized model binary
- even quantization from 16 -> 16 also fails, something is wrong in the tensors produced
2) mmap should work with quantized binaries once 1) is solved
3) CUDA support is mostly there, it's currently disabled (all CPU backend)
4) memory/context caluculations are off, GPU memory calculations are wrong either
5) the python conversion script is pre GGML 1 version (tokens without scores)
6) some stuff is still called "llama", some of it should be renamed to a generic name as it works for both
7) the GGML produced by the current python uses an old ftype method

Makfiles:
cmake on windows with build tools works
the makefile for linux/msys was blind adjusted but not tested yet - possibly missed something

Changes to the codebase:
* repeat2 has been added to ggml (jploski - ggerganov/ggml#231) including the backward variant (untested, probably fails)
* minor changes to work with falcon (name length)
* libfalcon is the previous "llama.cpp" and falcon_main is the previous main.cpp
JohannesGaessler pushed a commit to JohannesGaessler/llama.cpp that referenced this pull request Jun 17, 2023
Added falcon main and library based on llama.cpp
CPU inference works (getting ~260ms/token on 7B 16 bit falcon)
Tested with 7B 16 bit and the two shakespear models (both in 16 bit precisiononly)

TODO/WIP:
1) quantization runs, creates a ggjt 3 file but something is wrong with the quantized model binary
- even quantization from 16 -> 16 also fails, something is wrong in the tensors produced
2) mmap should work with quantized binaries once 1) is solved
3) CUDA support is mostly there, it's currently disabled (all CPU backend)
4) memory/context caluculations are off, GPU memory calculations are wrong either
5) the python conversion script is pre GGML 1 version (tokens without scores)
6) some stuff is still called "llama", some of it should be renamed to a generic name as it works for both
7) the GGML produced by the current python uses an old ftype method

Makfiles:
cmake on windows with build tools works
the makefile for linux/msys was blind adjusted but not tested yet - possibly missed something

Changes to the codebase:
* repeat2 has been added to ggml (jploski - ggerganov/ggml#231) including the backward variant (untested, probably fails)
* minor changes to work with falcon (name length)
* libfalcon is the previous "llama.cpp" and falcon_main is the previous main.cpp
@ggerganov
Copy link
Owner

So I'm a bit confused - what is the difference between this branch and @jploski version?
Why @JPolski added repeat2 and here we haven't?

@jploski
Copy link
Contributor

jploski commented Jun 18, 2023

So I'm a bit confused - what is the difference between this branch and @jploski version? Why @JPolski added repeat2 and here we haven't?

Which branch are you referring to? I see ggml_repeat2 on all branches related to Falcon integration.

@jploski
Copy link
Contributor

jploski commented Jun 18, 2023

So I'm a bit confused - what is the difference between this branch and @jploski version? Why @JPolski added repeat2 and here we haven't?

Which branch are you referring to? I see ggml_repeat2 on all branches related to Falcon integration.

Oh, if you meant apage43:falcon, then it is out-of-date, as it only supports 7B (for which ggml_repeat2 was indeed not needed because there is only one KV to repeat, so the ordering does not matter). apache43:falcon is where I forked my jploski/ggml (falcon40b) to implement 40B. And later the development moved over to cmp-nct/ggllm.cpp

@ggerganov
Copy link
Owner

Ok, got it. I'll postpone merging this PR then. Want to focus on some ggml maintenance for while and it looks like there are already enough ongoing efforts for Falcon support.

Ideally, I think we want to avoid the ggml_repeat2() and figure out how to generalize the existing ops. I'm not sure it is possible as I haven't looked in understanding completely MQA, but I hope we figure it out eventually

@jploski
Copy link
Contributor

jploski commented Jun 18, 2023

Ok, got it. I'll postpone merging this PR then. Want to focus on some ggml maintenance for while and it looks like there are already enough ongoing efforts for Falcon support.

Ideally, I think we want to avoid the ggml_repeat2() and figure out how to generalize the existing ops. I'm not sure it is possible as I haven't looked in understanding completely MQA, but I hope we figure it out eventually

Yes, I agree that we should remove repeat2. If it is done on the cmp-nct/ggllm.cpp branch, I will update my https://github.com/jploski/ggml/tree/falcon40b accordingly. I think it would be helpful if you could check whether the mat_mul broadcast could somehow do the trick, as you are most familiar with the broadcast implementation (I suppose).

To understand how it needs to work, see:

https://docs.google.com/spreadsheets/d/1FoM6pIUj23GMW4zO_G1hjmEnUacBxBKN/edit#gid=2097305276

What we need to come out of repeat/broadcast (and what repeat2 produces) is: N[0].K[0], N[0].K[0], N[0].K[1], N[0].K[1]
But what we get from the standard repeat is: N[0].K[0], N[0].K[1], N[0].K[0], N[0].K[1]

@jploski
Copy link
Contributor

jploski commented Jun 18, 2023

Ok, got it. I'll postpone merging this PR then. Want to focus on some ggml maintenance for while and it looks like there are already enough ongoing efforts for Falcon support.

Ideally, I think we want to avoid the ggml_repeat2() and figure out how to generalize the existing ops. I'm not sure it is possible as I haven't looked in understanding completely MQA, but I hope we figure it out eventually

If by MQA you mean multi-query attention (sorry, you mentioned it earlier, but I did not manage to decipher it), in the original multi-query paper (which Falcon-7B adheres to), there is only one key vector and one value vector (n_head_kv=1), and this k/v vector is reused/shared by all queries (in the sense that each query vector is multiplied by the same key). Contrast this with the traditional GPT approach where there is the same number of queries as keys/values. The motivation from the paper was to save (KV) memory while retaining approximately the same quality of inference.

In the generalized n_head_kv>1 implementation, which Falcon-40B implements, and for which I found no paper, there are multiple "kv groups", each consisting of one KV pair and n queries that reuse/share that group's KV pair. This is somewhat of a compromise between having just one KV pair and one KV pair for each query.

So the ordering issue in ggml_repeat(2) is to make sure that the right queries are matched to the right keys for the multiplication (and the resulting weights to right values).

@FSSRepo
Copy link
Collaborator

FSSRepo commented Aug 22, 2023

llama.cpp will have the complete support for falcon #2717

@apage43 apage43 closed this Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet