Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebAssembly and emscripten headers #97

Closed
loretoparisi opened this issue Mar 13, 2023 · 28 comments
Closed

WebAssembly and emscripten headers #97

loretoparisi opened this issue Mar 13, 2023 · 28 comments
Labels
enhancement New feature or request stale

Comments

@loretoparisi
Copy link

Hello I have tried a minimal Emscripten support to Makefile adding

# WASM
EMCXX = em++
EMCC = emcc
EMCXXFLAGS = --bind --std=c++11 -s WASM=1 -s ALLOW_MEMORY_GROWTH=1 -s "EXPORTED_RUNTIME_METHODS=['addOnPostRun','FS']" -s "DISABLE_EXCEPTION_CATCHING=0" -s "EXCEPTION_DEBUG=1" -s "FORCE_FILESYSTEM=1" -s "MODULARIZE=1" -s "EXPORT_ES6=0" -s 'EXPORT_NAME="LLAMAModule"' -s "USE_ES6_IMPORT_META=0" -I./
EMCCFLAGS = --bind -s WASM=1 -s ALLOW_MEMORY_GROWTH=1 -s "EXPORTED_RUNTIME_METHODS=['addOnPostRun','FS']" -s "DISABLE_EXCEPTION_CATCHING=0" -s "EXCEPTION_DEBUG=1" -s "FORCE_FILESYSTEM=1" -s "MODULARIZE=1" -s "EXPORT_ES6=0" -s 'EXPORT_NAME="LLAMAModule"' -s "USE_ES6_IMPORT_META=0" -I./ 

EMOBJS = utils.bc ggml.bc

wasm: llama_wasm.js quantize_wasm.js
wasmdebug: export EMCC_DEBUG=1
wasmdebug: llama_wasm.js quantize_wasm.js

#
# WASM lib
#

ggml.bc: ggml.c ggml.h
	$(EMCC) -c $(EMCCFLAGS) ggml.c -o ggml.bc
utils.bc: utils.cpp utils.h
	$(EMCXX) -c $(EMCXXFLAGS) utils.cpp -o utils.bc

$(info I EMOBJS:      $(EMOBJS))

#
# WASM executable
#
llama_wasm.js: $(EMOBJS) main.cpp Makefile
	$(EMCXX) $(EMCXXFLAGS) $(EMOBJS) -o llama_wasm.js
quantize_wasm.js: $(EMOBJS) quantize.cpp Makefile
	$(EMCXX) $(EMCXXFLAGS) $(EMOBJS) quantize.cpp -o quantize_wasm.js

It complies ok with both em++ and emcc. At this stage the problem is that main.cpp and quantize.cpp does not expose a proper header file, and I cannot call main as a module, or a function export using Emscripten EMSCRIPTEN_KEEPALIVE to main by example.

In fact a simple C++ headers could be compiled as a node module and then called like

/** file:llama.js */
const llamaModularized = require('./llama_wasm.js');
var llamaModule = null
const _initLLAMAModule = async function () {
    llamaModule = await llamaModularized();
    return true
}
let postRunFunc = null;
const addOnPostRun = function (func) {
    postRunFunc = func;
};
_initLLAMAModule().then((res) => {
    if (postRunFunc) {
        postRunFunc();
    }
});

class LLaMa {
    constructor() {
        this.f = new llamaModule.LLaMa();
    }
    // here modules fun impl
}

module.exports = { LLaMa, addOnPostRun };

and then executed in node scripts like

/** file:run.js */
(async () => {
    const LLaMa = require('./llama.js');
    const loadWASM = function () {
        var self = this;
        return new Promise(function (resolve, reject) {
            LLaMa.addOnPostRun(() => {
                let model = new LLaMa.LLaMa();
                /** use model functions */
            });
        });
    }//loadWASM
    await loadWASM();

}).call(this);
@MarkSchmidty
Copy link

MarkSchmidty commented Mar 13, 2023

Without https://github.com/WebAssembly/memory64 implemented in web assembly you are going to run into show stopping memory issues with the current 4GB limit due to 32bit addressing.

Do you have a plan for getting around that?

@Dicklesworthstone
Copy link

If you quantized the 7B model to a mixture of 3-bit and 4-bit quantization using https://github.com/qwopqwop200/GPTQ-for-LLaMa then you could stay within that memory envelope.

@MarkSchmidty
Copy link

MarkSchmidty commented Mar 13, 2023

I think that's a reasonable proposal @Dicklesworthstone.

A purely 3-bit implementation of llama.cpp using GPTQ could retain acceptable performance and solve the same memory issues. There's an open issue for implementing GPTQ quantization in 3-bit and 4-bit. GPTQ Quantization (3-bit and 4-bit) #9.

Other use cases could benefit from this same enhancement, such as getting 65B under 32GB and 30B under 16GB to further extend access to (perhaps slightly weaker versions of) the larger models.

This was referenced Mar 13, 2023
@gjmulder gjmulder added the enhancement New feature or request label Mar 15, 2023
@thypon
Copy link

thypon commented Mar 20, 2023

https://twitter.com/nJoyneer/status/1637863946383155220

I was able to run llama.cpp in the browser with a minimal patchset and some *FLAGS

Screenshot 2023-03-20 at 17 46 01

Following the Emscripten version used:

$ emcc --version
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.33-git
Copyright (C) 2014 the Emscripten authors (see AUTHORS.txt)
This is free and open source software under the MIT license.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Following the compile flags:

make CC=emcc CXX=em++ LLAMA_NO_ACCELERATE=1 CFLAGS=" -DNDEBUG -s MEMORY64" CXXFLAGS=" -DNDEBUG -s MEMORY64" LDFLAGS="-s MEMORY64 -s TOTAL_MEMORY=8589934592 -s STACK_SIZE=2097152 --preload-file models " main.html

Following the minimal patch:

diff --git a/Makefile b/Makefile
index 1601079..12a1a80 100644
--- a/Makefile
+++ b/Makefile
@@ -189,12 +189,16 @@ utils.o: utils.cpp utils.h
 	$(CXX) $(CXXFLAGS) -c utils.cpp -o utils.o
 
 clean:
-	rm -f *.o main quantize
+	rm -f *.o main.{html,wasm,js,data,worker.js} main quantize
 
 main: main.cpp ggml.o utils.o
 	$(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main $(LDFLAGS)
 	./main -h
 
+main.html: main.cpp ggml.o utils.o
+	$(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main.html $(LDFLAGS)
+	go run server.go
+
 quantize: quantize.cpp ggml.o utils.o
 	$(CXX) $(CXXFLAGS) quantize.cpp ggml.o utils.o -o quantize $(LDFLAGS)
 
diff --git a/ggml.c b/ggml.c
index 4813f74..3dc2cbc 100644
--- a/ggml.c
+++ b/ggml.c
@@ -6,6 +6,8 @@
 #include <alloca.h>
 #endif
 
+#define _POSIX_C_SOURCE 200809L
+
 #include <assert.h>
 #include <time.h>
 #include <math.h>
@@ -107,7 +109,7 @@ typedef void* thread_ret_t;
     do { \
         if (!(x)) { \
             fprintf(stderr, "GGML_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \
-            abort(); \
+            /*abort();*/ \
         } \
     } while (0)
 
diff --git a/main.cpp b/main.cpp
index e181056..afb0c53 100644
--- a/main.cpp
+++ b/main.cpp
@@ -785,7 +785,7 @@ int main(int argc, char ** argv) {
     const int64_t t_main_start_us = ggml_time_us();
 
     gpt_params params;
-    params.model = "models/llama-7B/ggml-model.bin";
+    params.model = "models/7B/ggml-model-q4_0.bin";
 
     if (gpt_params_parse(argc, argv, params) == false) {
         return 1;

@loretoparisi
Copy link
Author

I was able to run llama.cpp in the browser with a minimal patchset and some *FLAGS

Screenshot 2023-03-20 at 17 46 01

Following the Emscripten version used:

$ emcc --version
emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.33-git
Copyright (C) 2014 the Emscripten authors (see AUTHORS.txt)
This is free and open source software under the MIT license.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Following the compile flags:

make CC=emcc CXX=em++ LLAMA_NO_ACCELERATE=1 CFLAGS=" -DNDEBUG -s MEMORY64" CXXFLAGS=" -DNDEBUG -s MEMORY64" LDFLAGS="-s MEMORY64 -s TOTAL_MEMORY=8589934592 -s STACK_SIZE=2097152 --preload-file models " main.html

Following the minimal patch:

diff --git a/Makefile b/Makefile
index 1601079..12a1a80 100644
--- a/Makefile
+++ b/Makefile
@@ -189,12 +189,16 @@ utils.o: utils.cpp utils.h
 	$(CXX) $(CXXFLAGS) -c utils.cpp -o utils.o
 
 clean:
-	rm -f *.o main quantize
+	rm -f *.o main.{html,wasm,js,data,worker.js} main quantize
 
 main: main.cpp ggml.o utils.o
 	$(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main $(LDFLAGS)
 	./main -h
 
+main.html: main.cpp ggml.o utils.o
+	$(CXX) $(CXXFLAGS) main.cpp ggml.o utils.o -o main.html $(LDFLAGS)
+	go run server.go
+
 quantize: quantize.cpp ggml.o utils.o
 	$(CXX) $(CXXFLAGS) quantize.cpp ggml.o utils.o -o quantize $(LDFLAGS)
 
diff --git a/ggml.c b/ggml.c
index 4813f74..3dc2cbc 100644
--- a/ggml.c
+++ b/ggml.c
@@ -6,6 +6,8 @@
 #include <alloca.h>
 #endif
 
+#define _POSIX_C_SOURCE 200809L
+
 #include <assert.h>
 #include <time.h>
 #include <math.h>
@@ -107,7 +109,7 @@ typedef void* thread_ret_t;
     do { \
         if (!(x)) { \
             fprintf(stderr, "GGML_ASSERT: %s:%d: %s\n", __FILE__, __LINE__, #x); \
-            abort(); \
+            /*abort();*/ \
         } \
     } while (0)
 
diff --git a/main.cpp b/main.cpp
index e181056..afb0c53 100644
--- a/main.cpp
+++ b/main.cpp
@@ -785,7 +785,7 @@ int main(int argc, char ** argv) {
     const int64_t t_main_start_us = ggml_time_us();
 
     gpt_params params;
-    params.model = "models/llama-7B/ggml-model.bin";
+    params.model = "models/7B/ggml-model-q4_0.bin";
 
     if (gpt_params_parse(argc, argv, params) == false) {
         return 1;

Wow well done! Why did you had to remove abort(); from ggml?

@thypon
Copy link

thypon commented Mar 20, 2023

The abort(); case was hit when out of memory, and before outputting the partial LLM output, so no string was shown.
Unfortunately, I did not find any way of using Memory64 WASM extension without incurring in some bugs on both Firefox and Chrome.

@loretoparisi
Copy link
Author

The abort(); case was hit when out of memory, and before outputting the partial LLM output, so no string was shown. Unfortunately, I did not find any way of using Memory64 WASM extension without incurring in some bugs on both Firefox and Chrome.

So given the limit of WASM 64, you have to go for 3 and 4 bit quantization using GPTQ I think.

@thypon
Copy link

thypon commented Mar 20, 2023

The abort(); case was hit when out of memory, and before outputting the partial LLM output, so no string was shown. Unfortunately, I did not find any way of using Memory64 WASM extension without incurring in some bugs on both Firefox and Chrome.

So given the limit of WASM 64, you have to go for 3 and 4 bit quantization using GPTQ I think.

It's already quantized 4bits when converting. 7b overflows 8GB of allocated WASM64 memory though.
Besides that it's quite slow since there is not both (pthread OR simd) AND memory64. Whenever I try to mix any of the two, the compiler or the linter fails to create or run the output.

@okpatil4u
Copy link

@thypon apparently memory64 is available in firefox nightly, did you check it ?

https://webassembly.org/roadmap/#feature-note-2

@lapo-luchini
Copy link

The new RedPajama-3B seems like a nice tiny model that could probably fit without memory64.

@IsaacRe
Copy link

IsaacRe commented May 17, 2023

@thypon @loretoparisi I'm curious, what sort of performance drop did you notice running in browser from running natively? How many toks/sec were you getting?

@thypon
Copy link

thypon commented Jun 7, 2023

@IsaacRe Did not make a performance comparison since it was not 100% stable and needed to be refined. As mentioned it was single core since multithreaded + memory64 on firefox nightly was not working properly together, and crashing the experiment.

@okpatil4u already running with experimental memory64

@okpatil4u
Copy link

Hey @thypon, did you make any progress on this experiment ?

@thypon
Copy link

thypon commented Jul 4, 2023

I'm not actively working on this at the current stage.

@lukestanley
Copy link

lukestanley commented Jul 4, 2023

@okpatil4u
I broadly followed the very useful steps above by @loretoparisi and was able to run really small models only with latest Emscripten and a fairly recent master commit. I am way out of my comfort zone with C++ or WASM (since I spent most of my time with Typescript and Python). I didn't get around to installing Firefox nightly and stopped for now.
I last had the tiny Shakespeare models running in the browser.
The diff by loretoparisi was made a while ago so I had to make some significant changes and I am a complete C++ noob, so take this with a bag of salt, but if it helps someone, great:
lukestanley@41cbd2b

@rahuldshetty
Copy link

I've tried the approach suggested by @lukestanley and @loretoparisi and got starcoder.cpp to run on browser.
Published a demo project on this: https://github.com/rahuldshetty/starcoder.js

I tried with tiny_starcoder_py model as the weight size were quite small to fit without mem64, and tried to see the performance/accuracy. It seems like the output of the model without mem64 is gibberish while mem64 version results in meaningful output. Not sure if memory addressing in 32bit vs 64bit has to do with it.

@mindplay-dk
Copy link

Without https://github.com/WebAssembly/memory64 implemented in web assembly you are going to run into show stopping memory issues with the current 4GB limit due to 32bit addressing.

Do you have a plan for getting around that?

How about WebGPU? Probably better to run it off-CPU where possible anyhow?

(full disclosure: I have no idea what I'm talking about.)

@mohamedmansour
Copy link

Without https://github.com/WebAssembly/memory64 implemented in web assembly you are going to run into show stopping memory issues with the current 4GB limit due to 32bit addressing.

Do you have a plan for getting around that?

The implementation status is complete for emscripten:
https://github.com/WebAssembly/memory64/blob/main/proposals/memory64/Overview.md

image

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@loretoparisi
Copy link
Author

Not sure what is the progress here, apparently there are overlapping or related opened issues.

@ggerganov
Copy link
Owner

There is this project that might be relevant: https://github.com/ngxson/wllama

@flatsiedatsie
Copy link

@ggerganov Thanks for sharing that. I'm already using https://github.com/tangledgroup/llama-cpp-wasm as the basis of a big project.

So far llama-cpp-wasm has allowed me to run pretty much any .gguf that is less than 2GB in size in the browser (and that limitation seems to be related to the caching mechanism of that project, so I suspect the real limit would be 4GB).

People talk about bringing AI to the masses, but the best way to do that is with browser-based technology. My mom is never going to install Ollama and the like.

@ggerganov
Copy link
Owner

My mom is never going to install Ollama and the like.

But would your mom be OK with her browser downloading gigabytes of weights upon page visit? To me this seems like the biggest obstacle for browser-based LLM inference

@flatsiedatsie
Copy link

flatsiedatsie commented Apr 12, 2024

She's already doing it :-)

Sneak preview:

sneak_preview

(100% browser based)

@loretoparisi
Copy link
Author

My mom is never going to install Ollama and the like.

But would your mom be OK with her browser downloading gigabytes of weights upon page visit? To me this seems like the biggest obstacle for browser-based LLM inference

Agreed, the best example so far is LLM MLC, web version:
https://webllm.mlc.ai/

you can see that it can download 4GB in shards like 20 shards or so for a Llama-2 7B weights, 4-bit quantized. Of course this means that you can wait from tens seconds to few minutes to start the inference. And this is not going to change soon unless the quantization at 3,2bits works better and the accuracy is good as the 4bit...

By example, if we take Llama-2, 8B we have 108 shards
Screenshot 2024-04-24 at 18 29 53

and it took 114 seconds to complete on my fiber channel:

Screenshot 2024-04-24 at 18 31 48

before being ready to infer
Screenshot 2024-04-24 at 18 32 41

on Mac M1 Pro I get

prefill: 13.5248 tokens/sec, decoding: 7.9857 tokens/sec
Models with “-1k” suffix signify 1024 context length, lowering ~2-3GB VRAM requirement compared to their counterparts. Feel free to start trying with those.

@flatsiedatsie
Copy link

Huggingface has recently released a streaming option for GGUF, where you can already start inference even though the model is noy fully loaded yet. At least, that's my understanding from a recent Youtube video by Yannic Kilcher.

For my project I'm trying to use a less than 2Gb quant of Phi 2 with 128K context. I think that model will become the best model for browser-based used for a while.

@slaren
Copy link
Collaborator

slaren commented Apr 24, 2024

You may be thinking of a library that Huggingface released that can read GGUF metadata without downloading the whole file. You wouldn't gain much from streaming the model for inference, generally the entire model is necessary to generate every token.

@flatsiedatsie
Copy link

@slaren Ah, thanks for clarifying that. It sounded a little too good to be true :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests