Vimium is a Chrome extension that lets you navigate the web with only your keyboard.
Create environment with conda
conda create --name webagent
or
Create environment with venv
python3 -m venv venv
Install Python requirements
pip install -r requirements.txt
Download Vimium locally (have to load the extension manually when running Playwright)
./setup.sh
(Hugging Face guide)[https://huggingface.co/docs/transformers.js/guides/private] https://huggingface.co/settings/tokens
Download pytorch
This is for using GPT Vision API and TTS API (so if we don't use that anymore remove it)
OPENAI_API_KEY=
brew install portaudio
- Clone llama.cpp and run
makein the root directory. - Download the llava vision model. I'm using the quantized 4 bit version of the 7B param model. Download it here. I saved the
ggml-model-q4_k.gguffile and themmproj-model-f16.gguffile in amodels/llavafolder in the llama.cpp directory. - Run the llama.cpp server
./server -m models/llava/ggml-model-q4_k.gguf --mmproj models/llava/mmproj-model-f16.gguf - Now you can save a
screenshot.pngfile in the root directory of this repo and runpython3 llava.pyto check it's working.