One-command tool to turn any VPS into a production-ready, OpenAI-compatible embedding API server.
- Features
- Quick Start
- Available Models
- API Reference
- CLI Commands
- Configuration
- Deployment
- Troubleshooting
- Development
- OpenAI-compatible API: Drop-in replacement for
/v1/embeddings - CPU-only inference: Runs on low-end VPS (4GB RAM, 2 vCPU)
- Hugging Face integration: Download and cache models from HF Hub
- API key authentication: Secure your API with local key management
- Concurrency control: Automatic concurrency limits based on model size and hardware
- Request queueing: Handle traffic spikes gracefully
- Memory safety: Pre-flight checks prevent OOM crashes
- Node.js 18+ (npm or npx)
- Hugging Face Account with an API token
- 4GB+ RAM (for small models; 8GB+ recommended)
npm install -g vps-vector-nodeOr run without installing:
npx vps-vector-node --helpVectorNode requires a Hugging Face token to download models.
Get a token:
- Go to https://huggingface.co/join (create account if needed)
- Navigate to https://huggingface.co/settings/tokens
- Click "New token" β Give it a name (e.g., "vectornode")
- Select "Read" permission
- Click "Generate token" and copy it immediately
Login to VectorNode:
npx vps-vector-node login hf --token hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxOr if installed globally:
vectornode login hf --token hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxThe token is stored securely in ~/.vectornode/config.json.
Token troubleshooting:
- Invalid token? Ensure you copied the entire token including
hf_prefix- Permission denied? Token must have "Read" permission
- Lost token? Create a new one at https://huggingface.co/settings/tokens
Download the model before starting the server (avoids delays on first request):
npx vps-vector-node models:download bge-small-enOr if installed globally:
vectornode models:download bge-small-enSee Available Models for other options.
npx vps-vector-node key create --name devThis outputs an API key like sk-abc123.... Store it securelyβyou'll need it for every API request.
npx vps-vector-node serve --model bge-small-en --port 3000Or if installed globally:
vectornode serve --model bge-small-en --port 3000You should see:
[INFO] Starting VectorNode server { model: 'bge-small-en', port: '3000', host: '0.0.0.0' }
[INFO] Model loaded successfully { model: 'bge-small-en', dimensions: 384 }
[INFO] Server listening on 0.0.0.0:3000
Using curl:
curl -H "Authorization: Bearer sk-abc123..." \
-H "Content-Type: application/json" \
-X POST http://localhost:3000/v1/embeddings \
-d '{
"model": "bge-small-en",
"input": "hello world"
}'Using Postman:
- Method:
POST - URL:
http://localhost:3000/v1/embeddings - Headers:
Authorization: Bearer sk-abc123...Content-Type: application/json
- Body (raw JSON):
{ "model": "bge-small-en", "input": "hello world" }
Important Notes:
/v1/embeddingsonly accepts POST requests (GET will return 404)- Send the payload as raw JSON in the body, not as query parameters
- VectorNode handles Unicode (spaces, tabs, emoji) automatically:
{ "model": "bge-small-en", "input": "hello world π\tTabbed text" }- If your JSON has extra surrounding quotes, remove them (e.g., remove
"before{and after})
| Model ID | Dimensions | Size (GB) | Parameters | Min RAM | Rec RAM | Best Use Case | Multilingual | Latency |
|---|---|---|---|---|---|---|---|---|
bge-micro |
384 | 0.1 | ~22M | 2GB | 4GB | Edge/Frontend, ultra-low latency, real-time | No | ~5ms |
minilm-l6-v2 |
384 | 0.4 | ~22M | 2GB | 4GB | General purpose, semantic search, fast inference | No | ~8ms |
gte-small |
384 | 0.5 | ~33M | 2GB | 4GB | General embedding, retrieval | Yes | ~10ms |
bge-small-en |
384 | 0.5 | ~110M | 3GB | 6GB | RAG, production search, English-focused | No | ~12ms |
gemma-embedding-small |
768 | 1.6 | ~1.6M | 3GB | 6GB | Lightweight multilingual, fast | Yes | ~15ms |
gte-base |
768 | 1.5 | ~137M | 4GB | 8GB | Higher quality search, retrieval-augmented gen | Yes | ~25ms |
bge-base-en |
768 | 1.5 | ~125M | 4GB | 8GB | Better accuracy than small, English RAG | No | ~30ms |
qwen2.5-embedding-0.5b |
1024 | 2.0 | ~500M | 4GB | 8GB | Modern multilingual, strong semantic understanding | Yes | ~40ms |
gemma-embedding-base |
768 | 3.0 | ~308M | 6GB | 12GB | High-quality multilingual embeddings | Yes | ~50ms |
bge-large-en |
1024 | 3.0 | ~335M | 8GB | 16GB | Best accuracy for English, production RAG systems | No | ~80ms |
Choose your model based on your needs:
-
Ultra-fast (<10ms), edge device?
βbge-microorminilm-l6-v2 -
Production English RAG/search?
βbge-small-en(balanced) orbge-large-en(best quality) -
Multilingual support needed?
βgemma-embedding-small(fast) orgte-base(quality) -
General purpose, cost-conscious?
βgte-smallorminilm-l6-v2 -
Maximum accuracy, sufficient hardware?
βbge-large-enorgemma-embedding-base
| Hardware Profile | Recommended Model | Why |
|---|---|---|
| 2GB RAM, 1-2 vCPU | bge-micro or minilm-l6-v2 |
Minimal overhead, fast inference |
| 4GB RAM, 2-4 vCPU | bge-small-en or gte-small |
Production-ready, good balance |
| 8GB RAM, 4-8 vCPU | bge-base-en or gte-base |
Higher quality, still responsive |
| 16GB+ RAM, 8+ vCPU | bge-large-en or gemma-embedding-base |
Best quality and multilingual support |
npx vps-vector-node models
# or if installed globally:
vectornode modelsGenerate embeddings for input text(s).
Request:
{
"model": "bge-small-en",
"input": "hello world"
}For multiple inputs:
{
"model": "bge-small-en",
"input": ["text1", "text2", "text3"]
}Query Parameters:
normalize(boolean, default:true) - Whether to L2-normalize embeddings
Response (200 OK):
{
"object": "list",
"data": [
{
"index": 0,
"embedding": [0.0123, -0.0034, ...],
"object": "embedding"
}
],
"model": "bge-small-en",
"usage": {
"text_length": [11],
"tokens_estimated": [3]
},
"inference_time_ms": 12
}Status Codes:
200- Success202- Request queued (includesqueue_positionin response)400- Bad request (missing/invalid fields)401- Unauthorized (invalid/missing API key)429- Queue full500- Server error
List available models on the server.
Headers:
Authorization: Bearer sk-...(required)
Response (200 OK):
{
"object": "list",
"data": [
{
"id": "bge-small-en",
"dimensions": 384,
"size_gb": 0.5,
"quantization": "Q4_K",
"source": "huggingface",
"loaded": true
}
]
}Health check endpoint (no authentication required).
Response (200 OK):
{
"status": "ok",
"model_loaded": "bge-small-en",
"uptime_seconds": 1023
}Server performance metrics.
Headers:
Authorization: Bearer sk-...(required)
Response (200 OK):
{
"active_requests": 3,
"queue_depth": 2,
"memory_free_mb": 2432,
"cpu_load": 0.43,
"requests_per_sec": 12.5,
"avg_latency_ms": 45
}npx vps-vector-node login hf --token <token>
# or if installed globally:
vectornode login hf --token <token>Token is stored in ~/.vectornode/config.json.
npx vps-vector-node models:download bge-small-en
# or if installed globally:
vectornode models:download bge-small-enStart the embedding API server:
npx vps-vector-node serve --model <model-id> [options]Common Options:
--port <port>- Port to listen on (default:3000)--host <host>- Host to bind to (default:0.0.0.0)--max-queue-size <size>- Maximum queue size (default:100)--batch-size <size>- Batch size for processing (default:10)--threads <threads>- Number of threads (auto-detected)--verbose- Enable verbose logging--trace- Enable trace logging
Create a new API key:
npx vps-vector-node key create --name <name>List all keys:
npx vps-vector-node key listRevoke a key:
npx vps-vector-node key revoke --name <name>Configuration is stored in ~/.vectornode/:
~/.vectornode/
βββ config.json # HF token and settings
βββ keys.json # API keys
βββ models/ # Downloaded models cache
β βββ bge-small-en/
β βββ bge-base-en/
β βββ ...
βββ logs/ # Trace logs (if --trace enabled)
Minimum (for small models):
- RAM: 4GB
- CPU: 2 vCPU
- Storage: 5GB free
Recommended (for base models):
- RAM: 8GB
- CPU: 4 vCPU
- Storage: 10GB free
The server checks available memory before loading models and will refuse to start if insufficient.
Dockerfile:
FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
EXPOSE 3000
CMD ["node", "cli/index.js", "serve", "--model", "bge-small-en"]Build and run:
docker build -t vectornode .
docker run -p 3000:3000 vectornodeCreate /etc/systemd/system/vectornode.service:
[Unit]
Description=VectorNode Embedding API
After=network.target
[Service]
Type=simple
User=vectornode
WorkingDirectory=/opt/vectornode
ExecStart=/usr/bin/node /opt/vectornode/cli/index.js serve --model bge-small-en --port 3000
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.targetEnable and start:
sudo systemctl enable vectornode
sudo systemctl start vectornodeInvalid or missing token:
- Check if token is set:
cat ~/.vectornode/config.json - Re-login with correct token:
npx vps-vector-node login hf --token <your-token> - Get a new token: https://huggingface.co/settings/tokens
Token permission issues:
- Ensure your token has "Read" permission
Network/connectivity issues:
- Check connection to huggingface.co:
ping huggingface.co - Try again (may be temporary)
Not enough disk space:
- Check available space:
df -h - Clean up space or use a smaller model
[ERROR] Model not found: gemma-embedding-small.
Run: vectornode models:download gemma-embedding-small
Solution: Download the model first:
npx vps-vector-node models:download gemma-embedding-small- Use a smaller model (e.g.,
bge-small-eninstead ofbge-large-en) - Reduce
--max-queue-sizeflag - Add swap space to the system
- Increase available RAM
- Check CPU usage:
toporhtop - Try reducing
--batch-size - Use a quantized model
- Add more CPU resources
Missing or invalid API key:
- List keys:
npx vps-vector-node key list - Create new key:
npx vps-vector-node key create --name dev - Verify header format:
Authorization: Bearer sk-...
[ERROR] Unexpected token '"', ""{\\n \\\"mo\"... is not valid JSON
Solution in Postman:
- Use Body tab β select
rawβ pickJSONfrom dropdown - Place payload in body (not in query params)
- Remove any extra quotes around the JSON
git clone <repo>
cd vectornode
npm installnpm testnode cli/index.js serve --model bge-small-en --verbose --port 3000node cli/index.js serve --model bge-small-en --trace --port 3000Logs are saved to ~/.vectornode/logs/
MIT
Contributions welcome! Please:
- Open an issue to discuss your idea
- Fork the repository
- Submit a PR with a clear description