JavaScript Native PyTorch-aligned Machine Learning Framework
Bringing the true PyTorch experience to the JavaScript ecosystem
Quick Start • Core Features • Example Projects • Architecture • Roadmap
Kandle is a JavaScript Native machine learning framework that adopts an Eager Mode (dynamic graph) execution pattern, deeply referencing PyTorch's ATen/c10 architectural design. I view PyTorch not just as a Python framework, but as the API specification standard for modern AI frameworks. Kandle is dedicated to implementing an API system highly aligned with PyTorch within the JavaScript ecosystem.
- 🔄 Dynamic Graph Execution: True Eager Mode, supporting layer-by-layer debugging, intermediate state inspection, and dynamic control flow.
- 🎨 PyTorch API Alignment: Aligned at the architectural level rather than simple API wrapping, reducing migration costs and learning curves.
- ⚡ Hybrid Backend Architecture: Supports both WebGPU (GPU acceleration) and pure JS (CPU computation) backends under a unified interface.
- 🧩 Complete Tensor System: Implements a full Stride mechanism, broadcasting, view operations, and non-contiguous memory support.
- 🎵 Rich Operator Library: 200+ tensor operations covering arithmetic, linear algebra, convolution, FFT, audio processing, and more.
- 🚀 Out-of-the-Box Models: Native support for mainstream models like Qwen3 and Whisper, capable of loading Safetensor weights directly.
Current inference engines in the JavaScript ecosystem, such as ONNX Runtime and WebLLM, are excellent but are fundamentally Blackbox Systems focused on static graph inference. Kandle, as a Whitebox Framework, fills the following gaps:
| Requirement | Blackbox Inference Engines | Kandle (Whitebox Framework) |
|---|---|---|
| Intermediate Computation | ❌ Cannot intervene after static graph compilation | ✅ Pause/Inspect at any layer via dynamic graph |
| Model Interpretability | ❌ Blackbox, internal states inaccessible | ✅ Hooks, layer-by-layer state export |
| Custom Compute Flow | ❌ Limited to predefined Pipelines | ✅ Fully programmable control flow |
| Pre/Post-processing | ✅ Unified tensor operation system | |
| API Learning Curve | ✅ Zero cost for PyTorch users | |
| Debugging Experience | ❌ Hard to pinpoint issues in a blackbox | ✅ "Breakpoint-style" step-by-step debugging |
| Inference Performance | ✅ Static graph global optimization |
What Whitebox can do that Blackbox cannot:
- 🔬 Layer-wise Feature Extraction: Export intermediate Tensors at any layer for visual analysis.
- 🎨 Runtime Layer Replacement: Dynamically replace/skip certain layers to implement model pruning or A/B testing.
- 🧪 Custom Loss Functions: Design special computation paths combined with business logic.
- 🎯 Precise Memory Control: Manually manage Tensor lifecycles to optimize VRAM usage.
- 🌐 Deep Integration with DOM API: Hooks directly bind to Canvas/WebGL for real-time rendering.
Suitable Scenarios: Research, prototype development, model debugging, applications requiring intermediate calculations, audio/visual pre-processing, interpretability analysis. Unsuitable Scenarios: High-performance production inference (please use ONNX Runtime or WebLLM).
⚠️ This is a technical verification prototype, not a production-ready preview.
- ✅ The current version focuses on Forward Propagation Architecture Verification, implementing 200+ operators and a complete nn.Module system.
- 🚧 Autograd (Backpropagation) is under development and will be fully implemented in the next version.
⚠️ Happy Path Disclaimer: The current implementation mainly verifies the main flow (Happy Path); edge cases and error handling are not yet perfect.- 🔒 No PRs Accepted Yet: The current development branch has completely diverged from the public version with breaking changes. Contributions will be opened after the architecture stabilizes.
- 💬 Feedback Welcome: I have been working somewhat in isolation, so I am very eager to hear the community's thoughts and suggestions on "what a JavaScript version of PyTorch should look like."
- 🎯 Operator Demand Collection: Besides primitive operators, I want to know which specific operators the community needs supported early on.
No installation required, experience Kandle immediately. We provide a visual interactive Demo based on Qwen3-0.6B, fully showcasing the unique advantages of an Eager Mode framework in Model Interpretability:
- 🤗 HuggingFace Spaces: https://huggingface.co/spaces/finalkk/kandle-demo
- ⚡ Vercel: http://kandle-demo.vercel.app/
| Feature | Description |
|---|---|
| 🎯 Step-by-Step Execution | Execute forward propagation step by step |
| ⏮️ Time Travel | Step back and re-select the generation path |
| 🎲 Manual Intervention | Manually select candidate words at each token generation to explore different branches |
| 🔍 Logit Lens | Visualize the probability distribution of each layer's output in the vocabulary space |
| 🔗 Attention Links | Interactively view Self-Attention weight connection relationships |
| 🔥 Heatmap Visualization | Real-time display of Attention Maps and activation value distributions |
💡 This is the meaning of a Whitebox framework: Not just reasoning, but "dissecting" every step of the calculation process.
- Explore the Model's Thought Process: Observe the top-k tokens of each layer's output during single-step execution to understand how the model gradually "focuses" on the final answer.
- Compare Different Paths: Backtrack and select different candidate words to observe the bifurcation points of the generation results.
- Discover Attention Patterns: Use Attention Links to discover key tokens the model focuses on (e.g., pronoun resolution, context dependencies).
- Debugging and Teaching: Suitable for researchers to understand the internal mechanisms of Transformers, or for teaching demonstrations.
- Original Pre-trained Version Only: Currently, techniques like quantization are not implemented; it only loads original bf16 weights.
- Relatively Large Model Size: The original model size is about 1.5GB. It is recommended to download the model manually and load it using WebFile or Upload. Qwen3-0.6B Link
# Browser environment only needs the core library
# Using pnpm (Recommended)
pnpm add @kandle/core @kandle/backend-webgpu
# Optional type library, utilities, and pre-model building tools
pnpm add @kandle/types @kandle/utils @kandle/model-utils
# Or using npm
npm install @kandle/core @kandle/backend-webgpu
# If running in a Node.js environment, install webgpu polyfill additionally
npm install webgpu
- Node.js: ≥ 18.0.0 (ES2020+ support required)
- Browser: Chrome/Edge ≥ 113 (WebGPU support)
- TypeScript: ≥ 5.0 (Optional)
import { env } from "@kandle/core";
import { WebGPUBackend } from "@kandle/backend-webgpu";
export async function initWebGPU() {
const backend = await WebGPUBackend.create();
env.setBackend(backend);
env.setDefaultDevice(backend.name);
}import * as k from '@kandle/core';
import { Tensor } from '@kandle/core';
// Create Tensor
const a = new Tensor([[1, 2, 3], [4, 5, 6]], { dtype: 'float32' });
const b = k.randn([2, 3]);
// Arithmetic operations (supports broadcasting)
const result = a.add(b).mul(2).softmax(-1);
// Get data (WebGPU asynchronous read)
const data = await result.dataAsync();
console.log(data); // Float32Array [...]
// Shape operations (Zero-copy views)
const transposed = a.transpose(0, 1);
console.log(transposed.shape); // [3, 2]
console.log(a.storageId === transposed.storageId); // true
console.log(a.id === transposed.id); // false
const reshaped = a.reshape([3, 2]);
console.log(reshaped.shape); // [3, 2]
console.log(a.storageId === reshaped.storageId); // true
console.log(a.id === reshaped.id); // false
// Advanced Indexing (Python style)
const slicedContiguous = a.slice(":1, 1:"); // a[:1, 1:]
console.log(slicedContiguous.shape) // [1, 2];
console.log(a.storageId === slicedContiguous.storageId); // true
console.log(a.id === slicedContiguous.id); // false
console.log(a.isContiguous); // true (contiguous here)
// Non-contiguous slicing
const slicedNonContiguous = a.slice("::2, ::-1"); // a[::2, ::-1]
console.log(slicedNonContiguous.shape) // [1, 3];
console.log(a.storageId === slicedNonContiguous.storageId); // true
console.log(a.id === slicedNonContiguous.id); // false
console.log(slicedNonContiguous.isContiguous); // false (non-contiguous here)import * as k from '@kandle/core';
// Matrix Multiplication
const x = k.randn([128, 512]);
const weight = k.randn([512, 256]);
const output = k.matmul(x, weight); // [128, 256]
console.log(output.shape);
// Batch Matrix Multiplication
const batch = k.randn([4, 64, 128]);
const weights = k.randn([4, 128, 64]);
const batchOut = k.bmm(batch, weights); // [4, 64, 64]
console.log(batchOut.shape);
// Linear Layer (with bias)
const weightLinear = k.randn([256, 512]);
const bias = k.randn([256]);
const result = k.linear(x, weightLinear, bias);
console.log(result.shape); // [128, 256]import { nn, Tensor, randn } from '@kandle/core';
class MLP extends nn.Module {
fc1: nn.Linear;
fc2: nn.Linear;
constructor(inputDim: number, hiddenDim: number, outputDim: number) {
super();
this.fc1 = new nn.Linear(inputDim, hiddenDim);
this.fc2 = new nn.Linear(hiddenDim, outputDim);
}
async forward(x: Tensor): Promise<Tensor> {
// JS cannot overload operators, must provide call method to replace Python's model(x)
x = await this.fc1.call(x);
x = x.relu();
x = await this.fc2.call(x);
return x;
}
}
// Using the model
const model = new MLP(784, 256, 10);
const input = randn([32, 784]);
const output = await model.call(input);
console.log(output.shape); // [32, 10]import * as k from '@kandle/core';
// Automatically release intermediate tensors
const result = k.tidy( () => {
const a = k.randn([1000, 1000]);
const temp1 = a.mul(2);
const temp2 = temp1.add(3);
return temp2.sum(); // Only the sum result is kept, temp1/temp2 are automatically released
});
console.log('Result:', await result.dataAsync());Kandle uses a Monorepo architecture organized by pnpm workspace. The responsibilities of each package are as follows:
| Package Name | Function Description | Core File |
|---|---|---|
| @kandle/core | 🎨 User-side API, Tensor class, Operators, nn.Module | src/tensor.ts |
| @kandle/backend-webgpu | ⚡ WebGPU Backend Implementation (GPU Compute) | src/index.ts |
| @kandle/types | 📐 Type definitions, Interfaces, OpSchema | src/opschema/ |
| @kandle/utils | 🛠️ Utility functions, dtype handling, shape inference | src/index.ts |
| @kandle/model-utils | 🤖 Model building tools (Qwen3, Whisper) | src/index.ts |
- ✅ Stride Mechanism: Fully implements PyTorch-style memory layout management.
- ✅ Zero-Copy View Operations: Operations like
transpose,permute,slicedo not copy data. - ✅ Non-Contiguous Memory Computation: Supports direct computation after reshape or slice.
- ✅ Memory Format: Supports Contiguous and ChannelsLast layouts.
// Non-contiguous memory example
const x = randn([4, 3, 224, 224]);
const transposed = x.transpose(1, 2); // Zero-copy, strides changed
const sliced = x.slice("1:-1"); // View operation
// Automatically handles non-contiguous memory computation
const result = transposed.add(1).relu(); // Backend handles strides automaticallyFully compatible with NumPy/PyTorch broadcasting rules:
const a = randn([4, 1, 3]);
const b = randn([3]);
const result = a.add(b); // Automatically broadcasts b to [4, 1, 3]💡 Design Philosophy: Logical dtype is separated from physical dtype; the backend automatically selects storage format based on device capabilities.
💡 Quantized types are planned, and storage optimization for bool / int8 / int16 / float16 will be added later.
| DType | TypedArray | WebGPU Storage | Status | Notes |
|---|---|---|---|---|
float32 |
Float32Array |
f32 |
✅ Full | Direct hardware support |
float64 |
Float64Array |
f32 |
Downgrades to f32, precision loss exists | |
float16 |
Uint16Array |
f16 / f32 |
Requires shader-f16 extension | |
int32 |
Int32Array |
i32 |
✅ Full | Direct support |
uint32 |
Uint32Array |
u32 |
✅ Full | Direct support |
int8 / uint8 |
Int8Array / Uint8Array |
i32 / u32 |
Extended storage to 32-bit | |
int16 / uint16 |
Int16Array / Uint16Array |
i32 / u32 |
Downgraded storage | |
complex64 / complex128 |
Float32Array / Float64Array |
vec2<f32> |
Interleaved storage [r0,i0,r1,i1,...] |
|
bool |
Uint8Array |
u32 |
Extended storage |
💡 List generated by AI retrieval, may contain omissions or unimplemented items. Please refer with caution.
💡 The following shows torch operator names. To align with JavaScript development experience, snake-case names are replaced with camelCase.
📐 Arithmetic & Math Operations
Basic Arithmetic: add, sub, mul, div, pow, sqrt, abs, neg, reciprocal, floor, ceil, round, trunc, frac, sign
Trigonometric: sin, cos, tan, asin, acos, atan, atan2
Hyperbolic: sinh, cosh, tanh, asinh, acosh, atanh
Exponential & Logarithmic: exp, exp2, expm1, log, log10, log2, log1p
Special Functions: erf, erfc, sigmoid, logit, i0
🔢 Linear Algebra
Matrix Operations: matmul, mm, bmm, dot, mv, outer, addmm, addmv, baddbmm
Matrix Manipulation: diag, diagonal, trace, tril, triu
Decomposition & Solving (Planned): svd, qr, cholesky, solve
🎲 Reduction Operations
sum, mean, std, var, min, max, argmin, argmax, logsumexp, prod, norm, median, mode, all, any
Supports reduction on specific dimensions and keepdim parameter:
const x = randn([4, 5, 6]);
const result = x.sum(1, true); // Reduce on dim 1, keep dim -> [4, 1, 6]🔍 Comparison & Logic
Comparison: eq, ne, lt, le, gt, ge, maximum, minimum, clamp
Logic: logical_and, logical_or, logical_not, logical_xor
Conditional Selection: where, masked_fill, masked_select
🔀 Shape Operations
View Operations (Zero Copy): view, reshape, transpose, permute, squeeze, unsqueeze, flatten
Concatenation & Splitting: cat, stack, split, chunk, unbind
Indexing & Slicing: slice, select, index_select, gather, scatter, masked_select
Repetition & Expansion: repeat, repeat_interleave, expand, tile
Flipping & Rotating: flip, fliplr, flipud, rot90, roll
Advanced: as_strided (Direct stride manipulation)
🧮 Convolution & Pooling
Convolution: conv1d, conv2d, conv3d, conv_transpose2d, conv_transpose3d
Pooling: max_pool1d, max_pool2d, max_pool3d, avg_pool1d, avg_pool2d, avg_pool3d
Adaptive Pooling: adaptive_avg_pool2d, adaptive_max_pool2d
Padding: pad (Supports constant, reflect, replicate, circular modes)
📊 Normalization
batch_norm, layer_norm, group_norm, instance_norm, rms_norm, normalize
⚡ Activation Functions
relu, gelu, silu (swish), elu, selu, leaky_relu, prelu, rrelu, hardtanh, relu6, softplus, softsign, softmax, log_softmax, softmin, sigmoid, tanh, log_sigmoid, hardsigmoid, hardswish, mish, dropout
🎵 FFT (Fast Fourier Transform)
Real FFT: rfft, irfft, rfft2, irfft2
Complex FFT: fft, ifft, fft2, ifft2
Application: Audio signal processing, spectrum analysis
📈 Cumulative Operations
cumsum, cumprod, cummax, cummin, diff
🔧 Other Utilities
Sorting: sort, argsort, topk, kthvalue
Unique Values: unique, unique_consecutive
Filling & Cloning: fill_, zero_, clone, detach
Type Conversion: to (dtype/device conversion), contiguous (force contiguous memory)
nn.Module: Base class, supportsforward,parameters()nn.Parameter: Learnable parameter wrapper- Containers:
Sequential,ModuleList,ModuleDict
state_dict()andload_state_dict()are hard to align perfectly, refer to theIOclass API below for model loading.
Linear & Embedding Layers
nn.Linear: Fully connected layernn.Embedding: Embedding layer
Convolution Layers
nn.Conv1d,nn.Conv2d,nn.Conv3dnn.ConvTranspose2d,nn.ConvTranspose3d
Pooling Layers
nn.MaxPool1d,nn.MaxPool2d,nn.MaxPool3dnn.AvgPool1d,nn.AvgPool2d,nn.AvgPool3d
Normalization Layers
nn.LayerNormnn.RMSNorm
Activation Layers
nn.ReLU,nn.GELU,nn.SiLUnn.LeakyReLU,nn.PReLU,nn.Softmax,nn.LogSoftmaxnn.Sigmoid,nn.Tanh,nn.Softplus,nn.Mish
Supports Forward and Backward Hooks (Backward requires Autograd support):
// Register forward Hook, register_forward_hook
model.registerForwardHook(async (module, input, output) => {
console.log('Layer output shape:', output.shape);
});
// Register forward pre-hook, register_forward_pre_hook
model.registerForwardPreHook(async (module, input) => {
console.log('Layer input shape:', input.shape);
});Use Cases:
- Feature Visualization (e.g., CAM, Grad-CAM)
- Intermediate Layer Output Extraction
- Model Debugging and Profiling
- Dynamic Layer Replacement
Implements core functionality of PyTorch's audio processing library:
Transforms
Class API:
audio.Spectrogram: Spectrogramaudio.MelScale: Mel Filter Bankaudio.MelSpectrogram: Mel Spectrogramaudio.MFCC: Mel-frequency cepstral coefficientsaudio.AmplitudeToDB: Amplitude to Decibelsaudio.InverseMelScale: Inverse Mel Transformaudio.GriffinLim: Phase Reconstructionaudio.FrequencyMasking: Frequency Masking (Data Augmentation)audio.TimeMasking: Time Masking (Data Augmentation)
Functional API:
Corresponding audio.functional.* functions
Usage Example
import { audio, Tensor } from '@kandle/core';
// Assume 3 seconds of audio data
const audioData = new Float32Array(16000 * 3);
const waveform = new Tensor(audioData, { shape: [1, audioData.length] });
// Compute Mel Spectrogram
const melSpec = new audio.MelSpectrogram({
sample_rate: 16000,
n_fft: 400,
hop_length: 160,
n_mels: 80,
});
const melOutput = await melSpec.call(waveform);
console.log(melOutput.shape); // [1, 80, 301]
// Convert to log scale
const ampToDB = new audio.AmplitudeToDB();
const logMel = await ampToDB.call(melOutput);
console.log(logMel.shape); // [1, 80, 301]import { audio, Tensor } from '@kandle/core';
// Assume 3 seconds of audio data
const audioData = new Float32Array(16000 * 3);
const waveform = new Tensor(audioData, { shape: [1, audioData.length] });
// Compute Spectrogram
const spectrogram = new audio.Spectrogram({
n_fft: 512,
hop_length: 256,
power: 2.0,
});
const spec = await spectrogram.call(waveform);
console.log(spec.shape); // [1, 257, 188]
// Apply Mel Filter
const melScale = new audio.MelScale({
n_mels: 80,
sample_rate: 16000,
n_stft: 257,
});
const melSpec = await melScale.call(spec);
console.log(melSpec.shape); // [1, 80, 188]
// Compute MFCC
const mfcc = new audio.MFCC({
sample_rate: 16000,
n_mfcc: 13,
n_mels: 40
});
const mfccFeatures = await mfcc.call(waveform);
console.log(mfccFeatures.shape); // [1, 13, 241]
// Data Augmentation: Time Masking
const timeMask = new audio.TimeMasking({ time_mask_param: 10 });
const augmented = await timeMask.call(melSpec);
console.log(augmented.shape); // [1, 80, 188]- ✅ Safetensor: HuggingFace mainstream format, supports shard index (
.safetensors.index.json) - ✅ NumPy (
.npy): Used for test data loading
Unified data source interface across platforms:
FileByteSource(Node.js)BlobByteSource(Web)BufferByteSource(Memory)
import { io } from '@kandle/core';
// Load safetensor (read header only, data not loaded)
const group = await io.loadSafetensor('./model.safetensors');
// View all weights
group.dumpWeightMap();
// Load specific tensor
const layer = group.getLayer('model.embed_tokens.weight');
const tensor = await io.tensorFromSafetensorLayer(layer!, { device: 'webgpu' });
console.log(tensor.shape, tensor.dtype);
// Release resources
group.close();Full IO usage see IO Documentation
💡 Design Goal: Constructing these models is not to replace dedicated inference engines, but to demonstrate how Kandle, as a Whitebox Framework, implements model architectures highly aligned with PyTorch.
Qwen3MLP (SwiGLU) Code Comparison: HuggingFace Transformers Official vs. Kandle Implementation
🐍 Python (HuggingFace Transformers)
# Source: huggingface/transformers
# https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3/modeling_qwen3.py
class Qwen3MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.intermediate_size = config.intermediate_size
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, x):
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
return down_proj📘 TypeScript (Kandle)
// @kandle/model-utils
// src/mlp/swiglu.ts
export class SwiGLUMLP extends nn.Module {
gate_proj: nn.Linear;
up_proj: nn.Linear;
down_proj: nn.Linear;
constructor(options: SwiGLUMLPOptions) {
super();
const {
hiddenSize,
intermediateSize,
bias = false,
} = options;
this.hiddenSize = hiddenSize;
this.intermediateSize = intermediateSize;
this.gate_proj = new nn.Linear(hiddenSize, intermediateSize, bias);
this.up_proj = new nn.Linear(hiddenSize, intermediateSize, bias);
this.down_proj = new nn.Linear(intermediateSize, hiddenSize, bias);
this.addModule('gate_proj', this.gate_proj);
this.addModule('up_proj', this.up_proj);
this.addModule('down_proj', this.down_proj);
}
async forward(x: Tensor): Promise<Tensor> {
const gateProj = await this.gate_proj.call(x);
const gate = functional.silu(gateProj);
const up = await this.up_proj.call(x);
const hidden = gate.mul(up);
const output = await this.down_proj.call(hidden);
return output;
}
}📌 Source Note: Python code referenced from huggingface/transformers - modeling_qwen3.py
Architecture Completeness:
- ✅
Qwen3DecoderLayer: Fully implements Attention + MLP + LayerNorm - ✅
GroupedQueryAttention: GQA with RoPE + Q/K RMSNorm - ✅
SwiGLUMLP: SwiGLU activation (silu(gate) * up) - ✅
nn.RMSNorm: RMS Normalization - ✅ Complete Forward Propagation flow, including KV Cache, Causal Mask
Full Example: playground-web/qwen3/, playground-node/src/qwen3/
import { Qwen3ForCausalLM } from '@kandle/model-utils';
const model = new Qwen3ForCausalLM(config, useCausalMask = true);
await model.loadFromSafetensor(safetensorGroup);
const output = await model.forward(inputIds, {
positionIds,
pastKeyValues,
attentionMask,
});- Architecture Components:
WhisperEncoder,WhisperDecoder,WhisperModel - Audio Processing: Integrated Mel Spectrogram preprocessing
- Decoding Strategy: Greedy Decoding
- Full Example: playground-node/src/whisper/
import { Whisper, prepareAudioInput } from '@kandle/model-utils';
const model = new Whisper(WHISPER_BASE_CONFIG);
await model.loadFromSafetensor(safetensorGroup);
const melInput = await prepareAudioInput(audioFloat32Array);
const result = await transcribe(model, tokenizer, melInput);
console.log(result.text);- RoPE:
applyRotaryPosEmb - Sinusoidal Positional Encoding:
sinusoidalPositionEncoding - KV Cache:
KVCache(Inference acceleration) - Attention Variants:
multiHeadAttention,groupedQueryAttention,multiQueryAttention - MLP Variants:
SwiGLU,GeGLU
┌─────────────────────────────────────────────────────────┐
│ User API Layer (@kandle/core) │
│ Tensor, zeros, randn, nn.Module, audio... │
└────────────────────┬────────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────────┐
│ Dispatch Layer │
│ Operation routing, dtype resolution, broadcasting │
└────────────────────┬────────────────────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
┌────▼──────┐ ┌────▼──────┐ ┌────▼──────┐
│ Handler 1 │ │ Handler 2 │ │ Handler N │ (Mechanism-based)
│ Map/Reduce│ │ Composite │ │ FFT │
└────┬──────┘ └────┬──────┘ └────┬──────┘
│ │ │
└───────────────┼───────────────┘
│
┌────────────────────▼────────────────────────────────────┐
│ Kernel Layer │
│ Backend-specific implementations │
└────────────────────┬────────────────────────────────────┘
│
┌───────────┴───────────┐
┌────────▼─────────┐ ┌─────────▼──────────┐
│ @kandle/backend- │ │ @kandle/backend-js │
│ webgpu │ │ (CPU fallback) │
└──────────────────┘ └────────────────────┘
Referencing PyTorch's ATen/c10 design:
// 1. Storage: Physical memory
interface IStorage {
data: TypedArray;
byteOffset: number;
byteLength: number;
}
// 2. TensorHandle: Metadata
interface ITensorHandle {
storage: IStorage;
shape: number[];
strides: number[];
offset: number;
dtype: DType;
}
// 3. Tensor: User-side wrapper
class Tensor {
constructor(public handle: ITensorHandle) {}
// View operations modify handle only, no storage copy
transpose(dim0: number, dim1: number): Tensor {
const newStrides = swapStrides(this.handle.strides, dim0, dim1);
return new Tensor({ ...this.handle, strides: newStrides });
}
}Advantages:
- ✅ Zero-copy view operations
- ✅ Supports non-contiguous memory layouts
- ✅ Flexible memory management strategies
⚠️ Difference from PyTorch: PyTorch uses a complex Dispatch Key system (e.g.,AutogradCPU,AutogradCUDA) supporting multi-dimensional dispatch (backend, layout, autograd). Kandle currently implements a simplified version based onopName + devicedispatch.
📝 Architecture Evolution: The current dispatch routing mechanism will be rewritten in future versions, but the core mechanized routing philosophy remains.
Routing by Computation Mechanism:
// packages/utils/src/dispatchUtils.ts
const handlers = {
'map_reduce': MapReduceHandler, // Element-wise + Reduction
'composite': CompositeHandler, // Pure JS composite operations
'fft': FFTHandler, // FFT specialized processing
'conv': ConvolutionHandler, // Convolution specialized
'matmul': MatmulHandler, // Matrix Multiplication specialized
....
};
// Simplified dispatch logic (Non-Dispatch Key)
function dispatch(opSchema: OpSchema, ...args) {
const handler = handlers[opSchema.mechanism];
const backend = getBackendByDevice(args[0].device);
return handler.execute(backend, opSchema, ...args);
}Current Implementation:
- ✅ Route to different Handlers by
mechanismfield - ✅ Get corresponding Backend (webgpu / js) by
device - ❌ Does not support PyTorch-style multi-dimensional Dispatch Key
- ❌ Does not support runtime dynamic registration of Dispatch rules (Under development)
Automatically handles dtype conversion and device compatibility:
// User code
const x = randn([100], { dtype: 'float64' });
// Backend actual storage (WebGPU does not support f64)
// Logical dtype: float64
// Physical dtype: float32 (downgrade)
// Upload: Float64Array -> Float32Array (precision loss warning)
// Download: Float32Array -> Float64ArrayFeatures:
- Auto-detects
shader-f16extension - Transparently handles dtype downgrading
- Supports
vec2<f32>mapping for complex types
💡 Design Inspiration: PyTorch uses
native_functions.yamlto define operator signatures and generates C++ code via torchgen. Kandle implements a similar idea, using TypeScript Interface as OpSchema and generating user-side APIs via Codegen.
Generator: File Location
Generated Files: File Location
Reduces boilerplate and ensures API consistency:
pnpm codegen
OpSchema Definition Example:
// packages/types/src/opschema/ops/activation.ts
export const gelu: OpEntry = {
name: 'gelu',
mechanism: 'Iterator',
iteratorType: 'Map',
signature: {
params: [
{ name: 'self', type: SchemaT.Tensor() },
{ name: 'approximate', type: SchemaT.String(['none', 'tanh']), default: 'none' },
],
returns: { single: SchemaT.Tensor() },
},
iteratorConfig: {
factory: 'unary',
tensorInputs: ['self'],
scalarArgs: ['approximate'],
},
shape: SchemaShape.same('self'),
dtype: SchemaDtype.same('self'),
dispatchKey: 'gelu',
codegen: { tensorMethod: 'gelu', namespace: 'nn.functional' },
};Generated Content:
methods-gen.ts: Tensor prototype methods (e.g.,tensor.add())ops-gen.ts: Top-level operation functions (e.g.,add(tensor, other))types-gen.ts: OpSchema type definition summary
Comparison with PyTorch:
| Feature | PyTorch (YAML) | Kandle (TypeScript Interface) |
|---|---|---|
| Definition Format | native_functions.yaml |
TypeScript Interface |
| Generation Target | C++ / Python Binding | TypeScript API |
| Type Check | Runtime | Compile-time (TypeScript) |
| Extensibility | ✅ Supports Complex Dispatch |
import { randn, slice } from '@kandle/core';
const x = randn([3, 4, 5]);
// Python: x[:, 1:5, ::2]
// Kandle:
const result = x.slice(":,1:5,::2");
console.log(result.shape); // [3,3,3]
// Supports negative indexing
const tail = x.slice("-5:"); // x[-5:]
console.log(tail.shape); // [3,4,5]Detailed documentation see knownIssues/
Issue: WebGPU's buffer.mapAsync() forces all data reading to be asynchronous.
Impact:
- ✅
forwardmethod is unified asasync. - ❌ Cannot directly read values of other Tensors in kernel (e.g., conditional judgment).
- ❌ Complexity of implementing composite operators increases.
Mitigation:
- Provide synchronous JS backend (Under development).
- Design to avoid operations requiring synchronous reading.
Details: knownIssues/async.md
Issue: WebGPU does not support some dtypes, requiring downgrading or extended storage. Impact:
float64→float32: Precision loss.int8→i32: Memory waste (4x).complex128→vec2<f32>: Precision loss.
Recommendation:
- Prioritize
float32andint32. - Use JS backend for high precision (Under development).
Details: See Core Features - DType Support
Issue: Current complex type implementation is basic, only supporting basic arithmetic. Plan: Will refactor the complex number calculation system in future versions.
Details: knownIssues/complex.md
Issue: Significant use of as any type assertions.
Plan: Gradually strengthen TypeScript type inference and generic constraints.
Details: knownIssues/type.md
Issue: The current dispatch layer mixes scheduling logic with some computation logic. Plan: Refactor into a pure routing layer.
Details: knownIssues/dispatch.md, knownIssues/opschema.md
Issue: WebGPU backend may produce numerical differences across different hardware/drivers, especially in certain activation functions (like GELU, softmax) and mathematical operations, leading to NaN or precision issues.
Impact:
⚠️ Identical models may produce slightly different outputs on different GPU devices.- ❌ Extreme cases may produce NaN values (e.g., unclamped GELU, softmax exp overflow).
- 🔴 Numerical instability caused by hardware/driver implementation differences seems unavoidable?
Known Cases:
- GELU Activation NaN: Without limiting tanh input range, large activation values in certain layers can produce NaN (See knownIssues/shader.md).
- Softmax Overflow: If input is not subtracted by max value, exp may overflow to Infinity.
- Precision Loss Accumulation: Float32 precision loss may accumulate after multi-layer computation.
Mitigation:
- ✅ Numerical stability protection added to key operators (e.g., clamp for GELU, subtract max for softmax).
⚠️ Use identical hardware for testing and deployment to avoid cross-device result differences.- 📊 Monitor numerical ranges of key outputs to detect anomalies in time.
- 🔍 Refer to knownIssues/shader.md for detailed troubleshooting guides.
Current Limitations:
- Since the WebGPU specification does not mandate precise floating-point behavior, implementations across drivers/hardware may vary.
- There is currently no excellent solution to completely eliminate this difference; this is an inherent limitation of the WebGPU ecosystem.
Details: knownIssues/shader.md
Issue: The WebGPU backend suffers from VRAM leaks because the JavaScript side cannot perceive WebGPU side memory pressure.
Root Causes:
- ❌ JS & WebGPU Memory Isolation: JavaScript's Garbage Collection (GC) mechanism cannot perceive GPU VRAM pressure.
- ❌ FinalizationRegistry Timing Uncontrollable: Even using
FinalizationRegistryto register destructors, the callback trigger timing is entirely decided by GC and may trigger after VRAM is exhausted. ⚠️ Complex View Tensor References: View Tensors created bytranspose,slice, etc., share Storage with the original Tensor, creating complex reference relationships that make precise release timing difficult to determine.
Impact:
- ❌ Long inference sessions (e.g., generating 1000+ tokens) may crash due to VRAM exhaustion.
⚠️ Even after loading large models, intermediate Tensors that are no longer used may still occupy VRAM.⚠️ View operations (likeview(),transpose()) extend the lifecycle of the original Storage even though they don't copy data.
My Optimization Attempts:
⚠️ Implemented a complex Memory Pool mechanism to reuse GPU Buffers, but it didn't achieve practical results, so it is disabled in the current release. See File Location.- ✅ Provided
tidy()and manualdispose()APIs. - ✅ Attempted to optimize reference counting for View Tensors.
⚠️ But problems persist: Due to the inherent limitation of JS/WebGPU memory isolation, perfect automatic management is impossible.
Mitigation (User Cooperation Required):
- Highly Recommended: Use
tidy()to wrap computation logic to automatically manage intermediate Tensor lifecycles.
const result = tidy(() => {
const temp1 = a.mul(2);
const temp2 = temp1.add(3);
return temp2.sum(); // Only the sum result is kept
});- Explicitly call
dispose()to release unused Tensors.
const temp = a.mul(2);
const result = temp.add(3);
temp.dispose(); // Manual release- Periodically monitor VRAM usage (Chrome DevTools → Performance Monitor).
- Avoid creating massive temporary Tensors in loops without releasing them.
Long-term Plan:
- Optimize Memory Pool strategy for more aggressive memory reclamation.
- Improve reference tracking mechanism for View Tensors.
Looking for expert advice!
Details: knownIssues/cache.md
| Browser | Minimum Version | Notes |
|---|---|---|
| Chrome | 113+ | ✅ Full Support |
| Edge | 113+ | ✅ Full Support |
| Safari | Preview | |
| Firefox | Experimental |
Location: playground-web/qwen3/
cd playground-web
pnpm install
pnpm dev
# Access http://localhost:5173/qwen3/
Features:
- WebGPU accelerated text generation
- Streaming output support
- Visualized Attention weights
Location: playground-node/src/whisper/
cd playground-node
pnpm install
pnpm start
Features:
- Loads local audio files
- Mel Spectrogram preprocessing
- End-to-end speech-to-text
-
Architecture Refactoring: Further optimize layered design, refine Codegen system and type inference.
-
Autograd (Automatic Differentiation): Backpropagation system supporting gradient calculation and parameter optimization.
-
Currently implementing an auto-differentiation system based on
derivatives.yaml. -
Designing a TypeScript version of the parser referencing PyTorch's DSL (Complex, AI might implement all primitive operators faster).
-
Automatically generate backpropagation operators via derivatives.yaml, ensuring consistency with PyTorch behavior.
-
Goal: Cover gradient definitions for most common forward operators, support higher-order derivatives.
-
nn.Module Enhancements:
-
✅ Generator-implemented layer-by-layer debugging.
-
🚧 Runtime Module Swapping.
-
🚧 State Checkpoints.
-
Custom Kernel Registration: Runtime custom kernel registration, supporting Fused Kernel optimization.
-
Pure JS Backend Completion: Fully synchronous CPU computation backend (analogous to PyTorch CPU).
-
Domain Module Completion: Continue perfecting the audio module (benchmarking torchaudio) and vision module (benchmarking torchvision).
-
Quantization Support:
-
int4,int8quantization dtypes. -
Dynamic Quantization.
-
Static Quantization.
-
Independent Scalar Math Library: Solve type conversion issues for mixed dtype calculations in JS.
-
Performance Optimization:
-
Kernel Fusion.
-
Memory Pool Optimization.
-
Shader Cache System.
-
Remote Backend: Distributed computing backend based on WebSocket/gRPC.
-
Training API: Complete training loop support (requires Autograd completion).
-
NumPy API Compatibility Layer: Reuse computation dispatch architecture, add
numpyoperators, exposed via namespaceimport { np } from '@kandle/core'. -
Model Interpretability UI Component Library (React-based):
-
Heatmap Visualization.
-
Feature Maps display.
-
Attention Weight Visualization.
-
Inference Process Animation.
-
Pre-trained Model Ecosystem:
-
Launch independent
@kandle/modelspackage, implementing functionality similar to HuggingFace Transformers. -
Provide out-of-the-box pre-trained models (LLaMA, BERT, ViT, Whisper, etc.).
-
Support loading models and configs directly from HuggingFace Hub.
-
Unified model loading and inference interface.
-
GitHub Agent Automated Workflow:
-
Implement intelligent GitHub Agent listening for specific Issue/PR formats.
-
When matching operator requests, automatically trigger Agent to:
- Search relevant technical docs and PyTorch implementations.
- Generate operator definitions (OpSchema).
- Implement Kernel (WebGPU/JS dual backend).
- Automatically generate functional tests and numerical validation cases.
- Submit Pull Request for human review.
- Lower community contribution threshold and accelerate operator ecosystem construction.
⚠️ Naming Convention Transition: Due to objective reasons related to Vibe Coding, the current code contains a mix ofsnake_caseandcamelCase. I will gradually unify this tocamelCasein future versions to align with JavaScript/TypeScript community habits.
Due to language differences between JavaScript and Python, some APIs cannot be perfectly aligned:
Python (Keyword Arguments):
torch.zeros(size=(3, 4), dtype=torch.float32, device='cuda')JavaScript (Object Arguments):
zeros([3, 4], { dtype: 'float32', device: 'webgpu' })Since JavaScript does not support operator overloading, basic operations require explicit method calls:
| Python | TypeScript (Kandle) |
|---|---|
a + b |
add(a, b) or a.add(b) |
a - b |
sub(a, b) or a.sub(b) |
a * b |
mul(a, b) or a.mul(b) |
a / b |
div(a, b) or a.div(b) |
a @ b |
matmul(a, b) or a.matmul(b) |
model(x) |
model.call(x) |
💡
nn.Module's__call__needs to be explicitly called via.call()method.
Python:
x[:, 1:5]JavaScript (Function Simulation):
x.slice(":,1:5")Regarding parameter positioning, two options are considered:
- Full Alignment with Torch: Attempt complete alignment via complex overloading.
Most APIs feasible, but implementation is overly complex, and a few APIs will fail to align, requiring separate memorization, leading to inconsistent experience.
- Design JS Specification: Design a JS benchmark specification, enforcing "alignment after translation" via rules.
Simpler development, but leads to degraded experience and lower alignment with Torch.
Kandle uses Eager Mode (dynamic graph) execution, which differs fundamentally from static graph inference engines:
| Feature | Eager Mode (Kandle) | Static Graph (ONNX) |
|---|---|---|
| Execution Style | Op-by-Op execution | One-time graph optimization |
| Intermediate State | ✅ Accessible anytime | ❌ Invisible after compilation |
| Dynamic Control Flow | ✅ Supports if/loop | |
| Memory Overhead | ✅ Low after optimization | |
| Inference Speed | ✅ Extreme optimization | |
| Debugging Experience | ✅ Excellent | ❌ Difficult |
✅ Recommend Kandle:
- Research and Prototype Development
- Model Debugging and Interpretability Analysis
- Applications requiring intermediate calculations (e.g., Audio Preprocessing + Model Inference)
- Teaching and Learning
❌ Do Not Recommend Kandle:
- High-performance production inference (Please use ONNX Runtime)
- Mobile/Edge devices (Please use WebLLM or TFLite)
- Real-time applications strictly sensitive to latency
- Avoid Unnecessary Data Reads: Reduce
dataAsync()calls. - Use
tidy()for Memory: Automatically release intermediate tensors. - Batch Inference: Increase batch size to improve GPU utilization.
💡 This is also an exploration of the limits of Vibe Coding.
This project adopts the Vibe Coding development mode, attempting to explore the boundaries of AI-assisted development:
- Architecture Design: Responsible by myself (Reading PyTorch ATen/c10 source code).
- Code Implementation: Mainly assisted by AI (Gemini, Claude).
- Testing & Verification: Human + AI collaboration (NumPy/PyTorch reference tests).
In this project, I tried to let AI complete:
- ✅ 200+ Operator Implementations: From mathematical formulas to WebGPU Shader code.
- ✅ Complex Architecture Landing: Stride mechanism, Dispatch system, Autograd (In progress).
- ✅ Cross-platform Adaptation: WebGPU / Pure JS dual backend.
⚠️ Edge Case Handling: Currently a shortcoming, requires human intervention.
Due to model hallucinations and objective reasons of Vibe Coding:
⚠️ Code style is not fully unified (will be refactored later).⚠️ Some comments may be inconsistent or outdated.⚠️ Edge case handling is imperfect (Happy Path priority).⚠️ Core logic is verified, but dtype coverage is insufficient, and some operators lack numerical stability tests (Reference PyTorch/NumPy).
With AI assistance, achieved:
- 📈 10x+ Development Speed: 200+ operators completed in weeks.
- 🔄 Fast Iteration: Multiple architecture refactors (from v1 to v11).
- 📚 Automated Documentation: README, API Docs, Design Docs.
- 🧪 Test Case Generation: Automatically align PyTorch behavior.
- ❌ Architecture Decisions: Still requires deep human thinking.
- ❌ Performance Optimization: AI struggles to understand details like memory layout, Cache optimization.
- ❌ Debugging Complex Issues: Non-contiguous memory, type inference, etc., require human intervention.
- ❌ Long-term Consistency: Easy to introduce inconsistencies during cross-file refactoring.
Summary: Vibe Coding is suitable for "repetitive work with clear specifications" (like operator implementation), but core architecture design still needs to be human-led.
Initially, I just wanted to perform inference using onnxruntime in a JS environment. However, inference with onnxruntime requires handling a large number of intermediate tensors, which is torture in JavaScript. Native JS methods written on the fly can only handle specific "one-off" processes. For example, it is very difficult to generically handle slicing, view transformation, or complex broadcasting operations of a high-dimensional array, and then reuse this method in the inference process of other models.
Due to historical reasons, academia and cutting-edge models are mostly built using the PyTorch paradigm. In the JavaScript ecosystem, the lack of corresponding API support creates a huge cognitive switching cost when reproducing papers or porting models.
Because I don't love writing Python; I might even say I hate it. Although Python's AI ecosystem holds a monopoly due to history, for someone accustomed to C-like languages and long-term use of strong type systems, development in Python is torture. I find it hard to get comfortable with def / None / and using "vernier calipers" (strict tools) in such a loose environment. Especially the infamous **kwargs—do you really know what you are writing?
Initially, I tried to export preprocessing actions (like audio Mel Spec calculation) directly as ONNX graphs. But I soon found this unfeasible; the fragmentation of the ecosystem makes model export extremely tedious. For instance, if you want to infer Whisper, there are tiny but fatal differences in preprocessing parameters across versions (e.g., turbo-v3's mel_spec n_mels is 128, while the base version is 80), meaning I would need to export specific model versions for every case. The more models I tried to infer in JS, the more obvious this "ecosystem gap" became.
I certainly tried transformers.js. It's great, works out of the box, and supports many mainstream models. But precisely because of this, it has a core problem: it is based on onnxruntime and is a black box. You can only adjust the Pipeline through limited parameters; you can hardly control the details of data flow. If you want to deeply customize or optimize the process, this is extremely frustrating.
After calm reflection, I discovered that the source of my pain was not the lack of a machine learning framework in JS—in fact, we have tfjs. Tfjs is powerful, but its API design philosophy stems from the previous generation of deep learning frameworks. When I want to casually write x.view().transpose() in JS, I find myself having to look up documentation that feels slightly alien to me. It's good, but it's not the 'standard' I'm used to.
We also don't lack inference frameworks; onnxruntime / WebLLM have done deep optimizations.
We are missing PyTorch, or rather, a de facto API that conforms to the Torch standard.
Naturally, I searched for existing libraries like torch-js. Although they did a lot of Binding work for ATen and c10 (PyTorch's core C++ libraries), unfortunately, most did not complete the work, and many projects stopped maintenance years ago. This was undoubtedly even more frustrating.
Is there really no way? Can we really only go back to writing Python?
The good news is, I've been writing C++ for years. The better news is, in this era, we have a "cheat code"—AI.
I cloned the torch repository and read the ATen and c10 source code deeply. With Gemini's help, I roughly understood their design: the Dispatch system, the code generation system, the philosophy of separating storage and computation, etc. This architecture is powerful, but also very complex.
Then, I tried to replicate a simplified version in TypeScript.
This journey was a series of MVP versions being torn down and rebuilt:
- From a crude version that could only calculate
T + T, gradually implementing type promotion and scalar calculation. - From chaotic data type management to a clear distinction between Logical Dtype and Physical Dtype.
- From strictly operating on contiguous memory to learning how to calculate Stride, implementing memory views and dimension folding.
- Then to implementing broadcasting, advanced indexing.
- Finally completing the separation of storage and computation, backend isolation, and user-friendly API design...
Finally, I was convinced I held all the necessary pieces of tensor computation.
The rest became pure: write documentation, design prompts, and direct AI to land the code.
Development was not smooth sailing. Every time a layer of abstraction was missed, or a key design was not considered, it often meant a subsequent large-scale refactor. Of course, I also had to fight the "model hallucinations" specific to the Vibe Coding mode. Even with vibe coding, it was painful.
But as long as I thought "If I don't do this, I have to go back to writing Python", I had infinite energy.
And then, there was Kandle.
About Autograd: Initially, to focus on inference scenarios and reduce engineering effort, I cut Autograd. In fact, it was hard for me to imagine large-scale model training scenarios in JS. But since the logic of backpropagation essentially shares the same dispatch system as forward propagation, and having come this far, it would be a shame not to do Autograd. With this mindset, I finally decided to patch Autograd in.
About Maturity: The current Kandle is still a "toy," at most a "delicate toy." Because the core is eager mode, more application scenarios are in intermediate calculation/preprocessing/post-processing, and state checking. Unless the target model cannot be exported to onnx or transformers.js has not implemented the pipeline, if directly applied to specific inference business, the performance will be far lower than specialized inference frameworks.
But eager mode also has advantages that black boxes cannot replace. Now you can completely "dissect" the model; you can fully control every layer, every forward propagation. I also designed hooks in nn.Module. For example, it can now be combined with DOM APIs, giving us cooler, more intuitive ways to perform model interpretability analysis. In the new version, I tried rewriting nn.Module using generator/yield; now you can "hand over" control during propagation, and you can "pause" the calculation of a certain layer at any time to debug like hitting a breakpoint in an IDE.
I still have many ideas that haven't been implemented yet. For example, due to the thorough decoupling, I can now implement a Remote Backend, interacting via gRPC/WebSocket schemes. Just like calling WebGPU on the Web, the user side issues calculations and only blocks to get data at "synchronization points." This is theoretically completely feasible.
I personally believe this design of Torch can completely become a "tensor computation protocol" standard, not just a machine learning framework under Python, but capable of achieving much more.
Of course, with my current ability, I can only get this far. After I "evolve," I will try to go further.
MIT License
🌟 If this project helps you, please give it a Star!
💬 If you have any thoughts on "JavaScript Version of PyTorch", welcome to share in Issues/Discussions
Made with ❤️ by Vibe Coding