Can't start training: ValueError: XBLOCK is not defined [PyTorch 2.0 issue] #1160

rkfg · 2023-04-02T15:02:00Z

1. Please find the following lines in the console and paste them below.

#######################################################################################################
Initializing Dreambooth                                                                                                                                                               
Dreambooth revision: 926ae204ef5de17efca2059c334b6098492a0641                                                                                                                         
Successfully installed accelerate-0.18.0 fastapi-0.94.1 gitpython-3.1.31 google-auth-oauthlib-0.4.6 requests-2.28.2 transformers-4.26.1                                               
                                                                                                                                                                                      
Does your project take forever to startup?                                                                                                                                            
Repetitive dependency installation may be the reason.                                                                                                                                 
Automatic1111's base project sets strict requirements on outdated dependencies.                                                                                                       
If an extension is using a newer version, the dependency is uninstalled and reinstalled twice every startup.                                                                          
                                                                                                                                                                                      
[+] xformers version 0.0.18 installed.                                                                                                                                                
[+] torch version 2.0.0+cu118 installed.                                                                                                                                              
[+] torchvision version 0.15.1+cu118 installed.                                                                                                                                       
[+] accelerate version 0.18.0 installed.                                                                                                                                              
[+] diffusers version 0.14.0 installed.                                                                                                                                               
[+] transformers version 4.26.1 installed.                                                                                                                                            
[+] bitsandbytes version 0.35.4 installed.                                                                                                                                            


#######################################################################################################

2. Describe the bug

When starting the training the process stops immediately with a long traceback. I had this extension working a while ago before A1111 and PyTorch updated to the current versions. Not sure if it's related, I tried to run the webUI with both venv and conda, the outcome is exactly the same. Also tried turning on and off various options such as memory attention (default/xformers), precision (fp16/bf16), using extended Lora or not and choosing different base models (SD 1.5 and Liberty). No difference whatsoever. Sometimes it does a few cycles of triton autotune but in the end can't compile that code. The following log is when using conda.

3. Provide logs

Initializing dreambooth training...
The version of diffusers is less than or equal to 0.14.0. Performing monkey-patch...
Pre-processing images: classifiers_0: : 0it [00:00, ?it/s]
Nothing to generate.
Found 175 reg images.
Preparing dataset...
Init dataset!
Preparing Dataset (With Caching)
Caching latents...:   0%|                                                                                                                                     | 0/210 [00:00<?, ?it/s]
Loading cached latents...
Bucket 0 (512, 512, 0) - Instance Images: 35 | Class Images: 175 | Max Examples/batch:  70
Total Buckets 1 - Instance Images: 35 | Class Images: 175 | Max Examples/batch:  70
Caching latents...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 210/210 [00:00<00:00, 36802.90it/s]
Total images / batch: 70, total examples: 70
Caching latents...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 210/210 [00:00<00:00, 36440.52it/s]
Total dataset length (steps): 70
Initializing bucket counter!
  ***** Running training *****
  Num batches each epoch = 70
  Num Epochs = 500
  Batch Size Per Device = 1
  Gradient Accumulation steps = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Text Encoder Epochs: 250
  Total optimization steps = 17500
  Total training steps = 35000
  Resuming from checkpoint: False
  First resume epoch: 0
  First resume step: 0
  Lora: True, Optimizer: 8bit AdamW, Prec: no
  Gradient Checkpointing: False
  EMA: False
  UNET: True
  Freeze CLIP Normalization Layers: False
  LR: 0.0001
  LoRA Extended: True
  LoRA Text Encoder LR: 5e-05
  V2: False
Steps:   0%|                                                                                                                                                | 0/35000 [00:00<?, ?it/s]
[2023-04-02 17:51:49,400] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-04-02 17:51:50,486] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 937, in build_triton_ir
    generator.visit(fn.parse())
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/home/rkfg/miniconda3/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 183, in visit_Module
    ast.NodeVisitor.generic_visit(self, node)
  File "/home/rkfg/miniconda3/lib/python3.10/ast.py", line 426, in generic_visit
    self.visit(item)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/home/rkfg/miniconda3/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 252, in visit_FunctionDef
    has_ret = self.visit_compound_statement(node.body)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 177, in visit_compound_statement
    self.last_ret_type = self.visit(stmt)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/home/rkfg/miniconda3/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 301, in visit_Assign
    values = self.visit(node.value)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/home/rkfg/miniconda3/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 757, in visit_Call
    args = [self.visit(arg) for arg in node.args]
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 757, in <listcomp>
    args = [self.visit(arg) for arg in node.args]
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/home/rkfg/miniconda3/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 188, in visit_List
    elts = [self.visit(elt) for elt in node.elts]
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 188, in <listcomp>
    elts = [self.visit(elt) for elt in node.elts]
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/home/rkfg/miniconda3/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 325, in visit_Name
    return self.get_value(node.id)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 156, in get_value
    raise ValueError(f'{name} is not defined')
ValueError: XBLOCK is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/rkfg/miniconda3/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 549, in _worker_compile
    kernel.precompile(warm_cache_only_with_cc=cc)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py", line 69, in precompile
    self.launchers = [
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py", line 70, in <listcomp>
    self._precompile_config(c, warm_cache_only_with_cc)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py", line 83, in _precompile_config
    triton.compile(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 1621, in compile
    next_module = compile(module)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 1550, in <lambda>
    lambda src: ast_to_ttir(src, signature, configs[0], constants)),
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 962, in ast_to_ttir
    mod, _ = build_triton_ir(fn, signature, specialization, constants)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 942, in build_triton_ir
    raise CompilationError(fn.src, node) from e
triton.compiler.CompilationError: at 63:39:
def triton_(arg_A, arg_B, seed2, in_ptr3, in_ptr4, out_ptr1, out_ptr2, out_ptr3):
    GROUP_M : tl.constexpr = 8
    EVEN_K : tl.constexpr = True
    ALLOW_TF32 : tl.constexpr = True
    ACC_TYPE : tl.constexpr = tl.float32
    BLOCK_M : tl.constexpr = 64
    BLOCK_N : tl.constexpr = 32
    BLOCK_K : tl.constexpr = 32

    A = arg_A
    B = arg_B

    M = 4096
    N = 32
    K = 320
    stride_am = 320
    stride_ak = 1
    stride_bk = 1
    stride_bn = 320

    # based on triton.ops.matmul
    pid = tl.program_id(0)
    grid_m = (M + BLOCK_M - 1) // BLOCK_M
    grid_n = (N + BLOCK_N - 1) // BLOCK_N

    # re-order program ID for better L2 performance
    width = GROUP_M * grid_n
    group_id = pid // width
    group_size = min(grid_m - group_id * GROUP_M, GROUP_M)
    pid_m = group_id * GROUP_M + (pid % group_size)
    pid_n = (pid % width) // (group_size)

    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    ram = tl.max_contiguous(tl.multiple_of(rm % M, BLOCK_M), BLOCK_M)
    rbn = tl.max_contiguous(tl.multiple_of(rn % N, BLOCK_N), BLOCK_N)
    rk = tl.arange(0, BLOCK_K)
    A = A + (ram[:, None] * stride_am + rk[None, :] * stride_ak)
    B = B + (rk[:, None] * stride_bk + rbn[None, :] * stride_bn)

    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE)
    for k in range(K, 0, -BLOCK_K):
        if EVEN_K:
            a = tl.load(A)
            b = tl.load(B)
        else:
            a = tl.load(A, mask=rk[None, :] < k, other=0.)
            b = tl.load(B, mask=rk[:, None] < k, other=0.)
        acc += tl.dot(a, b, allow_tf32=ALLOW_TF32)
        A += BLOCK_K * stride_ak
        B += BLOCK_K * stride_bk

    # rematerialize rm and rn to save registers
    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    idx_m = rm[:, None]
    idx_n = rn[None, :]
    mask = (idx_m < M) & (idx_n < N)

    # inductor generates a suffix
    xindex = idx_n + (32*idx_m)
    tmp0_load = tl.load(seed2 + (0))
    tmp0 = tl.broadcast_to(tmp0_load, [XBLOCK])
                                       ^
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 670, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.fake_example_inputs())
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/debug_utils.py", line 1055, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/__init__.py", line 1390, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 401, in compile_fx
    return compile_fx(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 455, in compile_fx
    return aot_autograd(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/backends/common.py", line 48, in compiler_fn
    cg = aot_module_simplified(gm, example_inputs, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2805, in aot_module_simplified
    compiled_fn = create_aot_dispatcher_function(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 163, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2498, in create_aot_dispatcher_function
    compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1713, in aot_wrapper_dedupe
    return compiler_fn(flat_fn, leaf_flat_args, aot_config)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2133, in aot_dispatch_autograd
    compiled_fw_func = aot_config.fw_compiler(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 163, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 430, in fw_compiler
    return inner_compile(
  File "/home/rkfg/miniconda3/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/debug_utils.py", line 595, in debug_wrapper
    compiled_fn = compiler_fn(gm, example_inputs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/debug.py", line 239, in inner
    return fn(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 177, in compile_fx_inner
    compiled_fn = graph.compile_to_fn()
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/graph.py", line 586, in compile_to_fn
    return self.compile_to_module().call
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 163, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/graph.py", line 575, in compile_to_module
    mod = PyCodeCache.load(code)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 528, in load
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_rkfg/ew/cew2eoumgtgvrzh47c55fej6l6p7ggtu5ehdsiuv334opbob6r7y.py", line 505, in <module>
    async_compile.wait(globals())
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 715, in wait
    scope[key] = result.result()
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 573, in result
    self.future.result()
  File "/home/rkfg/miniconda3/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/home/rkfg/miniconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
triton.compiler.CompilationError: at 63:39:
def triton_(arg_A, arg_B, seed2, in_ptr3, in_ptr4, out_ptr1, out_ptr2, out_ptr3):
    GROUP_M : tl.constexpr = 8
    EVEN_K : tl.constexpr = True
    ALLOW_TF32 : tl.constexpr = True
    ACC_TYPE : tl.constexpr = tl.float32
    BLOCK_M : tl.constexpr = 64
    BLOCK_N : tl.constexpr = 32
    BLOCK_K : tl.constexpr = 32

    A = arg_A
    B = arg_B

    M = 4096
    N = 32
    K = 320
    stride_am = 320
    stride_ak = 1
    stride_bk = 1
    stride_bn = 320

    # based on triton.ops.matmul
    pid = tl.program_id(0)
    grid_m = (M + BLOCK_M - 1) // BLOCK_M
    grid_n = (N + BLOCK_N - 1) // BLOCK_N

    # re-order program ID for better L2 performance
    width = GROUP_M * grid_n
    group_id = pid // width
    group_size = min(grid_m - group_id * GROUP_M, GROUP_M)
    pid_m = group_id * GROUP_M + (pid % group_size)
    pid_n = (pid % width) // (group_size)

    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    ram = tl.max_contiguous(tl.multiple_of(rm % M, BLOCK_M), BLOCK_M)
    rbn = tl.max_contiguous(tl.multiple_of(rn % N, BLOCK_N), BLOCK_N)
    rk = tl.arange(0, BLOCK_K)
    A = A + (ram[:, None] * stride_am + rk[None, :] * stride_ak)
    B = B + (rk[:, None] * stride_bk + rbn[None, :] * stride_bn)

    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE)
    for k in range(K, 0, -BLOCK_K):
        if EVEN_K:
            a = tl.load(A)
            b = tl.load(B)
        else:
            a = tl.load(A, mask=rk[None, :] < k, other=0.)
            b = tl.load(B, mask=rk[:, None] < k, other=0.)
        acc += tl.dot(a, b, allow_tf32=ALLOW_TF32)
        A += BLOCK_K * stride_ak
        B += BLOCK_K * stride_bk

    # rematerialize rm and rn to save registers
    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    idx_m = rm[:, None]
    idx_n = rn[None, :]
    mask = (idx_m < M) & (idx_n < N)

    # inductor generates a suffix
    xindex = idx_n + (32*idx_m)
    tmp0_load = tl.load(seed2 + (0))
    tmp0 = tl.broadcast_to(tmp0_load, [XBLOCK])
                                       ^

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/rkfg/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/ui_functions.py", line 727, in start_training
    result = main(class_gen_method=class_gen_method)
  File "/home/rkfg/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 1371, in main
    return inner_loop()
  File "/home/rkfg/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/memory.py", line 119, in decorator
    return function(batch_size, grad_size, prof, *args, **kwargs)
  File "/home/rkfg/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 1169, in inner_loop
    noise_pred = unet(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 82, in forward
    return self.dynamo_ctx(self._orig_mod.forward)(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 209, in _fn
    return fn(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py", line 582, in forward
    sample, res_samples = downsample_block(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 837, in forward
    hidden_states = attn(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/diffusers/models/transformer_2d.py", line 265, in forward
    hidden_states = block(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/diffusers/models/attention.py", line 291, in forward
    attn_output = self.attn1(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/diffusers/models/cross_attention.py", line 205, in forward
    return self.processor(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 337, in catch_errors
    return callback(frame, cache_size, hooks)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 404, in _convert_frame
    result = inner_convert(frame, cache_size, hooks)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 104, in _fn
    return fn(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 262, in _convert_frame_assert
    return _compile(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 163, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 324, in _compile
    out_code = transform_code_object(code, transform)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 445, in transform_code_object
    transformations(instructions, code_options)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 311, in transform
    tracer.run()
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1726, in run
    super().run()
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 576, in run
    and self.step()
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 540, in step
    getattr(self, inst.opname)(inst)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1792, in RETURN_VALUE
    self.output.compile_subgraph(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 517, in compile_subgraph
    self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 588, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 163, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 675, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e) from e
torch._dynamo.exc.BackendCompilerFailed: debug_wrapper raised CompilationError: at 63:39:
def triton_(arg_A, arg_B, seed2, in_ptr3, in_ptr4, out_ptr1, out_ptr2, out_ptr3):
    GROUP_M : tl.constexpr = 8
    EVEN_K : tl.constexpr = True
    ALLOW_TF32 : tl.constexpr = True
    ACC_TYPE : tl.constexpr = tl.float32
    BLOCK_M : tl.constexpr = 64
    BLOCK_N : tl.constexpr = 32
    BLOCK_K : tl.constexpr = 32

    A = arg_A
    B = arg_B

    M = 4096
    N = 32
    K = 320
    stride_am = 320
    stride_ak = 1
    stride_bk = 1
    stride_bn = 320

    # based on triton.ops.matmul
    pid = tl.program_id(0)
    grid_m = (M + BLOCK_M - 1) // BLOCK_M
    grid_n = (N + BLOCK_N - 1) // BLOCK_N

    # re-order program ID for better L2 performance
    width = GROUP_M * grid_n
    group_id = pid // width
    group_size = min(grid_m - group_id * GROUP_M, GROUP_M)
    pid_m = group_id * GROUP_M + (pid % group_size)
    pid_n = (pid % width) // (group_size)

    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    ram = tl.max_contiguous(tl.multiple_of(rm % M, BLOCK_M), BLOCK_M)
    rbn = tl.max_contiguous(tl.multiple_of(rn % N, BLOCK_N), BLOCK_N)
    rk = tl.arange(0, BLOCK_K)
    A = A + (ram[:, None] * stride_am + rk[None, :] * stride_ak)
    B = B + (rk[:, None] * stride_bk + rbn[None, :] * stride_bn)

    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE)
    for k in range(K, 0, -BLOCK_K):
        if EVEN_K:
            a = tl.load(A)
            b = tl.load(B)
        else:
            a = tl.load(A, mask=rk[None, :] < k, other=0.)
            b = tl.load(B, mask=rk[:, None] < k, other=0.)
        acc += tl.dot(a, b, allow_tf32=ALLOW_TF32)
        A += BLOCK_K * stride_ak
        B += BLOCK_K * stride_bk

    # rematerialize rm and rn to save registers
    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    idx_m = rm[:, None]
    idx_n = rn[None, :]
    mask = (idx_m < M) & (idx_n < N)

    # inductor generates a suffix
    xindex = idx_n + (32*idx_m)
    tmp0_load = tl.load(seed2 + (0))
    tmp0 = tl.broadcast_to(tmp0_load, [XBLOCK])
                                       ^

Set torch._dynamo.config.verbose=True for more information


You can suppress this exception and fall back to eager by setting:
    torch._dynamo.config.suppress_errors = True

Steps:   0%|                                                                                                                                                | 0/35000 [00:05<?, ?it/s]
Restored system models.
Duration: 00:00:11

4. Environment

What OS? Debian testing

What GPU are you using? RTX 3090 Ti

The text was updated successfully, but these errors were encountered:

rkfg · 2023-04-02T17:37:15Z

Note, that on nightly torch version 2.1.0.dev20230330 it doesn't happen but this autotuning thing takes an ENORMOUS amount of time, I think I waited for at least 30 minutes and it ended up in #1146 issue.

ArrowM · 2023-04-02T18:13:01Z

Definitely a torch problem. You should check their github issues.

rkfg · 2023-04-02T18:43:47Z

I couldn't find the specific issue but overall PyTorch 2.0 seems to be very rough for now. I downgraded it back to 1.13.1 and also installed xformers 0.0.18 with conda because the version in pip doesn't support PyTorch 1. Now it seems to work fine, thanks. Let's keep this issue open until it's at least resolved in the stable version of PyTorch 2.

dsully · 2023-04-02T22:55:33Z

I'm seeing the same thing with torch 2.0.0 on an RTX 4090

Looks like this issue: pytorch/pytorch#97018

And this fix:

pytorch/pytorch#95556 (merged 5 hours ago)

polym · 2023-04-04T07:25:41Z

I couldn't find the specific issue but overall PyTorch 2.0 seems to be very rough for now. I downgraded it back to 1.13.1 and also installed xformers 0.0.18 with conda because the version in pip doesn't support PyTorch 1. Now it seems to work fine, thanks. Let's keep this issue open until it's at least resolved in the stable version of PyTorch 2.

@rkfg thanks. Based on your solution, I commented out the venv related code in webui.sh and searched for the correct xformer package at https://anaconda.org/xformers/xformers/files. I then installed it using conda install https://anaconda.org/xformers/xformers/0.0.18/download/linux-64/xformers-0.0.18-py310_cu11.7.1_pyt1.13.1.tar.bz2. The program is now functioning properly.

rkfg · 2023-04-04T07:32:12Z

Glad to hear that! You can also install the package even easier with conda install xformers -c xformers, at least it worked for me. Not sure how it picks the correct version though, probably from the already installed PyTorch?

I'm seeing the same thing with torch 2.0.0 on an RTX 4090

Looks like this issue: pytorch/pytorch#97018

And this fix:

pytorch/pytorch#95556 (merged 5 hours ago)

Good to know there's progress, thanks!

polym · 2023-04-04T07:44:26Z

Not sure how it picks the correct version though, probably from the already installed PyTorch?

I'm not sure, but it could be related to this issue: facebookresearch/xformers#708.

rkfg · 2023-04-04T07:47:58Z

Yes, that's where I found the way to install xformers for PT 1. I just checked the conda site and it has versions for both PT 1 and 2, in my case the correct version of xformers 0.0.18 was installed, probably because I had PT 1 installed before that.

github-actions · 2023-04-19T00:35:52Z

This issue is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days

rkfg added the new Just added, you should probably sort this. label Apr 2, 2023

rkfg changed the title ~~Can't start training: ValueError: XBLOCK is not defined~~ Can't start training: ValueError: XBLOCK is not defined [PyTorch 2.0 issue] Apr 2, 2023

KC2021 mentioned this issue Apr 3, 2023

CUDA error: Invalid argument when training #1146

Closed

ElliottLester mentioned this issue Apr 14, 2023

LORA training: Sample Generation uses massive amounts of VRam #1184

Closed

github-actions bot added the Stale label Apr 19, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't start training: ValueError: XBLOCK is not defined [PyTorch 2.0 issue] #1160

Can't start training: ValueError: XBLOCK is not defined [PyTorch 2.0 issue] #1160

rkfg commented Apr 2, 2023 •

edited

Loading

rkfg commented Apr 2, 2023

ArrowM commented Apr 2, 2023

rkfg commented Apr 2, 2023

dsully commented Apr 2, 2023 •

edited

Loading

polym commented Apr 4, 2023

rkfg commented Apr 4, 2023

polym commented Apr 4, 2023

rkfg commented Apr 4, 2023

github-actions bot commented Apr 19, 2023

Can't start training: ValueError: XBLOCK is not defined [PyTorch 2.0 issue] #1160

Can't start training: ValueError: XBLOCK is not defined [PyTorch 2.0 issue] #1160

Comments

rkfg commented Apr 2, 2023 • edited Loading

1. Please find the following lines in the console and paste them below.

2. Describe the bug

3. Provide logs

4. Environment

rkfg commented Apr 2, 2023

ArrowM commented Apr 2, 2023

rkfg commented Apr 2, 2023

dsully commented Apr 2, 2023 • edited Loading

polym commented Apr 4, 2023

rkfg commented Apr 4, 2023

polym commented Apr 4, 2023

rkfg commented Apr 4, 2023

github-actions bot commented Apr 19, 2023

rkfg commented Apr 2, 2023 •

edited

Loading

dsully commented Apr 2, 2023 •

edited

Loading