Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't start training: ValueError: XBLOCK is not defined [PyTorch 2.0 issue] #1160

Closed
rkfg opened this issue Apr 2, 2023 · 9 comments
Closed
Labels
new Just added, you should probably sort this. Stale

Comments

@rkfg
Copy link

rkfg commented Apr 2, 2023

1. Please find the following lines in the console and paste them below.

#######################################################################################################
Initializing Dreambooth                                                                                                                                                               
Dreambooth revision: 926ae204ef5de17efca2059c334b6098492a0641                                                                                                                         
Successfully installed accelerate-0.18.0 fastapi-0.94.1 gitpython-3.1.31 google-auth-oauthlib-0.4.6 requests-2.28.2 transformers-4.26.1                                               
                                                                                                                                                                                      
Does your project take forever to startup?                                                                                                                                            
Repetitive dependency installation may be the reason.                                                                                                                                 
Automatic1111's base project sets strict requirements on outdated dependencies.                                                                                                       
If an extension is using a newer version, the dependency is uninstalled and reinstalled twice every startup.                                                                          
                                                                                                                                                                                      
[+] xformers version 0.0.18 installed.                                                                                                                                                
[+] torch version 2.0.0+cu118 installed.                                                                                                                                              
[+] torchvision version 0.15.1+cu118 installed.                                                                                                                                       
[+] accelerate version 0.18.0 installed.                                                                                                                                              
[+] diffusers version 0.14.0 installed.                                                                                                                                               
[+] transformers version 4.26.1 installed.                                                                                                                                            
[+] bitsandbytes version 0.35.4 installed.                                                                                                                                            


#######################################################################################################

2. Describe the bug

When starting the training the process stops immediately with a long traceback. I had this extension working a while ago before A1111 and PyTorch updated to the current versions. Not sure if it's related, I tried to run the webUI with both venv and conda, the outcome is exactly the same. Also tried turning on and off various options such as memory attention (default/xformers), precision (fp16/bf16), using extended Lora or not and choosing different base models (SD 1.5 and Liberty). No difference whatsoever. Sometimes it does a few cycles of triton autotune but in the end can't compile that code. The following log is when using conda.

3. Provide logs

Initializing dreambooth training...
The version of diffusers is less than or equal to 0.14.0. Performing monkey-patch...
Pre-processing images: classifiers_0: : 0it [00:00, ?it/s]
Nothing to generate.
Found 175 reg images.
Preparing dataset...
Init dataset!
Preparing Dataset (With Caching)
Caching latents...:   0%|                                                                                                                                     | 0/210 [00:00<?, ?it/s]
Loading cached latents...
Bucket 0 (512, 512, 0) - Instance Images: 35 | Class Images: 175 | Max Examples/batch:  70
Total Buckets 1 - Instance Images: 35 | Class Images: 175 | Max Examples/batch:  70
Caching latents...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 210/210 [00:00<00:00, 36802.90it/s]
Total images / batch: 70, total examples: 70
Caching latents...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 210/210 [00:00<00:00, 36440.52it/s]
Total dataset length (steps): 70
Initializing bucket counter!
  ***** Running training *****
  Num batches each epoch = 70
  Num Epochs = 500
  Batch Size Per Device = 1
  Gradient Accumulation steps = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Text Encoder Epochs: 250
  Total optimization steps = 17500
  Total training steps = 35000
  Resuming from checkpoint: False
  First resume epoch: 0
  First resume step: 0
  Lora: True, Optimizer: 8bit AdamW, Prec: no
  Gradient Checkpointing: False
  EMA: False
  UNET: True
  Freeze CLIP Normalization Layers: False
  LR: 0.0001
  LoRA Extended: True
  LoRA Text Encoder LR: 5e-05
  V2: False
Steps:   0%|                                                                                                                                                | 0/35000 [00:00<?, ?it/s]
[2023-04-02 17:51:49,400] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[2023-04-02 17:51:50,486] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 937, in build_triton_ir
    generator.visit(fn.parse())
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/home/rkfg/miniconda3/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 183, in visit_Module
    ast.NodeVisitor.generic_visit(self, node)
  File "/home/rkfg/miniconda3/lib/python3.10/ast.py", line 426, in generic_visit
    self.visit(item)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/home/rkfg/miniconda3/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 252, in visit_FunctionDef
    has_ret = self.visit_compound_statement(node.body)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 177, in visit_compound_statement
    self.last_ret_type = self.visit(stmt)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/home/rkfg/miniconda3/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 301, in visit_Assign
    values = self.visit(node.value)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/home/rkfg/miniconda3/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 757, in visit_Call
    args = [self.visit(arg) for arg in node.args]
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 757, in <listcomp>
    args = [self.visit(arg) for arg in node.args]
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/home/rkfg/miniconda3/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 188, in visit_List
    elts = [self.visit(elt) for elt in node.elts]
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 188, in <listcomp>
    elts = [self.visit(elt) for elt in node.elts]
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 855, in visit
    return super().visit(node)
  File "/home/rkfg/miniconda3/lib/python3.10/ast.py", line 418, in visit
    return visitor(node)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 325, in visit_Name
    return self.get_value(node.id)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 156, in get_value
    raise ValueError(f'{name} is not defined')
ValueError: XBLOCK is not defined

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/rkfg/miniconda3/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 549, in _worker_compile
    kernel.precompile(warm_cache_only_with_cc=cc)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py", line 69, in precompile
    self.launchers = [
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py", line 70, in <listcomp>
    self._precompile_config(c, warm_cache_only_with_cc)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py", line 83, in _precompile_config
    triton.compile(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 1621, in compile
    next_module = compile(module)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 1550, in <lambda>
    lambda src: ast_to_ttir(src, signature, configs[0], constants)),
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 962, in ast_to_ttir
    mod, _ = build_triton_ir(fn, signature, specialization, constants)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/triton/compiler.py", line 942, in build_triton_ir
    raise CompilationError(fn.src, node) from e
triton.compiler.CompilationError: at 63:39:
def triton_(arg_A, arg_B, seed2, in_ptr3, in_ptr4, out_ptr1, out_ptr2, out_ptr3):
    GROUP_M : tl.constexpr = 8
    EVEN_K : tl.constexpr = True
    ALLOW_TF32 : tl.constexpr = True
    ACC_TYPE : tl.constexpr = tl.float32
    BLOCK_M : tl.constexpr = 64
    BLOCK_N : tl.constexpr = 32
    BLOCK_K : tl.constexpr = 32

    A = arg_A
    B = arg_B

    M = 4096
    N = 32
    K = 320
    stride_am = 320
    stride_ak = 1
    stride_bk = 1
    stride_bn = 320

    # based on triton.ops.matmul
    pid = tl.program_id(0)
    grid_m = (M + BLOCK_M - 1) // BLOCK_M
    grid_n = (N + BLOCK_N - 1) // BLOCK_N

    # re-order program ID for better L2 performance
    width = GROUP_M * grid_n
    group_id = pid // width
    group_size = min(grid_m - group_id * GROUP_M, GROUP_M)
    pid_m = group_id * GROUP_M + (pid % group_size)
    pid_n = (pid % width) // (group_size)

    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    ram = tl.max_contiguous(tl.multiple_of(rm % M, BLOCK_M), BLOCK_M)
    rbn = tl.max_contiguous(tl.multiple_of(rn % N, BLOCK_N), BLOCK_N)
    rk = tl.arange(0, BLOCK_K)
    A = A + (ram[:, None] * stride_am + rk[None, :] * stride_ak)
    B = B + (rk[:, None] * stride_bk + rbn[None, :] * stride_bn)

    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE)
    for k in range(K, 0, -BLOCK_K):
        if EVEN_K:
            a = tl.load(A)
            b = tl.load(B)
        else:
            a = tl.load(A, mask=rk[None, :] < k, other=0.)
            b = tl.load(B, mask=rk[:, None] < k, other=0.)
        acc += tl.dot(a, b, allow_tf32=ALLOW_TF32)
        A += BLOCK_K * stride_ak
        B += BLOCK_K * stride_bk

    # rematerialize rm and rn to save registers
    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    idx_m = rm[:, None]
    idx_n = rn[None, :]
    mask = (idx_m < M) & (idx_n < N)

    # inductor generates a suffix
    xindex = idx_n + (32*idx_m)
    tmp0_load = tl.load(seed2 + (0))
    tmp0 = tl.broadcast_to(tmp0_load, [XBLOCK])
                                       ^
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 670, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.fake_example_inputs())
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/debug_utils.py", line 1055, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/__init__.py", line 1390, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 401, in compile_fx
    return compile_fx(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 455, in compile_fx
    return aot_autograd(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/backends/common.py", line 48, in compiler_fn
    cg = aot_module_simplified(gm, example_inputs, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2805, in aot_module_simplified
    compiled_fn = create_aot_dispatcher_function(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 163, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2498, in create_aot_dispatcher_function
    compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1713, in aot_wrapper_dedupe
    return compiler_fn(flat_fn, leaf_flat_args, aot_config)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 2133, in aot_dispatch_autograd
    compiled_fw_func = aot_config.fw_compiler(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 163, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 430, in fw_compiler
    return inner_compile(
  File "/home/rkfg/miniconda3/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/debug_utils.py", line 595, in debug_wrapper
    compiled_fn = compiler_fn(gm, example_inputs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/debug.py", line 239, in inner
    return fn(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 177, in compile_fx_inner
    compiled_fn = graph.compile_to_fn()
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/graph.py", line 586, in compile_to_fn
    return self.compile_to_module().call
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 163, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/graph.py", line 575, in compile_to_module
    mod = PyCodeCache.load(code)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 528, in load
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_rkfg/ew/cew2eoumgtgvrzh47c55fej6l6p7ggtu5ehdsiuv334opbob6r7y.py", line 505, in <module>
    async_compile.wait(globals())
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 715, in wait
    scope[key] = result.result()
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 573, in result
    self.future.result()
  File "/home/rkfg/miniconda3/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/home/rkfg/miniconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
triton.compiler.CompilationError: at 63:39:
def triton_(arg_A, arg_B, seed2, in_ptr3, in_ptr4, out_ptr1, out_ptr2, out_ptr3):
    GROUP_M : tl.constexpr = 8
    EVEN_K : tl.constexpr = True
    ALLOW_TF32 : tl.constexpr = True
    ACC_TYPE : tl.constexpr = tl.float32
    BLOCK_M : tl.constexpr = 64
    BLOCK_N : tl.constexpr = 32
    BLOCK_K : tl.constexpr = 32

    A = arg_A
    B = arg_B

    M = 4096
    N = 32
    K = 320
    stride_am = 320
    stride_ak = 1
    stride_bk = 1
    stride_bn = 320

    # based on triton.ops.matmul
    pid = tl.program_id(0)
    grid_m = (M + BLOCK_M - 1) // BLOCK_M
    grid_n = (N + BLOCK_N - 1) // BLOCK_N

    # re-order program ID for better L2 performance
    width = GROUP_M * grid_n
    group_id = pid // width
    group_size = min(grid_m - group_id * GROUP_M, GROUP_M)
    pid_m = group_id * GROUP_M + (pid % group_size)
    pid_n = (pid % width) // (group_size)

    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    ram = tl.max_contiguous(tl.multiple_of(rm % M, BLOCK_M), BLOCK_M)
    rbn = tl.max_contiguous(tl.multiple_of(rn % N, BLOCK_N), BLOCK_N)
    rk = tl.arange(0, BLOCK_K)
    A = A + (ram[:, None] * stride_am + rk[None, :] * stride_ak)
    B = B + (rk[:, None] * stride_bk + rbn[None, :] * stride_bn)

    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE)
    for k in range(K, 0, -BLOCK_K):
        if EVEN_K:
            a = tl.load(A)
            b = tl.load(B)
        else:
            a = tl.load(A, mask=rk[None, :] < k, other=0.)
            b = tl.load(B, mask=rk[:, None] < k, other=0.)
        acc += tl.dot(a, b, allow_tf32=ALLOW_TF32)
        A += BLOCK_K * stride_ak
        B += BLOCK_K * stride_bk

    # rematerialize rm and rn to save registers
    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    idx_m = rm[:, None]
    idx_n = rn[None, :]
    mask = (idx_m < M) & (idx_n < N)

    # inductor generates a suffix
    xindex = idx_n + (32*idx_m)
    tmp0_load = tl.load(seed2 + (0))
    tmp0 = tl.broadcast_to(tmp0_load, [XBLOCK])
                                       ^

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/rkfg/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/ui_functions.py", line 727, in start_training
    result = main(class_gen_method=class_gen_method)
  File "/home/rkfg/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 1371, in main
    return inner_loop()
  File "/home/rkfg/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/memory.py", line 119, in decorator
    return function(batch_size, grad_size, prof, *args, **kwargs)
  File "/home/rkfg/stable-diffusion-webui/extensions/sd_dreambooth_extension/dreambooth/train_dreambooth.py", line 1169, in inner_loop
    noise_pred = unet(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 82, in forward
    return self.dynamo_ctx(self._orig_mod.forward)(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 209, in _fn
    return fn(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py", line 582, in forward
    sample, res_samples = downsample_block(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 837, in forward
    hidden_states = attn(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/diffusers/models/transformer_2d.py", line 265, in forward
    hidden_states = block(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/diffusers/models/attention.py", line 291, in forward
    attn_output = self.attn1(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/diffusers/models/cross_attention.py", line 205, in forward
    return self.processor(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 337, in catch_errors
    return callback(frame, cache_size, hooks)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 404, in _convert_frame
    result = inner_convert(frame, cache_size, hooks)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 104, in _fn
    return fn(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 262, in _convert_frame_assert
    return _compile(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 163, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 324, in _compile
    out_code = transform_code_object(code, transform)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 445, in transform_code_object
    transformations(instructions, code_options)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 311, in transform
    tracer.run()
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1726, in run
    super().run()
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 576, in run
    and self.step()
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 540, in step
    getattr(self, inst.opname)(inst)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 1792, in RETURN_VALUE
    self.output.compile_subgraph(
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 517, in compile_subgraph
    self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 588, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 163, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/rkfg/miniconda3/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 675, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e) from e
torch._dynamo.exc.BackendCompilerFailed: debug_wrapper raised CompilationError: at 63:39:
def triton_(arg_A, arg_B, seed2, in_ptr3, in_ptr4, out_ptr1, out_ptr2, out_ptr3):
    GROUP_M : tl.constexpr = 8
    EVEN_K : tl.constexpr = True
    ALLOW_TF32 : tl.constexpr = True
    ACC_TYPE : tl.constexpr = tl.float32
    BLOCK_M : tl.constexpr = 64
    BLOCK_N : tl.constexpr = 32
    BLOCK_K : tl.constexpr = 32

    A = arg_A
    B = arg_B

    M = 4096
    N = 32
    K = 320
    stride_am = 320
    stride_ak = 1
    stride_bk = 1
    stride_bn = 320

    # based on triton.ops.matmul
    pid = tl.program_id(0)
    grid_m = (M + BLOCK_M - 1) // BLOCK_M
    grid_n = (N + BLOCK_N - 1) // BLOCK_N

    # re-order program ID for better L2 performance
    width = GROUP_M * grid_n
    group_id = pid // width
    group_size = min(grid_m - group_id * GROUP_M, GROUP_M)
    pid_m = group_id * GROUP_M + (pid % group_size)
    pid_n = (pid % width) // (group_size)

    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    ram = tl.max_contiguous(tl.multiple_of(rm % M, BLOCK_M), BLOCK_M)
    rbn = tl.max_contiguous(tl.multiple_of(rn % N, BLOCK_N), BLOCK_N)
    rk = tl.arange(0, BLOCK_K)
    A = A + (ram[:, None] * stride_am + rk[None, :] * stride_ak)
    B = B + (rk[:, None] * stride_bk + rbn[None, :] * stride_bn)

    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE)
    for k in range(K, 0, -BLOCK_K):
        if EVEN_K:
            a = tl.load(A)
            b = tl.load(B)
        else:
            a = tl.load(A, mask=rk[None, :] < k, other=0.)
            b = tl.load(B, mask=rk[:, None] < k, other=0.)
        acc += tl.dot(a, b, allow_tf32=ALLOW_TF32)
        A += BLOCK_K * stride_ak
        B += BLOCK_K * stride_bk

    # rematerialize rm and rn to save registers
    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    idx_m = rm[:, None]
    idx_n = rn[None, :]
    mask = (idx_m < M) & (idx_n < N)

    # inductor generates a suffix
    xindex = idx_n + (32*idx_m)
    tmp0_load = tl.load(seed2 + (0))
    tmp0 = tl.broadcast_to(tmp0_load, [XBLOCK])
                                       ^

Set torch._dynamo.config.verbose=True for more information


You can suppress this exception and fall back to eager by setting:
    torch._dynamo.config.suppress_errors = True

Steps:   0%|                                                                                                                                                | 0/35000 [00:05<?, ?it/s]
Restored system models.
Duration: 00:00:11

4. Environment

What OS? Debian testing

What GPU are you using? RTX 3090 Ti

@rkfg rkfg added the new Just added, you should probably sort this. label Apr 2, 2023
@rkfg
Copy link
Author

rkfg commented Apr 2, 2023

Note, that on nightly torch version 2.1.0.dev20230330 it doesn't happen but this autotuning thing takes an ENORMOUS amount of time, I think I waited for at least 30 minutes and it ended up in #1146 issue.

@ArrowM
Copy link
Collaborator

ArrowM commented Apr 2, 2023

Definitely a torch problem. You should check their github issues.

@rkfg
Copy link
Author

rkfg commented Apr 2, 2023

I couldn't find the specific issue but overall PyTorch 2.0 seems to be very rough for now. I downgraded it back to 1.13.1 and also installed xformers 0.0.18 with conda because the version in pip doesn't support PyTorch 1. Now it seems to work fine, thanks. Let's keep this issue open until it's at least resolved in the stable version of PyTorch 2.

@rkfg rkfg changed the title Can't start training: ValueError: XBLOCK is not defined Can't start training: ValueError: XBLOCK is not defined [PyTorch 2.0 issue] Apr 2, 2023
@dsully
Copy link

dsully commented Apr 2, 2023

I'm seeing the same thing with torch 2.0.0 on an RTX 4090

Looks like this issue: pytorch/pytorch#97018

And this fix:

pytorch/pytorch#95556 (merged 5 hours ago)

@polym
Copy link

polym commented Apr 4, 2023

I couldn't find the specific issue but overall PyTorch 2.0 seems to be very rough for now. I downgraded it back to 1.13.1 and also installed xformers 0.0.18 with conda because the version in pip doesn't support PyTorch 1. Now it seems to work fine, thanks. Let's keep this issue open until it's at least resolved in the stable version of PyTorch 2.

@rkfg thanks. Based on your solution, I commented out the venv related code in webui.sh and searched for the correct xformer package at https://anaconda.org/xformers/xformers/files. I then installed it using conda install https://anaconda.org/xformers/xformers/0.0.18/download/linux-64/xformers-0.0.18-py310_cu11.7.1_pyt1.13.1.tar.bz2. The program is now functioning properly.

@rkfg
Copy link
Author

rkfg commented Apr 4, 2023

Glad to hear that! You can also install the package even easier with conda install xformers -c xformers, at least it worked for me. Not sure how it picks the correct version though, probably from the already installed PyTorch?

I'm seeing the same thing with torch 2.0.0 on an RTX 4090

Looks like this issue: pytorch/pytorch#97018

And this fix:

pytorch/pytorch#95556 (merged 5 hours ago)

Good to know there's progress, thanks!

@polym
Copy link

polym commented Apr 4, 2023

Not sure how it picks the correct version though, probably from the already installed PyTorch?

I'm not sure, but it could be related to this issue: facebookresearch/xformers#708.

@rkfg
Copy link
Author

rkfg commented Apr 4, 2023

Yes, that's where I found the way to install xformers for PT 1. I just checked the conda site and it has versions for both PT 1 and 2, in my case the correct version of xformers 0.0.18 was installed, probably because I had PT 1 installed before that.

@github-actions
Copy link

This issue is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days

@github-actions github-actions bot added the Stale label Apr 19, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new Just added, you should probably sort this. Stale
Projects
None yet
Development

No branches or pull requests

4 participants