Ok, with `control_scale *= 0`, I get the same results across local cuda and cloud cuda.

Still, the application of control blocks produces errors. So let's zoom into that.

In [1]:
import torch
from torch.testing import assert_close
from torch import allclose, nn, tensor
torch.set_printoptions(linewidth=200, precision=3, sci_mode=False)

In [2]:
device = 'cuda' if torch.cuda.is_available() else 'mps'
device_dtype = torch.float16 if device == 'cuda' else torch.float32

## Load logs

In [3]:
from diffusers.umer_debug_logger import UmerDebugLogger

In [4]:
cloud_cuda = UmerDebugLogger.load_log_objects_from_dir('logs/cloud')
local_cuda = UmerDebugLogger.load_log_objects_from_dir('logs/local_cuda')

In [5]:
len(cloud_cuda), len(local_cuda)

(970, 970)

In [6]:
for i, (c,l) in enumerate(zip(cloud_cuda, local_cuda)):
    if c.msg!=l.msg: print(f'{i:<3}{c.msg:>20}{l.msg:>20}')

## Compare intermediate results

In [7]:
def mae(t1,t2):
    assert t1.shape==t2.shape
    return (t1-t2).abs().mean()

In [41]:
from functools import partial
from util_inspect import fmt_bool

def compare_intermediate_results(n=None,n_start=0,prec=5, compare_prec=3):
    if n is None: n=max(len(cloud_cuda), len(local_cuda))

    print(f'{"":<3} | {"block":<20} | {"name":<20} | {"shape":<20} | {"same names?":<12} | {"same shapes?":<12} | {"same values?":<12} | {"Δ cuda local -> cloud":<20}')
    print(f'{"":<3} | {"":<20} | {"":<20} | {"":<20} | {"":<12} | {"":<12} | {"prec="+str(compare_prec):^12} | {"prec="+str(prec):^20}')

    def calc_total_len(lens): return sum(lens)+3*len(lens)-1
    total_len = calc_total_len((3,20,20,20,12,12,12,20))

    line = partial(
        lambda txt, width: print(txt * (width//len(txt))),
        width=total_len
    )
    
    labels = []
    def add_label(lbs, ctrl=True):
        if not isinstance(lbs, (list, tuple)): lbs = [lbs]
        for l in lbs:
            labels.append(('Base',l))
        if ctrl:
            for l in lbs: 
                labels.append(('Ctrl',l))
    
    # # down
    # 1
    add_label('ResBlock 1.1')
    add_label('ResBlock 1.2')
    add_label('Conv 1')
    # 2
    add_label(('ResBlock 2.1', 'AttnBlock 2.1'))
    add_label(('ResBlock 2.2', 'AttnBlock 2.2'))
    add_label('Conv 2')
    # 3
    add_label(('ResBlock 3.1', 'AttnBlock 3.1'))
    add_label(('ResBlock 3.2', 'AttnBlock 3.2')) 
    # # mid
    add_label(('ResBlock', 'AttnBlock', 'ResBlock'))
    # # up
    for _ in range(1000): add_label('DONT CARE', ctrl=False)
    
    line('#')
    bc,block=labels.pop(0)
    for i in range(n_start,n):
        cc,lc = cloud_cuda[i], local_cuda[i]
                
        eq_name = cc.msg==lc.msg
        eq_shape = cc.shape==lc.shape
        eq_vals = torch.allclose(cc.t,lc.t,atol=10**-compare_prec)

        mae_2 = mae(lc.t,cc.t)        
        mae_2 = ("{:>20."+str(prec)+"f}").format(mae_2)
        
        print(f'{i+1:<3} | {bc:<4} | {block:<13} | {cc.msg:<20} | {cc.shape:>20} | {fmt_bool(eq_name, "^12")} | {fmt_bool(eq_shape, "^12")} | {fmt_bool(eq_vals, "^12")} | {mae_2}')
        
        if cc.msg in ('add conv_shortcut','conv','proj_out'):
            line('=')
            bc,block=labels.pop(0)
        elif cc.msg in ('add ff','proj_in'): line('- ')

In [42]:
compare_intermediate_results(compare_prec=3, prec=3)

    | block                | name                 | shape                | same names?  | same shapes? | same values? | Δ cuda local -> cloud
    |                      |                      |                      |              |              |    prec=3    |        prec=3       
##############################################################################################################################################
1   | Base | ResBlock 1.1  | conv1                |     [2, 320, 96, 96] | [92m     y      [0m | [92m     y      [0m | [91m     n      [0m |                0.000
2   | Base | ResBlock 1.1  | add time_emb_proj    |     [2, 320, 96, 96] | [92m     y      [0m | [92m     y      [0m | [91m     n      [0m |                0.000
3   | Base | ResBlock 1.1  | conv2                |     [2, 320, 96, 96] | [92m     y      [0m | [92m     y      [0m | [92m     y      [0m |                0.000
4   | Base | ResBlock 1.1  | add conv_shortcut    |     [2, 320, 96, 9

The errors arise in the transformer-parts of each control block.<br/>
The first error arises in step 46.

So let's manually run through that transformer, for local and cloud.

In [10]:
inp_c = cloud_cuda[46-1].t
inp_l = local_cuda[46-1].t

In [11]:
del cloud_cuda, local_cuda

In [12]:
inp_c.shape, inp_l.shape

(torch.Size([2, 2304, 64]), torch.Size([2, 2304, 64]))

Load diffusers version, but keep in on cpu, so we can load Heidelberg version into gpu

In [13]:
from diffusers import StableDiffusionXLPipeline
from diffusers import EulerDiscreteScheduler
from diffusers.models.controlnetxs import ControlNetXSModel
from diffusers.pipelines.controlnet_xs.pipeline_controlnet_xs_sd_xl import StableDiffusionXLControlNetXSPipeline

sdxl_pipe = StableDiffusionXLPipeline.from_single_file('weights/sd_xl_base_1.0_0.9vae.safetensors', device='cpu')
cnxs = ControlNetXSModel.from_pretrained('weights/cnxs', device='cpu')

cnxs.base_model = sdxl_pipe.unet

cnxs.scale_list = cnxs.scale_list * 0. + 0.95
assert cnxs.scale_list[0] == .95

scheduler_cgf = dict(sdxl_pipe.scheduler.config)
scheduler_cgf['timestep_spacing'] = 'linspace'
sdxl_pipe.scheduler = EulerDiscreteScheduler.from_config(scheduler_cgf)

# test it worked
sdxl_pipe.scheduler.set_timesteps(50)
assert sdxl_pipe.scheduler.timesteps[0]==999

# reset
sdxl_pipe.scheduler = EulerDiscreteScheduler.from_config(scheduler_cgf)

cnxs_pipe = StableDiffusionXLControlNetXSPipeline(
    vae=sdxl_pipe.vae,
    text_encoder=sdxl_pipe.text_encoder,
    text_encoder_2=sdxl_pipe.text_encoder_2,
    tokenizer=sdxl_pipe.tokenizer,
    tokenizer_2=sdxl_pipe.tokenizer_2,
    unet=sdxl_pipe.unet,
    controlnet=cnxs,
    scheduler=sdxl_pipe.scheduler,
)

sigmas after (linear) interpolation: [14.61464691 12.93677721 11.49164976 10.24291444  9.16035419] ...


Load Heidelberg version

In [14]:
import scripts.control_utils as cu



Downloading: "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/dpt_hybrid-midas-501f0c75.pt" to /home/ControlNet-XS/annotator/ckpts/dpt_hybrid-midas-501f0c75.pt



100%|██████████| 470M/470M [00:04<00:00, 120MB/s]  
  model = create_fn(


In [15]:
path_to_config = 'cnxs_config/sdxl/sdxl_encD_canny_48m.yaml'

If this results in the kernel crashing, I'm using too much GPU memory elsewhere. Shut down every other kernel and try again.

In [None]:
model = cu.create_model(path_to_config).to('cuda')

Building a Downsample layer with 2 dims.
  --> settings are: 
 in-chn: 320, out-chn: 320, kernel-size: 3, stride: 2, padding: 1
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
Building a Downsample layer with 2 dims.
  --> settings are: 
 in-chn: 640, out-chn: 640, kernel-size: 3, stride: 2, padding: 1
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 10 w/ 1280 channels and 20 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing SpatialTransformer of depth 2 w/ 640 channels and 10 heads
constructing

**Okay, the kernel crashes, presumibly due to cpu memory error.** So let's do the comparison on 2 computers (instead of running both models on this one). See `Run Heidelnerg ctrl attention.ipynb` and `Run diffusers ctrl attetntion.ipynb`.