I have now turned off the contorl part by setting `control_scale *= 0`. Let's compare the outputs of (my implementation of) the base model. Visually, my implementation seems to be broken.

In [1]:
import torch
from torch.testing import assert_close
from torch import allclose, nn, tensor
torch.set_printoptions(linewidth=200, precision=3, sci_mode=False)

In [2]:
device = 'cuda' if torch.cuda.is_available() else 'mps'
device_dtype = torch.float16 if device == 'cuda' else torch.float32

## Load logs

In [3]:
from diffusers.umer_debug_logger import UmerDebugLogger

In [4]:
cloud_cuda = UmerDebugLogger.load_log_objects_from_dir('logs/cloud')
local_cuda = UmerDebugLogger.load_log_objects_from_dir('logs/local_cuda')
local_mps = UmerDebugLogger.load_log_objects_from_dir('logs/local_mps')

In [5]:
len(cloud_cuda), len(local_cuda), len(local_mps)

(73, 73, 73)

In [6]:
for i, (c,l,l2) in enumerate(zip(cloud_cuda, local_cuda,local_mps)):
    if c.msg!=l.msg: print(i,c.msg,l.msg,l2.msg)

## Compare intermediate results

In [7]:
def mae(t1,t2):
    assert t1.shape==t2.shape
    return (t1-t2).abs().mean()

In [8]:
from functools import partial
from util_inspect import fmt_bool

def compare_intermediate_results(n=None,n_start=0,prec=5, compare_prec=3, skip_ctrl=False):
    if n is None: n=max(len(cloud_cuda), len(local_cuda), len(local_mps))

    print(f'{"":<3} | {"name":<20} | {"shape":<20} | {"same names?":<12} | {"same shapes?":<12} | {"same values?":<12} | {"Δ local mps -> cuda":<20} | {"Δ cuda local -> cloud":<20}')
    print(f'{"":<3} | {"":<20} | {"":<20} | {"":<12} | {"":<12} | {"prec="+str(compare_prec):^12} | {"prec="+str(prec):^20} | {"prec="+str(prec):^20}')

    def calc_total_len(lens): return sum(lens)+3*len(lens)-1
    total_len = calc_total_len((3,20,20,12,12,12,20,20))

    line = partial(
        lambda txt, width: print(txt * (width//len(txt))),
        width=total_len
    )
    
    # # to separate logs into sections
    lines_at = [5]                                     # input 
    def add_line_after(x): lines_at.append(lines_at[-1]+x)
    add_line_after(4)                                  # conv in
    for _ in range(8): add_line_after(4)               # enc       (R,R,D / RA,RA,D / RA,RA)
    add_line_after(4)                                  # mid
    for _ in range(9): add_line_after(3)               # dec       (RA,RA,RAU / RA,RA,RAU / R,R,R)
    fat_lines_at = [5,9,41,45]
    
    # # to describe each log line
    descrs = ['x', 'time info', 'text info', 'guidance image', 'guidance imagen (projected)']                              # input 
    descrs += ['base conv in','ctrl conv in','add guided hint to ctrl','add ctrl->base']                                   # conv in
    for _ in range(8):descrs += ['concat base -> ctrl','apply base block','apply ctrl block','add ctrl -> base']           # enc
    descrs += ['concat base -> ctrl', 'apply base block','apply ctrl block','add ctrl->base']                              # mid
    for _ in range(9): descrs += ['add ctrl encoder->base decoder','concat base encoder->base decoder','apply base block'] # dec 
    descrs += ['base conv out']                                                                                            # output 
    
    line('#')
    lv,block=1,1
    for i in range(n_start,n):
        cc,lc,lm = cloud_cuda[i], local_cuda[i], local_mps[i]
                
        eq_name = (cc.msg==lc.msg) and (lc.msg==lm.msg)
        eq_shape = (cc.shape==lc.shape) and (lc.shape==lm.shape)
        eq_vals = torch.allclose(cc.t,lc.t,atol=10**-compare_prec) and torch.allclose(lc.t,lm.t,atol=10**-compare_prec)

        mae_1 = mae(lm.t,lc.t)
        mae_2 = mae(lc.t,cc.t)
        
        mae_1 = ("{:>20."+str(prec)+"f}").format(mae_1)
        mae_2 = ("{:>20."+str(prec)+"f}").format(mae_2)
        
        if not skip_ctrl or not 'ctrl' in cc.msg:
            print(f'{i+1:<3} | {cc.msg:<20} | {cc.shape:>20} | {fmt_bool(eq_name, "^12")} | {fmt_bool(eq_shape, "^12")} | {fmt_bool(eq_vals, "^12")} | {mae_1} | {mae_2}\t{descrs[i]}')
        
        if i+1 in fat_lines_at: line('=')
        elif i+1 in lines_at: line('-')

In [9]:
compare_intermediate_results(compare_prec=4, prec=3, skip_ctrl=True)

    | name                 | shape                | same names?  | same shapes? | same values? | Δ local mps -> cuda  | Δ cuda local -> cloud
    |                      |                      |              |              |    prec=4    |        prec=3        |        prec=3       
##############################################################################################################################################
1   | prep.x               |       [2, 4, 96, 96] | [92m     y      [0m | [92m     y      [0m | [92m     y      [0m |                0.000 |                0.000	x
2   | prep.temb            |            [2, 1280] | [92m     y      [0m | [92m     y      [0m | [91m     n      [0m |                0.000 |                0.000	time info
3   | prep.context         |        [2, 77, 2048] | [92m     y      [0m | [92m     y      [0m | [91m     n      [0m |                0.000 |                0.001	text info
4   | prep.raw_hint        |     [2, 3, 768, 768

Obeservation:
1. No difference between local mps and local cuda
2. No difference between local cuda and cloud cuda

Conclusion:
My implementation for of the base part is correct. Therefore the first visual output should also be the same between local mps and cloud cuda, but it is not.<br/>
Therefore, the problem is in the pipeline, not in the denoiser model.

Next:
1. Inspect the pipeline code side by side
2. Save the denoiser outputsa after each step and compare them.

**Edit (after adding conv out):**
There's a large difference in `conv out`!

**Edit 2:** I had not included the `norm` and `silu` in `conv_out`. After adding that, the difference between local cuda and cloud cuda vanishes.

In [10]:
print('h base after decoder:')
print(cloud_cuda[71].head)
print(local_cuda[71].head)
print()
print('h base after conv out:')
print(cloud_cuda[72].head)
print(local_cuda[72].head)

h base after decoder:
tensor([ 1.729,  1.362,  2.302,  0.758,  3.037,  0.908,  0.548, -0.552,  0.977,  1.700])
tensor([ 1.729,  1.362,  2.302,  0.758,  3.037,  0.908,  0.547, -0.552,  0.977,  1.700])

h base after conv out:
tensor([ 1.290,  0.491,  0.440, -0.570,  0.976,  1.417,  1.690,  0.644,  0.392, -0.807])
tensor([ 1.290,  0.491,  0.440, -0.570,  0.976,  1.417,  1.690,  0.644,  0.392, -0.807])


___

Okay, let's now include the control model. (Note: it's output is not added back to base, as I've turned that off. Still, the computation of each ctrl block should give the same results across cloud and local.)

In [11]:
compare_intermediate_results(compare_prec=4, prec=3, skip_ctrl=False)

    | name                 | shape                | same names?  | same shapes? | same values? | Δ local mps -> cuda  | Δ cuda local -> cloud
    |                      |                      |              |              |    prec=4    |        prec=3        |        prec=3       
##############################################################################################################################################
1   | prep.x               |       [2, 4, 96, 96] | [92m     y      [0m | [92m     y      [0m | [92m     y      [0m |                0.000 |                0.000	x
2   | prep.temb            |            [2, 1280] | [92m     y      [0m | [92m     y      [0m | [91m     n      [0m |                0.000 |                0.000	time info
3   | prep.context         |        [2, 77, 2048] | [92m     y      [0m | [92m     y      [0m | [91m     n      [0m |                0.000 |                0.001	text info
4   | prep.raw_hint        |     [2, 3, 768, 768

Observation: The application of control blocks seems to produce errors