Hierarchical cross-Attention Resnet based U-net
for depth map inpainting with partial panorama and ambisonic room impulse response
We propose novel model for novel task with physical analysis aligned with acoustic theory
Key phrases: Inpainting, RIR, Depth Map, Cross-Attention, U-Net, Acoustic Interpretability
2025 URP Winter/Spring
Final Paper for URP: 🧾PDF
URP workshop presentation: 🔬 Google Slide
Data & Experiment Logs: 🌐 Notion
URP workshop assets: 📊 asset
Reconstructing the indoor structures is crucial for augmented/extended reality (AR/XR) applications or interactions. The structural information can be obtained from indoor depth panoramas; however, acquiring a complete depth panorama under typical conditions remains challenging due to the limited field of view. We propose the novel model that reconstruct the full depth panorama from the partial panorama and room impulse response. The model acheive about 20 dB and 0.844 for peak signal-to-noise ratio (PSNR) and structual similiarity index map (SSIM) evaluation metrics, respectively.
Our proposed transformer(TR) based model outperform the CNN based model (from 음향학회) in
- Peak-signal-to-noise (PSNR): High value refer better image quality
- Structural similiarity index map (SSIM): High value refer better structural, luminance, contrast quality
- Perceptual Quality with naive eyes: TR based model estimate plausible result
- Kindly mention that the GT is little different by uncontrolled randomness
Each block of ResNet is attention on distinct elements of indoor room shown as the figure below.
- RIR can be seperate as the direct sound, early reflection, and late reverberation parts
- Indoor suggested as explained by floorplan, edges, fine details, and representative structure of the room
Claim
The red bar indicate time until 125ms where the low-to-high order reflections exist, refer to EchoScan
- Map 1 matches floorplan and late reverberation aligned with Sabine's Equation
- Map 2 matches fine scaled structure and lots of early reflection in short time period
- Map 3 matches edges and multiple reflections within a short interval
- Map 4 matches height, representative structure and direct sound & reflection & EDC curve region
Analysis leads to Blockwise Contrastive Learning task as the another branch of future works
- As each block can extract distint features
- Might help global smoothing problem of transformer encoder
- We consider Diffusion or Flow based model with HARU as the noise estimator or vector field function respectively
- Plug in the room material segmentation with regards of sound absorb coefficents.
- Generalize model with reverberent speech and other type of multi channel reciever (not only FoA, first order ambisonic)
- Considering Sound Enhancement model like DeFTAN II or Dereverberation model
- Extend acoustic based analysis and compare with other similiar task-purposed network
- 🗒️ VisualEcho
- 🗒️ Beyond Image to Depth
- 🗒️ BatVision
Still, the other works use sound as Lidar system, while we concentrate on the acoustic property of RIR *Representative properties: direct sound, early reflection, EDC curve slope, RT60, C50, etc.)
RIR and corresponding Panorama provided by



