HARU

Hierarchical cross-Attention Resnet based U-net
for depth map inpainting with partial panorama and ambisonic room impulse response

We propose novel model for novel task with physical analysis aligned with acoustic theory

Key phrases: Inpainting, RIR, Depth Map, Cross-Attention, U-Net, Acoustic Interpretability

2025 URP Winter/Spring
Final Paper for URP: 🧾PDF
URP workshop presentation: 🔬 Google Slide
Data & Experiment Logs: 🌐 Notion
URP workshop assets: 📊 asset

Abstract

Reconstructing the indoor structures is crucial for augmented/extended reality (AR/XR) applications or interactions. The structural information can be obtained from indoor depth panoramas; however, acquiring a complete depth panorama under typical conditions remains challenging due to the limited field of view. We propose the novel model that reconstruct the full depth panorama from the partial panorama and room impulse response. The model acheive about 20 dB and 0.844 for peak signal-to-noise ratio (PSNR) and structual similiarity index map (SSIM) evaluation metrics, respectively.

Result

Our proposed transformer(TR) based model outperform the CNN based model (from 음향학회) in

Peak-signal-to-noise (PSNR): High value refer better image quality
Structural similiarity index map (SSIM): High value refer better structural, luminance, contrast quality

Perceptual Quality with naive eyes: TR based model estimate plausible result
- Kindly mention that the GT is little different by uncontrolled randomness

Analysis with Acoustics

Each block of ResNet is attention on distinct elements of indoor room shown as the figure below.

RIR can be seperate as the direct sound, early reflection, and late reverberation parts
Indoor suggested as explained by floorplan, edges, fine details, and representative structure of the room

Claim

The red bar indicate time until 125ms where the low-to-high order reflections exist, refer to EchoScan

Map 1 matches floorplan and late reverberation aligned with Sabine's Equation
Map 2 matches fine scaled structure and lots of early reflection in short time period
Map 3 matches edges and multiple reflections within a short interval
Map 4 matches height, representative structure and direct sound & reflection & EDC curve region

Analysis leads to Blockwise Contrastive Learning task as the another branch of future works

As each block can extract distint features
Might help global smoothing problem of transformer encoder

Future Work

We consider Diffusion or Flow based model with HARU as the noise estimator or vector field function respectively
Plug in the room material segmentation with regards of sound absorb coefficents.
Generalize model with reverberent speech and other type of multi channel reciever (not only FoA, first order ambisonic)
- Considering Sound Enhancement model like DeFTAN II or Dereverberation model
Extend acoustic based analysis and compare with other similiar task-purposed network
- 🗒️ VisualEcho
- 🗒️ Beyond Image to Depth
- 🗒️ BatVision

Still, the other works use sound as Lidar system, while we concentrate on the acoustic property of RIR *Representative properties: direct sound, early reflection, EDC curve slope, RT60, C50, etc.)

Dataset

RIR and corresponding Panorama provided by

⚙️ Soundspace2.0
📊 Matterport3D

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
code		code
images		images
preprocessing_tools		preprocessing_tools
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HARU

Abstract

Result

Analysis with Acoustics

Future Work

Dataset

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

byulharang/HARU

Folders and files

Latest commit

History

Repository files navigation

HARU

Abstract

Result

Analysis with Acoustics

Future Work

Dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages