Skip to content

Depth Map Inpainting from Partial Observation and Room Impulse Responses

Notifications You must be signed in to change notification settings

byulharang/HARU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HARU

Hierarchical cross-Attention Resnet based U-net
for depth map inpainting with partial panorama and ambisonic room impulse response

We propose novel model for novel task with physical analysis aligned with acoustic theory

Key phrases: Inpainting, RIR, Depth Map, Cross-Attention, U-Net, Acoustic Interpretability


2025 URP Winter/Spring
Final Paper for URP: 🧾PDF
URP workshop presentation: 🔬 Google Slide
Data & Experiment Logs: 🌐 Notion
URP workshop assets: 📊 asset


Abstract

Reconstructing the indoor structures is crucial for augmented/extended reality (AR/XR) applications or interactions. The structural information can be obtained from indoor depth panoramas; however, acquiring a complete depth panorama under typical conditions remains challenging due to the limited field of view. We propose the novel model that reconstruct the full depth panorama from the partial panorama and room impulse response. The model acheive about 20 dB and 0.844 for peak signal-to-noise ratio (PSNR) and structual similiarity index map (SSIM) evaluation metrics, respectively.

Proposed Model Flow

Result

Our proposed transformer(TR) based model outperform the CNN based model (from 음향학회) in

  • Peak-signal-to-noise (PSNR): High value refer better image quality
  • Structural similiarity index map (SSIM): High value refer better structural, luminance, contrast quality

PSNR and SSIM

  • Perceptual Quality with naive eyes: TR based model estimate plausible result
    • Kindly mention that the GT is little different by uncontrolled randomness

Perceptual Result Comparison

Analysis with Acoustics

Each block of ResNet is attention on distinct elements of indoor room shown as the figure below.

  • RIR can be seperate as the direct sound, early reflection, and late reverberation parts
  • Indoor suggested as explained by floorplan, edges, fine details, and representative structure of the room

Attention heatmap Anaysis

Claim

The red bar indicate time until 125ms where the low-to-high order reflections exist, refer to EchoScan

  1. Map 1 matches floorplan and late reverberation aligned with Sabine's Equation
  2. Map 2 matches fine scaled structure and lots of early reflection in short time period
  3. Map 3 matches edges and multiple reflections within a short interval
  4. Map 4 matches height, representative structure and direct sound & reflection & EDC curve region

Analysis leads to Blockwise Contrastive Learning task as the another branch of future works

  • As each block can extract distint features
  • Might help global smoothing problem of transformer encoder

Future Work

  • We consider Diffusion or Flow based model with HARU as the noise estimator or vector field function respectively
  • Plug in the room material segmentation with regards of sound absorb coefficents.
  • Generalize model with reverberent speech and other type of multi channel reciever (not only FoA, first order ambisonic)
    • Considering Sound Enhancement model like DeFTAN II or Dereverberation model
  • Extend acoustic based analysis and compare with other similiar task-purposed network

Still, the other works use sound as Lidar system, while we concentrate on the acoustic property of RIR *Representative properties: direct sound, early reflection, EDC curve slope, RT60, C50, etc.)

Dataset

RIR and corresponding Panorama provided by

About

Depth Map Inpainting from Partial Observation and Room Impulse Responses

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages