<a href="https://colab.research.google.com/github/fjadidi2001/Image_Inpaint/blob/main/BIDS_Net.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bidirectional Interaction Dual-Stream Network (BIDS-Net)

> The article proposes a novel approach for image inpainting called the Bidirectional Interaction Dual-Stream Network (BIDS-Net), integrating CNN and Transformer models to enhance inpainting quality by leveraging their complementary strengths.

## Methodology Overview
1. Dual-Stream Structure:

- CNN Stream: Captures rich local patterns and refines details.
- Transformer Stream: Models long-range contextual correlations for global information.
- Both streams are based on a U-shaped encoder-decoder structure to facilitate efficient multi-scale context reasoning.

2. Bidirectional Feature Interaction (BFI):

Implements **bidirectional feature alignment and fusion** between the CNN and Transformer streams.
Employs **Selective Feature Fusion (SFF)** for adaptive feature integration by learning channel weights.
3. Fast Global Self-Attention:

Utilizes a **kernelizable fast-attention mechanism** for the Transformer, reducing computational complexity to linear.
4. Loss Functions:

Combines pixel-wise reconstruction, adversarial, perceptual, and style losses to ensure inpainting quality and perceptual consistency.

- Channel allocation: Optimal performance when CNN and Transformer streams have equal importance.
- Fusion methods: Bidirectional fusion outperforms unidirectional and unified-path approaches.
- Specific fusion techniques: SFF surpasses element-wise addition and concatenation.
- Number of random features: Optimal trade-off achieved with 72 orthogonal random features.

### **Mask Creation Process**

#### 1. **Purpose of Masking in Image Inpainting**
   - Masks simulate corrupted regions by marking areas of an image for restoration.
   - Masks represent regions with **value 1** (corrupted) and **value 0** (uncorrupted), facilitating selective processing during training.

#### 2. **Mask Datasets**
   - **Mask Set I**: Contains irregular shapes with various hole-to-image area ratios (10%–60%) to simulate real-world image corruption scenarios.
   - **Mask Set II**: Focuses on **large-scale corruptions**, derived from a large mask sampling strategy, targeting challenges in **large-hole inpainting**.

#### 3. **Techniques for Mask Creation**
   - **Random Irregular Masks**:
     - Generated using freehand-like curves and random polygons.
     - Often involve **random rotations** and **flipping** for augmentation.
   - **Large-Hole Masks**:
     - Created by sampling large continuous regions, ensuring high diversity in shape and size.
   - **Tools and Libraries**:
     - Python libraries like **OpenCV** and **NumPy** for procedural generation of irregular shapes.
     - **External mask datasets** for additional diversity, e.g., Mask datasets from previous works such as [29].

---

### **Model Architecture: BIDS-Net**

#### 1. **Overall Structure**
   - A **dual-stream network** combining **CNN** and **Transformer** models in a parallel design.
   - Built on a **U-shaped encoder-decoder structure** for multi-scale feature extraction.

#### 2. **Key Components**
   - **CNN Stream**:
     - Focus: Capturing **local patterns** for texture refinement.
     - Built with **pre-activation residual blocks** for efficient and robust feature learning.
   - **Transformer Stream**:
     - Focus: Modeling **long-range contextual correlations**.
     - Uses **fast global self-attention** for scalability and reduced computational overhead.
   - **Bidirectional Feature Interaction (BFI)**:
     - Bridges the CNN and Transformer streams with **feature alignment** and **adaptive fusion**.

#### 3. **Detailed Implementation Steps**
   - **Input Projection**:
     - Corrupted images and masks are projected into separate feature spaces for the CNN and Transformer streams.
     - Transformer features are downsampled to balance computational cost and performance.
   - **Encoding Stage**:
     - Each stream extracts features using **convolutional blocks (CNN)** and **Transformer blocks**.
     - Features are fused bidirectionally via the **BFI module**.
   - **Bottleneck Stage**:
     - Features from both streams interact for enhanced context reasoning at the lowest spatial resolution.
   - **Decoding Stage**:
     - Outputs from both streams are upsampled and concatenated for final refinement.
   - **Output Projection**:
     - Combined features are transformed back to the image space for inpainting results.

---

### **Relevant Techniques and Algorithms**

#### 1. **Fast Global Self-Attention**
   - Reduces standard attention's quadratic complexity to linear using:
     - **Kernelizable Attention**: Positive orthogonal random features replace softmax attention.
     - Ensures **scalability** and efficiency for high-resolution images.

#### 2. **Selective Feature Fusion (SFF)**
   - Adapts weights for each channel during fusion, ensuring:
     - CNN benefits from Transformer’s global context.
     - Transformer incorporates CNN’s local details.
   - Based on the **Selective Kernel Convolution** technique.

#### 3. **Loss Functions**
   - **Pixel-wise Reconstruction Loss**: Ensures pixel-level consistency.
   - **Adversarial Loss**: Improves texture realism by incorporating a discriminator network.
   - **Perceptual Loss**: Derived from a pre-trained VGG-19, enhancing perceptual similarity.
   - **Style Loss**: Preserves stylistic details using Gram matrices.

---

### **Tools and Libraries**
   - **Frameworks**: PyTorch (1.10.1), TensorFlow for alternate implementations.
   - **Visualization**: Matplotlib or OpenCV for displaying masks and inpainted results.
   - **GPU Hardware**: Tested on NVIDIA GeForce RTX 3090 for performance.

---

### **Considerations**
   - **Mask Diversity**: Critical for generalization across various corruption scenarios.
   - **Computational Efficiency**: Striking a balance between accuracy and runtime, particularly with Transformer integration.
   - **Evaluation Metrics**:
     - Quantitative: PSNR, SSIM, FID, LPIPS.
     - Qualitative: Visual coherence and texture consistency.

