
# Denoising Model Pipeline with ResNet and SCSE Block

## 1. Architecture

The architecture is a combination of a **pre-trained ResNet encoder** with an **attention-based U-Net decoder** for image denoising. This structure enables efficient feature extraction, fine-grained attention, and detail preservation.

### 1.1 Overview of U-Net with Attention
The architecture uses an **Attention U-Net** structure that leverages a pre-trained encoder, combining U-Net’s segmentation power with attention to focus on critical regions. This is particularly valuable for image denoising tasks that require retaining small, important features.

### 1.2 Key Components in Detail

#### Encoder (Feature Extraction - ResNet)

The encoder extracts features at multiple abstraction levels, beginning with the input image and passing through ResNet layers:

1. **Input Image (RGB)**: The image has 3 channels, representing RGB color channels. Let \( X \in \mathbb{R}^{H \times W \times 3} \) be the input image where:
   - \( H \): Image height.
   - \( W \): Image width.

2. **Convolutional Layer**: 
   - The first layer applies a convolution to the image, producing feature maps. For an input \( X \), the convolutional operation is defined as:
     $$
     X'_{i,j} = \sum_{m=1}^{M} \sum_{n=1}^{N} W_{m,n} \cdot X_{i-m, j-n} + b
     $$
     where:
     - \( X' \): Output feature map after convolution.
     - \( W_{m,n} \): Convolution kernel of size \( M \times N \).
     - \( b \): Bias term.
     - \( i, j \): Spatial indices.

   - After convolution, batch normalization and ReLU activation are applied:
     $$
     X'' = \text{ReLU}(\text{BatchNorm}(X'))
     $$

3. **Residual Blocks**: ResNet introduces residual connections to facilitate learning in deep networks. 
   - Each **Residual Block** contains two convolutions with weights \( W_1 \) and \( W_2 \) and applies a skip connection:
     $$
     Y = X + F(X) = X + \text{ReLU}(\text{BatchNorm}(W_2 * \text{ReLU}(\text{BatchNorm}(W_1 * X))))
     $$
     where:
     - \( Y \): Block output after residual connection.
     - \( F(X) \): Transformation applied to \( X \) via convolutions, batch normalization, and activation.

4. **Downsampling**: Strided convolutions are used to reduce spatial dimensions while increasing the number of channels:
   $$
   X_{\text{downsampled}} = W_{\text{stride=2}} * X
   $$

The encoder output consists of multi-level feature maps, capturing complex image patterns necessary for effective denoising.

#### Skip Connections

Skip connections between corresponding encoder and decoder layers retain spatial information:
   $$
   F_{\text{skip}} = \text{Concat}(F_{\text{encoder}}, F_{\text{decoder}})
   $$
   where:
   - \( F_{\text{encoder}} \): Feature map from the encoder.
   - \( F_{\text{decoder}} \): Feature map in the decoder at the same level.

#### SCSE Attention Block

The **SCSE (Spatial and Channel Squeeze & Excitation)** Block selectively emphasizes important channels and spatial regions.

1. **Channel Attention (CSE)**:
   - **Global Average Pooling**: For each channel \( c \), compute a scalar summary:
     $$
     z_c = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} F_{c, i, j}
     $$
     where:
     - \( z_c \): Aggregated channel information.
     - \( F_{c, i, j} \): Feature map value at channel \( c \) and location \( (i, j) \).

   - **Fully Connected Layers**: Transform this summary into attention weights using two fully connected layers:
     $$
     s_c = \sigma(W_2 \cdot \text{ReLU}(W_1 \cdot z_c))
     $$
     where:
     - \( s_c \): Attention weight for channel \( c \).
     - \( \sigma \): Sigmoid activation function.

   - **Channel Scaling**: Scale each channel \( F_c \) by \( s_c \):
     $$
     F_c' = s_c \cdot F_c
     $$

2. **Spatial Attention (SSE)**:
   - **Spatial Convolution**: A convolution generates a spatial attention map \( M \):
     $$
     M_{i,j} = \sigma(\text{Conv2D}(F))
     $$
     where \( M_{i,j} \) is the spatial weight at location \( (i, j) \).

   - **Spatial Scaling**: Apply \( M_{i,j} \) to refine the spatial features in \( F \):
     $$
     F_{i,j}' = M_{i,j} \cdot F_{i,j}
     $$

Combining channel and spatial attention in the SCSE block results in a refined feature map \( F'' \) that enhances important features while suppressing irrelevant details.

#### Decoder (Image Reconstruction)

The decoder reconstructs a denoised image from the encoder’s multi-scale features:

1. **Upsampling**:
   - Each decoder layer upsamples the feature maps by a factor of 2 using transpose convolutions:
     $$
     F_{\text{upsampled}} = \text{TransposeConv}(F)
     $$

2. **Skip Connection Concatenation**: Combine the upsampled feature maps with corresponding encoder feature maps:
   $$
   F_{\text{decoded}} = \text{Concat}(F_{\text{upsampled}}, F_{\text{skip}})
   $$

3. **Final Convolution**: The final 1x1 convolution produces the denoised image, aligning with the input dimensions:
   $$
   Y_{\text{output}} = W_{\text{final}} * F_{\text{decoded}}
   $$

## 2. Handling Variable Image Sizes

To handle images of different sizes without distortion, **adaptive padding** is used to match the largest image dimensions in each batch:
   $$
   X_{\text{padded}} = \text{ZeroPad}(X, (H_{\max}, W_{\max}))
   $$

## 3. Dataset Management

### Dataset Components
- **Degraded Images**: Noisy images requiring denoising.
- **Defect Masks**: Binary masks indicating regions with important features.
- **Ground Truth**: Clean images used as targets for training.

### Data Splitting
The dataset is split as follows:
   - **Training Set (70%)**: Used to learn patterns in noisy data.
   - **Validation Set (20%)**: Monitors performance during training.
   - **Test Set (10%)**: Evaluates final model generalization.

## 4. Training Setup

### Loss Function: Weighted MSE Loss

A weighted Mean Squared Error (MSE) loss gives more importance to defect regions, preserving critical details in the denoised image:

1. **Weighted Loss Definition**:
   $$
   \text{Loss} = \text{MSE}_{\text{non-masked}} + \lambda \cdot \text{MSE}_{\text{masked}}
   $$
   where \( \lambda \) is a weighting factor for the defect (masked) areas.

### Optimizer: AdamW
   - **AdamW** optimizer with weight decay provides regularization, handling noisy gradients effectively:
     $$
     \theta_{t+1} = \theta_t - \alpha \frac{m_t}{\sqrt{v_t} + \epsilon} - \text{weight decay} \cdot \theta_t
     $$
     where:
     - \( \theta_t \): Parameter at step \( t \).
     - \( m_t, v_t \): Exponentially decayed averages of gradients.
     - \( \alpha \): Learning rate.
     - \( \epsilon \): Small constant to avoid division by zero.

### Learning Rate Adjustment
   - **Scheduler (`ReduceLROnPlateau`)**: Adjusts learning rate when validation loss plateaus, enhancing convergence stability.

## 5. Evaluation Metrics

### Peak Signal-to-Noise Ratio (PSNR)
   - **PSNR** measures the pixel-level similarity of the denoised output to the ground truth:
     $$
     \text{PSNR} = 10 \cdot \log_{10} \left( \frac{\text{MAX}^2}{\text{MSE}} \right)
     $$
     where \( \text{MAX} \) is the maximum possible pixel value (e.g., 255 for 8-bit images).

### Structural Similarity Index (SSIM)
   - SSIM evaluates perceptual quality, incorporating luminance, contrast, and structure:
     $$
     \text{SSIM}(x, y) = \frac{(2 \mu_x \mu_y + C_1)(2 \sigma_{xy} + C_2)}{(\mu_x^2 + \mu_x^2 + \\mu_y^2 + C_1)(\\sigma_x^2 + \\sigma_y^2 + C_2)}
     $$
     where:
     - \( x \) and \( y \): The two images being compared (e.g., denoised output and ground truth).
     - \( \mu_x \) and \( \mu_y \): Mean pixel intensities of \( x \) and \( y \), respectively.
     - \( \sigma_x^2 \) and \( \sigma_y^2 \): Variances of \( x \) and \( y \), respectively.
     - \( \sigma_{xy} \): Covariance between \( x \) and \( y \).
     - \( C_1 \) and \( C_2 \): Stabilization constants to avoid division by zero.

SSIM values range from 0 to 1, with values closer to 1 indicating higher similarity, which is useful for assessing perceptual quality in denoising.

## 6. Algorithmic Summary

1. **Input Preparation**:
   - Load degraded (noisy) images, ground truth images, and defect masks.

2. **Encoder (ResNet)**:
   - Pass the input through convolutional layers and residual blocks with downsampling.

3. **SCSE Attention Mechanism**:
   - Apply Channel Attention (CSE) to prioritize significant channels.
   - Apply Spatial Attention (SSE) to highlight critical spatial regions.

4. **Decoder**:
   - Upsample the refined features.
   - Combine with skip connections from the encoder to retain spatial details.

5. **Output Layer**:
   - Apply a final 1x1 convolution to produce the denoised RGB image.

6. **Loss Calculation**:
   - Compute the **Weighted MSE Loss** on masked (defect) and non-masked areas.

7. **Optimization (AdamW)**:
   - Update parameters using AdamW optimizer with weight decay.

8. **Evaluation**:
   - Calculate PSNR for pixel accuracy and SSIM for perceptual quality.

## Summary

This architecture and training setup is designed to address the unique challenges of image denoising, such as preserving critical features and removing unwanted noise, using a combination of ResNet encoding, SCSE attention, weighted loss, and effective evaluation metrics (PSNR and SSIM). Together, these components allow for high-quality denoised outputs suitable for various real-world applications.



