### 1. Depthwise Convolution
---
**Equation:**

$$
y_i = \sum_{j \in \mathcal{L}(i)} W_{i-j} \odot X_j
$$

- **Explanation:**
    - In depthwise convolution, we apply a kernel $W$ to a local neighborhood $\mathcal{L}(i)$ of pixel $i$. The kernel weights $W_{i-j}$ are multiplied with the corresponding pixel values $X_j$ in this neighborhood.
    - This means each pixel $i$ gathers information only from its nearby pixels defined by the size of the kernel (e.g., 3x3, 5x5).

- **Receptive field:**
    - The receptive field is **local**, determined by the kernel size. This limits the information gathering to small neighborhoods of pixels.
    - For a K × K kernel, each output pixel in the feature map has a receptive field of K × K in the input image.
    - Increasing the filter size K directly increases the receptive field, allowing the model to capture larger local patterns. Conversely, reducing K decreases the receptive field, focusing on smaller details.

- **Information Propogation:**
  - Information from neighboring pixels is combined based on the weights defined by the filter, allowing the model to learn local features effectively. Each output pixel is influenced solely by the pixels within its receptive field.
---
**Example in Python:**

```python
import cv2
import numpy as np

# Sample image and kernel for depthwise convolution
image = np.random.rand(5, 5)
kernel = np.ones((3, 3))  # 3x3 kernel

# Apply depthwise convolution using OpenCV
output = cv2.filter2D(image, -1, kernel)

output


### 2. Self-Attention

---
**Equation:**

$$
y_i = \sum_{j \in \mathcal{G}} \frac{\exp(X_i^T X_j)}{\sum_{k \in \mathcal{G}} \exp(X_i^T X_k)} x_j
$$

- **Explanation:**
    - In self-attention, every pixel $i$ attends to all the other pixels $j \in \mathcal{G}$ in the image. The contribution of each pixel $j$ to the final value at pixel $i$ is weighted by the similarity score $X_i^T X_j$, normalized across all pixels.
    - The similarity score captures how much the value of pixel $j$ should contribute to the output at pixel $i$, based on the inner product of their feature representations.
    
- **Receptive field:**
    - The receptive field in self-attention is **global**, meaning pixel $i$ integrates information from **all other pixels** in the image. This allows the model to capture long-range dependencies and interactions, unlike convolution, which is limited to a local neighborhood.
    - The size of the receptive field does not depend on kernel size, as all pixels can interact with each other directly. The receptive field is inherently the entire input image, allowing for complex relationships to be captured regardless of distance.

- **Information Propogation:**
    - The attention mechanism is content-aware: information is aggregated based on the relevance of the surrounding pixels, not just their spatial proximity. Information flows globally, allowing the model to capture relationships between distant pixels. This capability enhances the contextual understanding of image components.
    - The global nature of the receptive field makes self-attention suitable for tasks where context from the entire image is important.

---

**Example in Python:**

```python
import numpy as np

# Define a simple self-attention mechanism
def self_attention(X):
    """
    A simplified self-attention function.
    X: Input feature matrix (each row is a pixel's feature vector)
    Returns: Weighted sum of input features using self-attention
    """
    # Compute the similarity scores (dot-product attention)
    scores = np.exp(X @ X.T)  # Shape: (N, N), where N is the number of pixels

    # Normalize the attention scores by rows
    attention_weights = scores / np.sum(scores, axis=1, keepdims=True)

    # Compute the attention-weighted output
    output = attention_weights @ X

    return output

# Example input (5 pixels with 5-dimensional feature vectors)
X = np.random.rand(5, 5)

# Apply self-attention
attention_output = self_attention(X)

# Display the output
attention_output


### 3. Post-normalization Combination
---
**Equation:**

$$
y_i^{post} = \sum_{j \in \mathcal{G}} \left( \frac{\exp(X_i^T X_j)}{\sum_{k \in \mathcal{G}} \exp(X_i^T X_k)} + W_{i-j} \right) x_j
$$

- **Explanation:**
    - In this method, we combine the results of **self-attention** and **depthwise convolution** after normalizing the attention scores.
    - The self-attention term allows pixel $i$ to aggregate global information from all pixels $j$, while the convolution term adds local information from nearby pixels.
    - The result is a blend of global (attention) and local (convolution) information at pixel $i$.

- **Receptive field:**
    - The receptive field here is both **local** (from the convolution) and **global** (from the self-attention), giving the model access to both nearby and distant pixels.

- **Information Propogation:**
  - Local features are extracted separately through convolution, which are then contextualized by the global information obtained through self-attention. This combination enhances the model's capability to learn complex features.

---

**Example in Python:**

```python
import numpy as np
import cv2

# Define a post-normalization combination function
def post_normalization_combination(X, kernel):
    """
    Combines self-attention and depthwise convolution post-normalization.
    X: Input feature matrix (each row is a pixel's feature vector)
    kernel: Convolution kernel
    Returns: Combined feature map
    """
    # Compute the self-attention scores
    attention_scores = np.exp(X @ X.T)
    
    # Normalize the attention scores
    attention_weights = attention_scores / np.sum(attention_scores, axis=1, keepdims=True)
    
    # Apply depthwise convolution
    conv_output = cv2.filter2D(X, -1, kernel)
    
    # Combine self-attention and convolution post-normalization
    combined_output = attention_weights @ X + conv_output
    
    return combined_output

# Example input (5 pixels with 5-dimensional feature vectors)
X = np.random.rand(5, 5)
kernel = np.random.rand(3, 3)

# Apply post-normalization combination
post_norm_output = post_normalization_combination(X, kernel)

# Display the output
post_norm_output


### 4. Pre-normalization Combination
---
**Equation:**

$$
y_i^{pre} = \sum_{j \in \mathcal{G}} \frac{\exp(X_i^T X_j + W_{i-j})}{\sum_{k \in \mathcal{G}} \exp(X_i^T X_k + W_{i-k})} x_j
$$

- **Explanation:**
    - In this method, the convolution weights $W_{i-j}$ and self-attention scores $X_i^T X_j$ are combined **before normalization**.
    - The convolution kernel directly influences the attention mechanism by adjusting the attention scores prior to softmax normalization.
    - The self-attention term still enables global information propagation, while the convolution kernel ensures that local information is preserved.
  
- **Receptive field:**
    - The receptive field is a mix of **local** (convolution) and **global** (self-attention) interactions.
    - The pre-normalization combination ensures that convolutional features contribute to the attention score calculation, giving more importance to specific local features before normalizing them across the global field.

- **Information Propogation:**
  - The normalization stabilizes training and enhances the quality of features extracted by both operations, allowing for more effective learning.

---

**Example in Python:**

```python
import numpy as np
import cv2

# Define a pre-normalization combination function
def pre_normalization_combination(X, kernel):
    """
    Combines self-attention and depthwise convolution pre-normalization.
    X: Input feature matrix (each row is a pixel's feature vector)
    kernel: Convolution kernel
    Returns: Combined feature map
    """
    # Compute the combined self-attention + convolution scores pre-normalization
    combined_scores = np.exp(X @ X.T + kernel)
    
    # Normalize the combined scores
    attention_weights = combined_scores / np.sum(combined_scores, axis=1, keepdims=True)
    
    # Apply the attention weights to the input
    combined_output = attention_weights @ X
    
    return combined_output

# Example input (5 pixels with 5-dimensional feature vectors)
X = np.random.rand(5, 5)
kernel = np.random.rand(5, 5)  # Assuming kernel is the same shape as X for simplicity

# Apply pre-normalization combination
pre_norm_output = pre_normalization_combination(X, kernel)

# Display the output
pre_norm_output


### 5. Attention Modulated Convolution
---
**Equation:**

$$
Y(i) = \sum_{j \in \mathcal{L}(i)} A(i, j) \odot X(j) \odot W(i - j)
$$

where the attention weight $A(i, j)$ is given as:

$$
A(i, j) = \frac{\exp(\text{score}(Q(i), K(j)))}{\sum_{k \in \mathcal{L}(i)} \exp(\text{score}(Q(i), K(k)))}
$$

- **Explanation:**
    - In **Attention Modulated Convolution**, the convolution process is modulated by attention weights $A(i, j)$, which dynamically scale the contribution of each neighboring pixel $j$ in the convolution operation.
    - The attention weights $A(i, j)$ are computed using a similarity score (e.g., dot-product) between the query $Q(i)$ at pixel $i$ and the key $K(j)$ at neighboring pixel $j$.
    - The attention mechanism adjusts the contribution of each neighboring pixel based on how relevant it is to the current pixel $i$, making the convolution adaptive to the content of the image.

- **Receptive field:**
    - The receptive field is **local** (as determined by the convolution filter size), but the attention weights allow for more flexible and content-dependent information aggregation from neighboring pixels.

- **Information Propogation:**
  - This operation emphasizes important local features by modulating them with global attention, effectively balancing local detail and global context.

---

**Example in Python:**

```python
import numpy as np
import cv2

# Define the attention modulated convolution function
def attention_modulated_convolution(X, kernel, Q, K):
    """
    Applies attention modulated convolution.
    X: Input feature matrix (each row is a pixel's feature vector)
    kernel: Convolution kernel
    Q: Query matrix (for attention)
    K: Key matrix (for attention)
    Returns: Convolution result modulated by attention.
    """
    # Compute attention scores (dot product of query and key)
    attention_scores = np.exp(Q @ K.T)
    
    # Normalize the attention scores
    attention_weights = attention_scores / np.sum(attention_scores, axis=1, keepdims=True)
    
    # Apply depthwise convolution using OpenCV
    conv_output = cv2.filter2D(X, -1, kernel)
    
    # Modulate convolution output by attention weights
    attention_modulated_output = attention_weights @ conv_output
    
    return attention_modulated_output

# Example input (5 pixels with 5-dimensional feature vectors)
X = np.random.rand(5, 5)
kernel = np.random.rand(3, 3)  # A random convolution kernel

# Example query and key matrices (for attention)
Q = np.random.rand(5, 5)
K = np.random.rand(5, 5)

# Apply attention modulated convolution
attention_conv_output = attention_modulated_convolution(X, kernel, Q, K)

# Display the output
attention_conv_output


### 6. Convolution Modulated Attention
---
**Equation:**

$$
\text{Attention}(X) = \text{Softmax}\left( \text{DWConv}_{F \times F}(X, W) \right) \odot V
$$

- **Explanation:**
    - In **Convolution Modulated Attention**, depthwise convolution (denoted as $\text{DWConv}_{F \times F}$) is first applied to the input feature map $X$ using a convolution kernel $W$. This captures all the local spatial relationships.
    - The result of the depthwise convolution is passed through a softmax function, which transforms it into attention weights.
    - These attention weights are then applied to another input feature map $V$ to generate the final modulated output.
    - The convolution operation allows local features to be aggregated, while the attention mechanism ensures that the features are modulated based on their relevance.

- **Receptive field:**
    - The receptive field is **local**, determined by the convolution kernel size, but the attention mechanism allows for a more adaptive combination of local features.

- **Information Propogation:**
  - Local features are transformed and then contextualized by self-attention, leading to robust feature representation. The combination allows for the model to prioritize relevant features based on the global context.
  - The process starts by capturing information locally using depthwise convolution, and then these local features are modulated by attention. This allows the model to adaptively weight the local features, enhancing its ability to focus on the most important parts of the input.

---

**Example in Python:**

```python
import numpy as np
import cv2

# Define the convolution modulated attention function
def convolution_modulated_attention(X, kernel, V):
    """
    Applies convolution modulated attention.
    X: Input feature matrix (each row is a pixel's feature vector)
    kernel: Convolution kernel
    V: Input feature map to be modulated by attention
    Returns: Attention-modulated feature map.
    """
    # Apply depthwise convolution to input X
    conv_output = cv2.filter2D(X, -1, kernel)
    
    # Compute the attention weights by applying softmax to convolution output
    attention_weights = np.exp(conv_output) / np.sum(np.exp(conv_output), axis=1, keepdims=True)
    
    # Modulate the feature map V with attention weights
    attention_modulated_output = attention_weights @ V
    
    return attention_modulated_output

# Example input (5 pixels with 5-dimensional feature vectors)
X = np.random.rand(5, 5)
kernel = np.random.rand(3, 3)  # A random convolution kernel
V = np.random.rand(5, 5)  # A separate feature map to be modulated

# Apply convolution modulated attention
conv_mod_attn_output = convolution_modulated_attention(X, kernel, V)

# Display the output
conv_mod_attn_output
