# Hierarchical Fusion for Multimodal Sarcasm Detection
**Author:** Anoop Singh


## Motivation
Sarcasm is a linguistic device that conveys meaning by saying the opposite of what is actually intended. It's a frequent phenomenon in social media and plays a critical role in understanding sentiment, intent, and context in user-generated content.

While text alone has been the traditional focus of sarcasm detection, we observe a growing use of multimodal contentâ€”especially in platforms like Twitter where users frequently combine text with images. Detecting sarcasm from such content based solely on the text misses important visual and contextual cues. 

For example, a tweet like:

    "Yum, hospital cafeteria food!"


![Model Architecture](eg.png)


â€¦ is obviously sarcasticâ€”but only when you see the image.

This motivated us to explore a more robust sarcasm detection approach using Hierarchical Fusion of three modalities:

    Text (tweet content),

    Image (attached visual),

    Image Attributes (e.g., objects, colors, scenes inferred from the image).


## Background: From Text to Multimodal Sarcasm Detection

### Historical Perspective on Sarcasm Detection:

* **Early Work**: Focused on handcrafted features like n-grams, punctuation, or sentiment shift (Riloff et al., 2013).
* **Deep Learning Phase**: Models like CNNs, LSTMs, and attention mechanisms emerged (Ghosh & Veale, 2016; Baziotis et al., 2018), yielding better results on text alone.
* **Multimodal Beginnings**: Schifanella et al. (2016) were among the first to explore sarcasm in multimodal tweets using simple feature concatenation of image and text.

### Whatâ€™s New in This Work?

Cai et al. (2019) proposed a **Hierarchical Fusion Model**, which:

* Uses *three modalities*: text, image, and image attributes.
* Applies *fusion in stages*: early (for context initialization), representation (for attention-guided encoding), and modality fusion (for final integration).
* Outperforms unimodal and naÃ¯ve fusion approaches significantly.






## Learnings from This Work

Hereâ€™s what we gained from implementing and exploring this model:

* **Attributes matter**: Object and color tags extracted from images help "translate" visual information into something the model can relate to the text.
* **Early fusion boosts performance**: Using attribute embeddings to initialize the text encoder (Bi-LSTM) improves semantic alignment.
* **Hierarchical fusion is effective**: Combining features in structured stages yields better results than simply concatenating them.
* **Attention mechanisms** enhance interpretability, showing where the model is focusingâ€”be it on certain words, image regions, or specific attributes.



## Code Snippet â€“ Attention Over Attributes


In [None]:
# Simplified example: attention weights for attribute guidance
import torch
import torch.nn.functional as F

# Dummy attribute embeddings
attributes = torch.randn(5, 200)  # 5 attributes, 200-dim each
W1 = torch.randn(128, 200)
W2 = torch.randn(1, 128)
b1 = torch.randn(128)
b2 = torch.randn(1)

# Attention score computation
alpha = W2 @ torch.tanh(W1 @ attributes.T + b1[:, None]) + b2
attention_weights = F.softmax(alpha, dim=1)
print('Attention Weights:', attention_weights)

# This approximates how the model gives importance to different attributes before fusing them with other modalities.


## ðŸ“Š Results

To evaluate the modelâ€™s performance on multimodal sarcasm detection, we used metrics such as **accuracy**, **precision**, **recall**, and **F1-score**. Here we also visualize:

* The **confusion matrix** to show classification distribution.





![image.png](image.png)

In [None]:
>>>Example 1<<<
Text:  as we all know , calligraphy is one of the artsy-trendy ! grab a personalized notebook and get yours for only php 50
Labels:  ['sign', 'painted', 'sand', 'says', 'bed']
Truth: not sarcasm
Preduct: not sarcasm
>>>Example 1<<<
Text:  i will not remain silent . israel and the temple mount belong to the jews . <user> <user> <user>
Labels:  ['building', 'large', 'parked', 'people', 'landing']
Truth: not sarcasm
Preduct: not sarcasm
>>>Example 1<<<
Text:  today 's winner of the kia : bandwagon fan of the day award , <user> ! # warriorsbandwagon emoji_1031
Labels:  ['happy', 'bun', 'phone', 'bathroom', 'cellphone']
Truth: not sarcasm
Preduct: not sarcasm
>>>Example 1<<<
Text:  larries out here living the treat people with kindness life . # eyeroll 
Labels:  ['picture', 'birds', 'perched', 'screen', 'showing']
Truth: sarcasm
Preduct: sarcasm
>>>Example 1<<<
Text:  the saying of " i 'm going to hell for this or laughing at this " is quite often . # myhumorisdark  # andbeing # savage is part of me now
Labels:  ['painting', 'old', 'picture', 'photo', 'woman']
Truth: sarcasm
Preduct: sarcasm

## Reflections

### What surprised me?

* How much performance improved **just by using object-level image information**.
* Some sarcastic tweets are virtually impossible to interpret correctly without imagesâ€”especially when the text is neutral or positive in tone.

### What can be improved?

* The model doesnâ€™t yet use **transformers** or **vision-language pretraining (e.g., CLIP, BLIP)**â€”which could dramatically enhance both alignment and generalization.
* **Commonsense reasoning** is still missing; sarcasm often relies on unstated knowledge.
* Training on larger or multilingual datasets would help validate the method across different cultures and languages.



## References

* Cai et al., 2019. [Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model](https://aclanthology.org/P19-1239.pdf)
* Ghosh & Veale, 2016. Fracking Sarcasm Using Neural Networks.
* Schifanella et al., 2016. Detecting Sarcasm in Multimodal Social Platforms.
* Pennington et al., 2014. GloVe: Global Vectors for Word Representation.
