# ðŸ“Š Data Exploration - Hateful Memes Dataset

This notebook explores the Facebook AI Hateful Memes Challenge dataset.

**Contents:**
1. Dataset Loading
2. Class Distribution Analysis
3. Text Analysis
4. Image Visualization
5. Multimodal Patterns

In [None]:
# Setup
import json
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
from collections import Counter
from wordcloud import WordCloud

# Configuration
DATA_PATH = '../data/hateful_memes'
plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

## 1. Load Dataset

In [None]:
def load_jsonl(filepath):
    """Load JSONL file into list of dicts."""
    data = []
    with open(filepath, 'r') as f:
        for line in f:
            data.append(json.loads(line.strip()))
    return data

# Load all splits
train_data = load_jsonl(f'{DATA_PATH}/train.jsonl')
dev_data = load_jsonl(f'{DATA_PATH}/dev.jsonl')
test_data = load_jsonl(f'{DATA_PATH}/test.jsonl')

print(f"Train: {len(train_data):,} samples")
print(f"Dev: {len(dev_data):,} samples")
print(f"Test: {len(test_data):,} samples")

In [None]:
# Convert to DataFrames
train_df = pd.DataFrame(train_data)
dev_df = pd.DataFrame(dev_data)
test_df = pd.DataFrame(test_data)

print("Sample entry:")
train_df.head()

## 2. Class Distribution

In [None]:
# Class distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Training set
train_counts = train_df['label'].value_counts()
axes[0].pie(train_counts, labels=['Not Hateful', 'Hateful'], 
            autopct='%1.1f%%', colors=['#2ecc71', '#e74c3c'])
axes[0].set_title('Training Set Distribution')

# Dev set
dev_counts = dev_df['label'].value_counts()
axes[1].pie(dev_counts, labels=['Not Hateful', 'Hateful'],
            autopct='%1.1f%%', colors=['#2ecc71', '#e74c3c'])
axes[1].set_title('Dev Set Distribution')

plt.tight_layout()
plt.show()

print(f"\nClass Imbalance Ratio: {train_counts[0]/train_counts[1]:.2f}:1")

## 3. Text Analysis

In [None]:
# Text length statistics
train_df['text_length'] = train_df['text'].str.len()
train_df['word_count'] = train_df['text'].str.split().str.len()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Text length distribution
for label in [0, 1]:
    subset = train_df[train_df['label'] == label]
    label_name = 'Hateful' if label == 1 else 'Not Hateful'
    axes[0].hist(subset['text_length'], alpha=0.6, bins=30, label=label_name)

axes[0].set_xlabel('Text Length (characters)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Text Length Distribution')
axes[0].legend()

# Word count distribution
train_df.boxplot(column='word_count', by='label', ax=axes[1])
axes[1].set_xlabel('Label (0=Not Hateful, 1=Hateful)')
axes[1].set_ylabel('Word Count')
axes[1].set_title('Word Count by Class')

plt.suptitle('')
plt.tight_layout()
plt.show()

In [None]:
# Word clouds
hateful_text = ' '.join(train_df[train_df['label'] == 1]['text'].str.lower())
not_hateful_text = ' '.join(train_df[train_df['label'] == 0]['text'].str.lower())

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

wc_hateful = WordCloud(width=800, height=400, background_color='white',
                       colormap='Reds').generate(hateful_text)
axes[0].imshow(wc_hateful, interpolation='bilinear')
axes[0].axis('off')
axes[0].set_title('Hateful Memes - Word Cloud', fontsize=14, color='red')

wc_not_hateful = WordCloud(width=800, height=400, background_color='white',
                           colormap='Greens').generate(not_hateful_text)
axes[1].imshow(wc_not_hateful, interpolation='bilinear')
axes[1].axis('off')
axes[1].set_title('Not Hateful Memes - Word Cloud', fontsize=14, color='green')

plt.tight_layout()
plt.show()

## 4. Sample Visualization

In [None]:
# Visualize sample memes
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

# Get samples
hateful = train_df[train_df['label'] == 1].sample(4, random_state=42)
not_hateful = train_df[train_df['label'] == 0].sample(4, random_state=42)

samples = pd.concat([not_hateful, hateful])
titles = ['NOT HATEFUL'] * 4 + ['HATEFUL'] * 4
colors = ['green'] * 4 + ['red'] * 4

for ax, (_, row), title, color in zip(axes, samples.iterrows(), titles, colors):
    img_path = f"{DATA_PATH}/{row['img']}"
    try:
        img = Image.open(img_path)
        ax.imshow(img)
        text = row['text'][:40] + '...' if len(row['text']) > 40 else row['text']
        ax.set_title(f"{title}\n\"{text}\"", fontsize=9, color=color)
    except:
        ax.text(0.5, 0.5, 'Image not found', ha='center', va='center')
    ax.axis('off')

plt.tight_layout()
plt.show()

## 5. Key Insights

**Findings:**
1. Class imbalance exists (64% not hateful, 36% hateful)
2. Hateful memes tend to have slightly longer text
3. The challenge requires understanding image-text interaction
4. Similar words appear in both classes - context matters

**Implications for modeling:**
- Use Focal Loss or class weighting for imbalance
- Need multimodal fusion, not just separate encoders
- Cross-attention can capture image-text relationships