# Social Representations and Boundaries of Humor: A focus on Gender roles

## Research questions: 

1) How are men and women depicted in New Yorker cartoons and captions, and do these depictions reflect traditional gender roles or stereotypes?

2) How does audience response (e.g., votes or winning captions) relate to gendered content—do captions about one gender receive more positive attention, and does this reinforce or challenge stereotypes?

## Structure: How do we answer these questions ? (To complete)

**Step 1:** detect gendered references in sentences and assign a gender to each of them (male, female, both, neutral). 

    *Method*: Found two gender lists that contains gendered word. I wanted a longer list so I manually augmented it with universal gendered word and contextual gender markers. Then I added words based on what words are actually on the dataset.

## Initialisation of the root path

In [70]:
from pathlib import Path
import sys

def warning1(text): print("WARNING!!! ", text)
ACTIVATE_PRINTS = False

# Get correct root path
try:
    root = Path(__file__).resolve().parent
except NameError:
    root = Path.cwd()  # fallback for Jupyter notebooks

while root.parent != root:
    if all((root / marker).exists() for marker in [".git", "README.md", "results.ipynb"]):
        break
    root = root.parent

# Fallback in case nothing found
if not any((root / marker).exists() for marker in [".git", "README.md", "results.ipynb"]):
    print("Could not locate project root — defaulting to current working directory")
    root = Path.cwd()

if ACTIVATE_PRINTS: print(f"Root folder detected at: {root}")

# Ensure importability of the project
if str(root) not in sys.path:
    sys.path.insert(0, str(root))

print(root)

d:\GitHub\ada-2025-project-adacore42


## Imports

In [None]:
# working librairies
import os
import pickle
import csv

# basics
import pandas as pd
import numpy as np

# plots
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image

# text processing libraries
import nltk
import spacy

## Loading the data



In [55]:
stored_dataprep_pkl_path = r'D:\GitHub\ada-2025-project-adacore42\data\data_prepared.pkl'

with open(stored_dataprep_pkl_path, 'rb') as f:
    data = pickle.load(f)

In [56]:
# Extract the objects in the pickle

# dataA est une liste de DataFrames pandas (ou un objet similaire, comme un dictionnaire de DataFrames). Chaque élément de la liste contient un DataFrame avec 7 colonnes et un nombre variable de lignes.
dataA = data['dataA']
# dataC est un DataFrame de métadonnées de tous les cartoon contests.
dataC = data['dataC']
dataA_startID = data['dataA_startID']
dataA_endID = data['dataA_endID']
dataC_lastGoodID = data['dataC_lastGoodID']

In [71]:
print(root)

d:\GitHub\ada-2025-project-adacore42


In [82]:
def drop_NaN(dataA, dataC):
    """
    This function finds the contests with no metadata and drop them in dataA
    and dataC
    Input: dataA, dataC
    Return: dataA_removed, dataC_removed
    """
    dataC_copy = dataC.copy(deep=True)

    # find the where there are no NaN's are in the metadata
    NaN_in_rows = dataC_copy[dataC_copy['image_descriptions'].isna()].index
    # remove them in dataC
    dataC_copy.dropna(subset=['image_descriptions'], inplace=True)
    # Remove the corresponding contests in dataA
    dataA_removed = [x for i, x in enumerate(dataA) if i not in NaN_in_rows]

    return dataA_removed, dataC_copy

def get_contest_id(idx, dataC):
    """
    Find the contest id based on the index of one element of dataA
    Return: contest_id
    """
    contest_id = dataC.iloc[idx]['contest_id']
    return contest_id

def plot_cartoon(contest_id, root):
    cartoon_path = os.path.join("data", "newyorker_caption_contest_virgin", "images", f"{contest_id}.jpg")
    path = os.path.join(root, cartoon_path)
    img = Image.open(path)
    img.show()

In [59]:
dataA_removed, dataC_removed = drop_NaN(dataA, dataC)

In [60]:
print(f"Length dataA: {len(dataA_removed)}\nShape dataC: {dataC_removed.shape}")

Length dataA: 240
Shape dataC: (240, 9)


In [105]:
cartoon_id = get_contest_id(144, dataC_removed)

In [107]:
plot_cartoon(cartoon_id, root)

In [108]:
dataA_removed[144].caption

rank
0       You should see the couch they have at the urol...
1       The last place I saw a couch like this also ch...
2                      What do uou mean MY oral fixation?
3       Are you going to help me or is this just lip s...
4                     Shouldn’t you be sitting in an ear?
                              ...                        
6223    My lips are sealed. Your secrets are safe with...
6224    In the dream, she asks me to come UP and see h...
6225    These Red State Town Halls are very stressful ...
6226    I apologize in advance ... it won't be nearly ...
6227    Your couch's shade of lipstick doesn't match y...
Name: caption, Length: 6228, dtype: object

## Step 0: Augment the gendered lists

## Step 1: Detect gender

Use gender lexicons, but we need to define them first. 

For P2, I used two small lists with common gendered terms. For P3 I want to extend them based on what what terms are used in the contest!

In [27]:
# load nlp from spacy
nlp = spacy.load("en_core_web_sm")

## Other codes

In [21]:
dataA[0]['caption'][3901]

"This has 'Alice in Wonderland' beat by a mile."

In [13]:
# load nlp from spacy
nlp = spacy.load("en_core_web_sm")

In [22]:
example = dataA[0]['caption'][3901]
doc = nlp(example)

In [23]:
doc

This has 'Alice in Wonderland' beat by a mile.

In [24]:
tokens = [token.text for token in doc]
print(tokens)

['This', 'has', "'", 'Alice', 'in', 'Wonderland', "'", 'beat', 'by', 'a', 'mile', '.']


In [25]:
pos_tagged = [(token.text, token.pos_) for token in doc]
print(pos_tagged)

[('This', 'PRON'), ('has', 'VERB'), ("'", 'PUNCT'), ('Alice', 'PROPN'), ('in', 'ADP'), ('Wonderland', 'PROPN'), ("'", 'PUNCT'), ('beat', 'NOUN'), ('by', 'ADP'), ('a', 'DET'), ('mile', 'NOUN'), ('.', 'PUNCT')]


In [26]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Alice PERSON
Wonderland GPE
a mile QUANTITY
