# Social Representations and Boundaries of Humor: A focus on Gender roles

## The plan for this notebook

Here is a first tentative of a structure, ideas and methodology on how to analyse gender roles and representation in the New Yorker caption contest. Remember, the idea here is to start simple and then add complexity little by little.

0) Start with some basics plots and analysis:

    - How often men/women appear in the cartoon ? 

        a. To recognize if a man/woman is in the picture, the idea is to use the *image_descriptions* and *uncanny_image_descriptions* that are contained in the metadata. Some of the images don't have this metadata, this is a problem for later where I'll need to find a way to add a description to the images that don't have one. Or find an model that detect men/women in a cartoon.

        b. When this is done, I can do several plots. The first being a bar plot with just the distribution of gender over all the cartoons. Then I can plot the evolution of the gender distribution over time, to see if it is constant or not. 
    
    - How often men/women are mentionned in the captions ?

        I think this is interesting to link this with part 0.a. Are men more mentionned when there is a man in the picture or not, same for woment ? Are men mentionned in the caption even if there are no men in the picture, same for women ? 
        
        c. To do this, I can just find the mention of man/women in the captions. How ? For now, I haven't thought about it, I think we need a kind of list or something that can find word that are gender related maybe ?

        d. Same as 0.b, do some plots of the overall contest and then evolution over the years.

    **Suggestions from chat** I need to look into it: 
        Suggestions:

        - Add a simple ‚Äúco-occurrence‚Äù heatmap ‚Üí e.g., men in image √ó women in caption, women in image √ó men in caption. That shows if humor around one gender depends on referencing the other. Easy to compute as a 2√ó2 table.

        - Normalize by total captions per year ‚Üí gender mention frequency divided by total captions, so you can compare across years even if some years have more contests.

        - Gender-neutral cases: Track ‚Äúno gender mention/no gender in image‚Äù as a category ‚Äî this helps show if humor is becoming more or less gendered over time.

1) Dig in a little deeper: How are men vs women depicted

    - Caption Analysis Word Clouds (Andras did that a bit already, try to reuse)
        Here the idea is to find gendered terms (e.g., ‚Äúwife,‚Äù ‚Äúhusband,‚Äù ‚Äúboss,‚Äù ‚Äúnurse‚Äù) and their cooccurrences, to see if ... ?

    - Role Distribution Sankey Diagram: Flow from gender ‚Üí depicted roles (domestic, professional, heroic, villainous). 

    - Do captions reinforce stereotypes and does the audience reward or punishes them ?

    **Suggestions from chat**: 

    - Build a small gendered lexicon manually first (e.g., ‚Äúwife, husband, mom, dad, boss, nurse, secretary, CEO‚Äù). Then count frequencies of those words. Later, you could extend it using a prebuilt list (like LIWC or GenderedWords from textdescriptives).

    - Simple role classification: You don‚Äôt need machine learning yet ‚Äî just group words into themes:

        1. Domestic (kitchen, home, dinner)

        2. Workplace (office, boss, meeting)

        3. Heroic/Action (police, firefighter, soldier)

    Then make a Sankey plot linking gender ‚Üí role.

    - Caption polarity:

    Use sentiment analysis (e.g., TextBlob or VADER) to see if captions mentioning men vs women differ in sentiment. This is very simple to compute and could hint at bias (‚Äúare jokes about men more negative?‚Äù).

    - Audience response: If you have access to which captions won or were finalists, compare the proportion of gender-related captions among winners vs non-winners. This ties into the ‚Äúdoes the audience reward stereotypes?‚Äù question.

2) Coming soon...

üìä Step 2 ‚Äî Expanding later

Keep your ‚Äúcoming soon‚Äù section! Some easy next steps later could be:

Word embeddings to see what words cluster around ‚Äúman‚Äù vs ‚Äúwoman.‚Äù

Topic modeling filtered by gender mentions.

Temporal word shift (how associations change over years).

ü™∂ Bonus (quick, simple adds)

Timeline of first appearances: When did women start appearing more often? Is there a visible change after 2010 or so?

Visualization tip: Use small multiples ‚Äî one panel per decade, showing proportions of gendered captions or image content.

## Initialisation of the absolute Github repository path

In [2]:
from pathlib import Path
import sys

root = Path(__file__).resolve().parent if "__file__" in globals() else Path.cwd()
while root.parent != root:
    if ((root / ".git").exists() and 
        (root / "README.txt").exists() and 
        (root / "results.ipynb").exists()): break
    root = root.parent
if str(root) not in sys.path: sys.path.insert(0, str(root))

print("Root folder at: ", root)

Root folder at:  d:\GitHub\ada-2025-project-adacore42


## Imports

In [3]:
# utils
from src.utils.general_utils import *

# paths
from src.utils.paths import *

# working librairies
import os
import pickle

## Loading of the preprocessed data pickle files

Use this when the data is stored in the right place.

In [7]:
# stored_dataprep_pkl_path = root / STORED_DATAPREP_PKL_PATH

# # Chargement du fichier pickle
# with open(stored_dataprep_pkl_path, "rb") as f:
#     data = pickle.load(f)

# # Extraction des objets stock√©s dans le pickle

# # dataA est une liste de DataFrames pandas (ou un objet similaire, comme un dictionnaire de DataFrames). Chaque √©l√©ment de la liste contient un DataFrame avec 7 colonnes et un nombre variable de lignes.
# dataA = data['dataA']
# # dataC est un DataFrame de m√©tadonn√©es de tous les cartoon contests.
# dataC = data['dataC']
# dataA_startID = data['dataA_startID']
# dataA_endID = data['dataA_endID']
# dataC_lastGoodID = data['dataC_lastGoodID']


Loading the data from Andras.

In [10]:
stored_dataprep_pkl_path = r'D:\EPFL\MA3\Applied Data Analysis\Project\cleaned_data_prepared 1.pkl'

with open(stored_dataprep_pkl_path, 'rb') as f:
    data = pickle.load(f)

In [12]:
# Extract the objects in the pickle

# dataA est une liste de DataFrames pandas (ou un objet similaire, comme un dictionnaire de DataFrames). Chaque √©l√©ment de la liste contient un DataFrame avec 7 colonnes et un nombre variable de lignes.
dataA = data['dataA']
# dataC est un DataFrame de m√©tadonn√©es de tous les cartoon contests.
dataC = data['dataC']
dataA_startID = data['dataA_startID']
dataA_endID = data['dataA_endID']
dataC_lastGoodID = data['dataC_lastGoodID']

In [14]:
dataA[42].head()

Unnamed: 0_level_0,caption,mean,precision,votes,not_funny,somewhat_funny,funny,cleaned_caption
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,You were with Ringling Brothers? I was with Le...,1.982092,0.010177,6645,2347,2070,2228,ringing brother german brother
1,"Well, it suits you better than the president c...",1.931655,0.039002,556,255,84,217,well suit better president costume
2,"Sure, it's all fun and games. Until one of you...",1.916803,0.014894,3053,1171,965,917,sure fun game one get elected
3,"Hey, all I know is that they left in a very sm...",1.908179,0.017256,1993,691,794,508,hey know left small car
4,"Would you like that straight up, on the rocks,...",1.90575,0.0149,2939,1110,996,833,would like straight rock sprayed directly face


In [15]:
dataC.head()

Unnamed: 0,num_captions,num_votes,image_locations,image_descriptions,image_uncanny_descriptions,entities,questions,date,cleaned_image_locations,cleaned_questions,cleaned_image_uncanny_descriptions,cleaned_image_descriptions
0,3905.0,41185.0,[the street],[A man is relaxing on a city street. Others ar...,[A man is just laying in the middle of the sid...,[https://en.wikipedia.org/wiki/Bystander_effec...,[Why is he laying there?],NaT,street,laying,man laying middle sidewalk,man relaxing city street others going business...
1,3325.0,28205.0,"[the front hard, a residential walkway]",[A man in a winter coat and cap is looking at ...,[It's unusual to see someone holding a snow sh...,"[https://en.wikipedia.org/wiki/Snowball_fight,...",[Is the man overly small or the shovel overly ...,NaT,front hard residential halfway,man overlay small shovel overlay big boy huge ...,unusual see someone holding snow shovel way ma...,man winter coat cap looking small bearded man ...
2,4399.0,21574.0,"[yoga place, a yoga studio]",[A man and woman are standing facing one anoth...,[Nothing is really out of place in this image....,"[https://en.wikipedia.org/wiki/Rug, https://en...","[Why is the man carrying a huge rug?, Why is t...",2016-03-21,place studio,man carrying huge rug man trying use living ro...,nothing really place image man huge rug big st...,man woman standing facing one another mirror i...
3,4141.0,16894.0,"[a workplace, an elevator]",[Three business men are walking down a hall. T...,[A suit case is usually carried by one person ...,[https://en.wikipedia.org/wiki/Worker_cooperat...,[Why is the briefcase big enough for three peo...,2016-03-27,workplace elevator,briefcase big enough three people carrying car...,suit case usually carried one person three sup...,three business men walking hall carrying brief...
4,3951.0,95790.0,[plains],[Some cowboys are riding through the desert. T...,[There are rocking horses in place of real hor...,"[https://en.wikipedia.org/wiki/Rocking_horse, ...",[Why is this chase taking place?],2016-04-03,plain,chase taking place,rocking horse place real horse,cowboy riding desert rocking horse


## 0. Basic plot and analysis