# <center> Punchlines as Mirrors: Social Attitudes, Politics, and Biases in the *The New Yorker* Caption Contest

Humor reflects society’s views, stereotypes, and political climate. The New Yorker Caption Contest offers a unique lens into this process, showing what people find acceptable, absurd, or taboo.

## <center> Narrative Flow
- **Introduction:** The Caption Contest as a cultural mirror — humor as social data.
- **Axis 1:** Professions & politics → humor about authority and power, *“What are people laughing about?”*
- **Axis 2:** Humor in time → historical & contextual dimensions, *“When and why do jokes resonate?”*
- **Axis 3:** Social norms → gender roles & taboos, testing the limits of humor, *“What’s acceptable or not?”*
- **Axis 4:** Biases → explain psychological and cultural mechanisms behind why we laugh, *“Why do we find it funny?”*
- **Conclusion:** Humor not only entertains — it reveals evolving attitudes, biases, and the cultural pulse of society.

> **Idea for website:** Each section should begin with a set of cartoons from the contest to immerse the viewer in humor before moving to analysis.

---

## <center> Axes of Research

### <center> 1. Professions, Politics, and Power

- **Professions in Humor:** Which jobs are depicted most often? Which are ridiculed vs. admired? What stereotypes recur (e.g., lawyers as tricksters, doctors as saviors)?
- **Politics in Humor:** Do captions reflect partisan leanings (Democrat vs. Republican) or mock political figures more broadly? Are political jokes rated differently?
- **Interplay:** Professions like politicians or lawyers sit at the crossroads of both — this axis highlights how authority and social roles are viewed through humor.

**Plots / Statistics:**
- Bar / Word Clouds: Frequency of professions mentioned in captions (“doctor,” “lawyer,” “politician”).
- Histograms / Line Plots: Frequency of professions across time.
- Grouped Bar Charts: Average funniness scores by profession category (healthcare, law, politics, education, etc.).
- Heatmaps: Cross-tab professions × sentiment (positive/negative/neutral).
- Cartoon + Caption Samples: A few annotated cartoons showing how professions are ridiculed.

**For Politics:**
- Timeline of mentions of political figures/parties.
- Sentiment distribution around Democrats vs. Republicans.
- Example “political joke clusters” side by side with major events (e.g., elections).

**Statistical Tests & Models:**
- t-tests / z-tests → Compare funniness scores of politicians vs. other professions.
- Multiple hypothesis testing (FDR/BH) → Control for comparisons across 30+ job categories.
- Network graphs → Co-occurrence of profession keywords with stereotypes (“lawyer–money,” “doctor–death”).
- Linear regression / lmplot → Test if political humor ratings rise around elections.
- Pearsonr / Spearmanr → Correlation between real-world political cycles and joke frequency.


In [1]:
import sys
print(sys.executable)

c:\Users\andra\OneDrive\Desktop\MA1_2025-2026\Applied_data_analysis\project\ada-2025-project-adacore42\_Other\andras_analysis\venv\Scripts\python.exe


In [2]:
#Loading packages (hopefully installed, all is correct version and whatnot)

# Data manipulation
import numpy as np
import pandas as pd
import pickle

# Statistical analysis
import scipy.stats as stats

# Language processing
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import spacy
import textblob as TextBlob
import contractions
import string
from collections import Counter
from nltk.corpus import wordnet as wn

# Plotting
import seaborn as sns
import matplotlib.pyplot as plt

nltk.download('punkt')       # Tokeniser
nltk.download('stopwords')   # Stopwords list
nltk.download('wordnet')     # Lemmatiser
nlp = spacy.load('en_core_web_sm')

stop_words = set(stopwords.words('english')) # Initialise stopwords
lemmatizer = WordNetLemmatizer() # Initialise lemmatiser

[nltk_data] Downloading package punkt to C:\Users\andra/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\andra/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\andra/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


---
# <center> Preparing the Data

In this section, the code will preprocess the text of the captions and create a tokenized column suitable for analysis. The preprocessing steps include:

- Converting all text to **lower-case**  
- Removing **stopwords**  
- Eliminating **punctuation** such as dots and commas  
- **Expanding contractions**, e.g., “don’t” → “do not”, “it’s” → “it is”  
- **Correcting typos** to standardize common misspellings (optional but recommended for cleaner analysis)  
- **Removing very short tokens** (e.g., single letters or extremely short words)  
- **Lemmatizing words** to reduce them to their base forms, e.g., “running” → “run”, “better” → “good”  

These steps will prepare the captions for downstream analyses, such as frequency counts, word clouds, sentiment analysis, and extraction of professions or topics from the text.

I will only run this cell once, and save the outcome data in a new file, still within my folder here for the time being. For future work, there will be no need to do this work again. Then, I think this data should be added to the datapreparation step, as I am not doing anything fundamentally bad. I am creating new columns in the dataframes, so only the data becomes larger.


The code is in a __text__ file, it is not necessary to see here. the function to tokenise is included below.

In [3]:
def preprocess_text_list(entry, min_len=2):
    """Preprocess a list of text entries or a single string."""
    if isinstance(entry, list):
        text = " ".join(entry)
    elif isinstance(entry, str):
        text = entry
    else:
        return ""

    # Lowercase
    text = text.lower()

    # Expand contractions
    text = contractions.fix(text)

    # Typo correction
    text = str(TextBlob(text).correct())

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords and short tokens
    tokens = [word for word in tokens if word not in stop_words and len(word) >= min_len]

    # Lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return ' '.join(tokens)

In [4]:
# Load clean data
fullldata = '../../data/cleaned_data_prepared.pkl'
with open(fullldata, "rb") as f:
    cleaned_stored_data = pickle.load(f)
print("Cleaned data loaded successfully.")
dataA_cleaned = cleaned_stored_data["dataA"]
dataC_cleaned = cleaned_stored_data["dataC"]
dataA_startID = cleaned_stored_data["dataA_startID"]
dataA_endID = cleaned_stored_data["dataA_endID"]
dataC_lastGoodID = cleaned_stored_data["dataC_lastGoodID"]


Cleaned data loaded successfully.


---
# <center> Professions in Humor

In this section, we will focus on how different professions are depicted in *The New Yorker* Caption Contest captions. Humor often reflects societal attitudes toward authority, expertise, and social roles, and professions provide a lens into these perceptions.  

## <center> Key Points
- **Frequency of depiction:** Which jobs appear most often in captions?  
- **Stereotypes:** How are certain professions portrayed — are they admired, ridiculed, or caricatured?  
  - Example stereotypes: lawyers as tricksters, doctors as saviors.  
- **Interplay with politics:** Some professions, like politicians or lawyers, intersect with both professional and political commentary, highlighting how authority and social power are perceived.  

## <center> Analytical Approach
To study professions in humor, we will:
- Count the number of times each profession is mentioned across all captions.  
- Visualize the distribution with **bar charts** or **word clouds**.  
- Examine sentiment associated with professions using **heatmaps**.  
- Compare average “funniness” scores by profession category to see which roles tend to be funnier.  
- Annotate examples of cartoons and captions to illustrate recurring jokes and stereotypes.

> This analysis will help us answer the question: *“What are people laughing about when it comes to professions?”*


We are only dealing with nouns when depicting jobs, so, as a first step, we need to extract all nouns from our captions. This will essentially reduce the size of the dataset and save us some more time. To do this, I will use the nltk package.

In [5]:
dataA_cleaned0 = dataA_cleaned.copy()
dataC_cleaned0 = dataC_cleaned.copy()


In [6]:
dataA_cleaned0[0].loc[0, 'cleaned_caption'] = 'congressmen obstruction job'

To extract nouns, I used the following function. I removed the actual code which was used to run it and save it as it takes really long to run and I dont want to accidentally start it.

In [7]:

def extract_nouns(text):
    # Ensure the input is a string, not a list
    if not isinstance(text, str):
        text = " ".join(text)
    doc = nlp(text.lower())
    return [token.text for token in doc if token.pos_ in ("NOUN", "PROPN")]

In [8]:
# load the new pickle file to verify
noun_datafile = '../../data/cleaned_data_nouns.pkl'
with open(noun_datafile, "rb") as f:
    noun_stored_data = pickle.load(f)

# Verify the contents
print("Noun-extracted data loaded successfully.")
dataA1 = noun_stored_data["dataA_nouns"]
dataC1 = noun_stored_data["dataC_nouns"]
dataA_startID1 = noun_stored_data["dataA_startID"]
dataA_endID1 = noun_stored_data["dataA_endID"]
dataC_lastGoodID1 = noun_stored_data["dataC_lastGoodID"]


Noun-extracted data loaded successfully.


In [9]:
print(dataA1[0].head())

                                                caption      mean  precision  \
rank                                                                           
0             I'm a congressman--obstruction is my job.  1.913043   0.094022   
1     I'm what they mean when they say, 'The middle ...  1.842105   0.191381   
2                     Does this suit make me look flat?  1.711111   0.112915   
3       When the right woman comes along, I'll know it.  1.625000   0.116657   
4     I used to lie in the gutter, but then I quit d...  1.617647   0.133610   

      votes  not_funny  somewhat_funny  funny  \
rank                                            
0        69         24              27     18   
1        19          8               6      5   
2        45         21              16      8   
3        32         15              14      3   
4        34         19               9      6   

                            cleaned_caption                   captions_nouns  
rank            

Now that I have extracted all the nouns from the tokenised captions, I can think about how to count occupations. This should in theory bring me closer to solving the problem. At first, I will work with the first contest only, and see if it can be generalised further.

In [37]:
df = dataA1[0]

occupations = pd.read_excel('all_data_M_2024.xlsx', usecols=['OCC_CODE', 'OCC_TITLE'])
len(occupations)
#save into a csv file
occupations.to_csv('occupations.csv', index=False)


In [None]:
#load csv
occupations = pd.read_csv('occupations.csv')
occupations = occupations.drop_duplicates(subset=['OCC_TITLE']).reset_index(drop=True)
#if "and" in occupations['OCC_TITLE'], split into two separate rows, keeping the OCC_CODE the same
# eg General and Operations Managers -> General Managers, Operations Managers
new_rows = []
for _, row in occupations.iterrows():
    


Unnamed: 0,OCC_CODE,OCC_TITLE
0,00-0000,All Occupations
1,11-0000,Management Occupations
2,11-1000,Top Executives
3,11-1010,Chief Executives
4,11-1020,General
...,...,...
1827,53-7080,Recyclable Material Collectors
1828,53-7120,"Tank Car, Truck,"
1829,53-7120,Ship Loaders
1830,53-7190,Miscellaneous Material Moving Workers


In [53]:
#read new dataset
eudataset = pd.read_csv('occupations_en.csv')
euoccupations = eudataset['preferredLabel']

#locate ceo
ceo_indices = euoccupations[euoccupations.str.contains(r'\bchief executive officer\b', case=False, na=False)].index
print(ceo_indices)
print(euoccupations[ceo_indices])

Index([1335], dtype='int64')
1335    chief executive officer
Name: preferredLabel, dtype: object


To count the occurrences of professions in the captions, we will use the 2018 U.S. Census occupation data as a reference.  
This dataset provides a comprehensive list of job titles and their frequencies.  

- The Census occupation indexes can be found [here](https://www.census.gov/topics/employment/industry-occupation/guidance/indexes.html).  
- The explanation of the SOC (Standard Occupational Classification) codes is available [here](https://www.bls.gov/soc/2018/major_groups.htm).

The problem with this approach is that occupations occur in their _colloquial_ form and not in their full _official_ title. This will make using the occupation indexes way too difficult. We must find a way to take the census data, and group it into smaller, colloquial terms (for example the occupation of midwife nurse from the census data should simply be nurse or midwife). The following approach is taken:

- Clean the census data by lower casing, removing trailing spaces, taking away special characters like brackets and hyphens.
- Some jobs are "complicated title" See "simpler title" -> lets cut all such instances as they are essentially the same as the simpler titles
- There are some occupations of the form "Analyst\ specified type See type of analyst" and "Clerk\any other specified   Code by duties" etc. I want to remove these and make them simpler
- It can be seen that some titles have entries like "CFO (Chief Financial Officer)" -> create a new column with alternative name, and delete from first column

In [None]:

# Cleaning census data

# Data
census_loc = 'Alphabetical-Index-of-Occupations-December-2019_Final.xlsx'
occupations = pd.read_excel(census_loc, skiprows=6)
occupations.columns = ['occupation_name', 'industry_restriction', 'occupation_code', 'SOC_code']


# See point 2 above
def filter_complicated_titles(df):

    pattern = r'.+\sSee\s+"[^"]+"'  # any text followed by 'See "..."'
    mask = df['occupation_name'].str.contains(pattern, na=False, case=False, regex=True)
    filtered_df = df[~mask].reset_index(drop=True)
    return filtered_df

# See point 4 above
def extract_bracketed(text):

    # Extract bracketed text
    match = re.search(r"\[([^\]]+)\]", text) # Try to match square brackets first
    if not match:
        match = re.search(r"\(([^\)]+)\)", text) # If none, try round parentheses
    
    if match:
        alternative_name = match.group(1)
    else:
        alternative_name = None
    # Remove bracketed text from original
    cleaned_text = re.sub(r"\[.*?\]|\(.*?\)", "", text).strip()
    return cleaned_text, alternative_name

# See point 1 above
def clean_occupation(text):
    text = str(text).lower()
    text = re.sub(r"[\[\]\(\)\-/,]", " ", text)  # remove brackets, hyphens, slashes, commas
    text = re.sub(r"\s+", " ", text)  # collapse multiple spaces
    return text.strip()


# See point 3 above
def simplify_occupation(text):
    text = str(text).lower()  # lowercase
    # Patterns to cut off extra explanations
    cut_patterns = [
        r"\\.*",            # everything after backslash
        r"see.*",           # everything after 'see'
        r"code by.*",       # everything after 'code by'
        r"specified.*",     # everything after 'specified'
        r"as ns.*",         # everything after 'as ns'
        r"any other.*",     # everything after 'any other'
        r"\/.*"              # everything after forward slash
    ]
    
    for pattern in cut_patterns:
        text = re.sub(pattern, '', text, flags=re.IGNORECASE)
    
    text = re.sub(r"\s+", " ", text)  # collapse multiple spaces
    return text.strip()

occupations = filter_complicated_titles(occupations)
occupations[['occupation_clean', 'alternative_name']] = occupations['occupation_name'].apply(lambda x: pd.Series(extract_bracketed(x)))

occupations['occupation_clean'] = occupations['occupation_clean'].apply(clean_occupation)
occupations['occupation_clean'] = occupations['occupation_clean'].apply(simplify_occupation)
occupations_unique = occupations.drop_duplicates(subset='occupation_clean', keep='first').reset_index(drop=True) # Keep only unique cleaned occupations

# Removing entries which are alternative names and main names too
alt_matches = set(occupations['alternative_name'].dropna())
occupations = occupations[~occupations['occupation_clean'].isin(alt_matches)].reset_index(drop=True)

KeyboardInterrupt: 

The next step is bulkier: We want to make the occupations into colloquial forms "midwife nurse" or "radiologist nurse" should be both "nurse". For this, we need to use the _Spacy_ dataset and maybe _nltk_.

The code below will break the entries of the census data into nouns, then counts words which occur often. This is done because, for example, there are lots of types of nurses, but in a joke, someone will never make a joke about a complicated title - only about a nurse. Or, even if it is a complicated title that poeple are joking about, it will be counted as an occurence of the more general field. This will allow us to not nitpick every job, only the somewhat wider categories

In [None]:
#This is chatted - see how it works
def extract_core_nouns(text):
    """
    Extract nouns (or proper nouns) from occupation title.
    """
    doc = nlp(text.lower())
    nouns = [token.text for token in doc if token.pos_ in ("NOUN", "PROPN")]
    return nouns

all_nouns = []
for occ in occupations['occupation_clean']:
    all_nouns.extend(extract_core_nouns(occ))

noun_freq = Counter(all_nouns)
print(noun_freq.most_common(50))


[('operator', 3738), ('machine', 1752), ('supervisor', 1111), ('teacher', 783), ('worker', 756), ('clerk', 683), ('maker', 608), ('helper', 603), ('manager', 584), ('inspector', 504), ('tender', 466), ('sales', 447), ('engineer', 443), ('technician', 391), ('cutter', 386), ('installer', 338), ('attendant', 321), ('director', 320), ('hand', 305), ('driver', 299), ('service', 279), ('repairer', 265), ('equipment', 257), ('exc', 244), ('tester', 241), ('apprentice', 234), ('specialist', 230), ('car', 230), ('setter', 225), ('assembler', 220), ('press', 218), ('man', 215), ('agent', 212), ('mechanic', 205), ('officer', 195), ('aide', 186), ('analyst', 181), ('metal', 181), ('assistant', 180), ('control', 168), ('plant', 167), ('room', 164), ('mixer', 157), ('grinder', 155), ('maintenance', 151), ('builder', 149), ('health', 148), ('checker', 148), ('cleaner', 143), ('counselor', 139)]


In [None]:
# print all occurences of worker
for occ in occupations['occupation_clean']:
    if 'health' in occ.lower():
        print(occ)

unique_occupations = occupations['occupation_clean']
# Make all operators into "operator"
occupations.loc[unique_occupations.str.contains('operator', case=False), 'occupation_clean'] = 'operator' # Check conceptually... maybe ask the assitant
# Make all cutters into "cutter"
occupations.loc[unique_occupations.str.contains('cutter', case=False), 'occupation_clean'] = 'cutter'
# Make all cleaners into "cleaner"
occupations.loc[unique_occupations.str.contains('cleaner', case=False), 'occupation_clean'] = 'cleaner'
# Make all drivers into "driver"
occupations.loc[unique_occupations.str.contains('driver', case=False), 'occupation_clean'] = 'driver'
# Make all inspectors into "inspector"
occupations.loc[unique_occupations.str.contains('inspector', case=False), 'occupation_clean'] = 'inspector'
# Make all technicians into "technician"
occupations.loc[unique_occupations.str.contains('technician', case=False), 'occupation_clean'] = 'technician'
# Make all sales into "sales"
occupations.loc[unique_occupations.str.contains('sales', case=False), 'occupation_clean'] = 'sales' # What about in between director of sales???
# Make all counselors into "counselor"
occupations.loc[unique_occupations.str.contains('counselor', case=False), 'occupation_clean'] = 'counselor'
#Make all analyst into "analyst"
occupations.loc[unique_occupations.str.contains('analyst', case=False), 'occupation_clean'] = 'analyst'
#make all teachers into "teacher"
occupations.loc[unique_occupations.str.contains('teacher', case=False), 'occupation_clean'] = 'teacher'
#make all clerks into "clerk"
occupations.loc[unique_occupations.str.contains('clerk', case=False), 'occupation_clean'] = 'clerk'
#make all nurses into "nurse"
occupations.loc[unique_occupations.str.contains('nurse', case=False), 'occupation_clean'] = 'nurse'

#remove duplicates from unique occupations
occupations_unique = occupations.drop_duplicates(subset='occupation_clean', keep='first').reset_index(drop=True) # Keep only unique cleaned occupations
print(occupations_unique)

health commissioner
director of health education
health education director
director of health services
health administrator
health care administrator
health director
health information services manager
manager medicine and health service
mental health program manager
public health administrator
health insurance adjuster
health program analyst
health program specialist
health systems analyst exc computer
health systems analyst computer
health actuary
engineer public health
microbiologist public health
public health microbiologist
health physicist radiation control
health physicist
health environmentalist
health psychologist
public health policy analyst
technician biological exc health
public health sanitarian technician
technician public health
construction health and safety technician
environmental health sanitarian
environmental health technologist
health and safety inspector
health officer field
health sanitarian
industrial safety and health specialist
inspector health
inspector occu