# <center> Punchlines as Mirrors: Social Attitudes, Politics, and Biases in the *The New Yorker* Caption Contest

Humor reflects society’s views, stereotypes, and political climate. The New Yorker Caption Contest offers a unique lens into this process, showing what people find acceptable, absurd, or taboo.

## <center> Narrative Flow
- **Introduction:** The Caption Contest as a cultural mirror — humor as social data.
- **Axis 1:** Professions & politics → humor about authority and power, *“What are people laughing about?”*
- **Axis 2:** Humor in time → historical & contextual dimensions, *“When and why do jokes resonate?”*
- **Axis 3:** Social norms → gender roles & taboos, testing the limits of humor, *“What’s acceptable or not?”*
- **Axis 4:** Biases → explain psychological and cultural mechanisms behind why we laugh, *“Why do we find it funny?”*
- **Conclusion:** Humor not only entertains — it reveals evolving attitudes, biases, and the cultural pulse of society.

> **Idea for website:** Each section should begin with a set of cartoons from the contest to immerse the viewer in humor before moving to analysis.

---

## <center> Axes of Research

### <center> 1. Professions, Politics, and Power

- **Professions in Humor:** Which jobs are depicted most often? Which are ridiculed vs. admired? What stereotypes recur (e.g., lawyers as tricksters, doctors as saviors)?
- **Politics in Humor:** Do captions reflect partisan leanings (Democrat vs. Republican) or mock political figures more broadly? Are political jokes rated differently?
- **Interplay:** Professions like politicians or lawyers sit at the crossroads of both — this axis highlights how authority and social roles are viewed through humor.

**Plots / Statistics:**
- Bar / Word Clouds: Frequency of professions mentioned in captions (“doctor,” “lawyer,” “politician”).
- Histograms / Line Plots: Frequency of professions across time.
- Grouped Bar Charts: Average funniness scores by profession category (healthcare, law, politics, education, etc.).
- Heatmaps: Cross-tab professions × sentiment (positive/negative/neutral).
- Cartoon + Caption Samples: A few annotated cartoons showing how professions are ridiculed.

**For Politics:**
- Timeline of mentions of political figures/parties.
- Sentiment distribution around Democrats vs. Republicans.
- Example “political joke clusters” side by side with major events (e.g., elections).

**Statistical Tests & Models:**
- t-tests / z-tests → Compare funniness scores of politicians vs. other professions.
- Multiple hypothesis testing (FDR/BH) → Control for comparisons across 30+ job categories.
- Network graphs → Co-occurrence of profession keywords with stereotypes (“lawyer–money,” “doctor–death”).
- Linear regression / lmplot → Test if political humor ratings rise around elections.
- Pearsonr / Spearmanr → Correlation between real-world political cycles and joke frequency.


In [19]:
import sys
print(sys.executable)

c:\Users\andra\OneDrive\Desktop\MA1_2025-2026\Applied_data_analysis\project\ada-2025-project-adacore42\_Other\andras_analysis\venv\Scripts\python.exe


In [20]:
#Loading packages (hopefully installed, all is correct version and whatnot)

# Data manipulation
import numpy as np
import pandas as pd
import pickle

# Statistical analysis
import scipy.stats as stats

# Language processing
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import spacy
import textblob as TextBlob
import contractions
import string
from collections import Counter
from nltk.corpus import wordnet as wn

# Plotting
import seaborn as sns
import matplotlib.pyplot as plt

nltk.download('punkt')       # Tokeniser
nltk.download('stopwords')   # Stopwords list
nltk.download('wordnet')     # Lemmatiser
nlp = spacy.load('en_core_web_sm')

stop_words = set(stopwords.words('english')) # Initialise stopwords
lemmatizer = WordNetLemmatizer() # Initialise lemmatiser

[nltk_data] Downloading package punkt to C:\Users\andra/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\andra/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\andra/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


---
# <center> Preparing the Data

In this section, the code will preprocess the text of the captions and create a tokenized column suitable for analysis. The preprocessing steps include:

- Converting all text to **lower-case**  
- Removing **stopwords**  
- Eliminating **punctuation** such as dots and commas  
- **Expanding contractions**, e.g., “don’t” → “do not”, “it’s” → “it is”  
- **Correcting typos** to standardize common misspellings (optional but recommended for cleaner analysis)  
- **Removing very short tokens** (e.g., single letters or extremely short words)  
- **Lemmatizing words** to reduce them to their base forms, e.g., “running” → “run”, “better” → “good”  

These steps will prepare the captions for downstream analyses, such as frequency counts, word clouds, sentiment analysis, and extraction of professions or topics from the text.

I will only run this cell once, and save the outcome data in a new file, still within my folder here for the time being. For future work, there will be no need to do this work again. Then, I think this data should be added to the datapreparation step, as I am not doing anything fundamentally bad. I am creating new columns in the dataframes, so only the data becomes larger.


The code is in a __text__ file, it is not necessary to see here. the function to tokenise is included below.

In [21]:
def preprocess_text_list(entry, min_len=2):
    """Preprocess a list of text entries or a single string."""
    if isinstance(entry, list):
        text = " ".join(entry)
    elif isinstance(entry, str):
        text = entry
    else:
        return ""

    # Lowercase
    text = text.lower()

    # Expand contractions
    text = contractions.fix(text)

    # Typo correction
    text = str(TextBlob(text).correct())

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords and short tokens
    tokens = [word for word in tokens if word not in stop_words and len(word) >= min_len]

    # Lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return ' '.join(tokens)

In [22]:
# Load clean data
fullldata = '../../data/cleaned_data_prepared.pkl'
with open(fullldata, "rb") as f:
    cleaned_stored_data = pickle.load(f)
print("Cleaned data loaded successfully.")
dataA_cleaned = cleaned_stored_data["dataA"]
dataC_cleaned = cleaned_stored_data["dataC"]
dataA_startID = cleaned_stored_data["dataA_startID"]
dataA_endID = cleaned_stored_data["dataA_endID"]
dataC_lastGoodID = cleaned_stored_data["dataC_lastGoodID"]


Cleaned data loaded successfully.


---
# <center> Professions in Humor

In this section, we will focus on how different professions are depicted in *The New Yorker* Caption Contest captions. Humor often reflects societal attitudes toward authority, expertise, and social roles, and professions provide a lens into these perceptions.  

## <center> Key Points
- **Frequency of depiction:** Which jobs appear most often in captions?  
- **Stereotypes:** How are certain professions portrayed — are they admired, ridiculed, or caricatured?  
  - Example stereotypes: lawyers as tricksters, doctors as saviors.  
- **Interplay with politics:** Some professions, like politicians or lawyers, intersect with both professional and political commentary, highlighting how authority and social power are perceived.  

## <center> Analytical Approach
To study professions in humor, we will:
- Count the number of times each profession is mentioned across all captions.  
- Visualize the distribution with **bar charts** or **word clouds**.  
- Examine sentiment associated with professions using **heatmaps**.  
- Compare average “funniness” scores by profession category to see which roles tend to be funnier.  
- Annotate examples of cartoons and captions to illustrate recurring jokes and stereotypes.

> This analysis will help us answer the question: *“What are people laughing about when it comes to professions?”*


We are only dealing with nouns when depicting jobs, so, as a first step, we need to extract all nouns from our captions. This will essentially reduce the size of the dataset and save us some more time. To do this, I will use the nltk package.

In [None]:
dataA_cleaned0 = dataA_cleaned.copy()
dataC_cleaned0 = dataC_cleaned.copy()


In [None]:
dataA_cleaned0[0].loc[0, 'cleaned_caption'] = 'congressman obstruction job'

To extract nouns, I used the following function. I removed the actual code which was used to run it and save it as it takes really long to run and I dont want to accidentally start it.

In [25]:

def extract_nouns(text):
    # Ensure the input is a string, not a list
    if not isinstance(text, str):
        text = " ".join(text)
    doc = nlp(text.lower())
    return [token.text for token in doc if token.pos_ in ("NOUN", "PROPN")]

In [26]:
# load the new pickle file to verify
noun_datafile = '../../data/cleaned_data_nouns.pkl'
with open(noun_datafile, "rb") as f:
    noun_stored_data = pickle.load(f)

# Verify the contents
print("Noun-extracted data loaded successfully.")
dataA1 = noun_stored_data["dataA_nouns"]
dataC1 = noun_stored_data["dataC_nouns"]
dataA_startID1 = noun_stored_data["dataA_startID"]
dataA_endID1 = noun_stored_data["dataA_endID"]
dataC_lastGoodID1 = noun_stored_data["dataC_lastGoodID"]


Noun-extracted data loaded successfully.


                                                caption      mean  precision  \
rank                                                                           
0             I'm a congressman--obstruction is my job.  1.913043   0.094022   
1     I'm what they mean when they say, 'The middle ...  1.842105   0.191381   
2                     Does this suit make me look flat?  1.711111   0.112915   
3       When the right woman comes along, I'll know it.  1.625000   0.116657   
4     I used to lie in the gutter, but then I quit d...  1.617647   0.133610   

      votes  not_funny  somewhat_funny  funny  \
rank                                            
0        69         24              27     18   
1        19          8               6      5   
2        45         21              16      8   
3        32         15              14      3   
4        34         19               9      6   

                            cleaned_caption                   captions_nouns  
rank            

--- 
### Preparing External data

Now that I have extracted all the nouns from the tokenised captions, I can think about how to count occupations. This should in theory bring me closer to solving the problem. At first, I will work with the first contest only, and see if it can be generalised further.

I will use multiple datasets that I found on occupations, and I will merge the possible occupations to have a quite comprehensive list. The list will also be expanded by jobs which are frequently mentioned but are not "real" titles, and thus do not appear on each list. Such examples include "Physicist", "Lawyer", "President" and such. Everything will be tested on the first dataset we have, as running a comparison between the full data and a list of 50k occupations is tiring.

In [50]:
df = dataA1[0]

In [53]:
#load occupations
df_occupations = pd.read_csv("final_combined_occupations_with_synonyms.csv")
import ast
# Convert each string to a Python list
df_occupations['Synonyms'] = df_occupations['Synonyms'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

In [55]:
occupations = df_occupations['Occupation']

# Initialize counter
counter = Counter()

for tokens in df['captions_cleaned']:
    token_counts = Counter(tokens)
    for occ in occupations:
        counter[occ] += token_counts.get(occ, 0)

# Convert to DataFrame
occupation_counts = pd.DataFrame.from_dict(counter, orient='index', columns=['count'])
occupation_counts = occupation_counts.sort_values('count', ascending=False)

print(occupation_counts)

KeyError: 'captions_cleaned'

To count the occurrences of professions in the captions, we will use the 2018 U.S. Census occupation data as a reference.  
This dataset provides a comprehensive list of job titles and their frequencies.  

- The Census occupation indexes can be found [here](https://www.census.gov/topics/employment/industry-occupation/guidance/indexes.html).  
- The explanation of the SOC (Standard Occupational Classification) codes is available [here](https://www.bls.gov/soc/2018/major_groups.htm).

The problem with this approach is that occupations occur in their _colloquial_ form and not in their full _official_ title. This will make using the occupation indexes way too difficult. We must find a way to take the census data, and group it into smaller, colloquial terms (for example the occupation of midwife nurse from the census data should simply be nurse or midwife). The following approach is taken:

- Clean the census data by lower casing, removing trailing spaces, taking away special characters like brackets and hyphens.
- Some jobs are "complicated title" See "simpler title" -> lets cut all such instances as they are essentially the same as the simpler titles
- There are some occupations of the form "Analyst\ specified type See type of analyst" and "Clerk\any other specified   Code by duties" etc. I want to remove these and make them simpler
- It can be seen that some titles have entries like "CFO (Chief Financial Officer)" -> create a new column with alternative name, and delete from first column