# <center> Punchlines as Mirrors: Social Attitudes, Politics, and Biases in the *The New Yorker* Caption Contest

Humor reflects society’s views, stereotypes, and political climate. The New Yorker Caption Contest offers a unique lens into this process, showing what people find acceptable, absurd, or taboo.

## <center> Narrative Flow
- **Introduction:** The Caption Contest as a cultural mirror — humor as social data.
- **Axis 1:** Professions & politics → humor about authority and power, *“What are people laughing about?”*
- **Axis 2:** Humor in time → historical & contextual dimensions, *“When and why do jokes resonate?”*
- **Axis 3:** Social norms → gender roles & taboos, testing the limits of humor, *“What’s acceptable or not?”*
- **Axis 4:** Biases → explain psychological and cultural mechanisms behind why we laugh, *“Why do we find it funny?”*
- **Conclusion:** Humor not only entertains — it reveals evolving attitudes, biases, and the cultural pulse of society.

> **Idea for website:** Each section should begin with a set of cartoons from the contest to immerse the viewer in humor before moving to analysis.

---

## <center> Axes of Research

### <center> 1. Professions, Politics, and Power

- **Professions in Humor:** Which jobs are depicted most often? Which are ridiculed vs. admired? What stereotypes recur (e.g., lawyers as tricksters, doctors as saviors)?
- **Politics in Humor:** Do captions reflect partisan leanings (Democrat vs. Republican) or mock political figures more broadly? Are political jokes rated differently?
- **Interplay:** Professions like politicians or lawyers sit at the crossroads of both — this axis highlights how authority and social roles are viewed through humor.

**Plots / Statistics:**
- Bar / Word Clouds: Frequency of professions mentioned in captions (“doctor,” “lawyer,” “politician”).
- Histograms / Line Plots: Frequency of professions across time.
- Grouped Bar Charts: Average funniness scores by profession category (healthcare, law, politics, education, etc.).
- Heatmaps: Cross-tab professions × sentiment (positive/negative/neutral).
- Cartoon + Caption Samples: A few annotated cartoons showing how professions are ridiculed.

**For Politics:**
- Timeline of mentions of political figures/parties.
- Sentiment distribution around Democrats vs. Republicans.
- Example “political joke clusters” side by side with major events (e.g., elections).

**Statistical Tests & Models:**
- t-tests / z-tests → Compare funniness scores of politicians vs. other professions.
- Multiple hypothesis testing (FDR/BH) → Control for comparisons across 30+ job categories.
- Network graphs → Co-occurrence of profession keywords with stereotypes (“lawyer–money,” “doctor–death”).
- Linear regression / lmplot → Test if political humor ratings rise around elections.
- Pearsonr / Spearmanr → Correlation between real-world political cycles and joke frequency.


In [22]:
# -----------------------------
# Install required packages
# -----------------------------
# %pip install nltk
# %pip install contractions
# %pip install textblob
# %pip install wordcloud
# %pip install matplotlib
# %pip install statsmodels
# %pip install pandas
# Text analysis


import re                                          # For regular expressions
import string                                       # For text cleaning
from collections import Counter                     # For text processing
import pandas as pd                                 # For data manipulation
import nltk                                         # For natural language processing
from nltk.corpus import stopwords                   # For stopwords
from nltk.tokenize import TreebankWordTokenizer     # For tokenization
from nltk.tokenize import word_tokenize             # For word tokenization
from nltk.stem import WordNetLemmatizer             # For lemmatization
import contractions                                 # For expanding contractions
from textblob import TextBlob                       # For typo correction

# Download NLTK resources if not already
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialise NLP tools
stop_words = set(stopwords.words('english'))        # Use NLTK English stopwords
lemmatizer = WordNetLemmatizer()                    # Use WordNetLemmatizer
tokenizer = TreebankWordTokenizer()                 # Use TreebankWordTokenizer

# Text visualisation
from wordcloud import WordCloud                     # For word cloud generation
import matplotlib.pyplot as plt                     # For plotting

# Statistical modeling
import statsmodels.api as sm                        # For regression analysis    



# Loading dataset
import pickle
from pathlib import Path
loc = '../../data/data_prepared.pkl'


with open(loc, "rb") as f:
    stored_data = pickle.load(f)                    # Load the pickle

# Access the elements
dataA_startID = stored_data["dataA_startID"]        # these are integers
dataA_endID = stored_data["dataA_endID"]            
dataC_lastGoodID = stored_data["dataC_lastGoodID"]  
dataA = stored_data["dataA"]                        # this is a list of DataFrames
dataC = stored_data["dataC"]                        # this is a single DataFrame

[nltk_data] Downloading package punkt to C:\Users\andra/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\andra/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\andra/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


---
# <center> Preparing the Data

In this section, the code will preprocess the text of the captions and create a tokenized column suitable for analysis. The preprocessing steps include:

- Converting all text to **lower-case**  
- Removing **stopwords**  
- Eliminating **punctuation** such as dots and commas  
- **Expanding contractions**, e.g., “don’t” → “do not”, “it’s” → “it is”  
- **Correcting typos** to standardize common misspellings (optional but recommended for cleaner analysis)  
- **Removing very short tokens** (e.g., single letters or extremely short words)  
- **Lemmatizing words** to reduce them to their base forms, e.g., “running” → “run”, “better” → “good”  

These steps will prepare the captions for downstream analyses, such as frequency counts, word clouds, sentiment analysis, and extraction of professions or topics from the text.

I will only run this cell once, and save the outcome data in a new file, still within my folder here for the time being. For future work, there will be no need to do this work again. Then, I think this data should be added to the datapreparation step, as I am not doing anything fundamentally bad. I am creating new columns in the dataframes, so only the data becomes larger.


In [23]:
def preprocess_text(text, min_len=2):
    """
    Preprocess text by:
    - Lowercasing
    - Removing punctuation
    - Expanding contractions
    - Optional typo correction
    - Removing stopwords
    - Removing short tokens
    - Lemmatization
    """
    # Lowercase
    text = text.lower()
    
    # Expand contractions
    text = contractions.fix(text)
    
    # typo correction
    text = str(TextBlob(text).correct())
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenise
    tokens = word_tokenize(text)
    
    # Remove stopwords and very short tokens
    tokens = [word for word in tokens if word not in stop_words and len(word) >= min_len]
    
    # Lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return ' '.join(tokens)

In [None]:
# apply preprocessing to captions in dataA
size = len(dataA)
for i, df in enumerate(dataA):
    df['cleaned_caption'] = df['caption'].apply(preprocess_text)
    print(f"done with {(i+1)/size*100:.2f}%")



done with 0.00%
done with 0.26%
done with 0.52%
done with 0.78%
done with 1.04%
done with 1.30%
done with 1.56%
done with 1.82%
done with 2.08%
done with 2.34%
done with 2.60%
done with 2.86%
done with 3.12%
done with 3.39%
done with 3.65%
done with 3.91%
done with 4.17%
done with 4.43%
done with 4.69%
done with 4.95%
done with 5.21%
done with 5.47%
done with 5.73%
done with 5.99%
done with 6.25%
done with 6.51%
done with 6.77%
done with 7.03%
done with 7.29%
done with 7.55%
done with 7.81%
done with 8.07%
done with 8.33%
done with 8.59%
done with 8.85%
done with 9.11%
done with 9.38%
done with 9.64%
done with 9.90%
done with 10.16%
done with 10.42%
done with 10.68%
done with 10.94%
done with 11.20%
done with 11.46%
done with 11.72%
done with 11.98%
done with 12.24%
done with 12.50%
done with 12.76%
done with 13.02%
done with 13.28%
done with 13.54%
done with 13.80%
done with 14.06%
done with 14.32%
done with 14.58%
done with 14.84%
done with 15.10%
done with 15.36%
done with 15.62%
do

AttributeError: 'list' object has no attribute 'lower'

In [34]:
def preprocess_text_list(entry, min_len=2):
    """Preprocess a list of text entries or a single string."""
    if isinstance(entry, list):
        text = " ".join(entry)
    elif isinstance(entry, str):
        text = entry
    else:
        return ""

    # Lowercase
    text = text.lower()

    # Expand contractions
    text = contractions.fix(text)

    # Typo correction
    text = str(TextBlob(text).correct())

    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords and short tokens
    tokens = [word for word in tokens if word not in stop_words and len(word) >= min_len]

    # Lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return ' '.join(tokens)
# apply preprocessing to image_locations, questions, image_uncanny_descriptions, image_descriptions
dataC['cleaned_image_locations'] = dataC['image_locations'].apply(preprocess_text_list)
print("done with image_locations")
dataC['cleaned_questions'] = dataC['questions'].apply(preprocess_text_list)
print("done with questions")
dataC['cleaned_image_uncanny_descriptions'] = dataC['image_uncanny_descriptions'].apply(preprocess_text_list)
print("done with image_uncanny_descriptions")
dataC['cleaned_image_descriptions'] = dataC['image_descriptions'].apply(preprocess_text_list)
print("done with image_descriptions")

done with image_locations
done with questions
done with image_uncanny_descriptions
done with image_descriptions


In [35]:
# new name for the updated text, save as cleaned_...
# Save the dataframes with cleaned text back to pickle
saveloc = '../../data/cleaned_data_prepared.pkl'
with open(saveloc, "wb") as f:
    pickle.dump({
        "dataA_startID": dataA_startID,
        "dataA_endID": dataA_endID,
        "dataC_lastGoodID": dataC_lastGoodID,
        "dataA": dataA,
        "dataC": dataC
    }, f)
print(f"Cleaned data saved to {saveloc}")


Cleaned data saved to ../../data/cleaned_data_prepared.pkl


---
# <center> Professions in Humor

In this section, we will focus on how different professions are depicted in *The New Yorker* Caption Contest captions. Humor often reflects societal attitudes toward authority, expertise, and social roles, and professions provide a lens into these perceptions.  

## <center> Key Points
- **Frequency of depiction:** Which jobs appear most often in captions?  
- **Stereotypes:** How are certain professions portrayed — are they admired, ridiculed, or caricatured?  
  - Example stereotypes: lawyers as tricksters, doctors as saviors.  
- **Interplay with politics:** Some professions, like politicians or lawyers, intersect with both professional and political commentary, highlighting how authority and social power are perceived.  

## <center> Analytical Approach
To study professions in humor, we will:
- Count the number of times each profession is mentioned across all captions.  
- Visualize the distribution with **bar charts** or **word clouds**.  
- Examine sentiment associated with professions using **heatmaps**.  
- Compare average “funniness” scores by profession category to see which roles tend to be funnier.  
- Annotate examples of cartoons and captions to illustrate recurring jokes and stereotypes.

> This analysis will help us answer the question: *“What are people laughing about when it comes to professions?”*


In [None]:
# Load clean data to verify
with open(saveloc, "rb") as f:
    cleaned_stored_data = pickle.load(f)
print("Cleaned data loaded successfully for verification.")
dataA_cleaned = cleaned_stored_data["dataA"]
dataC_cleaned = cleaned_stored_data["dataC"]
dataA_startID = cleaned_stored_data["dataA_startID"]
dataA_endID = cleaned_stored_data["dataA_endID"]
dataC_lastGoodID = cleaned_stored_data["dataC_lastGoodID"]


Cleaned data loaded successfully for verification.


Unnamed: 0_level_0,caption,mean,precision,votes,not_funny,somewhat_funny,funny,cleaned_caption
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,"I have to ask, do you feel that you could be a...",2.060484,0.027639,744,190,319,235,ask feel could danger others
1,"Now that you've opened up, let's talk about wh...",2.048047,0.016587,2227,631,858,738,opened let u talk eating
2,It's normal to feel empty after a split.,1.943949,0.028433,785,272,285,228,normal feel empty split
3,You're right; some of us bruise easier than ot...,1.93617,0.04863,235,73,104,58,right u bruise easier others
4,Would you feel more comfortable on the floor ?,1.92668,0.035743,491,173,181,137,would feel comfortable floor
