<a href="https://colab.research.google.com/github/dani-studiohawk/Grandparentsbabysitting/blob/main/Should_Grandparents_Babysit_%7C_Purebaby.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports

In [9]:
import pandas as pd
import json
from openai import OpenAI
from google.colab import userdata

# Retrieve colab secret
api_key = userdata.get('DaniKey')

# create client
client = OpenAI(api_key=api_key)

# not printing the actual key
print("API key loaded:", api_key is not None)


API key loaded: True


# Relevance

So my first step was for GPT to go in and determine whether the comment was relevant to the direct conversation of 'Should grandparents be paid to babysit'. This allows us to filter out all of the noise, ie jokes, memes and deleted messages from the offset.

In [30]:

df = pd.read_csv("reddit_thread_with_relevance.csv", dtype=str)

# rerun and identify eror rows
error_mask = df["relevance"] == "error"
error_rows = df[error_mask]

print(f"Found {len(error_rows)} rows with errors. Reprocessing now.")


TASK_PROMPT = (
    "You are classifying whether a Reddit comment is relevant to the question: "
    "'Should grandparents be paid for babysitting grandchildren'.\n\n"
    "RELEVANT means:\n"
    "• The comment expresses an opinion about paying or not paying grandparents.\n"
    "• The comment discusses babysitting, childcare, compensation, fairness, or responsibility.\n"
    "• The comment provides anecdotes, arguments, or examples about the issue.\n"
    "• The comment responds directly to another point about paying grandparents.\n\n"
    "IRRELEVANT means:\n"
    "• The comment does not address payment, childcare, or babysitting.\n"
    "• The comment is off-topic, a joke, a meme, a tag, or noise.\n"
    "• The comment only discusses grandparents or children in general, not payment.\n"
    "• The comment is meta-discussion about Reddit or the thread.\n\n"
    "OUTPUT FORMAT:\n"
    "You must respond in exactly two lines:\n"
    "relevance: relevant or irrelevant\n"
    "reason: a short explanation\n\n"
    "Comment: <<<COMMENT>>>"
)


def clean_comment(text):

    if not isinstance(text, str):
        return ""
    text = text.replace("\x00", "")
    return text.encode("utf-8", "ignore").decode("utf-8", "ignore").strip()


def parse_two_line_output(raw):
    relevance = "error"
    reason = raw.strip()

    lines = raw.splitlines()
    for line in lines:
        lower = line.lower().strip()

        if lower.startswith("relevance:"):
            value = lower.replace("relevance:", "").strip()
            if value in ["relevant", "irrelevant"]:
                relevance = value

        elif lower.startswith("reason:"):
            reason = line.split(":", 1)[1].strip()

    return relevance, reason


def classify_comment(comment, retries=1):
    comment_clean = clean_comment(comment)
    prompt = TASK_PROMPT.replace("<<<COMMENT>>>", comment_clean)

    try:
        response = client.chat.completions.create(
            model="gpt-4.1-mini",
            messages=[
                {"role": "system", "content": "Follow the two-line output format strictly."},
                {"role": "user", "content": prompt}
            ],
            max_completion_tokens=300
        )

        raw = response.choices[0].message.content
        relevance, reason = parse_two_line_output(raw)

        # Retry if parsing failed
        if relevance == "error" and retries > 0:
            return classify_comment(comment, retries=retries - 1)

        return relevance, reason

    except Exception as e:
        if retries > 0:
            return classify_comment(comment, retries=retries - 1)
        return "error", str(e)



for idx, row in error_rows.iterrows():
    comment = row["body"]

    print(f"Processing row {idx}...", end=" ")

    relevance, reason = classify_comment(comment)

    df.at[idx, "relevance"] = relevance
    df.at[idx, "reason"] = reason

    print(f"→ {relevance}")



df.to_csv("reddit_thread_with_relevance_fixed.csv", index=False)
print("Done. Saved as reddit_thread_with_relevance_fixed.csv.")

Found 136 rows with errors. Reprocessing now.
Processing row 0... → relevant
Processing row 1... → relevant
Processing row 2... → relevant
Processing row 6... → irrelevant
Processing row 7... → irrelevant
Processing row 10... → relevant
Processing row 11... → relevant
Processing row 13... → relevant
Processing row 14... → relevant
Processing row 15... → relevant
Processing row 16... → relevant
Processing row 18... → irrelevant
Processing row 20... → irrelevant
Processing row 22... → relevant
Processing row 28... → relevant
Processing row 30... → relevant
Processing row 35... → relevant
Processing row 38... → relevant
Processing row 41... → relevant
Processing row 42... → irrelevant
Processing row 43... → irrelevant
Processing row 50... → relevant
Processing row 53... → relevant
Processing row 55... → irrelevant
Processing row 56... → relevant
Processing row 57... → relevant
Processing row 58... → irrelevant
Processing row 61... → irrelevant
Processing row 64... → irrelevant
Processing 

This leaves us with 266 relevant comments.

# For or Against

Because Clea is primarily interested in percentages and statistical analysis, my next step was to breakdown the dataset into clearly labelled For, Against or Neutral. I had GPT output the reason it made that determination in order to help spot check that the output was correct.

This step mainly allows us to breakdown the data in a way that can be quoted cleanly:

What proportion support paying?

What proportion opposes?

How many sit on the fence?

In [31]:

df = pd.read_csv("reddit_thread_with_relevance_fixed.csv", dtype=str)

relevant_df = df[df["relevance"] == "relevant"]
print(f"Found {len(relevant_df)} relevant comments. Classifying positions now.")

POSITION_PROMPT = (
    "Classify the stance of the Reddit comment on whether grandparents "
    "should be paid for babysitting grandchildren.\n\n"
    "FOR means:\n"
    "The comment supports paying grandparents.\n"
    "It argues in favour of compensation, fairness, or respecting their time.\n\n"
    "AGAINST means:\n"
    "The comment argues grandparents should not be paid.\n"
    "It frames babysitting as a family duty or something that should not involve money.\n\n"
    "NEUTRAL means:\n"
    "The comment discusses the topic but does not clearly choose a side.\n"
    "The stance is mixed, unclear, or only descriptive.\n\n"
    "OUTPUT FORMAT (exactly two lines):\n"
    "position: for or against or neutral\n"
    "reason: short explanation\n\n"
    "Comment: <<<COMMENT>>>"
)


def clean_comment(text):
    if not isinstance(text, str):
        return ""
    text = text.replace("\x00", "")
    return text.encode("utf-8", "ignore").decode("utf-8", "ignore").strip()

def parse_two_line_output(raw):
    position = "error"
    reason = raw.strip()

    lines = raw.splitlines()
    for line in lines:
        lower = line.lower().strip()
        if lower.startswith("position:"):
            val = lower.replace("position:", "").strip()
            if val in ["for", "against", "neutral"]:
                position = val
        elif lower.startswith("reason:"):
            reason = line.split(":", 1)[1].strip()

    return position, reason

def classify_position(comment, retries=1):
    comment_clean = clean_comment(comment)
    prompt = POSITION_PROMPT.replace("<<<COMMENT>>>", comment_clean)

    try:
        response = client.chat.completions.create(
            model="gpt-4.1-mini",
            messages=[
                {"role": "system", "content": "Respond strictly in two lines."},
                {"role": "user", "content": prompt}
            ],
            max_completion_tokens=200
        )

        raw = response.choices[0].message.content
        position, reason = parse_two_line_output(raw)

        if position == "error" and retries > 0:
            return classify_position(comment, retries=retries - 1)

        return position, reason

    except Exception as e:
        if retries > 0:
            return classify_position(comment, retries=retries - 1)
        return "error", str(e)


df["position"] = None
df["position_reason"] = None

for idx, row in relevant_df.iterrows():
    comment = row["body"]
    print(f"Classifying row {idx}...", end=" ")

    position, reason = classify_position(comment)

    df.at[idx, "position"] = position
    df.at[idx, "position_reason"] = reason

    print(f"→ {position}")


df.to_csv("reddit_thread_with_positions.csv", index=False)
print("Done. Saved as reddit_thread_with_positions.csv.")

Found 263 relevant comments. Classifying positions now.
Classifying row 0... → neutral
Classifying row 1... → against
Classifying row 2... → for
Classifying row 4... → against
Classifying row 9... → neutral
Classifying row 10... → neutral
Classifying row 11... → neutral
Classifying row 12... → against
Classifying row 13... → against
Classifying row 14... → neutral
Classifying row 15... → against
Classifying row 16... → against
Classifying row 17... → against
Classifying row 19... → neutral
Classifying row 21... → for
Classifying row 22... → neutral
Classifying row 24... → neutral
Classifying row 25... → neutral
Classifying row 26... → against
Classifying row 27... → neutral
Classifying row 28... → for
Classifying row 29... → against
Classifying row 30... → against
Classifying row 31... → against
Classifying row 32... → against
Classifying row 33... → neutral
Classifying row 34... → against
Classifying row 35... → against
Classifying row 36... → neutral
Classifying row 37... → neutral
C

KeyboardInterrupt: 

This gives us

For: 42

Against: 111

Neutral: 113

# For Reasons

At this stage, when determining reasons, I considered what I'd imagine to be the most likley reasons having looked at the dataset and read a lot of the comments. I used Chatgpt browser to help me come up with clear categories.

I then got ChatGPT to assign each FOR comment to one of the 5 categories I came up with. This allows us to look at this data statistically.

In [14]:
df = pd.read_csv("reddit_thread_with_positions.csv", dtype=str)

# Filter to only "for" rows
for_rows = df[df["position"] == "for"]
print(f"Found {len(for_rows)} rows needing reason categorisation.")


CATEGORIES_TEXT = """
Classify the reason why the commenter supports paying grandparents for babysitting.
Choose exactly one of the following categories:

1. Compensation for labour and effort
2. Preventing financial harm or lost income
3. Fairness and avoiding exploitation
4. Covering expenses and out of pocket costs
5. Systemic or policy reasons

Definitions:
1. Compensation for labour and effort = Emphasises childcare being tiring, hard work, or equivalent to a job.
2. Preventing financial harm or lost income = Mentions rent, lost work hours, fixed income, or financial strain.
3. Fairness and avoiding exploitation = Focuses on fairness, being used, guilt, respect, or boundary violations.
4. Covering expenses and out of pocket costs = Mentions petrol, food, outings, activities, supplies.
5. Systemic or policy reasons = Discusses government subsidies, pension supplements, or pressure on childcare systems.

OUTPUT FORMAT (exactly two lines):
category: one of the five categories above
reason: very short explanation

Comment: <<<COMMENT>>>
"""

def clean_comment(text):
    if not isinstance(text, str):
        return ""
    text = text.replace("\x00", "")
    return text.encode("utf-8", "ignore").decode("utf-8", "ignore").strip()


def parse_two_line_cat(raw):
    category = "error"
    reason = raw.strip()

    for line in raw.splitlines():
        lower = line.lower().strip()
        if lower.startswith("category:"):
            category = line.split(":", 1)[1].strip()
        elif lower.startswith("reason:"):
            reason = line.split(":", 1)[1].strip()

    return category, reason


def classify_for_reason(comment, retries=1):
    text = clean_comment(comment)
    prompt = CATEGORIES_TEXT.replace("<<<COMMENT>>>", text)

    try:
        response = client.chat.completions.create(
            model="gpt-4.1-mini",
            messages=[
                {"role": "system", "content": "Respond strictly in two lines."},
                {"role": "user", "content": prompt}
            ],
            max_completion_tokens=200
        )
        raw = response.choices[0].message.content
        category, reason = parse_two_line_cat(raw)

        if category == "error" and retries > 0:
            return classify_for_reason(comment, retries - 1)

        return category, reason

    except Exception as e:
        if retries > 0:
            return classify_for_reason(comment, retries - 1)
        return "error", str(e)


df["for_reason_category"] = None
df["for_reason_explanation"] = None

for idx, row in for_rows.iterrows():
    comment = row["body"]
    print(f"Categorising row {idx}...", end=" ")

    category, explanation = classify_for_reason(comment)

    df.at[idx, "for_reason_category"] = category
    df.at[idx, "for_reason_explanation"] = explanation

    print(f"→ {category}")

df.to_csv("reddit_thread_with_for_categories.csv", index=False)
print("Done. Saved as reddit_thread_with_for_categories.csv.")

Found 42 rows needing reason categorisation.
Categorising row 2... → Compensation for labour and effort
Categorising row 19... → Preventing financial harm or lost income
Categorising row 21... → 3. Fairness and avoiding exploitation
Categorising row 28... → 1. Compensation for labour and effort
Categorising row 45... → 3. Fairness and avoiding exploitation
Categorising row 49... → 5. Systemic or policy reasons
Categorising row 74... → 4. Covering expenses and out of pocket costs
Categorising row 76... → 2. Preventing financial harm or lost income
Categorising row 80... → 5. Systemic or policy reasons
Categorising row 92... → 5. Systemic or policy reasons
Categorising row 110... → 4. Covering expenses and out of pocket costs
Categorising row 126... → 2. Preventing financial harm or lost income
Categorising row 142... → 3. Fairness and avoiding exploitation
Categorising row 150... → Compensation for labour and effort
Categorising row 155... → 1. Compensation for labour and effort
Categor

# Against Section

I followed the same steps that I outlined in FOR Reasons.

In [19]:
df = pd.read_csv("reddit_thread_with_positions.csv", dtype=str)

# Filter only "against" rows
against_rows = df[df["position"] == "against"]
print(f"Found {len(against_rows)} rows needing AGAINST reason categorisation.")


CATEGORIES_TEXT_AGAINST = """
Classify the reason why the commenter opposes paying grandparents for babysitting.
Choose exactly one of the following categories:

1. Family duty and relationship building
2. Anti-transactional or moral objections
3. Prefer professional childcare instead
4. Cultural or generational norms against payment
5. Payment is impractical, unnecessary, or creates complications

Definitions:
1. Family duty and relationship building = Emphasises love, bonding, creating memories, or that family helps family.
2. Anti-transactional = Says paying is transactional, morally wrong, erodes family values, or harms relationships.
3. Prefer professional childcare = Argues that if money is involved it should go to trained childcare workers.
4. Cultural or generational norms = Mentions cultural attitudes, upbringing, reciprocity, grandparents refusing payment.
5. Impractical or unnecessary = Points out pension/tax issues, bureaucracy, tension, refusal to accept money, or that gifts/not money are appropriate.

OUTPUT FORMAT (exactly two lines):
category: one of the five categories above
reason: very short explanation

Comment: <<<COMMENT>>>
"""

def clean_comment(text):
    if not isinstance(text, str):
        return ""
    text = text.replace("\x00", "")
    return text.encode("utf-8", "ignore").decode("utf-8", "ignore").strip()

def parse_two_line_cat(raw):
    category = "error"
    reason = raw.strip()

    for line in raw.splitlines():
        lower = line.lower().strip()
        if lower.startswith("category:"):
            category = line.split(":", 1)[1].strip()
        elif lower.startswith("reason:"):
            reason = line.split(":", 1)[1].strip()

    return category, reason

def classify_against_reason(comment, retries=1):
    text = clean_comment(comment)
    prompt = CATEGORIES_TEXT_AGAINST.replace("<<<COMMENT>>>", text)

    try:
        response = client.chat.completions.create(
            model="gpt-4.1-mini",
            messages=[
                {"role": "system", "content": "Respond strictly in two lines."},
                {"role": "user", "content": prompt}
            ],
            max_completion_tokens=200
        )
        raw = response.choices[0].message.content
        category, reason = parse_two_line_cat(raw)

        if category == "error" and retries > 0:
            return classify_against_reason(comment, retries - 1)

        return category, reason

    except Exception as e:
        if retries > 0:
            return classify_against_reason(comment, retries - 1)
        return "error", str(e)


df["against_reason_category"] = None
df["against_reason_explanation"] = None

for idx, row in against_rows.iterrows():
    comment = row["body"]
    print(f"Categorising row {idx}...", end=" ")

    category, explanation = classify_against_reason(comment)

    df.at[idx, "against_reason_category"] = category
    df.at[idx, "against_reason_explanation"] = explanation

    print(f"→ {category}")


df.to_csv("reddit_thread_with_all_reason_categories.csv", index=False)
print("Done. Saved as reddit_thread_with_all_reason_categories.csv.")

Found 111 rows needing AGAINST reason categorisation.
Categorising row 1... → 1. Family duty and relationship building
Categorising row 4... → 1. Family duty and relationship building
Categorising row 12... → 4. Cultural or generational norms
Categorising row 13... → 4. Cultural or generational norms
Categorising row 15... → 3. Prefer professional childcare instead
Categorising row 16... → 3. Prefer professional childcare instead
Categorising row 17... → 2. Anti-transactional or moral objections
Categorising row 25... → 5. Payment is impractical, unnecessary, or creates complications
Categorising row 26... → 1. Family duty and relationship building
Categorising row 29... → 5. Payment is impractical, unnecessary, or creates complications
Categorising row 30... → 4. Cultural or generational norms
Categorising row 31... → 1. Family duty and relationship building
Categorising row 32... → 4. Cultural or generational norms
Categorising row 34... → 1. Family duty and relationship building
Cat

# Merge

I had to merge my datasets as I'd output them as two different files lol.

In [20]:

df_all = pd.read_csv("reddit_thread_with_all_reason_categories.csv", dtype=str)
df_for = pd.read_csv("reddit_thread_with_for_categories.csv", dtype=str)


if "comment_id" not in df_all.columns or "comment_id" not in df_for.columns:
    raise ValueError("Both CSVs must contain comment_id for safe merging.")

df_for_subset = df_for[[
    "comment_id",
    "for_reason_category",
    "for_reason_explanation"
]]

df_merged = df_all.merge(
    df_for_subset,
    on="comment_id",
    how="left"
)

df_merged.to_csv("reddit_thread_with_all_reasons_complete.csv", index=False)

print("✓ Merged FOR reasons into the unified dataset.")
print("Saved as reddit_thread_with_all_reasons_complete.csv")

✓ Merged FOR reasons into the unified dataset.
Saved as reddit_thread_with_all_reasons_complete.csv


# Value

I actually ended up using this to filter the sheet in Excel and grab the most accurate values from comments manually. Money is really difficult to get clarification on, some were values for what people are paying for daycare, some were sarcastic values saying 'oh you should pay for 18 years of raising you' etc. This prompt just allowed me a general framework to filter by in Excel.

In [22]:
df = pd.read_csv("reddit_thread_with_all_reasons_complete.csv", dtype=str)

target_rows = df[df["position"].isin(["for", "against"])].copy()

print(f"Processing {len(target_rows)} comments with clear positions (for/against).")

MONEY_PROMPT = """
Determine whether the comment references money.

Pick exactly ONE label:

1. yes_explicit_amount = mentions a specific monetary value, e.g. $50, 80/day, 150 a week, $1200 a month.
2. yes_money_words = mentions payment or cost without numbers. Words like pay, cost, charge, free, compensate.
3. no = no monetary reference at all.

OUTPUT FORMAT (exactly two lines):
category: <one of the three labels>
reason: <very short explanation>

Comment: <<<COMMENT>>>
"""

def clean_comment(text):
    if not isinstance(text, str):
        return ""
    return text.encode("utf-8", "ignore").decode("utf-8", "ignore").strip()

def parse_two_line(raw):
    cat = "error"
    reason = raw.strip()

    for line in raw.splitlines():
        lower = line.lower().strip()
        if lower.startswith("category:"):
            cat = line.split(":", 1)[1].strip()
        elif lower.startswith("reason:"):
            reason = line.split(":", 1)[1].strip()

    return cat, reason

def classify_money(comment, retries=1):
    text = clean_comment(comment)
    prompt = MONEY_PROMPT.replace("<<<COMMENT>>>", text)

    try:
        response = client.chat.completions.create(
            model="gpt-4.1-mini",
            messages=[
                {"role": "system", "content": "Respond strictly in two lines."},
                {"role": "user", "content": prompt}
            ],
            max_completion_tokens=100,
            temperature=0
        )

        raw = response.choices[0].message.content
        cat, rea = parse_two_line(raw)

        if cat == "error" and retries > 0:
            return classify_money(comment, retries - 1)

        return cat, rea

    except Exception as e:
        if retries > 0:
            return classify_money(comment, retries - 1)
        return "error", str(e)


df["mentions_money_value"] = None
df["mentions_money_reason"] = None

for idx, row in target_rows.iterrows():
    comment = row["body"]
    print(f"Row {idx}...", end=" ")

    cat, rea = classify_money(comment)

    df.at[idx, "mentions_money_value"] = cat
    df.at[idx, "mentions_money_reason"] = rea

    print(f"→ {cat}")

df.to_csv("reddit_thread_with_money_detection.csv", index=False)
print("Done. Saved as reddit_thread_with_money_detection.csv.")

Processing 153 comments with clear positions (for/against).
Row 1... → no
Row 2... → no
Row 4... → yes_money_words
Row 12... → yes_money_words
Row 13... → yes_money_words
Row 15... → yes_money_words
Row 16... → no
Row 17... → yes_money_words
Row 19... → yes_money_words
Row 21... → yes_money_words
Row 25... → yes_money_words
Row 26... → yes_money_words
Row 28... → yes_money_words
Row 29... → yes_money_words
Row 30... → yes_money_words
Row 31... → yes_money_words
Row 32... → yes_money_words
Row 34... → no
Row 35... → no
Row 38... → yes_money_words
Row 39... → yes_money_words
Row 41... → yes_money_words
Row 45... → no
Row 47... → yes_money_words
Row 48... → yes_money_words
Row 49... → yes_money_words
Row 52... → yes_money_words
Row 53... → yes_money_words
Row 56... → yes_explicit_amount
Row 57... → yes_money_words
Row 58... → yes_money_words
Row 59... → yes_money_words
Row 62... → yes_money_words
Row 63... → yes_money_words
Row 65... → yes_money_words
Row 66... → no
Row 67... → yes_money_

# Analysis

This was the analysis I did of the final CSV with all of the For, Against, Reasons columns. It's very basic, and I'm not convinced there is enough there to provide a solid story, however I think it aligns well with Clea's expectations of this dataset.

I then manually looked at the monetary values as discussed above.

In [32]:
import pandas as pd

df = pd.read_csv("reddit_thread_with_money_detection.csv", dtype=str)
df = df.fillna("")


try:
    df["score"] = df["score"].astype(int)
except:
    pass


relevant_df = df[df["relevance"] == "relevant"]
total_relevant = len(relevant_df)


position_counts = relevant_df["position"].value_counts()
position_pct = (position_counts / total_relevant * 100).round(1)

position_table = pd.DataFrame({
    "count": position_counts,
    "percent": position_pct
})


# FOR
for_df = relevant_df[relevant_df["position"] == "for"]
for_total = len(for_df)

for_reason_counts = for_df["for_reason_category"].value_counts()
for_reason_pct = (for_reason_counts / for_total * 100).round(1)

for_reason_table = pd.DataFrame({
    "count": for_reason_counts,
    "percent": for_reason_pct
})

# AGAINST
against_df = relevant_df[relevant_df["position"] == "against"]
against_total = len(against_df)

against_reason_counts = against_df["against_reason_category"].value_counts()
against_reason_pct = (against_reason_counts / against_total * 100).round(1)

against_reason_table = pd.DataFrame({
    "count": against_reason_counts,
    "percent": against_reason_pct
})



top_comments = relevant_df.sort_values("score", ascending=False).head(10)[[
    "comment_id", "body", "position", "score", "mentions_money_value"
]]

avg_score_pos = relevant_df.groupby("position")["score"].mean().round(1)


print("\n=== POSITION BREAKDOWN (relevant comments only) ===")
print(position_table, "\n")

print("=== FOR-REASON CATEGORIES ===")
print(for_reason_table, "\n")

print("=== AGAINST-REASON CATEGORIES ===")
print(against_reason_table, "\n")




=== POSITION BREAKDOWN (relevant comments only) ===
          count  percent
position                
neutral     113     42.5
against     111     41.7
for          42     15.8 

=== FOR-REASON CATEGORIES ===
                                              count  percent
for_reason_category                                         
3. Fairness and avoiding exploitation             8     19.0
Compensation for labour and effort                7     16.7
1. Compensation for labour and effort             7     16.7
4. Covering expenses and out of pocket costs      6     14.3
2. Preventing financial harm or lost income       5     11.9
5. Systemic or policy reasons                     5     11.9
Preventing financial harm or lost income          3      7.1
Fairness and avoiding exploitation                1      2.4 

=== AGAINST-REASON CATEGORIES ===
                                                    count  percent
against_reason_category                                           
1. Family 