<a href="https://colab.research.google.com/github/casllmproject/bending_effect/blob/main/C1_1_DQPD_GPT_CA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Code Chunk 1: Setup, Configuration, and Data Loading
This chunk handles all the initial setup, including installing libraries, mounting Google Drive, and loading target data.

In [None]:
import pandas as pd
import os
import json
from openai import OpenAI
from google.colab import drive
import time

# =========================================================================
# 0. CONFIGURATION & SETUP
# =========================================================================
# --- CONFIGURATION ---
TARGET_MODEL = "gpt-4-turbo"
MAX_RETRIES = 5              # Max attempts to code a single text on failure
INSPECTION_SIZE = 10         # Number of cases for the initial inspection batch
CODING_COLUMNS = ["RAT", "BGI", "ETV", "REC", "DIS", "ACK", "OPN", "SOL"]
FILE_PATH = "/content/drive/MyDrive/CYON_Analysis_Materials/Main_Test/Final_Cleaned_Dec18.csv"
OUTPUT_FILE_PATH = FILE_PATH.replace(".csv", "_FULL_CODED_USER.csv")

# --- SETUP ---
print("--- STARTING LLM CONTENT ANALYSIS SCRIPT ---")

# Set API Key
try:
    client = OpenAI(api_key=" ")
    print("‚úÖ OpenAI client initialized.")
except Exception as e:
    print(f"‚ùå ERROR: Failed to initialize OpenAI client. Check API key. {e}")
    # Consider raising an error here to stop execution if the key is bad

# Load the Data
try:
    df2 = pd.read_csv(FILE_PATH)
    print(f"‚úÖ DataFrame 'df2' loaded successfully with {len(df2)} rows.")
except Exception as e:
    print(f"‚ùå FATAL ERROR: Could not load file from {FILE_PATH}. Script stopping. {e}")
    raise

df2_original = df2.copy()
original_cols = df2_original.columns.tolist()

Code Chunk 2: Coding Instructions and Robust Function
This chunk defines the comprehensive system prompt and the robust function that handles the API call, JSON parsing, and retry logic.

In [None]:
# =========================================================================
# 1. CODING INSTRUCTIONS (SYSTEM PROMPT) - FULL DETAIL
# =========================================================================

CODING_INSTRUCTIONS = """
You are a highly-skilled social scientific research assistant specializing in content analysis.
Your task is to analyze a mobile message that shares a news article about U.S. climate policy and code it according to the following eight categories.
Your output MUST be a valid JSON object with the exact keys provided below.
Do not include any explanation, introductory text, or other characters outside of the JSON.

---
CODING SCHEME:
---
1. Rationality (RAT):
  - Reasoning: Does the sender explicitly try to justify the reason why they are sending this message?
  - Instruction: Code YES when the text explicitly elaborates why the sender wants the receiver to see this news. The use of words such as "because", "due to", "therefore", ‚Äúreasons‚Äù, ‚Äúwhy‚Äù can signal an attempt at justification, but texts without these words may still be coded as YES if the comment follows a clear line of reasoning.
  - Example: "it would be really cool for you to read this article to understand why I believe in policies", "we all need to do our part and climbing change because it affects us all and we all have to live on this planet."
  - Coding Scheme: No = 0, Yes = 1, Very YES (If such expressions appear more than once) = 2

2. Background Information (BGI):
  - Reasoning: Does the sender describe the contextual background for why they are sending the message?
  - Instruction: Code YES when the text describes the broader context or societal issues embedded in the topic.
  - Example: "Trump wants to withdraw from the Paris agreement", "In regard to the current financial conditions of the US..."
  - Coding Scheme: No = 0, Yes = 1, Very YES (If such expressions appear more than once) = 2

3. External Evidence (ETV):
  - Reasoning: Does the sender provide external evidence in their message?
  - Instruction: Code YES when the text elaborates on the opinion using facts, media sources, politicians‚Äô statements, or other verifiable evidence. Code YES when the text refers to authorities or experts to support the messege.
  - Example: "Recent studies show green energy is a rapidly growing economic sector", ‚ÄúTrump said...‚Äù, "Scientists show..." ‚ÄúAccording to the New York Times...‚Äù
  - Coding Scheme: No = 0, Yes = 1, Very YES (If such expressions appear more than once) = 2

4. Reciprocity (REC):
  - Reasoning: Does the sender express curiosity about the receiver‚Äôs thoughts or ask a question?
  - Instruction: Code YES when the text uses a question or expressions such as ‚Äúcurious‚Äù, ‚Äúwonder‚Äù, ‚Äúwant to know‚Äù, "let's talk".
  - Example: "I'm curious to hear what you think about it.", "Please see this and let me know how you think", "Tell me..."
  - Coding Scheme: No = 0, Yes = 1, Very YES (If such expressions appear more than once) = 2

5. Disrespect (DIS):
  - Reasoning: Does the sender show uncivil attitudes toward the receiver or toward the position of a particular group?
  - Instruction: Code as YES when the text contains expressions implying that the opposing position on the issue is irrational, inferior, ridiculous, or should be excluded from discussion.
  - Example: "This is stupid", "idiot", ‚ÄúTrump‚Äôs withdrawal from the Paris Agreement will definitely hurt our country‚Äù, "They are liar", "Liberals are hypocratic", "They should stop..."
  - Coding Scheme: No = 0, Yes = -1, Very YES (If such expressions appear more than once) = -2

6. Acknowledgement (ACK):
  - Reasoning: Does the sender express how they view the receiver‚Äôs opinion of them or vice versa?
  - Instruction: Code YES when the text describes differences in opinion between the sender and the receiver.
  - Example: ‚Äúyour belief might be different than mine‚Äù, ‚ÄúI believe we have different views on climate change‚Äù, "although we disagree each other.."
  - Coding Scheme: No = 0, Yes = 1, Very YES (If such expressions appear more than once) = 2

7. Openness (OPN):
  - Reasoning: Does the sender suggest that different perspectives can be viewed fairly?
  - Instruction: Code YES when the text explicitly states that perspectives on the issue can be presented fairly, using expressions such as ‚Äúfair‚Äù, ‚Äúbalanced‚Äù, ‚Äúboth sides‚Äù.
  - Example: ‚ÄúThis is a pretty fair assessment of the situation that shows both sides of the debate‚Äù, ‚Äúwe can all get a more balanced perspective‚Äù
  - Coding Scheme: No = 0, Yes = 1, Very YES (If such expressions appear more than once) = 2

8. Solution (SOL):
  - Reasoning: Does the sender propose a joint solution to the receiver?
  - Instruction: Code YES when the text suggests common solutions or alternatives that people with differing positions on the issue may commonly accept.
  - Example: ‚ÄúThere are alternative solutions that can lead to a more sustainable future‚Äù, "This solution...", "We need to work together", ‚Äúthe article can help us find some common grounds‚Äù
  - Coding Scheme: No = 0, Yes = 1, Very YES (If such expressions appear more than once) = 2
---
OUTPUT FORMAT:
Your final output MUST be a JSON object with the following keys:
{
  "RAT": <int>, "BGI": <int>, "ETV": <int>, "REC": <int>,
  "DIS": <int>, "ACK": <int>, "OPN": <int>, "SOL": <int>
}
"""

# =========================================================================
# 2. ROBUST CODING FUNCTION (Handles retry logic and errors)
# =========================================================================

def code_text_with_llm(text_to_code: str, model_name: str, max_retries: int):
    """Calls LLM API with retry logic and error handling (JSON, API failure)."""
    # Skip if text is missing or empty
    if pd.isna(text_to_code) or str(text_to_code).strip() == "":
        return None

    attempt = 0
    while attempt < max_retries:
        try:
            completion = client.chat.completions.create(
                model=model_name,
                response_format={"type": "json_object"},
                messages=[
                    {"role": "system", "content": CODING_INSTRUCTIONS},
                    {"role": "user", "content": f"Code the following news text: '{text_to_code}'"}
                ],
                temperature=0.0
            )

            raw_output = completion.choices[0].message.content
            coded_data = json.loads(raw_output)

            # Basic validation: ensure all 8 keys are present
            expected_keys = set(CODING_COLUMNS)
            if expected_keys.issubset(set(coded_data.keys())):
                return coded_data
            else:
                print(f"\n  --> Retry {attempt+1}/{max_retries}: JSON validation failed (missing keys).")
                attempt += 1
                time.sleep(2)

        except json.JSONDecodeError:
            print(f"\n  --> Retry {attempt+1}/{max_retries}: Failed to decode JSON response.")
            attempt += 1
            time.sleep(2)
        except Exception as e:
            print(f"\n  --> Retry {attempt+1}/{max_retries}: API Error ({type(e).__name__}). Waiting 5 seconds...")
            attempt += 1
            time.sleep(5)

    print(f"*** Failed to code text after {max_retries} attempts.")
    return None

Code Chunk 3: Phase 1 - Inspection Batch (First 10 Cases)
This chunk runs the first 10 cases and prints the results for the manual inspection.

In [None]:
# =========================================================================
# 3. PHASE 1: INSPECTION BATCH (First 10 Cases)
# =========================================================================

print("\n\n--- PHASE 1: INSPECTION BATCH (First 10 Cases) ---")

# 1. Split data
df2_inspection = df2.head(INSPECTION_SIZE).copy()
df2_remainder = df2.iloc[INSPECTION_SIZE:].copy()
print(f"Inspection Batch size: {len(df2_inspection)} | Remainder Batch size: {len(df2_remainder)}")

# Initialize columns with pd.NA for missing values
for col in CODING_COLUMNS:
    df2_inspection[col] = pd.NA

coded_count = 0
skipped_count = 0
total_rows = len(df2_inspection)

for i, row in df2_inspection.iterrows():
    text_to_code = row['OE1']

    # Show progress
    progress_percent = ((i - df2_inspection.index.min() + 1) / total_rows) * 100
    print(f"Processing case {i + 1}/{total_rows} ({progress_percent:.0f}%)...", end="\r")

    if pd.isna(text_to_code) or str(text_to_code).strip() == "":
        skipped_count += 1
        continue

    coded_values = code_text_with_llm(text_to_code, TARGET_MODEL, MAX_RETRIES)

    if coded_values:
        for key in CODING_COLUMNS:
            # Assign coded value or pd.NA if key is missing from LLM output
            df2_inspection.loc[i, key] = coded_values.get(key, pd.NA)
        coded_count += 1
    else:
        # If LLM failed to code, assign pd.NA to all coding columns for this row
        for key in CODING_COLUMNS:
            df2_inspection.loc[i, key] = pd.NA
        skipped_count += 1

# Convert coding columns to nullable integer type (Int64Dtype)
for col in CODING_COLUMNS:
    df2_inspection[col] = df2_inspection[col].astype(pd.Int64Dtype())

print(f"\n\n‚úÖ PHASE 1 COMPLETE. Total Coded: {coded_count}, Skipped/Failed: {total_rows - coded_count}.")
print("\n*** INSPECTION BATCH RESULTS (df2_inspection) ***")
print(df2_inspection[['OE1'] + CODING_COLUMNS])

print("\n\n#################################################################")
print("üõë PAUSE: Review the 10 cases above. Run the next cell ONLY if satisfied.")
print("#################################################################")

Code Chunk 4: Phase 2 - Remainder Batch
Run this chunk after reviewed and approved the results from Chunk 3. It codes the rest of the dataset.

In [None]:
# =========================================================================
# 4. PHASE 2: REMAINDER BATCH
# =========================================================================

# Initialize columns for the remainder DataFrame with pd.NA
for col in CODING_COLUMNS:
    df2_remainder[col] = pd.NA

coded_count = 0
skipped_count = 0
total_rows = len(df2_remainder)

print(f"\n\n--- PHASE 2: Coding Remainder Batch ({total_rows} cases) ---")

# Iterate over the Remainder Batch and apply the coding function
for i, row in df2_remainder.iterrows():
    text_to_code = row['OE1']

    # Show progress every 5 cases
    current_case_num = i - df2_remainder.index.min() + 1
    progress_percent = (current_case_num / total_rows) * 100
    if current_case_num % 5 == 0 or current_case_num == total_rows:
        print(f"Progress: {current_case_num}/{total_rows} rows processed ({progress_percent:.2f}%). Coded: {coded_count}, Skipped: {skipped_count}.")

    if pd.isna(text_to_code) or str(text_to_code).strip() == "":
        skipped_count += 1
        continue

    coded_values = code_text_with_llm(text_to_code, TARGET_MODEL, MAX_RETRIES)

    if coded_values:
        for key in CODING_COLUMNS:
            # Assign coded value or pd.NA if key is missing from LLM output
            df2_remainder.loc[i, key] = coded_values.get(key, pd.NA)
        coded_count += 1
    else:
        # If LLM failed to code, assign pd.NA to all coding columns for this row
        for key in CODING_COLUMNS:
            df2_remainder.loc[i, key] = pd.NA
        skipped_count += 1

print("\n‚úÖ PHASE 2 CODING COMPLETE.")
print(f"Final Counts: Successfully Coded = {coded_count}, Skipped/Failed = {total_rows - coded_count}.")

# Convert the new coding columns to nullable integers
for col in CODING_COLUMNS:
    df2_remainder[col] = df2_remainder[col].astype(pd.Int64Dtype())

Code Chunk 5: Concatenate and Save Results
This final chunk combines the two batches and saves the complete, coded DataFrame to Google Drive.

In [None]:
# =========================================================================
# 4.CALCULATE COMPOSITE SCORE (COMP) - Based on Created Columns
# =========================================================================

# Define the columns that were successfully created by the LLM function
# (excluding DISC due to its negative scoring convention).
SUM_COLUMNS = ["RAT", "BGI", "ETV", "REC", "DIS", "ACK", "OPN", "SOL"]
COMP_COLUMN_NAME = "DQPD"

print("\n--- CALCULATING COMPOSITE SCORE ---")

# --- 1. Inspection Batch (df2_inspection) ---
try:
    # Calculate the row-wise sum for the specified columns
    df2_inspection[COMP_COLUMN_NAME] = df2_inspection[SUM_COLUMNS].sum(axis=1)
    print(f"‚úÖ COMP score added to df2_inspection (Sum of {', '.join(SUM_COLUMNS)}).")
except KeyError as e:
    # This should not happen if the coding phases ran correctly
    print(f"‚ùå ERROR in df2_inspection: Column {e} not found. Check your coding phase results.")
    df2_inspection[COMP_COLUMN_NAME] = pd.NA

# --- 2. Remainder Batch (df2_remainder) ---
try:
    # Calculate the row-wise sum for the specified columns
    df2_remainder[COMP_COLUMN_NAME] = df2_remainder[SUM_COLUMNS].sum(axis=1)
    print(f"‚úÖ COMP score added to df2_remainder (Sum of {', '.join(SUM_COLUMNS)}).")
except KeyError as e:
    print(f"‚ùå ERROR in df2_remainder: Column {e} not found. Check your coding phase results.")
    df2_remainder[COMP_COLUMN_NAME] = pd.NA

print(f"\nExample COMP values from Inspection Batch (first 5 rows):\n{df2_inspection[COMP_COLUMN_NAME].head()}")
print("You can now proceed with Chunk 5: CONCATENATE AND SAVE.")

In [None]:
# =========================================================================
# 5. CONCATENATE AND SAVE
# =========================================================================

# Combine the Inspection Batch and the Remainder Batch
df2_final_coded = pd.concat([df2_inspection, df2_remainder])

# Ensure the final DataFrame is sorted by the original index
df2_final_coded = df2_final_coded.sort_index()

try:
    df2_final_coded.to_csv(OUTPUT_FILE_PATH, index=False)
    print(f"\n\n‚úÖ FULL CODED DataFrame successfully saved to: {OUTPUT_FILE_PATH}")
    print(f"Total rows in saved file: {len(df2_final_coded)}")
    print("The new file includes your original data plus the RAT, BGI, ETV, DIV, DIS, NRE, INT, MET, DQPD columns.")
except Exception as e:
    print(f"‚ùå ERROR: Failed to save the final file. Check your Google Drive permissions. Error: {e}")