In [1]:
PROMPT_TEMPLATE = """
You are an expert in propaganda analysis. Analyze the following text. 
Always keep in mind that you are only looking for russian propaganda, not any other type of propaganda. Any other type of propaganda should be ignored and labeled as "No Propaganda".

Text:
"{text}"

1. Identify the main label that best describes the text.
2. Identify the high level label that best describes the text.

Main labels:
•1 — Loaded language: Use emotionally charged or stereotyped wording to sway feelings.
•2 — Appeal to fear/prejudice: Try to persuade by warning of frightening outcomes or playing on biases.
•3 — Doubt: Undermine credibility by questioning motives, facts, or sources.
•4 — Name calling/labeling: Attack a person or group with insulting or demeaning labels.
•5 — Flag-waving: Justify a position by appealing to patriotism or group identity.
•6 — Exaggeration or minimization: Overstate benefits/harms or downplay them to mislead.
•7 — Causal oversimplification: Claim a simple cause for a complex issue or outcome.
•8 — Red herring: Introduce an irrelevant point to distract from the main issue.
•9 — Black-and-white fallacy: Present only two options and ignore reasonable alternatives.
•10 — Reductio ad hitlerum: Dismiss an idea by associating it with Nazis/Hitler or similarly reviled groups.
•11 — Appeal to authority: Cite an authority’s support as proof rather than evidence.
•12 — Straw man: Misrepresent an opponent’s argument, then refute the weaker version.
•13 — Thought-terminating cliches: Use stock phrases that shut down questioning or debate.
•14 — Whataboutism: Deflect criticism by accusing others of similar or worse behavior.
•15 — Slogans: Use a short, catchy phrase to promote an idea.
•16 — Bandwagon: Argue something is good/true because it’s popular or widely adopted.
•17 — Repetition: Repeat the same claim or phrase many times to make it stick.
•18 — no_propaganda: No relevant propaganda detected.

High level labels:

Group 1: Patriotic & Catchy Appeals  
    (Appeal through patriotism, group identity, or memorable slogans.)  
    • (5) Flag-waving – appeal to patriotism or group identity.  
    • (15) Slogans – short, catchy phrases to promote an idea.  

Group 2: Popularity Appeals  
    (Persuade by presenting an idea as popular or widely accepted.)  
    • (16) Bandwagon – argue something is good/true because it’s popular.  

Group 3: Deflections & Distractions  
    (Shift attention away from the issue through diversion or oversimplification.)  
    • (14) Whataboutism – deflect criticism by pointing to others’ behavior.  
    • (17) Repetition – repeat a claim to make it stick.  
    • (7) Causal oversimplification – claim a simple cause for a complex issue.  
    • (8) Red herring – distract with irrelevant points.  

Group 4: Emotional & Loaded Persuasion  
    (Exploit emotions or exaggeration to influence perception.)  
    • (1) Loaded language – emotionally charged wording.  
    • (10) Reductio ad hitlerum – dismiss by linking to Nazis/Hitler.  
    • (11) Appeal to authority – rely on authority instead of evidence.  
    • (2) Appeal to fear/prejudice – warn of frightening outcomes or biases.  
    • (4) Name-calling/labeling – attack with insulting labels.  
    • (6) Exaggeration or minimization – overstate or downplay effects.  

Group 5: Argument Manipulations  
    (Manipulate reasoning by misrepresenting arguments or limiting choices.)  
    • (12) Straw man – misrepresent opponent’s argument.  
    • (13) Thought-terminating clichés – shut down debate with stock phrases.  
    • (3) Doubt – question credibility or motives.  
    • (9) Black-and-white fallacy – present only two options.  

Group 6: No Propaganda  
    (Text contains no relevant propaganda techniques.)  
    • (18) no_propaganda – no relevant propaganda detected

Now return the labels in a JSON format with the following structure:
{{"main": integer,"high": integer}}

Follow the json format strictly. Do not add any additional text or explanations. Just use the integers for labeling, no strings.
Remember to always return Integers only! Only return one integer that best describes the text. Never return multiple integers or a list of integers!
"""

In [4]:
from openai import OpenAI
import json
import re
import pandas as pd

client = OpenAI()

def analyze_propaganda(text_to_analyze, max_retries=3):
    """Call OpenAI to get both 'main' and 'high' integer labels for a text."""
    if not text_to_analyze.strip():
        return None, None, "Empty input"

    try:
        full_prompt = PROMPT_TEMPLATE.format(text=text_to_analyze)
    except Exception as e:
        return None, None, f"Prompt error: {e}"

    last_model_output = None

    for attempt in range(1, max_retries + 1):
        try:
            response = client.chat.completions.create(
                model="ft:gpt-4.1-nano-2025-04-14:veronika-solopova::C5zhnCWM",
                messages=[{"role": "user", "content": full_prompt}],
                temperature=0,
                response_format={
                    "type": "json_schema",
                    "json_schema": {
                        "name": "propaganda_schema",
                        "schema": {
                            "type": "object",
                            "properties": {
                                "main": {"type": "integer"},
                                "high": {"type": "integer"}
                            },
                            "required": ["main", "high"]
                        }
                    }
                },
                max_tokens=128,
            )

            model_output_text = response.choices[0].message.content.strip()
            last_model_output = model_output_text

            # 🔍 Print raw model output for debugging
            print("\n📨 Raw model output:\n", model_output_text)

            json_match = re.search(r'\{(?:.|\n)*\}', model_output_text)
            if not json_match:
                raise ValueError("No JSON object found in model output.")

            parsed_json = json.loads(json_match.group(0).strip("` \n"))
            main_label = parsed_json.get("main")
            high_label = parsed_json.get("high")

            if not isinstance(main_label, int) or not isinstance(high_label, int):
                raise ValueError(f"Labels must be integers, got: main={main_label}, high={high_label}")

            return main_label, high_label, None

        except (json.JSONDecodeError, ValueError, KeyError) as e:
            if attempt == max_retries:
                return None, None, f"Parsing failed after {max_retries} attempts: {e}\nRaw output: {last_model_output}"
        except Exception as e:
            if attempt == max_retries:
                return None, None, f"Unexpected error after {max_retries} attempts: {e}\nRaw output: {last_model_output}"

    return None, None, "Unknown error"


def process_pickle(input_pickle_path, output_json_path):
    """Process a pickle file and return both 'main' and 'high' labels for each text."""
    try:
        df = pd.read_pickle(input_pickle_path)
        print(f"✅ Loaded pickle '{input_pickle_path}' with {len(df)} rows.")
    except Exception as e:
        print(f"❌ Error loading pickle: {e}")
        return

    if not {"text", "id"}.issubset(df.columns):
        print("❌ DataFrame must contain both 'text' and 'id' columns.")
        return

    processed_data = []

    for i, row in df.iterrows():
        record_id = row["id"]
        text = str(row["text"]).strip()

        print(f"\n🔍 Processing record {i + 1} (ID: {record_id})...")
        main_label, high_label, error_message = analyze_propaganda(text)

        record = {
            "id": record_id,
            "text": text,
            "main": main_label,
            "high": high_label,
            "error": error_message
        }

        processed_data.append(record)

    try:
        with open(output_json_path, 'w', encoding='utf-8') as f:
            json.dump(processed_data, f, ensure_ascii=False, indent=4)
        print(f"\n✅ Saved labeled results to '{output_json_path}'")
    except Exception as e:
        print(f"❌ Error saving JSON: {e}")

In [5]:
if __name__ == "__main__":
    input_pickle_file = r"C:\Users\david\PY\golden_test_set_martino_extended_integer.pkl"
    output_json_file = r"C:\Users\david\PY\labeled_data_nano41_finetuned_high_and_main.json"
    process_pickle(input_pickle_file, output_json_file)
    

✅ Loaded pickle 'C:\Users\david\PY\golden_test_set_martino_extended_integer.pkl' with 200 rows.

🔍 Processing record 1 (ID: 1560717328412823552)...

📨 Raw model output:
 {"main": 10, "high": 4}

🔍 Processing record 2 (ID: 1498038424409817089)...

📨 Raw model output:
 {"main": 18, "high": 6}

🔍 Processing record 3 (ID: 1419340068314128385)...

📨 Raw model output:
 {"main": 18, "high": 6}

🔍 Processing record 4 (ID: 1512171601340497926)...

📨 Raw model output:
 {"main": 3, "high": 5}

🔍 Processing record 5 (ID: 1481806570723885061)...

📨 Raw model output:
 {"main": 18, "high": 6}

🔍 Processing record 6 (ID: 1499383587732221952)...

📨 Raw model output:
 {"main": 5, "high": 1}

🔍 Processing record 7 (ID: 1562035778389155840)...

📨 Raw model output:
 {"main": 18, "high": 6}

🔍 Processing record 8 (ID: 1498936437298843654)...

📨 Raw model output:
 {"main": 5, "high": 1}

🔍 Processing record 9 (ID: 1533124506767593474)...

📨 Raw model output:
 {"main": 18, "high": 6}

🔍 Processing record 10 (