# Concordance & reliability (temp = 0.1, n= 200) + code to run all abstracts

This notebook runs **three repeats** of the `infer` pipeline at temperature **0.1** on the first **200** abstracts,
then compares the runs pairwise to check self-consistency.

**Assumptions**
- Your working directory contains `5_concordance.py` and `human_review.xlsx`.
- Your Excel sheet with inputs is named `in` (change `SHEET` below if needed).
- Caching is disabled to force fresh API calls.

It also estimates concordance (human vs human, human vs LLM)

In [57]:
# ---- Configuration ----
EXCEL = "human_review.xlsx"
SHEET = "in"             
TEMP = 0.1
LIMIT = 200
RUN_PREFIX = "temp0.1_run"  # run names will be temp0.1_run1, temp0.1_run2, temp0.1_run3

print(f"EXCEL={EXCEL} SHEET={SHEET} TEMP={TEMP} LIMIT={LIMIT} RUN_PREFIX={RUN_PREFIX}")

import sys
PYTHON = sys.executable
print("Using:", PYTHON)

EXCEL=human_review.xlsx SHEET=in TEMP=0.1 LIMIT=200 RUN_PREFIX=temp0.1_run
Using: /Users/wangmengyao/Desktop/policyclaims/.venv/bin/python


## 1) Run inference (three repeats)

In [58]:
!$PYTHON 5_concordance.py infer \
  --excel "$EXCEL" \
  --sheet "$SHEET" \
  --prompt-name "${RUN_PREFIX}1" \
  --temperature "$TEMP" \
  --limit "$LIMIT" \
  --no-cache

Using 0 cached results, processing 200 new items
Processing: 100%|██████████| 200/200 [01:14<00:00,  2.68it/s, Saved 200 results]
[OK] Saved: /Users/wangmengyao/Desktop/policyclaims/concordance/concordance_outputs/run_.1_run1_temp0.1_2025-11-02_165332.csv
Total processed: 200 (0 from cache, 200 new)


In [59]:
# Repeat 2
!$PYTHON 5_concordance.py infer \
  --excel "$EXCEL" \
  --sheet "$SHEET" \
  --prompt-name "${RUN_PREFIX}2" \
  --temperature "$TEMP" \
  --limit "$LIMIT" \
  --no-cache

Using 0 cached results, processing 200 new items
Processing: 100%|██████████| 200/200 [01:09<00:00,  2.86it/s, Saved 200 results]
[OK] Saved: /Users/wangmengyao/Desktop/policyclaims/concordance/concordance_outputs/run_.1_run2_temp0.1_2025-11-02_165445.csv
Total processed: 200 (0 from cache, 200 new)


In [60]:
# Repeat 3
!$PYTHON 5_concordance.py infer \
  --excel "$EXCEL" \
  --sheet "$SHEET" \
  --prompt-name "${RUN_PREFIX}3" \
  --temperature "$TEMP" \
  --limit "$LIMIT" \
  --no-cache

Using 0 cached results, processing 200 new items
Processing: 100%|██████████| 200/200 [01:13<00:00,  2.74it/s, Saved 200 results]
[OK] Saved: /Users/wangmengyao/Desktop/policyclaims/concordance/concordance_outputs/run_.1_run3_temp0.1_2025-11-02_165602.csv
Total processed: 200 (0 from cache, 200 new)


## 2) Pairwise comparisons (run to run reliability)

In [61]:
import glob
import itertools
import os

print("--- Step 1: Finding all run files ---")
run_files = {}
all_files_found = True

# Find the output CSV for each of the three runs
for i in [1, 2, 3]:
    pattern = f"../concordance/concordance_outputs/run_*{i}_temp{TEMP}_*.csv"
    found_files = glob.glob(pattern)
    
    if len(found_files) == 1:
        run_files[i] = found_files[0]
        print(f"Found Run {i}: {run_files[i]}")
    else:
        print(f"Error: Expected 1 file for pattern '{pattern}', but found {len(found_files)}.")
        all_files_found = False

# Proceed only if all three files were located successfully
if all_files_found:
    print("\n--- Step 2: Running all pairwise comparisons ---")
    
    # Automatically generate pairs: (1, 2), (1, 3), (2, 3)
    run_numbers = sorted(run_files.keys())
    for i, j in itertools.combinations(run_numbers, 2):
        file_a = run_files[i]
        file_b = run_files[j]
        
        print(f"\n----- Comparing Run {i} vs Run {j} -----")
        
        # Run the compare command with the specific files found
        !$PYTHON 5_concordance.py compare --run-a {file_a} --run-b {file_b}
else:
    print("\nSkipping comparisons because one or more run files could not be found.")

--- Step 1: Finding all run files ---
Found Run 1: ../concordance/concordance_outputs/run_.1_run1_temp0.1_2025-11-02_165332.csv
Found Run 2: ../concordance/concordance_outputs/run_.1_run2_temp0.1_2025-11-02_165445.csv
Found Run 3: ../concordance/concordance_outputs/run_.1_run3_temp0.1_2025-11-02_165602.csv

--- Step 2: Running all pairwise comparisons ---

----- Comparing Run 1 vs Run 2 -----
# A/B Concordance
- Run A: `../concordance/concordance_outputs/run_.1_run1_temp0.1_2025-11-02_165332.csv`
- Run B: `../concordance/concordance_outputs/run_.1_run2_temp0.1_2025-11-02_165445.csv`

**Percent agreement:** 0.990
**Cohen's kappa:** 0.973

## Confusion matrix
Labels: ['NO', 'YES']
```
[149, 0]
[2, 49]
```

[OK] Saved A/B report: /Users/wangmengyao/Desktop/policyclaims/concordance/concordance_reports/ab_compare_2025-11-02_165608.md

----- Comparing Run 1 vs Run 3 -----
# A/B Concordance
- Run A: `../concordance/concordance_outputs/run_.1_run1_temp0.1_2025-11-02_165332.csv`
- Run B: `../co

## 3) Concordance (Human vs. Human)

In [62]:
# This cell assumes the 'db_review' and 'ec_review' variables were defined in the cell above.
# Define the files generated from human reviews
db_review = '../concordance/concordance_outputs/run_from_DBreview_2025-09-02_213212.csv'
ec_review = '../concordance/concordance_outputs/run_from_ECreview_2025-09-02_213357.csv'
mw_review = '../concordance/concordance_outputs/run_from_MWreview_2025-09-02_213416.csv'
print(f"\n--- Comparing Human Rater DB vs. EC ---")
!$PYTHON 5_concordance.py compare \
  --run-a {db_review} \
  --run-b {ec_review}


--- Comparing Human Rater DB vs. EC ---
# A/B Concordance
- Run A: `../concordance/concordance_outputs/run_from_DBreview_2025-09-02_213212.csv`
- Run B: `../concordance/concordance_outputs/run_from_ECreview_2025-09-02_213357.csv`

**Percent agreement:** 0.949
**Cohen's kappa:** 0.834

## Confusion matrix
Labels: ['NO', 'YES']
```
[77, 4]
[1, 16]
```

[OK] Saved A/B report: /Users/wangmengyao/Desktop/policyclaims/concordance/concordance_reports/ab_compare_2025-11-02_165612.md


## 4) Concordance LLM vs human

In [64]:


# CORRECTED: Use the file from Run 1 generated in this notebook
llm_run = run_files[1] 

print(f"--- Comparing LLM Run ({os.path.basename(llm_run)}) vs. DB Review ---")
!$PYTHON 5_concordance.py compare \
  --run-a {llm_run} \
  --run-b {db_review}

print(f"\n--- Comparing LLM Run ({os.path.basename(llm_run)}) vs. EC Review ---")
!$PYTHON 5_concordance.py compare \
  --run-a {llm_run} \
  --run-b {ec_review}

print(f"\n--- Comparing LLM Run ({os.path.basename(llm_run)}) vs. MW Review ---")
!$PYTHON 5_concordance.py compare \
  --run-a {llm_run} \
  --run-b {mw_review}

--- Comparing LLM Run (run_.1_run1_temp0.1_2025-11-02_165332.csv) vs. DB Review ---
# A/B Concordance
- Run A: `../concordance/concordance_outputs/run_.1_run1_temp0.1_2025-11-02_165332.csv`
- Run B: `../concordance/concordance_outputs/run_from_DBreview_2025-09-02_213212.csv`

**Percent agreement:** 0.908
**Cohen's kappa:** 0.725

## Confusion matrix
Labels: ['NO', 'YES']
```
[73, 1]
[8, 16]
```

[OK] Saved A/B report: /Users/wangmengyao/Desktop/policyclaims/concordance/concordance_reports/ab_compare_2025-11-02_170552.md

--- Comparing LLM Run (run_.1_run1_temp0.1_2025-11-02_165332.csv) vs. EC Review ---
# A/B Concordance
- Run A: `../concordance/concordance_outputs/run_.1_run1_temp0.1_2025-11-02_165332.csv`
- Run B: `../concordance/concordance_outputs/run_from_ECreview_2025-09-02_213357.csv`

**Percent agreement:** 0.882
**Cohen's kappa:** 0.707

## Confusion matrix
Labels: ['NO', 'YES']
```
[73, 3]
[10, 24]
```

[OK] Saved A/B report: /Users/wangmengyao/Desktop/policyclaims/concordanc

## 5) Run on all abstracts

Recommendation: Preferably, copy the command below into a terminal. This is safer for a long-running process than executing it within a Jupyter Notebook.

Estimated Time (40k abstracts):  8-10 hours, $3-5 (assuming discounted deepseek rates)

How to Resume: If the process is interrupted, it will resume automatically. To resume, simply run the exact same command again.

Note caffeinate command to prevent screen sleeping in macos

### Outputs 
A JSON file (e.g., all_abstracts_LLM.json): This is the main output file, containing the original data plus the new llm_policy_claim classification for each abstract.

A CSV file (e.g., all_abstracts_LLM.csv): A CSV version of the same results for easier use in other programs.

A "missed" CSV file (e.g., data/all_abstracts_LLM_missed.csv): This file will contain only the abstracts that failed to be classified after all API retries, which is useful for debugging.

A log file (e.g., data/all_abstracts_LLM_missed.log): This tracks the progress of the run, noting how many abstracts are classified at different stages (checkpoints and final).

In [37]:
!python "code/3_llm process_API.py" "data/json_files/filtered/all_abstracts.json"

# using  caffeinate command to prevent macos from sleeping and stopping API calls...

!caffeinate python3 "code/3_llmprocess_API.py" "data/json_files/filtered/all_abstracts.json"


/Users/wangmengyao/Desktop/policyclaims/.venv/bin/python: can't open file '/Users/wangmengyao/Desktop/policyclaims/code/code/3_llm process_API.py': [Errno 2] No such file or directory
/Users/wangmengyao/Desktop/policyclaims/.venv/bin/python3: can't open file '/Users/wangmengyao/Desktop/policyclaims/code/code/3_llmprocess_API.py': [Errno 2] No such file or directory


✅ Classification Complete!

Total abstracts in output file: 45808
Successfully classified: 45807 (11735 True)
Failed/unclassified: 1
Total script execution time: 35445.78 seconds


[thus 9.8 hours, approx $3 in old API fees ]

### Validate findings from main with a n=200 validation set

e.g., compare outputs/run_.1_run1_temp0.1_2025-09-03_095832.csv with json_files/filtered/all_abstracts_LLM.csv

should expect same run to run reliability as prior checks...

In [71]:
import pandas as pd

val_file = "../concordance/concordance_outputs/run_.1_run1_temp0.1_2025-11-02_165332.csv"
val_df = pd.read_csv(val_file)
main_file = "../data/json_files/filtered/all_abstracts_LLM.csv"
main_df = pd.read_csv(main_file)

print("Validation set columns:", val_df.columns.tolist())
print("Main file columns:", main_df.columns.tolist())

Validation set columns: ['id', 'scopus_id', 'doi', 'title', 'prompt_name', 'prompt_hash', 'model', 'temperature', 'llm_output', 'llm_label']
Main file columns: ['scopus_id', 'doi', 'title', 'journal', 'publication_year', 'keywords', 'abstract', 'article_type', 'corresponding_author_country', 'cited_by_count', 'llm_policy_claim']


In [69]:
import pandas as pd

val_file = "../concordance/concordance_outputs/run_.1_run3_temp0.1_2025-11-02_165602.csv"
val_df = pd.read_csv(val_file)
main_file = "../data/json_files/filtered/all_abstracts_LLM.csv"
main_df = pd.read_csv(main_file)

# Standardize scopus_id for robust matching
val_df['scopus_id'] = val_df['scopus_id'].astype(str).str.strip().str.lower()
main_df['scopus_id'] = main_df['scopus_id'].astype(str).str.strip().str.lower()

# Use llm_label as the validation set's policy claim
val_df['policy_claim_val'] = val_df['llm_label'].astype(str).str.strip().str.upper() == "YES"
main_df['main_claim'] = main_df['llm_policy_claim'].astype(bool)

# Merge on scopus_id
merged = pd.merge(val_df, main_df, on="scopus_id", suffixes=('_val', '_main'))

# Only keep rows where both are not missing
mask = merged['llm_label'].notna() & merged['llm_policy_claim'].notna()
filtered = merged[mask].copy()

# Calculate agreement
agreement = (filtered["policy_claim_val"] == filtered["main_claim"]).mean()

#kappa too
from sklearn.metrics import cohen_kappa_score

kappa = cohen_kappa_score(filtered["policy_claim_val"], filtered["main_claim"])

print("Comparing variables:")
print("  Validation set: llm_label (converted to boolean as 'policy_claim_val')")
print("  Main file: llm_policy_claim (converted to boolean as 'main_claim')")
print(f"Number of matched, non-missing rows: {len(filtered)}")
print(f"Concordance between validation set and main results: {agreement*100:.1f}% ({agreement*len(filtered):.0f}/{len(filtered)})")
print(f"Cohen's kappa: {kappa:.3f}")

# Show 5 sample rows
print("\nSample comparison rows:")
print(filtered[['scopus_id', 'llm_label', 'llm_policy_claim', 'policy_claim_val', 'main_claim']].head(5))

# Show up to 10 discordant rows
discordant = filtered[filtered["policy_claim_val"] != filtered["main_claim"]]
print(f"\nNumber of discordant entries: {len(discordant)}")
if len(discordant) > 0:
    print("Sample discordant entries:")
    print(discordant[['scopus_id', 'llm_label', 'llm_policy_claim', 'policy_claim_val', 'main_claim']].head(10).to_string(index=False))

Comparing variables:
  Validation set: llm_label (converted to boolean as 'policy_claim_val')
  Main file: llm_policy_claim (converted to boolean as 'main_claim')
Number of matched, non-missing rows: 195
Concordance between validation set and main results: 96.4% (188/195)
Cohen's kappa: 0.900

Sample comparison rows:
               scopus_id llm_label llm_policy_claim  policy_claim_val  \
0   scopus_id:0032978484        NO            False             False   
1   scopus_id:0032931075        NO            False             False   
2  scopus_id:84895890934        NO            False             False   
3  scopus_id:78650633722        NO            False             False   
4  scopus_id:85066260594       YES            False              True   

   main_claim  
0       False  
1       False  
2       False  
3       False  
4       False  

Number of discordant entries: 7
Sample discordant entries:
            scopus_id llm_label llm_policy_claim  policy_claim_val  main_claim
scopus_