# Inter-Annotator Agreement Analysis

This notebook analyzes the inter-annotator agreement for the IVC (Index of Content Validity) dataset.

## Dataset Overview
- **Source**: IVC.xlsx file from the 20250915 folder
- **Number of items**: 14 questions (Q01-Q14)
- **Number of annotators**: 4 evaluators (avaliador_1 through avaliador_4)
- **Response type**: Categorical (Sim/Não - Yes/No)

## Analysis Approach
1. **Fleiss' Kappa**: Multi-rater agreement statistic for categorical data
2. **Cohen's Kappa**: Pairwise agreement between individual annotators
3. **Accuracy Matrix**: Pairwise accuracy scores
4. **Visualizations**: Heatmaps, bar charts, and gauge charts

## Key Sections
- Data loading and preprocessing
- Fleiss' Kappa calculation and interpretation
- Pairwise agreement analysis
- Visual representations
- Detailed findings and recommendations

In [1]:
# Importando bibliotecas necessárias
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
from pathlib import Path
import warnings

warnings.filterwarnings("ignore")

# Configurações de visualização
pio.templates.default = "plotly_white"
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
# set default to plotly in pandas
pd.options.plotting.backend = "plotly"

In [2]:
# Definindo caminhos dos arquivos
data_path = Path("20250915")

# Carregando os arquivos
print("Loading files...")

# Dados principais para análises dos pontos 1-3
db_val2 = pd.read_excel(data_path / "DB_val2.xlsx")
print(f"DB_val2.xlsx loaded: {db_val2.shape[0]} rows, {db_val2.shape[1]} columns")

# Dicionário de espécies
dicionario = pd.read_excel(data_path / "Dicionário.xlsx")
print(f"Dicionário.xlsx loaded: {dicionario.shape[0]} rows, {dicionario.shape[1]} columns")

# Dados para análise do ponto 4 (IVC)
ivc = pd.read_excel(data_path / "IVC.xlsx")
print(f"IVC.xlsx loaded: {ivc.shape[0]} rows, {ivc.shape[1]} columns")

Loading files...
DB_val2.xlsx loaded: 155 rows, 98 columns
Dicionário.xlsx loaded: 207 rows, 7 columns
IVC.xlsx loaded: 14 rows, 9 columns
Dicionário.xlsx loaded: 207 rows, 7 columns
IVC.xlsx loaded: 14 rows, 9 columns


In [3]:
ivc

Unnamed: 0,Item,Avaliador 1,Avaliador 2,Avaliador 3,Avaliador 4,"N° de ""sim""",IVC-i,Unnamed: 7,Unnamed: 8
0,Q01,Sim,Sim,Sim,Não,3,0.75,,"IVC-i = (Nº de Juízes que votaram ""SIM"") / (Nº..."
1,Q02,Sim,Sim,Sim,Não,3,0.75,,
2,Q03,Sim,Sim,Sim,Não,3,0.75,,
3,Q04,Sim,Sim,Sim,Não,3,0.75,,
4,Q05,Sim,Sim,Sim,Não,3,0.75,,
5,Q06,Sim,Sim,Sim,Não,3,0.75,,
6,Q07,Sim,Sim,Sim,Não,3,0.75,,
7,Q08,Sim,Sim,Sim,Não,3,0.75,,
8,Q09,Sim,Sim,Sim,Não,3,0.75,,
9,Q10,Sim,Sim,Sim,Não,3,0.75,,


In [4]:
ivc.columns = [i.lower().strip().replace(" ", "_") for i in ivc.columns]
ivc

Unnamed: 0,item,avaliador_1,avaliador_2,avaliador_3,avaliador_4,"n°_de_""sim""",ivc-i,unnamed:_7,unnamed:_8
0,Q01,Sim,Sim,Sim,Não,3,0.75,,"IVC-i = (Nº de Juízes que votaram ""SIM"") / (Nº..."
1,Q02,Sim,Sim,Sim,Não,3,0.75,,
2,Q03,Sim,Sim,Sim,Não,3,0.75,,
3,Q04,Sim,Sim,Sim,Não,3,0.75,,
4,Q05,Sim,Sim,Sim,Não,3,0.75,,
5,Q06,Sim,Sim,Sim,Não,3,0.75,,
6,Q07,Sim,Sim,Sim,Não,3,0.75,,
7,Q08,Sim,Sim,Sim,Não,3,0.75,,
8,Q09,Sim,Sim,Sim,Não,3,0.75,,
9,Q10,Sim,Sim,Sim,Não,3,0.75,,


In [5]:
ivc.set_index("item", inplace=True)
ivc

Unnamed: 0_level_0,avaliador_1,avaliador_2,avaliador_3,avaliador_4,"n°_de_""sim""",ivc-i,unnamed:_7,unnamed:_8
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Q01,Sim,Sim,Sim,Não,3,0.75,,"IVC-i = (Nº de Juízes que votaram ""SIM"") / (Nº..."
Q02,Sim,Sim,Sim,Não,3,0.75,,
Q03,Sim,Sim,Sim,Não,3,0.75,,
Q04,Sim,Sim,Sim,Não,3,0.75,,
Q05,Sim,Sim,Sim,Não,3,0.75,,
Q06,Sim,Sim,Sim,Não,3,0.75,,
Q07,Sim,Sim,Sim,Não,3,0.75,,
Q08,Sim,Sim,Sim,Não,3,0.75,,
Q09,Sim,Sim,Sim,Não,3,0.75,,
Q10,Sim,Sim,Sim,Não,3,0.75,,


In [6]:
ivc = ivc[[col for col in ivc.columns if "avaliador_" in col]]
ivc

Unnamed: 0_level_0,avaliador_1,avaliador_2,avaliador_3,avaliador_4
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Q01,Sim,Sim,Sim,Não
Q02,Sim,Sim,Sim,Não
Q03,Sim,Sim,Sim,Não
Q04,Sim,Sim,Sim,Não
Q05,Sim,Sim,Sim,Não
Q06,Sim,Sim,Sim,Não
Q07,Sim,Sim,Sim,Não
Q08,Sim,Sim,Sim,Não
Q09,Sim,Sim,Sim,Não
Q10,Sim,Sim,Sim,Não


In [7]:
ivc.isnull().sum()

avaliador_1    0
avaliador_2    0
avaliador_3    0
avaliador_4    0
dtype: int64

In [8]:
ivc.max(), ivc.min()

(avaliador_1    Sim
 avaliador_2    Sim
 avaliador_3    Sim
 avaliador_4    Não
 dtype: object,
 avaliador_1    Sim
 avaliador_2    Sim
 avaliador_3    Sim
 avaliador_4    Não
 dtype: object)

In [9]:
# Examining the data structure
print("Round 1 shape:", ivc.shape)
print("\nRound 1 sample:")
print(ivc.head())
print("\nUnique values in ivc:")
print(ivc.stack().unique())


Round 1 shape: (14, 4)

Round 1 sample:
     avaliador_1 avaliador_2 avaliador_3 avaliador_4
item                                                
Q01          Sim         Sim         Sim         Não
Q02          Sim         Sim         Sim         Não
Q03          Sim         Sim         Sim         Não
Q04          Sim         Sim         Sim         Não
Q05          Sim         Sim         Sim         Não

Unique values in ivc:
['Sim' 'Não']


In [10]:
# Import required libraries for inter-annotator agreement
from sklearn.metrics import cohen_kappa_score
from itertools import combinations
from scipy.stats import pearsonr, spearmanr
from sklearn.metrics import accuracy_score

# Create images directory if it doesn't exist
images_path = Path("src/survey_garden/images")
images_path.mkdir(parents=True, exist_ok=True)
print(f"Images will be saved to: {images_path}")

Images will be saved to: src/survey_garden/images


In [11]:
# Functions for calculating inter-annotator agreement metrics


def calculate_pairwise_agreement(data, metric="cohen_kappa"):
    """
    Calculate pairwise agreement between all annotators

    Parameters:
    data: DataFrame with annotators as columns
    metric: 'cohen_kappa', 'accuracy', 'pearson', 'spearman'

    Returns:
    DataFrame with pairwise agreement scores
    """
    annotators = data.columns
    n_annotators = len(annotators)

    # Initialize results matrix
    agreement_matrix = np.zeros((n_annotators, n_annotators))

    for i, ann1 in enumerate(annotators):
        for j, ann2 in enumerate(annotators):
            if i == j:
                agreement_matrix[i, j] = 1.0  # Perfect agreement with self
            elif i < j:  # Calculate only upper triangle
                if metric == "cohen_kappa":
                    score = cohen_kappa_score(data[ann1], data[ann2])
                elif metric == "accuracy":
                    score = accuracy_score(data[ann1], data[ann2])
                elif metric == "pearson":
                    score, _ = pearsonr(data[ann1], data[ann2])
                elif metric == "spearman":
                    score, _ = spearmanr(data[ann1], data[ann2])

                agreement_matrix[i, j] = score
                agreement_matrix[j, i] = score  # Symmetric matrix

    return pd.DataFrame(agreement_matrix, index=annotators, columns=annotators)


def calculate_fleiss_kappa(data):
    """
    Calculate Fleiss' kappa for multi-rater agreement

    Parameters:
    data: DataFrame with annotators as columns and items as rows

    Returns:
    float: Fleiss' kappa value
    dict: Additional statistics including P_bar, P_e, and interpretation
    """
    n_items, n_annotators = data.shape

    # Get all unique categories/labels
    all_categories = pd.unique(data.values.ravel())
    n_categories = len(all_categories)

    # Create contingency table: items x categories
    contingency_table = np.zeros((n_items, n_categories))

    for i in range(n_items):
        for j, category in enumerate(all_categories):
            contingency_table[i, j] = (data.iloc[i] == category).sum()

    # Calculate P_i (proportion of agreement for each item)
    P_i = np.zeros(n_items)
    for i in range(n_items):
        sum_squares = np.sum(contingency_table[i] ** 2)
        P_i[i] = (sum_squares - n_annotators) / (n_annotators * (n_annotators - 1))

    # Calculate P_bar (mean proportion of agreement)
    P_bar = np.mean(P_i)

    # Calculate marginal proportions (p_j for each category)
    marginal_props = np.zeros(n_categories)
    for j in range(n_categories):
        marginal_props[j] = np.sum(contingency_table[:, j]) / (n_items * n_annotators)

    # Calculate P_e (expected agreement by chance)
    P_e = np.sum(marginal_props**2)

    # Calculate Fleiss' kappa
    if P_e == 1.0:
        kappa = 1.0  # Perfect agreement case
    else:
        kappa = (P_bar - P_e) / (1 - P_e)

    # Interpretation
    if kappa < 0:
        interpretation = "Poor (worse than chance)"
    elif kappa < 0.20:
        interpretation = "Slight"
    elif kappa < 0.40:
        interpretation = "Fair"
    elif kappa < 0.60:
        interpretation = "Moderate"
    elif kappa < 0.80:
        interpretation = "Substantial"
    else:
        interpretation = "Almost Perfect"

    return kappa, {"P_bar": P_bar, "P_e": P_e, "interpretation": interpretation, "n_items": n_items, "n_annotators": n_annotators, "n_categories": n_categories, "categories": all_categories, "marginal_proportions": dict(zip(all_categories, marginal_props))}


def calculate_fleiss_kappa_simple(data):
    """
    Calculate a simplified version of overall agreement (for backwards compatibility)
    """
    kappa, stats = calculate_fleiss_kappa(data)
    return stats["P_bar"]  # Return observed agreement proportion

In [12]:
# =============================================================================
# FLEISS' KAPPA ANALYSIS - Inter-Annotator Agreement
# =============================================================================

print("=" * 60)
print("FLEISS' KAPPA ANALYSIS - INTER-ANNOTATOR AGREEMENT")
print("=" * 60)

# Calculate Fleiss' kappa
fleiss_kappa, stats = calculate_fleiss_kappa(ivc)
print(f"\nFleiss' Kappa: {fleiss_kappa:.4f}")
print(f"Interpretation: {stats['interpretation']}")
print(f"Observed Agreement (P̄): {stats['P_bar']:.4f}")
print(f"Expected Agreement (Pe): {stats['P_e']:.4f}")
print(f"Number of items: {stats['n_items']}")
print(f"Number of annotators: {stats['n_annotators']}")
print(f"Categories: {list(stats['categories'])}")
print("\nMarginal proportions:")
for cat, prop in stats["marginal_proportions"].items():
    print(f"  - '{cat}': {prop:.3f}")

# Statistical interpretation
print(f"\nSTATISTICAL INTERPRETATION")
print("-" * 30)
print("Fleiss' Kappa interpretation guidelines:")
print("  < 0.00: Poor (worse than chance)")
print("  0.00-0.20: Slight agreement")
print("  0.21-0.40: Fair agreement")
print("  0.41-0.60: Moderate agreement")
print("  0.61-0.80: Substantial agreement")
print("  0.81-1.00: Almost perfect agreement")

FLEISS' KAPPA ANALYSIS - INTER-ANNOTATOR AGREEMENT

Fleiss' Kappa: -0.3333
Interpretation: Poor (worse than chance)
Observed Agreement (P̄): 0.5000
Expected Agreement (Pe): 0.6250
Number of items: 14
Number of annotators: 4
Categories: ['Sim', 'Não']

Marginal proportions:
  - 'Sim': 0.750
  - 'Não': 0.250

STATISTICAL INTERPRETATION
------------------------------
Fleiss' Kappa interpretation guidelines:
  < 0.00: Poor (worse than chance)
  0.00-0.20: Slight agreement
  0.21-0.40: Fair agreement
  0.41-0.60: Moderate agreement
  0.61-0.80: Substantial agreement
  0.81-1.00: Almost perfect agreement


In [13]:
# =============================================================================
# FLEISS' KAPPA VISUALIZATION
# =============================================================================

print("\nCreating Fleiss' Kappa Visualizations...")

# Create agreement components visualization
fig_fleiss = make_subplots(rows=1, cols=2, subplot_titles=["Agreement Components", "Interpretation Scale"], specs=[[{"type": "bar"}, {"type": "bar"}]], horizontal_spacing=0.15)

# 1. Agreement Components
components = ["Observed Agreement<br>(P̄)", "Expected Agreement<br>(Pe)", "Fleiss' Kappa<br>(κ)"]
values = [stats["P_bar"], stats["P_e"], fleiss_kappa]
colors = ["#3498DB", "#95A5A6", "#FF6B6B" if fleiss_kappa < 0 else "#4ECDC4"]

fig_fleiss.add_trace(go.Bar(x=components, y=values, text=[f"{val:.3f}" for val in values], textposition="auto", marker_color=colors, name="Agreement Components"), row=1, col=1)

# 2. Interpretation scale reference
interpretation_ranges = ["Poor<br>(<0.00)", "Slight<br>(0.00-0.20)", "Fair<br>(0.21-0.40)", "Moderate<br>(0.41-0.60)", "Substantial<br>(0.61-0.80)", "Almost Perfect<br>(0.81-1.00)"]
range_values = [-0.1, 0.1, 0.3, 0.5, 0.7, 0.9]
range_colors = ["#FF6B6B", "#FFA07A", "#FFD700", "#98FB98", "#32CD32", "#228B22"]

fig_fleiss.add_trace(go.Bar(x=interpretation_ranges, y=range_values, text=[f"{val:.1f}" for val in range_values], textposition="auto", marker_color=range_colors, name="Interpretation Scale"), row=1, col=2)

# Add horizontal line to show where our value falls
fig_fleiss.add_hline(y=fleiss_kappa, line_dash="dash", line_color="red", annotation_text=f"κ = {fleiss_kappa:.3f}", row=1, col=2)

# Update layout
fig_fleiss.update_layout(height=500, title={"text": "Fleiss' Kappa Inter-Annotator Agreement Analysis", "x": 0.5, "font": {"size": 18}}, showlegend=False)

fig_fleiss.update_yaxes(title_text="Agreement Value", row=1, col=1)
fig_fleiss.update_yaxes(title_text="Reference Value", row=1, col=2)

fig_fleiss.show()

# Save the figure
fig_fleiss.write_html(images_path / "03_iaa_fleiss_kappa_analysis.html")
fig_fleiss.write_image(images_path / "03_iaa_fleiss_kappa_analysis.png", width=1200, height=500, scale=3)
print(f"Saved: 03_iaa_fleiss_kappa_analysis.html and .png")


Creating Fleiss' Kappa Visualizations...


Saved: 03_iaa_fleiss_kappa_analysis.html and .png


In [14]:
# =============================================================================
# PAIRWISE COHEN'S KAPPA ANALYSIS
# =============================================================================

print("=" * 60)
print("PAIRWISE COHEN'S KAPPA & ACCURACY ANALYSIS")
print("=" * 60)

# Calculate Cohen's Kappa for all pairs
kappa_matrix = calculate_pairwise_agreement(ivc, metric="cohen_kappa")
print("\nCohen's Kappa Matrix:")
print(kappa_matrix.round(3))

# Calculate accuracy for all pairs
accuracy_matrix = calculate_pairwise_agreement(ivc, metric="accuracy")
print("\nAccuracy Matrix:")
print(accuracy_matrix.round(3))

# Calculate overall agreement
overall_agreement = calculate_fleiss_kappa_simple(ivc)
print(f"\nObserved Overall Agreement: {overall_agreement:.3f}")

# Summary statistics
kappa_values = []
accuracy_values = []

for i in range(len(ivc.columns)):
    for j in range(i + 1, len(ivc.columns)):
        kappa_values.append(kappa_matrix.iloc[i, j])
        accuracy_values.append(accuracy_matrix.iloc[i, j])

print(f"\nPairwise Summary Statistics:")
print(f"Mean Cohen's Kappa: {np.nanmean(kappa_values):.3f}")
print(f"Std Cohen's Kappa: {np.nanstd(kappa_values):.3f}")
print(f"Mean Accuracy: {np.mean(accuracy_values):.3f}")
print(f"Std Accuracy: {np.std(accuracy_values):.3f}")

# Examine data distribution per annotator
print("\nData distribution per annotator:")
for col in ivc.columns:
    print(f"{col}: {ivc[col].value_counts().to_dict()}")

PAIRWISE COHEN'S KAPPA & ACCURACY ANALYSIS

Cohen's Kappa Matrix:
             avaliador_1  avaliador_2  avaliador_3  avaliador_4
avaliador_1          1.0          NaN          NaN          0.0
avaliador_2          NaN          1.0          NaN          0.0
avaliador_3          NaN          NaN          1.0          0.0
avaliador_4          0.0          0.0          0.0          1.0

Accuracy Matrix:
             avaliador_1  avaliador_2  avaliador_3  avaliador_4
avaliador_1          1.0          1.0          1.0          0.0
avaliador_2          1.0          1.0          1.0          0.0
avaliador_3          1.0          1.0          1.0          0.0
avaliador_4          0.0          0.0          0.0          1.0

Observed Overall Agreement: 0.500

Pairwise Summary Statistics:
Mean Cohen's Kappa: 0.000
Std Cohen's Kappa: 0.000
Mean Accuracy: 0.500
Std Accuracy: 0.500

Data distribution per annotator:
avaliador_1: {'Sim': 14}
avaliador_2: {'Sim': 14}
avaliador_3: {'Sim': 14}
avaliador_

In [15]:
# =============================================================================
# DETAILED INTERPRETATION & FINDINGS
# =============================================================================

print("=" * 70)
print("DETAILED INTERPRETATION & KEY FINDINGS")
print("=" * 70)

print(f"\nDATASET OVERVIEW")
print("-" * 35)
print(f"Data Type: Categorical (Sim/Não)")
print(f"Number of items: {stats['n_items']}")
print(f"Number of annotators: {stats['n_annotators']}")
print(f"Categories: {', '.join([str(c) for c in stats['categories']])}")

print(f"\nAGREEMENT METRICS")
print("-" * 35)
print(f"Fleiss' Kappa: {fleiss_kappa:.4f} ({stats['interpretation']})")
print(f"Observed Agreement: {stats['P_bar']:.3f} ({stats['P_bar'] * 100:.1f}%)")
print(f"Expected Agreement by chance: {stats['P_e']:.3f} ({stats['P_e'] * 100:.1f}%)")

print(f"\nKEY FINDINGS")
print("-" * 35)
# Analyze the pattern
marginal_props = stats["marginal_proportions"]
for cat, prop in marginal_props.items():
    print(f"- '{cat}' appears in {prop * 100:.1f}% of all annotations")

print(f"\nPROBLEM ANALYSIS")
print("-" * 35)
print(f"- High expected agreement by chance (Pe = {stats['P_e']:.3f})")
print(f"- Observed agreement ({stats['P_bar']:.3f}) is LOWER than chance expectation")
print(f"- Result: Negative kappa indicates systematic disagreement between annotators")
print(f"- Pattern: Clear split in annotator behavior (3 vs 1)")

print(f"\nQUALITY ASSESSMENT")
print("-" * 35)
print("Target Fleiss' Kappa values:")
print("- Minimum acceptable: κ > 0.40 (Moderate)")
print("- Good quality: κ > 0.60 (Substantial)")
print("- Excellent quality: κ > 0.80 (Almost Perfect)")
print(f"\nCurrent status: κ = {fleiss_kappa:.3f} - BELOW minimum threshold")
print(f"Gap to minimum acceptable: +{0.40 - fleiss_kappa:.3f} needed")

DETAILED INTERPRETATION & KEY FINDINGS

DATASET OVERVIEW
-----------------------------------
Data Type: Categorical (Sim/Não)
Number of items: 14
Number of annotators: 4
Categories: Sim, Não

AGREEMENT METRICS
-----------------------------------
Fleiss' Kappa: -0.3333 (Poor (worse than chance))
Observed Agreement: 0.500 (50.0%)
Expected Agreement by chance: 0.625 (62.5%)

KEY FINDINGS
-----------------------------------
- 'Sim' appears in 75.0% of all annotations
- 'Não' appears in 25.0% of all annotations

PROBLEM ANALYSIS
-----------------------------------
- High expected agreement by chance (Pe = 0.625)
- Observed agreement (0.500) is LOWER than chance expectation
- Result: Negative kappa indicates systematic disagreement between annotators
- Pattern: Clear split in annotator behavior (3 vs 1)

QUALITY ASSESSMENT
-----------------------------------
Target Fleiss' Kappa values:
- Minimum acceptable: κ > 0.40 (Moderate)
- Good quality: κ > 0.60 (Substantial)
- Excellent quality: κ > 

In [16]:
# =============================================================================
# VISUALIZATION 1: Accuracy Heatmap
# =============================================================================

print("1. Inter-Annotator Accuracy Matrix")
fig1 = go.Figure(data=go.Heatmap(z=accuracy_matrix.values, x=accuracy_matrix.columns, y=accuracy_matrix.index, colorscale="RdYlBu_r", zmin=0, zmax=1, text=accuracy_matrix.round(3).values, texttemplate="%{text}", textfont={"size": 12}, showscale=True))

fig1.update_layout(title={"text": "Inter-Annotator Accuracy Matrix", "x": 0.5, "font": {"size": 16}}, xaxis_title="Annotators", yaxis_title="Annotators", height=500, width=600)

fig1.show()

# Save the figure
fig1.write_html(images_path / "03_iaa_accuracy_heatmap.html")
fig1.write_image(images_path / "03_iaa_accuracy_heatmap.png", width=600, height=500, scale=3)
print(f"Saved: 03_iaa_accuracy_heatmap.html and .png")

1. Inter-Annotator Accuracy Matrix


Saved: 03_iaa_accuracy_heatmap.html and .png


In [17]:
# =============================================================================
# VISUALIZATION 2: Cohen's Kappa Heatmap
# =============================================================================

print("2. Cohen's Kappa Matrix")
fig2 = go.Figure(data=go.Heatmap(z=kappa_matrix.values, x=kappa_matrix.columns, y=kappa_matrix.index, colorscale="RdYlBu_r", zmin=-1, zmax=1, text=kappa_matrix.round(3).values, texttemplate="%{text}", textfont={"size": 12}, showscale=True))

fig2.update_layout(title={"text": "Cohen's Kappa Agreement Matrix", "x": 0.5, "font": {"size": 16}}, xaxis_title="Annotators", yaxis_title="Annotators", height=500, width=600)

fig2.show()

# Save the figure
fig2.write_html(images_path / "03_iaa_cohens_kappa_heatmap.html")
fig2.write_image(images_path / "03_iaa_cohens_kappa_heatmap.png", width=600, height=500, scale=3)
print(f"Saved: 03_iaa_cohens_kappa_heatmap.html and .png")

2. Cohen's Kappa Matrix


Saved: 03_iaa_cohens_kappa_heatmap.html and .png


In [18]:
# =============================================================================
# VISUALIZATION 3: Response Distribution
# =============================================================================

print("3. Response Distribution Across All Annotators")

response_dist = ivc.stack().value_counts()
fig3 = go.Figure(data=go.Bar(x=response_dist.index, y=response_dist.values, text=response_dist.values, textposition="auto", marker_color=["#2E86AB", "#A23B72"]))

fig3.update_layout(title={"text": "Distribution of Responses Across All Annotators", "x": 0.5, "font": {"size": 16}}, xaxis_title="Response", yaxis_title="Count", height=500, width=600)

fig3.show()

# Save the figure
fig3.write_html(images_path / "03_iaa_response_distribution.html")
fig3.write_image(images_path / "03_iaa_response_distribution.png", width=600, height=500, scale=3)
print(f"Saved: 03_iaa_response_distribution.html and .png")

3. Response Distribution Across All Annotators


Saved: 03_iaa_response_distribution.html and .png


In [19]:
# =============================================================================
# VISUALIZATION 4: Pairwise Agreement Analysis
# =============================================================================

print("4. Pairwise Accuracy Between Annotators")

pairwise_results = []
for i, ann1 in enumerate(ivc.columns):
    for j, ann2 in enumerate(ivc.columns):
        if i < j:
            acc = accuracy_score(ivc[ann1], ivc[ann2])
            pairwise_results.append({"Pair": f"{ann1} vs {ann2}", "Accuracy": acc, "Annotator_1": ann1, "Annotator_2": ann2})

pairwise_df = pd.DataFrame(pairwise_results)
print(pairwise_df)

# Visualize pairwise agreements
fig4 = px.bar(pairwise_df, x="Pair", y="Accuracy", title="Pairwise Accuracy Between Annotators", labels={"Accuracy": "Accuracy Score", "Pair": "Annotator Pairs"}, text="Accuracy")
fig4.update_traces(texttemplate="%{text:.3f}", textposition="outside")
fig4.update_layout(height=400, xaxis_tickangle=-45)
fig4.show()

# Save the figure
fig4.write_html(images_path / "03_iaa_pairwise_accuracy.html")
fig4.write_image(images_path / "03_iaa_pairwise_accuracy.png", width=800, height=400, scale=3)
print(f"Saved: 03_iaa_pairwise_accuracy.html and .png")

4. Pairwise Accuracy Between Annotators
                         Pair  Accuracy  Annotator_1  Annotator_2
0  avaliador_1 vs avaliador_2       1.0  avaliador_1  avaliador_2
1  avaliador_1 vs avaliador_3       1.0  avaliador_1  avaliador_3
2  avaliador_1 vs avaliador_4       0.0  avaliador_1  avaliador_4
3  avaliador_2 vs avaliador_3       1.0  avaliador_2  avaliador_3
4  avaliador_2 vs avaliador_4       0.0  avaliador_2  avaliador_4
5  avaliador_3 vs avaliador_4       0.0  avaliador_3  avaliador_4


Saved: 03_iaa_pairwise_accuracy.html and .png


In [20]:
# =============================================================================
# RECOMMENDATIONS
# =============================================================================

print("=" * 70)
print("ACTIONABLE RECOMMENDATIONS")
print("=" * 70)

print(f"\nCRITICAL ISSUES IDENTIFIED")
print("-" * 35)
print("1. Perfect split: 3 annotators always say 'Sim', 1 always says 'Não'")
print("2. No variation within individual annotators")
print("3. Negative kappa (worse than random chance)")
print("4. High marginal imbalance creates inflated chance agreement")

print(f"\nSTATISTICAL CONTEXT")
print("-" * 35)
print("- Negative Fleiss' kappa indicates systematic bias or disagreement")
print("- Values worse than chance suggest fundamental issues with:")
print("  - Annotation guidelines clarity")
print("  - Annotator training/calibration")
print("  - Task definition ambiguity")
print("  - Potential systematic biases")

print(f"\nIMMEDIATE ACTIONS NEEDED")
print("-" * 35)
print("1. Review and clarify annotation guidelines")
print("2. Conduct annotator retraining sessions")
print("3. Implement calibration exercises with gold standard examples")
print("4. Investigate why one annotator consistently disagrees")
print("5. Consider consensus meeting to resolve disagreements")

print(f"\nLONG-TERM IMPROVEMENTS")
print("-" * 35)
print("6. Provide clearer category definitions with examples")
print("7. Implement pilot testing of annotation protocols")
print("8. Add inter-annotator discussion sessions")
print("9. Calculate confidence intervals for kappa values")
print("10. Consider using majority voting for final labels")

print(f"\nSUCCESS CRITERIA")
print("-" * 35)
print(f"Current Fleiss' κ: {fleiss_kappa:.3f}")
print(f"Target minimum: κ > 0.40 (Moderate agreement)")
print(f"Improvement needed: +{0.40 - fleiss_kappa:.3f}")
print("\nRecommended milestones:")
print("- Phase 1: Achieve κ > 0.20 (Slight agreement)")
print("- Phase 2: Achieve κ > 0.40 (Moderate agreement)")
print("- Final goal: Achieve κ > 0.60 (Substantial agreement)")

ACTIONABLE RECOMMENDATIONS

CRITICAL ISSUES IDENTIFIED
-----------------------------------
1. Perfect split: 3 annotators always say 'Sim', 1 always says 'Não'
2. No variation within individual annotators
3. Negative kappa (worse than random chance)
4. High marginal imbalance creates inflated chance agreement

STATISTICAL CONTEXT
-----------------------------------
- Negative Fleiss' kappa indicates systematic bias or disagreement
- Values worse than chance suggest fundamental issues with:
  - Annotation guidelines clarity
  - Annotator training/calibration
  - Task definition ambiguity
  - Potential systematic biases

IMMEDIATE ACTIONS NEEDED
-----------------------------------
1. Review and clarify annotation guidelines
2. Conduct annotator retraining sessions
3. Implement calibration exercises with gold standard examples
4. Investigate why one annotator consistently disagrees
5. Consider consensus meeting to resolve disagreements

LONG-TERM IMPROVEMENTS
-----------------------------

In [21]:
# =============================================================================
# FINAL SUMMARY
# =============================================================================

print("=" * 70)
print("INTER-ANNOTATOR AGREEMENT ANALYSIS - FINAL SUMMARY")
print("=" * 70)

print(f"\nDATASET CHARACTERISTICS")
print("-" * 35)
print(f"- Number of items: {stats['n_items']}")
print(f"- Number of annotators: {stats['n_annotators']}")
print(f"- Response categories: {', '.join([str(c) for c in stats['categories']])}")
print(f"- Observed agreement: {stats['P_bar']:.1%}")
print(f"- Expected agreement (by chance): {stats['P_e']:.1%}")

print(f"\nAGREEMENT METRICS")
print("-" * 35)
print(f"- Fleiss' Kappa: {fleiss_kappa:.4f}")
print(f"- Interpretation: {stats['interpretation']}")
print(f"- Mean pairwise accuracy: {np.mean(accuracy_values):.1%}")
print(f"- Mean Cohen's Kappa (pairwise): {np.nanmean(kappa_values):.3f}")

print(f"\nKEY FINDINGS")
print("-" * 35)
print(f"- Annotators show systematic disagreement pattern")
print(f"- Three annotators agree perfectly (all 'Sim')")
print(f"- One annotator disagrees completely (all 'Não')")
print(f"- Creates a clear 3 vs 1 split in annotations")

print(f"\nCONCLUSION")
print("-" * 35)
print(f"The inter-annotator agreement is POOR ({stats['interpretation']}).")
print(f"Immediate intervention required to improve annotation quality.")
print(f"Recommend reviewing guidelines and retraining annotators before proceeding.")

# Create final summary visualization
fig_summary = go.Figure()

fig_summary.add_trace(
    go.Indicator(
        mode="gauge+number+delta",
        value=fleiss_kappa,
        domain={"x": [0, 1], "y": [0, 1]},
        title={"text": "Fleiss' Kappa", "font": {"size": 24}},
        delta={"reference": 0.40, "suffix": " (vs. target)"},
        gauge={
            "axis": {"range": [-0.5, 1], "tickwidth": 1, "tickcolor": "darkblue"},
            "bar": {"color": "darkred" if fleiss_kappa < 0 else "darkblue"},
            "bgcolor": "white",
            "borderwidth": 2,
            "bordercolor": "gray",
            "steps": [{"range": [-0.5, 0], "color": "#FFE5E5"}, {"range": [0, 0.2], "color": "#FFE5CC"}, {"range": [0.2, 0.4], "color": "#FFFFCC"}, {"range": [0.4, 0.6], "color": "#E5FFCC"}, {"range": [0.6, 0.8], "color": "#CCFFE5"}, {"range": [0.8, 1], "color": "#CCF5FF"}],
            "threshold": {"line": {"color": "red", "width": 4}, "thickness": 0.75, "value": 0.40},
        },
    )
)

fig_summary.update_layout(height=400, title={"text": "Inter-Annotator Agreement Quality Gauge", "x": 0.5, "font": {"size": 18}})

fig_summary.show()

# Save the figure
fig_summary.write_html(images_path / "03_iaa_summary_gauge.html")
fig_summary.write_image(images_path / "03_iaa_summary_gauge.png", width=800, height=400, scale=3)
print(f"Saved: 03_iaa_summary_gauge.html and .png")

INTER-ANNOTATOR AGREEMENT ANALYSIS - FINAL SUMMARY

DATASET CHARACTERISTICS
-----------------------------------
- Number of items: 14
- Number of annotators: 4
- Response categories: Sim, Não
- Observed agreement: 50.0%
- Expected agreement (by chance): 62.5%

AGREEMENT METRICS
-----------------------------------
- Fleiss' Kappa: -0.3333
- Interpretation: Poor (worse than chance)
- Mean pairwise accuracy: 50.0%
- Mean Cohen's Kappa (pairwise): 0.000

KEY FINDINGS
-----------------------------------
- Annotators show systematic disagreement pattern
- Three annotators agree perfectly (all 'Sim')
- One annotator disagrees completely (all 'Não')
- Creates a clear 3 vs 1 split in annotations

CONCLUSION
-----------------------------------
The inter-annotator agreement is POOR (Poor (worse than chance)).
Immediate intervention required to improve annotation quality.
Recommend reviewing guidelines and retraining annotators before proceeding.


Saved: 03_iaa_summary_gauge.html and .png


In [22]:
# =============================================================================
# CRONBACH'S ALPHA - Internal Consistency Reliability
# =============================================================================

print("=" * 70)
print("CRONBACH'S ALPHA - INTERNAL CONSISTENCY ANALYSIS")
print("=" * 70)


def calculate_cronbachs_alpha(data):
    """
    Calculate Cronbach's alpha for internal consistency reliability

    Parameters:
    data: DataFrame with items as rows and raters as columns

    Returns:
    float: Cronbach's alpha value
    dict: Additional statistics
    """
    # Convert categorical to numeric if needed
    # Map 'Sim' to 1 and 'Não' to 0
    data_numeric = data.copy()
    if data_numeric.iloc[0, 0] in ["Sim", "Não"]:
        data_numeric = data_numeric.replace({"Sim": 1, "Não": 0})

    # Number of items (raters/annotators)
    n_items = data_numeric.shape[1]

    # Calculate variance for each item
    item_variances = data_numeric.var(axis=0, ddof=1)

    # Calculate total variance (variance of sum scores)
    total_scores = data_numeric.sum(axis=1)
    total_variance = total_scores.var(ddof=1)

    # Calculate Cronbach's alpha
    # α = (k / (k-1)) * (1 - (Σσ²ᵢ / σ²ₜ))
    if n_items > 1:
        alpha = (n_items / (n_items - 1)) * (1 - (item_variances.sum() / total_variance))
    else:
        alpha = np.nan

    # Interpretation
    if alpha < 0.5:
        interpretation = "Unacceptable"
    elif alpha < 0.6:
        interpretation = "Poor"
    elif alpha < 0.7:
        interpretation = "Questionable"
    elif alpha < 0.8:
        interpretation = "Acceptable"
    elif alpha < 0.9:
        interpretation = "Good"
    else:
        interpretation = "Excellent"

    # Calculate item-total correlations
    item_total_correlations = {}
    for col in data_numeric.columns:
        # Correlation between item and total (excluding that item)
        other_items_sum = total_scores - data_numeric[col]
        corr = data_numeric[col].corr(other_items_sum)
        item_total_correlations[col] = corr

    # Calculate alpha if item deleted
    alpha_if_deleted = {}
    for col in data_numeric.columns:
        remaining_items = data_numeric.drop(columns=[col])
        if remaining_items.shape[1] > 1:
            n_remaining = remaining_items.shape[1]
            remaining_item_vars = remaining_items.var(axis=0, ddof=1)
            remaining_total_var = remaining_items.sum(axis=1).var(ddof=1)
            alpha_deleted = (n_remaining / (n_remaining - 1)) * (1 - (remaining_item_vars.sum() / remaining_total_var))
            alpha_if_deleted[col] = alpha_deleted
        else:
            alpha_if_deleted[col] = np.nan

    return alpha, {
        "interpretation": interpretation,
        "n_items": n_items,
        "n_observations": data_numeric.shape[0],
        "item_variances": item_variances.to_dict(),
        "total_variance": total_variance,
        "item_total_correlations": item_total_correlations,
        "alpha_if_item_deleted": alpha_if_deleted,
        "mean_item_variance": item_variances.mean(),
        "data_numeric": data_numeric,
    }


# Calculate Cronbach's alpha
cronbach_alpha, cronbach_stats = calculate_cronbachs_alpha(ivc)

print(f"\nCronbach's Alpha: {cronbach_alpha:.4f}")
print(f"Interpretation: {cronbach_stats['interpretation']}")
print(f"Number of items (annotators): {cronbach_stats['n_items']}")
print(f"Number of observations (questions): {cronbach_stats['n_observations']}")

print(f"\nVARIANCE ANALYSIS")
print("-" * 35)
print(f"Total variance: {cronbach_stats['total_variance']:.4f}")
print(f"Mean item variance: {cronbach_stats['mean_item_variance']:.4f}")

print(f"\nITEM STATISTICS")
print("-" * 35)
print("\nItem-Total Correlations:")
for item, corr in cronbach_stats["item_total_correlations"].items():
    print(f"  {item}: {corr:.3f}")

print("\nAlpha if Item Deleted:")
for item, alpha_del in cronbach_stats["alpha_if_item_deleted"].items():
    change = alpha_del - cronbach_alpha
    print(f"  {item}: {alpha_del:.4f} (Δ = {change:+.4f})")

print(f"\nINTERPRETATION GUIDELINES")
print("-" * 35)
print("Cronbach's Alpha interpretation:")
print("  α < 0.50: Unacceptable")
print("  α 0.50-0.59: Poor")
print("  α 0.60-0.69: Questionable")
print("  α 0.70-0.79: Acceptable")
print("  α 0.80-0.89: Good")
print("  α ≥ 0.90: Excellent")

CRONBACH'S ALPHA - INTERNAL CONSISTENCY ANALYSIS

Cronbach's Alpha: nan
Interpretation: Excellent
Number of items (annotators): 4
Number of observations (questions): 14

VARIANCE ANALYSIS
-----------------------------------
Total variance: 0.0000
Mean item variance: 0.0000

ITEM STATISTICS
-----------------------------------

Item-Total Correlations:
  avaliador_1: nan
  avaliador_2: nan
  avaliador_3: nan
  avaliador_4: nan

Alpha if Item Deleted:
  avaliador_1: nan (Δ = +nan)
  avaliador_2: nan (Δ = +nan)
  avaliador_3: nan (Δ = +nan)
  avaliador_4: nan (Δ = +nan)

INTERPRETATION GUIDELINES
-----------------------------------
Cronbach's Alpha interpretation:
  α < 0.50: Unacceptable
  α 0.50-0.59: Poor
  α 0.60-0.69: Questionable
  α 0.70-0.79: Acceptable
  α 0.80-0.89: Good
  α ≥ 0.90: Excellent


In [23]:
for col in ivc.columns:
    print(f"{col}: {ivc[col].value_counts().to_dict()}")

avaliador_1: {'Sim': 14}
avaliador_2: {'Sim': 14}
avaliador_3: {'Sim': 14}
avaliador_4: {'Não': 14}


## Understanding the NaN Values in Cronbach's Alpha Calculation

### Why Are We Getting NaN (Not a Number) Values?

The Cronbach's alpha calculation is producing **NaN** values due to a critical mathematical issue: **division by zero**.

### Root Cause Analysis

Looking at the variance analysis output above, we observe:
- **Total variance: 0.0000**
- **Mean item variance: 0.0000**

This occurs because:

1. **No Within-Rater Variability**: Each annotator gave the **same response** to all 14 questions
   - `avaliador_1`: Always responded "Sim" (or "Não")
   - `avaliador_2`: Always responded "Sim" (or "Não")
   - `avaliador_3`: Always responded "Sim" (or "Não")
   - `avaliador_4`: Always responded "Não"

2. **Mathematical Consequence**: When all responses from an annotator are identical:
   - Individual item variance = 0
   - Sum of item variances = 0
   - Total score variance = 0 (because each person gets the same total score)

3. **Cronbach's Alpha Formula Breakdown**:
   ```
   α = (k / (k-1)) × (1 - (Σσ²ᵢ / σ²ₜ))
   ```
   When σ²ₜ (total variance) = 0, we get: **0 / 0 = NaN**

### What Does This Mean?

**Cronbach's alpha is NOT appropriate for this dataset** because:

- Cronbach's alpha measures **internal consistency** - how well items (annotators) correlate with each other
- It requires **variability within items** to be meaningful
- When there's no variance, the concept of "internal consistency" doesn't apply

### The Correct Interpretation

The issue is not a calculation error, but rather that:

1. **The data pattern is extreme**: Annotators show perfect consistency within themselves but complete disagreement between each other (3 vs 1 split)

2. **Better metrics for this scenario**:
   - ✅ **Fleiss' Kappa**: Appropriate for categorical data with perfect individual consistency
   - ✅ **Cohen's Kappa (pairwise)**: Measures agreement between specific annotator pairs
   - ✅ **Accuracy scores**: Direct measure of agreement
   - ❌ **Cronbach's Alpha**: Requires variance within raters

### Recommendation

**Focus on the Fleiss' Kappa results** (calculated earlier in this notebook) as the primary measure of inter-annotator agreement for this dataset. The Cronbach's alpha analysis confirms what we already know: annotators are internally consistent but disagree systematically with each other.