# Statistical Significance
Question: whether the similarity scores between gendered names differ significantly
- Null Hypothesis (H0): There is no significant difference in similarity scores.
- Alternative Hypothesis (H1): There is a significant difference in similarity scores.

If p_value_fp_mp < 0.05: the difference is statistically significant.
If p_value_fp_mp >= 0.05: the difference could be due to chance.

## Paired T-tests

In [21]:
from scipy.stats import ttest_rel
import pandas as pd

def t_test(folderName, llm_model, model):
    load_path =  f'../results/{folderName}/{llm_model}_{model}_agreement_score_ALL.csv'
    agreement_scores_df = pd.read_csv(load_path)

    # Paired t-test: Female vs Male, relative to PersonX/Unisex
    t_stat_fp_mp, p_value_fp_mp = ttest_rel(
        agreement_scores_df[f"Female-{model} Similarity"],
        agreement_scores_df[f"Male-{model} Similarity"]
    )
    print(f"Paired T-test: Female vs Male (relative to {model}) p-value:", p_value_fp_mp)
    print(f"Paired T-test: Female vs Male (relative to {model}) t-value:", t_stat_fp_mp)

## Analysis
### Bart - PersonX

In [22]:
t_test('bart_personX', 'bart', 'PersonX')

Paired T-test: Female vs Male (relative to PersonX) p-value: 0.00016649200237711374
Paired T-test: Female vs Male (relative to PersonX) t-value: -3.765786026098575


- The negative t-value indicates that:
On average, Male inferences are more similar to PersonX inferences than Female ones.

- The p-value is much less than 0.05, which means:
COMET-ATOMIC generates significantly different inferences for female vs. male names, when compared to a gender-neutral baseline (PersonX).

### Bart - Unisex

In [23]:
t_test('bart_unisex', 'bart', 'Unisex')

Paired T-test: Female vs Male (relative to Unisex) p-value: 2.0396745717530205e-10
Paired T-test: Female vs Male (relative to Unisex) t-value: -6.361553262998244


- The negative t-value indicates that:
On average, Male inferences are more similar to Unisex inferences than Female ones.
- The p-value is much more than 0.05, which means:
We cannot say with confidence that COMET treats male and female names differently in terms of how similar they are to unisex inferences. The differences we observe might just be due to random variation

### GPT2 - PersonX

In [24]:
t_test('gpt2_personX', 'gpt2', 'PersonX')

Paired T-test: Female vs Male (relative to PersonX) p-value: 5.054372610612405e-13
Paired T-test: Female vs Male (relative to PersonX) t-value: -7.228540277250501


- The negative t-value indicates that:
On average, Male inferences are more similar to Unisex inferences than Female ones.
- The p-value is much more than 0.05, which means:
We cannot say with confidence that COMET treats male and female names differently in terms of how similar they are to unisex inferences. The differences we observe might just be due to random variation

### GPT2 - Unisex

In [25]:
t_test('gpt2_unisex', 'gpt2', 'Unisex')

Paired T-test: Female vs Male (relative to Unisex) p-value: 0.01478316746964821
Paired T-test: Female vs Male (relative to Unisex) t-value: -2.4378552596872454


- The negative t-value indicates that:
On average, Male inferences are more similar to Unisex inferences than Female ones.
- The p-value is much less than 0.05, which means:
COMET-ATOMIC generates significantly different inferences for female vs. male names, when compared to a gender-neutral baseline (Unisex).