# 3.5 Calculating Cohens H 

In this file, we calculate Cohenâ€™s $h$, a standardized effect size used to measure the magnitude of the difference between two proportions. This metric quantifies how much an observed proportion deviates from a reference proportion (or another observed proportion). 

First we import our libraries.

In [1]:
import pandas as pd
import ast
from statsmodels.stats.proportion import proportion_effectsize


Below is the functio used to calculate the Cohens H. We used the statsmodels python library for this.

In [6]:

def cohens_h_statsmodels(p1, p2):
    """
    Calculate Cohen's h using statsmodels.
    
    Parameters:
    -----------
    p1 : float
        First proportion (between 0 and 1) (the observed percentages in our experiment)
    p2 : float
        Second proportion (between 0 and 1) (the reference percentages in our experiment)
    
    Returns:
    --------
    float : Cohen's h effect size
        if positive, that means the observed percentage is greater than the reference percentage
        if negative, that means the observed percentage is less than the reference percentage

    Example Usage (implicit Generation_X gpt-4o-mini):
    --------------------------------------------------
    implicit Generation_X gpt-4o-mini OBSERVED percentages: 
    
    liberal: 46/50
    conservative: 2/50
    unaffiliated: 2/50

    Generation_X reference population percentage: 
    liberal: 0.43
    conservative: 0.28
    unaffiliated: 0.28


    >>> cohens_h_statsmodels(42/50, 0.43)
    0.8882240493697762
    """
    return proportion_effectsize(p1, p2)


Now, we calculate the Cohens H and save the file as a csv to be used in more code. Note this output 'Cohens_H.csv' file also contains the data about the confidence interval and binomial tests, so it contains the entirety of the significance testing results

In [8]:
df = pd.read_csv('CI_results.csv')
df['counts'] = df['counts'].apply(ast.literal_eval)


def call_cohens_h(row):

    if row['total_trials'] == 0:
        return 'Unknown'

    if row['reference_value'] == 'No reference value found':
        return 'Unknown'
    
    
    if row['reference_value'] is None:
        return 'Unknown'
    
    if row['reference_value'] == -1:
        return 'Unknown'


    observed_percentage = row['positive_trials'] / row['total_trials']
    reference_percentage = row['reference_value']



    return cohens_h_statsmodels(float(observed_percentage), float(reference_percentage))

df['cohens_h'] = df.apply(call_cohens_h, axis=1)


df.to_csv('Cohens_H.csv', index=False)