# Baseline Means and Significance Testing

## Conser's Baselines

#### Prose baseline

Conser 2020, p. 264: 
> In order to determine how accentual contours might align by chance,
it is first necessary to establish a baseline of chance alignment, using a con-
trol group. In the prose of Lysias’ *Against Eratosthenes*, for example, the rate of
matched accents between sections is **5.6%**, and the rate of compatible syllables
is **73.6%**, providing a minimum baseline for chance alignment.

Continuing in a footnote:

> Random ‘stanza pairs’ were created by pairing odd and even paragraphs of the first 12 sections and trimming the longer section to have the same number of syllables as the shorter.
This resulted in six stanza pairs, each containing an average of 86.5 syllables.

#### Trimeter baseline

Most important:

> Prose, however, is a poor choice of comparison for poetic texts, because of the effect of
metrical responsion. [...] It is not surprising, then, that the percentage of both matched accents and
compatible syllables are higher between sections of iambic trimeter, at **9.7%**
and **76.9%** respectively.

> Random stanza pairs were created by pairing sequential groups of eight lines, drawn from
Antigone 1-96 and 162-321 (Prologue and Episode 1). Resolutions were treated as a single
syllable. This resulted in sixteen stanza pairs, each containing 96 syllables.

**Summary:**

16 antistrophic pairs of 2x8 lines (strikingly, 8 is the mean for Aristophanes' cantica too!)

Accentual responsion: **9.7%**
Compatibility: **76.9%**

## My Baselines

### Appendix: Constructing the Lyric Tetrameter Baseline

Strophe line length should match the mean length of the stanzas in the corpus. Constructing 79 baseline cantica would be overdoing a bit, so 16 seems fine. 

Most importantly, I'm going to make two things beyond Conser:
- triadic and quadratic baselines, and
- lyric Frankenstain cantica, using metres that appear in multiple songs, most importantly 4 tr^, cr and ar.
  

In [1]:
from src.stats_comp import compatibility_corpus
from statistics import mean

all_sets = compatibility_corpus('data/compiled/')

flat = []
length = 0
cantica_lengths = []

for play in all_sets:
    for canticum in play:
        for line_group in canticum:
            flat.append(line_group)
            length += 1
    cantica_lengths.append(len(canticum))
mean_cantica_length = mean(cantica_lengths)

print(f'Total nr of sets of responding lines: {length} lines')
print(f'Mean strophe length: {mean_cantica_length} ≈ {round(mean_cantica_length)} lines')


Total nr of sets of responding lines: 518 lines
Mean strophe length: 6.181818181818182 ≈ 6 lines


Ach: 4x2x8. 
Excludes the extrametrical lines 43 and 61, and Pseudartabas weirdness, and also lines with anapests, e.g. 
    6,7 (36, 37 instead)

cantica = [
    [(1, 8), (9, 17)],
    [(18, 26), (27, 35)],
    [(72, 80), (81, 89)],
    [(108, 116), (117, 125)]
]

Eq. 4x2x8

cantica = [
    [(18, 26), (27, 35)],
    [(72, 80), (81, 89)],
    [(108, 116), (117, 125)],
    [(126, 134), (135, 143)]
]

Nu. 
Av.

Let's find trochaic tetrameter catalectics, and make a pseudo-canticum with them!

In [2]:
from lxml import etree
from pathlib import Path

compiled = [
    Path('scan') / file for file in [
        'responsion_ach_scan.xml', 
        'responsion_av_scan.xml', 
        'responsion_eq_scan.xml', 
        'responsion_nu_scan.xml', 
        'responsion_pax_scan.xml', 
        'responsion_v_scan.xml'
    ]
]

trochaic_tetrameter_catalectic = []

for xml_file in compiled:
    try:
        tree = etree.parse(xml_file)
        for l in tree.xpath("//l[@metre='4 tr^']"):
            text = ''.join(l.itertext()) # good lxml method to know; recursively joins all texts and tails
            title = tree.xpath("//title/text()")[0]
            provenience = title + l.attrib['n']
            if text == '' or l.attrib.get('skip', 'False') == 'True':
                continue
            complete_description = [provenience, text]
            trochaic_tetrameter_catalectic.append(complete_description)

    except etree.XMLSyntaxError as e:
        print(f"Error parsing XML: {e}")

print(trochaic_tetrameter_catalectic)
print(f'Lyric trochaic tetrameter catalectics found: {len(trochaic_tetrameter_catalectic)}')

with open('scan/lyricbaseline.txt', 'w', encoding='utf-8') as file:
    for line in trochaic_tetrameter_catalectic:
        file.write(f'<l n="{line[0]}">{line[1]}</l>\n')

OSError: Error reading file 'scan/responsion_ach_scan.xml': failed to load external entity "scan/responsion_ach_scan.xml"

### 6*3 Null Baselines

In [None]:
from pathlib import Path

from src.stats import accentual_responsion_metric_canticum, accentual_responsion_metric_play
from src.stats_barys import barys_oxys_metric_canticum, barys_oxys_metric_play
from src.stats_comp import compatibility_corpus, compatibility_play, compatibility_canticum, compatibility_ratios_to_stats

trimeter_path = Path('data/compiled/baseline/responsion_baseline_compiled.xml')
trimeterpoly_path = Path('data/compiled/baseline/responsion_baselinepoly_compiled.xml')
tetrameter_path = Path('data/compiled/baseline/responsion_lyricbaseline_compiled.xml')

if not all(p.exists() for p in [tetrameter_path, trimeter_path, trimeterpoly_path]):
    print(f'Some baseline file paths do not exist.')

baseline_dict = {}

#
# Compatibility baselines (these are strictly means)
#

trimeter_2_strophic = compatibility_play(trimeter_path) # 3 antistrophic cantica of 2x8 lines each
trimeter_3_strophes = compatibility_canticum(trimeterpoly_path, 'baselinepoly01')
trimeter_4_strophes = compatibility_canticum(trimeterpoly_path, 'baselinepoly02')
tetrameter_2_strophes = compatibility_canticum(tetrameter_path, 'lyricbaseline01')
tetrameter_3_strophes = compatibility_canticum(tetrameter_path, 'lyricbaseline02')
tetrameter_4_strophes = compatibility_canticum(tetrameter_path, 'lyricbaseline03')

trimeter_2_strophic_baseline_comp = compatibility_ratios_to_stats([trimeter_2_strophic])
trimeter_3_strophes_baseline_comp = compatibility_ratios_to_stats([trimeter_3_strophes])
trimeter_4_strophes_baseline_comp = compatibility_ratios_to_stats([trimeter_4_strophes])
tetrameter_2_strophes_baseline_comp = compatibility_ratios_to_stats([tetrameter_2_strophes])
tetrameter_3_strophes_baseline_comp = compatibility_ratios_to_stats([tetrameter_3_strophes])
tetrameter_4_strophes_baseline_comp = compatibility_ratios_to_stats([tetrameter_4_strophes])

print('--------------------------------')
print('Contour compatibility baselines:')
print('--------------------------------')
print(f'Trimeter 2-strophes compatibility baseline: {trimeter_2_strophic_baseline_comp}')
print(f'Trimeter 3-strophes compatibility baseline: {trimeter_3_strophes_baseline_comp}')
print(f'Trimeter 4-strophes compatibility baseline: {trimeter_4_strophes_baseline_comp}')
print(f'Tetrameter 2-strophes compatibility baseline: {tetrameter_2_strophes_baseline_comp}')
print(f'Tetrameter 3-strophes compatibility baseline: {tetrameter_3_strophes_baseline_comp}')
print(f'Tetrameter 4-strophes compatibility baseline: {tetrameter_4_strophes_baseline_comp}')

print('--------------------------------')
all_sets = compatibility_corpus('data/compiled/') # takes a dir path 
total_comp = compatibility_ratios_to_stats(all_sets)
print(f'Total actual corpus compatibility: {total_comp}')
print('--------------------------------')

#
# Accentual responsion baselines (these are strictly ratios and not means)
#

trimeter_2_strophic_baseline_acc = accentual_responsion_metric_play(trimeter_path) # 3 antistrophic cantica of 2x8 lines each
trimeter_3_strophes_baseline_acc = accentual_responsion_metric_canticum(trimeterpoly_path, 'baselinepoly01')
trimeter_4_strophes_baseline_acc = accentual_responsion_metric_canticum(trimeterpoly_path, 'baselinepoly02')
tetrameter_2_strophes_baseline_acc = accentual_responsion_metric_canticum(tetrameter_path, 'lyricbaseline01')
tetrameter_3_strophes_baseline_acc = accentual_responsion_metric_canticum(tetrameter_path, 'lyricbaseline02')
tetrameter_4_strophes_baseline_acc = accentual_responsion_metric_canticum(tetrameter_path, 'lyricbaseline03')

print('\n--------------------------------')
print('Accentual responsion baselines:')
print('--------------------------------')
print(f'Trimeter 2-strophes accentual baseline: {trimeter_2_strophic_baseline_acc}')
for key, value in trimeter_2_strophic_baseline_acc.items():
    print(f'  {key}: \033[32m{value}\033[0m')
print(f'Trimeter 3-strophes accentual baseline: {trimeter_3_strophes_baseline_acc}')
for key, value in trimeter_3_strophes_baseline_acc.items():
    print(f'  {key}: \033[32m{value}\033[0m')
print(f'Trimeter 4-strophes accentual baseline: {trimeter_4_strophes_baseline_acc}')
for key, value in trimeter_4_strophes_baseline_acc.items():
    print(f'  {key}: \033[32m{value}\033[0m')
print(f'Tetrameter 2-strophes accentual baseline: {tetrameter_2_strophes_baseline_acc}')
for key, value in tetrameter_2_strophes_baseline_acc.items():
    print(f'  {key}: \033[32m{value}\033[0m')
print(f'Tetrameter 3-strophes accentual baseline: {tetrameter_3_strophes_baseline_acc}')
for key, value in tetrameter_3_strophes_baseline_acc.items():
    print(f'  {key}: \033[32m{value}\033[0m')
print(f'Tetrameter 4-strophes accentual baseline: {tetrameter_4_strophes_baseline_acc}')
for key, value in tetrameter_4_strophes_baseline_acc.items():
    print(f'  {key}: \033[32m{value}\033[0m')


#
# Barys responsion baselines (also ratios, not means)
#

trimeter_2_strophic_baseline_barys = barys_oxys_metric_play("baseline", baseline=True) # 3 antistrophic cantica of 2x8 lines each
trimeter_3_strophes_baseline_barys = barys_oxys_metric_canticum('baselinepoly01', baseline=True)
trimeter_4_strophes_baseline_barys = barys_oxys_metric_canticum('baselinepoly02', baseline=True)
tetrameter_2_strophes_baseline_barys = barys_oxys_metric_canticum('lyricbaseline01', baseline=True)
tetrameter_3_strophes_baseline_barys = barys_oxys_metric_canticum('lyricbaseline02', baseline=True)
tetrameter_4_strophes_baseline_barys = barys_oxys_metric_canticum('lyricbaseline03', baseline=True)

print('\n--------------------------------')
print('Barys responsion baselines:')
print('--------------------------------')

print(f'Trimeter 2-strophes barys baseline: {trimeter_2_strophic_baseline_barys}')
for key, value in trimeter_2_strophic_baseline_barys.items():
    print(f'  {key}: \033[32m{value}\033[0m')
print(f'Trimeter 3-strophes barys baseline: {trimeter_3_strophes_baseline_barys}')
for key, value in trimeter_3_strophes_baseline_barys.items():
    print(f'  {key}: \033[32m{value}\033[0m')
print(f'Trimeter 4-strophes barys baseline: {trimeter_4_strophes_baseline_barys}')
for key, value in trimeter_4_strophes_baseline_barys.items():
    print(f'  {key}: \033[32m{value}\033[0m')
print(f'Tetrameter 2-strophes barys baseline: {tetrameter_2_strophes_baseline_barys}')
for key, value in tetrameter_2_strophes_baseline_barys.items():
    print(f'  {key}: \033[32m{value}\033[0m')
print(f'Tetrameter 3-strophes barys baseline: {tetrameter_3_strophes_baseline_barys}')
for key, value in tetrameter_3_strophes_baseline_barys.items():
    print(f'  {key}: \033[32m{value}\033[0m')
print(f'Tetrameter 4-strophes barys baseline: {tetrameter_4_strophes_baseline_barys}')
for key, value in tetrameter_4_strophes_baseline_barys.items():
    print(f'  {key}: \033[32m{value}\033[0m')

--------------------------------
Contour compatibility baselines:
--------------------------------
Trimeter 2-strophes compatibility baseline: 0.7929447852760736
Trimeter 3-strophes compatibility baseline: 0.8103975535168195
Trimeter 4-strophes compatibility baseline: 0.7522935779816514
Tetrameter 2-strophes compatibility baseline: 0.8217391304347826
Tetrameter 3-strophes compatibility baseline: 0.8014814814814815
Tetrameter 4-strophes compatibility baseline: 0.7651515151515151
--------------------------------
Total actual corpus compatibility: 0.8205128205128205
--------------------------------

--------------------------------
Accentual responsion baselines:
--------------------------------
Trimeter 2-strophes accentual baseline: {'acute': 0.2251655629139073, 'grave': 0.034482758620689655, 'circumflex': 0.12307692307692308, 'acute_circumflex': 0.19444444444444445}
  acute: [32m0.2251655629139073[0m
  grave: [32m0.034482758620689655[0m
  circumflex: [32m0.12307692307692308[0m
  

#### Making src/utils/baselines.py

In [6]:
baseline_dict = {
    'comp': {
        'trimeter_2_strophic': trimeter_2_strophic_baseline_comp,
        'trimeter_3_strophes': trimeter_3_strophes_baseline_comp,
        'trimeter_4_strophes': trimeter_4_strophes_baseline_comp,
        'tetrameter_2_strophes': tetrameter_2_strophes_baseline_comp,
        'tetrameter_3_strophes': tetrameter_3_strophes_baseline_comp,
        'tetrameter_4_strophes': tetrameter_4_strophes_baseline_comp
    },
    'acc': {
        'trimeter_2_strophic': trimeter_2_strophic_baseline_acc,
        'trimeter_3_strophes': trimeter_3_strophes_baseline_acc,
        'trimeter_4_strophes': trimeter_4_strophes_baseline_acc,
        'tetrameter_2_strophes': tetrameter_2_strophes_baseline_acc,
        'tetrameter_3_strophes': tetrameter_3_strophes_baseline_acc,
        'tetrameter_4_strophes': tetrameter_4_strophes_baseline_acc
    },
    'barys': {
        'trimeter_2_strophic': trimeter_2_strophic_baseline_barys,
        'trimeter_3_strophes': trimeter_3_strophes_baseline_barys,
        'trimeter_4_strophes': trimeter_4_strophes_baseline_barys,
        'tetrameter_2_strophes': tetrameter_2_strophes_baseline_barys,
        'tetrameter_3_strophes': tetrameter_3_strophes_baseline_barys,
        'tetrameter_4_strophes': tetrameter_4_strophes_baseline_barys
    }
}

for key, value in baseline_dict.items():
    print(f'\nBaseline {key} dict:')
    for subkey, subvalue in value.items():
        print(f'  {subkey}: {subvalue}')


# Pretty-print write the baseline dict to src/utils/baselines.py
import pprint

with open('src/utils/baselines.py', 'w', encoding='utf-8') as f:
    f.write('# This baseline dictionary is generated by a cell in the nb_significance.ipynb notebook.\n')
    f.write('baseline_dict = ')
    f.write(pprint.pformat(baseline_dict, width=100))
    f.write('\n')


Baseline comp dict:
  trimeter_2_strophic: 0.7929447852760736
  trimeter_3_strophes: 0.8103975535168195
  trimeter_4_strophes: 0.7522935779816514
  tetrameter_2_strophes: 0.8217391304347826
  tetrameter_3_strophes: 0.8014814814814815
  tetrameter_4_strophes: 0.7651515151515151

Baseline acc dict:
  trimeter_2_strophic: {'acute': 0.2251655629139073, 'grave': 0.034482758620689655, 'circumflex': 0.12307692307692308, 'acute_circumflex': 0.19444444444444445}
  trimeter_3_strophes: {'acute': 0.125, 'grave': 0.0, 'circumflex': 0.0, 'acute_circumflex': 0.08256880733944955}
  trimeter_4_strophes: {'acute': 0.03773584905660377, 'grave': 0.0, 'circumflex': 0.0, 'acute_circumflex': 0.026845637583892617}
  tetrameter_2_strophes: {'acute': 0.2893081761006289, 'grave': 0.10526315789473684, 'circumflex': 0.2318840579710145, 'acute_circumflex': 0.2719298245614035}
  tetrameter_3_strophes: {'acute': 0.11612903225806452, 'grave': 0.04054054054054054, 'circumflex': 0.04477611940298507, 'acute_circumflex'

#### Example usage of the baseline utility function

In [11]:
from src.utils.utils import baseline

# Example usage of the baseline utility function

print("\nGetting acute + circumflex responsion baseline for the tristrophic trimeter:")

# 1
print(baseline_dict["acc"]["trimeter_3_strophes"]["acute_circumflex"])

# 2
print(baseline("acc", "trimeter_3_strophes", "acute_circumflex"))



Getting acute + circumflex responsion baseline for the tristrophic trimeter:
0.08256880733944955
0.08256880733944955


## Null Hypothesis Testing

### Compatibility Mean

#### Chi-square test for the comp mean

##### Antistrophic distribution

In [17]:
from collections import Counter

from src.stats_comp import compatibility_strophicity, compatibility_ratios_to_stats

# Distribution info (Antistrophic)

all_sets = compatibility_strophicity('data/compiled/', 'antistrophic')
total_comp = compatibility_ratios_to_stats(all_sets)

print(f'Compatibility mean (observed, antistrophic): {total_comp}')

number_of_variables = 0

values = []
for element in all_sets:
    for subelement in element:
        for subsubelement in subelement:
            for value in subsubelement:
                number_of_variables += 1
                values.append(value)
                
print(f'Number of variables: {number_of_variables}')

count_dict = Counter(values)
print(f'Distribution variables (Antistrophic):')
for key, value in count_dict.items():
    print(f'\t{key}: {value}')

Compatibility mean (observed, antistrophic): 0.8221052631578948
Number of variables: 6175
Distribution variables (Antistrophic):
	1.0: 3978
	0.5: 2197


Antistrophic trimeter null

In [22]:
from collections import Counter

from src.stats_comp import compatibility_strophicity, compatibility_ratios_to_stats

# Distribution info (Antistrophic)

all_sets = compatibility_strophicity('data/compiled/baseline_trimeter', 'antistrophic')
total_comp = compatibility_ratios_to_stats(all_sets)

print(f'Compatibility mean (null, antistrophic trimeter): {total_comp}')

number_of_variables = 0

values = []
for element in all_sets:
    for subelement in element:
        for subsubelement in subelement:
            for value in subsubelement:
                number_of_variables += 1
                values.append(value)
                
print(f'Number of variables: {number_of_variables}')

count_dict = Counter(values)
print(f'Distribution variables:')
for key, value in count_dict.items():
    print(f'\t{key}: {value}')

Compatibility mean (null, antistrophic trimeter): 0.7929447852760736
Number of variables: 326
Distribution variables:
	0.5: 135
	1.0: 191


Antistrophic mean chi-square test against antistrophic trimeter null

In [27]:
import numpy as np
from scipy.stats import chisquare

# Observed counts
obs_counts = np.array([2197, 3978])  # [0.5, 1.0]
obs_total = obs_counts.sum()

# Null counts → proportions
null_counts = np.array([135, 191])
null_total = null_counts.sum()
null_probs = null_counts / null_total

# Expected counts under H₀
expected_counts = null_probs * obs_total

# Chi-square test
chi2_stat, p_value = chisquare(f_obs=obs_counts, f_exp=expected_counts)

# Output
print(f"Chi-square statistic: {chi2_stat:.4f}")
print(f"Degrees of freedom: {len(obs_counts) - 1}")
print(f"P-value: {p_value:.3e}")

Chi-square statistic: 86.5674
Degrees of freedom: 1
P-value: 1.351e-20


Antistrophic tetrameter null

In [29]:
from collections import Counter

from src.stats_comp import compatibility_strophicity, compatibility_ratios_to_stats

# Distribution info (Antistrophic)

all_sets = compatibility_strophicity('data/compiled/baseline_tetrameter', 'antistrophic')
total_comp = compatibility_ratios_to_stats(all_sets)

print(f'Compatibility mean (null, antistrophic tetrameter): {total_comp}')

number_of_variables = 0

values = []
for element in all_sets:
    for subelement in element:
        for subsubelement in subelement:
            for value in subsubelement:
                number_of_variables += 1
                values.append(value)
                
print(f'Number of variables: {number_of_variables}')

count_dict = Counter(values)
print(f'Distribution variables:')
for key, value in count_dict.items():
    print(f'\t{key}: {value}')

Compatibility mean (null, antistrophic tetrameter): 0.8217391304347826
Number of variables: 345
Distribution variables:
	1.0: 222
	0.5: 123


Antistrophic tetrameter null SHUFFLED, 1 to 10 versions together

In [None]:
from collections import Counter

from src.stats_comp import compatibility_strophicity, compatibility_ratios_to_stats

# Distribution info (Antistrophic)

all_sets = compatibility_strophicity('data/compiled/baseline_tetrameter_shuffled', 'antistrophic')
total_comp = compatibility_ratios_to_stats(all_sets)

print(f'Compatibility mean (null, antistrophic tetrameter): {total_comp}')

number_of_variables = 0

values = []
for element in all_sets:
    for subelement in element:
        for subsubelement in subelement:
            for value in subsubelement:
                number_of_variables += 1
                values.append(value)
                
print(f'Number of variables: {number_of_variables}')

count_dict = Counter(values)
print(f'Distribution variables:')
for key, value in count_dict.items():
    print(f'\t{key}: {value}')
print('---------------------------------')

# Making the other 9 shuffled null distributions

# import default dict
from collections import defaultdict
shuffled_distributions = defaultdict(dict)

# Make dict with the dist pair for each i
for i in range(2, 21):
    all_sets = compatibility_strophicity(f'data/compiled/baseline_tetrameter_shuffled{i}', 'antistrophic')
    total_comp = compatibility_ratios_to_stats(all_sets)

    print(f'{i} shuffled versions')
    print(f'Compatibility mean (null, antistrophic tetrameter): {total_comp}')

    number_of_variables = 0

    values = []
    for element in all_sets:
        for subelement in element:
            for subsubelement in subelement:
                for value in subsubelement:
                    number_of_variables += 1
                    values.append(value)
                    
    print(f'Number of variables: {number_of_variables}')

    count_dict = Counter(values)
    print(f'Distribution variables:')
    for key, value in count_dict.items():
        print(f'\t{key}: {value}')
    print('---------------------------------')
    shuffled_distributions[i] = count_dict

print(shuffled_distributions)

# Make the dict in to a normal dict again
shuffled_distributions = {k: dict(v) for k, v in shuffled_distributions.items()}
print(shuffled_distributions)

Compatibility mean (null, antistrophic tetrameter): 0.8043478260869565
Number of variables: 345
Distribution variables:
	1.0: 210
	0.5: 135
---------------------------------
2 shuffled versions
Compatibility mean (null, antistrophic tetrameter): 0.8057971014492754
Number of variables: 690
Distribution variables:
	1.0: 422
	0.5: 268
---------------------------------
3 shuffled versions
Compatibility mean (null, antistrophic tetrameter): 0.8091787439613527
Number of variables: 1035
Distribution variables:
	1.0: 640
	0.5: 395
---------------------------------
4 shuffled versions
Compatibility mean (null, antistrophic tetrameter): 0.8152173913043478
Number of variables: 1380
Distribution variables:
	0.5: 510
	1.0: 870
---------------------------------
5 shuffled versions
Compatibility mean (null, antistrophic tetrameter): 0.8147826086956522
Number of variables: 1725
Distribution variables:
	1.0: 1086
	0.5: 639
---------------------------------
6 shuffled versions
Compatibility mean (null, 

Antistrophic mean chi-square test against antistrophic tetrameter null

In [None]:
import numpy as np
from scipy.stats import chisquare
import math

# Observed counts
obs_counts = np.array([2197, 3978])  # [0.5, 1.0]
obs_total = obs_counts.sum()

# Updated null counts from antistrophic tetrameter
null_counts = np.array([123, 222])  # [0.5, 1.0]
null_total = null_counts.sum()
null_probs = null_counts / null_total

# Expected counts under null
expected_counts = null_probs * obs_total

# Chi-square test
chi2_stat, p_value = chisquare(f_obs=obs_counts, f_exp=expected_counts)

# Output
print(f"Chi-square statistic: {chi2_stat:.4f}")
print(f"Degrees of freedom: {len(obs_counts) - 1}")
if p_value > 0:
    print(f"P-value: {p_value:.3f} (≈ 10^{math.floor(math.log10(p_value))})")
else:
    print("P-value is effectively zero (underflow)")

Chi-square statistic: 0.0144
Degrees of freedom: 1
P-value: 0.904 (≈ 10^-1)


Antistrophic mean chi-square test against SHUFFLED antistrophic tetrameter null

In [16]:
import numpy as np
from scipy.stats import chisquare
import math

# Observed counts
obs_counts = np.array([2197, 3978])  # [0.5, 1.0]
obs_total = obs_counts.sum()

# Updated null counts from antistrophic tetrameter
null_counts = np.array([754, 1316])  # [0.5, 1.0]
null_total = null_counts.sum()
null_probs = null_counts / null_total

# Expected counts under null
expected_counts = null_probs * obs_total

# Chi-square test
chi2_stat, p_value = chisquare(f_obs=obs_counts, f_exp=expected_counts)

# Output
print(f"Chi-square statistic: {chi2_stat:.4f}")
print(f"Degrees of freedom: {len(obs_counts) - 1}")
if p_value > 0:
    print(f"P-value: {p_value:.3f} (≈ 10^{math.floor(math.log10(p_value))})")
else:
    print("P-value is effectively zero (underflow)")

Chi-square statistic: 1.9093
Degrees of freedom: 1
P-value: 0.167 (≈ 10^-1)


Chi-square test using all ever-growing concatenations of 10 shuffled distributions

In [34]:
import numpy as np
from scipy.stats import chisquare
import math

for i in range(2, 14):
    print(f'\nTesting distribution {i}:')
    distribution = shuffled_distributions[i]

    # Observed counts
    obs_counts = np.array([2197, 3978])  # [0.5, 1.0]
    obs_total = obs_counts.sum()

    # Updated null counts from antistrophic tetrameter
    null_counts = np.array([distribution[0.5], distribution[1]])  # [0.5, 1.0]
    null_total = null_counts.sum()
    null_probs = null_counts / null_total

    # Expected counts under null
    expected_counts = null_probs * obs_total

    # Chi-square test
    chi2_stat, p_value = chisquare(f_obs=obs_counts, f_exp=expected_counts)

    # Output
    print(f"Chi-square statistic: {chi2_stat:.4f}")
    print(f"Degrees of freedom: {len(obs_counts) - 1}")
    if p_value > 0:
        print(f"P-value: {p_value:.4f} (≈ 10^{math.floor(math.log10(p_value))})")
    else:
        print("P-value is effectively zero (underflow)")


Testing distribution 2:
Chi-square statistic: 27.6540
Degrees of freedom: 1
P-value: 0.0000 (≈ 10^-7)

Testing distribution 3:
Chi-square statistic: 17.4890
Degrees of freedom: 1
P-value: 0.0000 (≈ 10^-5)

Testing distribution 4:
Chi-square statistic: 5.0296
Degrees of freedom: 1
P-value: 0.0249 (≈ 10^-2)

Testing distribution 5:
Chi-square statistic: 5.6791
Degrees of freedom: 1
P-value: 0.0172 (≈ 10^-2)

Testing distribution 6:
Chi-square statistic: 1.9093
Degrees of freedom: 1
P-value: 0.1670 (≈ 10^-1)

Testing distribution 7:
Chi-square statistic: 5.8072
Degrees of freedom: 1
P-value: 0.0160 (≈ 10^-2)

Testing distribution 8:
Chi-square statistic: 6.7261
Degrees of freedom: 1
P-value: 0.0095 (≈ 10^-3)

Testing distribution 9:
Chi-square statistic: 8.0647
Degrees of freedom: 1
P-value: 0.0045 (≈ 10^-3)

Testing distribution 10:
Chi-square statistic: 7.0944
Degrees of freedom: 1
P-value: 0.0077 (≈ 10^-3)

Testing distribution 11:
Chi-square statistic: 10.2243
Degrees of freedom: 1
P

##### 3 and 4-strophic distributions

1. Polystrophic observed distribution
2. Polystrophic trimeter null
3. Chi-square test against trimeter null
4. Polystrophic tetrameter null
5. Chi-square test against tetrameter null

###### Observed distributions

3-strophic observed distribution

In [7]:
from collections import Counter

from src.stats_comp import compatibility_strophicity, compatibility_ratios_to_stats

# Distribution info

all_sets = compatibility_strophicity('data/compiled/', 'three-strophic')
total_comp = compatibility_ratios_to_stats(all_sets)

print(f'Compatibility mean (observed): {total_comp}')

number_of_variables = 0

values = []
for element in all_sets:
    for subelement in element:
        for subsubelement in subelement:
            for value in subsubelement:
                number_of_variables += 1
                values.append(value)
                
print(f'Number of variables: {number_of_variables}')

count_dict = Counter(values)
print(f'Distribution variables (Polystrophic):')
for key, value in count_dict.items():
    print(f'\t{key}: {value}')

Compatibility mean (observed): 0.8172043010752688
Number of variables: 155
Distribution variables (Polystrophic):
	0.6666666666666666: 85
	1.0: 70


3-strophic null trimeter

In [5]:
from collections import Counter

from src.stats_comp import compatibility_strophicity, compatibility_ratios_to_stats

# Distribution info

all_sets = compatibility_strophicity('data/compiled/baseline_trimeter', 'three-strophic')
total_comp = compatibility_ratios_to_stats(all_sets)

print(f'Compatibility mean: {total_comp}')

number_of_variables = 0

values = []
for element in all_sets:
    for subelement in element:
        for subsubelement in subelement:
            for value in subsubelement:
                number_of_variables += 1
                values.append(value)
                
print(f'Number of variables: {number_of_variables}')

count_dict = Counter(values)
print(f'Distribution variables:')
for key, value in count_dict.items():
    print(f'\t{key}: {value}')

Compatibility mean: 0.8103975535168195
Number of variables: 109
Distribution variables:
	0.6666666666666666: 62
	1.0: 47


Chi-square test comparing observed 3-strophic distribution to 3-strophic trimeter null:

In [None]:
import numpy as np
from scipy.stats import chisquare
import math

# Observed counts
# Order: [0.666..., 1.0]
obs_counts = np.array([85, 70])
obs_total = obs_counts.sum()

# Null counts (for expected proportions)
null_counts = np.array([62, 47])
null_total = null_counts.sum()
null_probs = null_counts / null_total

# Expected counts under H₀ (scaled to obs_total)
expected_counts = null_probs * obs_total

# Chi-square test
chi2_stat, p_value = chisquare(f_obs=obs_counts, f_exp=expected_counts)

# Output
print("Chi-square test comparing observed 3-strophic distribution to 3-strophic trimeter null:")
print(f"Observed total: {obs_total}")
print(f"Expected counts under H₀: {expected_counts.round(2)}")
print(f"Chi-square statistic: {chi2_stat:.4f}")
print(f"Degrees of freedom: {len(obs_counts) - 1}")
if p_value > 0:
    print(f"P-value: {p_value:.4f} (≈ 10^{math.floor(math.log10(p_value))})")
else:
    print("P-value: < 1e-{0}".format(abs(int(np.floor(np.log10(np.finfo(float).eps))))))

Chi-square test comparing observed polystrophic distribution to trimeter null:
Observed total: 155
Expected counts under H₀: [88.17 66.83]
Chi-square statistic: 0.2635
Degrees of freedom: 1
P-value: 0.6077 (≈ 10^-1)


3-strophic tetrameter null

In [9]:
from collections import Counter

from src.stats_comp import compatibility_strophicity, compatibility_ratios_to_stats

# Distribution info

all_sets = compatibility_strophicity('data/compiled/baseline_tetrameter', 'three-strophic')
total_comp = compatibility_ratios_to_stats(all_sets)

print(f'Compatibility mean: {total_comp}')

number_of_variables = 0

values = []
for element in all_sets:
    for subelement in element:
        for subsubelement in subelement:
            for value in subsubelement:
                number_of_variables += 1
                values.append(value)
                
print(f'Number of variables: {number_of_variables}')

count_dict = Counter(values)
print(f'Distribution variables:')
for key, value in count_dict.items():
    print(f'\t{key}: {value}')

Compatibility mean: 0.8014814814814815
Number of variables: 225
Distribution variables:
	0.6666666666666666: 134
	1.0: 91


In [10]:
import numpy as np
from scipy.stats import chisquare
import math

# Observed counts
# Order: [0.666..., 1.0]
obs_counts = np.array([85, 70])
obs_total = obs_counts.sum()

# Null counts (for expected proportions)
null_counts = np.array([134, 91])
null_total = null_counts.sum()
null_probs = null_counts / null_total

# Expected counts under H₀ (scaled to obs_total)
expected_counts = null_probs * obs_total

# Chi-square test
chi2_stat, p_value = chisquare(f_obs=obs_counts, f_exp=expected_counts)

# Output
print("Chi-square test comparing observed 3-strophic distribution to 3-strophic TETRAMETER null:")
print(f"Observed total: {obs_total}")
print(f"Expected counts under H₀: {expected_counts.round(2)}")
print(f"Chi-square statistic: {chi2_stat:.4f}")
print(f"Degrees of freedom: {len(obs_counts) - 1}")
if p_value > 0:
    print(f"P-value: {p_value:.4f} (≈ 10^{math.floor(math.log10(p_value))})")
else:
    print("P-value: < 1e-{0}".format(abs(int(np.floor(np.log10(np.finfo(float).eps))))))

Chi-square test comparing observed 3-strophic distribution to 3-strophic TETRAMETER null:
Observed total: 155
Expected counts under H₀: [92.31 62.69]
Chi-square statistic: 1.4317
Degrees of freedom: 1
P-value: 0.2315 (≈ 10^-1)


## Monte Carlo Test for the Comp Mean 

We’re computing a mean $\bar{x}$ over $n$ random variables (representing syllables or accents), which
- are bounded in [0, 1],
- take values in a discrete set, 
  - for full corpus
$\left\{\frac{1}{2}, \frac{2}{3}, \frac{3}{4}, 1 \right\}$ (97.0% being binary 0.5 or 1)
  - or just binary for any metric on the antistrophic sub-corpus,

and of which the true underlying distribution is:
- known only empirically (no analytical variance available!),
- not symmetric, not uniform, and heavily bimodal (binary),

and then comparing with a known baseline mean under the null, $\mu_0$.

To do this well we need information about the distribution $\bar{x}$ comes from. It turns out the two lowest values $\frac{1}{4}$ and $\frac{1}{3}$ actually never occur, so there are only $\left\{\frac{1}{4}, \frac{1}{3}, \frac{1}{2}, \frac{2}{3}\right\}$. (This means that among 3 or 4 responding strophes there are no syllables where each all are incompatible; this makes sense, since there are only two directions for the contour to go, since the flat is compatible with either by definition!)

In [None]:
from collections import Counter

from src.stats_comp import compatibility_corpus, compatibility_ratios_to_stats

# Distribution info (Full-corpus)

all_sets = compatibility_corpus('data/compiled/')
total_comp = compatibility_ratios_to_stats(all_sets)

print(f'Compatibility mean (observed): {total_comp}')

number_of_variables = 0

values = []
for element in all_sets:
    for subelement in element:
        for subsubelement in subelement:
            for value in subsubelement:
                number_of_variables += 1
                values.append(value)

print(f'Number of variables: {number_of_variables}')

count_dict = Counter(values)
print(f'Distribution variables:')
for key, value in count_dict.items():
    print(f'\t{key}: {value}')

##### Trimeter Null Distribution

In [20]:
from collections import Counter

from src.stats_comp import compatibility_corpus, compatibility_ratios_to_stats

# Trimeter Null distribution info

null_sets = compatibility_corpus('data/compiled/baseline')
null_comp = compatibility_ratios_to_stats(null_sets)

print('--------------------------------')
print(f'Compatibility mean under trimeter null (m0): {null_comp}')

number_of_variables_null = 0

values_null = []
for element in null_sets:
    for subelement in element:
        for subsubelement in subelement:
            for value in subsubelement:
                number_of_variables_null += 1
                values_null.append(value)

print(f'Number of variables (null): {number_of_variables_null}')

count_dict_null = Counter(values_null)
print(f'Null distribution variables:')
for key, value in count_dict_null.items():
    print(f'\t{key}: {value}')

--------------------------------
Compatibility mean under trimeter null (m0): 0.7966510294500913
Number of variables (null): 1279
Null distribution variables:
	0.6666666666666666: 196
	1.0: 632
	0.5: 328
	0.75: 123


##### Tetrameter Null Distribution

In [11]:
from collections import Counter

from src.stats_comp import compatibility_corpus, compatibility_ratios_to_stats

# Tetrameter Null distribution info

null_sets = compatibility_corpus('data/compiled/baseline_tetrameter')
null_comp = compatibility_ratios_to_stats(null_sets)

print('--------------------------------')
print(f'Compatibility mean under tetrameter null (m0): {null_comp}')

number_of_variables_null = 0

values_null = []
for element in null_sets:
    for subelement in element:
        for subsubelement in subelement:
            for value in subsubelement:
                number_of_variables_null += 1
                values_null.append(value)

print(f'Number of variables (null): {number_of_variables_null}')

count_dict_null = Counter(values_null)
print(f'Null distribution variables:')
for key, value in count_dict_null.items():
    print(f'\t{key}: {value}')

--------------------------------
Compatibility mean under tetrameter null (m0): 0.8028344671201814
Number of variables (null): 735
Null distribution variables:
	0.6666666666666666: 134
	1.0: 364
	0.5: 164
	0.75: 73


### Acute-circumflex and barys ratio 

Binomial significance test for the binary acute-circumflex and barys-only ratio metrics. 