# Relational Analysis

Using encodings generated by USE ML v3 encoder for:

- Constitution segments
- CCP reference topics (concatenated label and descriptions)
- Comparison topic sets (concatenated label and descriptions)

## Methodology

Discover semantic coverage gaps in a reference ontology by using a comparison ontology to audit the reference ontology.

### Process Outline
1. **Reference ontology**: Designed to code corpus segments.
2. **Comparison ontology**: Alternative topic framework.
3. **Find non-overlapping topics**: Comparison topics with zero co-occurrences with reference topics.
4. **Gap identification**: These comparison topics capture segments that your reference ontology misses.
5. **Validation**: Confirm reference ontology truly has no coverage of these segments.

### What This Discovers
- **Blind spots** in the reference ontology.
- **Missing semantic categories** that comparison topics capture.
- **Segments that fall through the cracks** of the reference system.

### Chi-square Analysis

In the co-occurrence matrix:
- **High values**: Indicate overlap/redundancy between ontologies.
- **Low values**: Indicate the ontologies are capturing different semantic spaces. Strong evidence that comparison topics are finding genuinely different content.

The statistical significance of low chi-square values confirms that the comparison ontology is successfully identifying semantic content that the reference ontology systematically misses.


Analysis is based on topic-segment matrices generated during ontology processing.

Results can:

1. Indicate the need for a new topic.
2. If the comparison topic is a close match to a CCP reference topic then indicates segments that need to be tagged.


In [None]:
__author__      = 'Roy Gardner'
__copyright__   = 'Copyright 2025, Roy and Sally Gardner'

%run ./_library/packages.py
%run ./_library/utilities.py

exclusion_list = ['segment_encodings.json','IDEA-GLO_topic_encodings.json','CCP-FACET_topic_encodings.json',\
                     'NC-DCC_topic_encodings.json']
_,model_dict = initialise(exclusion_list=exclusion_list)


## Map segments onto tagged topics

This gives us a human-coded segment-topic map from which we can check for human coding of segments that have no semantic relationship to a CCP topics but are semantically similar to a comparison topic.


In [None]:
# Invert the sat_segments_dict

segments_lookup = {}
for k,v in model_dict['sat_segments_dict'].items():
    for segment_id in v:
        if segment_id in segments_lookup:
            segments_lookup[segment_id].append(k)
        else:
            segments_lookup[segment_id] = [k]
            
            


## Run the comparisons

1. The reference ontology matrix is the topic-segment matrix for CCP topics. This matrix contains semantic similarity scores for all topic-segment pairs.
2. For each comparison ontology we have a topic-segment matrix which is processed as follows:
    - Threshold and binarise
    - Generate the topic-topic co-occurrence matrix with comparison topics in rows and reference topics in columns.
 
Outputs results to CSV.

### Validation

All CCP segments assigned to a comparison topic must be an empty column in matrix $\mathbf{B}^{r,s}$.



In [None]:
threshold = 0.70

# Get the reference topic labels
ref_labels = [v['encoded_text'] for k,v in model_dict['CCP-FACET_topics_dict'].items()]

# Threshold and binarise the topic-segment matrix of the reference ontology
A = np.array(model_dict['CCP-FACET_topic_segment_matrix'])
B = np.where(A>=threshold,1,0).astype(int)


# Now process the comparison matrices
# Define the data structures for the comparison ontologies
comparison_matrices = ['IDEA-GLO_topic_segment_matrix','NC-DCC_topic_segment_matrix']
comparison_dicts = ['IDEA-GLO_topics_dict','NC-DCC_topics_dict']

# Iterate the comparison ontologies
for i,matrix_label in enumerate(comparison_matrices):
    
    # Get the topic labels for the comparison ontology
    comp_labels = [v['encoded_text'] for k,v in model_dict[comparison_dicts[i]].items()]
    # For results file and validation confirmation
    ontology_label = matrix_label.split('_')[0]
    
    # Get the topic-segment matrix for our comparison ontology
    C = np.array(model_dict[matrix_label])
    # Threshold and binarise
    D = np.where(C>=threshold,1,0).astype(int)
    
    # Co-occurrence matrix with comparison topics in rows and reference topics in columns
    E = np.matmul(D,B.T)
    
    # Use to collect segments that are semantically similar to a comparison topic
    segments_set = []

    csv_row_list = []
    header = []
    header.append('comparison_topic')
    header.append('segment_id')
    header.append('segment_text')
    header.append('tagged_ccp_topics')
    csv_row_list.append(header)
    
    # Iterate comparison topics searching for empty rows in the co-occurrence matrix E.
    # An empty row means that the comparison topic shares no segments with any CCP topic
    for i,row in enumerate(E):
        if row.nonzero()[0].size == 0:
            # Get at- or above-threshold segments from the topic's row in topic-segment matrix D
            segment_indices = [j for j,v in enumerate(D[i]) if v==1]
            if len(segment_indices) == 0:
                # Comparison topic is not semantically similar to any segment
                continue
            for j in segment_indices:
                # Iterating the comparison topic's semantically similar segments
                csv_row = []
                csv_row.append(comp_labels[i])
                segment_id = model_dict['encoded_segments'][j]
                segments_set.append(segment_id)
                segment_text = model_dict['segments_dict'][segment_id]['text']
                csv_row.append(segment_id)
                csv_row.append(segment_text)
                # Check whether the segment has been manually tagged
                if segment_id in segments_lookup:
                    csv_row.append(str(segments_lookup[segment_id]))
                else:
                    csv_row.append('')
                csv_row_list.append(csv_row)

    # Validate the segment set by ensuring that each segment's column in the CCP topic-segment
    # matrix comprises zeros only. This test ensures that a segment is not semantically similar to 
    # a CCP topic at or above threshold.
    segments_set = list(set(segments_set))
    n = 0
    for segment_id in segments_set:
        segment_index = model_dict['encoded_segments'].index(segment_id)
        if B[:,segment_index].nonzero()[0].size == 0:
            n += 1
    assert(len(segments_set)==n)       
    print('Validated',len(segments_set),'segments for',ontology_label)
    
    # Got through validation so write results to CSV file
    with open('./outputs/' + ontology_label + '_candidates.csv', 'w') as f:
        writer = csv.writer(f)
        writer.writerows(csv_row_list)
    f.close() 
    print('Data exported to CSV file:','./outputs/' + ontology_label + '_candidates.csv')
    print()


## Chi-square analysis

We want low chi-square values.


In [None]:
def get_expected_matrix(matrix):
    # Expected matrix assuming independence of topic sets
    rows_marginal = matrix.sum(axis=1)
    cols_marginal = matrix.sum(axis=0)
    matrix_total = matrix.sum()
    
    # Calculate matrix of expected values under independence
    expected = np.outer(rows_marginal,cols_marginal)/matrix_total
    return expected

# Threshold and binarise the topic-segment matrix of the reference ontology
A = np.array(model_dict['CCP-FACET_topic_segment_matrix'])
B = np.where(A>=threshold,1,0).astype(int)

comparison_matrices = ['IDEA-GLO_topic_segment_matrix','NC-DCC_topic_segment_matrix']
comparison_dicts = ['IDEA-GLO_topics_dict','NC-DCC_topics_dict']

for i,matrix_label in enumerate(comparison_matrices):
    ontology_label = matrix_label.split('_')[0]
    print(ontology_label)
    
    C = np.array(model_dict[matrix_label])
    # Threshold and binarise
    D = np.where(C>=threshold,1,0).astype(int)
    
    E = np.matmul(D,B.T)
    
    # Now test independence on the shuffled matrix
    expected = get_expected_matrix(E)
    
    # Add epsilon to avoid division by zero
    epsilon = 1e-10
    expected_adjusted = expected + epsilon
    chi2_stat = np.sum((E - expected_adjusted)**2 / expected_adjusted)
    print('Observed chi-square',chi2_stat)
    df = (E.shape[0] - 1) * (E.shape[1] - 1)
    print('Expected chi-square under independence',df)
    print('Ratio (observed/expected)', chi2_stat/df)
    print()
    