# Topic Discovery


## Data Sources

Data sources for this analysis are described below.

### Corpus

The [corpus](https://www.constitutueproject.org) comprises the set of in-force national constitutions compiled by the CCP.  

### Ontologies

#### Reference

Our reference ontology is:

- CCP-FACET: A faceted version of the [CCP ontology](https://www.constitutueproject.org).

#### Comparison

Our comparison ontologies in this analysis are:

- IDEA-GLO: [International IDEA Database Glossary](https://www.idea.int/data-tools)
- NC-DCC: Núcleo Constituyente,Diccionario Constitucional Chileno, Second Edition

All ontologies were formatted to conform to the Sartori Network's ontology specification of:

- One topic per row
- A minimum column set comprising the following fields:
    - key: a short topic identifier. If this is not provided by the ontology owner, then an integer is used.
    - label: a short human-readable text label.
    - description: a longer descriptive text.
- A first row containing the column names: `key`, `label`, `description`.


## Rationale

The methodology looks at the semantic similarities between topics in a pair of ontologies and the sections (referred to as segments) of a corpus of national constitutions. One of the ontologies is a reference ontology that is aligned with the corpus — in the analysis here the Comparative Constitutions Project (CCP) ontology. The other ontology is referred to as the comparison ontology.

The objective is to find segments that are similar to comparison topics but not to reference topics. We use a comparison ontology to audit the reference ontology in order to identify gaps in the coverage of the reference ontology. 

Comparison topics that capture segments that the reference ontology misses may be candidates for inclusion in the reference ontology. If a candidate comparison topic is considered semantically similar to an existing reference topic, then segments have been missed by manual coding.

## Methodology

The methodology is based on maniupulation of semantic similarity matrices constructed during ontology and corpus processing — see the codebase in the `processing` folder. The similarity matrices can be found in the `model` folder in the following files:
- CCP-FACET_topic_segment_matrix.json: CCP-FACET reference topics in rows, constitution segments in columns.
- IDEA-GLO_topic_segment_matrix.json: IDEA-GLO topics in rows, constitution segments in columns.
- NC-DCC_topic_segment_matrix.json: NC-DCC topics in rows, constitution segments in columns.

Each matrix has the topics of an ontology in rows, and the segments of the corpus in columns. Cells contain the semantic similarity score of a topic-segment pair. Similarity scores are calculated as the angular distance between the encoding vectors of topic and segment text where the topic text is the concatenated label and description. Encoding vectors were generating using Google's multilingual Universal Sentence Encoder version 3.


### Process

1. Let the CCP-FACET matrix be $\mathbf{A}$.
2. Threshold and binarise $\mathbf{A}$ to produce $\mathbf{B}$.
3. Let the comparison matrix be $\mathbf{C}$.
4. Threshold and binarise $\mathbf{C}$ to produce $\mathbf{D}$.
5. Let $\mathbf{E}=\mathbf{DB}^T$.

$\mathbf{E}$ contains a co-occurrence matrix with comparison topics in rows and reference topics in columns. Cells values contain the number of segments that are at or above threshold for a given topic pair, i.e., the number of segments that are semantically similar to both topics.

Next:

Find rows in $\mathbf{E}$ that contain only zeros. These rows are comparison topics that have no segments in common with any of the reference topics. For each such comparison topic:
1. Recover any semantically similar segments from $\mathbf{D}$.
2. For every segment recovered from $\mathbf{D}$, ensure that the segment's column in $\mathbf{B}$ contains only zeros.

We now have a set of comparison topics each of which is semantically similar to a set of corpus segments that are not semantically similar to any reference topics. As a further step, any manually tagged segments in the segment sets are identified. 

### Outputs

For a selected comparison ontology the following files are generated:

1. `<ontology_identifier>_candidate_data.csv`
    - Each row contains a candidate topic and a semantically similar constitution segment.
    - Candidate topics repeat if the topic is semantically similar to more than one segment.
    - Columns are:
        - `comparison_topic_key`: the comparison topic key.
        - `comparison_topic_text`: the topic text (concatenated label and description) used to generate the encoding vector.
        - `segment_id`: the ID of a semantically similar segment. Contains the constitution identifier.
        - `segment_text`: the text of segment.
        - `tagged_ccp_topics`: a list of manually tagged CCP topic codes for a segment.
2. `<ontology_identifier>_candidate_list.csv`
    - Each row contains a candidate topic.
    - Columns are:
        - `key`
        - `label`
        - `description`

Sample outputs at a threshold of 0.7 are present in the `outputs` folder.

## Interpretation

Candidate topics may provide evidence for:

- New topics for the reference ontology where the semantic distance between a candidate topic and existing topics is high.
- Segments have been missed by manual tagging. This may be the case if a candidate topic is judged similar to an existing reference topic and/or a segment is manually tagged.



## Initialialisation

### Load code and model

In [None]:
__author__      = 'Roy Gardner'
__copyright__   = 'Copyright 2025, Roy and Sally Gardner'

%run ./_library/packages.py
%run ./_library/utilities.py



In [None]:
model_path = '../model/'

exclusion_list = []
_,_,files = next(os.walk(model_path))
for file in files:
    if '_encodings.json' in file:
        exclusion_list.append(file)

model_dict = initialise(model_path,exclusion_list=exclusion_list)


### Map segments onto tagged topics

This gives us a human-coded segment-topic map from which we can check for human coding of segments that have no semantic relationship to CCP reference topics but are semantically similar to a comparison topic.


In [None]:
# Invert the sat_segments_dict

segments_lookup = {}
for k,v in model_dict['sat_segments_dict'].items():
    for segment_id in v:
        if segment_id in segments_lookup:
            segments_lookup[segment_id].append(k)
        else:
            segments_lookup[segment_id] = [k]
            
            


## Generate user interface

In [None]:
discovery_choice_dict = init_discovery_choice_dict()
discovery_interface(discovery_choice_dict,model_dict['ontologies_dict'],0.70)


## Run with selection from interface

In [None]:
threshold = discovery_choice_dict['threshold']
reference_label = discovery_choice_dict['reference']
comparison_label = discovery_choice_dict['comparison']


# Threshold and binarise the topic-segment matrix of the reference ontology
A = np.array(model_dict[f'{reference_label}_topic_segment_matrix'])
B = np.where(A>=threshold,1,0).astype(int)

# Define the data structures for the comparison ontologies
comparison_matrix = model_dict[f'{comparison_label}_topic_segment_matrix']
comparison_dict = model_dict[f'{comparison_label}_topics_dict']

# Get the topic keys for the comparison ontology
comp_keys = [k for k,v in comparison_dict.items()]
# Get the topic text for the comparison ontology
comp_text = [v['encoded_text'] for k,v in comparison_dict.items()]

# Get the topic-segment matrix for our comparison ontology
C = np.array(comparison_matrix)
# Threshold and binarise
D = np.where(C>=threshold,1,0).astype(int)

# Co-occurrence matrix with comparison topics in rows and reference topics in columns
E = np.matmul(D,B.T)

# Use to collect segments that are semantically similar to a comparison topic
segments_set = []

csv_row_list = []
header = []
header.append('comparison_topic_key')
header.append('comparison_topic_text')
header.append('segment_id')
header.append('segment_text')
header.append('tagged_ccp_topics')
csv_row_list.append(header)

# Iterate comparison topics searching for empty rows in the co-occurrence matrix E.
# An empty row means that the comparison topic shares no segments with any CCP topic
for i,row in enumerate(E):
    if row.nonzero()[0].size == 0:
        # Get at or above threshold segments from the topic's row in topic-segment matrix D
        segment_indices = [j for j,v in enumerate(D[i]) if v==1]
        if len(segment_indices) == 0:
            # Comparison topic is not semantically similar to any segment
            continue
        for j in segment_indices:
            # Iterating the comparison topic's semantically similar segments
            csv_row = []
            csv_row.append(comp_keys[i])
            csv_row.append(comp_text[i])
            segment_id = model_dict['encoded_segments'][j]
            segments_set.append(segment_id)
            segment_text = model_dict['segments_dict'][segment_id]['text']
            csv_row.append(segment_id)
            csv_row.append(segment_text)
            # Check whether the segment has been manually tagged
            if segment_id in segments_lookup:
                csv_row.append(str(segments_lookup[segment_id]))
            else:
                csv_row.append('')
            csv_row_list.append(csv_row)

# Validate the segment set by ensuring that each segment's column in the CCP topic-segment
# matrix comprises zeros only. This test ensures that a segment is not semantically similar to 
# a CCP topic at or above threshold.
segments_set = list(set(segments_set))
n = 0
for segment_id in segments_set:
    segment_index = model_dict['encoded_segments'].index(segment_id)
    if B[:,segment_index].nonzero()[0].size == 0:
        n += 1
assert(len(segments_set)==n)       
print('Validated',len(segments_set),'segments for',comparison_label)

# Got through validation so write results to CSV file
with open('./outputs/' + comparison_label + '_candidate_data.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerows(csv_row_list)
f.close() 
print('Candidate topics and segments exported to CSV file:','./outputs/' +\
              comparison_label + '_candidate_data.csv')

# Write list of candidate comparison topics to file
comp_topics = sorted(list(set([row[0] for row in csv_row_list[1:]])))

row_list = []
header = []
header.append('key')
header.append('label')
header.append('description')
row_list.append(header)
for int_key in comp_topics:
    key = str(int_key)
    csv_row = []
    csv_row.append(key)
    csv_row.append(comparison_dict[key]['Label'])
    csv_row.append(comparison_dict[key]['Description'])
    row_list.append(csv_row)

with open('./outputs/' + comparison_label + '_candidate_list.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerows(row_list)
f.close() 
print('Candidate topic list exported to CSV file:','./outputs/' + comparison_label + '_candidate_list.csv')
print()
    
print('Finished')
