# Topic Discovery


## Data Sources

### Corpus

The [corpus](https://www.constitutueproject.org) comprises the set of in-force national constitutions compiled by the CCP. Each constitution contains a number of sections.

### Ontologies

#### Reference

Our reference ontology is:

- CCP-FACET: A faceted version of the [CCP ontology](https://www.constitutueproject.org).

where `CCP-FACET` is the ontology's label used to identify and refer to the ontology.

#### Comparison

Our comparison ontologies in this analysis are:

- FJC-IDB: Concatenated Federal Judicial Center databases:
    - [Federal Judicial Center,Appeals Integrated Database](https://www.fjc.gov/sites/default/files/idb/codebooks/Appeals%20Codebook%201971-2007.pdf)
    - [Federal Judicial Center,Bankruptcy Petition Newstats Snapshots Database](https://www.fjc.gov/sites/default/files/idb/codebooks/Bankruptcy%20IDB%20Online%20Codebook%20rev%2002282023.pdf)
    - [Federal Judicial Center,Civil Integrated Database](https://www.fjc.gov/sites/default/files/idb/codebooks/Civil%20Codebook%201970-1987.pdf)
    - [Federal Judicial Center,Criminal Integrated Database](https://www.fjc.gov/sites/default/files/idb/codebooks/Criminal%20Code%20Book%201970-1995.pdf)
- GLOBALCIT-GLO: [Global Citizenship Observatory, Glossary on Citizenship and Electoral Rights](https://globalcit.eu/glossary)
- IDEA-DT: [International IDEA Democracy Tracker](https://www.idea.int/publications/catalogue/html/democracy-tracker-methodology-and-user-guide-version-2-february-2025)
- IDEA-GLO: [International IDEA Database Glossary](https://www.idea.int/data-tools)
- JUON-CPSD: [Andreas Juon,Constitutional Power-Sharing Dataset](https://doi.org/10.7910/DVN/9FYN8J)
- STROM-IDC: [Scott Gates, Benjamin A. T. Graham, and Håvard Strand, Inclusion, Dispersion, and Constraint Dataset](https://doi.org/10.7910/DVN/29421)

All ontologies were formatted to conform to the Sartori Network's ontology specification of:

- One topic per row
- A minimum column set comprising the following fields:
    - key: a short topic identifier. If this is not provided by the ontology owner, then an integer is used.
    - label: a short human-readable text label.
    - description: a longer descriptive text.
- A first row containing the column names: `Key`, `Label`, `Description`.


## Rationale

The methodology looks at the semantic similarities between topics in a pair of ontologies and the sections (referred to as segments) of a corpus of national constitutions. One of the ontologies is a reference ontology that is aligned with the corpus — in the analysis here the Comparative Constitutions Project (CCP) ontology. The other ontology is referred to as the comparison ontology.

The objective is to find segments that are similar to comparison topics but not to reference topics. We use a comparison ontology to audit the reference ontology in order to identify gaps in the coverage of the reference ontology. 

Comparison topics that capture segments that the reference ontology misses may be candidates for inclusion in the reference ontology. If a candidate comparison topic is considered semantically similar to an existing reference topic, then segments have been missed by manual coding.

## Methodology

The methodology is based on maniupulation of semantic similarity matrices constructed during ontology and corpus processing — see the codebase in the `processing` folder. The similarity matrices can be found in the `model` folder in the following files:
- CCP-FACET_topic_segment_matrix.json: CCP-FACET reference topics in rows, constitution segments in columns.
- IDEA-GLO_topic_segment_matrix.json: IDEA-GLO topics in rows, constitution segments in columns.
- NC-DCC_topic_segment_matrix.json: NC-DCC topics in rows, constitution segments in columns.

Each matrix has the topics of an ontology in rows, and the segments of the corpus in columns. Cells contain the semantic similarity score of a topic-segment pair. Similarity scores are calculated as the angular distance between the encoding vectors of topic and segment text where the topic text is the concatenated label and description. Encoding vectors were generating using Google's multilingual Universal Sentence Encoder version 3.


### Process

1. Let the CCP-FACET matrix be $\mathbf{A}$.
2. Threshold and binarise $\mathbf{A}$ to produce $\mathbf{B}$.
3. Let the comparison matrix be $\mathbf{C}$.
4. Threshold and binarise $\mathbf{C}$ to produce $\mathbf{D}$.
5. Let $\mathbf{E}=\mathbf{DB}^T$.

$\mathbf{E}$ contains a co-occurrence matrix with comparison topics in rows and reference topics in columns. Cells values contain the number of segments that are at or above threshold for a given topic pair, i.e., the number of segments that are semantically similar to both topics.

Next:

Find rows in $\mathbf{E}$ that contain only zeros. These rows are comparison topics that have no segments in common with any of the reference topics. For each such comparison topic:
1. Recover any semantically similar segments from $\mathbf{D}$.
2. For every segment recovered from $\mathbf{D}$, ensure that the segment's column in $\mathbf{B}$ contains only zeros.

We now have a set of comparison topics each of which is semantically similar to a set of corpus segments that are not semantically similar to any reference topics. As a further step, any manually tagged segments in the segment sets are identified. 

### Outputs

For a selected comparison ontology the following files are generated:

1. `<export_prefix>_<ontology_label>_candidate_data.csv`
    - Each row contains a candidate topic and a semantically similar constitution segment.
    - Candidate topics repeat if the topic is semantically similar to more than one segment.
    - Columns are:
        - `comparison_topic_key`: the comparison topic key.
        - `comparison_topic_text`: the topic text (concatenated label and description) used to generate the encoding vector.
        - `segment_id`: the ID of a semantically similar segment. Contains the constitution identifier.
        - `segment_text`: the text of the segment.
        - `link`: a deep link to the segment in the [Constitute Project website](https://www.constituteproject.org).
        - `tagged_ccp_topics`: a list of manually tagged CCP topic codes for a segment.
2. `<export_prefix><ontology_label>_candidate_list.csv`
    - Each row contains a candidate topic.
    - Columns are:
        - `key`
        - `label`
        - `description`

Where:

- `<export_prefix>` is a user defined string.
- `<ontology_label>` is the label of the comparison ontology.


Sample outputs at a threshold of 0.7 are present in the `outputs` folder.

## Interpretation

Candidate topics may provide evidence for:

- New topics for the reference ontology where the semantic distance between a candidate topic and existing topics is high.
- Segments have been missed by manual tagging. This may be the case if a candidate topic is judged similar to an existing reference topic and/or a segment is manually tagged.



## Initialisation

### Load code and model

In [None]:
__author__      = 'Roy Gardner'
__copyright__   = 'Copyright 2025, Roy and Sally Gardner'

%run ./_library/packages.py
%run ./_library/utilities.py
%run ./_library/comparison.py



In [None]:
model_path = '../model/'

exclusion_list = []
_,_,files = next(os.walk(model_path))
for file in files:
    if '_encodings.json' in file:
        exclusion_list.append(file)

model_dict = initialise(model_path,exclusion_list=exclusion_list)


## Step 1: Generate user interface

This step creates an interface within which you select a comparison ontology. 

Run the cell below to generate the interface for selecting the following values and parameters:

- Comparison ontology
  - Select a comparison ontology from the dropdown menu.
- Threshold
  - Sets the minimum semantic similarity between topics from the reference and comparison ontologies and constitution sections. Sections that meet or exceed this threshold are included in the search results in Step 2 below.
  - To low and it will be hard to find sections that are similar only to comparison ontology topics.
  - Too high and you may miss some useful results.
  - 0.7 is a good starting point and is set as the default; move up or down as needed using the slider.
- Export
  - If checked comparison results will be exported to a file in the `outputs` as long as a file name prefix has been added in the `Export prefix` field.
- Export prefix
  - The prefix will be added to the start of the export file names. The maximum length is 16 characters.

Once you are happy with your choices click on the `Apply Choices` button and move on to Step 2.


In [None]:
discovery_choice_dict = init_discovery_choice_dict()
discovery_interface(discovery_choice_dict,model_dict['ontologies_dict'],0.70)


## Step 2: Run the ontology comparison

The choices made in the interface above are now used to find constitution sections that are semantically similar to topics from the comparison ontology but which are not semantically similar to topics from the reference ontology. The search may take a few seconds depending upon your computer.

Comparison topics and sections found by the comparison method appear in an HTML table. Each row in the table has five cells:

- A comparison topic key.
- The comparison topic text: The concatenated topic label and description.
- A section ID which is a link to the section in the [Constitute Project](https://www.constituteproject.org/) website. By using this link you are able to vew the section in the context of the consitution to which it belongs.
- The section's text.
- Any CCP topics that were manually applied to the section.

If you have selected the `export` option and supplied a valid export file prefix then the following files are created in the `outputs` folder:

1. `<export_prefix>_<ontology_label>_candidate_data.csv`

This is a version of the HML table in CSV format. Fields are:
- comparison_topic_key
- comparison_topic_text
- segment_id
- segment_text
- link
- tagged_ccp_topics

2. `<export_prefix>_<ontology_label>_candidate_list.csv`

Contains a deduplicate list of comparison topics. Fields are:
- key
- label
- description


In [None]:
if len(discovery_choice_dict['comparison']) > 0:
    
    threshold = discovery_choice_dict['threshold']
    reference_label = discovery_choice_dict['reference']
    comparison_label = discovery_choice_dict['comparison']


    # Threshold and binarise the topic-segment matrix of the reference ontology
    A = np.array(model_dict[f'{reference_label}_topic_segment_matrix'])
    B = np.where(A>=threshold,1,0).astype(int)

    # Define the data structures for the comparison ontologies
    comparison_matrix = model_dict[f'{comparison_label}_topic_segment_matrix']

    # Get the topic-segment matrix for our comparison ontology
    C = np.array(comparison_matrix)
    # Threshold and binarise
    D = np.where(C>=threshold,1,0).astype(int)

    # Co-occurrence matrix with comparison topics in rows and reference topics in columns
    E = np.matmul(D,B.T)

    # Use to collect segments that are semantically similar to a comparison topic so we can validate
    segments_set = []
    for i,row in enumerate(E):
        if row.nonzero()[0].size == 0:
            # Get at or above threshold segments from the topic's row in topic-segment matrix D
            segment_indices = [j for j,v in enumerate(D[i]) if v==1]
            if len(segment_indices) == 0:
                # Comparison topic is not semantically similar to any segment
                continue
            for j in segment_indices:
                segment_id = model_dict['encoded_segments'][j]
                segments_set.append(segment_id)

    # Validate the segment set by ensuring that each segment's column in the CCP topic-segment
    # matrix comprises zeros only. This test ensures that a segment is not semantically similar to 
    # a CCP topic at or above threshold.
    segments_set = list(set(segments_set))
    n = 0
    for segment_id in segments_set:
        segment_index = model_dict['encoded_segments'].index(segment_id)
        if B[:,segment_index].nonzero()[0].size == 0:
            n += 1
    assert(len(segments_set)==n)       
    print('Validated',len(segments_set),'segments for',comparison_label)

    # We got past validation so we can now list and export if selected as an option

    list_topic_discovery(E,D,comparison_label,model_dict)

    if discovery_choice_dict['export']:
        export_topic_discovery(E,D,comparison_label,discovery_choice_dict,model_dict)

    print('Finished')

else:
      alert('Please select from the interface and click on Apply Choices.')  
