# SAT Expansion Pipeline

This notebook implements a method a segments-as-topic (SAT) methodology for generating new topics in the Comparative Constitutions Project (CCP) ontology. In the implementation below, a SAT comprises sections of national constitutions that capture the meaning of a topic.

## Stages

### Preliminaries

This stage gets things started and need only be run once during a session. It comprises three steps:

- Step 1: Load packages and functions from external files.
- Step 2: Start a web server that supports Javascript to Python interactions with the notebook.
- Step 3: Load models. This includes Google's Universal Sentence Encoder (USE) version 4 and the data models used by the application which include encodings of constitution sections generated by the USE model. These encodings are used by semantic search.

### Initialisation

This stage initialises the data structures used to record your activities during a session. At the end of a session (SAT Review and Acceptance) the populated data structures are saved to a JSON file in the `outputs` folder. The file name is `<topic_key>_resources.json` and it provides a complete record of your activities. 

### SAT Generation

In the SAT generation stage, a topic formulation comprising a short phrase is created. A sentence-level semantic similarity model is then used to encode the topic formulation and the encoding is used to find constitution sections that are semantically similar to the topic formulation. Formulations can be tested and refined until a suitable seed set of segments is obtained. This SAT seed set is then used to find additional sections in the SAT Expansion stage. 

There are two steps in this stage:

- Step 1: Load an interface in which a topic key (a short identifier) and formulation are defined along with semantic search criteria.
- Step 2: Use the choices made in the interface to search for constitution sections that are semantically similar to the topic formuation. Once the search is complete, select sections that you judge match the formulation. Alernatively, assess the results and return to Step 1 to refine your choices.

In summary, SAT Generation is an iterative process, the final outcome of which is a set of accepted sections that constitute the SAT seed set which is the input to the SAT Exoansion stage.


### SAT Expansion

There are three steps in this stage:

- Step 1: Save the SAT seed set created by SAT Generation.
- Step 2: Load an interface in which to define the two semantic similarity thresholds needed by SAT Expansion.
- Step 3: Use the choices made in the interface to search for constitution sections that are semantically similar to SAT sections. Once the search is complete, select sections that you judge should be included in the SAT, i.e., expand the SAT.

SAT Expansion is an iterative process, and by running Step 3 you can .


### SAT Review and Acceptance

This stage provides the opportunity to review the segments of the expanded SAT. Segments can be removed by unchecking the segments box.

There are three steps in this stage:

- Step 1: Load an interface to define a cluster threshold for the SAT review.
- Step 2: Review cluster SAT segments and decide whether to remove segments. You can alos decide to return to the expansion stage using a lower search threshold.
- Step 3: Accept the review and write the SAT segments and the session history to file (see Outputs below).


## Outputs

The outputs of the SAT expansion process are two files:

1. `<topic_key>_final_SAT.csv` contains the final SAT segments with one row per segment. The columns are:
    - `segment_id`
    - `segment_text`
    - `constitution` (rename this column if using other corpora)
2. `<topic_key>_resource.json` contains a Python dictionary that records session history:
    - topic data
    - start and end dates of a complete end-to-end session
    - search and cluster thresholds
    - SAT segment IDs and text from generation, expansion iterations, and final review.
    

# Preliminaries

##  Step 1: Load packages and functions

In [None]:
__author__ = 'Roy Gardner, Matt Martin'

%run ./_library/packages.py
%run ./_library/utilities.py
%run ./_library/sat.py
%run ./_library/server.py


## Step 2: Start Python web server

The server handles Javascript to Python interactions. Specifically SAT segments are selected using checkboxes in output cell HTML. Checkbox element state changes are handled by Javascript and posted to the server which manages the set of checked elements.

Checkbox state is used to define:

- Segments constituting the SAT seed set in SAT generation.
- Selected segments during SAT expansion.
- Removal of segments dirung the SAT review process.


In [None]:
port = 8002

def get_selected_ids():
    # Get IDs of selected checkboxes
    return state.selected_ids 

def clear_selected_ids():
    state.selected_ids = set()
    
def set_selected_ids(selected_ids):
    state.selected_ids = set(selected_ids)

if not server_is_running(port):
    state = CheckboxState()
    handler = lambda *args: CheckboxHandler(state, *args)
    server = HTTPServer(('localhost', port), handler)

    thread = Thread(target=server.serve_forever)
    thread.daemon = True
    thread.start()
    print('Server running on port:', port)
else:
    print('Already running on port:', port)
    
if server_is_running(port):
    html = '''
    <script>
    function hit(id) {
        const checkbox = document.getElementById(id);
        fetch('http://localhost:''' + str(port) + '''', {
            method: 'POST',
            headers: {
                'Content-Type': 'application/json',
            },
            body: JSON.stringify({
                id: id,  // Send the actual ID
                checked: checkbox.checked
            })
        });
    }</script>
    '''
    display(HTML(html))
else:
    print('Server is not running:', port)


## Step 3: Select and load models

In [None]:
model_dict = {}
encoder = None

models_path = '../model/'

# Locate available models
_, dirs, _ = next(os.walk(models_path))
dirs = sorted([d for d in dirs if not d[0] == '.'])

model_options = {}
default = ''
for d in dirs:
    model_path = models_path + d + os.sep
    with open(model_path + 'config.json', 'r', encoding='utf-8') as f:
        config = json.load(f)
        model_options[config['label']] = (model_path,config['encoder_path'])
        if model_path == '../model/ccp/':
            default = config['label']        
        f.close() 

def get_selected_value(widget):
    clear_output() 
    selected_model = widget.value
    print(f'Loading {selected_model}')
    global model_dict
    model_dict = do_load(model_options[selected_model][0],exclusion_list=['config.json'],verbose=True)
    print(f'Loading {model_options[selected_model][1]}')
    global encoder
    encoder = hub.load(model_options[selected_model][1])
    print('Finished')

model_select = widgets.Select(
    options=[k for k in model_options.keys()],
    value=default,
    layout=widgets.Layout(width='600px'),
    description='Model:',
    disabled=False
)
apply_button = widgets.Button(
    description='Apply Choice',
    disabled=False,
    button_style='',
    tooltip='Click to apply choice'
)

apply_button.on_click(lambda change: get_selected_value(model_select))

display(model_select)
display(apply_button)
out = widgets.Output()
display(out)




# Initialisation

Run this cell to reset an existing session or to start a new session.


In [None]:
# Make sure that SAT segments are empty
clear_selected_ids()
review = False

# Dictionary containing resources for current run
resource_dict = {
    'topic_key': '',
    'topic_label': '',
    'topic_description': '',
    'start_datetime':None,
    'end_datetime':None,
    'generation': {
        'formulation': '',
        'search_threshold': 0.0,
        'cluster_threshold': 0.0,
        'seed_segments': []
    },
    'expansion': {
        'iterations': []
    },
    'review':{
        'sat_segments_final':[],
        'removed_segments':[],       
        'csv_file':'',    
    },
    'xml':{
        'constitution_count':0,
        'constitutions_updated':[]        
    }
}

def get_iteration_dict():
    iteration_dict = {
        'post_review':False,       
        'accepted_set':[],    
        'rejected_set':[],    
        'sat_set':[],
        'mapping_threshold':0.0,    
        'cluster_threshold':0.0    
    }
    return iteration_dict
    


# SAT Generation

SAT Generation is a two step stage:

- Step 1: Define your topic formulation and semantic search parameters in a simple interface.
- Step 2: Run your semantic search to see the results. Then select suitable sections in the results to create the seed set.

The key to success at this stage is experimentation. You can work on the formulation and search parameters to refine your search results in order to generate a seed set that matches the topic you are creating. The seed set need not be exhaustive — a small set of constitution sections that are a good match to your topic formulation will provide the basis for successful SAT expansion. The sections in the seed set are better a finding additional sections than any formulation.


## Step 1: Create the SAT generation interface


This step creates an interface within which you define your topic formulation and the parameters of your semantic search of constitution sections. 

Run the cell below to generate the interface for selecting the following values and parameters:

- Topic key
  - An alphanumeric key for your topic between 4 and 10 characters in length, e.g. parents.
- Search threshold
  - Sets the minimum semantic similarity a constitution section must meet to qualify as a match to your topic formulation. Sections that meet or exceed this threshold are included in the search results in Step 2 below.
  - Too low and you'll get too many results — many of which will be off-topic.
  - Too high and you may miss on-topic results.
  - 0.63 is a good starting point and is set as the default; move up or down as needed using the slider.
- Cluster threshold
  - Groups search results together to try and separate on-topic from off-topic results in Step 2.
  - Too low and you'll get one big cluster containing all search results.
  - Too high and most results will be considered unrelated to one another and will appear in the `singletons` set.
  - 0.72 is a good starting point and is set as the default, but you'll need to experiment for each topic you create.
- Formulation
  - Enter the text of your topic formulation here; this will be used to search for semantically similar constitution sections in Step 2. The maximum number of character is 400 and the text you entered is sanitised to escape HTML and remove characters that in a web setting would be considered a security threat. If text is sanitised then an alert displays the sanitised text.

Once you are happy with your choices click on the `Apply Choices` button and move on to Step 2.


In [None]:

choice_dict = init_choice_dict()
generation_interface(choice_dict,0.63,0.72)


## Step 2:  Run the semantic search and create the seed set

The choices made in the interface above are now used in a semantic search of constitution sections. The search may take a few seconds depending upon your computer.

Sections found by the semantic search appear in HTML tables and are organised into clusters of based of their semantic similarity to one another. Note that clusters may suggest sub-topics or further refinements for your topic.

Each section's row in a cluster has three elements:

- The section's ID which is a link to the section in the [Constitute Project](https://www.constituteproject.org/) website. By using this link you are able to vew the section in the context of the consitution to which it belongs.
- The section's text.
- A checkbox.

Use the checkbox to add (or remove) a section from the seed set. Once you are happy with your seed set, you can proceed to Stage 2: SAT Expansion. In the first step of SAT expansion your seed set will be saved.

Above the HTML tables is a field (Search for terms…) which can be used to search for one or more words in the search results. As you type, the search results are filtered to show only those results containing the text in the field. Clear the field to see the full set of results.

## Use cases

1. I want to start over. 
    - Rerun Initialisation to reset the entire process and start a new session.
2. I'm not happy with my formulation.
    - Change the formulation in Step 1, click on Apply Choices, and rerun the Step 2 to run a new search.
3. There are too many results.
    - Return to Step 1, increase the search threshold, click on Apply Choices, and rerun Step 2.
4. There are too few results.
    - Return to Step 1, reduce the search threshold, click on Apply Choices, and rerun Step 2.
5. I'm not happy with the results.
    - Return to Step 1, edit the formulation, click on Apply Choices, and rerun the Step 2.
6. I've made my selection, what next?.
    - Move to SAT Expansion: Step 1. This will save and record your selection and start the SAT Expansion process.


In [None]:

if len(choice_dict['formulation']) > 0:
    print('Topic key:', choice_dict['topic_key'])
    print('Formulation:', choice_dict['formulation'])
    print('Search threshold:', choice_dict['search_threshold'])
    print('Cluster threshold:', choice_dict['cluster_threshold'])
    print()
    
    resource_dict['topic_key'] = choice_dict['topic_key']
    resource_dict['start_datetime'] = int(time.time())
    resource_dict['generation']['formulation'] = choice_dict['formulation']
    resource_dict['generation']['search_threshold'] = choice_dict['search_threshold']
    resource_dict['generation']['cluster_threshold'] = choice_dict['cluster_threshold']

    # Get a set of segment IDs found by the semantic search
    segment_ids = run_sat_generation(choice_dict, model_dict, encoder)
    print('Number of search results:',len(segment_ids))
    if len(segment_ids) > 0:
        # Use same clustering and listing as expansion
        cluster_dict = cluster_sat_candidates(segment_ids,model_dict,\
                                              threshold=choice_dict['cluster_threshold'])
        print('Number of clusters:',len(cluster_dict))
        print()
        list_clusters(cluster_dict,model_dict)
        
else:
    alert('No formulation entered.')
    

# SAT Expansion

Now that you have selected a seed SAT you are ready to use the constitution sections in the seed set to search for semantically similar sections and therefore expand the SAT.



## Step 1: Load seed SAT from generation process

This process creates two sets of segments:

- The SAT segments, i.e., those segments accepted in the SAT generation process.
- Rejected segments — a set which is initially empty.

Both sets grow during the SAT expansion process below.


In [None]:
topic_key = choice_dict['topic_key']

# We might be returning here to start again, i.e., we need to check SAT Generation state

if len(resource_dict['generation']['seed_segments']) == 0:
    # First time into expansion
    # Set of selected segments from generation
    sat_segment_ids = get_selected_ids()
    # Convert SAT segments to list for serialisation to JSON resource
    resource_dict['generation']['seed_segments'] = get_segments(sat_segment_ids,model_dict)    
else:
    # We want to restart the process with the original generation seed set
    set_selected_ids([key for d in resource_dict['generation']['seed_segments'] for key in d.keys()])
    sat_segment_ids = get_selected_ids()
    
# Set of rejected segments
rejected_segment_ids = set()

# Convert SAT segments to list for serialisation to JSON resource
resource_dict['generation']['seed_segments'] = get_segments(sat_segment_ids,model_dict)

print('Expanding SAT for:',topic_key)
print()
print('Number of segments in SAT seed set:',len(sat_segment_ids))

# Initial state for expansion process
clear_selected_ids()
first_time = True


## Step 2: Create the SAT expansion interface


This step creates an interface within which you define SAT expansion parameters of your semantic search of constitution sections. 

Run the cell below to generate the interface for selecting the following parameters:

- Mapping threshold
  - Sets the minimum similarity between constitution sections and sections in the SAT.
  - Too low and you'll get too many results—many of which will be off-topic.
  - Too high and you may miss on-topic results.
  - 0.63 is a good starting point and is set as the default; move up or down as needed using the slider.
- Cluster threshold
  - Groups search results together to try and separate on-topic from off-topic results in Step 2.
  - Too low and you'll get one big cluster containing all search results.
  - Too high and most results will be considered unrelated to one another and will appear in the `singletons` set.
  - 0.72 is a good starting point and is set as the default, but you'll need to experiment for each topic you create.

Once you are happy with your choices click on the `Apply Choices` button and move on to Step 2.


In [None]:

expansion_choice_dict = init_expansion_choice_dict()
expansion_interface(expansion_choice_dict,0.70,0.74)


## Step 3: Run SAT expansion (iterative process)


Iteratively run the code cell below. Each iteration will:
1. Find SAT expansion candidate segments that are semantically similar to SAT segments at or above a `mapping_threshold`. A segments is a candidate if:
    - It is not a member of the current SAT segments set.
    - It is not a member of the rejected segments set.
2. Provide a clustered list of candidate segments.
3. Provide support for selecting candidate segments for inclusion in the SAT.

Each subsequent iteration will:

- Add selected candidate segments from the previous iteration to the SAT segments set. above.
- Add unselected candidate segments to the rejected segments set.
- Repeat steps 1-3 above.

The process terminates when no more candidates segments are found or no selection is made.

The results layout and interface is identical to that of SAT Generation: Step 2.

## Use cases

1. I want to start over from the very beginning.
    - Rerun Initialisation to reset the entire process and start a new session. You will have to start with SAT generation.
2. I want to start over with the original seed set.
    - Rerun Step 1, to start again with the seed set from SAT generation.
3. There are too many results.
    - Return to Step 2, increase the mapping threshold, click on Apply Choices, and rerun Step 3.
4. There are too few results.
    - Return to Step 2, reduce the mapping threshold, click on Apply Choices, and rerun Step 3.
5. I've made my selection, what next?.
    - Simply rerun Step 3. SAT expansion is an iterative process which can be repeated as many time as you like. Unless you reduce the mapping threshold the search results should get smaller with every iteration. The expansion terminates when there or no search results or when you rerun Step 3 without selecting any additional sections.


In [None]:

mapping_threshold = expansion_choice_dict['mapping_threshold']
cluster_threshold = expansion_choice_dict['cluster_threshold']

# First time in this state
if len(get_selected_ids()) == 0 and first_time:
    first_time = False
    # Get the set of candidate segments.
    sat_candidate_ids = run_sat_expansion(sat_segment_ids,sat_segment_ids,rejected_segment_ids,model_dict,\
                                          threshold=mapping_threshold)
    print('Number of candidate segments:',len(sat_candidate_ids))
else:    
    # Get accepted segments - could be from expansion iteration or review
    sat_accepted_ids = get_selected_ids()

    if len(sat_accepted_ids) == 0:
        # Termination condition
        rejected_segment_ids.update(sat_candidate_ids)
        sat_candidate_ids = set()
        # Populate an iteration dictionary
        # Updated rg 07/05/2025 to save segment text as well as segment IDs
        iteration_dict = {
            'accepted_set':get_segments(sat_accepted_ids,model_dict),    
            'rejected_set':get_segments(rejected_segment_ids,model_dict),    
            'sat_set':get_segments(sat_segment_ids,model_dict),
            'mapping_threshold':mapping_threshold,    
            'cluster_threshold':cluster_threshold    
        }
        resource_dict['expansion']['iterations'].append(iteration_dict)

    else:    
        print('Number of accepted segments:',len(sat_accepted_ids))
        # Add accepted segments to the SAT set. 
        if review:
            # Re-entrant from review so SAT is the current selected set from the review cell
            sat_segment_ids = sat_accepted_ids
            review = False
        else:
            # Expansion iteration so extend the SAT set
            sat_segment_ids.update(sat_accepted_ids)

        print('Updated SAT size:',len(sat_segment_ids))

        # Add all remaining segments from the last iteration's candidate set to the rejected set
        rejected_segment_ids.update(sat_candidate_ids.difference(sat_accepted_ids))        
        # Updated rg 07/05/2025 to save segment text as well as segment IDs
        iteration_dict = {
            'accepted_set':get_segments(sat_accepted_ids,model_dict),    
            'rejected_set':get_segments(rejected_segment_ids,model_dict),    
            'sat_set':get_segments(sat_segment_ids,model_dict),
            'mapping_threshold':mapping_threshold,    
            'cluster_threshold':cluster_threshold    
        }
        resource_dict['expansion']['iterations'].append(iteration_dict)

        # Build the matrix with the accepted set for speed 
        sat_candidate_ids = run_sat_expansion(sat_accepted_ids,sat_segment_ids,rejected_segment_ids,\
                                              model_dict,threshold=mapping_threshold)    
        print('Number of candidate segments:',len(sat_candidate_ids))

if len(sat_candidate_ids) > 0:     
    # Cluster the candidates and display
    clear_selected_ids() # Clear state for the next run
    cluster_dict = cluster_sat_candidates(sat_candidate_ids,model_dict,threshold=cluster_threshold)
    print('Number of clusters:',len(cluster_dict))
    print()
    list_clusters(cluster_dict,model_dict)
else:
    # Initialise so user can do another run with the currently selected topic
    clear_selected_ids()
    first_time = True
    print('The process has terminated. Please review the final SAT set in the cell below.')



# SAT Review

This stage provides the opportunity to review the segments of the expanded SAT. Segments can be removed by unchecking the segments box.


## Step 1: Create the SAT Review interface

This step creates an interface within which you define the cluster threshold for the final SAT sections. This is useful tool for helping identify sections that may be edge cases, as well as sections that might indicate the presence of sub-topics.

Run the cell below to generate the interface for selecting the following parameter:

- Cluster threshold
  - Groups SAT sections together.
  - Too low and you'll get one big cluster containing all SAT sections.
  - Too high and most sections will appear in the `singletons` set.
  - 0.74 is a good starting point and is set as the default, but you may need to experiment.

Once you are happy with your choice click on the `Apply Choices` button and move on to Step 2.


In [None]:
%run ./_library/utilities.py

review_choice_dict = init_review_choice_dict()
review_interface(review_choice_dict,0.74)


## Step 2: Review the final SAT

This section provides an opportunity to review the segments of the expanded SAT. Segments can be removed by unchecking the segments box.

The results layout and interface is identical to that of SAT Generation: Step 2, and SAT Expansion: Step 3 in layout. However, all checkboxes are checked by default and can be unchecked to remove a section from the SAT.

## Use cases

1. I unchecked a section but now I want to add it back in.
    - Click on the section's checkbox and the section will be added back into the SAT.
2. I need to return to the expansion stage to make sure I didn't miss anything.
    - Rerun SAT Expansion: Step 2, with a lower mapping threshold.


In [None]:

review = True

cluster_threshold = review_choice_dict['cluster_threshold']

# Make sure checkbox state contains all SAT segments
set_selected_ids(sat_segment_ids)

# Store the SAT set that is being reviewed
review_sat_ids = sat_segment_ids

cluster_dict = cluster_sat_candidates(sat_segment_ids,model_dict,threshold=cluster_threshold)
print('Number of SAT segments:',len(sat_segment_ids))
print('Number of clusters:',len(cluster_dict))
print()
list_clusters(cluster_dict,model_dict,check_all=True)



## Step 3: Accept review and write the final SAT to CSV

Run the cell below to generate the final interface for selecting the following values:

- Topic label
  - A short human-readable label for your new topic.
- Topic description
  - A description of the new topic that is more expansive than the topic formulation.

Once you are happy with your choices click on the `Accept Review` button which saves two files into the `outputs/` folder:

1. `<topic_key>_final_SAT.csv`: contains a list of all SAT sections.
2. `<topic_key>_resource.json`: A full history of your choices and results.


In [None]:
%run ./_library/utilities.py

# Set the SAT segments to the checked segments in review
sat_segment_ids = get_selected_ids()

print('Number of segments in final SAT:',len(sat_segment_ids))
print()

if len(sat_segment_ids) > 0:  
    accept_review_interface(sat_segment_ids,review_sat_ids,resource_dict,model_dict,accept_review)
else:
    print('The SAT is empty.')
