# Comparing Constitutions

## Introduction

This notebook presents a semantic similarity analysis of Chilean constitutional texts and related Latin American constitutions. We employ two complementary approaches to understand the relationships between constitutional documents: 
1. Textual alignment of constitutions using Kernel Density Estimation (KDE).
2. Thematic alignment of constitutions by comparing the distributions of topic mentions.

Both methods are based on semantic embeddings and textual analysis, but they measure different aspects of constitutional relationships. 

Textual alignment measures direct semantic relationships between individual constitution text segments in order to compare the similarity of language and phrasing between two constitutions. This creates a composite signal containing both linguistic and thematic dimensions—high semantic similarity without thematic overlap is impossible because topics are expressed by segments.

In contrast, the topic alignment method isolates the thematic dimension to reveal which topics are prioritised by creating a topic profile for each constitution. This profile is created by measuring the semantic similarity between pre-defined topics belonging to the CCP ontology and constitution text segments.

The relationship between segment-segment (textual alignment) and topic-segment (thematic alignment) similarity reveals important patterns:

1. High thematic alignment + moderate textual alignment: Two constitutions prioritise the same topics but express these topics in different language. The topic method reveals underlying structural similarity that segment-level comparison underestimates due to linguistic variation. This pattern indicates conceptual convergence—drafters adopted similar thematic frameworks but crafted original constitutional language.
2. Moderate textual similarity + lower thematic similarity: The constitutions share some constitutional vocabulary and formulaic language but differ in their substantive topic priorities. Constitutional boilerplate creates segment-level similarity even when thematic emphasis diverges. The topic alignment method filters out this noise to reveal the true differences between constitution priorities.
3. Both high: Strong alignment on both dimensions—constitutions share thematic priorities and express them in similar language (genuine textual borrowing with topic overlap).
4. Both low: Fundamentally different constitutions with neither thematic nor linguistic commonality.


## Data Sources

The analysis examines four Chilean constitutions:

- The 1980 constitution (as amended through 2021).
- The rejected 2018 reform proposal.
- The rejected 2022 progressive draft
- The rejected 2023 conservative draft.

The Chilean constitutions offer a unique opportunity to study constitutional evolution, reform failure, and the dynamics of constitutional change.

In addition there are two comparative cases representing the new Latin American constitutionalism movement:

- Ecuador's 2008 constitution.
- Bolivia's 2009 Constitution.

Constitution labels in constitution enacted or draft date order used in the notebook are:

- Chile_2021—1980
- Ecuador_2021—2008
- Bolivia_2009—2009
- Chile_2018D—2018
- Chile_2022D—2022
- Chile_2023DD—2023


## Method 1: Textual Alignment of Constitutions

### Overview

For each pair of constitutions, we construct a similarity matrix where rows represent segments from one constitution (A) and columns represent segments from the other constitution (B). From this matrix, we extract bidirectional maximum similarities: for each segment in Constitution A, we identify its most similar segment in Constitution B, and vice versa. This provides a semantic similarity distribution for a pair of constitutions.

We apply kernel density estimation (KDE) to our observed text segment semantic similarity distributions. KDE is a non-parametric method used to estimate the probability density function (PDF) of observed data (Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis, Chapman and Hall, London.). 

KDE offers several advantages:

- Distribution-agnostic: We estimate the shape of underlying similarity distributions without assuming they follow known parametric forms (e.g., normal, uniform)
- Sample size control: KDE enables fair comparison between pairs of constitutions of different length. A short constitution with consistently high matches can be properly compared to a much longer one.
- Mathematical operations: We can integrate under the right tail to quantify the proportion of high-quality matches.
- Visual interpretability: KDE curves reveal structural features like bimodality, which might indicate mixed drafting strategies (e.g., extensive borrowing for some provisions, genuine innovation for others).

### The Semantic Alignment Metric

Using integration, we determine the area under a PDF $p$ within a similarity score interval $I$ with limits $[a,1.0]$. The area $p$ is the probability that a similarity score $x$ is in the interval and is our measure of textual semantic alignment $A$.

$$A = \Pr(x \in I) = \int_{a}^{1.0} p(x) \, dx$$

Our approach captures the probability mass in the high-similarity region, representing the extent to which two constitutions share closely aligned language. Higher integral values indicate greater constitutional similarity.

### Interpreting Bimodality

When KDE reveals bimodal distributions a peak at high similarity may indicate extensive textual borrowing or minimal modification. Bimodality may therefore indicate a conservative drafting strategy—preserving some provisions while reforming others—revealing political compromises and/or evolutionary reform approaches.


## Method 2: Thematic Alignment of Constitutions

### Overview

The thematic alignment method measures constitutional similarity based on topics in constitutions rather than textual content. The topic profile of a constitution reveals which topics a constitution prioritises and is used to compare constitutions.

For each constitution we create a topic-segment matrix $c$ containing the semantic similarities of all topic-segment pairs. For each segment $j$ in $c$, we identify the $k$ topics with highest similarity scores and set all other topics scores to zero:

$$T^{(k)}_c[i,j] = \begin{cases} 
T_c[i,j] & \text{if topic } i \text{ is in top-}k \text{ for segment } j \\
0 & \text{otherwise}
\end{cases}$$

Where:
- $T_c$ is the topic-segment similarity matrix for constitution $c$ (topics × segments)
- $i$ indexes topics, $j$ indexes segments
- $k$ is the number of top topics to retain (e.g., $k=3$)

By retaining only the strongest topic associations for a segment we prevent weak associations from homogenising a constitution's topic profile.

Next, a constitution's topic profile is computed as the topic marginal of the filtered topic-segment matrix. For each topic $i$ in constitution $c$, we compute the marginal by summing across all segments and normalising by the number of segments:

$$\hat{M}_c[i] = \frac{1}{n_c} \sum_{j=1}^{n_c} T^{(k)}_c[i,j]$$

Where:
- $n_c$ is the number of segments in constitution $c$
- $\hat{M}_c[i]$ represents the average emphasis on topic $i$ per segment


### The Thematic Alignment Metric

For any pair of constitutions $(c_1, c_2)$, we compute their topic alignment using vector similarity metrics.

Topics with non-zero coverage in at least one constitution are identified to create a mask:

$$\text{mask}[i] = (\hat{M}_{c_1}[i] > 0) \lor (\hat{M}_{c_2}[i] > 0)$$

This removes topics that neither constitution emphasises, focusing the comparison on relevant topics only.

For constitution pair $(c_1, c_2)$, we compute similarity between masked topics profiles:

**Pearson Correlation:**

$$r(c_1, c_2) = \frac{\sum_{i \in \text{mask}} (\hat{M}_{c_1}[i] - \bar{M}_{c_1})(\hat{M}_{c_2}[i] - \bar{M}_{c_2})}{\sqrt{\sum_{i \in \text{mask}} (\hat{M}_{c_1}[i] - \bar{M}_{c_1})^2} \sqrt{\sum_{i \in \text{mask}} (\hat{M}_{c_2}[i] - \bar{M}_{c_2})^2}}$$

Where $\bar{M}_c$ is the mean of the masked marginals for constitution $c$.

**Cosine Similarity:**

$$\text{sim}(c_1, c_2) = \frac{\sum_{i \in \text{mask}} \hat{M}_{c_1}[i] \times \hat{M}_{c_2}[i]}{\sqrt{\sum_{i \in \text{mask}} \hat{M}_{c_1}[i]^2} \times \sqrt{\sum_{i \in \text{mask}} \hat{M}_{c_2}[i]^2}}$$

**Output**: A single similarity value in $[0, 1]$ (for cosine) or $[-1, 1]$ (for correlation) indicating topic alignment between the two constitutions.

## Summary

KDE analysis provides robust, distribution-aware similarity metrics for ranking constitution pairs. Topic profile analysis explains the substantive basis of those similarities, revealing whether constitutions align on core principles, procedural structures, or specific policy domains.

Together, they offer both quantitative rigour (through mathematical integration and statistical density estimation) and qualitative insight (through topic-level interpretation), making constitutional comparison both systematic and meaningful.



## Initialise

This cell loads Python packages and functions. It also loads the data model created by the code in the `processing` directory.


In [None]:
__author__      = 'Roy Gardner'
__copyright__   = 'Copyright 2023-2025, Roy and Sally Gardner'

%run ./_library/packages.py
%run ./_library/utilities.py
%run ./_library/mapping.py
%run ./_library/kde_pdf.py

model_path = '../model/'

try:
    n = len(model_dict)
except:
    exclusion_list=['segment_encoding.json','topic_encodings.json']
    model_dict = do_load(model_path,exclusion_list=[],verbose=True)


# Get the ojects we need for the analysis from the data model
documents_dict = model_dict['documents_dict']
segments_dict = model_dict['segments_dict']
encoded = model_dict['encoded_segments']
segments_matrix = np.array(model_dict['segments_matrix'])
topic_segment_matrix = np.array(model_dict['topic_segment_matrix'])

# Create some constitution labels for visualisations
doc_data = [(k,v['date']) for k,v in documents_dict.items()]
doc_data = sorted(doc_data,key=lambda t:t[1])
const_list = []
for doc in doc_data:
    const_list.append(f'{doc[0]}—{doc[1]}')


## Textual Alignment

### Method

The method processes the matrix `segments_matrix` contained in `model_dict`. This matrix contains the semantic similarities of all pairs of segments in our set of constitutions. 

For each constitution pairing:

- Collect the **row** indices from the `segments_matrix` correponding to the first constitution's sections.
- Collect the **column** indices from the `segments_matrix` correponding to the second constitution's sections.
- Use the row and column indices to extract a sub-matrix from the `segments_matrix`.
- Extract the bidirectional maximum similarity values from the sub-matrix. The rationale is that a row segment $s_i$ may be maximally similar to a column segment $s_j$, but $s_j$ might not be maximally similar to $s_i$.
- Use the maximum similarity values to generate a probability density function (PDF) using Kernel Density Estimation (KDE). Rationale: The more similar a pair of constitutions the longer and denser the right-hand tail of the distribution. 
- Integrate under the PDF in a given similarity score interval (default 0.7 - 1.0) to measure the likelihood that a pair of constitutions contain semantically similar sections.
- Visualise the PDFs and the integral values.

### Visualisations

Three visualisations are presented:

- PDFs for each pair of constitutions.
- A bar chart for comparisng the relative values of integral values over pairs of constitutions.
- A heatmap showing integrals values for each every of constitutions.

### Interpretation

#### Conservative Chilean constitutions

At the integral limits used in the analysis, the similarity of three Chilean constitutions is striking. These constitutions are:
- The in-force constitution `Chile_2021` enacted in 1980.
- The draft constitution `Chile_2018D`.
- The draft constitution `Chile_2023DD`.

All three pairwise comparisons of these constitutions:

- `Chile_2021—Chile_2018D`
- `Chile_2021—Chile_2023DD`
- `Chile_2018D—Chile_2023DD`

show:

- Long right-hand tails in the PDF with varying degrees of bimodality. The pair `Chile_2021—Chile_2018D` has two clear peaks in the distribution.
- Their integral values are the three largest values as shown in the bar chart and the heatmap.

#### Ecuador and Bolivia




In [None]:

samples_list = [] # samples for KDE
labels_list = [] # Labels for constitution pairs
matrix_dict = {} # Store the matrices for the constitution pairs

# Get pairwise combinations of constitutions
combs = list(combinations(doc_data,r=2))
for c in combs:
    row_doc = c[0][0]
    col_doc = c[1][0]
    # Get row segment IDs
    row_segment_ids = [k for k,_ in segments_dict.items() if k.split('/')[0] == row_doc and\
                             k in encoded]
    # Translate row segment IDs into row segment indices
    row_segment_indices = [encoded.index(segment_id) for segment_id in row_segment_ids]
    
    # Get column segment IDs
    col_segment_ids = [k for k,_ in segments_dict.items() if k.split('/')[0] == col_doc and\
                             k in encoded]
    # Translate column segment IDs into column segment indices
    col_segment_indices = [encoded.index(segment_id) for segment_id in col_segment_ids]
        
    # Get the segments sub-matrix for the pair of constitutions
    matrix = segments_matrix[np.ix_(row_segment_indices,col_segment_indices)]
    # Get the maximum similarity scores for the sub-matrix
    samples_list.append(get_max_scores(matrix))
    label = f'{c[0][0]}—{c[0][1]} versus {c[1][0]}—{c[1][1]}'
    labels_list.append(label)
    matrix_dict[(c[0]),c[1]] = matrix
    
print()

# Compute and visualise the PDFs using KDE
print('PDFs for pairs of constitutions from the set - see legend.')
plot_pdfs(samples_list,labels_list,xlim=[0.5,1.0])

# Compute the PDF integrals within the defined limits
limits = [0.7,1.0]
integrals = get_pdf_integrals(samples_list,limits,sample_size=2)
print()

# Plot the relative magnitude of the the integrals
print('Relative magnitude of integrals in interval for pairs of constitutions.')
plot_pdf_integrals(integrals,labels_list,limits,'Similarity of constitution pair','Constitution pair',\
                   title_suffix='',figsize=(8,6))


# Plot a heatmap of integral values
integral_matrix = np.zeros((len(doc_data),len(doc_data)))
for i,c in enumerate(combs):
    integral_matrix[doc_data.index(c[0]),doc_data.index(c[1])] = integrals[i]
    
plt.imshow(integral_matrix,aspect='auto',cmap=mpl.cm.Blues)
plt.yticks(range(0,len(const_list)),const_list)
plt.xticks(range(0,len(const_list)),const_list,rotation=90)
for (j,i),value in np.ndenumerate(integral_matrix):
    if round(value,2) > 0.3:
        color = 'white'
    else:
        color = 'black'
    if value > 0:
        plt.text(i,j,round(value,2),color=color,ha='center',va='center',alpha=0.9)
plt.grid(alpha=0.6)
plt.title('Heatmap of integral values')
plt.show()


### Segment similarity analysis

We compare the counts of identical and high-similarity segment in our constitution pairs and present the results in heatmaps. This confirms the finding of the KDE analysis with respect to the three conservative Chilean constitutions.

A list of identical segments is also provided for the three conservative Chilean constitutions.


In [None]:
# Plot a heatmap of identical segments
identical_dict = {}
identical_matrix = np.zeros((len(doc_data),len(doc_data))).astype(int)
for i,c in enumerate(combs):
    row_doc = c[0][0]
    col_doc = c[1][0]
    # Get row segment IDs
    row_segment_ids = [k for k,_ in segments_dict.items() if k.split('/')[0] == row_doc and\
                             k in encoded]
    # Get column segment IDs
    col_segment_ids = [k for k,_ in segments_dict.items() if k.split('/')[0] == col_doc and\
                             k in encoded]
    
    matrix = matrix_dict[(c[0],c[1])]
    indices = np.where(matrix == 1.0)
    # Collect the actual segments
    row_indices = indices[0]
    if len(row_indices) > 0:
        col_indices = indices[1]
        identical_dict[(c[0],c[1])] =\
         [(row_segment_ids[index],col_segment_ids[col_indices[j]]) for j,index in enumerate(row_indices)]
        
    identical_matrix[doc_data.index(c[0]),doc_data.index(c[1])] = len(indices[0])
    
plt.imshow(identical_matrix,aspect='auto',cmap=mpl.cm.Blues)
plt.yticks(range(0,len(const_list)),const_list)
plt.xticks(range(0,len(const_list)),const_list,rotation=90)
for (j,i),value in np.ndenumerate(identical_matrix):
    if value > 5:
        color = 'white'
    else:
        color = 'black'
    if value > 0:
        plt.text(i,j,value,color=color,ha='center',va='center',alpha=0.9)
plt.grid(alpha=0.6)
plt.title('Heatmap of identical segments')
plt.show()



# Plot a heatmap of high similarity segments
identical_matrix = np.zeros((len(doc_data),len(doc_data))).astype(int)
for i,c in enumerate(combs):
    matrix = matrix_dict[(c[0],c[1])]
    indices = np.where(matrix >= 0.8)
    identical_matrix[doc_data.index(c[0]),doc_data.index(c[1])] = len(indices[0])
    
plt.imshow(identical_matrix,aspect='auto',cmap=mpl.cm.Blues)
plt.yticks(range(0,len(const_list)),const_list)
plt.xticks(range(0,len(const_list)),const_list,rotation=90)
for (j,i),value in np.ndenumerate(identical_matrix):
    if value > 200:
        color = 'white'
    else:
        color = 'black'
    if value > 0:
        plt.text(i,j,value,color=color,ha='center',va='center',alpha=0.9)
plt.grid(alpha=0.6)
plt.title('Heatmap of segments with high semantic similarity')
plt.show()

print()
print()

# List of identical segments
for pair,segments in identical_dict.items():
    print(pair)
    for t in segments:
        print(f"{t[0]}: {segments_dict[t[0]]['text']}")
        print(f"{t[1]}: {segments_dict[t[1]]['text']}")
        print()
    print()
    print()

    
    

## Thematic Alignment

### Method

The method processes the matrix `topic_segment_matrix` contained in `model_dict`. This matrix contains the semantic similarities of all topic-segment pairs in our set of constitutions. 

For each constitution:

- Get the topic-segment matrix contain only the constitution's segments.

For each pair of constitutions the function `analyse_topic_alignment()`:

- Obtain each constitution's topic profile from its topic-segment matrix using only the top scoring topics for each segment. The analysis here uses the top three topics.
- Use vector similarity measures to compare the topic profiles of the constitutions. We use:
    - Pearson's R correlation coefficient.
    - Cosine similarity.

### Visualisations

Two visualisations are presented for two measures of topic profile comparison:

- A heatmap showing topic profile correlation coefficients (Pearson's R) for every pair of constitutions.
- A heatmap showing topic profile cosine similarities for every pair of constitutions.

### Interpretation




In [None]:

# Get topic-segment matrix for each constitution
topics_count = topic_segment_matrix.shape[0]
topic_matrix_dict = {}
for doc in doc_data:
    segment_ids = [k for k,_ in segments_dict.items() if k.split('/')[0] == doc[0] and\
                             k in encoded]
    segment_indices = [encoded.index(segment_id) for segment_id in segment_ids]
    # Get the segments sub-matrix for the pair of documents
    topic_matrix_dict[doc] = topic_segment_matrix[np.ix_(range(0,topics_count),segment_indices)]

# Heatmap visualisations
hm_corr_matrix = np.zeros((len(doc_data),len(doc_data)))
hm_cos_matrix = np.zeros((len(doc_data),len(doc_data)))

combs = combinations(doc_data,r=2)
for c in combs:
    m1 = topic_matrix_dict[c[0]]
    m2 = topic_matrix_dict[c[1]]
    # Get the alignment using the top k scoring topics for a segment k=3
    results = analyse_topic_alignment(m1,m2,k=3)
    hm_corr_matrix[doc_data.index(c[0]),doc_data.index(c[1])] = results['correlation']
    hm_cos_matrix[doc_data.index(c[0]),doc_data.index(c[1])] = results['cosine']
    
plt.imshow(hm_corr_matrix,aspect='auto',cmap=mpl.cm.Blues)
plt.yticks(range(0,len(const_list)),const_list)
plt.xticks(range(0,len(const_list)),const_list,rotation=90)

for (j,i),value in np.ndenumerate(hm_corr_matrix):
    if round(value,2) > 0.3:
        color = 'white'
    else:
        color = 'black'
    if value > 0:
        plt.text(i,j,round(value,2),color=color,ha='center',va='center',alpha=0.9)
plt.grid(alpha=0.6)
plt.title('Heatmap of topic profile correlations')
plt.show()
 
print()

plt.imshow(hm_cos_matrix,aspect='auto',cmap=mpl.cm.Blues)
plt.yticks(range(0,len(const_list)),const_list)
plt.xticks(range(0,len(const_list)),const_list,rotation=90)

for (j,i),value in np.ndenumerate(hm_cos_matrix):
    if round(value,2) > 0.3:
        color = 'white'
    else:
        color = 'black'
    if value > 0:
        plt.text(i,j,round(value,2),color=color,ha='center',va='center',alpha=0.9)
plt.grid(alpha=0.6)
plt.title('Heatmap of topic profile cosine similarities')
plt.show()

