# Measure of similarity of policy and abstract

This code allows the calculation of similarity of policy and abstract.

It uses the outputs of the other phases that will be used to compute the matrix of correlations pondered by a similarity score. The similarity score is computed using Proportional Sentence Match.

In this Jupyter Notebook we will: 
1. Import the data retrieved from the policy and outcome clustering process ; 
2. Import the relevant packages ;
3. Prepare data for computing ;
4. Prevalence of policies in abstract Using Proportional Sentence Match ; 
5. Export data with prevalence measure.

To complete those tasks you will need:
- The dataset of papers with the policy extraction of the 1_policy_extraction code. 
- The dataset of papers with the clustered policy of the 2_policy_clustering code. 
- The dataset of papers with the clustered policy of the 3_outcomes_clustering code. 

At the end of this script you will extract: 
- The named_cluster_df dataset of policies with prevalence metrics. 

## 1. Import the data retrieved from the policy and outcome clustering process

Change the input and output access paths:

In [None]:
## 3 inputs
## Output dataset of the 1_policy_extraction (df)
input_path_article = ""
## Output dataset of the 2_policy_clustering (named_cluster_df)
input_path_policy = ""
## Output dataset of the 3_outcomes_clustering (named_cluster_df)
input_path_outcome = ""

# 1 output
## Final dataset with clusters
policy_and_factors_clustered_similarity_normalized = ""

## 2. Import the relevant packages

In [2]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer, util

## 3. Prepare data for computing

In [None]:
df_policy = pd.read_csv("C:/Users/easycash/Mon Drive/Thèse/1_Systematic mapping/6_structural_topic_model/5_final_db/3_policy_and_factors_clustered.csv"  )
df_init = pd.read_csv("C:/Users/easycash/Mon Drive/Thèse/1_Systematic mapping/6_structural_topic_model/3_exit/extract_policies_ML_concat.csv")

In [None]:
# Convert 'index' to numeric, coercing errors to NaN
df_policy['row_index'] = pd.to_numeric(df_policy['row_index'], errors='coerce')

# Handle NaN values (e.g., drop rows with NaN)
df_policy = df_policy.dropna(subset=['row_index'])
df_policy = df_policy.dropna(subset=['matched_cluster_factor'])

# Convert to integer
df_policy['row_index'] = df_policy['row_index'].astype(int)

data = df_policy.set_index('row_index').join(df_init.set_index('Index')['abstract'], how='left').reset_index()

## 4. Prevalence of policies in abstract Using Proportional Sentence Match

In [None]:
df = data

In [None]:
# Load pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
"""
Louis Notes:
- Not clear to me what we are doing here: is it mapping for each cluster of policies the share of outcomes clusters that are similar (ex: 8/10 were positive, 1 neutral, 1 negative)?
"""

# Function to compute proportional sentence match
def compute_proportional_sentence_match(df, threshold=0.8):
    results = []
    
    for index, row in df.iterrows():
        # Encode the abstract and POLICY sentences
        abstract_embedding = model.encode(row['abstract'], convert_to_tensor=True)
        policy_sentences = row['POLICY'].split(".")  # Split POLICY into individual sentences
        policy_embeddings = model.encode(policy_sentences, convert_to_tensor=True)
        
        # Compute cosine similarity
        similarities = util.cos_sim(policy_embeddings, abstract_embedding).numpy()
        
        # Metrics
        #relevant_sentences = (similarities > threshold).sum()
        #proportion_relevant = relevant_sentences / len(policy_sentences)
        average_similarity = similarities.mean()
        
        # Store results
        results.append({
            'abstract': row['abstract'],
            'POLICY': row['POLICY'],
            #'proportion_relevant': proportion_relevant,
            'average_similarity': average_similarity
        })
    
    return pd.DataFrame(results)

In [None]:
# Apply the function
results_df = compute_proportional_sentence_match(df)

In [None]:
df['average_policy_similarity'] = results_df['average_similarity']

In [None]:
df.loc[df['CORRELATION']=='increasing','CORRELATION_num'] = 1
df.loc[df['CORRELATION']=='decreasing','CORRELATION_num'] = -1
df.loc[df['CORRELATION']=='neutral','CORRELATION_num'] = 0

# Normalize policy_similarity per matched_cluster
df['policy_similarity_normalized_by_cluster'] = df.groupby('matched_cluster')['average_policy_similarity'] \
    .transform(lambda x: (x - x.min()) / (x.max() - x.min()))

# Normalize policy_similarity per matched_cluster_factor
df['policy_similarity_normalized_by_factor'] = df.groupby('matched_cluster_factor')['average_policy_similarity'] \
    .transform(lambda x: (x - x.min()) / (x.max() - x.min()))

# If you want global normalization
df['policy_similarity_normalized_global'] = (df['average_policy_similarity'] - df['average_policy_similarity'].min()) / \
                                            (df['average_policy_similarity'].max() - df['average_policy_similarity'].min())

df['correlation_prod_normalized_by_cluster'] = df['policy_similarity_normalized_by_cluster']*df['CORRELATION_num']
df['correlation_prod_normalized_by_factor'] = df['policy_similarity_normalized_by_factor']*df['CORRELATION_num']
df['correlation_prod_normalized_global'] = df['policy_similarity_normalized_global']*df['CORRELATION_num']

## 5. Export data with prevalence measure

In [None]:
cluster_df = pd.read_csv(input_path_policy)
cluster_df_factor = pd.read_csv(input_path_outcome)

In [None]:
updated_df=df.dropna(subset='matched_cluster')
updated_df = pd.merge(updated_df, cluster_df[['Cluster Name','Agg Cluster']], how= 'left', left_on= 'matched_cluster', right_on= 'Cluster Name')
grouped = updated_df.groupby(["Agg Cluster","Cluster Name"])["matched_cluster"].count()

updated_df = pd.merge(updated_df, cluster_df_factor[['Cluster Name','Agg Cluster','Corr Sign']], how= 'left', left_on= 'matched_cluster_factor', right_on= 'Cluster Name',suffixes=(False, '_factor'))

In [None]:
# Update with your desired output path
updated_df.to_csv(policy_and_factors_clustered_similarity_normalized, index=False)