In [19]:
from sklearn.metrics import jaccard_score
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np
import pandas as pd

This is an exploration of how similar the dependencies (aka columns) in our various ADKP metadata manifest templates are, and whether any of the assay templates specifically are similar enough to be easily combined. The motivation is to reduce the number of templates to 1) make it easier to maintain the data model, especially cross-manfiest validation rules and 2) reduce the number of options available to data contributors in DCA to improve correct selection of templates. 

I initially asked GPT-4o: 'Using the DependsOn column in this csv, which templates (listed under Attribute) are the most similar to each other?' And got the following code, which I adjusted to work with our data. 

I also asked it why Jaccard Similarity, which it chose, was the most appropriate metric. Response: 

> Jaccard similarity is often chosen for comparing sets of categorical data, and it was appropriate in this context for several reasons:

1. **Set-Based Comparison**:
   - The "DependsOn" column data is naturally represented as sets of dependencies. Jaccard similarity is specifically designed to compare the similarity and diversity of sample sets.

2. **Presence/Absence Data**:
   - Jaccard similarity focuses on the presence or absence of elements in the sets, rather than their frequency or magnitude. This makes it well-suited for cases where the key concern is whether dependencies exist or not, rather than how often they occur.

3. **Intuitive Interpretation**:
   - The Jaccard index provides an intuitive interpretation of similarity as the ratio of the intersection of sets to their union. This makes it easy to understand and explain the degree of similarity between different templates.

4. **Non-Numeric Data**:
   - Unlike measures like Euclidean or Manhattan distances, Jaccard similarity does not require numeric data and can directly work with categorical data, such as lists of dependencies.

5. **Handling of Binary Attributes**:
   - When comparing binary attributes (present or absent), Jaccard similarity is particularly effective as it accounts for the proportion of shared attributes relative to the total unique attributes across both sets.

These properties make Jaccard similarity a natural and effective choice for comparing the dependency sets in your templates, providing clear and meaningful results based on the presence or absence of shared dependencies. If you have any specific concerns or need further details on the suitability of Jaccard similarity or other methods, feel free to ask!

In [20]:
# read in the data model template csv
file_path = './modules/template/templates.csv'

with open(file_path, 'r') as file:
    data = pd.read_csv(file)

In [21]:

# Tokenize the DependsOn strings into sets of dependencies
data['DependsOn Sets'] = data['DependsOn'].apply(lambda x: set(str(x).split(',')))


In [22]:

# Binarize the data for Jaccard similarity calculation
mlb = MultiLabelBinarizer()
depends_on_binarized = mlb.fit_transform(data['DependsOn Sets'])


In [23]:

# Calculate Jaccard similarity for each pair of templates
n = len(depends_on_binarized)
jaccard_similarities = np.zeros((n, n))

for i in range(n):
    for j in range(i+1, n):
        jaccard_similarities[i, j] = jaccard_score(depends_on_binarized[i], depends_on_binarized[j])


In [24]:

# Get the pairs with the highest similarity scores
similarity_threshold = 0.5  # You can adjust this threshold as needed, 0.5 is arbitrary for display purposes
similar_pairs = np.argwhere(jaccard_similarities >= similarity_threshold)


In [25]:

# Prepare the results
results = []
for i, j in similar_pairs:
    results.append({
        'Template 1': data.iloc[i]['Attribute'],
        'Template 2': data.iloc[j]['Attribute'],
        'Jaccard Similarity': jaccard_similarities[i, j]
    })


In [26]:
# make a data frame
similar_templates_df = pd.DataFrame(results)


In [27]:

# Display the results
similar_templates_df.sort_values(by = ['Jaccard Similarity'], ascending = False)


Unnamed: 0,Template 1,Template 2,Jaccard Similarity
24,assay_wholeExomeSeq_metadata_template,assay_wholeGenomeSeq_metadata_template,0.941176
5,assay_autorad_metadata_template,assay_PET_metadata_template,0.857143
16,assay_methylationArray_metadata_template,assay_snpArray_metadata_template,0.818182
21,assay_scrnaSeq_metadata_template,assay_STARRSeq_metadata_template,0.785714
18,assay_rnaSeq_metadata_template,assay_scrnaSeq_metadata_template,0.76
19,assay_rnaSeq_metadata_template,assay_STARRSeq_metadata_template,0.703704
4,assay_autorad_metadata_template,assay_MRI_metadata_template,0.666667
23,assay_STARRSeq_metadata_template,assay_wholeGenomeSeq_metadata_template,0.653846
1,assay_ATACSeq_metadata_template,assay_bisulfiteSeq_metadata_template,0.652174
6,assay_bisulfiteSeq_metadata_template,assay_HIC_metadata_template,0.636364


Remaining questions:

1. What's a good way to expand this to groups of 2 or more manifests, rather than just pair by pair comparisons? 
    - a. Potentially a clustering model?
    - b. combining the most similar templates above and trying Jaccard similarities again to get a second round of recommended reductions?