# Creating a similarity index for LLMs

Sometimes, it feels like a prompt in two different LLMs leads to the very same output. I tried to measure how true it was.

## Data collection

I ran the same prompt in three different LLMs: gemini-2.5-flash-lite, gpt-5-nano and grok-4-fast-non-reasoning.

The system prompt was:

> You are an expert at world History. You know that History is as much what happened as what is said to have happened. You are not ensconsed in any particular vision of History, especially not one from school or from university. You know that no version of History is right or false, but that History is a narrative that people use to explain the present. You know that History is always situated and created in and by a given context, but you are expert enough that you can move between contexts and present History from different points of views.

The prompt itself was:

> Provide a list of 200 dates that are most relevant to world History. I do not mean relevant to a world History class at school or in college, but relevant to world History in general.

> The result should be a Python list of the form [{"event": description of the event, "year": year of the event, "date": date of the event in the format 1970-12-01}]

I should have 100 batches of 200 events from each LLM.

(In truth, out of roughly 300 prompts, only once did an LLM provide 200 events).

## Data cleaning

I removed all content that was not a Python List.

Gemini returned many dates set to "1970-12-01". I reconciled the results by selecting all events where the year in the "year" field differed from the year in the "date" field. I then looked for events with the exact same name, or with a Levenshtein distance of less than 15, from the data output by OpenAI and xAI, and copied their date value.

## Data analysis

I now harmonize the data to have the same number of batches of the same size from all 3 LLMs.

I will compare each batch to all other batches from the same provider, and then from all other batches from the other 2 providers, using the [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index).

I use the 'date' field as the point of comparison between batches, as the 'event' field might vary considerably (from a lexicometric perspective) although it is the same (from the perspective of human understanding).

The value of the Jaccard index is bound to be an underestimate, as many events were given different dates by the LLMs.

In [246]:
import pandas as pd
import re

In [247]:
df = pd.read_json("clean_data/merged_events_corrected_fuzzy.json")

In [248]:
df['batch'] = df['source'] + '_' + df['batch'].astype(str)

In [249]:
batch_size = 180

### Drops all rows after Dec 31, 2022, to remove absurd events from xAI and future events from OpenAI

In [251]:
df = df[df["year"].notna() & (df["year"] != "")]
df = df[df["year"].astype(int) < 2023]

### Removes rows where the date and year are mismatched (Gemini had lots of these)

In [252]:
def extract_year_from_date(date_str):
    """Extract year from date string before '-\d\d-\d\d' pattern"""
    if pd.isna(date_str):
        return None
    match = re.search(r'^(.+?)-\d\d-\d\d', str(date_str))
    if match:
        return match.group(1)
    return None

In [253]:
df = df[df['date'].apply(extract_year_from_date) == df['year'].astype(str)]

### Drop batches with less than a given number of rows

In [254]:
df = df.groupby('batch').filter(lambda x: len(x) >= batch_size)

### Randomly truncate all other batches to the same number of rows

In [255]:
min_batches = df.groupby('source')['batch'].nunique().min()
df = df.groupby('source', group_keys=False).apply(lambda x: x[x['batch'].isin(x['batch'].unique()[:min_batches])])
df = df.groupby('batch', group_keys=False).apply(lambda x: x.sample(n=batch_size))

In [267]:
# Number of batches for each source
int(len(df.groupby(['source', 'batch']).size().reset_index(name='count')) / 3)

31

In [268]:
df

Unnamed: 0,source,batch,event,year,date
63652,gemini,gemini_1,Rise of social media,2004,2004-01-01
63519,gemini,gemini_1,Wright Brothers' first successful flight,1903,1903-12-17
63478,gemini,gemini_1,Black Death pandemic begins,1347,1347-01-01
63499,gemini,gemini_1,Enlightenment reaches its peak,1750,1750-01-01
63603,gemini,gemini_1,Battle of Waterloo,1815,1815-06-18
...,...,...,...,...,...
1992,xai,xai_95,Detroit bankruptcy,2013,2013-07-18
2123,xai,xai_95,Midterms 2022,2022,2022-11-08
2067,xai,xai_95,Manchester Arena bombing,2017,2017-05-22
2128,xai,xai_95,FTX collapse,2022,2022-11-11


### Compute the Jaccard Indexes

In [256]:
from itertools import combinations

def jaccard_index(set1, set2):
    """Compute Jaccard index between two sets"""
    intersection = len(set1 & set2)
    union = len(set1 | set2)
    return intersection / union if union > 0 else 0

# Create a dictionary of date sets for each batch
batch_dates = df.groupby('batch')['date'].apply(set).to_dict()
batch_source = df.groupby('batch')['source'].first().to_dict()

results = []

for batch in batch_dates.keys():
    source = batch_source[batch]
    dates = batch_dates[batch]
    
    openai_scores = []
    xai_scores = []
    gemini_scores = []
    
    for other_batch in batch_dates.keys():
        if batch == other_batch:
            continue
        
        other_source = batch_source[other_batch]
        other_dates = batch_dates[other_batch]
        
        jaccard = jaccard_index(dates, other_dates)
        
        if other_source == 'openai':
            openai_scores.append(jaccard)
        elif other_source == 'xai':
            xai_scores.append(jaccard)
        elif other_source == 'gemini':
            gemini_scores.append(jaccard)
    
    results.append({
        'batch': batch,
        'source': source,
        'avg_jaccard_openai': sum(openai_scores) / len(openai_scores) if openai_scores else 0,
        'avg_jaccard_xai': sum(xai_scores) / len(xai_scores) if xai_scores else 0,
        'avg_jaccard_gemini': sum(gemini_scores) / len(gemini_scores) if gemini_scores else 0
    })

jaccard_df = pd.DataFrame(results)

# Group by source and compute averages
grouped_results = jaccard_df.groupby('source')[['avg_jaccard_openai', 'avg_jaccard_xai', 'avg_jaccard_gemini']].mean()
print(grouped_results)

        avg_jaccard_openai  avg_jaccard_xai  avg_jaccard_gemini
source                                                         
gemini            0.138185         0.103441            0.242873
openai            0.179187         0.066326            0.138185
xai               0.066326         0.184209            0.103441


## Conclusion

As expected, the average Jaccard index is much higher between sets of the same LLM. But in some cases, the difference is slight.