### Use a Llama3:8B instruct LLM to score partial-match person data

D. Sisk, 2026-02-12

Generate python code to demonstrate using dolphin3:8b model via ollama running on ollamaserver_nuc1 to score matching level between these two pieces of text:  string1="David C. Sisk" and string2="Dave Gruuve Sisk"

In [1]:
import requests
import json

# Define the strings to compare
string1 = "David C. Sisk"
string2 = "Dave Gruuve Sisk"
#string2 = "david christopher sisk"

# Set up the Ollama API endpoint
ollama_url = "http://ollamaserver_nuc1:11434/api/generate"

# Create a prompt for the LLM to score the matching level
prompt = f"""Rate the similarity between these two names on a scale of 0-100, where 0 is completely different and 100 is identical.

Name 1: {string1}
Name 2: {string2}

Provide only a numeric score and a brief explanation."""

# Send request to Ollama
payload = {
    "model": "dolphin3:8b",
    "prompt": prompt,
    "stream": False
}

response = requests.post(ollama_url, json=payload)
result = response.json()

print(f"Comparing: '{string1}' vs '{string2}'")
print(f"\nLLM Response:\n{result['response']}")

Comparing: 'David C. Sisk' vs 'Dave Gruuve Sisk'

LLM Response:
Similarity Score: 40

Explanation: The names share the last name "Sisk" and have some similarities in the first name, but they differ significantly due to variations in spelling, addition of a middle initial in one name, and inclusion of an additional first name in the other.


In [2]:
import pandas as pd

df_inputdata = pd.read_csv('sample-data-messy_200.csv')

df_inputdata.shape

(200, 6)

In [3]:
df_inputdata.sample(3)

Unnamed: 0,row_id,name,email,address,phone,true_id
26,row-0041,Hailey Webb,,"85 Northview St, Springfield, IL 62785",(555) 010-0085,id-0085
137,row-0217,Boroke Kim,brooke.kim3@example.net,,555-010-0081,id-0081
9,row-0015,Parker Fisher,,"69 Cedarview Ave., Springfield, IL 62769",,id-0069


In [7]:
from itertools import combinations

# Create lists of row_id and true_id
row_ids = df_inputdata['row_id'].tolist()
true_ids = df_inputdata['true_id'].tolist()

# Helper function to get field value, replacing NaN with empty string
def get_field(row, col):
    val = df_inputdata.iloc[row][col]
    return '' if pd.isna(val) else str(val)

# Create all pairwise combinations
pairwise_data = []
for i in range(len(df_inputdata)):
    for j in range(len(df_inputdata)):
        if i != j:  # Don't pair a row with itself
            # Concatenate name, email, address, and phone for both rows
            data1 = ' | '.join([
                get_field(i, 'name'),
                get_field(i, 'email'),
                get_field(i, 'address'),
                get_field(i, 'phone')
            ])
            data2 = ' | '.join([
                get_field(j, 'name'),
                get_field(j, 'email'),
                get_field(j, 'address'),
                get_field(j, 'phone')
            ])
            pairwise_data.append({
                'row_id1': row_ids[i],
                'true_id1': true_ids[i],
                'data1': data1,
                'row_id2': row_ids[j],
                'true_id2': true_ids[j],
                'data2': data2
            })

df_pairwise = pd.DataFrame(pairwise_data)

df_pairwise.shape

(39800, 6)

In [8]:
df_pairwise.head()

Unnamed: 0,row_id1,true_id1,data1,row_id2,true_id2,data2
0,row-0002,id-0080,"Chavez, Leah | lchavez@mail.example.org | 8080...",row-0003,id-0013,"James Harris | | 13 Sycamore St, Springfield,..."
1,row-0002,id-0080,"Chavez, Leah | lchavez@mail.example.org | 8080...",row-0004,id-0030,"Lopez, Levi | llopez@mail.example.org | 3030 3..."
2,row-0002,id-0080,"Chavez, Leah | lchavez@mail.example.org | 8080...",row-0005,id-0011,"Jackson, Henry | hjackson@mail.example.org | 1..."
3,row-0002,id-0080,"Chavez, Leah | lchavez@mail.example.org | 8080...",row-0008,id-0041,Rayn Roberts | ryan.roberts5@example.net | | ...
4,row-0002,id-0080,"Chavez, Leah | lchavez@mail.example.org | 8080...",row-0009,id-0007,"Isabella Moore | | 7 Walnut Ave., Springfield..."


In [9]:
# Function to get matching score from Ollama
def get_matching_score(text1, text2):
    prompt = f"""Rate the similarity between these two person records on a scale of 0-100, where 0 is completely different and 100 is identical.

Record 1: {text1}
Record 2: {text2}

Provide only the numeric score, nothing else."""
    
    payload = {
        "model": "dolphin3:8b",
        "prompt": prompt,
        "stream": False
    }
    
    try:
        response = requests.post(ollama_url, json=payload)
        result = response.json()
        # Extract just the number from the response
        score_str = result['response'].strip().split('\n')[0]
        return int(score_str)
    except:
        return 0

# Apply the function to each row and create new 'score' column
df_pairwise['score'] = df_pairwise.apply(
    lambda row: get_matching_score(row['data1'], row['data2']), 
    axis=1
)

In [10]:
df_pairwise.head()

Unnamed: 0,row_id1,true_id1,data1,row_id2,true_id2,data2,score
0,row-0002,id-0080,"Chavez, Leah | lchavez@mail.example.org | 8080...",row-0003,id-0013,"James Harris | | 13 Sycamore St, Springfield,...",28
1,row-0002,id-0080,"Chavez, Leah | lchavez@mail.example.org | 8080...",row-0004,id-0030,"Lopez, Levi | llopez@mail.example.org | 3030 3...",26
2,row-0002,id-0080,"Chavez, Leah | lchavez@mail.example.org | 8080...",row-0005,id-0011,"Jackson, Henry | hjackson@mail.example.org | 1...",31
3,row-0002,id-0080,"Chavez, Leah | lchavez@mail.example.org | 8080...",row-0008,id-0041,Rayn Roberts | ryan.roberts5@example.net | | ...,11
4,row-0002,id-0080,"Chavez, Leah | lchavez@mail.example.org | 8080...",row-0009,id-0007,"Isabella Moore | | 7 Walnut Ave., Springfield...",18


In [None]:
# Save this csv data since creating it is a long-running process
df_pairwise.to_csv('output-data_pairwise-scores_llm.csv', index=False)

In [4]:
# Reload the pairwise score data if necessary
df_pairwise = pd.read_csv('output-data_pairwise-scores_llm.csv')

In [5]:
# Display rows where true_id1 equals true_id2
df_pairwise[df_pairwise['true_id1'] == df_pairwise['true_id2']]


Unnamed: 0,row_id1,true_id1,data1,row_id2,true_id2,data2,score
142,row-0002,id-0080,"Chavez, Leah | lchavez@mail.example.org | 8080...",row-0224,id-0080,Laeh Chavez | leah.chavez2@example.net | 8081 ...,62
313,row-0003,id-0013,"James Harris | | 13 Sycamore St, Springfield,...",row-0190,id-0013,Jmaes Harris | james.harris5@example.net | 131...,78
486,row-0004,id-0030,"Lopez, Levi | llopez@mail.example.org | 3030 3...",row-0154,id-0030,Lvei Lopez | levi.lopez1@example.net | 3031 30...,63
647,row-0005,id-0011,"Jackson, Henry | hjackson@mail.example.org | 1...",row-0087,id-0011,"Henry Jackson | | 11 Willow Dr, Springfield, ...",68
868,row-0008,id-0041,Rayn Roberts | ryan.roberts5@example.net | | ...,row-0121,id-0041,"Roberts, Ryan | rroberts@mail.example.org | 41...",78
...,...,...,...,...,...,...,...
38954,row-0294,id-0018,"Alexander Robinson | | 18 Cypress Ave., Sprin...",row-0232,id-0018,"Robinson, Alexander | arobinson@mail.example.o...",71
39159,row-0296,id-0039,Claeb Mitchell | caleb.mitchell3@example.net |...,row-0239,id-0039,"Mitchell, Caleb | cmitchell@mail.example.org |...",45
39261,row-0298,id-0047,"Edwards, Thomas | tedwards@mail.example.org | ...",row-0096,id-0047,Tohmas Edwards | thomas.edwards4@example.net |...,78
39531,row-0299,id-0005,Aav Miller | ava.miller4@example.net | 566 5 B...,row-0207,id-0005,"Miller, Ava | amiller@mail.example.org | 567 5...",53


In [8]:
# Examine the score spread where they were true matches

# Filter rows where true_id1 equals true_id2
matches = df_pairwise[df_pairwise['true_id1'] == df_pairwise['true_id2']]

# Calculate statistics
count = len(matches)
min_score = matches['score'].min()
avg_score = matches['score'].mean()
max_score = matches['score'].max()

print(f"Count: {count}")
print(f"Min Score: {min_score}")
print(f"Avg Score: {avg_score:.2f}")
print(f"Max Score: {max_score}")

Count: 200
Min Score: 29
Avg Score: 64.83
Max Score: 89


In [9]:
# Examine the score spread where they are NOT true matches

# Filter rows where true_id1 does NOT equal true_id2
non_matches = df_pairwise[df_pairwise['true_id1'] != df_pairwise['true_id2']]

# Calculate statistics
count = len(non_matches)
min_score = non_matches['score'].min()
avg_score = non_matches['score'].mean()
max_score = non_matches['score'].max()

print(f"Count: {count}")
print(f"Min Score: {min_score}")
print(f"Avg Score: {avg_score:.2f}")
print(f"Max Score: {max_score}")

Count: 39600
Min Score: 0
Avg Score: 30.69
Max Score: 87


So, the mid-point between avg score for match vs avg score for non-match is this number: 47.76

Let's use that as a starting point to flag rows as match or non-match, then we'll calculate an accuracy using the true_id's.


In [26]:
cutoff = 47.76
#cutoff = 50.00
#cutoff = 45.00
df_pairwise["match"] = (df_pairwise["score"] >= cutoff).astype(int)
# show counts of match values (0 = non-match, 1 = match)
df_pairwise["match"].value_counts()

match
0    33183
1     6617
Name: count, dtype: int64

In [27]:
# Rows where true_id1 == true_id2
same_id_rows = df_pairwise[df_pairwise["true_id1"] == df_pairwise["true_id2"]]
precision = same_id_rows["match"].sum() / len(same_id_rows) if len(same_id_rows) > 0 else 0.0

# Rows where true_id1 != true_id2
different_id_rows = df_pairwise[df_pairwise["true_id1"] != df_pairwise["true_id2"]]
recall = different_id_rows["match"].sum() / len(different_id_rows) if len(different_id_rows) > 0 else 0.0

print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")

Precision: 0.8800
Recall: 0.1627


So, for this experiment, precision (expressed as percentage of true positives) is a very good number, while recall (expressed as the percentage of true negatives) also a very good number. We can try adjusting our cutoff point to influence a higher precision or a higher recall, but these will typically pull against each other.  For instance, getting a higher precision (more true positives) will also cause a lower recall (more false positives), and vice versa.