# Title: DSC680 Project 1: Entity Matching POC 
## Author: David Hatchett  
## Created: 2025-01-10  

## Description:  
The code below is a project to test if we can take a list of messy data with human errors and some clean data, and use clustering to identify the correct records.
This code is not fully realized and is put together so I can rerun the process as needed. There are many hard-coded values in the functions.
However, the purpose is to create a POC and show that it could work with unlabeled data.

The scrambled data is a list of 1,520 records generated from a list of 20 company names and addresses. You can see the code for the scrambling process
in this repo. Only about 60% of the data was scrambled, and we then dropped another 20% to create something more like real-life data.


The code works in the following manner:
1. Ingest the file of scrambled data
2. Do cleaning on the test fields
3. Create a clean name and full address (street, city, zip, state in one field.
4. Sort the data set by clean name (this was added for chunking records for clustering, which wasn't fully implemented)
5. Split the validation data from the test data
6. Cluster the data using HDBSCAN
7. Evaluate the clustering
8. Compare the clustering to the validation set

Overall, this looks like a viable process, but it definitely needs some work to improve the underlying code. I consulted ChatGPT on the overall idea of this process; however, aside from some cleaning functions, the code was written by me.

## Set up Packages

In [1]:
import pandas as pd
import numpy as np
import hdbscan
import re
import unicodedata
import jellyfish
import matplotlib 
import Levenshtein as lev

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

from scipy.sparse import hstack

# Set up varables needed for the process

In [2]:
STREET_MAP = {
    "STREET": "ST",
    "AVENUE": "AVE",
    "BOULEVARD": "BLVD",
    "ROAD": "RD",
    "DRIVE": "DR",
    "LANE": "LN",
    "COURT": "CT",
    "PLACE": "PL"
}

DIRECTION_MAP = {
    "NORTH": "N",
    "SOUTH": "S",
    "EAST": "E",
    "WEST": "W",
    "NORTHWEST":"NW",
    "NORTHEAST":"NE",
    "SOUTHEAST":"SE",
    "SOUTHWEST":"SW"
}

UNIT_MAP = {
    "APARTMENT": "APT",
    "SUITE": "STE",
    "UNIT": "UNIT"
}

LEGAL_SUFFIXES = {
    " INCORPORATED": "",
    " INC": "",
    " LLC": "",
    " L L C": "",
    " LTD": "",
    " LIMITED": "",
    " CORP": "",
    " CORPORATION": "",
    " CO": "",
    " COMPANY": ""
}

STOPWORDS = {
    "THE": "",
    "AND": "",
    "OF": "",
    "AT": "",
    "FOR": ""}


US_STATES = {
    'AA',
    'AE',
    'AK',
    'AL',
    'AP',
    'AR',
    'AS',
    'AZ',
    'CA',
    'CO',
    'CT',
    'DC',
    'DE',
    'FL',
    'FM',
    'GA',
    'GU',
    'HI',
    'IA',
    'ID',
    'IL',
    'IN',
    'KS',
    'KY',
    'LA',
    'MA',
    'MD',
    'ME',
    'MH',
    'MI',
    'MN',
    'MO',
    'MP',
    'MS',
    'MT',
    'NC',
    'ND',
    'NE',
    'NH',
    'NJ',
    'NM',
    'NV',
    'NY',
    'OH',
    'OK',
    'OR',
    'PA',
    'PR',
    'PW',
    'RI',
    'SC',
    'SD',
    'TN',
    'TX',
    'UT',
    'VA',
    'VI',
    'VT',
    'WA',
    'WI',
    'WV',
    'WY',
}

## Set up Needed Functions for the process

In [3]:
def normalize_text(text:str) -> str:
    '''
    This function will clean up the strings 
    passed into it and return a string. Cleaned up
    ChatGPT Function.

    It uppercases the text
    removes any characters not alphabetical
    removes any extra text.
    '''
    text = str(text).upper()
    text = re.sub(r"[^A-Z0-9\s]", " ", text)
    text = remove_extra_spaces(text)
    return text

def normalize_name(name:str) -> str:
    '''
    used to normalize a name. 
    Parts of this were pulled from ChatGPT.

    uses unicodedata and Normalization Form Canonical Compatibility Decomposition(NFKD)
    and then remove anything that isn't ASCII.
    uppercase the data
 
    '''
    if not name:
        return ""

    name = str(name)

    name = unicodedata.normalize("NFKD", name)
    name = name.encode("ascii", "ignore").decode("ascii")
    name = name.upper()
    name = re.sub(r"[^A-Z0-9\s]", " ", name)
    name = remove_extra_spaces(name)
    return name

def replace_values(text, mapping) -> str:
    '''
    General purpose function to replace
    whatever is passed in the mapping 
    variable in the string using regex.
    '''
    for k, v in mapping.items():
        text = re.sub(rf'\b{k}\b', v, text)
    return text

def remove_extra_spaces(text:str) -> str:
    '''
    remove any instated of double spaces 
    from a string. came from Chatgtp.
    '''
    return re.sub(r'\s+', ' ', text).strip()    

def normalize_phone(p:str) -> str:
    '''
    subtutes any none numeric char
    with '' 
    '''
    p = str(p)
    if pd.isna(p):
        return ""
    digits = re.sub(r"\D", "", p)
    return digits

def get_area_code(phn_nbr:str)->str:
    '''
    strips the area code from the phone
    '''
    return phn_nbr[0:3] if len(phn_nbr) ==10 else ""

def truncate_phone_number(phn_nbr:str)->str:
    '''
    truncate the phone number after we split the 
    area code
    '''
    return phn_nbr[-7:] if len(phn_nbr) >=7 else ""


def check_if_value_exists(text:str,chck_list:list) -> bool:
    '''
    used to see if a value exist in a provided list
    '''
    for i in chck_list:
        if text == i:
            return True

    return False

## Set up Clustering Functions

In [4]:
def cluster_data(df:pd.DataFrame):
    '''
    This is an exsample of how to do the clustering by chucking the data. 
    Not fully implement but does work. Next steps would be to make this more generic.
    outline for this came from ChatGtp however it was quite off and need a lot of 
    work to fix.
    '''
    
    df['cluster_persistence'] = np.nan
    df['cluster_id'] = -1
    cluster_offset = 0

    ## blocking the data by state - change this to block differntly.
    ## starts the loop to process each block
    for block, data in df.groupby('state'):
        if len(data) < 2:
            continue  
    
        idx = data.index
    
        #create vector embeddings
        name_vect = TfidfVectorizer(lowercase=False, analyzer='char_wb', ngram_range=(3,5)).fit_transform(data['name_clean'])
        addrss_vect = TfidfVectorizer(lowercase=False, analyzer='char_wb', ngram_range=(3,5)).fit_transform(data['address_full'])

        ## use hstack to bring the data togher.
        ## weight the name higher then address
        ## this was suggested by Chatgtp
        features = hstack([
            name_vect * .8,
            addrss_vect * .2,
        ])
    

        ## clsuter the data I should set this up to 
        ## a pramter in the futher.
        clusterer = hdbscan.HDBSCAN(
            min_cluster_size = 5,
            min_samples = 1,
            metric="cosine",
            # cluster_selection_epsilon = .5
        )

        ## get the lables out of the data
        labels = clusterer.fit_predict(features)
        labels.astype("int64")

        ## create list used to update the dataframe
        ## CLUSTER ID AND cluster_persistence_
        test_list = []
        for i in labels:
            if i == -1:
                test_list.append(np.nan)
            else:
                test_list.append(clusterer.cluster_persistence_[i])
                
        
        labels = np.where(
            labels != -1,
            labels + cluster_offset,
            -1
        )

        
        ## udpate the dataframe
        df.loc[idx, "cluster_id"] = labels
        df.loc[idx, "cluster_persistence"] = test_list

        ## update the offset so no clustes overide. in the next block.
        if max(labels) > 0:
            cluster_offset = max(labels) + 1


def small_cluster(df:pd.DataFrame, mcs=2, ms=1, mtr='cosine'):
    '''
    this is the cluster used in this POC. set up so we can test differnt 
    parmeters for hdbscan. this is set up to handle about 20k records
    however this depends on your memory.
    '''

    ## add columns to the dataframe
    df['cluster_persistence'] = np.nan
    df['cluster_id'] = -1

    ## pull out the indexs to help with updates latter
    idx = df.index

    ## create the embedings
    name_vect = TfidfVectorizer(lowercase=False, analyzer='char_wb', ngram_range=(3,5)).fit_transform(df['name_clean'])
    addrss_vect = TfidfVectorizer(lowercase=False, analyzer='char_wb', ngram_range=(3,5)).fit_transform(df['address_full']) 

    ## weight name higher then address and add 
    ## them to one matrix
    features = hstack([
            name_vect * .8,
            addrss_vect * .2,
        ])

    ## clsuter the records
    clusterer = hdbscan.HDBSCAN(
            min_cluster_size = mcs,
            min_samples = ms,
            metric="cosine",
        )

    ## pull out the labels
    labels = clusterer.fit_predict(features)

    ## test list is used to update the dataframe
    test_list = []
    for i in labels:
        if i == -1:
             test_list.append(np.nan)
        else:
            test_list.append(clusterer.cluster_persistence_[i])

    
    #update the data frame
    df.loc[idx, "cluster_id"] = labels
    df.loc[idx, "cluster_persistence"] = test_list

## Set Up TF-IDF functions

In [5]:
def close_clusters(thrs_hold):
    '''
    Used to id a list of clusters 
    that may need to be compress into one
    This is more important during the blocked 
    clustering. the idea is that some clsuters 
    that should have been put together did not end that
    way. This takes in the summary records form the process
    and compares the top names.
    '''

    input_data = (
        thrs_hold['top_name'] +' '+
        thrs_hold['top_address']
    )



    match_list=list()
    
    for i in range(len(input_data)):
        clstr_tfidf = TfidfVectorizer()
        clstr_matrix = clstr_tfidf.fit_transform(input_data.drop(i))
        test_vect = clstr_tfidf.transform([str(input_data[i])])
    
        cosine_sim = linear_kernel(test_vect,clstr_matrix).flatten()
        top_result_indx=cosine_sim.argsort()[:-10:-1][0]
    
    
        if cosine_sim[top_result_indx] >= .9:
            match_text = input_data[top_result_indx]
            cosine_sim_value=cosine_sim[top_result_indx]
            match_list.append((input_data[i],match_text,cosine_sim_value,top_result_indx))

    return match_list

def apply_clstrs(acptd_clrs:pd.DataFrame, fll_df:pd.DataFrame) -> pd.DataFrame:
    '''
    This takes a list of accepted clusters and uses it to 
    Apply labels to the dataset using TF-IDF.
    This is used to test the process.
    '''


    ### make matching matrix
    input_data = (
        acptd_clrs['name_clean'] +' '+
        acptd_clrs['address_full']
    )

    clstr_tfidf = TfidfVectorizer()
    clstr_matrix = clstr_tfidf.fit_transform(input_data)


    ### create test input - reuse the varable as input_data isn't needed anymore the index pull
    ### will match back to acptd_clrs
    input_data = (
        fll_df['name_clean'] +' '+
        fll_df['address_full']
    )

    
    ### match to the data from the good clusters and apply it to the data set
    for i in range(len(input_data)):
        test_vect = clstr_tfidf.transform([str(input_data[i])])

        cosine_sim = linear_kernel(test_vect,clstr_matrix).flatten()

        top_result_indx=cosine_sim.argsort()[:-10:-1][0]

        if top_result_indx >= .9:
            fll_df.loc[i, "match_name"] = acptd_clrs['cluster_name'].at[top_result_indx]


    return fll_df

## Set up Evaluations Function

In [6]:
def run_meterics(sum_df:pd.DataFrame,val_df:pd.DataFrame, output_name:str):
    '''
    this was created to evualte the functions. this needs clean up in the future.
    spits out all the stuff needed to evualte how well the clsutering did.
    creates files in teh outputs folder.
    
    '''


    copy_of_full_fr_tst = sum_df[['name_clean','address_full']].copy()
    
    ## Create summary df - got parts of this from ChatGtp
    summary = (
    sum_df[sum_df["cluster_id"] != -1]
    .groupby("cluster_id")
    .apply(lambda g: pd.Series({
        "cluster_size": len(g),
        "top_name": g["name_clean"].value_counts().idxmax(),
        "top_address": g["address_full"].value_counts().idxmax(),
        "top_name_pct": (
            g["name_clean"]
            .value_counts(normalize=True)
            .iloc[0]
        ),
        "top_address_pct": (
            g["address_full"]
            .value_counts(normalize=True)
            .iloc[0]
        ),
        "cluster_persistence":g['cluster_persistence'].max()
    }))
    .reset_index()
    )

    ## create meteric
    total_recs = len(sum_df)
    nmbr_of_clstrs = len(sum_df.groupby('cluster_id').size().rename('cluster_size'))
    nmbr_of_clstrs_prs = len(summary[summary['cluster_persistence'] >= .9])
    nmbr_tp_nm = len(summary[summary['top_name_pct'] >= .7])
    nmbr_sz_20 = len(summary[summary['cluster_size'] >= 20])

    print(f'Total number of input records: {total_recs}')
    print(f'total number of clusters: {nmbr_of_clstrs}')
    print(f'Total number of clusters with cluster_persistence at >= .9: {nmbr_of_clstrs_prs}')
    print(f'Total number of Clusters with top_name_pct >- .7: {nmbr_tp_nm}')
    print(f'Total number of Clsuers with size over 20: {nmbr_sz_20}')    

    #output summary values
    summary.sort_values(by=['cluster_size','cluster_persistence']).to_excel(f'outputs/{output_name}_summary.xlsx')

    ## Thresholds index rest here is important
    thrs_hold = summary[(summary['cluster_size'] >= 20) & (summary['cluster_persistence'] >= 1) &  (summary['top_name_pct'] >= .7)].reset_index()
    cls_clstrs = close_clusters(thrs_hold)

    ##create renaming dic to apply clusters to inputed data
    top_nms_dict = dict(zip(thrs_hold['cluster_id'].to_list(),thrs_hold['top_name'].to_list()))
    grps = thrs_hold['cluster_id'].to_list()
    

    ## Split cluster data for accepted clusters
    cltrd_df = sum_df[sum_df['cluster_id'].isin(grps)].copy()
    cltrd_df['cluster_name'] = cltrd_df['cluster_id'].map(top_nms_dict)

    rmning_rcs = len(sum_df[~sum_df['cluster_id'].isin(grps)].copy())
    clstrd_rcs = len(cltrd_df)

    print(f'Remaining Records after accepted clusters: {rmning_rcs}')
    print(f'Accepted Cluster Records:{clstrd_rcs}')

    

    ##see how good the clusters data is 
    clstrd_rcs = pd.merge(cltrd_df,val_df,  left_index=True, right_index=True)
    ## making label name look like cleaned names.
    clstrd_rcs['label_name'] = clstrd_rcs['label_name'].apply(normalize_name)
    
    clstrd_rcs['match_ind'] = clstrd_rcs.apply(lambda row: lev.ratio(row['label_name'],row['cluster_name']), axis=1)
    full_mtch_val = len(clstrd_rcs[clstrd_rcs['match_ind'] == 1])

    print(f'Number of 100% matchs in clstrs: {full_mtch_val}')

    clstrd_rcs.to_excel(f'outputs/{output_name}_accepted_clusters.xlsx')

    

    ## use accepted clusters on test data and join to vaildate data to see how well it did
    ## this dosn't have the looping in it but we are just proving the concept right now
    fd = apply_clstrs(clstrd_rcs.reset_index(), copy_of_full_fr_tst)
    fd = pd.merge(fd,val_df,  left_index=True, right_index=True)
    fd['label_name'] = fd['label_name'].apply(normalize_name)
    fd['match_pct'] = fd.apply(lambda row: lev.ratio(row['label_name'],row['match_name']), axis=1)
    fd_full_mtch_val = len(fd[fd['match_pct'] == 1])

    print(f'Number of 100% matchs in full data: {fd_full_mtch_val}')
    fd.to_excel(f'outputs/{output_name}_full_data_match.xlsx')

    print('\n \n \n')
    print(cls_clstrs)
    

# Process Data

In [7]:
df = pd.read_csv('C:\\Users\\dkh18\\python_projects\\DSC680\\Project 1\\Company_Name_Clusterer\\data\\scarmble_data_small.csv')

df = df.rename(columns = {'test_name':'name',
                     'test_address':'address',
                     'test_city':'city',
                     'test_postalzip':'zip',
                     'test_state':'state',
                     'test_phone':'phone'})

df.head(5)

Unnamed: 0.1,Unnamed: 0,label_name,label_address,label_city,label_postalzip,label_state,label_phone,name,address,city,zip,state,phone
0,12001,Vestibulum Accumsan Neque Consulting,6037 Nulla St.,Olympia,56179,KS,3040776287,Vestibulum Accumsan Neque Consulting,6037 Nulla St.,Olympia,56179,KS,3040776287
1,10297,Pede Limited,740-5881 Facilisis St.,Lewiston,12737,KS,3987433483,Psejmitc,07-58 8aFcilisist.,eaisfon,12773,SK,938
2,7751,Pede Limited,740-5881 Facilisis St.,Lewiston,12737,KS,3987433483,Pred Lmietd,74-0518 FcilsixSt.,Lswisofj,173,KS,3874348
3,2593,Euismod Mauris Company,"P.O. Box 564, 3029 Cum Avenue",New Haven,11088,NE,5088427862,Euuaomd aMr iComoayn,"P.OoBx 564, 3029 CuQ venue",NewU agdn,1108,E,5847862
4,5747,Euismod Mauris Company,"P.O. Box 564, 3029 Cum Avenue",New Haven,11088,NE,5088427862,Euwomd,PO. xB 56 0 Cmj Avneye,Hwe Hvnw,1188,NE,50


## Clean and normalize the name filed

In [8]:
df['name_clean'] = df['name'].apply(normalize_name)
# df['name_clean'] = df['name_clean'].apply(lambda x: replace_values(x,LEGAL_SUFFIXES))
df['name_clean'] = df['name_clean'].apply(remove_extra_spaces)
df['name_ind'] = df['name_clean'].str.len() >=1
df.head(5)

Unnamed: 0.1,Unnamed: 0,label_name,label_address,label_city,label_postalzip,label_state,label_phone,name,address,city,zip,state,phone,name_clean,name_ind
0,12001,Vestibulum Accumsan Neque Consulting,6037 Nulla St.,Olympia,56179,KS,3040776287,Vestibulum Accumsan Neque Consulting,6037 Nulla St.,Olympia,56179,KS,3040776287,VESTIBULUM ACCUMSAN NEQUE CONSULTING,True
1,10297,Pede Limited,740-5881 Facilisis St.,Lewiston,12737,KS,3987433483,Psejmitc,07-58 8aFcilisist.,eaisfon,12773,SK,938,PSEJMITC,True
2,7751,Pede Limited,740-5881 Facilisis St.,Lewiston,12737,KS,3987433483,Pred Lmietd,74-0518 FcilsixSt.,Lswisofj,173,KS,3874348,PRED LMIETD,True
3,2593,Euismod Mauris Company,"P.O. Box 564, 3029 Cum Avenue",New Haven,11088,NE,5088427862,Euuaomd aMr iComoayn,"P.OoBx 564, 3029 CuQ venue",NewU agdn,1108,E,5847862,EUUAOMD AMR ICOMOAYN,True
4,5747,Euismod Mauris Company,"P.O. Box 564, 3029 Cum Avenue",New Haven,11088,NE,5088427862,Euwomd,PO. xB 56 0 Cmj Avneye,Hwe Hvnw,1188,NE,50,EUWOMD,True


## Clean and normalize the address fields

In [9]:
df['address'] = df['address'].apply(normalize_text)
df['address'] = df['address'].apply(lambda x: replace_values(x,STREET_MAP))
df['address'] = df['address'].apply(lambda x: replace_values(x,DIRECTION_MAP))
df['address'] = df['address'].apply(lambda x: replace_values(x,UNIT_MAP))
df['address'] = df['address'].apply(remove_extra_spaces)


df['city'] = df['city'].apply(normalize_text)
df['city'] = df['city'].apply(remove_extra_spaces)

df['state'] = df['state'].apply(normalize_text)
df['state'] = df['state'].apply(remove_extra_spaces)
df['vaild_state_ind'] = df['state'].apply(lambda x : check_if_value_exists(x,US_STATES))


df['zip'] = df['zip'].apply(lambda x: str(x)[0:5])
df['zip_ind'] = df['zip'].astype(str).str.len() == 5


df['address_full'] = (
    df['address'] + ' ' +
    df['city'] + ' ' +
    df['state'] + ' ' +
    df['zip']
)

df['address_ind'] = df['address_full'].str.len() > 0

df.head(5)

Unnamed: 0.1,Unnamed: 0,label_name,label_address,label_city,label_postalzip,label_state,label_phone,name,address,city,zip,state,phone,name_clean,name_ind,vaild_state_ind,zip_ind,address_full,address_ind
0,12001,Vestibulum Accumsan Neque Consulting,6037 Nulla St.,Olympia,56179,KS,3040776287,Vestibulum Accumsan Neque Consulting,6037 NULLA ST,OLYMPIA,56179,KS,3040776287,VESTIBULUM ACCUMSAN NEQUE CONSULTING,True,True,True,6037 NULLA ST OLYMPIA KS 56179,True
1,10297,Pede Limited,740-5881 Facilisis St.,Lewiston,12737,KS,3987433483,Psejmitc,07 58 8AFCILISIST,EAISFON,12773,SK,938,PSEJMITC,True,False,True,07 58 8AFCILISIST EAISFON SK 12773,True
2,7751,Pede Limited,740-5881 Facilisis St.,Lewiston,12737,KS,3987433483,Pred Lmietd,74 0518 FCILSIXST,LSWISOFJ,173,KS,3874348,PRED LMIETD,True,True,False,74 0518 FCILSIXST LSWISOFJ KS 173,True
3,2593,Euismod Mauris Company,"P.O. Box 564, 3029 Cum Avenue",New Haven,11088,NE,5088427862,Euuaomd aMr iComoayn,P OOBX 564 3029 CUQ VENUE,NEWU AGDN,1108,E,5847862,EUUAOMD AMR ICOMOAYN,True,False,False,P OOBX 564 3029 CUQ VENUE NEWU AGDN E 1108,True
4,5747,Euismod Mauris Company,"P.O. Box 564, 3029 Cum Avenue",New Haven,11088,NE,5088427862,Euwomd,PO XB 56 0 CMJ AVNEYE,HWE HVNW,1188,NE,50,EUWOMD,True,True,False,PO XB 56 0 CMJ AVNEYE HWE HVNW NE 1188,True


## Review the state of the data

In [10]:
print(f'bad names: {sum(df['name_ind']==False)}')
print(f'bad states: {sum(df['vaild_state_ind']==False)}')
print(f'bad zips: {sum(df['zip_ind']==False)}')
print(f'bad addrss: {sum(df['address_ind']==False)}')
print(f'record count: {len(df)}')

bad names: 3
bad states: 5595
bad zips: 6691
bad addrss: 0
record count: 15200


# Prep Data for Clustering

split the test from teh vaildate

In [11]:
df = df.sort_values(by='name_clean')

vaildate_df = df['label_name'].copy()
test_df = df[['name_clean','address_full','phone']].copy()

# Run Cluster Tests 

Here we are just trying out different cluster parameters to see how well the process works. This sets the min cluster size and sample size.
we can also prove out how well the process is working here too.

In [12]:
small_cluster(test_df, 2, 1)
run_meterics(test_df, vaildate_df, 'test_2_1')

  .apply(lambda g: pd.Series({


Total number of input records: 15200
total number of clusters: 1548
Total number of clusters with cluster_persistence at >= .9: 1547
Total number of Clusters with top_name_pct >- .7: 56
Total number of Clsuers with size over 20: 19
Remaining Records after accepted clusters: 9094
Accepted Cluster Records:6106
Number of 100% matchs in clstrs: 6105
Number of 100% matchs in full data: 10704

 
 

[]


In [13]:
small_cluster(test_df, 5, 1)
run_meterics(test_df, vaildate_df, 'test_5_2')

  .apply(lambda g: pd.Series({


Total number of input records: 15200
total number of clusters: 253
Total number of clusters with cluster_persistence at >= .9: 252
Total number of Clusters with top_name_pct >- .7: 32
Total number of Clsuers with size over 20: 22
Remaining Records after accepted clusters: 9064
Accepted Cluster Records:6136
Number of 100% matchs in clstrs: 6135
Number of 100% matchs in full data: 10462

 
 

[]


In [14]:
small_cluster(test_df, 2, 3)
run_meterics(test_df, vaildate_df, 'test_2_3')

  .apply(lambda g: pd.Series({


Total number of input records: 15200
total number of clusters: 468
Total number of clusters with cluster_persistence at >= .9: 467
Total number of Clusters with top_name_pct >- .7: 37
Total number of Clsuers with size over 20: 20
Remaining Records after accepted clusters: 9072
Accepted Cluster Records:6128
Number of 100% matchs in clstrs: 6128
Number of 100% matchs in full data: 10585

 
 

[]
