## Name Match Test Result

#### To match company name bwteen Dataset SR and BG, I use two method to measure the text similarity. One is Jaro-winkler distance, another is distance provided by python lib `difflib`.

In [1]:
import pandas as pd
import numpy as np
import jieba
import difflib
import random
from cleanco import cleanco
import jaro
import string

#### Download Data

In [2]:
comp=pd.read_stata(r"..\Compustat\names.dta")
shark=pd.read_excel(r"..\Factset Shark Repellent\FactSet SharkRepllent Data (Pulled 2019-11-19).xlsx").drop(index=[0,1,2,4])
shark.columns=shark.loc[3]
shark=shark.drop(index=3).reset_index().drop('index',axis=1)
bg=pd.read_stata(r"..\00_BGT_Firm_Names.dta")
#create name
c_name=pd.DataFrame(comp['conm']).astype(str)
s_name=pd.DataFrame(shark.iloc[:,7]).rename({'Company Name':'conm'}, axis=1)
bg_name=pd.DataFrame(bg.name_bgt[bg['total_postings_bgt']>50])
bg_name_full=pd.DataFrame(bg.name_bgt)

#### Provide cleaned versions of names
1. 
 `cleanco` processes company names, providing cleaned versions of the names by stripping away terms indicating organization type (such as "Ltd." or "Corp").  
- Using a database of organization type terms, It also provides an utility to deduce the type of organization, in terms of US/UK business entity types (ie. "limited liability company" or "non-profit"). 

- Details about this package can be found at https://pypi.org/project/cleanco/

- I also change uppercase letter to lowercase.


In [None]:
#clean name
remove organization type and thansfer to lower case
s_name1={}.fromkeys(list(map(lambda x: cleanco(x.lower()).clean_name(), s_name.conm))).keys()
c_name1={}.fromkeys(list(map(lambda x: cleanco(x.lower()).clean_name(), c_name.conm))).keys()
bg_name1={}.fromkeys(list(map(lambda x: cleanco(x.lower()).clean_name(), bg_name.name_bgt))).keys()

#### Jaro-winker distance

1. Jaro-winker distance is a letter-based distance that we can use to measure the similarity between two strings
2. Here I use python package `jaro` to calculate the distance between strings, details about this package can be found at https://pypi.org/project/jaro-winkler/.
3. Time cost: 1000 times query with a dictionary contained 6000 strings will cost about 175s

In [139]:
def jaro_distance(list1,list2):
    """
    Measure strings similarity by Jaro-winker distance.
    
    Parameters
    ----------
        list1: list of query strings
        list2: list of names dictionary       
    Returns
    -------
        df: Dataframe with three columns: "query_name", "match" and "score"
        "query_name" is the query string(target company name)
        "match" is the most similary string found at dictionary for the query string
        "score" is a float number used to measure the similarity between the query string and "match" string, range(0,1)
              
    """
    df=pd.DataFrame(list1)
    label=[]
    score_get=np.empty(len(list1))
    score=np.empty(len(list2))
    for n1,str1 in enumerate(list1):
        for n2,str2 in enumerate(list2):
            score[n2]=jaro.jaro_winkler_metric(str1,str2)
        imax=np.argmax(score)
        label.append(list2[imax])
        score_get[n1]=max(score)
    df['match']=label
    df['score']=score_get
    df.rename(columns={0:'query_name'})
    return df

#### StrSimilarity 

1. StrSimilarity is a function to measure string similarity on word-based and letter-based
- The alagorithm of StrSimilarity is:
 - First to count the number of common words of query string and potential matched string
 - Then keep potential strings with the n-highest common word number
 - Calculate adjusted scores for each potential matched strings(optional)
 - Used `difflib` to calculate these n strings'similarity with query string
 - The final similarity score is score given by `difflib` minutes adjustment scores
- The advantage of this method is avoiding matching query company name with those companies whose name is very similar
- Time cost: 1000 times query with a dictionary contained 6000 strings will cost about 35.3s

In [4]:
class StrSimilarity1():
    def __init__(self,word):
        self.word=word
#Compared函数，参数str_list是对比字符串列表
#返回原始字符串分词后和对比字符串的匹配次数，返回一个字典
    def Compared(self,str_list):
        """
        Count common words.
        
        Parameters
        ----------
        str_list: a list contains potential matched strings
        
        
        Returns
        ----------
        dict_data: a dictionary
        keys are the potential matched strings
        value are the number of common words
        
        """
        dict_data={}
        sarticiple=list(self.word.strip().translate(str.maketrans('', '', string.punctuation)).split())
        for strs in str_list:
            #s_name list
            strs_word=list(strs.strip().translate(str.maketrans('', '', string.punctuation)).split())
            num=0
            for strs1 in strs_word:
                if strs1 in sarticiple:
                    num = num+1
                else: 
                    num = num
            dict_data[strs]=num
        return dict_data
    #NumChecks函数，参数dict_data是原始字符串分词后和对比字符串的匹配次数的字典，也就是Compared函数的返回值
    #返回出现次数最高的两个，返回一个字典
    def NumChecks(self,dict_data):
        """
        Return two potential strings with the hightest common word number.
        
        
        Parameters
        ----------
        dict_data: a dictionary
        keys are the potential matched strings
        value are the number of common words(return of Compared)
        
        
        Returns
        ----------
        dict_data: a dictionary
        keys are the two potential matched strings with the highest number of common words
        value are the number of common words
        
        """       
        list_data = sorted(dict_data.items(), key=lambda asd:asd[1], reverse=True)
        length = len(list_data)
        json_data = {}
        if length>=2:
            datas = list_data[:2]
        else:
            datas = list_data[:length]
        for data in datas:
            json_data[data[0]]=data[1]
        return json_data
#MMedian函数，参数dict_data是出现次数最高的两个对比字符串的字典，也就是NumChecks函数的返回值
#返回对比字符串和调节值的字典
    def MMedian(self,dict_data):
        """
        Calculate adjusted similarity scores for most potential strings(optional step).
        
        
        Parameters
        ----------
        dict_data: a dictionary
        keys are the two potential matched strings with the highest number of common words
        value are the number of common words(return of NumChecks)
        
        
        Returns
        ----------
        dict_data: a dictionary
        keys are the two potential matched strings with the highest number of common words
        value are the adjusted similarity scores
               
        """   
        
        median_list={}
        l=len(list(self.word.strip().translate(str.maketrans('', '', string.punctuation)).split()))#query string word numbers
        for k,v in dict_data.items():#k is potential string, v is the common word number
            length=len(list(k.strip().translate(str.maketrans('', '', string.punctuation)).split()))#potential string word numbers
            if l>v: 
                if v==length:
                    xx=-1
                else: 
                    xx = ((abs(l-v))/l)
            else: 
                 xx=-2    
            median_list[k] = xx
        return median_list
    
    
    
#Appear函数，参数dict_data是对比字符串和调节值的字典，也就是MMedian函数的返回值
#返回最相似的字符串
    def Appear(self,dict_data):
        """
        Return the most similar potential string.
        
        
        Parameters
        ----------
        dict_data: a dictionary
        keys are the two potential matched strings with the highest number of common words
        value are the adjusted similarity scores(return of  MMedian)
        
        
        Returns
        ----------
        dict_data: a dictionary
        key is the query string
        value is most similar potential string
               
        """   
        json_data={}
        for k,v in dict_data.items():
            fraction = difflib.SequenceMatcher(None, self.word, k).quick_ratio()-v
            json_data[k]=fraction
        tulp_data = sorted(json_data.items(), key=lambda asd:asd[1], reverse=True)
        return tulp_data[0][0],tulp_data[0][1]   
    
def name_match1(query_list1,str_list1):
    """
    Measure strings similarity by StrSimilary.
    
    Parameters
    ----------
        query_list1: list of query strings
        str_list1: list of names dictionary       
    Returns
    -------
        df: Dataframe with three columns: "query_name", "match" and "score"
        "query_name" is the query string(target company name)
        "match" is the most similary string found at dictionary for the query string
        "score" is a float number used to measure the similarity between the query string and "match" string, range(0,1)             
    """ 
    name_match=[]
    score=[]
    #str_list1=list(' '.join(str1.strip().translate(str.maketrans('', '', string.punctuation)).split()) for str1 in str_list1)
    for i,str_query in enumerate(query_list1):
        def main():
            query_str =str_query
            str_list=str_list1
    
            ss = StrSimilarity1(query_str)
            list_data = ss.Compared(str_list)
            num = ss.NumChecks(list_data)
            mmedian = ss.MMedian(num)
            #print(query_str,ss.Appear(mmedian))
            return ss.Appear(mmedian)
        if __name__=="__main__":
            name_match.append(main()[0])
            score.append(main()[1])
    df=pd.DataFrame(query_list1)
    df['match']=name_match
    df['score']=score
    df.rename(columns={0:'query_name'})
    return df

This is a StrSimilarity function but with different definition of "common word" and adjusted similarity scores

1. I extend the definition of "common word" (ie. 'hotels' and 'hotel' will be regarded as common word, but 'hodel' and 'hotel' will not )
- I set different penalty weights to dismatch in words and in letters

In [10]:
# extend common word definition
def max_num(str1,str2):
    i=0
    while True:
        if str1[:len(str1)-i] in str2:
            return len(str1)-i,i
            break
        else:
            i+=1

#停用词，这里只是针对例子增加的停用词，如果数量很大可以保存在一个文件中
#stopwords=['financial','service','services','group','company','companies','the','managerment']
stopwords=[]
class StrSimilarity3():
    def __init__(self,word):
        self.word=word

#Compared函数，参数str_list是对比字符串列表
#返回原始字符串分词后和对比字符串的匹配次数，返回一个字典
    def Compared(self,str_list):
        """
        Count common words.
        
        Parameters
        ----------
        str_list: a list contains potential matched strings
        
        
        Returns
        ----------
        dict_data: a dictionary
        keys are the potential matched strings
        value are the number of common words
        
        """
        dict_data={}
        sarticiple=self.word.replace(' and ', " & ").translate(str.maketrans('', '', string.punctuation))
        for strs,strs_word in str_list.items():
            num=0
            l=0
            for strs1 in strs_word:
                lens,i=max_num(strs1,sarticiple) #uset to solve match problem 'hotel' vs. 'hotels'
                if i<=2 and lens>=3:
                    num = num+1
                else:
                    num=num
            dict_data[strs]=num
        return dict_data

    
    #NumChecks函数，参数dict_data是原始字符串分词后和对比字符串的匹配次数的字典，也就是Compared函数的返回值
    #返回出现次数最高的两个，返回一个字典
    def NumChecks(self,dict_data):
        """
        Return three potential strings with the hightest common word number.
        
        
        Parameters
        ----------
        dict_data: a dictionary
        keys are the potential matched strings
        value are the number of common words(return of Compared)
        
        
        Returns
        ----------
        dict_data: a dictionary
        keys are the three potential matched strings with the highest number of common words
        value are the number of common words
        
        """  
        list_data = sorted(dict_data.items(), key=lambda asd:asd[1], reverse=True)
        length = len(list_data)
        json_data = {}
        json_data1 = {}
        if length>=3:
            datas = list_data[:3]
        else:
            datas = list_data[:length]
        for data in datas:
            json_data[data[0]]=data[1]# match number of word
            #json_data1[data[0]]=dict_data1[data[0]]#match number of letter
        return json_data#,json_data1
    
#MMedian函数，参数dict_data是出现次数最高的两个对比字符串的字典，也就是NumChecks函数的返回值
#返回对比字符串和调节值xx的字典
       
    def MMedian(self,dict_data):
         """
        Calculate adjusted similarity scores for most potential strings(optional step).
        
        
        Parameters
        ----------
        dict_data: a dictionary
        keys are the three potential matched strings with the highest number of common words
        value are the number of common words(return of NumChecks)
        
        
        Returns
        ----------
        dict_data: a dictionary
        keys are the three potential matched strings with the highest number of common words
        value are the adjusted similarity scores
               
        """   
        median_list={}
        length = len(self.word)
        for k,v in dict_data.items():
            num = np.median([len(k),length])
            if abs(length-num) !=0 :
                xx = (abs(length - num)) * 0.017
            else:
                xx = 0
            median_list[k] = xx
        return median_list
 
    
    
#Appear函数，参数dict_data是对比字符串和调节值的字典，也就是MMedian函数的返回值
#返回最相似的字符串
    def Appear(self,dict_data):
        """
        Return the most similar potential string.
        
        
        Parameters
        ----------
        dict_data: a dictionary
        keys are the three potential matched strings with the highest number of common words
        value are the adjusted similarity scores(return of  MMedian)
        
        
        Returns
        ----------
        dict_data: a dictionary
        key is the query string
        value is most similar potential string
               
        """   
        json_data={}
        for k,v in dict_data.items():
            fraction = difflib.SequenceMatcher(None, self.word, k).quick_ratio()-v#v 调节值
            #fraction=-v
            json_data[k]=fraction
        tulp_data = sorted(json_data.items(), key=lambda asd:asd[1], reverse=True)
        return tulp_data[0][0],tulp_data[0][1]
    
def name_match3(query_list1,str_list1):
    """
    Measure strings similarity by StrSimilary.
    
    Parameters
    ----------
        query_list1: list of query strings
        str_list1: list of names dictionary       
    Returns
    -------
        df: Dataframe with three columns: "query_name", "match" and "score"
        "query_name" is the query string(target company name)
        "match" is the most similary string found at dictionary for the query string
        "score" is a float number used to measure the similarity between the query string and "match" string, range(0,1)             
    """ 
    name_match=[]
    score=[]
    #str_list1=list(' '.join(str1.strip().translate(str.maketrans('', '', string.punctuation)).split()) for str1 in str_list1)
    for i,str_query in enumerate(query_list1):
        def main():
            query_str =str_query
            str_list=str_list1
    
            ss = StrSimilarity3(query_str)
            list_data= ss.Compared(str_list)
            num= ss.NumChecks(list_data)
            mmedian = ss.MMedian(num)
            #print(query_str,ss.Appear(mmedian))
            return ss.Appear(mmedian)
        if __name__=="__main__":
            name_match.append(main()[0])
            score.append(main()[1])
    df=pd.DataFrame(query_list1)
    df['match']=name_match
    df['score']=score
    df.rename(columns={0:'query_name'})
    return df

### This is a match test using simulated data

1. I pick 1000 random names from `Shark Repellent` as my data label `train_Y` (because I need the true match label)
- I pick words(letters) from burning glass and add them randomly to `train_Y` to create train data `train_X`
- `train_X` now is the query list, `train_Y` is the true label od query word,  `Shark Repellent` is my potential matched string list

In [161]:
#train
noise=random.sample(s_name1,(1000))
train_Y=sorted(random.sample(s_name1,(1000)))#true lable
train_X=list(map(lambda x,y:x+' '+y[:4]+' '+y[-3:], train_Y,noise))

- Below is the test data, where `real nam` is the true label, `noise name` is string waiting to match with `Shark Repellent`
- Order of words doen't matter in each algorithm

In [162]:
data=pd.DataFrame(train_Y,columns=['real name'])
data['noise name']=train_X
data.head(-10)

Unnamed: 0,real name,noise name
0,180 connect,180 connect post ngs
1,22nd century group,22nd century group petr ent
2,3par,3par tuto cal
3,a. m. castle,a. m. castle rovi ovi
4,a10 networks,a10 networks redk ons
...,...,...
985,xenia hotels & resorts,xenia hotels & resorts prog ial
986,xenoport,xenoport surm ics
987,xo holdings,xo holdings usel com
988,xplore technologies,xplore technologies worl ngs


#### StrSimilarity1 Test Result

In [164]:
%%time
df1_train=name_match1(train_X,str_list11)
df1_train['lable']=train_Y
df1_train.sort_values(by='score',ascending=False).head(50)
#threashold=1
df1_result=df1_train[df1_train.score>=1]
accuracy_ratio=sum(np.where(df1_result.match==df1_result.lable,1,0))/np.shape(df1_result)[0]
print(accuracy_ratio)

0.9050505050505051
Wall time: 39.3 s


In [178]:
df1_train.sort_values(by='score',ascending=False).head(-10)

Unnamed: 0,query_name,match,score,lable
18,advent claymore convertible securities and inc...,advent claymore convertible securities and inc...,1.923077,advent claymore convertible securities and inc...
355,federated premier intermediate municipal incom...,federated premier intermediate municipal incom...,1.920354,federated premier intermediate municipal incom...
126,blackrock investment quality municipal income ...,blackrock investment quality municipal income ...,1.918919,blackrock investment quality municipal income ...
645,nuveen insured florida tax-free advantage muni...,nuveen insured florida taxfree advantage munic...,1.916667,nuveen insured florida tax-free advantage muni...
552,managed duration investment grade municipal fu...,managed duration investment grade municipal fund,1.914286,managed duration investment grade municipal fund
...,...,...,...,...
649,nwh smur ner,nwh,1.400000,nwh
105,bab imme ion,bab,1.400000,bab
483,iqe fanu nuc,iqe,1.400000,iqe
83,at&t sale ons,att,1.375000,at&t


#### Jaro-winkler Distance Test Result

In [179]:
%%time
df_jaro_train=jaro_distance(train_X,str_list11)
df_jaro_train['lable']=train_Y
df_jaro_train.sort_values(by='score',ascending=False).head(50)
#threashold
df_jaro_result=df_jaro_train[df_jaro_train.score>=0.8]
accuracy_ratio2=sum(np.where(df_jaro_result.match==df_jaro_result.lable,1,0))/np.shape(df_jaro_result)[0]
print(accuracy_ratio2)

0.8986960882647944
Wall time: 3min 29s


In [None]:
df_jaro_train.sort_values(by='score',ascending=False).head(-10)

#### Choose of Threashold

In [148]:
#choose the best threashold
def test_func1(train_X,train_Y,str_list11,threashold1):
    df1_train=name_match1(train_X,str_list11)
    df1_train['lable']=train_Y
    df1_result=df1_train[df1_train.score>=threashold1]
    accuracy_ratio=sum(np.where(df1_result.match==df1_result.lable,1,0))/np.shape(df1_result)[0]
    return accuracy_ratio

def test_func2(train_X,train_Y,str_list11,threashold2):
    df_jaro_train=jaro_distance(train_X,str_list11)
    df_jaro_train['lable']=train_Y
    df_jaro_result=df_jaro_train[df_jaro_train.score>=threashold2]
    accuracy_ratio2=sum(np.where(df_jaro_result.match==df_jaro_result.lable,1,0))/np.shape(df_jaro_result)[0]

    return accuracy_ratio2

In [None]:
acc1={}
acc2={}
n=5
for threashold1,threashold2 in zip(np.linspace(0.9,1.5,10),np.linspace(0.75,0.95,10)):
    ratio1=0
    ratio2=0
    i=-n
    while i:
        noise=random.sample(s_name1,(100))
        train_Y=sorted(random.sample(s_name1,(100)))#true lable
        train_X=list(map(lambda x,y:x+' '+y[:4]+' '+y[-3:], train_Y,noise))
        ratio1=ratio1+test_func1(train_X,train_Y,str_list11,threashold1)   
        ratio2=ratio2+test_func2(train_X,train_Y,str_list11,threashold2) 
        i+=1
    acc1[threashold1]=ratio1/n
    acc2[threashold2]=ratio2/n
    

In [213]:
k=acc1.keys()
v=acc1.values()
k2=acc2.keys()
v2=acc2.values()
table1=pd.DataFrame([k,v,k2,v2]).T.rename(columns={0:'threshold_strSimi',1:'accuracy_strSimi',2:'threshold_jaro',3:'accuracy_jaro'})
table1

Unnamed: 0,threshold_strSimi,accuracy_strSimi,threshold_jaro,accuracy_jaro
0,0.9,0.896,0.75,0.895818
1,0.966667,0.896,0.772222,0.896
2,1.033333,0.906,0.794444,0.905677
3,1.1,0.908,0.816667,0.909448
4,1.166667,0.904,0.838889,0.903196
5,1.233333,0.912,0.861111,0.927856
6,1.3,0.91,0.883333,0.91596
7,1.366667,0.902,0.905556,0.932159
8,1.433333,0.908775,0.927778,0.933825
9,1.5,0.905708,0.95,0.933782


#### Test Result for Real Data (use 1000 sample)

In [215]:
#real data test
random.seed(123)
query_list1=random.sample(bg_name1,1000)#1000 names 
s_name1=sorted(s_name1, key=len)    
str_list11=list(' '.join(str1.strip().translate(str.maketrans('', '', string.punctuation)).split()) for str1 in s_name1)
query_list1=list(' '.join(str1.strip().translate(str.maketrans('', '', string.punctuation)).split()) for str1 in query_list1)
query_list1.sort(reverse=False)

#### StrSimilarity1 Test Result

In [248]:
%%time
df1=name_match1(query_list1,str_list11).sort_values(by='score',ascending=False)

Wall time: 37 s


In [235]:
#threshold=1
df1[df1.score>1]

Unnamed: 0,0,match,score
941,universal technical institute,universal technical institute,3.000000
381,gateway,gateway,3.000000
512,kensey nash,kensey nash,3.000000
532,lake shore bancorp,lake shore bancorp,3.000000
630,national interstate,national interstate,3.000000
...,...,...,...
813,smc corp america,smc,1.315789
143,boca west country club,west,1.307692
111,ball factory indoor play cafe,ball,1.242424
634,netvision resources nvr,nvr,1.230769


In [236]:
#threshold=1.5
df1[df1.score>1.5]

Unnamed: 0,0,match,score
941,universal technical institute,universal technical institute,3.0
381,gateway,gateway,3.0
512,kensey nash,kensey nash,3.0
532,lake shore bancorp,lake shore bancorp,3.0
630,national interstate,national interstate,3.0
856,superdry,superdry,3.0
139,bmc software,bmc software,3.0
204,christopher banks,christopher banks,3.0
90,ashland,ashland,3.0
196,charter communications,charter communications,3.0


#### StrSimilarity2 Test Result

In [225]:
%%time
df2=name_match3(query_list1,str_list1).sort_values(by='score',ascending=False)

Wall time: 1min


In [237]:
#threshold=0.8
df2[df2.score>0.8]

Unnamed: 0,0,match,score
196,charter communications,charter communications,1.0
630,national interstate,national interstate,1.0
381,gateway,gateway,1.0
856,superdry,superdry,1.0
512,kensey nash,kensey nash,1.0
532,lake shore bancorp,lake shore bancorp,1.0
139,bmc software,bmc software,1.0
941,universal technical institute,universal technical institute,1.0
90,ashland,ashland,1.0
689,packeteer,packeteer,1.0


In [247]:
#threshold=0.9
df2[df2.score>0.85]

Unnamed: 0,0,match,score
196,charter communications,charter communications,1.0
630,national interstate,national interstate,1.0
381,gateway,gateway,1.0
856,superdry,superdry,1.0
512,kensey nash,kensey nash,1.0
532,lake shore bancorp,lake shore bancorp,1.0
139,bmc software,bmc software,1.0
941,universal technical institute,universal technical institute,1.0
90,ashland,ashland,1.0
689,packeteer,packeteer,1.0


#### Jaro-winkler Test Result

In [218]:
%%time
df_jaro=jaro_distance(query_list1,str_list11).sort_values(by='score',ascending=False)

Wall time: 2min 55s


Unnamed: 0,0,match,score
381,gateway,gateway,1.000000
532,lake shore bancorp,lake shore bancorp,1.000000
856,superdry,superdry,1.000000
90,ashland,ashland,1.000000
689,packeteer,packeteer,1.000000
...,...,...,...
584,marquardt transportation,apartment trust of america,0.721510
967,waldorf astoria park city,selas corporation of america,0.721429
501,jenison public schools,pulse biosciences,0.721390
72,apackansas,japan asset marketing co,0.721296


In [243]:
#threshold=0.8
df_jaro[df_jaro.score>0.9]

Unnamed: 0,0,match,score
381,gateway,gateway,1.000000
532,lake shore bancorp,lake shore bancorp,1.000000
856,superdry,superdry,1.000000
90,ashland,ashland,1.000000
689,packeteer,packeteer,1.000000
...,...,...,...
861,sybron dental,sybron dental specialties,0.904000
675,orlando group,ocado group,0.903497
544,legacy ventures,legacy reserves,0.903333
402,great american restaurants,great american group,0.900769


In [245]:
#threshold=0.8
df_jaro[df_jaro.score>0.93]

Unnamed: 0,0,match,score
381,gateway,gateway,1.0
532,lake shore bancorp,lake shore bancorp,1.0
856,superdry,superdry,1.0
90,ashland,ashland,1.0
689,packeteer,packeteer,1.0
139,bmc software,bmc software,1.0
204,christopher banks,christopher banks,1.0
196,charter communications,charter communications,1.0
512,kensey nash,kensey nash,1.0
941,universal technical institute,universal technical institute,1.0
