## Matching on DBLP ACM Dataset

### About the Dataset:
* This data set was taken from the Benchmark datasets for entity resolution web page. 
* It contains bibliographic data, with 4 attributes: title, authors, venue, year. 
* There are 3 CSV files in this zip archive.

### Task: Measure how accurate the bipartite matching algorithm is using the datasets and the ground truth presented in the data

Some assumptions and notes:
* We are checking string similarity using the `titles` column of the 2 datasets

* Added `encoded=latin-1` for pandas errors on file reading in the function `convert_df`

## Steps

#### 1. We'll include the bipartite matching algorithm components from our previous findings. We can try out different string matching techniques. 

#### A good starting point would be to see how `editDistance` is working in this context, so we'll include the function below.

In [2]:
import textdistance
import editdistance

def editDistance(str1, str2, m, n): 
  
    # If first string is empty, the only option is to 
    # insert all characters of second string into first 
    if m == 0: 
         return n 
  
    # If second string is empty, the only option is to 
    # remove all characters of first string 
    if n == 0: 
        return m 
  
    # If last characters of two strings are same, nothing 
    # much to do. Ignore last characters and get count for 
    # remaining strings. 
    if str1[m-1]== str2[n-1]: 
        return editDistance(str1, str2, m-1, n-1) 
  
    # If last characters are not same, consider all three 
    # operations on last character of first string, recursively 
    # compute minimum cost for all three operations and take 
    # minimum of three values. 
    return 1 + min(editDistance(str1, str2, m, n-1),    # Insert 
                   editDistance(str1, str2, m-1, n),    # Remove 
                   editDistance(str1, str2, m-1, n-1)    # Replace 
                   ) 


def jaccard_similarity(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union

In [3]:
import pandas as pd
import networkx as nx

"""

Transforms the given file to a pandas dataframe object if it was not one already
Assumption: Assumes that the data starts from the 1st row of given file, does not use seperators such as "," or ";"

Input: 2 variables that will be measured for similarity
Output: The desired similarity metric result
"""
def similarity_edit(x,y):
    return editDistance(x,y,len(x),len(y))



"""

Transforms the given file to a pandas dataframe object if it was not one already
Assumption: Assumes that the data starts from the 1st row of given file, does not use seperators such as "," or ";"

Input: Any file
Output: A pandas dataframe object
"""
def convert_df(file):
    if isinstance(file, pd.DataFrame):
        return file
    else:
        df = pd.read_csv(file, encoding='latin-1')
        return df
"""

Calculates maximum weight for the matching

Input: keys from 2 tables
Output: weight for each matching to be used in the weight part of constructing the graph
"""
def calc_max_weight(key1, key2):
    weight = textdistance.jaccard(key1,key2) #this library's implementation is slower than jaccard_similarity()
    return weight

"""

Calculates minimum weight for the matching

Input: keys from 2 tables
Output: weight for each matching to be used in the weight part of constructing the graph
"""
def calc_min_weight(key1, key2):
    weight = (-1)/(1+textdistance.jaccard(key1,key2)) #this library's implementation is slower than jaccard_similarity()
    return weight

"""

Calculates maximum weight for the matching

Input: keys from 2 tables
Output: weight for each matching to be used in the weight part of constructing the graph
"""
def calc_max_weight_edit(key1, key2):
    weight = (1)/(1+editdistance.eval(key1,key2))
    return weight

"""

Calculates minimum weight for the matching

Input: keys from 2 tables
Output: weight for each matching to be used in the weight part of constructing the graph
"""
def calc_min_weight_edit(key1, key2):
    weight = (-1)/(1+editdistance.eval(key1,key2))
    return weight



"""

Converts the dataframe into dictionary for better accuracy matching of pairs. 
Assumption: The data has headers in the first row (description of what that column describes)

Input: Any file
Output: A dictionary in the form col1:col2 matching
"""
def make_dict(file):
    V = list(file.to_dict('list').values())
    keys = V[0]
    values = zip(*V[1:])
    table = dict(zip(keys,values))
    return table

"""

Constructs a maximal bipartite graph of the given two tables

Input: Any 2 files in any format
Output: A Bipartite Graph with Maximal Weights
"""
def updated_maximal_construct_graph(file_a, file_b):
    table_a_unprocessed = convert_df(file_a)
    table_b_unprocessed = convert_df(file_b)
    bipartite_graph = nx.Graph()
    
    table_a = make_dict(table_a_unprocessed)
    table_b = make_dict(table_b_unprocessed)
    
    i=0
    
    for key1, val1 in table_a.items():
       # print(val1)
        id1 = str(key1) + '_' + str(val1) + '_1'
        for key2, val2 in table_b.items():
            i+=1
            if i%100000 == 0:
                print(str(round(100*i/len(table_a)/len(table_b),2))+'% complete')
            #add value to identifier to distinguish two entries with different values
            id2 = str(key2) + '_' + str(val2) + '_2' 
            bipartite_graph.add_edge(id1, id2, weight=calc_max_weight_edit(val1, val2))
            #edit distance and weight should be inv. prop.
            #also adding 1 to denom. to prevent divide by 0
            # add 1,2 to distinguish two key-value tuples belonging to different tables
    return bipartite_graph



"""

Constructs a maximal bipartite graph of the given two tables

Input: Any 2 files in any format
Output: A Bipartite Graph with Minimal Weights
"""
def updated_minimal_construct_graph(file_a, file_b):
    table_a_unprocessed = convert_df(file_a)
    table_b_unprocessed = convert_df(file_a)
    bipartite_graph = nx.Graph()
    
    table_a = make_dict(table_a_unprocessed)
    table_b = make_dict(table_b_unprocessed)
    
    i=0
    
    for key1, val1 in table_a.items():
        id1 = str(key1) + '_' + str(val1) + '_1'
        for key2, val2 in table_b.items():
            i+=1
            if i%100000 == 0:
                print(str(round(100*i/len(table_a)/len(table_b),2))+'% complete')
            #add value to identifier to distinguish two entries with different values
            id2 = str(key2) + '_' + str(val2) + '_2' 
            bipartite_graph.add_edge(id1, id2, weight=calc_min_weight(key1, key2)) 
            #edit distance and weight should be inv. prop.
            #also adding 1 to denom. to prevent divide by 0
            # add 1,2 to distinguish two key-value tuples belonging to different tables
    return bipartite_graph

bipartite_graph_maximal = updated_maximal_construct_graph("table_a.csv","table_b.csv")
#print(bipartite_graph_maximal.edges.data())
bipartite_graph_minimal = updated_minimal_construct_graph("table_a.csv", "table_b.csv")
bipartite_graph_minimal.edges.data()

EdgeDataView([("US_('300 M', 1)_1", "US_('300 M', 1)_2", {'weight': -0.5}), ("US_('300 M', 1)_1", "CN_('12 B', 2)_2", {'weight': -1.0}), ("US_('300 M', 1)_1", "CA_('50 M', 3)_2", {'weight': -1.0}), ("US_('300 M', 1)_1", "AU_('25 M', 4)_2", {'weight': -0.75}), ("US_('300 M', 1)_2", "CN_('12 B', 2)_1", {'weight': -1.0}), ("US_('300 M', 1)_2", "CA_('50 M', 3)_1", {'weight': -1.0}), ("US_('300 M', 1)_2", "AU_('25 M', 4)_1", {'weight': -0.75}), ("CN_('12 B', 2)_2", "CN_('12 B', 2)_1", {'weight': -0.5}), ("CN_('12 B', 2)_2", "CA_('50 M', 3)_1", {'weight': -0.75}), ("CN_('12 B', 2)_2", "AU_('25 M', 4)_1", {'weight': -1.0}), ("CA_('50 M', 3)_2", "CN_('12 B', 2)_1", {'weight': -0.75}), ("CA_('50 M', 3)_2", "CA_('50 M', 3)_1", {'weight': -0.5}), ("CA_('50 M', 3)_2", "AU_('25 M', 4)_1", {'weight': -0.75}), ("AU_('25 M', 4)_2", "CN_('12 B', 2)_1", {'weight': -1.0}), ("AU_('25 M', 4)_2", "CA_('50 M', 3)_1", {'weight': -0.75}), ("AU_('25 M', 4)_2", "AU_('25 M', 4)_1", {'weight': -0.5})])

In [4]:
#nx.algorithms.matching.max_weight_matching(bipartite_graph_maximal)
print(nx.algorithms.bipartite.matching.maximum_matching(bipartite_graph_minimal))

{"CA_('50 M', 3)_1": "US_('300 M', 1)_2", "CN_('12 B', 2)_1": "CN_('12 B', 2)_2", "US_('300 M', 1)_1": "CA_('50 M', 3)_2", "AU_('25 M', 4)_1": "AU_('25 M', 4)_2", "CN_('12 B', 2)_2": "CN_('12 B', 2)_1", "US_('300 M', 1)_2": "CA_('50 M', 3)_1", "CA_('50 M', 3)_2": "US_('300 M', 1)_1", "AU_('25 M', 4)_2": "AU_('25 M', 4)_1"}


### 2. Load the data for processing

In [5]:
# Sticking to the convention of table_a and table_b naming that we previously used for generalization purposes

table_a = convert_df("ACM.csv")

table_b = convert_df("DBLP2.csv")

### 3. Create a bipartite graph

In [6]:
graph_maximal = updated_maximal_construct_graph(table_a, table_b)
graph_maximal.number_of_edges()

1.67% complete
3.33% complete
5.0% complete
6.67% complete
8.33% complete
10.0% complete
11.66% complete
13.33% complete
15.0% complete
16.66% complete
18.33% complete
20.0% complete
21.66% complete
23.33% complete
25.0% complete
26.66% complete
28.33% complete
29.99% complete
31.66% complete
33.33% complete
34.99% complete
36.66% complete
38.33% complete
39.99% complete
41.66% complete
43.33% complete
44.99% complete
46.66% complete
48.32% complete
49.99% complete
51.66% complete
53.32% complete
54.99% complete
56.66% complete
58.32% complete
59.99% complete
61.66% complete
63.32% complete
64.99% complete
66.65% complete
68.32% complete
69.99% complete
71.65% complete
73.32% complete
74.99% complete
76.65% complete
78.32% complete
79.99% complete
81.65% complete
83.32% complete
84.98% complete
86.65% complete
88.32% complete
89.98% complete
91.65% complete
93.32% complete
94.98% complete
96.65% complete
98.32% complete
99.98% complete


6001104

As observed, from the above output, there are 6 million edges. This creates a very heavy computation for the maximal matching algorithm. It is very likely that we will observe that the maximum matching algorithm will not halt given the very large number of edges. 

In [7]:
#print(nx.algorithms.matching.max_weight_matching(graph_maximal))

The problem is that the max_weight_matching operation is O(n^3). The above operation does not halt given the large size of the dataset.

## Update - 04.08.2020 Wednesday
## Solution: Set a treshold to only connect the vertices that have the potential to be viable matches

The below solution sets a treshold of 0.3 for the jaccard similarity. However, this could be easily adjusted according to the similarity metric that we are using.

In [22]:
now = datetime.datetime.now()
"""

Constructs a maximal bipartite graph of the given two tables according to the treshold similarity.
The bipartite matching graph only includes those that have passed a certain similarity treshold.

Input: Any 2 files in any format
Output: A Bipartite Graph with Minimal Weights
"""
def treshold_updated_maximal_construct_graph(file_a, file_b):
    table_a_unprocessed = convert_df(file_a)
    table_b_unprocessed = convert_df(file_b)
    bipartite_graph = nx.Graph()
    
    table_a = make_dict(table_a_unprocessed)
    table_b = make_dict(table_b_unprocessed)
    
    i=0
    
    for key1, val1 in table_a.items():
       # print(val1)
        id1 = str(key1) + '_'+ '_1'
        for key2, val2 in table_b.items():
            i+=1
            if i%100000 == 0:
                print(str(round(100*i/len(table_a)/len(table_b),2))+'% complete')
            if calc_max_weight_edit(val1,val2) >= 0.3:
                #add value to identifier to disitnguish two entries with different values
                id2 = str(key2) + '_' + '_2' 
                bipartite_graph.add_edge(id1, id2, weight=calc_max_weight(val1, val2))
                #edit distance and weight should be inv. prop.
                #also adding 1 to denom. to prevent divide by 0
                # add 1,2 to distinguish two key-value tuples belonging to different tables
            else:
                continue
            
    return bipartite_graph

In [23]:
treshold_graph_maximal = treshold_updated_maximal_construct_graph(table_a, table_b)
#treshold_graph_maximal.number_of_edges()
"""

Outputs the matching that has the maximal weight for each edge in the bipartite graph

Input: A Bipartite Graph
Output: A set of matchings. Ex: {('journals/vldb/MedjahedBBNE03__2', '775457__1')}
"""
matching_set = nx.algorithms.matching.max_weight_matching(treshold_graph_maximal)
timing = (datetime.datetime.now()-now).total_seconds()
print("---- Timing for Graph Construction + Matching ----")
print(timing,"seconds")

1.67% complete
3.33% complete
5.0% complete
6.67% complete
8.33% complete
10.0% complete
11.66% complete
13.33% complete
15.0% complete
16.66% complete
18.33% complete
20.0% complete
21.66% complete
23.33% complete
25.0% complete
26.66% complete
28.33% complete
29.99% complete
31.66% complete
33.33% complete
34.99% complete
36.66% complete
38.33% complete
39.99% complete
41.66% complete
43.33% complete
44.99% complete
46.66% complete
48.32% complete
49.99% complete
51.66% complete
53.32% complete
54.99% complete
56.66% complete
58.32% complete
59.99% complete
61.66% complete
63.32% complete
64.99% complete
66.65% complete
68.32% complete
69.99% complete
71.65% complete
73.32% complete
74.99% complete
76.65% complete
78.32% complete
79.99% complete
81.65% complete
83.32% complete
84.98% complete
86.65% complete
88.32% complete
89.98% complete
91.65% complete
93.32% complete
94.98% complete
96.65% complete
98.32% complete
99.98% complete
---- Timing for Graph Construction + Matching ----

In [24]:
"""

Constructs a maximal bipartite graph of the given two tables according to the treshold similarity.
The bipartite matching graph only includes those that have passed a certain similarity treshold.

Input: A set of vertices that have been matched according to the weight of the edges. Matching should be of type tuples inside a set.
Output: A Bipartite Graph with Minimal Weights
"""
def retrieve(matching):
    res2 = list(matching)
    res_tuple = []
    for i in res2:

        if int(i[0].split("_")[-1]) == 1:
            idACM = i[0].split("_")[0]
            idDBLP = i[1].split("_")[0]
            res_tuple.append((idDBLP, idACM))
            
        if int(i[0].split("_")[-1]) == 2:
            idACM = i[1].split("_")[0]
            idDBLP = i[0].split("_")[0]
            res_tuple.append((idDBLP, idACM))
            
    return res_tuple

retrieve(matching_set)

[('journals/vldb/MedjahedBBNE03', '775457'),
 ('conf/vldb/VermeerA96', '673635'),
 ('conf/vldb/GehrkeRG98', '671330'),
 ('journals/sigmod/RosenthalS99', '333611'),
 ('conf/vldb/CherniackZ98', '671177'),
 ('conf/vldb/LometT95', '673170'),
 ('conf/sigmod/Gibson95', '223884'),
 ('conf/vldb/GravanoG97', '670994'),
 ('conf/vldb/CeriFP99', '671524'),
 ('journals/tods/WinslettSQ94', '195675'),
 ('journals/tods/FegarasM00', '377676'),
 ('journals/vldb/ChangG01', '767145'),
 ('conf/vldb/CareyD96', '673465'),
 ('conf/vldb/EicklerKK97', '673666'),
 ('conf/vldb/Sarawagi99', '671500'),
 ('journals/sigmod/BawaCCDGGKMSSVY03', '945728'),
 ('conf/sigmod/GoldmanW00', '335422'),
 ('conf/vldb/RusinkiewiczKTWM95', '673142'),
 ('conf/vldb/HelmerM97', '673667'),
 ('conf/vldb/CherniackFZ01', '672180'),
 ('conf/vldb/GehaniJR94', '672971'),
 ('conf/sigmod/BrunoC02', '564722'),
 ('journals/sigmod/dOnofrioP03', '776990'),
 ('journals/vldb/MecellaP01', '767134'),
 ('conf/vldb/ChenRS95', '673007'),
 ('conf/vldb/Aks

## Evaluating Accuracy

In [18]:
import datetime
import csv

def eval_matching(matching):
    f = open('DBLP-ACM_perfectMapping.csv', 'r', encoding = "ISO-8859-1")
    reader = csv.reader(f, delimiter=',', quotechar='"')
    matches = set()
    proposed_matches = set()

    tp = set()
    fp = set()
    fn = set()
    tn = set()

    for row in reader:
        matches.add((row[0],row[1]))

    for m in matching:
     #   print(m)
        proposed_matches.add(m)

        if m in matches:
            tp.add(m)
        else:
            fp.add(m)

    for m in matches:
        if m not in proposed_matches:
            fn.add(m)

    prec = len(tp)/(len(tp) + len(fp))
    rec = len(tp)/(len(tp) + len(fn))

    return {'false positive': 1-prec, 
            'false negative': 1-rec,
            'accuracy': 2*(prec*rec)/(prec+rec) }


#prints out the accuracy
now = datetime.datetime.now()
out = eval_matching(retrieve(matching_set)) # retrieve() returns a list of tuples of DBLP2 ids and ACM ids.
timing = (datetime.datetime.now()-now).total_seconds()
print("----Accuracy----")
print(out)
print("---- Timing ----")
print(timing,"seconds")

----Accuracy----
{'false positive': 0.028458498023715362, 'false negative': 0.44764044943820225, 'accuracy': 0.704297994269341}
---- Timing ----
0.006984 seconds


After trying out a few different string similarity libraries (most promising ones being https://pypi.org/project/editdistance/0.3.1/ and https://pypi.org/project/textdistance/#description), I observed that the `jaccard_similarity` function that I wrote was surprisingly faster than the textdistance library implementation, so I've reverted the code to use `jaccard_similarity` instead again until we can find a better library for it. On the other hand, the `editDistance` library is a very fast implementation of the Levenshtein Distance in C. So I've written another function called `calc_max_weight_edit` and `calc_min_weight_edit` so that it is easier to switch between the similarity metrics in a more generalized code structure.

Another observation worthy of noting is that the Jaccard and the Levenshtein similarity metrics gave the same accuracy rate (70%).