## Matching on DBLP ACM Dataset

### About the Dataset:
* This data set was taken from the Benchmark datasets for entity resolution web page. 
* It contains bibliographic data, with 4 attributes: title, authors, venue, year. 
* There are 3 CSV files in this zip archive.

### Task: Measure how accurate the bipartite matching algorithm is using the datasets and the ground truth presented in the data

Some assumptions and notes:
* We are checking string similarity using the `titles` column of the 2 datasets

* Added `encoded=latin-1` for pandas errors on file reading in the function `convert_df`

#### First, we'll include the bipartite matching algorithm components from our previous findings. We can try out different string matching techniques. 

#### A good starting point would be to see how `editDistance` is working in this context, so we'll include the function below.

In [1]:
def editDistance(str1, str2, m, n): 
  
    # If first string is empty, the only option is to 
    # insert all characters of second string into first 
    if m == 0: 
         return n 
  
    # If second string is empty, the only option is to 
    # remove all characters of first string 
    if n == 0: 
        return m 
  
    # If last characters of two strings are same, nothing 
    # much to do. Ignore last characters and get count for 
    # remaining strings. 
    if str1[m-1]== str2[n-1]: 
        return editDistance(str1, str2, m-1, n-1) 
  
    # If last characters are not same, consider all three 
    # operations on last character of first string, recursively 
    # compute minimum cost for all three operations and take 
    # minimum of three values. 
    return 1 + min(editDistance(str1, str2, m, n-1),    # Insert 
                   editDistance(str1, str2, m-1, n),    # Remove 
                   editDistance(str1, str2, m-1, n-1)    # Replace 
                   ) 


def jaccard_similarity(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union

In [2]:
import pandas as pd
import networkx as nx

"""

Transforms the given file to a pandas dataframe object if it was not one already
Assumption: Assumes that the data starts from the 1st row of given file, does not use seperators such as "," or ";"

Input: 2 variables that will be measured for similarity
Output: The desired similarity metric result
"""
def similarity_edit(x,y):
    return editDistance(x,y,len(x),len(y))



"""

Transforms the given file to a pandas dataframe object if it was not one already
Assumption: Assumes that the data starts from the 1st row of given file, does not use seperators such as "," or ";"

Input: Any file
Output: A pandas dataframe object
"""
def convert_df(file):
    if isinstance(file, pd.DataFrame):
        return file
    else:
        df = pd.read_csv(file, encoding='latin-1')
        return df
"""

Calculates maximum weight for the matching

Input: keys from 2 tables
Output: weight for each matching to be used in the weight part of constructing the graph
"""
def calc_max_weight(key1, key2):
    weight = jaccard_similarity(key1,key2)
    return weight

"""

Calculates minimum weight for the matching

Input: keys from 2 tables
Output: weight for each matching to be used in the weight part of constructing the graph
"""
def calc_min_weight(key1, key2):
    weight = (-1)/(1+jaccard_similarity(key1,key2))
    return weight

"""

Converts the dataframe into dictionary for better accuracy matching of pairs. 
Assumption: The data has headers in the first row (description of what that column describes)

Input: Any file
Output: A dictionary in the form col1:col2 matching
"""
def make_dict(file):
    V = list(file.to_dict('list').values())
    keys = V[0]
    values = zip(*V[1:])
    table = dict(zip(keys,values))
    return table
            
"""

Constructs a maximal bipartite graph of the given two tables

Input: Any 2 files in any format
Output: A Bipartite Graph with Maximal Weights
"""
def updated_maximal_construct_graph(file_a, file_b):
    table_a_unprocessed = convert_df(file_a)
    table_b_unprocessed = convert_df(file_b)
    bipartite_graph = nx.Graph()
    
    table_a = make_dict(table_a_unprocessed)
    table_b = make_dict(table_b_unprocessed)
    
    i=0
    
    for key1, val1 in table_a.items():
       # print(val1)
        id1 = str(key1) + '_' + str(val1) + '_1'
        for key2, val2 in table_b.items():
            i+=1
            if i%100000 == 0:
                print(str(round(100*i/len(table_a)/len(table_b),2))+'% complete')
            #add value to identifier to disitnguish two entries with different values
            id2 = str(key2) + '_' + str(val2) + '_2' 
            bipartite_graph.add_edge(id1, id2, weight=calc_max_weight(val1, val2))
            #edit distance and weight should be inv. prop.
            #also adding 1 to denom. to prevent divide by 0
            # add 1,2 to distinguish two key-value tuples belonging to different tables
    return bipartite_graph

"""

Constructs a maximal bipartite graph of the given two tables

Input: Any 2 files in any format
Output: A Bipartite Graph with Minimal Weights
"""
def updated_minimal_construct_graph(file_a, file_b):
    table_a_unprocessed = convert_df(file_a)
    table_b_unprocessed = convert_df(file_a)
    bipartite_graph = nx.Graph()
    
    table_a = make_dict(table_a_unprocessed)
    table_b = make_dict(table_b_unprocessed)
    
    i=0
    
    for key1, val1 in table_a.items():
        id1 = str(key1) + '_' + str(val1) + '_1'
        for key2, val2 in table_b.items():
            i+=1
            if i%100000 == 0:
                print(str(round(100*i/len(table_a)/len(table_b),2))+'% complete')
            #add value to identifier to distinguish two entries with different values
            id2 = str(key2) + '_' + str(val2) + '_2' 
            bipartite_graph.add_edge(id1, id2, weight=calc_min_weight(key1, key2)) 
            #edit distance and weight should be inv. prop.
            #also adding 1 to denom. to prevent divide by 0
            # add 1,2 to distinguish two key-value tuples belonging to different tables
    return bipartite_graph

bipartite_graph_maximal = updated_maximal_construct_graph("table_a.csv","table_b.csv")
#print(bipartite_graph_maximal.edges.data())
bipartite_graph_minimal = updated_minimal_construct_graph("table_a.csv", "table_b.csv")
bipartite_graph_minimal.edges.data()

EdgeDataView([("US_('300 M', 1)_1", "US_('300 M', 1)_2", {'weight': -0.5}), ("US_('300 M', 1)_1", "CN_('12 B', 2)_2", {'weight': -1.0}), ("US_('300 M', 1)_1", "CA_('50 M', 3)_2", {'weight': -1.0}), ("US_('300 M', 1)_1", "AU_('25 M', 4)_2", {'weight': -0.75}), ("US_('300 M', 1)_2", "CN_('12 B', 2)_1", {'weight': -1.0}), ("US_('300 M', 1)_2", "CA_('50 M', 3)_1", {'weight': -1.0}), ("US_('300 M', 1)_2", "AU_('25 M', 4)_1", {'weight': -0.75}), ("CN_('12 B', 2)_2", "CN_('12 B', 2)_1", {'weight': -0.5}), ("CN_('12 B', 2)_2", "CA_('50 M', 3)_1", {'weight': -0.75}), ("CN_('12 B', 2)_2", "AU_('25 M', 4)_1", {'weight': -1.0}), ("CA_('50 M', 3)_2", "CN_('12 B', 2)_1", {'weight': -0.75}), ("CA_('50 M', 3)_2", "CA_('50 M', 3)_1", {'weight': -0.5}), ("CA_('50 M', 3)_2", "AU_('25 M', 4)_1", {'weight': -0.75}), ("AU_('25 M', 4)_2", "CN_('12 B', 2)_1", {'weight': -1.0}), ("AU_('25 M', 4)_2", "CA_('50 M', 3)_1", {'weight': -0.75}), ("AU_('25 M', 4)_2", "AU_('25 M', 4)_1", {'weight': -0.5})])

In [3]:
#nx.algorithms.matching.max_weight_matching(bipartite_graph_maximal)
print(nx.algorithms.bipartite.matching.maximum_matching(bipartite_graph_minimal))

{"US_('300 M', 1)_1": "US_('300 M', 1)_2", "CN_('12 B', 2)_1": "CN_('12 B', 2)_2", "AU_('25 M', 4)_1": "CA_('50 M', 3)_2", "CA_('50 M', 3)_1": "AU_('25 M', 4)_2", "CA_('50 M', 3)_2": "AU_('25 M', 4)_1", "US_('300 M', 1)_2": "US_('300 M', 1)_1", "CN_('12 B', 2)_2": "CN_('12 B', 2)_1", "AU_('25 M', 4)_2": "CA_('50 M', 3)_1"}


### Secondly, load the data for processing

In [4]:
# Sticking to the convention of table_a and table_b naming that we previously used for generalization purposes

table_a = convert_df("ACM.csv")

table_b = convert_df("DBLP2.csv")

In [5]:
graph_maximal = updated_maximal_construct_graph(table_a, table_b)
graph_maximal.number_of_edges()

1.67% complete
3.33% complete
5.0% complete
6.67% complete
8.33% complete
10.0% complete
11.66% complete
13.33% complete
15.0% complete
16.66% complete
18.33% complete
20.0% complete
21.66% complete
23.33% complete
25.0% complete
26.66% complete
28.33% complete
29.99% complete
31.66% complete
33.33% complete
34.99% complete
36.66% complete
38.33% complete
39.99% complete
41.66% complete
43.33% complete
44.99% complete
46.66% complete
48.32% complete
49.99% complete
51.66% complete
53.32% complete
54.99% complete
56.66% complete
58.32% complete
59.99% complete
61.66% complete
63.32% complete
64.99% complete
66.65% complete
68.32% complete
69.99% complete
71.65% complete
73.32% complete
74.99% complete
76.65% complete
78.32% complete
79.99% complete
81.65% complete
83.32% complete
84.98% complete
86.65% complete
88.32% complete
89.98% complete
91.65% complete
93.32% complete
94.98% complete
96.65% complete
98.32% complete
99.98% complete


6001104

In [None]:
#print(nx.algorithms.matching.max_weight_matching(graph_maximal))

The problem is that the max_weight_matching operation is O(n^3). The above operation does not halt given the large size of the dataset. This might tell us that it is time that we look beyond the networkx package, because their maximal matching algorithms do not do any better than O(n^3). I've found the following alternatives:

* https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html (this seems to be the best solution)

* https://pypi.org/project/hungarian/