## Maximal Bipartite Matching for Inexact Matches Between 2 Tables

#### Goal: Explore some interesting problems around using PC's to model fuzzy joins 

#### Some examples: inexact matches between tables, finding the highest and lowest possible aggregate after a match)

### Task: Build a maximal bipartite matching algorithm

### Step 1: Define a bipartite graph. (Assume we are given any similarity metric - this similarity metric is taken from geeksforgeeks).

In [1]:
def editDistance(str1, str2, m, n): 
  
    # If first string is empty, the only option is to 
    # insert all characters of second string into first 
    if m == 0: 
         return n 
  
    # If second string is empty, the only option is to 
    # remove all characters of first string 
    if n == 0: 
        return m 
  
    # If last characters of two strings are same, nothing 
    # much to do. Ignore last characters and get count for 
    # remaining strings. 
    if str1[m-1]== str2[n-1]: 
        return editDistance(str1, str2, m-1, n-1) 
  
    # If last characters are not same, consider all three 
    # operations on last character of first string, recursively 
    # compute minimum cost for all three operations and take 
    # minimum of three values. 
    return 1 + min(editDistance(str1, str2, m, n-1),    # Insert 
                   editDistance(str1, str2, m-1, n),    # Remove 
                   editDistance(str1, str2, m-1, n-1)    # Replace 
                   ) 

In [2]:
import networkx as nx

def similarity(x,y):
    return editDistance(x,y,len(x),len(y))

def construct_graph(table_a, table_b):
    bipartite_graph = nx.Graph()
    for first in table_a:
        key1, val1 = first
        id1 = key1 + '_' + str(val1) + '_1'
        for second in table_b:
            key2, val2 = second
            id2 = key2 + '_' + str(val2) + '_2' #add value to identifier to disitnguish two entries with different values
            bipartite_graph.add_edge(id1, id2, weight=1/(1+similarity(key1,key2))) #edit distance and weight should be inv. prop.
                                                                                    #also adding 1 to denom. to prevent divide by 0
            # add 1,2 to distinguish two key-value tuples belonging to different tables
    return bipartite_graph

# import dictionary for graph 
from collections import defaultdict 

# defaultdict allows that if a key is not found in the dictionary, 
# then instead of a KeyError being thrown, a new entry is created.
table_a = [('US', '300 M'), ('CN', '300 M'), ('CA', '300 M'), ('AU', '300 M'),('USA', '35 T')] 
table_b = [('USA', '35 T'), ('USA', '32 T'), ('UK', '3 T'), ('AUS', '20 T'), ('CAL', '22 T')]

### Tests:

#### Table A:

US -> 300 M

CN -> 12 B

CA -> 50 M

AU -> 25 M

#### Table B

USA -> 35 T

UK -> 3 T

USA -> 32 T

AUS -> 20 T

CAL -> 22 T

table_a = { "US" : ["300 M"],
          "CN" : ["12 B"],
          "CA" : ["50 M"],
          "AU" : ["25 M"]
        } 
        
table_b = { "USA" : ["35 T"],
          "UK" : ["3 T"],
          "USA" : ["32 T"],
          "AUS" : ["20 T"],
          "CAL" : ["22 T"]
        } 
        
matching = { "US" : "USA",
          "CN" : "",
          "CA" : "CAL",
          "AU" : "AUS",
        }
       

In [3]:
construct_graph(table_a,table_b).edges.data()
bipartite_graph = construct_graph(table_a,table_b)

## Step 2: Detail possible edge cases for maximal bipartite matching

In [4]:
# 1. both tables have the same key-val pair like USA - 35 T
# 2. table_a=(US,UK), table_b=(USA,US). natural matching would be US-USA, UK-US 
# but similarity metric makes the other matching just as likely?

## Step 3: Maximal Bipartite Matching Algorithm

In [5]:
nx.algorithms.matching.max_weight_matching(bipartite_graph)

{('AUS_20 T_2', 'AU_300 M_1'),
 ('CAL_22 T_2', 'CA_300 M_1'),
 ('UK_3 T_2', 'CN_300 M_1'),
 ('USA_32 T_2', 'USA_35 T_1'),
 ('USA_35 T_2', 'US_300 M_1')}