## Experiments on Matching - Bipartite Matching, Naive Matching, Simulation Matching

### 1. Bipartite Graph Maximal-Minimal Matching Approach

In [1]:
import sys
from os import path
sys.path.insert(0, '../src')
import one_to_n

import datetime
import textdistance
import editdistance
import pandas as pd
import networkx as nx

In [2]:
table_a = one_to_n.lat_convert_df("../Amazon-GoogleProducts/Amazon.csv")

table_b = one_to_n.lat_convert_df("../Amazon-GoogleProducts/GoogleProducts.csv")

now = datetime.datetime.now()
bipartite_graph_result = one_to_n.valcomp_treshold_updated_maximal_construct_graph(table_a, table_b, "title", 0.5)
timing_tresh = (datetime.datetime.now()-now).total_seconds()
print("---- Timing for Graph Construction with Treshold Constraint ----")
print(timing_tresh,"seconds")

2.27% complete
4.55% complete
6.82% complete
9.1% complete
11.37% complete
13.65% complete
15.92% complete
18.19% complete
20.47% complete
22.74% complete
25.02% complete
27.29% complete
29.57% complete
31.84% complete
34.11% complete
36.39% complete
38.66% complete
40.94% complete
43.21% complete
45.49% complete
47.76% complete
50.03% complete
52.31% complete
54.58% complete
56.86% complete
59.13% complete
61.4% complete
63.68% complete
65.95% complete
68.23% complete
70.5% complete
72.78% complete
75.05% complete
77.32% complete
79.6% complete
81.87% complete
84.15% complete
86.42% complete
88.7% complete
90.97% complete
93.24% complete
95.52% complete
97.79% complete
---- Timing for Graph Construction with Treshold Constraint ----
18.329041 seconds


**Graph Construction is a time-expensive calculation, but ideally a user does not need to run graph construction more than once.**

In [23]:
# print(bipartite_graph_result.edges())

In [7]:
import re

def SUM_edit_edge_weight(bip_graph):
    for u,v,d in bip_graph.edges(data=True):
        val_tuple_1 = u.split("_")
        val_tuple_2 = v.split("_")
        
        val1 = re.sub("[^0-9]", "", val_tuple_1[2])
        val2 =re.sub("[^0-9]", "", val_tuple_2[2])

        d['weight'] = float(val1) + float(val2)

    return bip_graph

sum_weighted_graph = SUM_edit_edge_weight(bipartite_graph_result)

### SUM Maximal Matching Outcome

In [20]:
print("\n\n 'SUM' MAXIMAL MATCHING:")
now = datetime.datetime.now()
matching_set_maximal = nx.algorithms.matching.max_weight_matching(sum_weighted_graph)
timing_match = (datetime.datetime.now()-now).total_seconds()
print("---- Timing for Matching (Done on the graph constructed with the treshold constraint) ----")
print(timing_match,"seconds")
print("The number of edges in the graph is:", sum_weighted_graph.number_of_edges(), "\n")


# print("The Maximal Matching Set is:", matching_set_maximal, "\n")




 'SUM' MAXIMAL MATCHING RESULTS:
---- Timing for Matching (Done on the graph constructed with the treshold constraint) ----
3.787009 seconds
The number of edges in the graph is: 1215 



In [22]:
import editdistance
from Matching import core
from Matching import analyze
from Matching import matcher
import sys
import os
import datetime

# print(os.getcwd())

def eval_graph_matching(match_results):
    results_tuple = []
    for (val1, val2) in match_results:
        id1 = val1.split("_")[0]
        id2 = val2.split("_")[0]
        if id1.startswith("http"):
            results_tuple.append((id2, id1))
        else:
            results_tuple.append((id1,id2))
    return results_tuple

graph_matching_outcome = eval_graph_matching(matching_set_maximal)
# print(graph_matching_outcome)
print('Bipartite Matching Performance (Maximal Matching Case): ' + str(core.eval_matching(graph_matching_outcome)))


Bipartite Matching Performance (Maximal Matching Case): {'false positive': 0.39, 'false negative': 0.68, 'accuracy': 0.42}


### 2. Naive Matching Performance Evaluation

In [18]:
sample_size = 1000

amzn = core.amazon_catalog()
goog = core.google_catalog()
print('Loaded catalogs.')

/Users/denizturkcapar/Desktop/Data Research/Research-Bipartite-Matching-Problem/experiments
Loaded catalogs.


In [19]:
print('Performing compare all match (edit distance)...')
now = datetime.datetime.now()
compare_all_edit_match = matcher.matcher(amzn,goog,editdistance.eval, matcher.all)
naive_time_edit = (datetime.datetime.now()-now).total_seconds()
print("Naive Edit Distance Matching computation time taken: ", naive_time_edit, " seconds")
print('Compare All Matcher (Edit Distance) Performance: ' + str(core.eval_matching(compare_all_edit_match)))



print('Performing compare all match (jaccard distance)...')
now = datetime.datetime.now()
compare_all_jaccard_match = matcher.matcher(amzn,goog,analyze.jaccard_calc, matcher.all)
naive_time_jaccard = (datetime.datetime.now()-now).total_seconds()
print("Naive Jaccard Matching computation time taken: ", naive_time_jaccard, " seconds")
print('Compare All Matcher (Jaccard Distance) Performance: ' + str(core.eval_matching(compare_all_jaccard_match)))

Performing compare all match (edit distance)...
Naive Edit Distance Matching computation time taken:  53.383402  seconds
Compare All Matcher (Edit Distance) Performance: {'false positive': 0.72, 'false negative': 0.7, 'accuracy': 0.29}
Performing compare all match (jaccard distance)...
Naive Jaccard Matching computation time taken:  33.897103  seconds
Compare All Matcher (Jaccard Distance) Performance: {'false positive': 0.45, 'false negative': 0.42, 'accuracy': 0.57}


### 3. Random Sampling Matching Performance Evaluation

In [21]:
print('Performing random sample match (edit distance)...')
now = datetime.datetime.now()
compare_all_edit_match = matcher.matcher(amzn,goog,editdistance.eval, matcher.random_sample, sample_size)
sim_time_edit = (datetime.datetime.now()-now).total_seconds()
print("Simulation-Based Edit Distance Matching computation time taken: ", sim_time_edit, " seconds")
print('Random Sample Matcher (Edit Distance) Performance: ' + str(core.eval_matching(compare_all_edit_match)))

print('Performing random sample match (jaccard distance)...')
now = datetime.datetime.now()
compare_all_jaccard_match = matcher.matcher(amzn,goog,analyze.jaccard_calc, matcher.random_sample, sample_size)
sim_time_jaccard = (datetime.datetime.now()-now).total_seconds()
print("Simulation-Based Jaccard Matching computation time taken: ", sim_time_jaccard, " seconds")
print('Random Sample Matcher (Jaccard Distance) Performance: ' + str(core.eval_matching(compare_all_jaccard_match)))

Performing random sample match (edit distance)...
Simulation-Based Edit Distance Matching computation time taken:  32.660594  seconds
Random Sample Matcher (Edit Distance) Performance: {'false positive': 0.88, 'false negative': 0.88, 'accuracy': 0.12}
Performing random sample match (jaccard distance)...
Simulation-Based Jaccard Matching computation time taken:  24.757538  seconds
Random Sample Matcher (Jaccard Distance) Performance: {'false positive': 0.79, 'false negative': 0.78, 'accuracy': 0.21}


## Evaluation

Key Findings: 

- Matching on a bipartite graph took significantly less time than the naive and random sampling matching approach for the given similarity treshold of 0.5 for jaccard similarity.

- Bipartite Matching approach performed similar to the Naive jaccard matching in terms of accuracy.
    - Bipartite Matching Accuracy: 0.42
    - Naive Jaccard Matching Accuracy: 0.57
    
- Bipartite Matching seemed to perform better considering the accuracy versus time-expense tradeoff.

Note: Jaccard similarity fit better for this task so jaccard was used in the graph matching.

### Possible Next Step

- Comparing the thresholds of [min_sum, max_sum] for the SUM operation for Naive, Simulation-Based, and Graph Matching.