# Entity Resolution With Pytorch Geometric: usage guide

## Clean to clean entity resolution

In this section is explained how to use the library to perform the "entity resolution" task in a "clean to clean" situation, i.e., finding tuples that refers to the same entities assuming that there are not "matches" inside the same table. 

The assumptions on the data are the following:
* We are working with exactly two datasets
* There are not matches between tuples contained in the same dataset
* The datasets and the graph generated from them can fit in memory

Necessary imports:

In [1]:
from entity_resolution import *

  from .autonotebook import tqdm as notebook_tqdm


The function: one_to_one_clean_ER

Parameters:
- dfpathA (str): path to the first dataset 
- dfpathB (str): path to the second dataset
- p (int): higher values improve exploration during the generation of random walks--optional, default is 20
- q (int): lower values improve exploration during the generation of random walks--optional, default is 1
- n_similar (int): maximum number of closest tuples to find--optional, default is 10
- walks_per_node (int): number of random walks to generate for each node in the graph--optional, default is 20
- n_top (int): number of closest tuples to find (<= n_similar)--optional, default is 10
- embedding_size (int): the size of the node embeddings--optional, default is 128
- walk_length (int): length of the generated rnadom walks--optional, default is 10
- use_faiss (bool): if True tells to use faiss to find the top n closest nodes--optional, default is True
- file_directory (str): directory where to save the intermediate data--optional, default is None
- load_embedding_file (bool): if true tells to load the embeddings from the file_directory skipping training--optional, default is False--set True only if you are also providing a file_directory containing the required data
- load_graph (bool): if true tells to load the graph from the file_directory skipping its generation--optional, default is False--set True only if you are also providing a file_directory containing the required data
- load_n_best (bool): if true tells to load the embeddings from the file_directory skipping their computation--optional, default is False--set True only if you are also providing a file_directory containing the required data

Output: a set of couples in the format (tp\_"table\_from\_i"\_"index\_j", tp\_"table\_from\_m"\_"index\_n"), these tuples represent the matches found between the two dataset

Example: the datasets used in the example are "fodors_zagats-tableA" and "fodors_zagats-tableB"  

In [6]:
matches = one_to_one_clean_ER(r"/home/francesco.pugnaloni/EntityResolutionWithPyG/Tests/FZ/Datasets/fodors_zagats-tableA.csv", r"/home/francesco.pugnaloni/EntityResolutionWithPyG/Tests/FZ/Datasets/fodors_zagats-tableB.csv",file_directory=r"/home/francesco.pugnaloni/EntityResolutionWithPyG/Tests/FZ/Files", n_epochs=100)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/francesco.pugnaloni/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Number of walks: 2889
Training is starting
T_exec embedding generation: 0.007547855377197266s
Using faiss


In [7]:
l = [t for t in matches]

Output visualization:

In [8]:
l[0:10]

[('tp_A_81', 'tp_B_263'),
 ('tp_A_268', 'tp_B_250'),
 ('tp_A_223', 'tp_B_214'),
 ('tp_A_88', 'tp_B_12'),
 ('tp_A_13', 'tp_B_89'),
 ('tp_A_65', 'tp_B_218'),
 ('tp_A_45', 'tp_B_228'),
 ('tp_A_429', 'tp_B_322'),
 ('tp_A_270', 'tp_B_249'),
 ('tp_A_129', 'tp_B_223')]

## Free Entity Resolution

In this section is explained how to use the library to perform the "entity resolution" task in a "free" situation, i.e., finding tuples that refers to the same entities provided an arbitrary number of input tables, in this case matches between tuples in the same table are possible. 

The only assumption is that both the table and the graph generated from the will fit in memory

Necessary imports:

In [1]:
from entity_resolution import *

  from .autonotebook import tqdm as notebook_tqdm


The function: free_entity_resolution

Parameters:
- df_list (list): list of the paths to the dataframes to process 
- file_directory (str): directory where to save the intermediate data--optional, default is None
- n_epochs (int): the number of training epochs--optional, default is 100
- p (int): higher values improve exploration during the generation of random walks--optional, default is 20
- q (int): lower values improve exploration during the generation of random walks--optional, default is 1
- n_top (int): number of closest tuples to find (<= n_similar)--optional, default is 10
- embedding_size (int): the size of the node embeddings--optional, default is 128
- walk_length (int): length of the generated rnadom walks--optional, default is 10
- load_embedding_file (boolean): if true tells to load the embeddings from the file_directory skipping training--optional, default is False--set True only if you are also providing a file_directory containing the required data
- load_graph (boolean): if true tells to load the graph from the file_directory skipping its generation--optional, default is False--set True only if you are also providing a file_directory containing the required data

Output: a set of couples in the format (tp\_"index\_table\_from\_i"\_"index\_j", tp\_"index\_table\_from\_m"\_"index\_n"), these tuples represent the matches found between the datasets, it is important to notice that the dataset which the tuple refers to is shown as an index that correspond to the position of the dataframe path in the input_list

Example: the datasets used in the example are "fodors_zagats-tableA" and "fodors_zagats-tableB"

In [2]:
matches_free = free_entity_resolution([r"/home/francesco.pugnaloni/EntityResolutionWithPyG/Tests/FZ/Datasets/fodors_zagats-tableA.csv", r"/home/francesco.pugnaloni/EntityResolutionWithPyG/Tests/FZ/Datasets/fodors_zagats-tableB.csv"],file_directory=r"/home/francesco.pugnaloni/EntityResolutionWithPyG/Tests/FZ/Files", n_epochs=100)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/francesco.pugnaloni/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Number of walks: 2889
Training is starting
Epoch: 01, Loss: 9.7361, time passed since start: 0.6577680110931396s, t_exec last epoch: 0.6255474090576172s
Epoch: 02, Loss: 8.8841, time passed since start: 1.1624491214752197s, t_exec last epoch: 0.5046124458312988s
Epoch: 03, Loss: 8.1850, time passed since start: 1.6680190563201904s, t_exec last epoch: 0.5055027008056641s
Epoch: 04, Loss: 7.6015, time passed since start: 2.1728436946868896s, t_exec last epoch: 0.5047557353973389s
Epoch: 05, Loss: 7.0035, time passed since start: 2.6744203567504883s, t_exec last epoch: 0.5015084743499756s
Epoch: 06, Loss: 6.5263, time passed since start: 3.117095947265625s, t_exec last epoch: 0.44260740280151367s
Epoch: 07, Loss: 6.0512, time passed since start: 3.619380235671997s, t_exec last epoch: 0.5022053718566895s
Epoch: 08, Loss: 5.6157, time passed since start: 4.070436000823975s, t_exec last epoch: 0.45098233222961426s
Epoch: 09, Loss: 5.2163, time passed since start: 4.575002670288086s, t_exec l

In [3]:
l_free = [t for t in matches_free]

Output visualization

In [6]:
l_free[0:10]

[('tp_0_98', 'tp_0_110'),
 ('tp_0_140', 'tp_0_263'),
 ('tp_0_287', 'tp_0_329'),
 ('tp_0_66', 'tp_1_245'),
 ('tp_0_7', 'tp_1_7'),
 ('tp_0_274', 'tp_0_358'),
 ('tp_0_285', 'tp_0_361'),
 ('tp_0_101', 'tp_1_319'),
 ('tp_0_485', 'tp_0_503'),
 ('tp_1_189', 'tp_1_193')]