In [45]:
import pandas as pd
import numpy as np

# Exploration of results for PARIS, using code provided by authors (Zequn Sun) and Manuel and Stefano.

### 12th October 2020
After having downloaded the code provided by the authors and changed the directories for the datasets with more appropriate command line arguments, we have run experiments on PARIS, using as test data the `test_links` contained in folder 1 in the dataset D_Y_100K_V1 using the pipeline provided by the author:

- 1) You should first convert our dataset into ".nt" files using the python code "nt_file.py". The folder "nt_datasets/" contains the ".nt" files of our D_Y datasets. You can directly use them.

- 2) Run PARIS on the generated ".nt" files using "paris_interface.py". The code will output a folder like "D_Y_100K_V1_1010_224118_wjrevhtsrq" which contains the raw results of PARIS.

- 3) Use "paris_results.py" to get the evaluation results based on the output of PARIS and the gold standard (e.g., all alignment or test alignment).

This lead us to the results (absolutely in line with what the authors claim) of: 

```
P: 60545/70193=0.8626
R: 60545/70000=0.8649
F1: 0.8637
```

We still had the problem that we have run the experiments with our own wrapper, and results did not match at all. In order to be sure of our claim, we run our own pipeline (executable with `python /src/main.py --command-line-arguments...`) and observed totally different results, namely:

```
P: 59983/61095=0.9818
R: 59983/70000=0.8569
F1: 0.9151
```

Although the recall is quite similar, the precision achieved surprisingly better results (and consequently the f1-score). 

## Analysis 

The difference is huge, if we think about the fact that although the two pipelines were different, they run on the same data and execute the same algorithm (`paris.jar` as provided by the authors of PARIS). 

Analysing the source code of the pipeline provided kindly us by Zequn Sun, we spotted the following differences with ours:

1) A prefix for the dataset yago of `y2`. This should not be a problem in practice, since even in our simulation we added the prefix dbp: for dbpedia, but it may improve results for PARIS which makes some kind of checks (we can see it in the source code of PARIS, although it is not easy to tell what kind of improvements it can cause -contained in `standardPrefixes`).

2) Precision computed over `eqv_full_i.tsv`, whereas ours was computed over `eqv_i.tsv`.

The second point, actually, deserves a more in depth discussion: when PARIS executes, it prints its alignments in two files, `eqv_i.tsv` and `eqv_full_i.tsv`. Those two files have different meanings and purposes: the second contains all the alignments that PAIRS considers interesting, even if it means that the same entity appears more than once. A good example is the entity `Baltimora` which correctly aligned in `D_Y_15K_V1` with Baltimora, but even with Washington, which is a city that lies approximately in the same geographical area. `eqv_i.tsv`, instead, contains the most likely alingments for all entities (so, duplicates are dropped in favour of the aligmnent which has the highest probability).

For this reason, we believed it is more convenient (and correct), to perform the alignment on the `eqv_i.tsv` file, and not on the `full` version.

Still, curious to understand the decision that brought the authors to pick the larger version, we checked their source code for `paris_results.py` and found out that in fact they remove duplicates picking the ones with highest probability. This lead us to believe that the difference in performance needed even more in-depth analysis: first, we run our pipeline with the dataset created by the author's pipeline, and managed to replicate precisely the previous results close to 98% inprecision (which were the results computed by our pipeline in the first place). Secondly, we run the author's pipeline, providing the dataset created by our pipeline (hence without the `y2` prefix for Yago entities), and managed to replicate precisely their results (89% in precision). - This, in turn, leads us to believe that the `y2` prefix is not an important addition, and can be taken out of consideration for the time being.

At the same time, by using the results computed entirely with the author's pipeline, and switching to the `eqv_i.tsv` file during the evaluation phase, raised again the precision to 98%.

We try to understand why this happens.

In [46]:
# We load the results for PARIS that perform both 98% and 86% on the author's pipeline.
eqv_i_full_df = pd.read_csv("../paris_OpenEA/D_Y_100K_V1_1013_005128_rrijkskrfg/output/9_eqv_full.tsv", 
                            sep='\t', 
                            names=['DB1', 'DB2', 'Probability'])
eqv_i_df = pd.read_csv("../paris_OpenEA/D_Y_100K_V1_1013_005128_rrijkskrfg/output/9_eqv.tsv", 
                       sep='\t',  
                       names=['DB1', 'DB2', 'Probability'])

In [47]:
print("The size of eqv_full is {}".format(len(eqv_i_full_df)))

The size of eqv_full is 97056


In [48]:
print("The size of eqv is {}".format(len(eqv_i_df)))

The size of eqv is 86962


We can see that the difference is quite large, about 10k entitites.
Among these, we want to retain only the entities that are contained in the `test_links` file, which is the one for fold 1 of `D_Y_100K_V1/5_721/1/test_links` (since the other tuples are simply not considered at all, so they would be considered injoustly as wrong tuples.

In [49]:
# Count how many nans there are:
print("nan in DB1: {}\nnan in DB2: {}".format(np.sum(eqv_i_df['DB1'].isna()), np.sum(eqv_i_df['DB2'].isna())))

nan in DB1: 1
nan in DB2: 0


In [50]:
print("nan in DB1: {}\nnan in DB2: {}".format(np.sum(eqv_i_full_df['DB1'].isna()), np.sum(eqv_i_full_df['DB2'].isna())))

nan in DB1: 1
nan in DB2: 0


There is only one nan, we can safely remove it without affecting performance too much!

In [51]:
eqv_i_df = eqv_i_df.dropna()
eqv_i_full_df = eqv_i_full_df.dropna()

In [52]:
eqv_i_df.head()

Unnamed: 0,DB1,DB2,Probability
0,dbp:resource/Menahem_Golan,y2:Menahem_Golan,1.0
1,dbp:resource/Snow_White_(1987_film),y2:Snow_White_(1987_film),0.509527
2,dbp:resource/Germi_County,y2:Germi_County,1.0
3,dbp:resource/Chat_Qeshlaq-e_Bala,y2:Chat_Qeshlaq-e_Bala,0.807499
4,dbp:resource/All_Hell's_Breakin'_Loose,y2:All_Hell's_Breakin'_Loose,0.677578


Moreover, we need to substitute `dbp:resource` with `http://dbpedia.org/resource` and `y2` with ""

In [64]:
eqv_i_df['DB1'] = eqv_i_df['DB1'].apply(lambda x: x.replace("dbp:resource", "http://dbpedia.org/resource"))
eqv_i_df['DB2'] = eqv_i_df['DB2'].apply(lambda x: x.replace("y2:", ""))
eqv_i_full_df['DB1'] = eqv_i_full_df['DB1'].apply(lambda x: x.replace("dbp:resource", "http://dbpedia.org/resource"))
eqv_i_full_df['DB2'] = eqv_i_full_df['DB2'].apply(lambda x: x.replace("y2:", ""))
eqv_i_df.head()

Unnamed: 0,DB1,DB2,Probability
0,http://dbpedia.org/resource/Menahem_Golan,Menahem_Golan,1.0
1,http://dbpedia.org/resource/Snow_White_(1987_f...,Snow_White_(1987_film),0.509527
2,http://dbpedia.org/resource/Germi_County,Germi_County,1.0
3,http://dbpedia.org/resource/Chat_Qeshlaq-e_Bala,Chat_Qeshlaq-e_Bala,0.807499
4,http://dbpedia.org/resource/All_Hell's_Breakin...,All_Hell's_Breakin'_Loose,0.677578


In [65]:
test_links_df = pd.read_csv("../datasets/OpenEA_dataset/D_Y_100K_V1/721_5fold/1/test_links",
                            sep='\t',
                            names=['DB1', 'DB2'])
assert(len(test_links_df)==70000)      # Check that the length is indeed 70k

In [66]:
test_links_df.head()

Unnamed: 0,DB1,DB2
0,http://dbpedia.org/resource/Suffragette_(film),Suffragette_(film)
1,http://dbpedia.org/resource/Telegram_Sam,Telegram_Sam
2,http://dbpedia.org/resource/Barnaul,Barnaul
3,http://dbpedia.org/resource/Pontifical_Atheneu...,Pontifical_Atheneum_of_St._Anselm
4,http://dbpedia.org/resource/Pablo_Bezombe,Pablo_Bezombe


We print the alignments that we would end up with if we drop rows that are not in the ground truth in three cases:

1) only the first column is in the ground truth

2) Either the first either the second column is in the ground truth

3) Both of the columns are in the ground truth

The number of alignments considered by the author's approach is 60734.

In [67]:
print("# alignments where only DB1 appears in gold_standard: {}".format(
    len(eqv_i_df.loc[eqv_i_df['DB1'].isin(test_links_df['DB1'])])))
print("# alignments where either DB1 either DB2 appear in gold_standard: {}".format(
    len(eqv_i_df.loc[eqv_i_df['DB1'].isin(test_links_df['DB1']) | eqv_i_df['DB2'].isin(test_links_df['DB2'])])))
print("# alignments where both DB1 and DB2 appear in gold_standard: {}".format(
    len(eqv_i_df.loc[eqv_i_df['DB1'].isin(test_links_df['DB1']) & eqv_i_df['DB2'].isin(test_links_df['DB2'])])))

# alignments where only DB1 appears in gold_standard: 60734
# alignments where either DB1 either DB2 appear in gold_standard: 60980
# alignments where both DB1 and DB2 appear in gold_standard: 60503


In [68]:
print("# alignments where only DB1 appears in gold_standard: {}".format(
    len(eqv_i_full_df.loc[eqv_i_full_df['DB1'].isin(test_links_df['DB1'])])))
print("# alignments where either DB1 either DB2 appear in gold_standard: {}".format(
    len(eqv_i_full_df.loc[eqv_i_full_df['DB1'].isin(test_links_df['DB1']) | eqv_i_full_df['DB2'].isin(test_links_df['DB2'])])))
print("# alignments where both DB1 and DB2 appear in gold_standard: {}".format(
    len(eqv_i_full_df.loc[eqv_i_full_df['DB1'].isin(test_links_df['DB1']) & eqv_i_full_df['DB2'].isin(test_links_df['DB2'])])))

# alignments where only DB1 appears in gold_standard: 67834
# alignments where either DB1 either DB2 appear in gold_standard: 70002
# alignments where both DB1 and DB2 appear in gold_standard: 65602


Hence we may conclude that they keep only alignments for which the first column is contained in the ground truth. This does not seem correct: the two columns are pretty much symmetric, so there is no reason why pivoting results based only on the first column.

Additional note which may be a useful observation: number of alignments oputputted by PARIS varies from execution to execution: this should not be surprising, since the execution is randomized and hence may vary a little by execution to execution. This could be easily fixed by setting a constant seed for the PARIS algorithm to run with, but for now it should not make any difference.

## Counting the duplicates in `_full` for the first column and the second.

Maybe the asymmetry while computing the set of alignments for the author's approach is the reason of why it performs so poorly on `_full`. 
In order to analyze it deeper, we perform the same set of actions that they do for the `_full` dataset, and see if the result contains duplicates in the second column (an event which is totally avoided using the not-full dataset, which already performs some kind of selection (which one?))

In [69]:
# This code is almost copy pasted (added prints) from the source code provided by the authors.
def read_paris_mappings(file_path, standard_links):
    print("len of standard_links {}".format(len(standard_links)))
    pair_sim_set = set()
    ent_set = set(e1 for (e1, _) in standard_links) | set(e2 for (_, e2) in standard_links)
    print("len of ent_set: {}".format(len(ent_set)))
    # Used to store the best alignment (left is key, right is value) and best probability for such an alignment.
    res_dict = {}
    p_dict = {}
    with open(file_path, 'r', encoding='utf-8') as file:
        i = 0
        for line in file:
            line = line.strip('\n').split('\t')
            if len(line) != 3:
                continue
            e1 = line[0].replace('dbp:', 'http://dbpedia.org/')
            # TODO: This line should rather be: e2 = line[1].replace('y2:', '')
            e2 = line[1]
            p = float(line[2])
            if e1 not in ent_set and e2 not in ent_set:
                continue
            if e1 not in p_dict or e1 in p_dict and p_dict[e1] < p:
                p_dict[e1] = p
                res_dict[e1] = e2
            i += 1
        print("Number of alignments outputted by paris: {}".format(i))
        for k, v in res_dict.items():
            pair_sim_set.add((k, v))
    return pair_sim_set

So we do exactly the same steps to replicate the results.

The number of alignments that they obtain is 70002 when replacing `y2:`, and 67834 when they do not. Note, it is exactly the number of alignments we obtain when we do the or operation, or if we forget the second column.

In [76]:
print("Alignments with | among first and second column: {}".format(
    len(eqv_i_full_df.loc[eqv_i_full_df['DB1'].isin(test_links_df['DB1']) | eqv_i_full_df['DB2'].isin(test_links_df['DB2'])])))
print("Alignments forgetting second column: {}".format(
    len(eqv_i_full_df.loc[eqv_i_full_df['DB1'].isin(test_links_df['DB1'])])))

Alignments with | among first and second column: 70002
Alignments forgetting second column: 67834


We believe the best choice is of course to keep the replace `y2:` with ""

Now we keep only alignments for the best probability in the first column:

In [77]:
eqv_i_full_df_in_test = eqv_i_full_df.loc[eqv_i_full_df['DB1'].isin(test_links_df['DB1']) | eqv_i_full_df['DB2'].isin(test_links_df['DB2'])]
eqv_i_full_df_in_test

Unnamed: 0,DB1,DB2,Probability
0,http://dbpedia.org/resource/Menahem_Golan,Menahem_Golan,1.000000
1,http://dbpedia.org/resource/Snow_White_(1987_f...,Snow_White_(1987_film),0.509527
2,http://dbpedia.org/resource/Germi_County,Germi_County,1.000000
4,http://dbpedia.org/resource/All_Hell's_Breakin...,All_Hell's_Breakin'_Loose,0.677578
5,http://dbpedia.org/resource/Lick_It_Up_(song),Lick_It_Up_(song),0.857566
...,...,...,...
96886,http://dbpedia.org/resource/Joyce_Dickerson,Joyce_Dickerson,0.740688
96887,http://dbpedia.org/resource/Divinas_palabras_(...,Airbag_(film),0.300062
96888,http://dbpedia.org/resource/Jay_Leach_(ice_hoc...,Robertas_Ringys,0.671825
96889,http://dbpedia.org/resource/Five_Ashore_in_Sin...,Five_Ashore_in_Singapore,0.715384


In [88]:
p_dict = {}
res_dict = {}
for i, line in eqv_i_full_df_in_test.iterrows():
    e1 = line['DB1']
    # TODO: This line should rather be: e2 = line[1].replace('y2:', '')
    e2 = line['DB2']
    p = float(line['Probability'])
    if e1 not in p_dict or e1 in p_dict and p_dict[e1] < p:
        p_dict[e1] = p
        res_dict[e1] = e2

In [98]:
res = {"DB1": [], "DB2": []}
for k in res_dict:
    res['DB1'].append(k)
    res['DB2'].append(res_dict[k])
res_df = pd.DataFrame(res)
res_df

Unnamed: 0,DB1,DB2
0,http://dbpedia.org/resource/Menahem_Golan,Menahem_Golan
1,http://dbpedia.org/resource/Snow_White_(1987_f...,Snow_White_(1987_film)
2,http://dbpedia.org/resource/Germi_County,Germi_County
3,http://dbpedia.org/resource/All_Hell's_Breakin...,All_Hell's_Breakin'_Loose
4,http://dbpedia.org/resource/Lick_It_Up_(song),Lick_It_Up_(song)
...,...,...
69997,http://dbpedia.org/resource/Joyce_Dickerson,Joyce_Dickerson
69998,http://dbpedia.org/resource/Divinas_palabras_(...,Airbag_(film)
69999,http://dbpedia.org/resource/Jay_Leach_(ice_hoc...,Robertas_Ringys
70000,http://dbpedia.org/resource/Five_Ashore_in_Sin...,Five_Ashore_in_Singapore


In [102]:
# Now we try to understand if there are duplicates that may affect precision: 
# in the first column there should be no duplicates (it is the key for a set!)
np.sum(res_df['DB1'].duplicated())

0

In [103]:
# The second column instead may have some:
np.sum(res_df['DB2'].duplicated())

7776

By this inference it looks like the difference among `eqv_full` preprocessed with Zequn's pipeline and `eqv` is more deeper than only the fact that there are duplicates in the second column. (When running `eqv`, the number of tuples is 60734, which is much lower than 70002 - 7776 = 62226)