Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Entity Disambiguation] some training samples in blink-train-kilt have similar samples in wned-wiki #13

Closed
horseee opened this issue Mar 5, 2021 · 1 comment

Comments

@horseee
Copy link

horseee commented Mar 5, 2021

Hi,

We found that there are some samples in blink-train-kilt.jsonl that are very similar to some samples in the test set wned-wiki. Did you remove this part of the data during training? From the experimental results, GENRE is not much improved on other datasets(ACE2004, CWEB, AIDA-b, AQUAINT), but shows significant improvement on wned-wiki, is it related to these possible leaked samples?

Here are some examples (Since wikipedia has been updated compared to the wned-wiki dataset, I cannot filter out this type of samples by exact matching):

  1. Wikipedia page: https://en.wikipedia.org/wiki/Big_Blue_River_(Indiana)
    blink-train-kilt:{"id": "blink-train-731858", "input": "The Big Blue River is an [START_ENT] tributary [END_ENT] of the Driftwood River in east-central Indiana in the United States. Via the Driftwood, White, Wabash and Ohio rivers, it is part of the watershed of the Mississippi River.", "output": [{"answer": "Tributary", "provenance": [{"wikipedia_id": "72465", "title": "Tributary"}]}], "meta": {"mention": "tributary", ...}}
    wiki-test-kilt: {"id": 22, "input": "The Big Blue River is an [START_ENT] tributary [END_ENT] of the Driftwood River in east central Indiana in the United States Via the Driftwood White Wabash and Ohio rivers it is part of the watershed of the Mississippi River The Big Blue rises in northeastern Henry County and flows generally southwestwardly through Rush Hancock Shelby and Johnson counties past the towns of New Castle Knightstown Carthage Morristown Shelbyville and Edinburgh It joins Sugar Creek to form the Driftwood River west of Edinburgh At Shelbyville it collects the", "output": [{"answer": "Tributary", "provenance": [{"title": "Tributary"}]}], "meta": {..., "mention": "tributary"}, "candidates": [...]}

and under the same page, there are also pairs: (blink-train-6446404, wiki-test-kilt-id-30), (blink-train-6621890, wiki-test-kilt-id-27)

  1. Wikipedia page: https://en.wikipedia.org/wiki/Energy_in_Sudan
    blink-train-kilt: {"id": "blink-train-2613656", "input": "Energy in Sudan describes energy and [START_ENT] electricity [END_ENT] production, consumption and imports in Sudan. Sudan is a net energy exporter. Primary energy use in Sudan was 179 kWh and 4 kWh per million persons in 2008.", "output": [{"answer": "Electricity generation", "provenance": [{"wikipedia_id": "9540", "title": "Electricity generation"}]}], "meta": {"mention": "electricity", ...}
    wiki-test-kilt: {"id": 357, "input": "Energy in Sudan describes and [START_ENT] electricity [END_ENT] production consumption and imports in Sudan Sudan is a net energy exporter Primary energy use in Sudan was 179 kWh and 4 kWh per million persons in 2008 The world share of energy production in Africa was 12 percent of oil and 7 percent of gas in 2009 In 2010 major energy producers in Africa were Algeria Angola Cameroon Democratic Republic of the Congo Equatorial Guinea Gabon Libya Nigeria and Sudan According to the OECD and the World Bank the population growth of from 2004 to 2008 was 16 4 percent in comparison to the world average of 5 3", "output": [{"answer": "Electricity generation", "provenance": [{"title": "Electricity generation"}]}], "meta": {..., "mention": "electricity"}, "candidates": ...}

  2. Wikipedia page: https://en.wikipedia.org/wiki/2009_European_Pairs_Speedway_Championship
    blink-train-kilt: {"id": "blink-train-1894399", "input": "The 2009 European Pairs Speedway Championship will be the 6th UEM European Pairs Speedway Championship season. The Final was held on 26 September 2009 in Miskolc, Hungary; it was second Final in Hungary, but first in Miskolc. The championship was won by [START_ENT] Czech Republic pair [END_ENT] and they beat Russia and the defending Champions Poland.", "output": [{"answer": "Czech Republic national speedway team", "provenance": [{"wikipedia_id": "13444681", "title": "Czech Republic national speedway team"}]}], "meta": {"mention": "Czech Republic pair", ...}}
    wiki-test-kilt: {"id": 174, "input": "The 2009 European Pairs Speedway Championship will be the 6th UEM European Pairs Speedway Championship season The Final was held on 26 September 2009 in Miskolc Hungary it was second Final in Hungary but first in Miskolc The championship was won by [START_ENT] Czech Republic pair [END_ENT] and they beat Russia and the defending Champions Poland In the Final will be the defending Champion Poland Czech Republic 2nd place in 2008 Final Russia 3rd place host team Hungary 4th place and Latvia 5th place A last finalist will be determined in one Semi Final In Ljubljana Slovenia on May 13 will be Austria 6th place Germany 7th place Ukraine Finland host team Slovenia Italy and Croatia", "output": [{"answer": "Czech Republic national speedway team", "provenance": [{"title": "Czech Republic national speedway team"}]}], "meta": {..., "mention": "Czech Republic pair"}, "candidates": [...]}

@horseee horseee changed the title [Entity Disambiguation] similar samples between WNED-WIKI and blink-train-kilt [Entity Disambiguation] some training samples in blink-train-kilt appear very similar to samples in wned-wiki Mar 5, 2021
@horseee horseee changed the title [Entity Disambiguation] some training samples in blink-train-kilt appear very similar to samples in wned-wiki [Entity Disambiguation] some training samples in blink-train-kilt have similar samples in wned-wiki Mar 5, 2021
@nicola-decao
Copy link
Contributor

Hi, This is not surprising to us and we explicitly explained that in the paper (https://arxiv.org/pdf/2010.00904.pdf, Table 1):

*WIKI is usually considered out-of-domain but note that all methods use a part of Wikipedia to train.

So even though you are right that there is some overlap, the same may happen with any model that uses Wikipedia (and all of the reported ones use it) as additional supervision (e.g., BLINK training/dev sets are extracted from Wikipedia).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants