-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-create the trie tree #56
Comments
I used no filtering for creating the KILT trie but I used the KILT knowledge source http://dl.fbaipublicfiles.com/KILT/kilt_knowledgesource.json not the Wikipedia dump directly. Probably your piece of code is extracting many other page titles (maybe from special or deprecated pages) that should not be there (indeed 14M titles is too much as you can see from my screenshot of https://www.wikipedia.org as of today there are 6.2M pages so less than half of what you are extracting). |
Leaving a comment since it might be helpful for someone in the future that wants to create their own trie from the current wikipedia version.
The code by HuiBinR works fine for creating the Trie. |
Hi @bablf! It's very kind of you to share the SQL for filtering KILT titles. I am wondering if you are doing the same for entity linking usage. |
Goal: I am trying to create the
kilt_titles_trie_dict.pkl
file by myself. So that I can create trie tree for other data if I can perfectly re-create the KILT tree.Data Preparation: I download the raw format Wikipedia from the link you provided in
GENRE/examples_genre/README.md
. Unzip the download file and gotenwiki-pages-articles.xml
. I use Tool wikiextractor to analysis the .xml file and get the data format like this:The first line:
<doc id="10" url="https://en.wikipedia.org/wiki?curid=10" title="AccessibleComputing">
, the title is the target entity.The fourth line:"Anarchism is an ..." is a description for the entity.
I extract this two with the code below as a dict {title: description}:
I use the
list_title
(key of the dict) to generate the trie tree. (but the list_title length is 14608727 (not the same as you mentioned inGENRE/examples_genre/README.md
(where the number is ~5M titles).Trie tree creation: with the title list, I use the code below to create my wiki trie tree and get the file
our_kilt_titles_trie_dict.pkl
:I use
![捕获](https://user-images.githubusercontent.com/29992403/131280126-57c603f1-8ae1-4579-8e8e-14cd6d3e0151.PNG)
![捕获1](https://user-images.githubusercontent.com/29992403/131280987-732bb081-1a4d-4df6-af84-bd8053304a69.PNG)
kilt_titles_trie_dict.pkl
andour_kilt_titles_trie_dict.pkl
for same model testing, and they get different result. Thefairseq_blink_200k_default_no_reset
is pretain model based on Blink data withkilt_titles_trie_dict.pkl
. Use different trie tree to test, there is a 2 point gap.Question: I want to know if you also do some filtering when creating
kilt_titles_trie_dict.pkl
. (I know you have do some filtering when create special tree for Aida as I read Issue #37 .) Since the title number and the result is not match.The text was updated successfully, but these errors were encountered: