Re-create the trie tree #56

HuiBinR · 2021-08-30T03:40:55Z

Goal: I am trying to create the kilt_titles_trie_dict.pkl file by myself. So that I can create trie tree for other data if I can perfectly re-create the KILT tree.

We also release the BPE prefix tree (trie) from KILT Wikipedia titles (kilt_titles_trie_dict.pkl) that is based on the 2019/08/01 Wikipedia dump, downloadable in its raw format here. The trie contains ~5M titles and it is used to generate entites for all the KILT experiments.

Data Preparation: I download the raw format Wikipedia from the link you provided in GENRE/examples_genre/README.md. Unzip the download file and got enwiki-pages-articles.xml. I use Tool wikiextractor to analysis the .xml file and get the data format like this:

\<doc id="10" url="https://en.wikipedia.org/wiki?curid=10" title="AccessibleComputing"\>
AccessibleComputing



\</doc\>
<doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">
Anarchism

Anarchism is an anti-authoritarian political philosophy that rejects hierarchies deemed unjust and advocates their replacement with self-managed, self-governed societies based on voluntary, cooperative institutions. These institutions are often described as stateless societies, although several authors have defined them more specifically as distinct institutions based on non-hierarchical or free associations. Anarchism's central disagreement with other ideologies is that it holds the state to be undesirable, unnecessary, and harmful.
...

\</doc\>

The first line: <doc id="10" url="https://en.wikipedia.org/wiki?curid=10" title="AccessibleComputing"> , the title is the target entity.
The fourth line:"Anarchism is an ..." is a description for the entity.

I extract this two with the code below as a dict {title: description}:

import os
import pickle
import re
from multiprocessing import Pool, Manager

manager = Manager()
dict_title_description = manager.dict()

def process_a_file(in_path):
    print(in_path)
    with open(in_path, 'r', encoding='utf8') as r_f:
        lines = r_f.readlines()
    i = 0
    while i < len(lines):
        tmp_line = lines[i]
        if tmp_line.startswith('<doc id='):
            title = re.findall(r'<doc id=.*title="(.*)"', tmp_line)[0]
            try:
                description = lines[i + 3].rstrip('\n')
            except IndexError:
                break
            dict_title_description.update({title: description})
            i += 6
        else:
            i += 1
    return dict_title_description

if __name__ == '__main__':
    in_path = "wikiextractor-master/wikiextractor/enwiki_data/AA"
    out_path_dict = "wikiextractor-master/wikiextractor/enwiki_data/dict.pkl"
    out_path_list = "wikiextractor-master/wikiextractor/enwiki_data/title_list.pkl"
    final_dict = {}

    pool = Pool(processes=14)

    for root, dirnames, filenames in os.walk(in_path):
        for fname in filenames:
            path_list.append(in_path + '/' + fname)
    pool.map_async(process_a_file, path_list)
    pool.close()
    pool.join()
    final_dict = dict_title_description
    list_title = list(final_dict.keys())
    list_title.sort()

    with open(out_path_dict, 'wb') as w_f:
        pickle.dump(final_dict, w_f)
    with open(out_path_list, 'wb') as w_f:
        pickle.dump(list_title, w_f)

I use the list_title (key of the dict) to generate the trie tree. (but the list_title length is 14608727 (not the same as you mentioned in GENRE/examples_genre/README.md (where the number is ~5M titles).

Trie tree creation: with the title list, I use the code below to create my wiki trie tree and get the file our_kilt_titles_trie_dict.pkl:

from genre.fairseq_model import GENRE
from genre.trie import Trie
import pickle

model = (
        GENRE.from_pretrained("../models/fairseq_entity_disambiguation_aidayago", checkpoint_file='model.pt')
        .eval()
        .to('cpu')
    )

# entities = ['AccessibleComputing', 'Anarchism']
with open('wikiextractor-master/wikiextractor/enwiki_data/title_list.pkl', 'rb') as r_f:
    entities = pickle.load(r_f)
print(len(entities))
trie = Trie([2]+model.encode(entity)[1:].tolist() for entity in entities).trie_dict
with open('our_kilt_titles_trie_dict.pkl', 'wb') as w_f:
    pickle.dump(trie, w_f)
print("finish running!")

I use kilt_titles_trie_dict.pkl and our_kilt_titles_trie_dict.pkl for same model testing, and they get different result. The fairseq_blink_200k_default_no_reset is pretain model based on Blink data with kilt_titles_trie_dict.pkl. Use different trie tree to test, there is a 2 point gap.

Question: I want to know if you also do some filtering when creating kilt_titles_trie_dict.pkl. (I know you have do some filtering when create special tree for Aida as I read Issue #37 .) Since the title number and the result is not match.

The text was updated successfully, but these errors were encountered:

nicola-decao · 2021-08-30T08:55:44Z

I used no filtering for creating the KILT trie but I used the KILT knowledge source http://dl.fbaipublicfiles.com/KILT/kilt_knowledgesource.json not the Wikipedia dump directly. Probably your piece of code is extracting many other page titles (maybe from special or deprecated pages) that should not be there (indeed 14M titles is too much as you can see from my screenshot of https://www.wikipedia.org as of today there are 6.2M pages so less than half of what you are extracting).

bablf · 2023-02-08T15:50:57Z

Leaving a comment since it might be helpful for someone in the future that wants to create their own trie from the current wikipedia version.
I used wikimapper to get all wikipedia titles since it is easier than paring the xml dump. Then I used a sql query to get all titles and redirects. As far as I can tell you only need to filter all disambiguation pages (containing "(disambiguation)") and subsections (containing "/").

select wikipedia_title from 
where wikidata_id is not NULL AND
wikipedia_title is not NULL AND
wikipedia_title != "" AND
wikipedia_title NOT LIKE '%(disambiguation)%' AND
 wikipedia_title NOT LIKE '%/%'""")

The code by HuiBinR works fine for creating the Trie.

MrZilinXiao · 2023-03-28T11:33:18Z

Hi @bablf! It's very kind of you to share the SQL for filtering KILT titles. I am wondering if you are doing the same for entity linking usage.

nicola-decao closed this as completed Aug 30, 2021

HuiBinR mentioned this issue Sep 3, 2021

4 point interval between my finetune model and shared model #54

Closed

schwabmi mentioned this issue Sep 29, 2022

Invalid prediction - no wikipedia entity #89

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-create the trie tree #56

Re-create the trie tree #56

HuiBinR commented Aug 30, 2021 •

edited

Loading

nicola-decao commented Aug 30, 2021

bablf commented Feb 8, 2023

MrZilinXiao commented Mar 28, 2023

Re-create the trie tree #56

Re-create the trie tree #56

Comments

HuiBinR commented Aug 30, 2021 • edited Loading

nicola-decao commented Aug 30, 2021

bablf commented Feb 8, 2023

MrZilinXiao commented Mar 28, 2023

HuiBinR commented Aug 30, 2021 •

edited

Loading