Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-create the trie tree #56

Closed
HuiBinR opened this issue Aug 30, 2021 · 3 comments
Closed

Re-create the trie tree #56

HuiBinR opened this issue Aug 30, 2021 · 3 comments

Comments

@HuiBinR
Copy link

HuiBinR commented Aug 30, 2021

Goal: I am trying to create the kilt_titles_trie_dict.pkl file by myself. So that I can create trie tree for other data if I can perfectly re-create the KILT tree.

We also release the BPE prefix tree (trie) from KILT Wikipedia titles (kilt_titles_trie_dict.pkl) that is based on the 2019/08/01 Wikipedia dump, downloadable in its raw format here. The trie contains ~5M titles and it is used to generate entites for all the KILT experiments.

Data Preparation: I download the raw format Wikipedia from the link you provided in GENRE/examples_genre/README.md. Unzip the download file and got enwiki-pages-articles.xml. I use Tool wikiextractor to analysis the .xml file and get the data format like this:

\<doc id="10" url="https://en.wikipedia.org/wiki?curid=10" title="AccessibleComputing"\>
AccessibleComputing



\</doc\>
<doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">
Anarchism

Anarchism is an anti-authoritarian political philosophy that rejects hierarchies deemed unjust and advocates their replacement with self-managed, self-governed societies based on voluntary, cooperative institutions. These institutions are often described as stateless societies, although several authors have defined them more specifically as distinct institutions based on non-hierarchical or free associations. Anarchism's central disagreement with other ideologies is that it holds the state to be undesirable, unnecessary, and harmful.
...

\</doc\>
  • The first line: <doc id="10" url="https://en.wikipedia.org/wiki?curid=10" title="AccessibleComputing"> , the title is the target entity.

  • The fourth line:"Anarchism is an ..." is a description for the entity.

I extract this two with the code below as a dict {title: description}:

import os
import pickle
import re
from multiprocessing import Pool, Manager

manager = Manager()
dict_title_description = manager.dict()

def process_a_file(in_path):
    print(in_path)
    with open(in_path, 'r', encoding='utf8') as r_f:
        lines = r_f.readlines()
    i = 0
    while i < len(lines):
        tmp_line = lines[i]
        if tmp_line.startswith('<doc id='):
            title = re.findall(r'<doc id=.*title="(.*)"', tmp_line)[0]
            try:
                description = lines[i + 3].rstrip('\n')
            except IndexError:
                break
            dict_title_description.update({title: description})
            i += 6
        else:
            i += 1
    return dict_title_description

if __name__ == '__main__':
    in_path = "wikiextractor-master/wikiextractor/enwiki_data/AA"
    out_path_dict = "wikiextractor-master/wikiextractor/enwiki_data/dict.pkl"
    out_path_list = "wikiextractor-master/wikiextractor/enwiki_data/title_list.pkl"
    final_dict = {}

    pool = Pool(processes=14)

    for root, dirnames, filenames in os.walk(in_path):
        for fname in filenames:
            path_list.append(in_path + '/' + fname)
    pool.map_async(process_a_file, path_list)
    pool.close()
    pool.join()
    final_dict = dict_title_description
    list_title = list(final_dict.keys())
    list_title.sort()

    with open(out_path_dict, 'wb') as w_f:
        pickle.dump(final_dict, w_f)
    with open(out_path_list, 'wb') as w_f:
        pickle.dump(list_title, w_f)

I use the list_title (key of the dict) to generate the trie tree. (but the list_title length is 14608727 (not the same as you mentioned in GENRE/examples_genre/README.md (where the number is ~5M titles).

Trie tree creation: with the title list, I use the code below to create my wiki trie tree and get the file our_kilt_titles_trie_dict.pkl:

from genre.fairseq_model import GENRE
from genre.trie import Trie
import pickle

model = (
        GENRE.from_pretrained("../models/fairseq_entity_disambiguation_aidayago", checkpoint_file='model.pt')
        .eval()
        .to('cpu')
    )

# entities = ['AccessibleComputing', 'Anarchism']
with open('wikiextractor-master/wikiextractor/enwiki_data/title_list.pkl', 'rb') as r_f:
    entities = pickle.load(r_f)
print(len(entities))
trie = Trie([2]+model.encode(entity)[1:].tolist() for entity in entities).trie_dict
with open('our_kilt_titles_trie_dict.pkl', 'wb') as w_f:
    pickle.dump(trie, w_f)
print("finish running!")

I use kilt_titles_trie_dict.pkl and our_kilt_titles_trie_dict.pkl for same model testing, and they get different result. The fairseq_blink_200k_default_no_reset is pretain model based on Blink data with kilt_titles_trie_dict.pkl. Use different trie tree to test, there is a 2 point gap.
捕获
捕获1

Question: I want to know if you also do some filtering when creating kilt_titles_trie_dict.pkl. (I know you have do some filtering when create special tree for Aida as I read Issue #37 .) Since the title number and the result is not match.

@nicola-decao
Copy link
Contributor

I used no filtering for creating the KILT trie but I used the KILT knowledge source http://dl.fbaipublicfiles.com/KILT/kilt_knowledgesource.json not the Wikipedia dump directly. Probably your piece of code is extracting many other page titles (maybe from special or deprecated pages) that should not be there (indeed 14M titles is too much as you can see from my screenshot of https://www.wikipedia.org as of today there are 6.2M pages so less than half of what you are extracting).
image

@bablf
Copy link

bablf commented Feb 8, 2023

Leaving a comment since it might be helpful for someone in the future that wants to create their own trie from the current wikipedia version.
I used wikimapper to get all wikipedia titles since it is easier than paring the xml dump. Then I used a sql query to get all titles and redirects. As far as I can tell you only need to filter all disambiguation pages (containing "(disambiguation)") and subsections (containing "/").

select wikipedia_title from 
where wikidata_id is not NULL AND
wikipedia_title is not NULL AND
wikipedia_title != "" AND
wikipedia_title NOT LIKE '%(disambiguation)%' AND
 wikipedia_title NOT LIKE '%/%'""")

The code by HuiBinR works fine for creating the Trie.

@MrZilinXiao
Copy link

Hi @bablf! It's very kind of you to share the SQL for filtering KILT titles. I am wondering if you are doing the same for entity linking usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants