# Create New Wikipedia Index

This includes tools to:
- extract the metadata (titles with text) from a set of wikidata
- convert a wikidata set to the format to create full page indices with pyserini

See `readmes/preprocessing_new_wikipedia.md` for more info on the other stages in the process.

In [33]:
import sh
import json
import jsonlines
import os

from multiqa_utils import wikipedia_utils as wu

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Identify All Titles in Dump that Have Text

This collects all the titles that actually contain text and writes them into a metada file.  Then aggregates all the metadata into a single dict.

In [60]:
# The path of the new dump
wikipath = "/scratch/ddr8143/wikipedia/wiki_20220701/raw_extracted"

In [61]:
# Inspect the data setup
wiki_seg = wu.get_wikiseg_path(wikipath, "AA")
print(wiki_seg)
sh.ls(wiki_seg)

/scratch/ddr8143/wikipedia/wiki_20220701/raw_extracted/AA


metadata.json  wiki_12	wiki_25  wiki_38  wiki_51  wiki_64  wiki_77  wiki_90
wiki_00        wiki_13	wiki_26  wiki_39  wiki_52  wiki_65  wiki_78  wiki_91
wiki_01        wiki_14	wiki_27  wiki_40  wiki_53  wiki_66  wiki_79  wiki_92
wiki_02        wiki_15	wiki_28  wiki_41  wiki_54  wiki_67  wiki_80  wiki_93
wiki_03        wiki_16	wiki_29  wiki_42  wiki_55  wiki_68  wiki_81  wiki_94
wiki_04        wiki_17	wiki_30  wiki_43  wiki_56  wiki_69  wiki_82  wiki_95
wiki_05        wiki_18	wiki_31  wiki_44  wiki_57  wiki_70  wiki_83  wiki_96
wiki_06        wiki_19	wiki_32  wiki_45  wiki_58  wiki_71  wiki_84  wiki_97
wiki_07        wiki_20	wiki_33  wiki_46  wiki_59  wiki_72  wiki_85  wiki_98
wiki_08        wiki_21	wiki_34  wiki_47  wiki_60  wiki_73  wiki_86  wiki_99
wiki_09        wiki_22	wiki_35  wiki_48  wiki_61  wiki_74  wiki_87
wiki_10        wiki_23	wiki_36  wiki_49  wiki_62  wiki_75  wiki_88
wiki_11        wiki_24	wiki_37  wiki_50  wiki_63  wiki_76  wiki_89

In [62]:
# First, parse all metadata and write metadata files to segment directories
for i, segment in enumerate(sorted(os.listdir(wikipath))):
    wu.get_segment_metadata(wikipath, segment, force=False, verbose=False)

In [5]:
# Then load the titles with text
all_titles_wtext = wu.aggregate_wikipedia_metadata_key(wikipath, "titles_with_text", use_tqdm=True)
len(all_titles_wtext)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 170/170 [01:14<00:00,  2.29it/s]


6485795

In [27]:
# See (the first 10 of) all the titles with text
title_iter = iter(all_titles_wtext.items())
for i in range(10):
    print(next(title_iter))

('Anarchism', [{'id': '12', 'url': 'https://en.wikipedia.org/wiki?curid=12'}])
('Albedo', [{'id': '39', 'url': 'https://en.wikipedia.org/wiki?curid=39'}])
('A', [{'id': '290', 'url': 'https://en.wikipedia.org/wiki?curid=290'}])
('Alabama', [{'id': '303', 'url': 'https://en.wikipedia.org/wiki?curid=303'}])
('Achilles', [{'id': '305', 'url': 'https://en.wikipedia.org/wiki?curid=305'}])
('Abraham Lincoln', [{'id': '307', 'url': 'https://en.wikipedia.org/wiki?curid=307'}])
('Aristotle', [{'id': '308', 'url': 'https://en.wikipedia.org/wiki?curid=308'}])
('An American in Paris', [{'id': '309', 'url': 'https://en.wikipedia.org/wiki?curid=309'}])
('Academy Award for Best Production Design', [{'id': '316', 'url': 'https://en.wikipedia.org/wiki?curid=316'}])
('Academy Awards', [{'id': '324', 'url': 'https://en.wikipedia.org/wiki?curid=324'}])


## Create a Full Page Wikipedia Index

Create the base file that will be used by pyserini to convert into a full page index.

In [87]:
# Call utils to parse into correct format (don't force)
wu.postprocess_wikipedia_to_page_index(
    input_wikipath="/scratch/ddr8143/wikipedia/wiki_20220701/raw_extracted",
    output_dir="/scratch/ddr8143/wikipedia/wiki_20220701/fullpage_preindex",
    verbose=True,
    force=False,
)

>> Finished Postprocessing Wikipedia to Page Index
