# Entities as Experts

This notebook is a code implementation of the paper "Entities as Experts: Sparse Memory Access with Entity Supervision" by Févry, Baldini Soares, FitzGerald, Choi, Kwiatowski.

## Problem definition and high-level model description

We want to perform question answering on typical one-shot questions that require external knowledge or context. For example, in order to answer the question "Which country was Charles Darwin born in?" one needs some text providing answers on typical structured scenarios.

In this case, however, we want to rely on knowledge-graph extracted information. For example, in the question given here, we can prune out unrelated to the antropologist and evolution theorist Charles Darwins, e.g. Charles River, Darwin City etc. 

In the paper, the authors propose to augment BERT in the task of cloze-type question answering by leveraging an Entity Memory extracted from e.g. a Knoweldge Graph.

![Entity as Experts description](images/eae_highlevel.png)

The Entity Memory is a simple bunch of embeddings of entities extracted from a Knowledge Graph. Relationships are ignored (see the Facts as Experts paper and notebook to see how they could be used).

## Datasets

> We assume access to a corpus $D={(xi,mi)}$,where all entity mentions are detected but not necessarily  all  linked  to  entities.   We  use  English Wikipedia as our corpus, with a vocabulary of 1m entities. Entity links come from hyperlinks, leading to 32m 128 byte contexts containing 17m entity links.

In the appendix B, it is explained that:

> We build our training corpus of contexts paired with entity mention labels from the 2019-04-14 dump of English Wikipedia. We first divide each article into chunks of 500 bytes,resulting in a corpus of 32 million contexts withover 17 million entity mentions. We restrict our-selves  to  the  one  million  most  frequent  entities
(86% of the linked mentions).

Given that the dump 2019-04-14 is not available at the time of writing, we will adopt the revision 2020-11-01.

Entities are thus partially extracted by link annotations (e.g. they associate with each token a mention if that token belongs to a wikipedia url).

## Mention Detection

> In addition to the Wikipedia links, we annotaten each sentence with unlinked mention spans using the mention detector from Section 2.2

The mention detection head discussed in Section 2.2 is a simple BIO sequence: each token is annotated with a B (beginning), I (inside) or O (outside) if they are respectivelly beginning, inside or outside of a mention. The reason why we use both BIO and EL is to avoid inconsistencies.

There is a catch. In the paper, they explain they used Google NLP APIs to perform entity detection and linking on large-scale Wikipedia entries, that is, to have a properly annotated Wikipedia dataset.

Since we cannot technically afford this, we will use spacy's entity detection and linking capabilities as a baseline. Data quality 

## Chunking

- Split articles by chunks of 500 bytes (assuming unicode encoding).
- We will elide sentences till the last period to make sure they reach such limit without giving weird effects.

## Tokenization:

- BERT Tokenizer (e.g. Wordpiece) using lowercase vocabulary, limited to 128 distinct word-piece tokens.

In [1]:
from tools.providers import WikipediaProvider

WikipediaProvider.dump_full_dataset(revision="20201101")

Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream23.xml-p49288942p50564553.bz2


100%|██████████| 255M/255M [00:50<00:00, 5.05MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream13.xml-p9172789p10672788.bz2


100%|██████████| 351M/351M [01:08<00:00, 5.12MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream17.xml-p22070393p23570392.bz2


100%|██████████| 387M/387M [01:15<00:00, 5.14MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index17.txt-p20570393p22070392.bz2


100%|██████████| 4.79M/4.79M [00:01<00:00, 4.32MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index12.txt-p7054860p8554859.bz2


100%|██████████| 4.68M/4.68M [00:01<00:00, 4.34MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream24.xml-p55064554p56564553.bz2


100%|██████████| 347M/347M [01:07<00:00, 5.13MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream23.xml-p47788942p49288941.bz2


100%|██████████| 325M/325M [01:02<00:00, 5.21MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream16.xml-p20460153p20570392.bz2


100%|██████████| 24.4M/24.4M [00:04<00:00, 4.98MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream14.xml-p11659683p13159682.bz2


100%|██████████| 422M/422M [01:21<00:00, 5.19MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream22.xml-p44496246p44788941.bz2


100%|██████████| 59.8M/59.8M [00:12<00:00, 4.97MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index22.txt-p41496246p42996245.bz2


100%|██████████| 5.17M/5.17M [00:01<00:00, 4.44MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream11.xml-p5399367p6899366.bz2


100%|██████████| 514M/514M [01:39<00:00, 5.17MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index21.txt-p37022433p38522432.bz2


100%|██████████| 5.20M/5.20M [00:01<00:00, 4.18MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index16.txt-p20460153p20570392.bz2


100%|██████████| 297k/297k [00:00<00:00, 1.12MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream24.xml-p52064554p53564553.bz2


100%|██████████| 345M/345M [01:08<00:00, 5.03MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream15.xml-p14324603p15824602.bz2


100%|██████████| 381M/381M [01:13<00:00, 5.19MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream18.xml-p25216198p26716197.bz2


100%|██████████| 370M/370M [01:10<00:00, 5.23MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream18.xml-p26716198p27121850.bz2


100%|██████████| 93.8M/93.8M [00:18<00:00, 5.13MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream27.xml-p63975910p65475909.bz2


100%|██████████| 340M/340M [01:06<00:00, 5.12MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index23.txt-p47788942p49288941.bz2


100%|██████████| 4.57M/4.57M [00:01<00:00, 4.25MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream19.xml-p30121851p31308442.bz2


100%|██████████| 297M/297M [00:57<00:00, 5.15MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index22.txt-p44496246p44788941.bz2


100%|██████████| 866k/866k [00:00<00:00, 1.87MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index19.txt-p30121851p31308442.bz2


100%|██████████| 3.70M/3.70M [00:00<00:00, 4.09MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index20.txt-p34308443p35522432.bz2


100%|██████████| 3.93M/3.93M [00:00<00:00, 4.31MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index8.txt-p2134112p2936260.bz2


100%|██████████| 3.79M/3.79M [00:00<00:00, 4.07MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream25.xml-p61525656p62585850.bz2


100%|██████████| 246M/246M [00:47<00:00, 5.22MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream6.xml-p958046p1483661.bz2


100%|██████████| 461M/461M [01:27<00:00, 5.25MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream19.xml-p27121851p28621850.bz2


100%|██████████| 363M/363M [01:09<00:00, 5.19MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream2.xml-p41243p151573.bz2


100%|██████████| 329M/329M [01:03<00:00, 5.17MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index27.txt-p65475910p65740670.bz2


100%|██████████| 1.06M/1.06M [00:00<00:00, 2.33MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index2.txt-p41243p151573.bz2


100%|██████████| 654k/654k [00:00<00:00, 1.81MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index22.txt-p42996246p44496245.bz2


100%|██████████| 5.61M/5.61M [00:01<00:00, 4.49MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index11.txt-p5399367p6899366.bz2


100%|██████████| 5.13M/5.13M [00:01<00:00, 3.61MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index26.txt-p62585851p63975909.bz2


100%|██████████| 5.32M/5.32M [00:01<00:00, 4.32MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index13.txt-p10672789p11659682.bz2


100%|██████████| 2.74M/2.74M [00:00<00:00, 4.03MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index17.txt-p22070393p23570392.bz2


100%|██████████| 5.35M/5.35M [00:01<00:00, 4.44MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index19.txt-p27121851p28621850.bz2


100%|██████████| 5.45M/5.45M [00:01<00:00, 4.45MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream24.xml-p50564554p52064553.bz2


100%|██████████| 346M/346M [01:06<00:00, 5.23MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index24.txt-p55064554p56564553.bz2


100%|██████████| 5.44M/5.44M [00:01<00:00, 4.51MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream15.xml-p15824603p17324602.bz2


100%|██████████| 332M/332M [01:04<00:00, 5.13MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream20.xml-p31308443p32808442.bz2


100%|██████████| 406M/406M [01:17<00:00, 5.22MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index16.txt-p17460153p18960152.bz2


100%|██████████| 4.60M/4.60M [00:01<00:00, 4.40MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream24.xml-p53564554p55064553.bz2


100%|██████████| 330M/330M [01:03<00:00, 5.15MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index14.txt-p13159683p14324602.bz2


100%|██████████| 3.65M/3.65M [00:00<00:00, 4.18MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index6.txt-p958046p1483661.bz2


100%|██████████| 2.62M/2.62M [00:00<00:00, 3.70MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index24.txt-p50564554p52064553.bz2


100%|██████████| 5.30M/5.30M [00:01<00:00, 4.37MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index9.txt-p2936261p4045402.bz2


100%|██████████| 4.46M/4.46M [00:01<00:00, 4.38MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream4.xml-p311330p558391.bz2


100%|██████████| 396M/396M [01:16<00:00, 5.16MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream5.xml-p558392p958045.bz2


100%|██████████| 429M/429M [01:22<00:00, 5.20MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream16.xml-p18960153p20460152.bz2


100%|██████████| 336M/336M [01:05<00:00, 5.16MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index24.txt-p52064554p53564553.bz2


100%|██████████| 5.17M/5.17M [00:01<00:00, 4.36MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index20.txt-p31308443p32808442.bz2


100%|██████████| 5.32M/5.32M [00:01<00:00, 4.33MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream17.xml-p20570393p22070392.bz2


100%|██████████| 375M/375M [01:12<00:00, 5.18MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream22.xml-p42996246p44496245.bz2


100%|██████████| 380M/380M [01:13<00:00, 5.19MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream21.xml-p37022433p38522432.bz2


100%|██████████| 366M/366M [01:10<00:00, 5.19MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index23.txt-p49288942p50564553.bz2


100%|██████████| 4.41M/4.41M [00:01<00:00, 4.27MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index22.txt-p39996246p41496245.bz2


100%|██████████| 5.08M/5.08M [00:01<00:00, 4.46MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index27.txt-p63975910p65475909.bz2


100%|██████████| 5.20M/5.20M [00:01<00:00, 4.31MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream22.xml-p41496246p42996245.bz2


100%|██████████| 374M/374M [01:11<00:00, 5.20MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index12.txt-p8554860p9172788.bz2


100%|██████████| 1.89M/1.89M [00:00<00:00, 3.30MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream20.xml-p32808443p34308442.bz2


100%|██████████| 373M/373M [01:11<00:00, 5.19MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream11.xml-p6899367p7054859.bz2


100%|██████████| 49.4M/49.4M [00:09<00:00, 5.11MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index23.txt-p46288942p47788941.bz2


100%|██████████| 5.60M/5.60M [00:01<00:00, 4.29MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index21.txt-p35522433p37022432.bz2


100%|██████████| 5.34M/5.34M [00:01<00:00, 4.37MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index18.txt-p25216198p26716197.bz2


100%|██████████| 4.77M/4.77M [00:01<00:00, 4.28MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream25.xml-p58525656p60025655.bz2


100%|██████████| 318M/318M [01:01<00:00, 5.18MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index16.txt-p18960153p20460152.bz2


100%|██████████| 4.55M/4.55M [00:01<00:00, 4.39MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream23.xml-p44788942p46288941.bz2


100%|██████████| 243M/243M [00:47<00:00, 5.12MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index25.txt-p58525656p60025655.bz2


100%|██████████| 5.23M/5.23M [00:01<00:00, 4.30MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream8.xml-p2134112p2936260.bz2


100%|██████████| 489M/489M [01:34<00:00, 5.16MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index7.txt-p1483662p2134111.bz2


100%|██████████| 3.15M/3.15M [00:00<00:00, 3.95MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index18.txt-p26716198p27121850.bz2


100%|██████████| 1.42M/1.42M [00:00<00:00, 3.09MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index17.txt-p23570393p23716197.bz2


100%|██████████| 581k/581k [00:00<00:00, 1.59MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream1.xml-p1p41242.bz2


100%|██████████| 244M/244M [00:46<00:00, 5.19MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index23.txt-p44788942p46288941.bz2


100%|██████████| 3.34M/3.34M [00:00<00:00, 4.16MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream18.xml-p23716198p25216197.bz2


100%|██████████| 400M/400M [01:17<00:00, 5.18MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream10.xml-p4045403p5399366.bz2


100%|██████████| 527M/527M [01:41<00:00, 5.18MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index20.txt-p32808443p34308442.bz2


100%|██████████| 5.06M/5.06M [00:01<00:00, 4.27MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index15.txt-p17324603p17460152.bz2


100%|██████████| 374k/374k [00:00<00:00, 1.33MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream24.xml-p56564554p57025655.bz2


100%|██████████| 109M/109M [00:21<00:00, 5.16MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream21.xml-p38522433p39996245.bz2


100%|██████████| 370M/370M [01:11<00:00, 5.16MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream12.xml-p8554860p9172788.bz2


100%|██████████| 174M/174M [00:33<00:00, 5.19MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index5.txt-p558392p958045.bz2


100%|██████████| 2.17M/2.17M [00:00<00:00, 3.54MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream25.xml-p57025656p58525655.bz2


100%|██████████| 358M/358M [01:10<00:00, 5.10MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index18.txt-p23716198p25216197.bz2


100%|██████████| 5.14M/5.14M [00:01<00:00, 4.42MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index13.txt-p9172789p10672788.bz2


100%|██████████| 3.96M/3.96M [00:00<00:00, 4.25MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream7.xml-p1483662p2134111.bz2


100%|██████████| 476M/476M [01:33<00:00, 5.08MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream12.xml-p7054860p8554859.bz2


100%|██████████| 429M/429M [01:24<00:00, 5.10MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index25.txt-p61525656p62585850.bz2


100%|██████████| 3.59M/3.59M [00:00<00:00, 4.03MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream23.xml-p46288942p47788941.bz2


100%|██████████| 388M/388M [01:15<00:00, 5.15MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index1.txt-p1p41242.bz2


100%|██████████| 227k/227k [00:00<00:00, 861kiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream17.xml-p23570393p23716197.bz2


100%|██████████| 43.2M/43.2M [00:08<00:00, 5.00MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream21.xml-p35522433p37022432.bz2


100%|██████████| 377M/377M [01:13<00:00, 5.14MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index10.txt-p4045403p5399366.bz2


100%|██████████| 4.92M/4.92M [00:01<00:00, 4.42MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream14.xml-p13159683p14324602.bz2


100%|██████████| 291M/291M [00:56<00:00, 5.13MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream15.xml-p17324603p17460152.bz2


100%|██████████| 30.1M/30.1M [00:06<00:00, 4.96MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index11.txt-p6899367p7054859.bz2


100%|██████████| 496k/496k [00:00<00:00, 1.36MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index4.txt-p311330p558391.bz2


100%|██████████| 1.35M/1.35M [00:00<00:00, 2.97MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index25.txt-p60025656p61525655.bz2


100%|██████████| 5.30M/5.30M [00:01<00:00, 4.40MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream26.xml-p62585851p63975909.bz2


100%|██████████| 364M/364M [01:11<00:00, 5.11MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index24.txt-p56564554p57025655.bz2


100%|██████████| 1.64M/1.64M [00:00<00:00, 3.27MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream3.xml-p151574p311329.bz2


100%|██████████| 356M/356M [01:08<00:00, 5.17MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream13.xml-p10672789p11659682.bz2


100%|██████████| 243M/243M [00:48<00:00, 5.04MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream16.xml-p17460153p18960152.bz2


100%|██████████| 359M/359M [01:13<00:00, 4.90MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index15.txt-p14324603p15824602.bz2


100%|██████████| 5.09M/5.09M [00:01<00:00, 4.23MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream19.xml-p28621851p30121850.bz2


100%|██████████| 320M/320M [01:02<00:00, 5.08MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index3.txt-p151574p311329.bz2


100%|██████████| 841k/841k [00:00<00:00, 1.92MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index14.txt-p11659683p13159682.bz2


100%|██████████| 5.16M/5.16M [00:01<00:00, 4.46MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream25.xml-p60025656p61525655.bz2


100%|██████████| 352M/352M [01:07<00:00, 5.20MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream22.xml-p39996246p41496245.bz2


100%|██████████| 364M/364M [01:10<00:00, 5.18MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index25.txt-p57025656p58525655.bz2


100%|██████████| 5.17M/5.17M [00:01<00:00, 4.41MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index21.txt-p38522433p39996245.bz2


100%|██████████| 5.21M/5.21M [00:01<00:00, 4.30MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream9.xml-p2936261p4045402.bz2


100%|██████████| 534M/534M [01:44<00:00, 5.09MiB/s] 


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index24.txt-p53564554p55064553.bz2


100%|██████████| 5.02M/5.02M [00:01<00:00, 4.29MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index19.txt-p28621851p30121850.bz2


100%|██████████| 5.09M/5.09M [00:01<00:00, 4.40MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream-index15.txt-p15824603p17324602.bz2


100%|██████████| 5.41M/5.41M [00:01<00:00, 4.45MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream27.xml-p65475910p65740670.bz2


100%|██████████| 64.9M/64.9M [00:12<00:00, 5.09MiB/s]


Downloading partition wikipedia/enwiki-20201101-pages-articles-multistream20.xml-p34308443p35522432.bz2


100%|██████████| 276M/276M [00:54<00:00, 5.10MiB/s] 


## Model

In the paper, the authors explain they used a modified BERT.

In [3]:
from torch.nn import Module, Embedding, Dropout, ModuleList, Linear
import torch.nn as nn
import torch
import math

GELU = torch.nn.GELU
LayerNorm = torch.nn.LayerNorm

l0 = 4
l1 = 8

class SublayerConnection(Module):
    """A residual connection followed by layer norm and applied dropout.
    """
    
    def __init__(self, size, dropout):
        """
        :param size the size of the layer norm
        :param dropout the dropout rate
        """
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = Dropout(dropout)
    
    def forward(self, x, sublayer):
        return x + self.dropout(self.norm(sublayer(x)))


class TokenEmbedding(Embedding):
    """A token embedding is a mere synonim of the default torch embedder.
    
    It simply keeps a fixed vocabulary size.
    """
    def __init__(self, vocab_size, embed_size=512):
        super().__init__(vocab_size, embed_size, padding_idx=0)
    

class PositionalEmbedding(Module):
    def __init__(self, d_model, max_len=512):
        """
        Setup positional embeddings
        
        :param d_model: the size of a single embedding
        :param max_len: the maximum number of tokens to embed
        """
        super().__init()
        
        # Compute the positional encodings once in log space
        pe = torch.zeros(max_len, d_model).float()
        pe.require_grad = False
        
        position = torch.arange(0, max_len).float().unsqueeze(1)
        div_term = (torch.arange(0, d_model, 2).float() * (math.log(10000.0) / d_model)).exp()
         
        # I have no clue here
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        return self.pe[:, :x.size(1)]

    
class SegmentEmbedding(Embedding):
    """
    Segment Embedding. It works in a very similar way to token embed but we constrain the size
    of our embedding to 3 (before, current, next sentence?)
    """
    def __init__(self, embed_size=512):
        super().__init__(3, embed_size, padding_idx=0)
        

class BERTEmbedding(Module):
    """
    BERT Embeddings which is made of the following features:
    
    1. Token Embedding: normal embedding matrix
    2. Positional Embedding: additional position information
    3. SegmentEmbedding: Additional sentence segment info (eg. tok_1:1, tok_2:2)
    """
    def __init__(self, vocab_size, embed_size, dropout=0.1):
        """
        :param vocab_size: the size of the vocabulary
        :embed_size: the vector dimensionality of the embeddings
        :dropout: dropout rate
        """
        
        super().__init__()
        self.token = TokenEmbedding(vocab_size=vocab_size, embed_size=embed_size)
        self.position = PositionalEmbedding(embed_size=self.token.embedding_dim)
        self.segment = SegmentEmbedding(embed_size=self.token.embedding_dim)
        self.dropout = nn.Dropout(p=dropout)
        self.embed_size = embed_size
        
    def forward(self, sequence, segment_label):
        x = self.token(sequence) + self.position(sequence) + self.segment(segment_label)
        return self.dropout(x)


class Attention(nn.Module):
    """Attention mechanism in BERT.
    
    Compute Scaled Dot Product Attention.
    
    :return return p_attn * value, p_attn (possibly with applied dropout)
    """
    def forward(self, query, key, value, mask=None, dropout=None):
        
        # apply key to the query to get attention scores
        scores = torch.matmul(query, key.transpose(-2, -1)) \
            / math.sqrt(query.size(-1)) # the original paper suggests doing this
        
        # Be sure to erase out masked values
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # get a probability distribution
        p_attn = F.softmax(scores, dim=-1)
        
        if dropout is not None:
            p_attn = dropout(p_attn)
        
        return torch.matmul(p_attn, value), p_attn

class MultiHeadedAttention(Module):
    """
    Wrap and combine multiple attention heads
    """
    
    def __init__(self, heads, embedding_size, dropout=0.1):
        """
        :param heads the number of attention heads
        :param embedding_size the size of a token embedding. Must be equal to embedding_per_head * heads,
               where embeddding_per_head
        :param dropout: dropout rate
        """
        assert embedding_size % heads == 0
        
        self.embedding_per_head = embedding_size // heads
        self.heads = heads
        
        # query, key, value
        # Note for myself: ModuleList registers all the modules in autograd
        self.linear_layers = ModuleList([Linear(embedding_size, embedding_size) for _ in range(3)])
        self.output_linear = nn.Linera(d_model, d_model)
        self.attention = Attention()
        
        self.dropout = Dropout(p=dropout)
        
    def forward(self):
        batch_size = query.size(0)
        
        # Step 1: extract query, key, value for each
        pass
class TransformerBlock(Module):
    pass