In [1]:
from IPython import display
import binascii
import os

def hide_code_in_slideshow():
    uid = binascii.hexlify(os.urandom(8)).decode()    
    html = """<div id="%s"></div>
    <script type="text/javascript">
        $(function(){
            var p = $("#%s");
            if (p.length==0) return;
            while (!p.hasClass("cell")) {
                p=p.parent();
                if (p.prop("tagName") =="body") return;
            }
            var cell = p;
            cell.find(".input").addClass("hide-in-slideshow")
        });
    </script>""" % (uid, uid)
    display.display_html(html, raw=True)
    
%matplotlib inline

In [2]:
%%html
<style>
 .container.slides .celltoolbar, .container.slides .hide-in-slideshow {
    display: None ! important;
}
</style>

<center><h1>Navigating a Knowledge Base</h1></center>


<br>
<center>Using Learned Representations for Encyclopedia Navigation</center>
<br>
<center><small><i>Christopher Akiki</i></small></center>

<center><small><i>Wissens- und Content Management WiSe 20/21</i></small></center>

<h1>Overview</h1>
<br>
<ul>
    <li><b>Context:</b> Navigating an encyclopedic body of work </li>
    <br>
    <li><b>Data:</b> Stanford Encyclopedia of Philosophy and Wikipedia Philosophers</li>
    <br>
    <li><b>Features:</b> Node2vec, Specter and USE</li>
    <br>
    <li><b>Retrieval:</b> Elasticsearch and k-nearest neighbor</li>
</ul>

# Context: How *not* to consume an encyclopedia


> *In 1797, Fath Ali was given a complete set of the Britannica's 3rd edition, which he read completely; after this feat, he extended his royal title to include "Most Formidable Lord and Master of the Encyclopædia Britannica." ([source](https://en.wikipedia.org/wiki/Fath-Ali_Shah_Qajar))*

> *The Know-It-All: One Man's Humble Quest to Become the Smartest Person in the World is a book by Esquire editor A. J. Jacobs, published in 2004. It recounts his experience of reading the entire Encyclopædia Britannica; all 32 volumes of the 2002 edition, extending to over 33,000 pages with some 44 million words. He set out on this endeavour to become the "smartest person in the world". The book is organized alphabetically in encyclopedia format and recounts both interesting facts from the encyclopedia and the author's experiences.* ([source](https://en.wikipedia.org/wiki/The_Know-It-All))

> *In 2008, Ammon Shea published his account of reading the complete Oxford English Dictionary.* ([source](https://en.wikipedia.org/wiki/The_Know-It-All))

<h1>Context: How to consume an encyclopedia</h1>
<br>
<ul>
    <li><b>Encyclopedias:</b> Distillation of human knowledge </li>
    <br>
    <li><b>Goal of this project:</b>  Facilitate the consumption of a medium that is only meant for direct and targeted consumption</li>
    <br>
    <li><b>Problems:</b> Huge scale of encyclopedic knowledge bases</li>
    <br>
    <li><b>Solution:</b> Reduce the problem to one topical corner; <u>philosophy</u></li>
</ul>

# Context: High level goal


* Writing assistant


* Traverse the knowledge base in a locally relevant way

<h1>Data: <i>In vivo</i></h1>
<br>
<br>
<ul>
    <li><b>The Stanford Encyclopedia of Philosophy:</b><a href=https://plato.stanford.edu/contents.html> All Articles</a></li>
    <br>
    <br>
    <li><b>All Wikipedia Philosophers:</b> All philosophers on Wikipedia</li>
    <br>
</ul>

<h1>Data: <i>High level view</i></h1>
<br>
<br>
<ul>
    <li><b>Graph structure:</b> Use that to discover</li>
    <br>
    <br>
    <li><b>Text:</b> Use that to anchor into the graph and also to discover</li>
    <br>
</ul>

<h1>Data: Scraping</h1>
<br>
<br>
<ul>
    <li><b>The Stanford Encyclopedia of Philosophy:</b> Scrapy</li>
    <br>
    <br>
    <li><b>All Wikipedia Philosophers:</b> Local DBPedia instance</li>
</ul>
<br>

[Scrapy notebook](scraping_plato.ipynb) / [DBPedia notebook](scraping_wiki_dbpedia.ipynb)

# Data: *In vitro*

In [7]:
!ls ../data/

dbpedia.pkl			plato_undirected.graphml
dbpedia_with_articles.pkl	spectre_embeddings.npy
homophily_plato_embeddings.npy	structural_plato_embeddings.npy
homophily_wiki_embeddings.npy	structural_wiki_embeddings.npy
node2vec_v1			to_index.csv
old_node2vec_embeddings		to_index_p4.pkl
plato_directed.gephi		to_index.pkl
plato_directed.graphml		use_abstract_embeddings.npy
plato.pkl			use_article_embeddings.npy


# Feature Creation

**What we have** for every entry:

* Links to other entries


* Abstract


* Full text

# Feature Creation

* [Node2vec](https://arxiv.org/abs/1607.00653): Graph embeddings by creating "sentences" from graph traversal and using word2vec


* [USE](https://arxiv.org/abs/1803.11175): Universal Sentence Encoder from Google


* [SPECTER](https://arxiv.org/abs/2004.07180): Document-level Representation Learning using Citation-informed Transformers

# Retrieval



* BM25 retrieval on text fields [notebook](indexing.ipynb)


* k-nearest neighbor retrieval on dense embeddings [notebook](nearest_neighbor.ipynb)

# Outlook


* How to evaluate results?


* Collapse nodes by finding (near) duplicates