<a href="https://colab.research.google.com/github/aolieman/semantic-corpus-exploration/blob/master/notebooks/widenet_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Corpus Exploration




## Part II: Introducing WideNet

### Semantic Search Tools for the (Digital) Humanities


## Introduction

The previous part of the workshop disentangled the algorithms that drive entity linking. This part emphasises particular applications of entity linking: exploratory search and source selection. 

In general, the goal of this section is to introduce [**WideNet**](http://widenet.politicalmashup.nl/dh2017/) an exploratory search system for corpus selection. We first explain how and why we built WideNet, and then continue by demonstrating the interface. 

We show how entity linking allows us to explore digital corpora in new ways and argue that even if the technology itself is far from perfect, it can still serve useful purposes for some application. This, however, requires a particular tool design, that corrects or at least accommodates for the possible negative effects of the crummy technologies involved. 

In short, the topics we cover are

- Finding Complex Concepts (Search)
- Using “Crummy” EL Technologies (Tool Design)
- Search as Source Selection for the Humanities (Tool Criticism)


## Context of WideNet


### Searching for complex concepts with WideNet

The creation of WideNet started with specific historical research questions: how do politicians today refer to the past? how is history reimagined and used in contemporary political discourses?

Particularly, we were interested in when and how Dutch parliamentarians mention the so-called Dutch ""[Golden Age](https://en.wikipedia.org/wiki/Dutch_Golden_Age)" in their speeches. 

Historical eras are complex concepts in many ways: 
they comprise a wide range of individual events, people, locations etc. Moreover, they are unstable constructs, since historians are likely to disagree on their scope and content.

This raises various methodological issues: how to search for these concepts? how to search for mention of the past?

WideNet was built as a tool to tackle these problems.

### Complex concepts: an example

To better understand why we model "events" or "periods" as complex concepts, have a look at the painting below.

<img src="https://upload.wikimedia.org/wikipedia/commons/1/1c/Helst%2C_Peace_of_M%C3%BCnster.jpg"> 

Imagine, you want to find speeches related to this painting. The painting shows--in Dutch-- “Schuttersmaaltijd ter viering van de Vrede van Munster” and was made by Bartholomeus van der Helst in 1648.
It depicts the shooter’s guild celebrating the Peace of Munster.

The task (searching for talk about the past) would require a reconstruction of the narrative (or scene). We would need to meaningfully combine all the depicted entities (persons, object, events etc).


Even as a single and micro-event it contains references to many different entities. 
- It’s located in **Amsterdam**, as suggested by flag carried by the ensign Jacob Baning, which shows the maiden of Amsterdam (also note the buildings which can be seen through the window).
- Other **characters** play a role in this scene, such as Witsen and Van Waveren, shaking hands on the right. 
The drum, in turn, has a few lines of poetry written on it by Jan Vos, which praises an everlasting covenant.

Ideally, to find relevant information, our search engine should generate a partial reconstruction of this scene.


### Complex concepts: running example (Dutch Golden Age)

<img src="https://raw.githubusercontent.com/kasparvonbeelen/dh2019-SCE-workshop/master/Screenshot%202019-07-06%20at%2013.21.51.png"/>

Inspired by this example, we applied entity-linking to retrieve documents related to larger historical periods, such as the Dutch Golden Age. This concept can be treated as a **container** which holds many entities

- **Synod of Dordrecht**, which attempted to end the religious controversy in the Republic between Remonstrants (Arminians) and Contraremonstrants (Gomarists),
- The emergence of **capitalism** in to form of a Stock Exchange in Amsterdam in 1609
- The expanding **international trade** exemplified by companies such as the Vereenigde Oostindische Compagnie (the Dutch East India Company)
- And of course, individuals related to this period such as the poet, writer and playwright Joost van den Vondel, Huygens etc.

To goal of WideNet is to connect a wide variety of references, which are **implicitly part of a complex container concept** (here a time period), but scattered all over the corpus.

### Colligatory concepts

Historical periods, such as the "Golden Age" or "French Revolution" are often referred to a **colligatory concepts**. 
- These concepts are coined by historians in an attempt to make the past understandable; they bind persons, locations, and events into a coherent narrative.

- As a result, these representations or narratives are unstable: they change with over time and space, they depend ultimately on the sense-making activity of the researcher.



### Challenges

Colligatory concepts are difficult to grasp as objects of information retrieval. WideNet's goal is to enable historians to explore such concepts, using exploratory search. 
More precisely, WideNet attempts to find these complex concepts by:
- going **beyond  keyword search**
- providing a convenient interface that presents relevant fragments related to the concept under investigation
- helping researchers to **clarify** how they understand the scope of their research (and sculpt their query as a result).


### Challenges: Beyond Keyword Search

To probe these challenges a bit further: why are more traditional methods of access (such as keyword search) insufficient? and what does it actually mean to go beyond keywords?

Let us return to the leading example of the Dutch golden age (as evoked in parliamentary discourse). The easiest way to find relevant documents would be to simply search for the string “Golden Age”. The result of [this query](http://search.politicalmashup.nl/?query=%20%7B%22page%22:1,%22debug%22:false,%22useRegexQuery%22:false,%22regexQuery%22:%22%22,%22query%22:%22%5C%22Gouden%20Eeuw%5C%22%22,%22downloadAmount%22:1000,%22selectedCollection%22:%22Netherlands%22,%22selectedDocType%22:%22Speech%22,%22selectedOrder%22:%22Relevance%22,%22excludedPartiesTags%22:%5B%5D,%22excludedSpeakersTags%22:%5B%5D,%22selectedSpeakersTags%22:%5B%5D,%22selectedPartiesTags%22:%5B%5D,%22partyFacets%22:%7B%7D,%22speakerFacets%22:%7B%7D,%22houseFacets%22:%7B%7D,%22categoryFacets%22:%7B%7D,%22dossierFacets%22:%7B%7D,%22roleFacets%22:%7B%7D,%22excludedParties%22:%7B%7D,%22excludedSpeakers%22:%7B%7D,%22sliderYearMin%22:1800,%22sliderYearMax%22:2018,%22dateStart%22:%221995-12-31T23:58:45.000Z%22,%22dateEnd%22:%222018-01-01T00:00:00.000Z%22,%22docType%22:%22speech%22,%22searchTopicTitleOnly%22:false,%22searchClicked%22:true,%22advancedSearchOpened%22:true,%22graphsOpened%22:false,%22yearFacets%22:%7B%7D%7D) is shown below:

<img src="https://raw.githubusercontent.com/kasparvonbeelen/dh2019-SCE-workshop/master/Screenshot%202019-07-06%20at%2013.47.10.png">


As you can see, this query doesn’t return many hits.
More importantly, as mentioned earlier, there is a difference between the **sequence of characters** “Golden Age” and the **concept** (the container of entities related to this period). The simple keyword search misses therefore many potentially relevant documents.


Of course, composing very intricate boolean queries would be one workaround. The query itself then would contain a wide variety of elements, such as persons, event, organisation or architecture.
For sure one could spend hours trying to formulate a query that includes all these elements. But as the figure below suggests, this is neither elegant nor effective. 




<img src="https://raw.githubusercontent.com/kasparvonbeelen/dh2019-SCE-workshop/master/Screenshot%202019-07-06%20at%2013.49.18.png">

Even an elaborate boolean keyword search will hardly track the relevant entities, primarily because this approach doesn’t handle the problem of **ambiguity** and **name variation** very well. 
- A name can refer to different things (for example Erasmus is a philosopher but his name is also attached to a university)
- Similarly, one person can be referred to in many different ways, and the spelling of the name may vary. 

Taken together, it is not impossible to formulate a very complex boolean query, but it doesn't make things easier either.




### Summary

We constructed WideNet to partially solve the issues that arise when searching for such complex/colligatory concepts.

As we argue in the next section--where we explain the  interface--WideNet allows you to deal with complexity because
- instead of searching for a word it will search for documents that contain relevant entities
- it takes into account that concepts arise as a result of the sense-making activity of the researcher--and are therefore unstable by definition. Colligatory concepts, specifically, are constructed by historians to interpret the past. 


## WideNet Demo

WideNet is currently available as a [demo](http://widenet.politicalmashup.nl/dh2017/). Below we explain some of the engineering that drives the backend, and our design choices regarding the interface.

### Defining a query (not available in the demo yet)

In the first step, the user formulates a query. The thematic query is demarcated by several categories or containers. Which categories are available to do this depends on which knowledge base is used. For our pilot studies we have used DBpedia as the knowledge base, so the user starts by selecting one or several categories using a typeahead search box (top-left). These categories correspond to Wikipedia categories:

- The top-right shows the selected categories, which can also be de-selected.
- Finally, the user can further demarcate the query by selecting a time period, which will be used to prune the underlying entities of the selected categories


<img src="https://raw.githubusercontent.com/kasparvonbeelen/dh2019-SCE-workshop/master/Screenshot%202019-07-06%20at%2014.03.31.png" width="170" height="200">

### Behind the Scenes (Backend)

But what happens in the backend after the user defined a concept and  a period?

The figure below helps to understand the algorithm at work.

<img src="https://raw.githubusercontent.com/kasparvonbeelen/dh2019-SCE-workshop/master/Screenshot%202019-07-06%20at%2014.01.56.png">

When the searcher continues from the previous screen (where the root category is defined), our system retrieves the network of narrower categories for each selected category. 
In DBpedia, this category network appears somewhat messy, because it encodes various viewpoints on how the world is organized, as a result of the diversity of Wikipedia contributors.

This example shows that starting from “17th-century Dutch people by occupation”, we find several layers of subcategories that correspond to a taxonomy of professions. But some people are important enough to have their own categories, for instance, the painter Rembrandt, which deviates from the general taxonomy.




Oftentimes, even with carefully chosen root categories, there may be underlying categories that do not contain entities that are relevant to the thematic query.

Our system addresses the issue by iterating through all subcategories and inspecting all contained entities for temporal clues.
Each entity is compared with the target period and is considered to be outright relevant to the period, or not, or a borderline case, or as lacking temporal clues altogether.
This information is then used to decide whether the category as a whole is relevant to the thematic query.
In this example, the system figures out that most works about Rembrandt are not relevant to the query, because they were created far after the target period.

<img src="https://raw.githubusercontent.com/kasparvonbeelen/dh2019-SCE-workshop/master/Screenshot%202019-07-06%20at%2014.02.06.png">


#### Recap

- WideNet implements automatic query expansion by traversing the category graph of a knowledge base
- It exploits the structure of a knowledge base to select relevant entities--on top of which it uses a simple rule-based heuristic to filter entities by timestamp.


### Interface (Frontend)

The image below shows the first screen which appears after selecting the root category, the concept of interest (in this case the "Golden Age"). The leading example is in Dutch, but don’t worry, we have plenty of English examples, which you can access [here](http://widenet.politicalmashup.nl/dh2017/).




This screen is called the **preview page**.

#### Preview Page

<img src="https://raw.githubusercontent.com/kasparvonbeelen/dh2019-SCE-workshop/master/Screenshot%202019-07-05%20at%2013.52.35.png">

- **At the bottom** you find the main category related to the root concept (in this case it comprises Dutch-Classicist Architecture). 
In case the category is irrelevant, you can simply discard it with one click.

- But you can make finer distinctions of course. **The left-hand side** of the screen lists entities related to the category. The panel shows which entities have been found, and how frequently they occur. To refine your query, you can individually deselect each of the entities. It also shows a list of the entities that have been searched for but were **not found** in the corpus.

- **The centre** of the screen contains the **Preview results**, showing limited context to offer quick clues about the relevance of the category. The preview is useful to identify individual entities that are not relevant after all.
For instance, the Dutch name for the “Spanish treasure fleet”, _zilvervloot_, is frequently used in a metaphorical sense. Such entities can be deselected, as marked by the struck-out name and its faded occurrence counter.


#### Close Reading Page

<img src="https://raw.githubusercontent.com/kasparvonbeelen/dh2019-SCE-workshop/master/Screenshot%202019-07-01%20at%2012.44.17.png">



When the final selection of relevant categories has been made, the corpus can be searched for the sum of all selected entities.
In this screen, the user assesses each result for its relevance, now showing much more context for close reading. On the left we now show the whole list of found entities, still allowing for selection and deselection.
For the corpus of parliamentary debates, we show metadata for each result, such as:
- debate title
- the role and party affiliation of the MP 
- the date



## Assignment
### Exploring Complex Concepts with WideNet


### Aim of the exercise

The exercise session consists of a guided WideNet tutorial (ca. 25 minutes) followed by a short discussion (ca. 15 minutes). The aim of the tutorial is to get acquainted with the WideNet interface and learn to sculpt a corpus using a semantically enhanced exploratory search system.  The emphasis here is on **tool criticism** centring on a concrete case-study (the last part of the tutorial is more hands-on). More specifically, please reflect and discuss

- How the interface and workflow **support historical research**, i.e. to what extent do the affordances of the interface enable (or complicate) the process of iterative concept definition and corpus building?
- How the focus on entities contributes to delineating **colligatory concepts**?
- How the **presence (or absence) of specific entities** can be explained in the light of everything you know about entity linking?
- How the process of source selection resembles that of **gathering expert annotations**? What can be done with the output of these annotations in terms of concept definition, and constructing knowledge bases?



### Step-by-step

Imagine you are a historian (if you are already a historian, just stay who you are) who wants to investigate how parliamentarians have made use of history in their speeches. Also, assume you are a specialist in Canadian politics--this will help you considerably with understanding the data. If both features don’t apply to you--i.e. you are, like most present here, neither a historian nor knowledgeable about the peculiar constellation of Canadian party politics--have a quick look at the Wikipedia pages below.

- [History of Canada](https://en.wikipedia.org/wiki/History_of_Canada)
- [Politics of Canada](https://en.wikipedia.org/wiki/Politics_of_Canada)

Please follow the steps outlined below:

- Select a topic of interest from the [demo](http://widenet.politicalmashup.nl/dh2017/) site one the Dutch example (["Golden Age"](http://widenet.politicalmashup.nl/nl/preview/ge/), ["Second World War"](http://widenet.politicalmashup.nl/nl/preview/wo2/)). We prepared multiple WideNet queries based on the parliamentary data from Canada and the Netherlands. Of course, you can also compare specific queries.
- Behind each of these queries hide a substantial number of speeches. To avoid clicking through loads of speeches, it may help to formulate a question or hypothesis that will guide you when working your way through the data. Put differently, before you start, try to come up with a set of guidelines that explicate your “aboutness” criteria for individual documents.
- After selecting the target query, the preview stage opens. As told during the presentation, here you can assess the relevance of each category as a whole, and/or refine the selection by included (or discarding) specific entities. 
- Experiment with selecting and deselecting categories and entities. The concordances located at the centre of the screen provide minimal context to help you make an informed decision. 
- When clicking through the categories, keep track of how your selection criteria (and thus your understanding of the concept under investigation) evolve. Is there any pattern in the entities that are informative?
- Don’t forget to expand the “Not found” and “Other entities” tabs at the bottom left-hand side of the screen; assess the value of the missing entities, is their absence historically interesting, or just an artefact of the entity linker we used? You can independently interrogate the data using the political mashup search interface. 
- After selecting entities and categories, carefully inspect the individual documents that resulted from your previous choices. Please note that you can further refine the set of entities, by (de)selecting them. 
- At the end of the road, store the data as JSON (click “Relevant Speeches”)! You will need this for later.



### Discussion

Please reflect on the questions below (you type your notes in the Text cell--simply double click on the question)




1 How does the WideNet idea fit with Digital Humanities? As opposed to prevailing trends, WideNet is not focussed on distant reading but offers a digital tool for exploration and corpus selection. 

 
  **Write your answer here**

2 How helpful is the entity-based exploratory search for exploring complex concepts? What type of information becomes visible, which parts of the corpus remain in the dark? 


 
  **Write your answer here**

3 Which functionalities were useful? What could be improved?

 
  **Write your answer here**