Explicit semantic analysis with R
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">


<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1">

<meta name="author" content="Francesco Bailo" />

<meta name="date" content="2016-04-11" />

<title>Explicit semantic analysis with R</title>

<script src="readme_files/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="readme_files/bootstrap-3.3.5/css/united.min.css" rel="stylesheet" />
<script src="readme_files/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="readme_files/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="readme_files/bootstrap-3.3.5/shim/respond.min.js"></script>

<style type="text/css">code{white-space: pre;}</style>
<link rel="stylesheet"
      type="text/css" />
<script src="readme_files/highlight/highlight.js"></script>
<style type="text/css">
  pre:not([class]) {
    background-color: white;
<script type="text/javascript">
if (window.hljs && document.readyState && document.readyState === "complete") {
   window.setTimeout(function() {
   }, 0);



<style type = "text/css">
.main-container {
  max-width: 940px;
  margin-left: auto;
  margin-right: auto;
code {
  color: inherit;
  background-color: rgba(0, 0, 0, 0.04);
img {
  height: auto;
h1 {
  font-size: 34px;
h1.title {
  font-size: 38px;
h2 {
  font-size: 30px;
h3 {
  font-size: 24px;
h4 {
  font-size: 18px;
h5 {
  font-size: 16px;
h6 {
  font-size: 12px;
.tabbed-pane {
  padding-top: 12px;
button.code-folding-btn:focus {
  outline: none;

<div class="container-fluid main-container">

<!-- tabsets -->
<script src="readme_files/navigation-1.0/tabsets.js"></script>
$(document).ready(function () {

<!-- code folding -->

<div class="fluid-row" id="header">

<h1 class="title">Explicit semantic analysis with R</h1>
<h4 class="author"><em>Francesco Bailo</em></h4>
<h4 class="date"><em>11 April 2016</em></h4>


<div id="TOC">
<li><a href="#step-1-initialise-a-mysql-database-to-store-data-from-wikipedia">Step 1: Initialise a MySQL database to store data from Wikipedia</a></li>
<li><a href="#step-2-dowload-wikipedias-data-dumps">Step 2: Dowload Wikipedia’s data dumps</a></li>
<li><a href="#step-3-import-data-dumps-into-the-mysql-database">Step 3: Import data dumps into the MySQL database</a></li>
<li><a href="#step-4-mapping-categories-optional">Step 4: Mapping categories (optional)</a></li>
<li><a href="#step-5-concept-map">Step 5: Concept map</a></li>
<li><a href="#step-6-visualisating-a-discussion-in-2d">Step 6: Visualisating a discussion in 2D</a></li>
<li><a href="#references-and-r-packages">References and R packages</a></li>

<p>Explicit semantic analysis (ESA) was proposed by <span class="citation">Gabrilovich and Markovitch (2007)</span> to compute a document position in a high-dimensional concept space. At the core, the technique compares the terms of the input document with the terms of documents describing the concepts estimating the relatedness of the document to each concept. In spatial terms if I know the relative distance of the input document from meaningful concepts (e.g. ‘car’, ‘Leonardo da Vinci’, ‘poverty’, ‘electricity’), I can infer the meaning of the document relatively to explicitly defined concepts because of the document’s position in the concept space.</p>
<p>Wikipedia provides the concept space. Each article is a <em>concept</em>: then, <a href="en.wikipedia.org/wiki/Car" class="uri">en.wikipedia.org/wiki/Car</a> for ‘car’, <a href="en.wikipedia.org/wiki/Leonardo_da_Vinci" class="uri">en.wikipedia.org/wiki/Leonardo_da_Vinci</a> for ‘Leonardo da Vinci’, <a href="en.wikipedia.org/wiki/Poverty" class="uri">en.wikipedia.org/wiki/Poverty</a> for ‘poverty’ and <a href="en.wikipedia.org/wiki/Electricity" class="uri">en.wikipedia.org/wiki/Electricity</a> for ‘electricity’. The</p>
<p>For each input document <span class="math inline">\(D\)</span>, the analysis results in a vector of weights of length <span class="math inline">\(N\)</span> — where <span class="math inline">\(N\)</span> is the number of concepts <span class="math inline">\(c\)</span> from the concept space — so to describe with a scalar value the strength of the association between the document <span class="math inline">\(D\)</span> and concept <span class="math inline">\(c_j\)</span> for <span class="math inline">\(c_j \in c_1, . . ., c_N\)</span>.</p>
<div id="step-1-initialise-a-mysql-database-to-store-data-from-wikipedia" class="section level2">
<h2>Step 1: Initialise a MySQL database to store data from Wikipedia</h2>
<p>The Wikipedia content that we will download needs to be stored in a relational database. MySQL is the main back-end database management system for Wikipedia and is then a natural choice to store the content from the site. The database will seize, depending of the versions, few gigabytes of disk space. It is then necessary to make sure that the there is enough storage available in MySQL data directory (you can refer to <a href="http://askubuntu.com/a/137946">this answer</a> to change your MySQL data directory). Assuming now that MySQL is already installed and running on our machine we can create a database called <code>wikipedia</code> from the shell with</p>
<pre class="bash"><code>echo &quot;CREATE DATABASE wikipedia&quot; | mysql -u username -p</code></pre>
<p>and then run <a href="sql/initialise_wikipedia_database.sql">this script</a> <span class="citation">(Jarosciak 2013)</span> to initialise the different tables with</p>
<pre class="bash"><code>mysql -u username -p wikipedia &lt; initialise_wikipedia_database.sql</code></pre>
<p>This will create an empty database populated by all the tables required to store a Wikipedia site (or more precisely all the tables required by <em>MediaWiki</em> the open source software powering all the sites of the <a href="https://www.wikimedia.org/">Wikimedia galaxy</a>). For a description of the tables see <a href="https://www.mediawiki.org/wiki/Category:MediaWiki_database_tables">here</a> and for the diagram of the database schema <a href="https://upload.wikimedia.org/wikipedia/commons/f/f7/MediaWiki_1.24.1_database_schema.svg">here</a>. Once the database is set-up we can proceed to download and import the data.</p>
<div id="step-2-dowload-wikipedias-data-dumps" class="section level2">
<h2>Step 2: Dowload Wikipedia’s data dumps</h2>
<p>Wikimedia, the not-for-profit foundation running Wikipedia, conveniently provide the the data dumps of the different language versions of Wikipedia at <a href="http://dumps.wikimedia.org/backup-index.html">dumps.wikimedia.org/backup-index.html</a>. Since the data dumps can reach considerably sizes, the repository provides different data packages.</p>
<p>Here I will use data dumps of the Italian version of Wikipedia (to download the English version simply replace <code>it</code> with <code>en</code>) as of February 3, 2016 (<code>20160203</code>).</p>
<p>In order to build our concept map we will start by downloading the corpus of all the pages of the Italian version of Wikipedia which are contained in the file <code>itwiki-20160203-pages-articles.xml</code>. Once imported in the <code>wikipedia</code> database the XML file will fill three tables: the <a href="https://www.mediawiki.org/wiki/Manual:Text_table"><code>text</code> table</a>, where the body of the articles is actually contained, the <a href="https://www.mediawiki.org/wiki/Manual:Page_table"><code>page</code> table</a> with the metadata for all pages and the <a href="https://www.mediawiki.org/wiki/Manual:Revision_table"><code>revision</code> table</a> that allows to associate each text to a page through the SQL relations <code>revision.rev_text_id=text.old_id</code> and <code>revision.rev_page=page.page_id</code>. Additionally we will need to download two other data dumps: <code>itwiki-20160203-pagelinks.sql</code> and <code>itwiki-20160203-categorylinks.sql</code>, which will fill respectively the <a href="https://www.mediawiki.org/wiki/Manual:Pagelinks_table"><code>pagelinks</code> table</a> and the <a href="https://www.mediawiki.org/wiki/Manual:Categorylinks_table"><code>categorylinks</code> table</a>. The <code>pagelinks</code> table details all internal links connecting different pages, which we will use this information to estimate the network relevance of each page), and the <code>categorylink</code> table describes the categories which each page belongs to.</p>
<div id="step-3-import-data-dumps-into-the-mysql-database" class="section level2">
<h2>Step 3: Import data dumps into the MySQL database</h2>
<p>The three data dumps that we downloaded are in two different formats: SQL and XML. The SQL files are ready to be imported into the database with the following commands</p>
<pre class="bash"><code>mysql -u username -p wikipedia &lt; itwiki-20160203-pagelinks.sql
mysql -u username -p wikipedia &lt; itwiki-20160203-categorylinks.sql</code></pre>
<p>The XML file we instead require some preprocessing. There are different tools that can assist you in importing a Wikipedia XML data dumps into a relational database. For convenience we use <a href="https://www.mediawiki.org/wiki/Manual:MWDumper">MWDumper</a>. The following command will pipe the content of the XML file directly into the <code>wikipedia</code> database:</p>
<pre class="bash"><code>java -jar mwdumper.jar --format=sql:1.5 itwiki-20160203-pages-articles.xml | mysql -u username -p wikipedia</code></pre>
<p>MWDumper can also automatically take care of compressed dump files (such as <code>itwiki-20160203-pages-articles.xml.bz2</code>). The process of importing the XML data dump can take few hours. It is then advisable if launched on a remote machine to use a virtual console such as <a href="https://tmux.github.io/">tmux</a> to avoid problems if the connection with the remote machine is interrupted.</p>
<p>We then have a MySQL database <code>wikipedia</code> with information of all current pages, their internal links and the categories.</p>
<div id="step-4-mapping-categories-optional" class="section level2">
<h2>Step 4: Mapping categories (optional)</h2>
<p>In my case I was interested in limiting my concept map to few thousands concepts of instead of hundred thousands of concepts (or more than 5 million in the case of the English version of Wikipedia) which would have resulted in processing all the articles contained in the data dump. I then create a map hierarchical map of categories constructed as a directed network with categories as nodes and links defined by the relation <em>subcategory of</em>. In the map I targeted specific <em>neighbourhoods</em>, defined by a node of interest and its immediate <em>neighbours</em>, and filtered all articles not belonging to these categories out. Reducing the number of articles used to construct the concept space has both a practical purpose – to bring the computing required for the analysis to a more accessible levels – but also a theoretical purpose, which must find a justification in your analysis (in my case I was interested in reading an online conversation in terms of only <em>few</em> categories of interest).</p>
<p>In R we first need to establish a connection with the MySQL database. For convenience we create a function to pull the entire content of a table from a MySQL database into a <code>data.frame</code>:</p>
<pre class="r"><code>getTable &lt;- function(con, table) {
  query &lt;- dbSendQuery(con, paste(&quot;SELECT * FROM &quot;, table, &quot;;&quot;, sep=&quot;&quot;)) 
  result &lt;- fetch(query, n = -1)
<p>Then we open the connection with</p>
<pre class="r"><code>pw &lt;- &quot;yourpassword&quot; 
con &lt;- dbConnect(RMySQL::MySQL(), dbname = &quot;wikipedia&quot;, username = &quot;root&quot;, password = pw)</code></pre>
<p>and load the <code>page</code> table and the <code>categorylinks</code> with as <code>data.table</code>s</p>
<pre class="r"><code>require(data.table)
categorylinks &lt;- data.table(getTable(con, &quot;categorylinks&quot;))
page &lt;- data.table(getTable(con, &quot;page&quot;))</code></pre>
<p>Mediawiki defines a <a href="https://www.mediawiki.org/wiki/Manual:Namespace"><em>namespace</em></a> for each wiki page to indicate the purpose of the page. Pages describing categories, which are of interest now, are indicate with the namespace <code>14</code>. We then create a <code>data.table</code> containing only category pages with</p>
<pre class="r"><code>page_cat &lt;- page[page_namespace == 14,]</code></pre>
<p>Now we need to construct a network describing the hierarchy of categories and the relations between articles and categories (which categories an article belongs to). Each page has category links - described in the <a href="https://www.mediawiki.org/wiki/Manual:Categorylinks_table"><code>categorylinks</code> table</a> - pointing to its <em>parent categories</em> (importantly each category page can have one or more parent categories, this is important because in fact the topology of the network mapping the hierarchy of categories will not be tree). In the <code>categorylinks</code> table the field <code>cl_from</code> stores the <em>page id</em> (the same we find in the field <code>page_id</code> of the <code>page</code> table) while <code>cl_to</code> stores the <em>name</em> of the parent category, which does not necessarily correspond to an actual page list in the table <code>page</code>. In order to unambiguously map the relation between categories we need to join (or <code>merge()</code>) the <code>data.table</code>s so to build an directed edgelist with a page id for each endpoint.</p>
<p>First we need to rename few of the columns</p>
<pre class="r"><code>require(plyr)
categorylinks &lt;-  plyr::rename(categorylinks, c(&quot;cl_from&quot; = &quot;cl_from_id&quot;))
page_cat &lt;-  plyr::rename(page_cat, 
                          c(&quot;page_title&quot; = &quot;cl_to&quot;, 
                          &quot;page_id&quot; = &quot;cl_to_id&quot;, 
                          &quot;page_namespace&quot; = &quot;cl_to_namespace&quot;))

categorylinks &lt;- categorylinks[,cl_to:=tolower(cl_to)]
Encoding(page_cat$cl_to) &lt;- &#39;latin1&#39; # This might fix encoding issues
page_cat &lt;- page_cat[,cl_to:=tolower(cl_to)]

setkey(page_cat, &#39;cl_to&#39;)
setkey(categorylinks, &#39;cl_to&#39;)</code></pre>
<p>and then we merge the <code>data.table</code>s</p>
<pre class="r"><code># Merge on cl_to (to are categories)
categorylinks &lt;- merge(categorylinks, page_cat[,.(cl_to, cl_to_id, cl_to_namespace)])

# Merge on cl_from
page_cat &lt;-  page
page_cat &lt;-  plyr::rename(page_cat, c(&quot;page_id&quot; = &quot;cl_from_id&quot;, &quot;page_namespace&quot; = &quot;cl_from_namespace&quot;))
setkey(page_cat, &#39;cl_from_id&#39;)
setkey(categorylinks, &#39;cl_from_id&#39;)

# This will remove all links from non-content pages (users, talks, etc)
categorylinks &lt;- merge(categorylinks, page_cat[,.(cl_from_id, page_title, cl_from_namespace)])</code></pre>
<p>Once we have a <code>data.table</code> describing each edge of the category network we can create two <code>data.table</code>s, a two-columns <code>data.table</code> as edgelist and a <code>data.table</code> describing each vertex (category or article) of the network with as attribute the name of the page (<code>page_title</code>) and its <code>namespace</code>.</p>
<pre class="r"><code>edgelist_categorylinks &lt;- categorylinks[,.(cl_from_id, cl_to_id)]
vertices_categorylinks &lt;- data.table(page_id = c(categorylinks$cl_from_id, 
                                     namespace = c(categorylinks$cl_from_namespace,
setkey(vertices_categorylinks, &#39;page_id&#39;)
setkey(page, &#39;page_id&#39;)
Encoding(page$page_title) &lt;- &#39;latin1&#39;
vertices_categorylinks &lt;- merge(vertices_categorylinks, page[,.(page_id, page_title)])
vertices_categorylinks &lt;- unique(vertices_categorylinks)</code></pre>
<p>We are now ready to create an <code>igraph</code> object and then to drop all nodes (pages) that are not content articles (<code>namespace == 0</code>) or category description pages (<code>namespace == 14</code>):</p>
<pre class="r"><code>require(igraph)
g &lt;- graph.data.frame(categorylinks[,.(cl_from_id, cl_to_id)],
                      vertices = vertices_categorylinks)
g_ns0_and_ns14 &lt;- g - V(g)[!(V(g)$namespace %in% c(0, 14))]  </code></pre>
<p>This will result in a directed network like this</p>
<p><img src="Figs/graph-wikipedia-category-and-article-1.png" title="" alt="" width="672" /></p>
<p>By construction – since every link describes the relation between a page and a parent category – article pages have no incoming links, while category pages might have both incoming from <em>subcategories</em> and outgoing links to parent categories. To simplify the plot online five nodes have two <em>outgoing</em>, but this is mostly the case since most of both article and category pages have more than one <em>parent</em> category.</p>
<p>If we subset this network removing every article page we obtain a network describing the hierarchy among Wikipedia categories.</p>
<pre class="r"><code>g_ns14 &lt;- g_ns0_and_ns14 - V(g_ns0_and_ns14)[!(V(g_ns0_and_ns14)$namespace %in% c(14))]</code></pre>
<p>Still many categories will not be of any interest in defining the content of the articles since they are used by Wikipedia contributors to maintain the site and should be removed. Each Wikipedia version has a different set of ‘service’ categories (e.g. <a href="https://en.wikipedia.org/wiki/Category:Articles_needing_expert_attention">Articles_needing_expert_attention</a>) so it is impossible to define general rules in how to remove them.</p>
<p>After removing all categories not related to the content of the article we can target specific <em>macro</em>categories and fetch all of their neighbours to create a list of categories of interest. All article pages linked to this categories of interest will be selected for creating the concept map. The advantage of this approach is that if we know the general categories we are interested in (e.g. <a href="https://en.wikipedia.org/wiki/Category:British_pop_music_groups">British_pop_music_groups</a> we do not necessarily know all categories in the neighbourhood of this category (following in the example <a href="https://en.wikipedia.org/wiki/Category:British_boy_bands">British_boy_bands</a> and <a href="https://en.wikipedia.org/wiki/Category:British_pop_musicians">British_pop_musicians</a>).</p>
<p>To add categories to a list of categories of interest we proceed as following: we subset <code>g_ns14</code> by creating an ego-network around a general category, we fetch all categories contained in the resulting ego-network and we store them in a <code>data.frame</code> defined as</p>
<pre class="r"><code>cat_to_include &lt;- data.frame(page_id = character(),
                             page_title = character(),
                             target_cat = character())</code></pre>
<p>First we create a function that given a graph <code>g</code> and our <code>data.frame</code> <code>cat_to_include</code> will add to the <code>data.frame</code> all the category names contained in <code>g</code></p>
<pre class="r"><code>addToCat &lt;- function(g, cat_to_include, target_cat) {
  cat_to_include &lt;- rbind(cat_to_include, data.frame(page_id = V(g)$name,
                                                     page_title = V(g)$page_title,
                                                     target_cat = target_cat,
                                                     stringsAsFactors = FALSE))
} </code></pre>
<p>Then for each general category we know, let’s say <code>Poverty</code>, we construct an ego-graph with <code>make_ego_graph()</code> and fetch the categories it contains.</p>
<pre class="r"><code>g_of_interest &lt;-
  make_ego_graph(g_ns14, nodes = V(g_ns14)[V(g_ns14)$page_title == &#39;Poverty&#39;], order = 2, mode = &#39;all&#39;)[[1]]
cat_to_include &lt;- addToCat(g_of_interest, cat_to_include, &#39;Poverty&#39;)</code></pre>
<p>The attribute <code>order</code> control the radius of the neighbourhood; <code>order = 2</code> indicates that we want to include in our ego-graph every node within a distance of two degree from the ego-vertex.</p>
<p>Finally, once we have a comprehensive list of categories of interest we want to fetch all articles that belong to these categories. We go back to our graph describing relations among categories and between categories and articles – <code>g_ns0_and_ns14</code> – and we create from it a new graph called <code>selected_articles</code> which includes only article pages of interest.</p>
<p>First, we drop from <code>g_ns0_and_ns14</code> all category pages (<code>V(g_ns0_and_ns14)$namespace == 14</code>) and all article pages that are not listed in <code>cat_to_include</code> (note that <code>g_ns0_and_ns14</code> uses as attribute <code>name</code> the page id) and then calculate for each vertex the number of outgoing links.</p>
<pre class="r"><code>selected_articles &lt;- g_ns0_and_ns14 - 
  V(g_ns0_and_ns14)[V(g_ns0_and_ns14)$namespace == 14 &amp; !(V(g_ns0_and_ns14)$name %in% cat_to_include$page_id)]
V(selected_articles)$outdegree &lt;- degree(selected_articles, V(selected_articles), mode = &#39;out&#39;)</code></pre>
<p>Second, we want to track for each selected article which categories determined it to be included in our list. Of course since article pages usually belong multiple categories, it is possible that an article was selected because linked to more than one category of interests. For each target category we then create a logical vector storing information on whether an article belongs to it. We do it with</p>
<pre class="r"><code>for (supracat in unique(cat_to_include$target_cat)) {
  cats &lt;- subset(cat_to_include, target_cat == supracat)$page_id
  neighs &lt;- 
    graph.neighborhood(selected_articles, V(selected_articles)[V(selected_articles)$name %in% cats], order = 1)
  vertices &lt;- unlist(sapply(neighs, getPageIds))
  selected_articles &lt;-
    set.vertex.attribute(selected_articles, supracat, V(selected_articles), 
                         V(selected_articles)$name %in% vertices)
<p>then converting <code>selected_articles</code> into a <code>data.frame</code> where each row contains the attributes of a node</p>
selected_articles &lt;-  vertexAttributesAsDataFrame(addDegreeToVertices(selected_articles))
selected_articles &lt;- subset(selected_articles, namespace == &#39;0&#39;)</code></pre>
<div id="step-5-concept-map" class="section level2">
<h2>Step 5: Concept map</h2>
<p>Once we have stored all the necessary Wikipedia tables in the MySQL database we can proceed to build our concept map.</p>
<p>We first create a connection with the MySQL database</p>
<pre class="r"><code>pw &lt;- &quot;yourpassword&quot; 
con &lt;- dbConnect(RMySQL::MySQL(), dbname = &quot;wikipedia&quot;, username = &quot;root&quot;, password = pw)</code></pre>
<p>then we load into the R environment the three tables we require as <code>data.table</code>s using the help function <code>getTable</code></p>
<pre class="r"><code>getTable &lt;- function(con, table) {
  query &lt;- dbSendQuery(con, paste(&quot;SELECT * FROM &quot;, table, &quot;;&quot;, sep=&quot;&quot;)) 
  result &lt;- fetch(query, n = -1)

text &lt;- data.table(getTable(con, &quot;text&quot;))
revision &lt;- data.table(getTable(con, &quot;revision&quot;))
page &lt;- data.table(getTable(con, &quot;page&quot;))
<p>The table <code>revision</code> is a join table that we need to relate the table <code>text</code>, which contains the actual text of the Wikipedia pages, and the table <code>page</code> which instead contains the name of the page (that is, the title). The relations connecting the three tables are defined as</p>
<pre class="sql"><code>revision.rev_text_id = text.old_id
revision.rev_page = page.page_id</code></pre>
<p>The goal now is to create a table <code>wiki_pages</code> containing all the information of interest for each Wikipedia page. We do first</p>
<pre class="r"><code>setnames(text,&quot;old_id&quot;,&quot;rev_text_id&quot;)
setkey(text, rev_text_id)
setkey(revision, rev_text_id)
wiki_pages &lt;- merge(revision[,.(rev_text_id, rev_page)], text[,.(rev_text_id, old_text)])</code></pre>
<p>and then</p>
setkey(wiki_pages, &quot;page_id&quot;)
setkey(page, &quot;page_id&quot;)
wiki_pages &lt;- merge(page[,.(page_id, page_namespace, page_title, page_is_redirect)], 
                    wiki_pages[,.(page_id, old_text)])
Encoding(wiki_pages$page_title) &lt;- &#39;latin1&#39;</code></pre>
<p><code>wiki_pages</code> now contains <code>page_id</code>, <code>page_title</code>, <code>page_namespace</code> (for details on namespaces used by Wikipedia see <a href="https://www.mediawiki.org/wiki/Manual:Namespace">this</a>), <code>page_is_redirect</code> (a Boolean field indicating whether the page is actual a simple redirect for another page), and <code>old_text</code> where the text of the article actually is stored.</p>
<p>We then create a <code>data.table</code> named <code>redirects</code> (see <a href="https://en.wikipedia.org/wiki/Wikipedia:Redirect">here</a> for an explanation of redirect pages) with two fields indicating the title of the page that is redirected (<code>from</code>) and the destination of the redirect link (<code>to</code>). In a redirect page the destination of the redirect link is indicated in the text of the page in squared brackets. For example the page <a href="https://en.wikipedia.org/wiki/UK">UK</a>, redirecting to the page <a href="https://en.wikipedia.org/wiki/United_Kingdom">United_Kingdom</a>, containing as body the article the text <code>#REDIRECT [[United Kingdom]]</code>. We can then use the regular expression <code>&quot;\\[\\[(.*?)\\]\\]&quot;</code> as argument of the function <code>str_extract()</code> to parse the article title from the article text before storing it in the field <code>to</code> (note that since the regular expression is passed as an R string we need to double escape special characters).</p>
<pre class="r"><code># Extract all redirects
redirects &lt;- wiki_pages[page_is_redirect == 1,]
redirects$from &lt;- redirects$page_title

getPageRedirect &lt;- function(x) {
  x &lt;- unlist(str_extract(x, &quot;\\[\\[(.*?)\\]\\]&quot;))
  x &lt;- gsub(&quot;\\[|\\]&quot;,&quot;&quot;,x)
  x &lt;- gsub(&quot;(\\||#)(.*?)$&quot;,&quot;&quot;, x)
  ifelse ((grepl(&quot;[A-Za-z]:[A-Za-z]&quot;,x)), return(NA), return(gsub(&quot; &quot;,&quot;_&quot;,x)))

redirects$to &lt;- sapply(redirects$old_text, getPageRedirect, USE.NAMES = FALSE)
redirects &lt;- redirects[,.(from, to)]
redirects &lt;- redirects[!is.na(to),]</code></pre>
<p>At this point we can proceed to clean our <code>wiki_pages</code> table by removing every thing we do not need in the analysis. This lines</p>
<pre class="r"><code>wiki_pages &lt;- wiki_pages[page_namespace == 0,] 
wiki_pages &lt;- wiki_pages[page_is_redirect == 0,] 
wiki_pages &lt;- wiki_pages[!grepl(&quot;^\\{\\{disambigua\\}\\}&quot;,old_text),] </code></pre>
<p>will remove all pages that are not a content article (<code>page_namespace != 0</code>) and are redirects (<code>page_is_redirect != 0</code>). We also get rid of disambiguation pages (see <a href="https://en.wikipedia.org/wiki/Wikipedia:Disambiguation">here</a>), which do not contain any article but simply list other pages.</p>
<p>Preparing the actual text of the articles will require few steps and follows traditional ‘guide lines’ of Natural language processing. With the following function we remove all links containing in the text (usually identified by angle, curly or square brackets), all special characters indicating a new line (<code>\\n</code>), all digits (<code>\\d+</code>) and replace multiple spacing (<code>\\s+</code>) with a single space and finally we lower all characters. We then store the processed version of the text in a new variable <code>clean_text</code>.</p>
<pre class="r"><code># Text cleaning
preprocessText &lt;- function (string) {
  string &lt;- gsub(&quot;&lt;.*?&gt;|\\{.*?\\}|\\[\\[File.*?\\]\\]&quot;,&quot; &quot;, string)
  string &lt;- gsub(&quot;[[:punct:]]+&quot;,&quot; &quot;, string)
  string &lt;- gsub(&quot;\\n&quot;,&quot; &quot;, string)
  string &lt;- gsub(&quot;\\d+&quot;,&quot; &quot;, string)
  string &lt;- gsub(&quot;\\s+&quot;,&quot; &quot;,string)
  string &lt;- tolower(string)
wiki_pages$clean_text &lt;- preprocessText(wiki_pages$old_text)</code></pre>
<p>A crucial set of decisions is defining which articles to exclude from the analysis. There are two reasons why we must consider reducing the number of articles: filtering articles out will reduce computation and more importantly reduce the number of articles that we would not consider as description of a <em>concept</em>. <span class="citation">Gabrilovich and Markovitch (2007)</span> propose two filtering rules:</p>
<ol style="list-style-type: decimal">
<li><p>Articles with less then 100 words, which might be only draft of an article or that in any case do not provide enough information to inform the description of a concept, are excluded.</p></li>
<li><p>Articles with fewer than 5 incoming or outgoing links to other Wikipedia pages are excluded because the presence of only 5 outgoing links might indicate an article in draft form (Wikipedia is always a work-in-progress) and the fact the pages in linked to by only 5 <em>other</em> articles might indicate that the article is not relevant in network terms.</p></li>
<p>To these two rules I suggest an additional third rules:</p>
<ol start="3" style="list-style-type: decimal">
<li>Articles with a word-to-link ratio of less than 15. Many Wikipedia pages are in fact <em>lists</em> of Wikipedia pages (e.g. ([List_of_Presidents_of_the_United_States]<a href="https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States" class="uri">https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States</a>)) and clearly these pages should not be intended as description of any concept. Although it is not always possible to identify whether a page is a list only by the word-to-link ratio, I found that usually a list pages have a relatively higher number of links.</li>
<p>Let’s now apply these three rules. The 100 words threshold is pretty easy to implement</p>
<pre class="r"><code>wiki_pages$word_count &lt;- sapply(gregexpr(&quot;\\W+&quot;, wiki_pages$clean_text), length) + 1
wiki_pages &lt;- wiki_pages[word_count &gt;= 100,]</code></pre>
<p>To calculate the number of outgoing links we can easily count the occurrences of links present in each page. The calculation of the number of incoming links requires instead to check all other pages. We first create a help function that get all internal links (that is links to other Wikipedia pages) embedded in the text of a page:</p>
<pre class="r"><code>getPageLinks &lt;- function(x) {
  x &lt;- unlist(str_extract_all(x, &quot;\\[\\[(.*?)\\]\\]&quot;))
  x &lt;- x[!grepl(&quot;:\\s&quot;, x)]
  x &lt;- gsub(&quot;\\[|\\]&quot;,&quot;&quot;,x)
  x &lt;- gsub(&quot;(\\||#)(.*?)$&quot;,&quot;&quot;, x)
  x &lt;- gsub(&quot; &quot;,&quot;_&quot;,x)
<p>We then create a <code>data.table</code> named <code>edgelist</code> with a row for each internal link found in the Wikipedia articles. The line <code>sapply(wiki_pages$old_text, getPageLinks)</code> will return a list of length equal to the number of the <code>wiki_pages</code>. After we name the list, we can take advantage of the function <code>stack()</code> to convert the list into a <code>data.frame</code>, which we then convert into a <code>data.table</code>.</p>
<pre class="r"><code>edgelist &lt;- sapply(wiki_pages$old_text, getPageLinks)
names(edgelist) &lt;- wiki_pages$page_id
edgelist &lt;- data.table(stack(edgelist))
edgelist &lt;- plyr::rename(edgelist, c(&quot;values&quot; = &quot;from&quot;, &quot;ind&quot; = &quot;from_page_id&quot;))
edgelist &lt;- edgelist[,from:=tolower(from)]</code></pre>
<p>We first merge <code>edgelist</code>, representing all internal links, with <code>redirects</code> since it is possible than a link will point to a redirect page (e.g. to <code>UK</code> instead of <code>United_Kingdom</code>).</p>
<pre class="r"><code># Merge 1
wiki_pages &lt;- wiki_pages[, page_title:=tolower(page_title)]
setkey(wiki_pages, &#39;page_title&#39;)

redirects &lt;- redirects[,from:=tolower(from)]
redirects &lt;- redirects[,to:=tolower(to)]

setkey(edgelist, &#39;from&#39;)
setkey(redirects, &#39;from&#39;)

edgelist_redirect &lt;- merge(edgelist, redirects)
edgelist_redirect &lt;- edgelist_redirect[,from:=NULL]</code></pre>
<p>and then the resulting <code>edgelist_redirect</code> with <code>wiki_pages</code></p>
<pre class="r"><code>edgelist_redirect &lt;- plyr::rename(edgelist_redirect, c(&quot;to&quot; = &quot;page_title&quot;))
setkey(edgelist_redirect, &quot;page_title&quot;)
edgelist_redirect &lt;- merge(edgelist_redirect, wiki_pages[,.(page_id, page_title)])
edgelist_redirect &lt;- plyr::rename(edgelist_redirect, c(&quot;page_id&quot; = &quot;to_page_id&quot;))
edgelist_redirect &lt;- edgelist_redirect[,page_title:=NULL]</code></pre>
<p>We merge <code>edgelist</code> directly with <code>wiki_pages</code> to obtain <code>edgelist_noredirect</code>, which contains the internal links to pages that are <em>not</em> redirect pages.</p>
<pre class="r"><code># Merge 2
edgelist &lt;- plyr::rename(edgelist, c(&quot;from&quot; = &quot;page_title&quot;))
setkey(edgelist, &#39;page_title&#39;)
edgelist_noredirect &lt;- merge(edgelist, wiki_pages[,.(page_id,page_title)])
edgelist_noredirect &lt;- plyr::rename(edgelist_noredirect, c(&quot;page_id&quot; = &quot;to_page_id&quot;))
edgelist_noredirect &lt;- edgelist_noredirect[,page_title:=NULL]</code></pre>
<p>We now have a complete list of all internal links, either to <em>redirect</em> or <em>non-redirect</em> pages by <code>rbind(edgelist_noredirect, edgelist_redirect)</code> and we use it to create a directed graph where nodes are pages and the edges described the internal links connecting the pages</p>
<pre class="r"><code>require(igraph)
g &lt;- graph.data.frame(rbind(edgelist_noredirect, edgelist_redirect))</code></pre>
<p>and for each page we calculate incoming (indegree) and outgoing (outdegree) links.</p>
<pre class="r"><code>degree_df &lt;- 
  data.frame(page_id = V(g)$name, 
             indegree = degree(g, V(g), &#39;in&#39;),
             outdegree = degree(g, V(g), &#39;out&#39;),
             stringsAsFactors = FALSE)</code></pre>
<p>Based on the second rule listed above we should drop all pages with less than 5 incoming and outgoing links. We store the list of <code>page_id</code>s in the character <code>corpus_ids</code>.</p>
<pre class="r"><code>corpus_ids &lt;- 
  subset(degree_df, indegree &gt;= 5 &amp; outdegree &gt;= 5)$page_id</code></pre>
<p>Optionally, if we selected a subset of articles we are interested by targeting specific categories we can additionally reduce the number of pages we will consider in the concept analysis with</p>
<pre class="r"><code>corpus_ids &lt;-
  corpus_ids[corpus_ids %in% as.character(selected_articles$name)]
<p>The articles will now be treated <a href="https://en.wikipedia.org/wiki/Bag-of-words_model">bag-of-words</a> and jointly analysed to construct a (term frequency–inverse document frequency)[<a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">https://en.wikipedia.org/wiki/Tf%E2%80%93idf</a>] (tf-idf) matrix. That is, the position of individual terms within each document is disregarded and each document is represented by a vector of weights (or scores) of length equal to the number of terms found in the entire corpus of documents. Weights assigned to each term are a function of the number of terms found <em>in the document</em> (tf) and inversely of the number of documents <em>in the corpus</em> where the term is found (idf), so that terms that appears only in few documents will be assigned a relatively higher score than terms that appears in most of the documents of the corpus <span class="citation">(for more details see Manning, Raghavan, and Schütze 2008, ch. 6)</span>.</p>
<p>We create a <code>data.table</code> only with the document we want to include in the analysis</p>
<pre><code>concept_corpus &lt;- wiki_pages[page_id %in% corpus_ids, .(page_id, page_title, clean_text)]</code></pre>
<p>and we process the corpus of documents (that was already clean) before computing a td-idf matrix by removing stop words and stemming the remaining terms.</p>
<p>```r require(tm) require(SnowballC) tm_corpus &lt;- Corpus(VectorSource(concept_corpus$clean_text))</p>
<p>tm_corpus &lt;- tm_map(tm_corpus, removeWords, stopwords(“italian”), lazy=TRUE) tm_corpus &lt;- tm_map(tm_corpus, stemDocument, language=“italian”, lazy = TRUE) wikipedia_tdm &lt;- TermDocumentMatrix(tm_corpus, control = list(weighting = function(x) weightTfIdf(x, normalize = TRUE)))</p>
<p>``<code>r</code>wikipedia_tdm<code>is our end product. It is a term--document matrix where the rows represent the terms present in the corpus and the columns the documents. The cells of the matrix represent the weights of each pair term--document. With</code>wikipedia_tdm<code>we can represent any other term--document matrix computed from another corpus in terms of the the Wikipedia</code>wikipedia_tdm`, that is we can map the position a corpus of document in the concept space defined by the Wikipedia articles. We do this by a simple matrix operation.</p>
<p>Let <code>forum_tdm</code> be an tf-idf term–document matrix created from a corpus of online comments on a forum. We extract all the terms from <code>forum_tdm</code> and intersect them with the terms of <code>wikipedia_tdm</code> (we want to simplify the computation than we drop all terms that do not appear in <code>forum_tdm</code>),</p>
<pre class="r"><code>w &lt;- rownames(forum_tdm)
w &lt;- w[w %in% rownames(wikipedia_tdm)]
wikipedia_tdm_subset &lt;- wikipedia_tdm[w,]</code></pre>
<p>and finally we obtain a <code>concept_matrix</code> with</p>
<pre class="r"><code>concept_matrix &lt;- 

colnames(concept_matrix) &lt;- concept_corpus$page_id
rownames(concept_matrix) &lt;- rownames(forum_tdm)</code></pre>
<p>The <code>concept_matrix</code> assigns to each pair comment–concept a score. It is then possible to interpret each comment from the <code>forum_tdm</code> in terms of the scoring of its concepts. Specifically an insight into the meaning of comments might be derived from the 10/20 concepts that received the highest score.</p>
<div id="step-6-visualisating-a-discussion-in-2d" class="section level2">
<h2>Step 6: Visualisating a discussion in 2D</h2>
<p>By locating each document of a corpus of interest within a concept space we can quantify the ‘distance’ between each pair of documents. Of course the concept space is a multidimensional space where instead of the three axis of the space we experience around us (width, height and depth) we have an axis for each of the concept of the concept map (that is, potentially hundred of thousands of axis). Nevertheless there exist many mathematical techniques for <a href="https://en.wikipedia.org/wiki/Dimensionality_reduction">dimensionality reduction</a> that in practical terms can bring the number of dimensions down to two or three, then opening the way to visualisation. I detail here how to use a technique called <a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-SNE</a> <span class="citation">(Van der Maaten and Hinton 2008)</span> to visualise about 4,000 documents (blog posts, comments and parliamentary bills) discussing the introduction of <a href="https://en.wikipedia.org/wiki/Basic_income">citizen’s income</a> in Italy based on the concept space we computed before.</p>
<p>First we need to calculate the cosine distance matrix from our <code>concept_matrix</code>. We should remind that our <code>concept_matrix</code> store the weights that were assigned to each pair document–concept. If we think about it in spatial terms, for each <em>document</em> the <code>concept_matrix</code> will tell us its relative distance from each <em>concept</em>. But what we want to visualise is the distance separating each pair of <em>documents</em>, that is we need a document–document matrix. The transformation is performed calculating the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a> of the <code>concept_matrix</code>. We first transpose the <code>concept_matrix</code> with <code>t()</code> and then calculate a cosine <em>distance</em> matrix with the package <code>slam</code>.</p>
<pre class="r"><code># Cosine
concept_matrix &lt;- as.simple_triplet_matrix(t(concept_matrix))
cosine_dist_mat &lt;- 
  1 - crossprod_simple_triplet_matrix(concept_matrix)/
  (sqrt(col_sums(concept_matrix^2) %*% t(col_sums(concept_matrix^2))))</code></pre>
<p>Finally with the package <code>tsne</code> we fit our data to produce a matrix of two columns with <span class="math inline">\(xy\)</span> coordinates to plot each document as a dot on a 2D plane.</p>
<pre class="r"><code>require(tsne)
fit &lt;- tsne(cosine_dist_mat, max_iter = 1000)</code></pre>
<p>This is the result rendered with <code>ggplot2</code>:</p>
<p><img src="Figs/tsne-cosine-distance-m5s-gmi-discussion-1.png" title="" alt="" width="576" /></p>
<p>The figure is part of my research on online deliberation and the Italy’s <a href="https://en.wikipedia.org/wiki/Five_Star_Movement">Five Star Movement (M5S)</a>. In the figures (top panel) I color coded the document based on five macro concepts — which were used to identify each document — and identified the bill that was presented in Parliament (also a document in my corpus) with a triangle. In the second row of plots from the top I map the 2D kernel density of documents belonging to each macro concept, in the third and fourth row the temporal evolution of the discussion on two platforms (a forum and a blog).</p>
<div id="references-and-r-packages" class="section level1 unnumbered">
<h1>References and R packages</h1>
<div id="refs" class="references">
<div id="ref-gridExtra">
<p>Auguie, Baptiste. 2015. <em>GridExtra: Miscellaneous Functions for “Grid” Graphics</em>. <a href="https://CRAN.R-project.org/package=gridExtra" class="uri">https://CRAN.R-project.org/package=gridExtra</a>.</p>
<div id="ref-SnowballC">
<p>Bouchet-Valat, Milan. 2014. <em>SnowballC: Snowball Stemmers Based on the c Libstemmer UTF-8 Library</em>. <a href="https://CRAN.R-project.org/package=SnowballC" class="uri">https://CRAN.R-project.org/package=SnowballC</a>.</p>
<div id="ref-igraph">
<p>Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” <em>InterJournal</em> Complex Systems: 1695. <a href="http://igraph.org" class="uri">http://igraph.org</a>.</p>
<div id="ref-DBI">
<p>Databases, R Special Interest Group on. 2014. <em>DBI: R Database Interface</em>. <a href="https://CRAN.R-project.org/package=DBI" class="uri">https://CRAN.R-project.org/package=DBI</a>.</p>
<div id="ref-tsne">
<p>Donaldson, Justin. 2012. <em>Tsne: T-Distributed Stochastic Neighbor Embedding for R (T-SNE)</em>. <a href="https://CRAN.R-project.org/package=tsne" class="uri">https://CRAN.R-project.org/package=tsne</a>.</p>
<div id="ref-datatable">
<p>Dowle, M, A Srinivasan, T Short, S Lianoglou with contributions from R Saporta, and E Antonyan. 2015. <em>Data.table: Extension of Data.frame</em>. <a href="https://CRAN.R-project.org/package=data.table" class="uri">https://CRAN.R-project.org/package=data.table</a>.</p>
<div id="ref-tm">
<p>Feinerer, Ingo, Kurt Hornik, and David Meyer. 2008. “Text Mining Infrastructure in R.” <em>Journal of Statistical Software</em> 25 (5): 1–54. <a href="http://www.jstatsoft.org/v25/i05/" class="uri">http://www.jstatsoft.org/v25/i05/</a>.</p>
<div id="ref-gabrilovich_computing_2007">
<p>Gabrilovich, Evgeniy, and Shaul Markovitch. 2007. “Computing Semantic Relatedness Using Wikipedia-Based Explicit Semantic Analysis.” In <em>IJCAI</em>, 7:1606–11. <a href="http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-259.pdf" class="uri">http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-259.pdf</a>.</p>
<div id="ref-slam">
<p>Hornik, Kurt, David Meyer, and Christian Buchta. 2014. <em>Slam: Sparse Lightweight Arrays and Matrices</em>. <a href="https://CRAN.R-project.org/package=slam" class="uri">https://CRAN.R-project.org/package=slam</a>.</p>
<div id="ref-jarosciak_how_2013">
<p>Jarosciak, Jozef. 2013. “How to Import Entire Wikipedia into Your Own MySQL Database.” <em>Joe0.com</em>. <a href="http://www.joe0.com/2013/09/30/how-to-create-mysql-database-out-of-wikipedia-xml-dump-enwiki-latest-pages-articles-multistream-xml/" class="uri">http://www.joe0.com/2013/09/30/how-to-create-mysql-database-out-of-wikipedia-xml-dump-enwiki-latest-pages-articles-multistream-xml/</a>.</p>
<div id="ref-manning_introduction_2008">
<p>Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. <em>Introduction to Information Retrieval</em>. New York, NY: Cambridge University Press.</p>
<div id="ref-RColorBrewer">
<p>Neuwirth, Erich. 2014. <em>RColorBrewer: ColorBrewer Palettes</em>. <a href="https://CRAN.R-project.org/package=RColorBrewer" class="uri">https://CRAN.R-project.org/package=RColorBrewer</a>.</p>
<div id="ref-RMySQL">
<p>Ooms, Jeroen, David James, Saikat DebRoy, Hadley Wickham, and Jeffrey Horner. 2016. <em>RMySQL: Database Interface and ’MySQL’ Driver for R</em>. <a href="https://CRAN.R-project.org/package=RMySQL" class="uri">https://CRAN.R-project.org/package=RMySQL</a>.</p>
<div id="ref-R">
<p>R Core Team. 2016. <em>R: A Language and Environment for Statistical Computing</em>. Vienna, Austria: R Foundation for Statistical Computing. <a href="https://www.R-project.org/" class="uri">https://www.R-project.org/</a>.</p>
<div id="ref-van_der_maaten_visualizing_2008">
<p>Van der Maaten, Laurens, and Geoffrey Hinton. 2008. “Visualizing Data Using T-SNE.” <em>Journal of Machine Learning Research</em> 9 (2579-2605): 85. <a href="http://siplab.tudelft.nl/sites/default/files/vandermaaten08a.pdf" class="uri">http://siplab.tudelft.nl/sites/default/files/vandermaaten08a.pdf</a>.</p>
<div id="ref-ggplot2">
<p>Wickham, Hadley. 2009. <em>Ggplot2: Elegant Graphics for Data Analysis</em>. Springer-Verlag New York. <a href="http://had.co.nz/ggplot2/book" class="uri">http://had.co.nz/ggplot2/book</a>.</p>
<div id="ref-plyr">
<p>———. 2011. “The Split-Apply-Combine Strategy for Data Analysis.” <em>Journal of Statistical Software</em> 40 (1): 1–29. <a href="http://www.jstatsoft.org/v40/i01/" class="uri">http://www.jstatsoft.org/v40/i01/</a>.</p>
<div id="ref-stringr">
<p>———. 2015. <em>Stringr: Simple, Consistent Wrappers for Common String Operations</em>. <a href="https://CRAN.R-project.org/package=stringr" class="uri">https://CRAN.R-project.org/package=stringr</a>.</p>
<div id="ref-dplyr">
<p>Wickham, Hadley, and Romain Francois. 2015. <em>Dplyr: A Grammar of Data Manipulation</em>. <a href="https://CRAN.R-project.org/package=dplyr" class="uri">https://CRAN.R-project.org/package=dplyr</a>.</p>



// add bootstrap table styles to pandoc tables
$(document).ready(function () {
  $('tr.header').parent('thead').parent('table').addClass('table table-condensed');


<!-- dynamically load mathjax for compatibility with self-contained -->
  (function () {
    var script = document.createElement("script");
    script.type = "text/javascript";
    script.src  = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";