# Example Task: count number of references on a page in Wikipedia

* Use the wikidumps, as this information is not otherwise available
* We count using a regular expression.
* The tricky thing is that the references are in the running text at the point where the footnote appears.
    * They are not "where you see them on the page at the end"
* **Technique:** We stream in the bz2 file and read it page by page using lxml iterparse.    
    
    
 

In [2]:
# get the en wikipedia dump
!wget https://dumps.wikimedia.org/nlwiki/20170201/nlwiki-20170201-pages-articles1.xml.bz2

--2017-03-13 14:28:38--  https://dumps.wikimedia.org/nlwiki/20170201/nlwiki-20170201-pages-articles1.xml.bz2
Resolving dumps.wikimedia.org... 208.80.154.11, 2620::861:1:208:80:154:11
Connecting to dumps.wikimedia.org|208.80.154.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 154796969 (148M) [application/octet-stream]
Saving to: ‘nlwiki-20170201-pages-articles1.xml.bz2’


2017-03-13 14:32:04 (737 KB/s) - ‘nlwiki-20170201-pages-articles1.xml.bz2’ saved [154796969/154796969]



In [2]:
# inspecting the file on the command line  (only works on Mac and Linux)
! bzip2 -d --stdout nlwiki-20170201-pages-articles1.xml.bz2 |head -20

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="nl">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>nlwiki</dbname>
    <base>https://nl.wikipedia.org/wiki/Hoofdpagina</base>
    <generator>MediaWiki 1.29.0-wmf.9</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Speciaal</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Overleg</namespace>
      <namespace key="2" case="first-letter">Gebruiker</namespace>
      <namespace key="3" case="first-letter">Overleg gebruiker</namespace>
      <namespace key="4" case="first-letter">Wikipedia</namespace>
      <namespace key="5" case="first-letter">Overleg Wikipedia

In [3]:
%matplotlib inline
import pandas as pd
import re
from lxml import etree 
from bz2file import BZ2File
import codecs

In [4]:

# The regex which extracts the references (depending on the string encoding)
reference_extractor=r'<ref>(.*?)</ref>' # r'&lt;ref&gt;(.*?)&lt;/ref&gt;'


sample='''
Anarchism is a [[political philosophy]] that advocates [[self-governance|self-governed]] societies with voluntary institutions. These are often described as [[stateless society|stateless societies]],&lt;ref&gt;&quot;ANARCHISM, a social philosophy that rejects authoritarian government and maintains that voluntary institutions are best suited to express man's natural social tendencies.&quot; George Woodcock. &quot;Anarchism&quot; at The Encyclopedia of Philosophy&lt;/ref&gt;&lt;ref&gt;&quot;In a society developed on these lines, the voluntary associations which already now begin to cover all the fields of human activity would take a still greater extension so as to substitute themselves for the state in all its functions.&quot; [http://www.theanarchistlibrary.org/HTML/Petr_Kropotkin___Anarchism__from_the_Encyclopaedia_Britannica.html Peter Kropotkin. &quot;Anarchism&quot; from the Encyclopædia Britannica]&lt;/ref&gt;&lt;ref&gt;&quot;Anarchism.&quot; The Shorter Routledge Encyclopedia of Philosophy. 2005. p. 14 &quot;Anarchism is the view that a society without the state, or government, is both possible and desirable.&quot;&lt;/ref&gt;&lt;ref&gt;Sheehan, Sean. Anarchism, London: Reaktion Books Ltd., 2004. p. 85&lt;/ref&gt; but several authors have defined them more specifically as institutions based on non-[[Hierarchy|hierarchical]] [[Free association (communism and anarchism)|free associations]].&lt;ref&gt;&quot;as many anarchists have stressed, it is not government as such that they find objectionable, but the hierarchical forms of government associated with the nation state.&quot; Judith Suissa. ''Anarchism and Education: a Philosophical Perspective''. Routledge. New York. 2006. p. 7&lt;/ref&gt;&lt;ref name=&quot;iaf-ifa.org&quot;/&gt;&lt;ref&gt;&quot;That is why Anarchy, when it works to destroy authority in all its aspects, when it demands the abrogation of laws and the abolition of the mechanism that serves to impose them, when it refuses all hierarchical organisation and preaches free agreement — at the same time strives to maintain and enlarge the precious kernel of social customs without which no human or animal society can exist.&quot; [[Peter Kropotkin]]. [http://www.theanarchistlibrary.org/HTML/Petr_Kropotkin__Anarchism__its_philosophy_and_ideal.html Anarchism: its philosophy and ideal]&lt;/ref&gt;&lt;ref&gt;&quot;anarchists are opposed to irrational (e.g., illegitimate) authority, in other words, hierarchy — hierarchy being the institutionalisation of authority within a society.&quot; [http://www.theanarchistlibrary.org/HTML/The_Anarchist_FAQ_Editorial_Collective__An_Anarchist_FAQ__03_17_.html#toc2 &quot;B.1 Why are anarchists against authority and hierarchy?&quot;] in [[An Anarchist FAQ]]&lt;/ref&gt; Anarchism holds the [[state (polity)|state]] to be undesirable, unnecessary, or harmful.&lt;ref name=&quot;definition&quot;&gt;
{{cite journal |last=Malatesta|first=Errico|title=Towards Anarchism|journal=MAN!|publisher=International Group of San Francisco|location=Los Angeles|oclc=3930443|url=http://www.marxists.org/archive/malatesta/1930s/xx/toanarchy.htm|archiveurl=https://web.archive.org/web/20121107221404/http://marxists.org/archive/malatesta/1930s/xx/toanarchy.htm|archivedate=7 November 2012 |deadurl=no|authorlink=Errico Malatesta |ref=harv}}
{{cite journal |url=http://www.theglobeandmail.com/servlet/story/RTGAM.20070514.wxlanarchist14/BNStory/lifeWork/home/
|archiveurl=https://web.archive.org/web/20070516094548/http://www.theglobeandmail.com/servlet/story/RTGAM.20070514.wxlanarchist14/BNStory/lifeWork/home |archivedate=16 May 2007 |deadurl=yes |title=Working for The Man |journal=[[The Globe and Mail]] |accessdate=14 April 2008 |last=Agrell |first=Siri |date=14 May 2007 |ref=harv }}
'''


sum ( 1 for x in re.finditer(r'&lt;ref&gt;(.*?)&lt;/ref&gt;', sample)), re.findall(r'&lt;ref&gt;(.*?)&lt;/ref&gt;',sample)

(7,
 ["&quot;ANARCHISM, a social philosophy that rejects authoritarian government and maintains that voluntary institutions are best suited to express man's natural social tendencies.&quot; George Woodcock. &quot;Anarchism&quot; at The Encyclopedia of Philosophy",
  '&quot;In a society developed on these lines, the voluntary associations which already now begin to cover all the fields of human activity would take a still greater extension so as to substitute themselves for the state in all its functions.&quot; [http://www.theanarchistlibrary.org/HTML/Petr_Kropotkin___Anarchism__from_the_Encyclopaedia_Britannica.html Peter Kropotkin. &quot;Anarchism&quot; from the Encyclop\xc3\xa6dia Britannica]',
  '&quot;Anarchism.&quot; The Shorter Routledge Encyclopedia of Philosophy. 2005. p. 14 &quot;Anarchism is the view that a society without the state, or government, is both possible and desirable.&quot;',
  'Sheehan, Sean. Anarchism, London: Reaktion Books Ltd., 2004. p. 85',
  "&quot;as many 

In [5]:
# from http://stackoverflow.com/questions/12160418/why-is-lxml-etree-iterparse-eating-up-all-my-memory
# We use the ideas in this function in our own iter function:
    # clear the context element
    # remove all now-empty references from the root node to the contxt element

def fast_iter(context, func, *args, **kwargs):
    """
    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

In [6]:
# We use the pageid http://dbpedia.org/ontology/wikiPageID as key value for the page
# Note the use of the (default) namespace all the time.

def count_references(f,inset): # ,outfile):
    '''With f a wikipedia dump and inset a list of pageids, count for each page in the dump \cap inset the number of references
    on the page and return as a dict with pageid:reference_count key-value pairs.'''
    with BZ2File(f) as xml_file:
        context = etree.iterparse(xml_file,  tag= '{http://www.mediawiki.org/xml/export-0.10/}page')
        pages_references_dict={} 
        # use this when you want to write away the results to file
        #with codecs.open(outfile,'w', encoding='utf-8') as f: 
            #f.write('WikiPageID;Number_of_references\n')
            
        c=0
        for _, elem in context:
                #title= elem.findtext('{http://www.mediawiki.org/xml/export-0.10/}title') 
                page_id = elem.findtext('{http://www.mediawiki.org/xml/export-0.10/}id')
                try:  
                    page_id= int(page_id)
                    if  page_id in inset: # we only do this for pages in  inset
                        pagetext=elem.findtext('{http://www.mediawiki.org/xml/export-0.10/}revision/{http://www.mediawiki.org/xml/export-0.10/}text')
                        ref_count= sum ( 1 for x in re.finditer(reference_extractor, pagetext))
                        #if page_id  and ref_count: # just store those with a count > 0
                        pages_references_dict[page_id]= ref_count 
                        # use this when you want to write away the results to file
                        #f.write(str(page_id)+';'+str(ref_count)+'\n')
                except:
                    True
                
                # now get rid of the element and also delete all its ancestors   
                # from http://stackoverflow.com/questions/12160418/why-is-lxml-etree-iterparse-eating-up-all-my-memory
                elem.clear()
                # Also eliminate now-empty references from the root node to elem
                for ancestor in elem.xpath('ancestor-or-self::*'):
                    while ancestor.getprevious() is not None:
                        del ancestor.getparent()[0]    
                # for debugging and seeing how far the code is already
                c+=1
                if c% 10**5==0:
                     print c  #    break #
        del context   
    return pages_references_dict


In [7]:
# do the counting and save dataframe as pickle

def Count_References():
    # wikidump
    inputfile= 'nlwiki-20170201-pages-articles1.xml.bz2'
    #output='en_ref_count.csv'
    # our list of wikipageid of our persons
    #wikiPageID= pd.read_pickle('../wikiPageID.pkl')
    wikiPageID_set= set(range(10**8)) # set(wikiPageID.values) # we just take all
    # do the counting
    en_ref_count= count_references(inputfile,wikiPageID_set)# ,output)
    # turn into a dataframe and pickle
    ef= pd.DataFrame.from_dict(en_ref_count, orient='index')
    ef.columns=['Number_of_references']
    ef.to_pickle('en_ref_count.pkl')
    return True


In [8]:
%time Count_References()

100000
CPU times: user 1min 19s, sys: 21.8 s, total: 1min 41s
Wall time: 1min 50s


True

In [9]:
#test
df= pd.read_pickle('en_ref_count.pkl')
df.describe()

Unnamed: 0,Number_of_references
count,121591.0
mean,0.847053
std,4.532418
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,327.0


In [None]:
!pwd;ls -l

In [None]:
!bzip2 --help


In [13]:
!ls ../Data/GenderWikipedia/

CollectMoreFirstNames.ipynb                  GenderIdentificationUsingAmericanNames.ipynb NonAmbiguousNames.csv                        Person_10K.csv.gz
CollectReferences.ipynb                      GenderWikipedia.ipynb                        NrLanguagesPerWikipediaPage.pkl              Untitled.ipynb
FeaturesPerPage.ipynb                        [34mNames[m[m                                        PageViews.ipynb                              [34mgender-on-wikipedia[m[m
