In this notebook we take the previously downloaded XML wikipedia data and show how it may be searched and catagorized.

The XML of these articles is extremely rich with data and could be applied to all sorts of things. The biggest bottleneck is that is takes a long time and a lot of memory to iterate through the downloaded partitians for the purposes of extracting information. If that goal is achievedm, there are two main ways to categorize information. Through InfoBoxes that are placed on most article to apply a template. There are also Categories on articles placed by the author.

The attached blog goes into more detail on the processes mentioned here.
https://towardsdatascience.com/wikipedia-data-science-working-with-the-worlds-largest-encyclopedia-c08efbac5f5c

In [11]:
import bz2
import subprocess
import numpy as np
import os
from timeit import default_timer as timer

data_path = r'C:\Users\Austin\.keras\datasets\enwiki-20191220-pages-articles9.xml-p1791081p2336422.bz2'
keras_home = r'C:\Users\Austin\.keras\datasets'
data_path

'C:\\Users\\Austin\\.keras\\datasets\\enwiki-20191220-pages-articles9.xml-p1791081p2336422.bz2'

Test bz2 versus bzcat. 
The cells below test the run time of using bz2 versus bzcat to process 1 million lines of the compressed file. <br>
The author of this notebook never could get bzcat to work properly, although it is much faster.


In [12]:
start = timer()
lines = []
for i, line in enumerate(bz2.BZ2File(data_path, 'r')):
    lines.append(line)
    if i > 1e6:
        break  
        


In [13]:

start = timer()
lines = []

for line in enumerate(subprocess.Popen('bzcat',
                         shell = True,              
                         stdin = open(data_path), 
                         stdout = subprocess.PIPE).stdout):
    print(i)
    lines.append(line)
    if i > 1e6:
        break
        

In [14]:
print(lines[-250:-50])

[]



**Parsing Approach**
In order to get useful information from this data, we have to parse it on two levels.

Extract the titles and article text from the XML <br>
Extract relevant information from the article text

In [15]:
import xml.sax

class WikiXmlHandler(xml.sax.handler.ContentHandler):
    """Content handler for Wiki XML data using SAX"""
    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)
        self._buffer = None
        self._values = {}
        self._current_tag = None
        self._pages = []

    def characters(self, content):
        """Characters between opening and closing tags"""
        if self._current_tag:
            self._buffer.append(content)

    def startElement(self, name, attrs):
        """Opening tag of element"""
        if name in ('title', 'text', 'timestamp'):
            self._current_tag = name
            self._buffer = []

    def endElement(self, name):
        """Closing tag of element"""
        if name == self._current_tag:
            self._values[name] = ' '.join(self._buffer)

        if name == 'page':
            self._pages.append((self._values['title'], self._values['text']))

In [16]:
# Content handler for Wiki XML
handler = WikiXmlHandler()

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

handler._pages

[]

In [17]:

# Object for handling xml
handler = WikiXmlHandler()

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

for i, line in enumerate(subprocess.Popen(['bzcat'], 
                         stdin = open(data_path),
                         shell = True,
                         stdout = subprocess.PIPE).stdout):
    parser.feed(line)
    print(i)
    # Stop when 3 articles have been found
    if len(handler._pages) > 2:
        break
        
print([x[0] for x in handler._pages])

[]


In [18]:
pip install mwparserfromhell

Note: you may need to restart the kernel to use updated packages.


In [19]:
# Object for handling xml
handler = WikiXmlHandler()

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

for i, line in enumerate(bz2.BZ2File(data_path, 'r')):
    
    parser.feed(line)
    
    # Stop when 3 articles have been found
    if len(handler._pages) >1000:
        break
        
print([x[0] for x in handler._pages])

['MHS', 'Picture Music International', "St Mark's Eve", 'Mount Sopris', 'Wikipedia:Articles for deletion/Andy Carrico', "Saint Mark's Eve", 'Goldschmidt Sex Scandal', 'Regional Technical College, Waterford', 'Mt. Sopris', 'SAMe', 'Herpes simplex virus 1', 'Herpes simplex virus 2', 'Damian Cray', 'Henry Preserved Smith', 'Arthur Seldon', 'Elk Range (California)', 'Sunburst Award', 'Gunn High', 'Seán Moore (Irish politician)', 'Damian cray', 'Taner Sağır', 'Line 6 (company)', 'Bertie Mee', 'Category:Value (ethics)', 'Paul Cullen (cardinal)', 'Wikipedia:WikiProject Swiss Municipalities', 'Paul Egede', 'Gregory Pakourianos', 'Private mortgage insurance', 'Category:Mountain ranges of British Columbia', 'Wikipedia:Wikiproject Swiss municipalities', 'Global positioning systems', 'Mexican labor law', 'Paul Emile Botta', 'Atlas Network', 'Yesler Terrace, Seattle', 'Regional technical college', 'Yakko Warner', 'Wikipedia:Articles for deletion/Invigilator', 'Paul Francois Barras', 'Paul Fleming (

In [20]:

# Object for handling xml
handler = WikiXmlHandler()

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

for i, line in enumerate(bz2.BZ2File(data_path, 'r')):
    parser.feed(line)
    
    # Stop when 50 articles have been found
    if len(handler._pages) > 50:
        break

In [21]:
import mwparserfromhell 

print(handler._pages[20][0])

# Create the wiki article
wiki = mwparserfromhell.parse(handler._pages[20][1])

Taner Sağır


In [22]:
print(type(wiki))
wiki[:100]

<class 'mwparserfromhell.wikicode.Wikicode'>


'{{BLP sources|date=November 2009}} \n {{Infobox sportsperson \n | name           = Taner Sağır \n | ima'

In [23]:
wikilinks = [x.title for x in wiki.filter_wikilinks()]
print(f'There are {len(wikilinks)} wikilinks.')
wikilinks

There are 72 wikilinks.


['Turkey',
 'Kardzhali',
 'Bulgaria',
 'Turkey',
 'Olympic weightlifting',
 'Ankara',
 'Weightlifting at the Summer Olympics',
 '2004 Summer Olympics',
 "Weightlifting at the 2004 Summer Olympics – Men's 77 kg",
 'World Weightlifting Championships',
 '2006 World Weightlifting Championships',
 'European Weightlifting Championships',
 '2004 European Weightlifting Championships',
 '2005 European Weightlifting Championships',
 '2007 European Weightlifting Championships',
 'Kardzhali',
 'Bulgaria',
 'Turkey',
 'Olympic weightlifting',
 '2004 Summer Olympics',
 'snatch (weightlifting)',
 'clean and jerk',
 'Turks in Bulgaria',
 'Yenimahalle, Ankara',
 'Pursaklar, Ankara',
 'Nezir Sağır',
 'Hürriyet',
 'Ankara',
 '2008 Summer Olympics',
 'Sibel Güler',
 'Hürriyet',
 'Athens',
 'Greece',
 'Santo Domingo',
 'Dominican Republic',
 'Hermosillo',
 'Mexico',
 'Sofia',
 'Bulgaria',
 'Kiev',
 'Ukraine',
 'Havířov',
 'Czech Republic',
 'Stavanger',
 'Norway',
 'Košice',
 'Slovakia',
 'Turkey',
 'Athen

In [24]:

wiki.filter_arguments()

[]

In [25]:
wiki.filter_comments()


[]

To figure out everything you can do with mwparserfromhell, read the docs.
https://mwparserfromhell.readthedocs.io/en/latest/

In [26]:
external_links = [(x.title, x.url) for x in wiki.filter_external_links()]
print(f'There are {len(external_links)} external links.')
external_links[:5]

There are 4 external links.


[(None, 'http://arama.hurriyet.com.tr/arsivnews.aspx?id=10535466'),
 (None,
  'http://www.sabah.com.tr/SabahSpor/TumSporlar/2011/10/10/taner-sagirdan-haltere-veda'),
 (None, 'http://www.hurriyet.de/haberler/son-dakika/163264/haber'),
 ('Taner Sağır at Lift Up',
  'http://www.chidlovski.net/liftup/l_athleteStatsResult.asp?a_id=599')]

In [27]:
#We can search specific words.
contemporary = wiki.filter(matches = 'Turkey')
contemporary[1], type(contemporary[1])

('[[Turkey|Turkish]]', mwparserfromhell.nodes.wikilink.Wikilink)

In [28]:
wiki.strip_code().strip()[:100]


'Taner Sağır (born March 13, 1985 in Kardzhali, Bulgaria) is a Turkish world and Olympic weightliftin'

## Infoboxes are the easiest way to segment wiki articles.

In [29]:
templates = wiki.filter_templates()
print(f'There are {len(templates)} templates.')
for template in templates:
    print(template.name)

There are 25 templates.
BLP sources
Infobox sportsperson 
 
birth date and age
height
MedalCompetition
MedalGold
MedalCompetition
MedalGold
MedalCompetition
MedalGold
MedalGold
MedalBronze
cite news 
cite news 
Gold medal
Gold medal
Gold medal
Gold medal
Gold medal
Gold medal
Gold medal
Silver medal
reflist
Footer Olympic Champions Weightlifting Middleweight
DEFAULTSORT:Sagir, Taner


In [30]:
templates

['{{BLP sources|date=November 2009}}',
 "{{Infobox sportsperson \n | name           = Taner Sağır \n | image          =  \n | imagesize      = \n | caption        =  \n | birth_name     = \n | fullname       =  \n | nickname       =  \n | nationality    = [[Turkey|Turkish]] \n | residence      =  \n | birth_date     = {{birth date and age|1985|3|13}}  \n | birth_place    = [[Kardzhali]], [[Bulgaria]] \n | height         = {{height|m=1.70}} \n | weight         =  \n | website        = \n | country        = [[Turkey]] \n | sport          = [[Olympic weightlifting|Weightlifting]] \n | event          =  & ndash;77 kg \n | collegeteam    =  \n | club           = Demirspor Club, [[Ankara]] \n | team           =  \n | turnedpro      =  \n | coach          = Muharrem Süleymanoğlu and Osman Nuri Vural  \n | retired        =  \n | coaching       =  \n | worlds         =  \n | regionals      =  \n | nationals      =  \n | olympics       =  \n | paralympics    =  \n | highestranking =  \n | pb    

In [31]:
infobox = wiki.filter_templates(matches = 'Infobox sportsperson')[0]
infobox

"{{Infobox sportsperson \n | name           = Taner Sağır \n | image          =  \n | imagesize      = \n | caption        =  \n | birth_name     = \n | fullname       =  \n | nickname       =  \n | nationality    = [[Turkey|Turkish]] \n | residence      =  \n | birth_date     = {{birth date and age|1985|3|13}}  \n | birth_place    = [[Kardzhali]], [[Bulgaria]] \n | height         = {{height|m=1.70}} \n | weight         =  \n | website        = \n | country        = [[Turkey]] \n | sport          = [[Olympic weightlifting|Weightlifting]] \n | event          =  & ndash;77 kg \n | collegeteam    =  \n | club           = Demirspor Club, [[Ankara]] \n | team           =  \n | turnedpro      =  \n | coach          = Muharrem Süleymanoğlu and Osman Nuri Vural  \n | retired        =  \n | coaching       =  \n | worlds         =  \n | regionals      =  \n | nationals      =  \n | olympics       =  \n | paralympics    =  \n | highestranking =  \n | pb             =  \n | medaltemplates =  \n {{

In [32]:
information = {param.name.strip_code().strip(): param.value.strip_code().strip() for param in infobox.params}
information

{'name': 'Taner Sağır',
 'image': '',
 'imagesize': '',
 'caption': '',
 'birth_name': '',
 'fullname': '',
 'nickname': '',
 'nationality': 'Turkish',
 'residence': '',
 'birth_date': '',
 'birth_place': 'Kardzhali, Bulgaria',
 'height': '',
 'weight': '',
 'website': '',
 'country': 'Turkey',
 'sport': 'Weightlifting',
 'event': '& ndash;77 kg',
 'collegeteam': '',
 'club': 'Demirspor Club, Ankara',
 'team': '',
 'turnedpro': '',
 'coach': 'Muharrem Süleymanoğlu and Osman Nuri Vural',
 'retired': '',
 'coaching': '',
 'worlds': '',
 'regionals': '',
 'nationals': '',
 'olympics': '',
 'paralympics': '',
 'highestranking': '',
 'pb': '',
 'medaltemplates': '',
 'show-medals': ''}

In [33]:
import re

def process_article(title, text, timestamp, template = 'Infobox book'):
    """Process a wikipedia article looking for template"""
    
    # Create a parsing object
    wikicode = mwparserfromhell.parse(text)
    
    # Search through templates for the template
    matches = wikicode.filter_templates(matches = template)
    
    # Filter out errant matches
    matches = [x for x in matches if x.name.strip_code().strip().lower() == template.lower()]
    
    if len(matches) >= 1:
        # template_name = matches[0].name.strip_code().strip()

        # Extract information from infobox
        properties = {param.name.strip_code().strip(): param.value.strip_code().strip() 
                      for param in matches[0].params
                      if param.value.strip_code().strip()}

        # Extract internal wikilinks
        wikilinks = [x.title.strip_code().strip() for x in wikicode.filter_wikilinks()]

        # Extract external links
        exlinks = [x.url.strip_code().strip() for x in wikicode.filter_external_links()]

        # Find approximate length of article
        text_length = len(wikicode.strip_code().strip())

        return (title, properties, wikilinks, exlinks, timestamp, text_length)

In [34]:
r = process_article('Taner Sağır', wiki, None)
r

In [35]:
r = process_article('Taner Sağır', wiki, None, template = 'Infobox sportsperson')
r[0], r[1]

('Taner Sağır',
 {'name': 'Taner Sağır',
  'nationality': 'Turkish',
  'birth_place': 'Kardzhali, Bulgaria',
  'country': 'Turkey',
  'sport': 'Weightlifting',
  'event': '& ndash;77 kg',
  'club': 'Demirspor Club, Ankara',
  'coach': 'Muharrem Süleymanoğlu and Osman Nuri Vural'})

### Content Manager modified for finding books.

In [36]:
class WikiXmlHandler(xml.sax.handler.ContentHandler):
    """Parse through XML data using SAX"""
    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)
        self._buffer = None
        self._values = {}
        self._current_tag = None
        self._books = []
        self._article_count = 0
        self._non_matches = []

    def characters(self, content):
        """Characters between opening and closing tags"""
        if self._current_tag:
            self._buffer.append(content)

    def startElement(self, name, attrs):
        """Opening tag of element"""
        if name in ('title', 'text', 'timestamp'):
            self._current_tag = name
            self._buffer = []

    def endElement(self, name):
        """Closing tag of element"""
        if name == self._current_tag:
            self._values[name] = ' '.join(self._buffer)

        if name == 'page':
            self._article_count += 1
            # Search through the page to see if the page is a book
            book = process_article(**self._values, template = 'Infobox book')
            # Append to the list of books
            if book:
                self._books.append(book)

In [37]:
# Object for handling xml
handler = WikiXmlHandler()

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

for i, line in enumerate(bz2.BZ2File(data_path, 'r')):
    parser.feed(line)
    
    # Stop when 3 articles have been found
    if len(handler._books) > 2:
        break
        
print(f'Searched through {handler._article_count} articles to find 3 books.')

Searched through 510 articles to find 3 books.


In [38]:
handler._books[0]


('The Book of Daniel (novel)',
 {'1': '< !-- See Wikipedia:WikiProject Novels or Wikipedia:WikiProject Books -- >',
  'name': 'The Book of Daniel',
  'title_orig': 'The Book of Daniel',
  'image': 'TheBookOfDaniel.jpg',
  'border': 'yes',
  'caption': 'First edition',
  'author': 'E. L. Doctorow',
  'country': 'US',
  'language': 'English',
  'genre': 'Fiction',
  'publisher': 'Random House',
  'pub_date': '1971',
  'media_type': 'Hardcover',
  'pages': '303',
  'isbn': '978-0-8129-7817-9',
  'congress': 'PS3554.O3 B6 2007',
  'oclc': '141385012'},
 ['Wikipedia:WikiProject Novels',
  'Wikipedia:WikiProject Books',
  'E. L. Doctorow',
  'Random House',
  'E. L. Doctorow',
  'Julius and Ethel Rosenberg',
  'The Guardian',
  'The New York Times',
  'Book of Daniel',
  'Project MUSE',
  'Disneyland',
  'Daniel (1983 movie)',
  'Sidney Lumet',
  'Morton Sobell',
  'David Greenglass',
  'Paul Robeson',
  'Peekskill Riots',
  'Category:1971 American novels',
  'Category:American historical no

In [39]:
from timeit import default_timer as timer

start = timer()
# Object for handling xml
handler = WikiXmlHandler()

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

# Parse the entire file
for i, line in enumerate(bz2.BZ2File(data_path, 'r')):
    if (i + 1) % 10000 == 0:
        print(f'Processed {i + 1} lines so far.', end = '\r')
    try:
        parser.feed(line)
    except StopIteration:
        break
    
end = timer()
books = handler._books

print(f'\nSearched through {handler._article_count} articles.')
print(f'\nFound {len(books)} books in {round(end - start)} seconds.')

Processed 19410000 lines so far.
Searched through 257498 articles.

Found 982 books in 2049 seconds.


Only one partition took almost 30 minutes. 
We definitely need to consider parallelization!

## Parallel Computing
We have a large amount of data and it takes a very long time 
to compute it sequentially. 

Unfortunately, at this point, my project changed slightly to use the SQL data. I never made any real <br>
headway into doing this properly. All attempts has issues with Windows and Jupyter Notebook.