# Introduction: Exploring Wikipedia Book Data

After gathering all of the data about books on Wikipedia, it's time to see what we can find from it! In this notebook we'll explore the data in preparation for creating a recommendation engine based on embeddings.

In [1]:
import pandas as pd
import numpy as np


In [2]:
data_path = '/data/books.ndjson'

import json

books = []

with open(data_path, 'r') as fin:
    for l in fin.readlines():
        books.append(json.loads(l))
        
print(f'Found {len(books)} books.')

Found 36552 books.


In [10]:
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = 'all'
import pprint

books[2][0]
pprint.pprint(books[2][1])
books[2][2][-10:]

'A Modest Proposal'

{'author': 'Jonathan Swift',
 'genre': 'Satirical essay',
 'image': 'File:A Modest Proposal 1729 Cover.jpg',
 'name': 'A Modest Proposal',
 'pub_date': '1729'}


['Melvyn Bragg',
 'Category:Essays by Jonathan Swift',
 'Category:Satirical works',
 'Category:Pamphlets',
 'Category:18th-century essays',
 'Category:Works published anonymously',
 'Category:British satire',
 'Category:1729 in Great Britain',
 'Category:Cannibalism in fiction',
 'Category:1729 books']

# Exploring Categories

Many of the external links refer to categories. We can count the number of different categories and take a look at the options.

In [11]:
categories = []

for book in books:
    for link in book[2]:
        if 'Category:' in link:
            categories.append(link[9:])
    
print(f"There are {len(set(categories))} unique categories.")

There are 24333 unique categories.


In [13]:
from collections import Counter

cate_counts = dict(Counter(categories))
cate_count_list = sorted(cate_counts.items(), key = lambda x: x[1], reverse = True)
cate_count_list[:10]

[('American science fiction novels', 1280),
 ('American novels adapted into films', 1139),
 ('Debut novels', 1066),
 ('American fantasy novels', 1030),
 ('American young adult novels', 949),
 ('HarperCollins books', 887),
 ('British novels adapted into films', 845),
 ('English-language books', 742),
 ('Novels first published in serial form', 629),
 ('American non-fiction books', 625)]

What are the most rare categories? 

In [14]:
cate_count_list[-10:]

[('Book stubs', 1),
 ('Go (game)', 1),
 ('Iranian nationalism', 1),
 ('French novels', 1),
 ('1952 in Wales', 1),
 (' Music in Manchester', 1),
 (' Culture in Manchester', 1),
 ('Pablo Picasso', 1),
 ('Maltese books', 1),
 ('Maltese literature', 1)]

# Exploring Book Attributes

We can also look through all the information that was in the infobox templates. We'll collect this into a single dictionary.

In [17]:
attributes = {}

for book in books:
    for key, value in book[1].items():
        if key in attributes:
            attributes[key].append(value)
        else:
            attributes[key] = [value]

print(f'There are a total of {len(attributes)} attributes.')

There are a total of 110 attributes.


In [19]:
attributes.keys()

dict_keys(['1', 'name', 'image', 'alt', 'author', 'illustrator', 'country', 'language', 'genre', 'publisher', 'release_date', 'pages', 'isbn', 'title_orig', 'caption', 'published', 'media_type', 'dewey', 'congress', 'oclc', 'pub_date', 'cover_artist', 'italic title', 'subject', 'preceded_by', 'followed_by', 'wikisource', 'image_size', 'series', 'translator', 'english_pub_date', 'preceded_by_quotation_marks', 'audio_read_by', 'border', 'orig_lang_code', 'native_wikisource', 'subjects', 'isbn_note', 'awards', 'editor', 'title_working', 'set_in', 'width', 'award', 'authors', 'notes', 'ISBN_note', 'ISBN', 'english_release_date', 'release_number', 'ol', 'location', 'website', 'followed_by_quotation_marks', 'native_external_host', 'external_host', 'translators', 'publisher2', 'nocat_wdimage', 'native_external_url', 'infoboxwidth', 'first', 'exclude_cover', 'dedicated_to', 'external_url', 'note', 'illustrators', 'image_caption', '2', 'publication_type', 'genres', 'editors', 'printer', 'asin',

In [21]:
num_attr = {key: len(values) for key, values in attributes.items()}
num_attr = sorted(num_attr.items(), key = lambda x: x[1], reverse = True)
num_attr[:10]

[('name', 35942),
 ('author', 35106),
 ('language', 32503),
 ('country', 30341),
 ('publisher', 30249),
 ('image', 28517),
 ('media_type', 25816),
 ('pages', 25791),
 ('genre', 25661),
 ('isbn', 24617)]

Let's make a dataframe that has the columns as the attributes and each row is a book.

In [26]:
data = pd.DataFrame([x[0] for x in books], columns = ['title'])
data.head()

Unnamed: 0,title
0,Animalia (book)
1,Animal Farm
2,A Modest Proposal
3,A Clockwork Orange (novel)
4,An Enquiry Concerning Human Understanding


In [27]:
for key in attributes:
    data.loc[:, key] = np.nan
    
data.head()

Unnamed: 0,title,1,name,image,alt,author,illustrator,country,language,genre,...,Spanish_pub_date,lcccn,cover_design,Contributing illustrators,Product Dimensions,Age Range,Book length,official website,The Natural Daughter with Portraits of the Leadenhead Family,Securing Sex: Morality and Repression in the Making of Cold War Brazil
0,Animalia (book),,,,,,,,,,...,,,,,,,,,,
1,Animal Farm,,,,,,,,,,...,,,,,,,,,,
2,A Modest Proposal,,,,,,,,,,...,,,,,,,,,,
3,A Clockwork Orange (novel),,,,,,,,,,...,,,,,,,,,,
4,An Enquiry Concerning Human Understanding,,,,,,,,,,...,,,,,,,,,,


In [33]:
book_dict = {book[0]: book[1] for book in books}
book_dict['Animal Farm'].values()

dict_values(['Animal Farm', 'Animal Farm: A Fairy Story', 'Animal Farm - 1st edition.jpg', 'First edition cover', 'George Orwell', 'United Kingdom', 'English', 'Political satire', '17 August 1945 (Secker and Warburg, London, England)', 'Print (hardback  &  paperback)', '112 (UK paperback edition)  < !-- First edition page count preferred -- >', '< !-- First released before ISBN system implemented -- >', '823/.912 20', 'PR6029.R8 A63 2003b', '53163540'])

In [36]:
for book in book_dict:
    break

In [37]:
book

'Animalia (book)'

In [40]:
for book in book_dict:
    book_attr = book_dict[book].keys()
    data.loc[data['title'] == book, book_attr] = book_dict[book].values()

In [41]:
data.head()

Unnamed: 0,title,1,name,image,alt,author,illustrator,country,language,genre,...,Spanish_pub_date,lcccn,cover_design,Contributing illustrators,Product Dimensions,Age Range,Book length,official website,The Natural Daughter with Portraits of the Leadenhead Family,Securing Sex: Morality and Repression in the Making of Cold War Brazil
0,Animalia (book),< !-- See Wikipedia:WikiProject_Novels or Wiki...,Animalia,Animalia (book cover).jpg,Book cover: a larger picture framed by smaller...,Graeme Base,Graeme Base,Australia,English,Picture books,...,,,,,,,,,,
1,Animal Farm,,Animal Farm,Animal Farm - 1st edition.jpg,,George Orwell,,United Kingdom,English,Political satire,...,,,,,,,,,,
2,A Modest Proposal,,A Modest Proposal,File:A Modest Proposal 1729 Cover.jpg,,Jonathan Swift,,,,Satirical essay,...,,,,,,,,,,
3,A Clockwork Orange (novel),,A Clockwork Orange,Clockwork orange.jpg,,Anthony Burgess,,United Kingdom,English < br > Nadsat,"Science fiction, Dystopian fiction, Satire, Bl...",...,,,,,,,,,,
4,An Enquiry Concerning Human Understanding,,An Enquiry Concerning Human Understanding,,,David Hume,,,English,,...,,,,,,,,,,


In [42]:
data.to_csv('books_info.csv', index = False)

In [39]:
for attr in attributes:
    for title, book_attr in zip(book_dict.values()):
        if attr in book_attr:
            data.loc[data['title'] == title, attr] = book_attr[attr]

ValueError: not enough values to unpack (expected 2, got 1)

In [4]:
!sudo mount /dev/xvdf /data

In [5]:
import bz2
lines = []
for i, line in enumerate(bz2.BZ2File(data_path, 'r')):
    lines.append(line)
    if i > 5e5:
        break

In [6]:
for l in lines[:50]:
    if 'title' in l.decode('utf-8'):
        print(l)

b'    <title>AccessibleComputing</title>\n'
b'    <redirect title="Computer accessibility" />\n'


In [7]:
import subprocess
lines = []

for i, line in enumerate(subprocess.Popen(['bzcat'], 
                         stdin = open(data_path), 
                         stdout = subprocess.PIPE).stdout):
    lines.append(line)
    if i > 1e6:
        break

In [8]:
lines[:-100]

[b'<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">\n',
 b'  <siteinfo>\n',
 b'    <sitename>Wikipedia</sitename>\n',
 b'    <dbname>enwiki</dbname>\n',
 b'    <base>https://en.wikipedia.org/wiki/Main_Page</base>\n',
 b'    <generator>MediaWiki 1.32.0-wmf.19</generator>\n',
 b'    <case>first-letter</case>\n',
 b'    <namespaces>\n',
 b'      <namespace key="-2" case="first-letter">Media</namespace>\n',
 b'      <namespace key="-1" case="first-letter">Special</namespace>\n',
 b'      <namespace key="0" case="first-letter" />\n',
 b'      <namespace key="1" case="first-letter">Talk</namespace>\n',
 b'      <namespace key="2" case="first-letter">User</namespace>\n',
 b'      <namespace key="3" case="first-letter">User talk</namespace>\n',
 b'      <namespace key="4" case="first-letter">W

In [88]:
import xml.sax

class WikiXmlHandler(xml.sax.handler.ContentHandler):
    """Used to handle the XML wiki dump. Copied 
    directly from the book and only edited self._books (from self._movies)"""
    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)
        self._buffer = None
        self._values = {}
        self._current_tag = None
        self._pages = []

    def characters(self, content):
        if self._current_tag:
            self._buffer.append(content)

    def startElement(self, name, attrs):
        if name in ('title', 'text'):
            self._current_tag = name
            self._buffer = []

    def endElement(self, name):
        if name == self._current_tag:
            self._values[name] = ' '.join(self._buffer)

        if name == 'page':
            self._pages.append((self._values['title'], self._values['text']))

In [89]:
# Object for handling xml
handler = WikiXmlHandler()

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

In [90]:
handler._buffer
handler._values

{}

In [91]:
handler._pages

[]

In [92]:
for i, line in enumerate(subprocess.Popen(['bzcat'], 
                         stdin = open(data_path), 
                         stdout = subprocess.PIPE).stdout):
    parser.feed(line)
    if i > 1e5:
        break

In [93]:
len(handler._pages)

420

In [94]:
handler._pages[1]

('Anarchism',
 '{{Use dmy dates|date=July 2018}} \n {{redirect2|Anarchist|Anarchists|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}} \n {{pp-move-indef}} \n {{use British English|date=January 2014}} \n {{Anarchism sidebar}} \n {{Basic forms of government}} \n \'\'\'Anarchism\'\'\' is a [[political philosophy]] < ref > {{cite book |last=McLaughlin |date=2007-11-28 |first=Paul |title=Anarchism and Authority: A Philosophical Introduction to Classical Anarchism |url=https://we.riseup.net/assets/394498/paul-mclaughlin-anarchism-and-authority-a-philosophical-introduction-to-classical-anarchism-1.pdf |archiveurl=https://web.archive.org/web/20180804180522/https://we.riseup.net/assets/394498/paul-mclaughlin-anarchism-and-authority-a-philosophical-introduction-to-classical-anarchism-1.pdf |archivedate=4 August 2018 |publisher=[[Ashgate Publishing|Ashgate]] |place=[[Aldershot]] |isbn=978-0-7546-6196-2 |page=59 }} < /ref > < ref > {{cite book |last=Flint |date=2

In [25]:
handler._values

{'title': 'Capability Maturity Model',
 'text': '{{About|the beverage}} \n {{Infobox beverage \n | name         = Cola \n | image        = [[File:Tumbler of cola with ice.jpg|220px]] \n | caption      = A glass of cola served with [[ice cube]]s \n | type         = [[soft drink]] \n | abv          =  \n | proof        =  \n | manufacturer =Various  \n | distributor  =  \n | origin       =United States \n | introduced   = 1886 \n | discontinued =  \n | color        = [[Caramel color|Caramel]] \n | flavor       = Cola (kola nut, citrus, cinnamon and vanilla) \n | variants     =  \n | related      =  \n | website      = \n }} \n \n \'\'\'Cola\'\'\' is a sweetened, [[Carbonation|carbonated]] [[soft drink]] flavored with [[vanilla]], [[cinnamon]], [[citrus]] [[essential oil|oils]] and other flavorings. Most contain [[caffeine]], which was originally sourced from the [[kola nut]], leading to the drink\'s name, though other sources are now also used. Cola became popular worldwide after [[pharm

In [95]:
!pip install mwparserfromhell

Collecting mwparserfromhell
[33m  Cache entry deserialization failed, entry ignored[0m
Installing collected packages: mwparserfromhell
Successfully installed mwparserfromhell-0.5.1
[33mYou are using pip version 9.0.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [96]:
import mwparserfromhell

In [97]:
wiki = mwparserfromhell.parse(handler._pages[1][1])

In [98]:
wiki

'{{Use dmy dates|date=July 2018}} \n {{redirect2|Anarchist|Anarchists|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}} \n {{pp-move-indef}} \n {{use British English|date=January 2014}} \n {{Anarchism sidebar}} \n {{Basic forms of government}} \n \'\'\'Anarchism\'\'\' is a [[political philosophy]] < ref > {{cite book |last=McLaughlin |date=2007-11-28 |first=Paul |title=Anarchism and Authority: A Philosophical Introduction to Classical Anarchism |url=https://we.riseup.net/assets/394498/paul-mclaughlin-anarchism-and-authority-a-philosophical-introduction-to-classical-anarchism-1.pdf |archiveurl=https://web.archive.org/web/20180804180522/https://we.riseup.net/assets/394498/paul-mclaughlin-anarchism-and-authority-a-philosophical-introduction-to-classical-anarchism-1.pdf |archivedate=4 August 2018 |publisher=[[Ashgate Publishing|Ashgate]] |place=[[Aldershot]] |isbn=978-0-7546-6196-2 |page=59 }} < /ref > < ref > {{cite book |last=Flint |date=23 April 2009 |f

In [100]:
wiki.filter_external_links()

['https://we.riseup.net/assets/394498/paul-mclaughlin-anarchism-and-authority-a-philosophical-introduction-to-classical-anarchism-1.pdf',
 'https://web.archive.org/web/20180804180522/https://we.riseup.net/assets/394498/paul-mclaughlin-anarchism-and-authority-a-philosophical-introduction-to-classical-anarchism-1.pdf',
 'http://www.univpgri-palembang.ac.id/perpus-fkip/Perpustakaan/Geography/Kamus%20Geografi/Kamus%20Geografi%20Manusia.pdf',
 'https://web.archive.org/web/20150523054520/http://www.univpgri-palembang.ac.id/perpus-fkip/Perpustakaan/Geography/Kamus%20Geografi/Kamus%20Geografi%20Manusia.pdf',
 '[http://www.theanarchistlibrary.org/HTML/Petr_Kropotkin___Anarchism__from_the_Encyclopaedia_Britannica.html Peter Kropotkin.  " Anarchism "  from the Encyclopædia Britannica]',
 'http://rebels-library.org/files/anarchismandeducation.pdf',
 'https://web.archive.org/web/20150402095224/http://rebels-library.org/files/anarchismandeducation.pdf',
 'http://www.iaf-ifa.org/principles/english.ht

In [104]:
[x.title.strip_code() for x in wiki.filter_wikilinks()]

['political philosophy',
 'Ashgate Publishing',
 'Aldershot',
 'Self-governance',
 'Stateless society',
 'George Woodcock',
 'Encyclopedia of Philosophy',
 'Routledge Encyclopedia of Philosophy',
 'Hierarchy',
 'Free association (communism and anarchism)',
 'International of Anarchist Federations',
 'Peter Kropotkin',
 'An Anarchist FAQ',
 'State (polity)',
 'The Globe and Mail',
 'Routledge Encyclopedia of Philosophy',
 'Anti-statism',
 'Far-left politics',
 'The New York Times',
 'anarchist economics',
 'Anarchist law',
 'Libertarian socialism',
 'Anarcho-communism',
 'Collectivist anarchism',
 'Anarcho-syndicalism',
 'Mutualism (economic theory)',
 'participatory economics',
 'Demanding the Impossible: A History of Anarchism',
 'Anarchist schools of thought',
 'individualism',
 'collectivism',
 'Social anarchism',
 'individualist anarchism',
 'Geoffrey Ostergaard',
 'wikt:anarchism',
 'anarchy',
 '-ism',
 'Online etymology dictionary',
 'Peter Kropotkin',
 'Merriam-Webster',
 'Onlin

In [107]:
[x.text.strip_code() for x in wiki.filter_wikilinks() if x is not None]

AttributeError: 'NoneType' object has no attribute 'strip_code'