# Introduction: Downloading and Parsing Wikipedia Articles

In this notebook, we will download all of the latest wikipedia articles. After getting the data, we'll work on making sense of it using data science!

In [None]:
import requests
from bs4 import BeautifulSoup
import os

## Searching for Wikipedia Dump

To start, we make a request to the [Wikimedia dump](https://dumps.wikimedia.org/) of Wikipedia. We'll search through `enwiki` which has the [English language dumps](https://dumps.wikimedia.org/enwiki/) of wikipedia. This first request finds the available recent dumps and lists them.

In [None]:
base_url = 'https://dumps.wikimedia.org/enwiki/'
index = requests.get(base_url).text
soup_index = BeautifulSoup(index, 'html.parser')

# Find the links that are dates of dumps
dumps = [a['href'] for a in soup_index.find_all('a') if 
         a.has_attr('href')]
dumps

The next line of code finds the html of the dump for the first of September. If there is a more recent version available, feel free to use that instead!

In [None]:
dump_url = base_url + '20180901/'

# Retrieve the html
dump_html = requests.get(dump_url).text
dump_html[:10]

Now we can parse through the html text to find the available files for download. We will focus only on the most recent revision of the articles themselves. It is possible to get the past history of articles, the edits, the discussion, and metadata, but the articles themselves provide us with more than enough data! 

For more information on the available downloads, take a look at the [Wikimedia dump](https://dumps.wikimedia.org/) or on [Wikipedia itself](https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia).

### Deciding What to Download

The most recent revision of every single article (what we are after) is available in a single compressed file as `pages-articles.xml.bz2`. However, we'll download the articles in smaller chunks so that we can then process it in parallel (rather than all sequentially). The single file - which is compressed XML (using bz2) - is over 15 GB while the smaller chunks - of which there are more than 50 - are several hundred megabytes each. Since we want to find every single book on Wikipedia, we'll go with this particular dump. 

If you only want to find a specific article, then the `pages-article-multistream.xml.bz2` may be a better choice. There is an `index` for the multistream version that means you can locate a specific article within the compressed file without having to search through the entire file. For more information refer to the "Should I get Multistream?" section on [this Wikipedia page](https://en.wikipedia.org/wiki/Wikipedia:Database_download#).

To find all the individual files, we'll search through the html from the dump identified earlier. We'll look for any files that have `pages_articles` in the text. To parse through html (or other markup languages), we can use Beautiful Soup. This makes searching for particular tags or classes (such as `file` below) very efficient. If you are doing any web scraping, this will be a very useful library to learn! 

In [None]:
# Convert to a soup
soup_dump = BeautifulSoup(dump_html, 'html.parser')

# Find li elements with the class file
soup_dump.find_all('li', {'class': 'file'}, limit = 10)[:4]

Now we want to search for only the files containing `pages-articles` to get only the recent versions of the articles.

In [None]:
files = []

# Search through all files
for file in soup_dump.find_all('li', {'class': 'file'}):
    text = file.text
    # Select the relevant files
    if 'pages-articles' in text:
        files.append((text.split()[0], text.split()[1:]))
        
files[:5]

We only want the partitioned files so we can further refine our selection to those with `xml-p`. 

In [None]:
files_to_download = [file[0] for file in files if '.xml-p' in file[0]]
files_to_download[-5:]

## Download Wikipedia Data

Now we need to actually download the data. This can be done using the keras `get_file` utility which downloads the specified file at the specified url. If we already have the entire dataset downloaded, then we don't want to download it again! For that reason we first use a check to see if the data exists.

The default download directory for keras is `~/.keras/datasets`.

In [None]:
import sys
from keras.utils import get_file

keras_home = '/home/ubuntu/.keras/datasets/'

In [None]:
data_paths = []
file_info = []

# Iterate through each file
for file in files_to_download:
    path = keras_home + file
    
    # Check to see if the path exists
    if not os.path.exists(keras_home + file):
        print('Downloading')
        # If not, download the file
        data_paths.append(get_file(file, dump_url))
        # Find the file size in MB
        file_size = os.stat(path).st_size / 1e6
        
        # Find the number of articles
        file_articles = int(file.split('p')[-1].split('.')[-2]) - int(file.split('p')[-2])
        file_info.append((file, file_size, file_articles))
        
    # Otherwise extract information
    else:
        data_paths.append(path)
        # Find the file size in MB
        file_size = os.stat(path).st_size / 1e6
        
        # Find the number of articles
        file_number = int(file.split('p')[-1].split('.')[-2]) - int(file.split('p')[-2])
        file_info.append((file.split('-')[-1], file_size, file_number))

What is the largest file? We can use `sorted` to sort by the file size.

In [None]:
sorted(file_info, key = lambda x: x[1], reverse = True)[:5]

I'm not sure what the `p` numbers mean at the end of the file name. It must have something to do with the partitioning of the data. At first I thought it was the number of articles in the partition, but the English language Wikipedia only has 5.8 million article and these numbers go up to 58 million. If we subtract the second number from the first, the most common number is 1.5 million. Maybe this has something to do with the number of bytes in the file? I'm not sure, but I would appreciate if anyone knows! 

In [None]:
sorted(file_info, key = lambda x: x[2], reverse = True)[:5]

It looks like the largest files are about 400 MB compressed. 

In [None]:
print(f'There are {len(file_info)} partitions.')

How about putting this info into a dataframe?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
file_df = pd.DataFrame(file_info, columns = ['file', 'size (MB)', 'articles']).set_index('file')
file_df['size (MB)'].plot.bar(color = 'red', figsize = (12, 6));

Total there are 5.7 million articles contained in these files. In a way, it's surprising that the extent of all human knowledge (okay I'm exaggerating somewhat) is now contained on our computer! 

#### Size of Wikipedia 

For some interesting reading, check out the following Wikipedia Articles:

* [Size of Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia)
* [Modelling Growth of Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia%27s_growth#Data_set_for_number_of_articles)
* [Size Comparisons of Wikipedia to other Encyclopedias](https://en.wikipedia.org/wiki/Wikipedia:Size_comparisons)

In [None]:
print(f"The total size of files on disk is {file_df['size (MB)'].sum() / 1e3} GB")

Again, we could download this all at once with the `pages-articles.xml.bz2` file, but then we would have to parse it sequentially using only one process. Instead, because we have partitioned data, we can iterate over the files in parallel (as we'll see later).

## Going Through the Data

Together, the files take up 15.4 GB. Decompressed as xml, it's close to 50 GB. We could decompress each file into XML and then parse through it, but we can also parse through a decompressed file one line at a time. If we are concered about disk space, this is a better option. Working through the compressed file one line at a time also might be the only option in cases where the individual files are too large to fit in memory. 

We'll start by working through one of the files and then develop functions that we can run on all of the files. Because we downloaded the data in chunks, we'll be able to parallelize the parsing operations.

To get started, make sure you have `bzcat` installed on your system. The [`bzcat` utility] (http://www.qnx.com/developers/docs/6.5.0SP1.update/com.qnx.doc.neutrino_utilities/b/bzcat.html) is a command line program that decompresses a bz2 compressed file and sends the contents to standard out. To go through the files one line at a time, we simply iterate over the command to decompress the file. We call the `bzcat` command using `subprocess` which is often used to execute system commands in Python.

Another option for decompressing the file is `bz2`. However, in tests (see below), I found this to be much slower than `bzcat`. 

In [None]:
import bz2
import subprocess

data_path = data_paths[15]
data_path

In [None]:
%%timeit -n 3 -r 3

lines = []
for i, line in enumerate(bz2.BZ2File(data_path, 'r')):
    lines.append(line)
    if i > 1e6:
        break

In [None]:
%%timeit -n 3 -r 3

lines = []
for i, line in enumerate(subprocess.Popen(['bzcat'], 
                         stdin = open(data_path), 
                         stdout = subprocess.PIPE).stdout):
    lines.append(line)
    if i > 1e6:
        break

The `subprocess` + `bzcat` approach is nearly twice as fast. Let's run this again and see what kind of data we have.

In [None]:
lines = []

for i, line in enumerate(subprocess.Popen(['bzcat'], 
                         stdin = open(data_path), 
                         stdout = subprocess.PIPE).stdout):
    lines.append(line)
    if i > 5e5:
        break
        
lines[-165:-109]

Here we have a complete article. We could go through each article and extract out the information using regular expression, but that would be extremely inefficient. Instead, we can use an xml parser to extract precisely the information we want.

## Parsing XML

In order to get useful information from this data, we have to parse it on two levels.

1. Extract the titles and article text from the XML
2. Extract relevant information from the article text

To solve the first problem, we'll use a tool purpose built for the task of parsing XML, SAX: The Simple API for XML. [The documentation](http://pyxml.sourceforge.net/topics/howto/section-SAX.html) is a little difficult to follow, but the basic idea is that we can use SAX to search through the XML and select elements based on the tag. (If you need an introduction to XML, I'd highly recommend starting [here](https://www.w3schools.com/xml/default.asp)).

For example, if we have the follow XML element, we want to extract the text that occurs between the `<title>` tags:

`<title>Carroll Knicely</title>`

Likewise, if we have the content of an article like below, we want to extract the text that occurs between the `<text>` tags. 

```XML
<text xml:space="preserve">\'\'\'Carroll F. Knicely\'\'\' (born c. 1929 in [[Staunton, Virginia]] - died November 2, 2006 in [[Glasgow, Kentucky]]) was [[Editing|editor]] and [[Publishing|publisher]] of the \'\'[[Glasgow Daily Times]]\'\' for nearly 20 years (and later, its owner) and served under three [[Governor of Kentucky|Kentucky Governors]] as commissioner and later Commerce Secretary.\n'
</text>
```

We'll use the SAX parser to do exactly that: find the titles and text content of the articles. Then, we can pass the text to another parser to extract information from the article. 

Explaining how SAX works is a little more difficult than just showing, so I'll present the code and show some examples. A SAX parser requires a content handler, which is the only code we need to write. 

In [None]:
import xml.sax

class WikiXmlHandler(xml.sax.handler.ContentHandler):
    """Parse through XML data using SAX"""
    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)
        self._buffer = None
        self._values = {}
        self._current_tag = None
        self._pages = []

    def characters(self, content):
        """Characters between opening and closing tags"""
        if self._current_tag:
            self._buffer.append(content)

    def startElement(self, name, attrs):
        """Opening tag of element"""
        if name in ('title', 'text', 'timestamp'):
            self._current_tag = name
            self._buffer = []

    def endElement(self, name):
        """Closing tag of element"""
        if name == self._current_tag:
            self._values[name] = ' '.join(self._buffer)

        if name == 'page':
            self._pages.append((self._values['title'], self._values['text']))

We need to create a `handler` object of the `WikiXMLHandler` class. Then we pass the handler in as the content handler to a SAX `parser`. Basically, we are overriding a few of the default SAX `ContentHandler` methods in order to do what we want: find the titles and texts in the XML. 

In [None]:
# Object for handling xml
handler = WikiXmlHandler()

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

handler._pages

Let's work through a single article. We `feed` in one line of XML at a time to the `parser`. It searches the XML for the tags using the methods in the handler. The correct data is then stored in the handler. 

In [None]:
for l in lines[-165:-109]:
    parser.feed(l)

In [None]:
handler._pages

We have successfully extracted one page! Once we have the page text, we need to process it as well to find the information we want. We'll write that function next. First, let's see this process again, this time finding three different articles.

In [None]:
# Object for handling xml
handler = WikiXmlHandler()

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

for i, line in enumerate(subprocess.Popen(['bzcat'], 
                         stdin = open(data_path), 
                         stdout = subprocess.PIPE).stdout):
    parser.feed(line)
    
    # Stop when 3 articles have been found
    if len(handler._pages) > 2:
        break
        
print([x[0] for x in handler._pages])

The next step is to process the actual text of the article. For this, we will turn to the `mwparserfromhell` library. [This library](https://github.com/earwig/mwparserfromhell) is custom made for parsing `MediaWiki` wikicode which is a standard that includes Wikipedia articles. [MediaWiki](https://www.mediawiki.org/wiki/MediaWiki) is used by Wikipedia and numerous other projects and provides a relatively standardized template for creating wiki pages. Thanks to this standardization, we can using a custom built parser to go through the articles.

## Parsing Articles

The best way to figure out how to parse an article is simply to do it! Let's work through one article.

First we'll find a number of articles using the `WikiXmlHandler` and a SAX parser.

In [None]:
# Object for handling xml
handler = WikiXmlHandler()

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

for i, line in enumerate(subprocess.Popen(['bzcat'], 
                         stdin = open(data_path), 
                         stdout = subprocess.PIPE).stdout):
    parser.feed(line)
    
    # Stop when 50 articles have been found
    if len(handler._pages) > 50:
        break

All we need to do is pass the Wikipedia article text to the `mwparserfromhell`. 

In [None]:
import mwparserfromhell 

print(handler._pages[6][0])

# Create the wiki article
wiki = mwparserfromhell.parse(handler._pages[6][1])

We'll work through the article for KENZ (FM). It might be helpful to pull up the [Wikipedia page alongside](https://en.wikipedia.org/wiki/KENZ_(FM)) so you can see what is happening.

In [None]:
print(type(wiki))
wiki[:100]

This looks like a normal string, but in fact, it's a `mwparserfromhell.wikicode.Wikicode` object with many different methods for sorting through the content. For example, we can find all the internal links (those that go to other wikipedia pages) using `wiki.filter_wikilinks()`. This will give us the `title` of the Wikipedia article linked to as well as the `text` of the link. We'll extract just the titles of the articles. 

In [None]:
wikilinks = [x.title for x in wiki.filter_wikilinks()]
print(f'There are {len(wikilinks)} wikilinks.')
wikilinks[:5]

We didn't get the discussion or the edits around the articles, so we won't be able to find this information. However, if you do decide to grab comments, edits, revisions, etc., you can use `mwparserfromhell` to extract all of these.

In [None]:
wiki.filter_arguments()

In [None]:
wiki.filter_comments()

To figure out everything you can do with `mwparserfromhell`, [read the docs](https://mwparserfromhell.readthedocs.io/en/latest/).

We can find the external links (those that go outside of Wikipedia).

In [None]:
external_links = [(x.title, x.url) for x in wiki.filter_external_links()]
print(f'There are {len(external_links)} external links.')
external_links[:5]

We can also search through the text for specific words.

In [None]:
contemporary = wiki.filter(matches = 'contemporary')
contemporary[1], type(contemporary[1])

If you want to get a cleaner version of just the text, you can call `wiki.strip_code()` and then add another `strip`. 

In [None]:
wiki.strip_code().strip()[:100]

There's quite a bit more you can do with this class, so explore the options if you have a project in mind.

## Article Templates

The easiest way to filter articles to a category (at least that I've found) is through the use of templates. These are standardized formats for information. For example, one template on the `KENZ (FM)` radio station page looks like this:

![Radio Station Infobox](https://github.com/WillKoehrsen/wikipedia-data-science/blob/master/images/radio_template.PNG?raw=true)

### Infobox Templates

These particular template is called an `Infobox`. There are many different infoboxes, each one for a different category such as films or books. __The easiest way to filter articles to one category is by using the `Infobox` template for that category__.
You can read about infoboxes [here](https://en.wikipedia.org/wiki/Help:Infobox) or look at the list [here](https://en.wikipedia.org/wiki/Wikipedia:List_of_infoboxes)

Let's take a look at the templates for this article.

In [None]:
templates = wiki.filter_templates()
print(f'There are {len(templates)} templates.')
for template in templates:
    print(template.name)

There are a number of different templates. In this case, the Infobox is name `Infobox radio station`. If we wanted to find all of the radio station articles, then the easiest way would be to search every article for this template. We can search for a specific template within an article as below.

In [None]:
infobox = wiki.filter_templates(matches = 'Infobox radio station')[0]
infobox

The attributes of the infobox can be accessed and put into a dictionary using the `name` and the `value`. To clean things up, we first strip the code and then strip whitespace and escape characters.

In [None]:
information = {param.name.strip_code().strip(): param.value.strip_code().strip() for param in infobox.params}
information

The `Infobox` templates provide us with a consistent method for finding a category of article. We simply need to search the article text for the `Infobox` of the category of article we want. 

Moreover, the information in the template could be useful for say building a set of features in a predictive model. In this project, we aren't going to be working with the text of the articles - that's an entirely separate undertaking - but we can use information such as that found in the `Infobox` or the `wikilinks` to build a recommendation engine. 

## Searching for Books

Now that we have an understanding of how to parse an article and find articles belonging to a certain category, we can start searching for what we want: all the books on Wikipedia! The books can be identified because they use an Infobox book template.

![Infobox book](https://github.com/WillKoehrsen/wikipedia-data-science/blob/master/images/book_template.PNG?raw=true)

We simply have to filter the article text for the `Infobox book` template, and if it's present, store the information. If not, then we move to the next article. The function below is designed to find and return book articles. 

In [None]:
import re

def process_article(title, text, timestamp, template = 'Infobox book | infobox book'):
    """Process a wikipedia article looking for template"""
    
    # Create a parsing object
    wikicode = mwparserfromhell.parse(text)
    
    # Search through templates for the book template
    matches = wikicode.filter_templates(matches = template)
    
    if matches:
        # Extract information from infobox
        properties = {param.name.strip_code().strip(): param.value.strip_code().strip() 
                      for param in matches[0].params
                      if param.value.strip_code().strip()}
        
        # Extract internal wikilinks
        wikilinks = [x.title.strip_code().strip() for x in wikicode.filter_wikilinks()]
        # Extract external links
        exlinks = [x.url.strip_code().strip() for x in wikicode.filter_external_links()]
        return (title, properties, wikilinks, exlinks, timestamp)

In [None]:
r = process_article('KENZ (FM)', wiki, None)

No such luck with this article searching for books. We can modify the search to fit our template.

In [None]:
r = process_article('KENZ (FM)', wiki, None, template = 'Infobox radio station | infobox radio station')

In [None]:
r[0], r[1]

To parse the text from each article, we slightly modify the `Handler` class. This time, in the `endElement` function, if the article ends (the tag is `page`) then we send the title and the contents (`text`) to the `process_article` function. This function will return either nothing if it doesn't find a book, or the book properties, Wikilinks, and external links if it does find a book. These will be added as a list to the `handler`. 

In [None]:
class WikiXmlHandler(xml.sax.handler.ContentHandler):
    """Parse through XML data using SAX"""
    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)
        self._buffer = None
        self._values = {}
        self._current_tag = None
        self._books = []
        self._article_count = 0

    def characters(self, content):
        """Characters between opening and closing tags"""
        if self._current_tag:
            self._buffer.append(content)

    def startElement(self, name, attrs):
        """Opening tag of element"""
        if name in ('title', 'text', 'timestamp'):
            self._current_tag = name
            self._buffer = []

    def endElement(self, name):
        """Closing tag of element"""
        if name == self._current_tag:
            self._values[name] = ' '.join(self._buffer)

        if name == 'page':
            self._article_count += 1
            # Search through the page to see if the page is a book
            book = process_article(**self._values)
            # Append to the list of books
            if book:
                self._books.append(book)

The code below stops when we've found 3 books.

In [None]:
# Object for handling xml
handler = WikiXmlHandler()

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

for i, line in enumerate(subprocess.Popen(['bzcat'], 
                         stdin = open(data_path), 
                         stdout = subprocess.PIPE).stdout):
    parser.feed(line)
    
    # Stop when 3 articles have been found
    if len(handler._books) > 2:
        break
        
print(f'Searched through {handler._article_count} articles to find 3 books.')

Let's see what we've got! 

In [None]:
handler._books[0]

For each book, we have the title, the information contained in the `Infobox`, the internal Wikilinks, the external links, and the timestamp of the last edit. Using just this information, we'll be able to build a fairly robust book recommendation system! 

Let's see how long it would take to search through just one partition. Uncompressed, the size of this partition is 1.5 GB with over 24 million lines.

In [None]:
# Uncompress the file if not already uncompressed
if not os.path.exists('/home/ubuntu/.keras/datasets/p15.xml'):
    subprocess.call(['bzcat /home/ubuntu/.keras/datasets/enwiki-20180901-pages-articles15.xml-p7744803p9244803.bz2 >> p15.xml'],
                    shell = True)
else:
    print('Already uncompressed')

In [None]:
!wc /home/ubuntu/.keras/datasets/p15.xml

In [None]:
from timeit import default_timer as timer

start = timer()
# Object for handling xml
handler = WikiXmlHandler()

# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)

# Parse the entire file
for i, line in enumerate(subprocess.Popen(['bzcat'], 
                         stdin = open(data_path), 
                         stdout = subprocess.PIPE).stdout):
    if (i + 1) % 10000 == 0:
        print(f'Processed {i + 1} lines so far.', end = '\r')
    try:
        parser.feed(line)
    except StopIteration:
        break
    
end = timer()
books = handler._books

print(f'\nSearched through {handler._article_count} articles.')
print(f'\nFound {len(books)} books in {round(end - start)} seconds.')

If we multiply the time to process one partition, about 1000 seconds, by the number of partitions, 55, we get 55,000 seconds to process all the files one at a time. This amounts to 15 hours! We'll see if we can do better using multiprocessing.

### Writing to File

The best way to save the information is as `ndjson`. We can save and then load back in the books using `json`. 

In [None]:
books[1]

In [None]:
import json

# Save list of books
with open('p15_books.ndjson', 'wt') as fout:
    for l in books:
        fout.write(json.dumps(l) + '\n')

In [None]:
books_in = []

# Read in list of books
with open('p15_books.ndjson', 'rt') as fin:
    for l in fin.readlines():
        books_in.append(json.loads(l))

In [None]:
books_in[10]

# Running Operations in Parallel

Instead of parsing through the files one at a time, we want to process several of them at once. We can do this using parallelization, either through multithreading or multiprocessing.


## Multithreading and Multiprocessing

Multithreading and multiprocessing are ways to carry out many tasks simulataneously. We have a number of files on disk, each of which needs to be parsed in the same way. A naive approach would be to parse one file at a time, but that is not taking full advantage of our resources. Instead, we can use either multithreading or multiprocessing to parse many files at the same time, speeding up the overall process. 

Learning multithreading / multiprocessing is essential for making your data science workflows more efficient. I'd recommend [this article](https://medium.com/@bfortuner/python-multithreading-vs-multiprocessing-73072ce5600b) to get started with the concepts.

Generally multithreading works better (is faster) for input / output bound tasks, such as reading in files or making requests. Multiprocessing works better (is faster) for cpu bound tasks. Due to the [global interpreter lock](https://realpython.com/python-gil/), only one thread can run at a time in Python for cpu intensive tasks (such as doing mathematical operations or machine learning) even on a multithreaded / multicore system. Running multiple processes does not have this issue because processes do not share memory. In some cases, you cannot use multiple processes because the tasks need to share memory. However, in our case, each file can be processed independently of the others. For that reason, we can use either multithreading or mulitprocessing.

I've found the best method for determining the optimal method and parameters is to try out a number of different options. The code for testing multithreding and mutltiprocessing appears after this next section. When I ran the testing code first and then tried to run the actual code, I ran into out of memory errors so I put it at the end.

__The final process of searching for all books on Wikipedia is run in parallel using 16 processes__. This was the fastest method in my benchmarking. Each process is run on a different core because the machine I used had 16 cores. The best options will depend on your set-up.

### Setting Up Parallel Code

To run an operation in parallel, we need a `service` and `tasks`. A `service` can just be a function that we need to run many times and the `tasks` are the different arguments to that function. For our purposes, the service is the `find_books` function that parses a complete compressed xml file for the books and the tasks are all of the compressed xml files. 


We first need to write a function that takes in a file and returns a list of the book articles. We already have all the parts defined so we can just put them together.

In [None]:
import gc
import json

def find_books(data_path, limit = None, save = True):
    """Find all the book articles from a compressed wikipedia XML dump.
       `limit` is an optional argument to only return a set number of books.
        If save, books are saved to partition directory based on file name"""

    # Object for handling xml
    handler = WikiXmlHandler()

    # Parsing object
    parser = xml.sax.make_parser()
    parser.setContentHandler(handler)

    # Iterate through compressed file
    for i, line in enumerate(subprocess.Popen(['bzcat'], 
                             stdin = open(data_path), 
                             stdout = subprocess.PIPE).stdout):
        try:
            parser.feed(line)
        except StopIteration:
            break
            
        # Optional limit
        if limit is not None and len(handler._books) >= limit:
            return handler._books
    
    if save:
        partition_dir = '/data/wiki/partitions/'
        # Create file name based on partition name
        p_str = data_path.split('-')[-1].split('.')[-2]
        out_dir = partition_dir + f'{p_str}.ndjson'

        # Open the file
        with open(out_dir, 'w') as fout:
            # Write as json
            for book in handler._books:
                fout.write(json.dumps(book) + '\n')
        
        print(f'{len(os.listdir(partition_dir))} files processed.', end = '\r')

    # Memory management
    del handler
    del parser
    gc.collect()
    return None

Next we need a list of services. This is simply all the partitioned data files we want to send to the function.

In [None]:
partitions = [keras_home + file for file in os.listdir(keras_home) if 'xml-p' in file]
len(partitions), partitions[-1]

In [None]:
from multiprocessing import Pool 
import tqdm 

# List of lists to single list
from itertools import chain

# Sending keyword arguments in map
from functools import partial

In [None]:
os.cpu_count()

## Searching for Every Book on Wikipedia

The code below simply searches all of Wikipedia for every book article! The `tqdm` library is used to track progress (the resulting widgets may not show up correclty in a static Jupyter Notebook).

We create a `Pool` with the number of processes and then call `map` with the `(service, tasks)`. The books for each partition are saved to a json file on disk. 

In [None]:
# Create a pool of workers to execute processes
pool = Pool(processes = 16)

start = timer()

# Map (service, tasks), applies function to each partition
results = pool.map(find_books, partitions)

pool.close()
pool.join()

end = timer()

#### Alternative Code with Progress Bar

If you would like a progress bar to appear in your Jupyter Notebook as you run the operation, you can use the code below. This wraps `tqdm` around a call to `imap_unordered`. This does the same basic job as `map` except the results are returned as they finished instead of all at once. The end result will still be the exact same, you just get a progress bar as the computation runs. There are some situations where you'd want to use `imap` instead of `map`, and [this Stack Overflow answer](https://stackoverflow.com/a/26521507/5755357) does a great job of explaining the differences. 

In [None]:
# start = timer()
# pool = Pool(processes = 8)
# results = []

# # Run partitions in parallel
# for x in tqdm.tqdm_notebook(pool.imap_unordered(find_books, partitions), total = len(partitions)):
#     results.append(x)
    
# pool.close()
# pool.join()

# end = timer()

### Join Data Together

After creating the separate files, each one containing the books from one partition, we can join them together into a single list, this time using `multithreading`. Again, we'll need a service, this time the function `read_data`, and tasks, all the `saved_files`, each of which contains the books for one partition.

In [None]:
def read_data(file_path):
    """Read in json data from `file_path`"""
    
    data = []
    
    # Open the file and load in json
    with open(file_path, 'r') as fin:
        for l in fin.readlines():
            data.append(json.loads(l))
            
    return data

To use `multithreading` instead of `multiprocessing`, we import `Pool` from the [`multiprocessing.dummy` module](https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.dummy). Then we create a `Threadpool` of workers and `map` the `tasks` to the `service`. This will return a lists of lists that we can flatten to a single list using `chain` from itertools.  

In [None]:
from multiprocessing.dummy import Pool as Threadpool
import itertools

start = timer()

# List of files to read in
saved_files = ['/data/wiki/partitions/' + x for x in os.listdir('/data/wiki/partitions/')]

# Create a threadpool for reading in files
threadpool = Threadpool(processes = 10)

# Read in the files as a list of lists
results = threadpool.map(read_data, saved_files)

# Flatten the list of lists to a single list
book_list = list(chain(*results))

end = timer()

print(f'Found {len(book_list)} books in {round(end - start)} seconds.')

To see what kind of speed ups this offered us, we can compare the results to reading in the files sequentially. 

In [None]:
results = []
start = timer()

for file in saved_files:
    with open(f'/data/wiki/partitions/{file}', 'r') as fin:
        for l in fin.readlines():
            results.append(l)

end = timer()
book_list = list(chain(*results))
print(f'Found {len(book_list)} books in {round(end - start)} seconds.')

In [None]:
if not os.path.exists()

with open('found_books.ndjson', 'wt') as fout:
    for book in book_list:
         fout.write(json.dumps(book) + '\n')

## Test Multithreading vs Multiprocessing

I've found the only way to know which to use is to do a test of both and see which is faster. Likewise, when it comes to choosing the number of threads, doing an empirical test seems to be the most effective method for maximizing efficiency. 

### Test Multiprocess

First we'll test multiprocessing. To make the comparisons feasible, we'll limit the search to finding 10 books in each file. We'll compare multiprocessing with 8 cores to 16 cores to see if we can achieve a double speed up!

The code is wrapped in a `tqdm` call which just displays a progress bar. This might not render correctly in the static version of the notebook.

In [None]:
def test_speed(pool):
    """Time how long """
    start = timer()
        
    # Need to pass in keyword limit argument
    map_find_books = partial(find_books, limit = 10, save = False)
    
    results = []
    for x in tqdm.tqdm_notebook(pool.imap_unordered(map_find_books, partitions), total = len(partitions)):
        results.append(x)
        
    pool.close()
    pool.join()
    end = timer()

    book_list = list(chain(*results))
    print(f'Found {len(book_list)} books in {round(end - start)} seconds.')

In [None]:
pool = Pool(processes = 8)
test_speed(pool)

We'll try again with 16 processes. Theoretically this should halve the time. 

In [None]:
pool = Pool(processes = 16)
test_speed(pool)

So using double the number of processors is faster, although not quite twice as fast. If this is the only task you are going to be running, it makes sense to use all of your cores.

### Test Multithreading

The syntax for using multithreading is exactly the same. I have no idea how many threads to use so I just tried 2 arbitrary numbers! 

In [None]:
from multiprocessing.dummy import Pool as ThreadPool
thread_pool = ThreadPool(processes = 10)
test_speed(thread_pool)

In [None]:
thread_pool = ThreadPool(processes = 20)
test_speed(thread_pool)

In [None]:
import gc
gc.collect()

Threading appears to be much slower than multiprocessing. For a real run on all the data, we'll should use all available cores on our machine.

# Conclusions

In this notebook, we saw how to download the most recent version of every article on English language Wikipedia. Not only are we able to download all of the information, but we also saw some tools for processing this information to extract the data that we need. Wikipedia is an incredible resource, not only for doing your school projects, but also for exploring techniques in data science. In future notebooks, we'll look at how to build a book recommendation engine based on the data we've collected here. Gathering the data is one thing, but eventually, we want to be able to accomplish useful tasks with this data! 