## Understanding the Data 
For this tutorial notebook, we will be working with the [Wikimedia Enterprise HTML Dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download#:~:text=Wikimedia%20Enterprise%20HTML%20Dumps,-As%20part%20of&text=Dumps%20are%20produced%20for%20a,per%20article%2C%20in%20json%20format.), developed for high-volume reusers of wiki contents. Dump files are created for a certain set of namespaces and wikis and then made publicly available. Each dump output file is a tar.gz archive containing one file. When uncompressed and untarred, the file has a single line per article in json format. Some attributes of these dump files can be found listed [here](https://dumps.wikimedia.org/other/enterprise_html/). Further information about data dump can be found [here](https://meta.wikimedia.org/wiki/Data_dumps).

We will be using a dump from the `simple` namespace, because they are much smaller in size and easy to run on local machines. You can download the simple wiki version of the English wikipedia from the same location as the other Enterprise dumps (just search for the `simplewiki` prefix!!)

In the following cell, we load one of our downloaded dump files using python's `tarfile` library. For the sake of demonstration, we will only be using the first article from the first chunk of the tarball here. 

In [10]:
import tarfile
import json
FILEPATH = "PATH-TO-DUMP-TAR.GZ"

with tarfile.open(FILEPATH, mode="r:gz") as tar:
    html_fn = tar.next()
    print(
        f"We will be working with {html_fn.name} ({html_fn.size / 1000000000:0.3f} GB)."
    )
    # extract the first file
    with tar.extractfile(html_fn) as fin:
        for line in fin:
            article = json.loads(line)
            break

We will be working with simplewiki_0.ndjson (8.648 GB).


#### Viewing the Data
In this cell, we simply print our json dump of a single article, with an indentation of 2 and try to make sense of it. I personally had a hard time, just reading through this json file, to understand what it contains. Which is why I instead saved it as a .json file and opened it with a more interactive json viewer (Mozilla Firefox browser, but any browser or even online json formatters like https://jsonlint.com/ should also work). I find this interface much easier to navigate and understand.

The json for our sample article contains information regarding it's name, id, url, language, namespace and license. At the same time it includes important metadata like article body, [categories](https://en.wikipedia.org/wiki/Help:Category), revision history,[redirect_links](https://en.wikipedia.org/wiki/Help:Redirect), [templates](https://en.wikipedia.org/wiki/Help:Template) etc. stored inside nested dictionaries. On the browser it looks like the below image: 

<img src="assets/sample_page.png" width="900" height="500" />

However, for the tasks at hand, we are particularly interested in the `article_body` attribute that contains the actual content of the article. Taking a closer look, it contains both HTML of the article, as well as the wikitext code that was parsed to generate the aforementioned HTML. We separately perform analysis on these two types of data in the next few cells.

<img src="assets/example_article.png" width="800" height="600" />

#### HTML Code
As mentioned earlier, the `article_body` key of our sample article json also contains HTML code generated by parsing it's wikitext code through internal Wikipedia APIs. Before we start working on on the tasks using our sample HTML data, I have first used an online [HTML Beautifier](https://codebeautify.org/htmlviewer) to make the HTML file look more readable to my eyes. One thing to notice here is, the formatted HTML code is much larger than the original wikitext code, as it has been programmatically expanded by the Wikipedia API.

If we go through the HTML we will be able to identify the different patterns or HTML tag structures associated with different article compoennets like `Headers, Categories, Templates`. For example `Sections` are found inside the HTML `<section>` tags, while categories contain a specific HTML relation attribute `mw:PageProp/Category`. In building the library, we utilize our observations on such patterns with the official [Mediawiki HTML 2.5.0 Specsheet](https://www.mediawiki.org/wiki/Specs/HTML/2.5.0) to extract the different features. 

In [11]:
print(article['article_body']['html'])

<!DOCTYPE html>
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="https://simple.wikipedia.org/wiki/Special:Redirect/revision/7554816"><head prefix="mwr: https://simple.wikipedia.org/wiki/Special:Redirect/"><meta property="mw:TimeUuid" content="bb7c69e0-1d68-11ec-85b3-01e58f568f31"/><meta charset="utf-8"/><meta property="mw:pageId" content="595306"/><meta property="mw:pageNamespace" content="0"/><link rel="dc:replaces" resource="mwr:revision/7520037"/><meta property="mw:revisionSHA1" content="e624e3d21201440b88f3bdf20885ad4b167f8e6d"/><meta property="dc:modified" content="2021-05-28T13:33:15.000Z"/><meta property="mw:htmlVersion" content="2.3.0"/><meta property="mw:html:version" content="2.3.0"/><link rel="dc:isVersionOf" href="//simple.wikipedia.org/wiki/Amsterdam_(city)%2C_New_York"/><title>Amsterdam (city), New York</title><base href="//simple.wikipedia.org/wiki/"/><meta property="mw:styleModules" content="ext.cite.style|ext.cite.styles"/><link rel="s