In this notebook, we will present a simple walkthrough of the different features currently available in the library. 
Before we begin - 
* If you're interested to know more about the premise of the project -> [see this blogpost](https://appledora.hashnode.dev/outreach-bw3 )
* If you're interested to have a more concrete idea about the internal structure of the dump files -> [see this notebook](example_data.ipynb)

### Installing the Package
As easy as 
```bash
   $ pip install mwparserfromhtml
```

In [6]:
## Import the package and load a dump file
from mwparserfromhtml import HTMLDump
html_file_path = "/home/appledora/Documents/wikimedia/data/simplewiki-NS0-20220601-ENTERPRISE-HTML.json.tar.gz"
html_dump = HTMLDump(html_file_path, max_article=3)

In [7]:
## Iterate over the articles in the dump and print their titles 
for article in html_dump:
    print(article.title)

Amsterdam (city), New York
You Kent Always Say What You Want
Bangor


In [8]:
# Extract the plain text of an article from the dump, i.e. remove anything that is not text (e.g. a link is replaced by its anchor text 
for article in html_dump:
    print(article.get_plaintext( skip_categories=True, skip_transclusion=False, skip_headers=False))
    print("="*80)

Amsterdam is a city in Montgomery County, New York, United States. As of the 2010 census, the city had a population of 18,620.[1] The name is influenced from the city of Amsterdam in the Netherlands.
The city of Amsterdam is surrounded on the north, east, and west sides by the town of Amsterdam. The city developed on both sides of the Mohawk River, with the majority located on the north bank. The Port Jackson area on the south side is also part of the city.
References

↑  "2016 U.S. Gazetteer Files". United States Census Bureau. Retrieved Jul 5, 2017.
Other websites
Official website

 This short article about a place or feature in the United States can be made longer. You can help Wikipedia by adding to it.


"You Kent Always Say What You Want"The Simpsons episodeEpisode no.Season 18Episode 22Directed byMatthew NastukWritten byTim LongProduction codeJABF15Original air dateMay 20, 2007 (2007-05-20)Guest appearancesLudacris as himselfMaurice LaMarche as Birch BarlowEpisode chronology

← 

In [23]:
# Yo can extract Templates, Categories, Wikilinks, External Links, Media, References etc. from the dump
# They will return a list of class instances for each component
for article in html_dump:
    templates = article.get_templates()
    categories = article.get_categories()
    wlinks = article.get_wikilinks()
    exlinks = article.get_externallinks()
    medias = article.get_media(skip_images=True, skip_video=False, skip_audio=False)
    referneces = article.get_references()


In [16]:
# Alternatively, you can read stand-alone html files obtained from the wikipedia dump and convert to an `Article` object to extract the features 
from mwparserfromhtml import Article
import json
article_object = json.load(open("/home/appledora/Documents/wikimedia/html-dumps/data/article.json")) 
article = Article(article_object)
print("Article Name: ", article.title)
templates = article.get_templates()
categories = article.get_categories()
wikilinks = article.get_wikilinks()

Article Name:  Chang Gum-chol


In [17]:
# print the template title of the article
for t in templates:
    print(t.title)

short description
Orphan
family name hatnote
Infobox football biography

Infobox Korean name

reflist
NFT player
NorthKorea-footy-bio-stub


In [18]:
# print the categories name of the article 
for category in categories:
    print(category.title)

Articles with short description
Short description matches Wikidata
Orphaned articles from January 2022
All orphaned articles
Articles needing Korean script or text#Chang%20Gum-chol
Wikipedia articles needing romanized Korean#Chang%20Gum-chol
NFT template with ID not in Wikidata
Date of birth unknown
Living people
North Korean footballers
North Korea international footballers
Association football midfielders
Year of birth missing (living people)
All stub articles
North Korean football biography stubs


In [22]:
# print the properties of a sample wikilink object. To know more about the properties, see relevant class files
sample_wikilink = wikilinks[-1]
print(sample_wikilink.__dict__)

{'name': 'Wikilink', 'html_string': <a href="./Template_talk:NorthKorea-footy-bio-stub" rel="mw:WikiLink" title="Template talk:NorthKorea-footy-bio-stub"><abbr title="Discuss this template">t</abbr></a>, 'title': 'Template talk:NorthKorea-footy-bio-stub', 'plaintext': 't', 'tid': None, 'transclusion': False, 'link': './Template_talk:NorthKorea-footy-bio-stub', 'namespace_id': 11, 'disambiguation': False, 'redirect': False, 'redlink': False, 'interwiki': False}
