Author: Jun Sun (jun.sun@gesis.org)

In this notebook, we will learn how to

1. install Python packages in the colab environment
2. use ```pywikibot``` to query articles in Wikipedia as wikitext
3. use ```wikitextparser``` to parse wikitext

In [None]:
import pandas as pd
from IPython.display import HTML

# 1. Install Python packages

In [None]:
# install pywikibot and wikitextparser
!pip install pywikibot
!pip install wikitextparser



## Import the installed packages

In [None]:
# pywikibot needs a config file
pywikibot_config = r"""

# -*- coding: utf-8  -*-

mylang = 'en'
family = 'wikipedia'
usernames['wikipedia']['en'] = 'test'
"""

with open('user-config.py', 'w', encoding="utf-8") as f:
    f.write(pywikibot_config)

import pywikibot
import wikitextparser as wtp

# 2. Use ```pywikibot``` to query Wikipedia

## Create a ```pywikibot``` instance for English Wikipedia

In [None]:
site_en = pywikibot.Site('en', 'wikipedia')

## Create a ```Page``` instance for Mannheim

In [None]:
page_mannheim = pywikibot.Page(site_en, 'Mannheim')

## Examine the wikitext of the page

In [None]:
wikitext_mannheim = page_mannheim.text
print("Page 'Mannheim' - Wikitext: \n\n%s" % page_mannheim.text)

Page 'Mannheim' - Wikitext: 

{{Short description|Second-largest city in Baden-Württemberg, Germany}}
{{About|the city in Germany|other uses}}
{{Use dmy dates|date=July 2021}}
{{Infobox German place
|type = City
|image_photo = {{Photomontage|position=center
|photo1a = Der Friedrichsplatz und der Wasserturm.jpg
|photo2a = Die Jesuitenkirche.jpg
|photo2b = Luisenpark Mannheim Gondolettas.JPG
|photo3a = Mannheim wasserspiele.jpg
|photo3b = MA-Friedrichsplatz-0329.jpg
|photo4a = SchlossMannheim-Pano-130616.jpg
|size = 280
|spacing = 2
|color = #FFFFFF
|border = 0}}
|image_caption='''Clockwise from top''': [[Friedrichsplatz]]; [[Luisenpark]]; [[Augustaanlage]]; [[Mannheim Palace]]; [[Mannheim Water Tower]]; and [[Jesuit Church, Mannheim|Jesuit Church]]
|image_coa=Wappen Mannheim.svg
|image_flag=Mannheim-Flagge.svg
|coordinates = {{coord|49|29|16|N|08|27|58|E|display=inline,title}}
|image_plan=Baden-Württemberg MA.svg
|plantext=Location of Mannheim in Baden-Württemberg
|state=Baden-Württembe

## Print the categories this page belongs to

In [None]:
for c in page_mannheim.categories():
    print(c)

[[en:Category:All articles with dead external links]]
[[en:Category:Articles containing German-language text]]
[[en:Category:Articles containing Palatine German-language text]]
[[en:Category:Articles with BNF identifiers]]
[[en:Category:Articles with BNFdata identifiers]]
[[en:Category:Articles with GND identifiers]]
[[en:Category:Articles with J9U identifiers]]
[[en:Category:Articles with LCCN identifiers]]
[[en:Category:Articles with MusicBrainz area identifiers]]
[[en:Category:Articles with NARA identifiers]]
[[en:Category:Articles with NKC identifiers]]
[[en:Category:Articles with SUDOC identifiers]]
[[en:Category:Articles with VIAF identifiers]]
[[en:Category:Articles with dead external links from April 2011]]
[[en:Category:Articles with short description]]
[[en:Category:Baden]]
[[en:Category:CS1 German-language sources (de)]]
[[en:Category:CS1 Korean-language sources (ko)]]
[[en:Category:CS1 maint: multiple names: authors list]]
[[en:Category:CS1 maint: url-status]]
[[en:Category

## Get all pages in a category

In [None]:
cat = pywikibot.Category(site_en, "Category:Cities in Baden-Württemberg")
print("======== Members in the Category : Cities in Baden-Württemberg ========")

for city in cat.articles():
    print(f'{city.title().ljust(20)} {city.full_url()}')

Freiburg im Breisgau https://en.wikipedia.org/wiki/Freiburg_im_Breisgau
Heidelberg           https://en.wikipedia.org/wiki/Heidelberg
Heilbronn            https://en.wikipedia.org/wiki/Heilbronn
Karlsruhe            https://en.wikipedia.org/wiki/Karlsruhe
Löwenstein           https://en.wikipedia.org/wiki/L%C3%B6wenstein
Mannheim             https://en.wikipedia.org/wiki/Mannheim
Pforzheim            https://en.wikipedia.org/wiki/Pforzheim
Reutlingen           https://en.wikipedia.org/wiki/Reutlingen
Stuttgart            https://en.wikipedia.org/wiki/Stuttgart
Ulm                  https://en.wikipedia.org/wiki/Ulm


# 3. Use ```wikitextparser``` to parse wikitext

It is easier to parse sections, links, lists and tables from wikitext with ```wikitextparser```.

## Extract the section structure

In [None]:
sections = wtp.parse(wikitext_mannheim).sections

## Print the table of content

In [None]:
for section in sections:
    if section.title:
        print('*' * (section.level - 1), section.title.strip())

* History
** Early history
** Early Modern Age
** 18th and 19th centuries
** Early 20th century and World War I
** Inter-war period
** World War II
** 1950s to 1980s
** Post-reunification
* Geography
** Climate
* Demographics
** Population
*** Nationalities
*** Religion
** Culture
*** Theatre
*** Sport
*** Education
*** Inventions
* Government and politics
** Mayor
** City council
* United States military installations
* Main sights
* Economy
** Media
* Infrastructure
** Road transport
** Railway transport
** River transport
** Air transport
** Local public transport
* Block numbering and computer mapping
* Twin towns – sister cities
* Notable people
* Notes and references
** Notes
** References
* Further reading
* External links


## Print the "summary" of the page, i.e., the first section

In [None]:
print(sections[0].plain_text().strip())

thumb|293px|Aerial view of the city centre showing the grid layout
Mannheim (; Palatine German:  or ), officially the University City of Mannheim (), is the second-largest city in the German state of Baden-Württemberg, after the state capital of Stuttgart, and Germany's 21st-largest city, with a 2021 population of 311,831 inhabitants. The city is the cultural and economic centre of the Rhine-Neckar Metropolitan Region, Germany's seventh-largest metropolitan region with nearly 2.4 million inhabitants and over 900,000 employees.

Mannheim is located at the confluence of the Rhine and the Neckar in the Kurpfalz (Electoral Palatinate) region of northwestern Baden-Württemberg. The city lies in the Upper Rhine Plain, Germany's warmest region. Together with Hamburg, Mannheim is the only German city bordering two other federal states. It forms a continuous conurbation of around 480,000 inhabitants with Ludwigshafen am Rhein in the neighbouring state of Rhineland-Palatinate, on the other side o

## Search sections containing a certain title

In [None]:
# return the first section in a page that contains a certain title
def search_section(page, title):
    wikitext = page.text
    sections = wtp.parse(wikitext).sections

    for section in sections:
        if section.title:
            if title in section.title:
                return section

    return None

### Search section "Twin towns – sister cities"

In [None]:
section_twintowns = search_section(page_mannheim, 'Twin towns')
print(section_twintowns.plain_text().strip())

==Twin towns – sister cities==


Mannheim is twinned with:

*Swansea, Wales, United Kingdom (1957)
*Toulon, France (1959)
*Charlottenburg-Wilmersdorf (Berlin), Germany (1961)
*Windsor, Canada (1980)
*Riesa, Germany (1988)
*Chișinău, Moldova (1989)
*Bydgoszcz, Poland (1991)
*Klaipėda, Lithuania (2002)
*Zhenjiang, China (2004)
*Haifa, Israel (2009)
*Qingdao, China (2016)
*Chernivtsi, Ukraine (2022)


## Get all twin towns as an iteratable Python list

In [None]:
# use the get_lists() function to retrieve all lists in the given wikitext
lst_twintowns = section_twintowns.get_lists()[0].items

# output each entry in the list
for twintown in lst_twintowns:
    print(twintown)

[[Swansea]], Wales, United Kingdom (1957)
[[Toulon]], France (1959)
[[Charlottenburg-Wilmersdorf|Charlottenburg-Wilmersdorf (Berlin)]], Germany (1961)
[[Windsor, Ontario|Windsor]], Canada (1980)
[[Riesa]], Germany (1988)
[[Chișinău]], Moldova (1989)
[[Bydgoszcz]], Poland (1991)
[[Klaipėda]], Lithuania (2002)
[[Zhenjiang]], China (2004)
[[Haifa]], Israel (2009)
[[Qingdao]], China (2016)
[[Chernivtsi]], Ukraine (2022)
