In [1]:
from pprint import pprint

## Task 1. Install and import packages

Note: If you use Google colab, you have to install ```pywikibot``` and ```wikitextparser``` for each new notebook where you want to use them.

In [2]:
# install pywikibot and wikitextparser
!pip install pywikibot
!pip install wikitextparser

Collecting pywikibot
  Downloading pywikibot-8.3.2-py3-none-any.whl (704 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m704.5/704.5 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mwparserfromhell>=0.5.2 (from pywikibot)
  Downloading mwparserfromhell-0.6.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (191 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m191.0/191.0 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: mwparserfromhell, pywikibot
Successfully installed mwparserfromhell-0.6.5 pywikibot-8.3.2
Collecting wikitextparser
  Downloading wikitextparser-0.54.0-py3-none-any.whl (66 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.1/66.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: wikitextparser
Successfully installed wikitextparser-0.54.0


In [3]:
## Import the installed packages

# pywikibot needs a config file
pywikibot_config = r"""

# -*- coding: utf-8  -*-

mylang = 'en'
family = 'wikipedia'
usernames['wikipedia']['en'] = 'test'
"""

with open('user-config.py', 'w', encoding="utf-8") as f:
      f.write(pywikibot_config)

import pywikibot
import wikitextparser as wtp

## Task 2. Pages and categories

Your task is:

    1. retrieve the page "Cologne" in the English Wikipedia and print its wikitext
    2. print the categories this page belongs to
    3. print all articles that belong to "Category:Cities in North Rhine-Westphalia"

### 2.1 Retrieve the page "Cologne" in the English Wikipedia and print its wikitext

In [4]:
site_en = pywikibot.Site('en', 'wikipedia')
page_cologne = pywikibot.Page(site_en, 'Cologne')

In [5]:
wikitext_cologne = page_cologne.text
print("Page 'Cologne' - Wikitext: \n\n%s" % page_cologne.text)

Page 'Cologne' - Wikitext: 

{{Short description|Largest city in North Rhine-Westphalia, Germany}}
{{About||the perfume|Eau de Cologne|other uses}}
{{Redirect|Köln}}
{{Use dmy dates|date=February 2021}}
{{Infobox German location
|name               = Cologne
|German_name        = {{native name|de|Köln}}
|type               = City
|image_photo        = {{Photomontage|position=center
|photo1a = Kranhäuser Cologne, April 2018 -01.jpg
|photo2a = Kölner Dom und Hohenzollernbrücke Abenddämmerung (9706 7 8).jpg
|photo2b = 12-09 WLM Cologne 40.JPG
|photo3a = St. Gereon Köln - Dekagon-9702.jpg
|photo3b = River Concerto (ship, 2000) 003.jpg
|photo4a = Flora - Köln.jpg
|photo4b = St Kunibert Koeln.jpg
|photo5a = Rheinpanorama mit Hohenzollernbrücke, Kölner Dom, Groß St. Martin und Deutzer Brücke.jpg
|size    = 280
|spacing = 2
|color   = white
|border  = 0}}
|image_caption      = Clockwise from top: view of Cologne (with the [[Kranhaus|Kranhäuser]], [[Cologne Cathedral]] and [[Great St. Martin Ch

### 2.2 Print the categories this page belongs to

In [6]:
for c in page_cologne.categories():
    print(c)

[[en:Category:30s BC establishments]]
[[en:Category:38 BC]]
[[en:Category:All articles with unsourced statements]]
[[en:Category:Articles containing French-language text]]
[[en:Category:Articles containing German-language text]]
[[en:Category:Articles containing Kölsch-language text]]
[[en:Category:Articles containing Latin-language text]]
[[en:Category:Articles with BNF identifiers]]
[[en:Category:Articles with BNFdata identifiers]]
[[en:Category:Articles with GND identifiers]]
[[en:Category:Articles with German-language sources (de)]]
[[en:Category:Articles with HDS identifiers]]
[[en:Category:Articles with ISNI identifiers]]
[[en:Category:Articles with J9U identifiers]]
[[en:Category:Articles with LCCN identifiers]]
[[en:Category:Articles with MusicBrainz area identifiers]]
[[en:Category:Articles with NARA identifiers]]
[[en:Category:Articles with NDL identifiers]]
[[en:Category:Articles with NKC identifiers]]
[[en:Category:Articles with Pleiades identifiers]]
[[en:Category:Articles

### 2.3 Print all articles that belong to "Category:Cities in North Rhine-Westphalia"

In [7]:
cat = pywikibot.Category(site_en, "Category:Cities in North Rhine-Westphalia")
print("======== Members in 'Category:Cities in North Rhine-Westphalia' ========")

for city in cat.articles():
    print(f'{city.title().ljust(20)} {city.full_url()}')

Aachen               https://en.wikipedia.org/wiki/Aachen
Bergisch Gladbach    https://en.wikipedia.org/wiki/Bergisch_Gladbach
Bielefeld            https://en.wikipedia.org/wiki/Bielefeld
Bochum               https://en.wikipedia.org/wiki/Bochum
Bottrop              https://en.wikipedia.org/wiki/Bottrop
Cologne              https://en.wikipedia.org/wiki/Cologne
Dortmund             https://en.wikipedia.org/wiki/Dortmund
Duisburg             https://en.wikipedia.org/wiki/Duisburg
Düsseldorf           https://en.wikipedia.org/wiki/D%C3%BCsseldorf
Essen                https://en.wikipedia.org/wiki/Essen
Gelsenkirchen        https://en.wikipedia.org/wiki/Gelsenkirchen
Gütersloh            https://en.wikipedia.org/wiki/G%C3%BCtersloh
Hagen                https://en.wikipedia.org/wiki/Hagen
Hamm                 https://en.wikipedia.org/wiki/Hamm
Herne, North Rhine-Westphalia https://en.wikipedia.org/wiki/Herne%2C_North_Rhine-Westphalia
Korschenbroich       https://en.wikipedia.org/wiki/Korsc

## Task 3. Working with nested lists

A nested list is a list of lists. An example can be found in the "Education" section. Your task is:

    1. find the "Education" section
    2. examine the structure of the nested list
    3. get all universities and colledges

### 3.1 Find the "Education" section

In [8]:
# return the first section in a page that contains a certain title
def search_section(page, title):
    wikitext = page.text
    sections = wtp.parse(wikitext).sections

    for section in sections:
        if section.title:
            if title in section.title:
                return section

    return None

In [9]:
section_education = search_section(page_cologne, 'Education')
print(section_education)

==Education==
Cologne is home to numerous universities and colleges,<ref>{{cite web |url=http://wissensdurst-koeln.de/category/wissenschaft-forschung/hochschulen/ |title=Hochschulen – Wissensdurst KĂśln – Das KĂślner Wissenschaftsportal |publisher=Wissensdurst-koeln.de |access-date=26 July 2010 |archive-date=19 July 2010 |archive-url=https://web.archive.org/web/20100719011744/http://www.wissensdurst-koeln.de/category/wissenschaft-forschung/hochschulen/ |url-status=live }}</ref><ref>{{cite web|url=http://wissensdurst-koeln.de/wp-content/uploads/2010/04/flyer-spitzenforschung.pdf |archive-url=https://web.archive.org/web/20110719114109/http://wissensdurst-koeln.de/wp-content/uploads/2010/04/flyer-spitzenforschung.pdf |archive-date=2011-07-19 |url-status=live |title=Forschungsschwerpunkte |publisher=Wissensdurst-koeln.de}}</ref> and host to some 72,000 students.<ref name="Cologneeconomy"/> Its oldest university, the [[University of Cologne]] (founded in 1388)<ref name="Cologne History"/> i

### 3.2 Examine the structure of the nested list

In [10]:
for lst in section_education.get_lists():
    print(lst)
    print('\n')

* Public and state universities:
** [[University of Cologne]] (''Universität zu Köln'');
** [[German Sport University Cologne]] (''Deutsche Sporthochschule Köln'').
* Public and state colleges:
** [[Cologne University of Applied Sciences]] (''"Technology, Arts, Sciences TH KöLN" Technische Hochschule Köln'');
** [[Köln International School of Design]];
** [[Hochschule für Musik Köln|Cologne University of Music and Dance]] ({{Lang|de|Hochschule für Musik und Tanz Köln}});
** [[Academy of Media Arts Cologne]] (''Kunsthochschule für Medien Köln'');
* Private colleges:
** Catholic University of Applied Sciences (''Katholische Hochschule Nordrhein-Westfalen'');
** [[Cologne Business School]];
** [[international filmschool cologne]] (''internationale filmschule köln'');
** Rhenish University of Applied Sciences (''Rheinische Fachhochschule Köln'')
** University of Applied Sciences Fresenius (''Hochschule Fresenius'')
 

* Research institutes:
** [[German Aerospace Centre]] (''Deutsches Zentr

### 3.3 Get all universities and colleges

In [11]:
for lst in section_education.get_lists():
    if len(lst.get_lists()) == 0:
        # a "standalone" entry: just print it
        print(lst)
    else:
        # a list of universities / colleges: we need to iterate the list
        for entry in lst.get_lists():
            for item in entry.items:
                print(item)

 [[University of Cologne]] (''Universität zu Köln'');
 [[German Sport University Cologne]] (''Deutsche Sporthochschule Köln'').
 [[Cologne University of Applied Sciences]] (''"Technology, Arts, Sciences TH KöLN" Technische Hochschule Köln'');
 [[Köln International School of Design]];
 [[Hochschule für Musik Köln|Cologne University of Music and Dance]] ({{Lang|de|Hochschule für Musik und Tanz Köln}});
 [[Academy of Media Arts Cologne]] (''Kunsthochschule für Medien Köln'');
 Catholic University of Applied Sciences (''Katholische Hochschule Nordrhein-Westfalen'');
 [[Cologne Business School]];
 [[international filmschool cologne]] (''internationale filmschule köln'');
 Rhenish University of Applied Sciences (''Rheinische Fachhochschule Köln'')
 University of Applied Sciences Fresenius (''Hochschule Fresenius'')
 [[German Aerospace Centre]] (''Deutsches Zentrum für Luft- und Raumfahrt'');
 [[European Astronaut Centre]] (''EAC'') of the [[European Space Agency]];
 [[European College of Spo