This notebooks requires the following dependencies: `httpx`, `beautifulsoup4`, `followthemoney`.

If you are using `pip`, run:
`pip install httpx, beautifulsoup4; pip install git+https://github.com/dataresearchcenter/followthemoney.git` 

If you are  using `poetry`, run:
`poetry init; poetry add httpx, beautifulsoup4; poetry add git+https://github.com/dataresearchcenter/followthemoney.git` 

Run `$(poetry env activate)` to activate the virtual environment that `poetry` creates.

The purpose of this notebook is to walk you through a demonstration of scraping Wikipedia for objects along with their respective Wikidata IDs and uploading these to OpenAleph. The result will make use of the feature that allows OpenAleph to display a preview images which it fetches from Wikidata using the Wikidata ID. 

First, fetch the entire contents of the [Wikipedia list of icebreaker ships](https://en.wikipedia.org/wiki/List_of_icebreakers), per country. 

Parse this contents using the [BeautifulSoup library](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). Focus on the section of the HTML that contains the list of vessels per country.

In [2]:
import httpx
from bs4 import BeautifulSoup

wiki_dom_request = httpx.get('https://en.wikipedia.org/wiki/List_of_icebreakers')
if wiki_dom_request.status_code == 200:
    wiki_dom = wiki_dom_request.text

wiki_dom_bs = BeautifulSoup(wiki_dom, 'html.parser')
content_dom = wiki_dom_bs.find("div", class_="mw-content-ltr")

Search for all the `<h2>` and `<ul>` elements. The `<h2>` elements contain the names of countries. The unordered lists (`<ul>`) contain the ships. We skip intermediary classifications of ships. For the purpose of this demo, we only want the icebreaker vessel names and the URLs that point to their Wikipedia articles. We skip the icebreaker ships for which Wikipedia articles don't exist yet.

In [37]:
data = {}

for tag in content_dom.css.select("h2, ul"):
    if tag.name == "h2":
        country = tag.text
        if country not in data:
            data[country] = []
    if tag.name == "ul":
        for vessel_obj in tag.find_all('li', recursive=False):
            link = vessel_obj.find('a', href=True)
            if link:
                vessel_name = link.get("title")
                if not vessel_name or (vessel_name and "page does not exist" in vessel_name):
                    continue
                vessel_url = f"https://en.wikipedia.org{link['href']}"
                data[country].append({"name": vessel_name, "url": vessel_url})


Get the [Wikidata ID](https://en.wikipedia.org/wiki/Wikipedia:Finding_a_Wikidata_ID).

There is some finesse here that is skipped for the sake of a simple tutorial. We haven't correctly gotten the ID of all the icebreaker ships due to differnces between the name used by Wikidata ID and the name displayed in the list above.

In [43]:
import json

with open("debug.json", "w") as f:
    for country in data:
        for vessel_obj in data[country]:
            name = vessel_obj["name"].replace(" ", "_")
            qid_request = httpx.get(f"https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&titles={name}&format=json")
            if qid_request.status_code == 200:
                response_json = json.loads(qid_request.text)

                f.write(json.dumps(response_json))
                f.write("\n")
                                                
                pages = response_json["query"]["pages"]
                for page_id in pages:
                    if "pageprops" in pages[page_id]:
                        qid = pages[page_id]["pageprops"].get("wikibase_item")
                        if qid:
                            vessel_obj["qid"] = qid 

Create the CSV file which we are going to use in order to create [Follow the Money](https://followthemoney.tech/explorer/) entities using the `Vessel` schema. 

In [47]:
import csv

with open("icebreakers.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerow(["name", "country", "qid" ,"url"])
    for country in data:
        for vessel in data[country]:
            writer.writerow([vessel['name'], country, vessel.get('qid') , vessel['url']])
