##### Init our env.

We're gonne need pywikibot for calling the wikibase APIs (mediawiki family wikis). It's going to be the quickest and easiest way to interface with them. Import DictWriter and json from built-in modules to interact with data files, and we're set.

In [4]:
import pywikibot
from csv import DictWriter
import json

##### Quickly set some vars...

All of our data will be stored in the [../data/categories]("../data/categories") file-a folder that is included in our .gitignore file.

"fieldnames" represents each column within our .csv file, and each entry into this file *must* have these three fields. This will make our life easier later on when we need to fetch this information for fetching page content.

In [5]:
path = "../data/categories"
fieldnames = [
    "FAMILY",
    "LOCALITY",
    "CATEGORY"
]

##### Query our "Categories"

To wikibase wikis, a category is an article that contains other articles, they're denoted by "Category:" before the title of said category. Because of some quirks with subcategories, <u>there may be an infinite depth to any given category query</u>. We need to query each category and it's subcategories to find most (if not all) related articles (wiki pages). We should store these categories and subcategories in our .csv file for later use.

The next python block opens our "categories.json" and "cat-list.csv" files. The "categories.json" file stores all of the categories in queue to be queried, and it has the following structure:

```yaml
[
    {
        "family": "wikibooks",
        "locality": "en",
        "categories": [
            "Recipes"
        ],
        "added": [
            "Recipes"
        ]
    },
    ...{} # more elements of similar structure.
]
```

<mark>"Family"</mark>: Denotes the wiki family (wikipedia, mediawiki, wikibooks, etc...). Basically the website you visit when in search of information.</br>
<mark>"Locality"</mark>: Denotes the locality. A short, language code that determines the sub-wiki.</br></br>
<mark>"Categories"</mark>: Contains the categories we'd like to query.</br>
<mark>"Added"</mark>: All of the categories we've already added.

In [6]:
with open(f"{path}/categories.json", "r+") as categories_raw, open(f"{path}/cat-list.csv", "a", encoding="utf-8") as pages_list:
    dictwriter = DictWriter(pages_list, fieldnames, lineterminator="\r")
    categories = json.load(categories_raw)

    for idx, wiki in enumerate(categories):
        cat_pending = list(set(wiki["categories"]) - set(wiki["added"]))

        site = pywikibot.Site(wiki["locality"], wiki["family"])

        cat_obj_list: list[pywikibot.Category] = list(map(
            lambda cat:
                pywikibot.Category(site, cat)
        , cat_pending))

        for category in cat_obj_list:
            print(f"Querying {category}")

            sub_cats = category.subcategories(recurse=7)

            sub_cats = list(map(
                lambda cat:
                    {
                        "FAMILY": wiki["family"],
                        "LOCALITY": wiki["locality"],
                        "CATEGORY": cat.title()
                    }
            , [category, *sub_cats]))

            dictwriter.writerows(sub_cats)

            wiki["added"].insert(-1, category.title().replace("Category:", ""))
            categories[idx] = wiki

            categories_raw.seek(0)
            json.dump(categories, categories_raw, indent=4)

# Filter data

In [1]:
import pandas

In [None]:
raw = pandas.read_csv("../data/categories/cat-list.csv")
raw.drop_duplicates(inplace=True)

raw.to_csv("../data/categories/cat-list-pandas.csv")