# Exploring Wikipedia using its API

No need to introduce Wikipedia, the online encyclopedia is now the default online reference for a lot of subjects.
Each page contains one or more links to other pages, it is therefore natural to apply the graph exploration methods we have seen previously. The whole graph of wikipedia pages is quite big, the english version has ca. 7M pages and 500M links, and the graph is highly connected.

A great thing about wikipedia is that almost all the data are open. Data is either available as [full dumps](https://dumps.wikimedia.org/) or [using the API](https://www.mediawiki.org/wiki/API:Main_page). Dumps are better for offline processing. There are a number of tools dedicated to processing data dumps from Wikipedia, e.g. [Sparkwiki](https://github.com/epfl-lts2/sparkwiki). In this tutorial we will be using the API to access the [English edition of Wikipedia](https://en.wikipedia.org), although adapting the code for another language is fairly trivial.

## Experimenting the API using the sandbox
There are a lot of possibilities to use the API, we only use the `query` action to retrieve data about pages. The [documentation](https://www.mediawiki.org/w/api.php?action=help&modules=query) provides a list to properties that can be retrieved for each page.

The [API Sandbox](https://en.wikipedia.org/wiki/Special:ApiSandbox) helps testing and building queries quickly. If we need to retrieve
the categories of Albert Einstein's wikipedia page it can be tried [here](https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=categories&titles=Albert%20Einstein) or use the following sandbox query:
```
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=categories&titles=Albert%20Einstein
```

Once we are satisfied with the data we retrieve, it can be performed directly. Let us load the links from Einstein's wikipedia page using python `requests`. We add the `cllimit` parameter to retrieve the first 20 categories.

In [None]:
import requests

In [None]:
r = requests.get('https://en.wikipedia.org/w/api.php?action=query&format=json&prop=categories&titles=Albert%20Einstein&cllimit=20')

In [None]:
einstein_cats = r.json()
einstein_cats

The result of the query is JSON-formatted and converted to a Python dict directly

In [None]:
einstein_cats['query']['pages']['736']['title']

The query only return the 20 first categories of the page. While the `cllimit`parameter could be increased, there are cases where multiple queries are needed to retrieve all the data of a page. The response provides a `continue` key providing information about how to retrieve the following categories. In this case, appending `clcontinue=736|American_science_writers` to our query will retrieve the next categories.

In [None]:
rc = requests.get('https://en.wikipedia.org/w/api.php?action=query&format=json&prop=categories&titles=Albert%20Einstein&cllimit=20&clcontinue=736|American_science_writers')

In [None]:
rc.json()

*Exercise*: experiment queries yourself, retrieve data for different pages. The `categories`, `links`, `pageviews` are of interest for us.

## Using the Wikipedia-API package
While it is fairly simple to make complete retrieval of the data using multiple requests fully automatic, we will leave this as an exercise for the readers and use a helper library that will handle this for us. 

There are multiple options available, we will use the [Wikipedia-API](https://github.com/martin-majlis/Wikipedia-API) library. You may want to check its [documentation](https://wikipedia-api.readthedocs.io/en/latest/API.html).

In [None]:
import wikipediaapi

In [None]:
# create the api object
api = wikipediaapi.Wikipedia('en')

We can simply create a `Page` object and get its properties (which are lazily evaluated to limit the number of requests actually sent to Wikipedia API)

In [None]:
albert = api.page('Albert Einstein')

In [None]:
len(albert.categories)

In [None]:
len(albert.links)

It is interesting to see which requests are sent by increasing the logging level

In [None]:
import sys
# helper function
def set_wikipediaapi_logging(level):
    wikipediaapi.logging.getLogger('wikipediaapi').handlers.clear() # ugly - remove handlers to avoid duplicates
    wikipediaapi.log.setLevel(level=level)
    # Set handler if you use Python in interactive mode
    out_hdlr = wikipediaapi.logging.StreamHandler(sys.stderr)
    out_hdlr.setFormatter(wikipediaapi.logging.Formatter('%(asctime)s %(message)s'))
    out_hdlr.setLevel(level)
    wikipediaapi.log.addHandler(out_hdlr)

set_wikipediaapi_logging(wikipediaapi.logging.INFO)

In [None]:
quantum = api.page('Quantum mechanics')

In [None]:
quantum.categories

In [None]:
quantum.summary

In [None]:
set_wikipediaapi_logging(wikipediaapi.logging.WARN) # reset logging

## Graph exploration
The methods presented in this tutorial can be used to explore the Wikipedia page graph. However, the package [littleballoffur](https://github.com/benedekrozemberczki/littleballoffur) used to demonstrate the concepts cannot be used directly with the API. Therefore those methods have been implemented into a different package: [spikexplore](https://github.com/epfl-lts2/spikexplore).

The package has no release (yet) so install it using pip:
```
pip install git+https://github.com/epfl-lts2/spikexplore.git
```

In [None]:
import networkx as nx
from spikexplore import graph_explore
from spikexplore.backends.wikipedia import WikipediaNetwork
from spikexplore.config import SamplingConfig, GraphConfig, DataCollectionConfig, WikipediaConfig

Spikexplore supports different backends: NetworkX (mostly for testing, in this case you can use littleballoffur), Twitter (requires the creation of a developer account to obtain API keys), and Wikipedia.

You must create first the sampling backend you will use to acquire data:

In [None]:
wiki_config = WikipediaConfig(lang='en') # adapt to your favorite language
wiki_config.pages_ignored = [] # you can supply a list of page titles you want to ignore
sampling_backend = WikipediaNetwork(wiki_config)

A second configuration object contains the parameters used for the graph creation. In our case the page graph is not weighted so keep the minimum edge weight to 1.

In [None]:
graph_config = GraphConfig(min_degree=1, min_weight=1)

Exploration parameters are stored in another object. In the example below, we sample randomly 10 % of the edges encountered at each hop. The `max_nodes_per_hop` provides an additional fine-tuning parameter to limit the growth of the graph, limiting the number of new neighbors for each node. The `Fireball` expansion will make it close to the forest fire sampling seen previously.

In [None]:
data_collection_config = DataCollectionConfig(exploration_depth=2, random_subset_mode="percent",
                                              random_subset_size=10, expansion_type="fireball",
                                              max_nodes_per_hop=100)

Finally the graph and data collection confuguration objects are combined into a single one that will be passed to spikexplore

In [None]:
sampling_config = SamplingConfig(graph_config, data_collection_config)

Let us define starting points to explore the graph (you might want to adapt this to your needs)

In [None]:
initial_nodes = ['Albert Einstein', 'Quantum mechanics', 'Theory of relativity']

In [None]:
#set_wikipediaapi_logging(wikipediaapi.logging.INFO) # optional, peep under the hood what is happening !! PRINTS A LOT OF OUTPUT !!
graph_result, _ =  graph_explore.explore(sampling_backend, initial_nodes, sampling_config)
print('Collected {} nodes and {} edges'.format(graph_result.number_of_nodes(), graph_result.number_of_edges()))

In [None]:
nx.write_gexf(graph_result, 'wiki_einstein.gexf') # save result to view in Gephi

You can now open the result graph in Gephi and see if the communities make sense. Use different sets of parameters (sampling ratio, different initial nodes, etc.)

With only 3 starting nodes and 2 hops, the graph is growing quickly and takes around 1 minute to collect. Adding an extra hop would take longer (but should remain reasonable), try it yourself if you have some time. Given the large number of connections in Wikipedia (the Albert Einstein page has more than 1000 links), exploring using Snowball sampling would lead to a much larger number of connections. You can try it by setting `random_subset_size` to 100 and set `max_nodes_per_hop` to a very large value to avoid being capped in any way.

In [None]:
# WARNING - this is going take some time to collect (ca. 10 minutes) !!
data_collection_config_full = DataCollectionConfig(exploration_depth=2, random_subset_mode="percent",
                                              random_subset_size=100, expansion_type="coreball",
                                              degree=2, max_nodes_per_hop=1000000)
sampling_config_full = SamplingConfig(graph_config, data_collection_config_full)
# Uncomment this if you want to collect the full neighborhood
# graph_full, _ = graph_explore.explore(sampling_backend, initial_nodes, sampling_config_full)
# print('Collected {} nodes and {} edges'.format(graph_full.number_of_nodes(), graph_full.number_of_edges()))

Collecting the full neighborhood of the 3 initial nodes using Snowball sampling yields a graph having ca. 1800 nodes and 126000 edges ! You can download the resulting graph using [this link](https://drive.switch.ch/index.php/s/qJr6qqBLnOEOR2f)

## Adding features to the graph
Thanks to the wikipedia API, it is possible to retrieve the number of times a page has been viewed using the `pageviews`property. Unfortunately this is not part of the Wikipedia-API package. We need a small helper function to get those values.

In [None]:
import numpy as np
from urllib.parse import quote

# get the pageview data for a page. this is not efficient, you can send requests simultaneously
# for multiple pages. There is no error checking, sending bad data will return you exceptions.
def get_pageviews(page_titles, number_of_days=7, lang='en'):
    pageviews = {}
    # pageview query supports at most 50 pages per request (in practice MUCH less), split page titles into smaller chunks
    chunk_size = 5 # try bigger values but you can have empty results..
    nodesplit = np.array_split(page_titles, len(page_titles)//chunk_size)
    
    # get pageview data for eahc chunk
    for ns in nodesplit:
        pages = quote('|'.join(ns))
        url = 'https://{}.wikipedia.org/w/api.php?action=query&format=json&prop=pageviews&titles={}&pvipdays={}'.format(lang, pages, number_of_days)
        res = requests.get(url).json()
        for k,v in res['query']['pages'].items():
            pv = list(filter(None, v['pageviews'].values())) # some pageviews value might be None -> do not take them into account
            pageviews[v['title']] = pv

    return pageviews

Let us retrieve the page views for all the nodes in the graph, for the last 10 days 

In [None]:
graph_stats = get_pageviews(graph_result.nodes(), number_of_days=10)
graph_stats

We will now compute the mean of the pageviews for each page and store it as an attribute in the graph

In [None]:
pageviews_mean_graph = {k: np.mean(v) for k,v in graph_stats.items()}

Store those values as attribute, and save the resulting graph

In [None]:
nx.set_node_attributes(graph_result, pageviews_mean_graph, name='mean_pageviews')
nx.write_gexf(graph_result, 'wiki_einstein_pageviews.gexf')

You can now use Gephi to open the graph and define the node size according to the number of visits they received. It can also be used to remove some nodes from the graph that have little importance visit-wise.

In [None]:
print('Average pageviews range from min={} to max={}'.format(np.min(list(pageviews_mean_graph.values())), np.max(list(pageviews_mean_graph.values()))))

*Exercise*: Acquire a bigger graph with pageviews data and use it to remove nodes that receive few visits. 