In [None]:
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## In this short notebook, we view the Wikispeedia graph through the lens of an ego subgraph within it.

### Try to find a subgraph that's small enough to plot but large enough to be meaningful

Build the graph from 2 saved DataFrames

In [7]:
linksDF = pd.read_csv('linksDF.csv', index_col=0)
linksDF.head(2)

Unnamed: 0,linkSource,linkTarget,cosDist,doc2vecDist
0,Áedán_mac_Gabráin,Bede,0.906869,0.626857
1,Áedán_mac_Gabráin,Columba,0.771878,0.376003


In [8]:
degreeDF = pd.read_csv('degreeDF.csv', index_col=0)
degreeDF.head(2)

Unnamed: 0,topic,in_dgr,out_dgr,generality
0,Second_Crusade,9,26,0.346154
1,Navassa_Island,4,56,0.071429


In [14]:
# quick lookup table for node attr (generality of node topic)
gens = {top:gen for top,gen in zip(degreeDF.topic, degreeDF.generality)}
# all link ends
linknames = set(linksDF.linkSource).union(set(linksDF.linkTarget))

DG = nx.DiGraph()
for topic in linknames:
    # topics with no out degrees don't have generality scores yet.  Set to 0.
    if topic not in gens: gens[topic]=0
    DG.add_node(topic, gen=gens[topic])
for s, t, d in zip(linksDF.linkSource, linksDF.linkTarget, linksDF.doc2vecDist):
    delta_gen = gens[t]
    DG.add_edge(s, t, d2v=d, d_gen=gens[t]-gens[s])

In [18]:
print(nx.info(DG))

Name: 
Type: DiGraph
Number of nodes: 4592
Number of edges: 119882
Average in degree:  26.1067
Average out degree:  26.1067


In [20]:
DG.node['France']

{'gen': 11.282352941176471}

In [27]:
DG.edge['France']['Paris']

{'d2v': 0.3606382791996672, 'd_gen': -8.532352941176471}

Note that a negative `d_gen` encodes moving from a more general topic to a less general one.

#### Let's try an ego graph for Pompeii.  Since a common shortest path is 4, let's use a radius of 2 in each direction, into and out of Pompeii.  That could give us a picture of what sorts of optimal paths Pompeii lies on.

In [72]:
DG['Pompeii']  # links from Pompeii to these

{'1st_century': {'d2v': 0.7858200710957813, 'd_gen': -0.525987525987526},
 'Ancient_Rome': {'d2v': 0.5553679442127262, 'd_gen': 0.2875874125874125},
 'Archaeology': {'d2v': 0.6455686268960943, 'd_gen': 0.02146690518783556},
 'BBC': {'d2v': 0.8448764072825363, 'd_gen': 2.555769230769231},
 'Drought': {'d2v': 0.7905698660449316, 'd_gen': 1.764102564102564},
 'Earthquake': {'d2v': 0.8676231723336092, 'd_gen': 2.5807692307692305},
 'Great_Britain': {'d2v': 0.9272456481169232, 'd_gen': 3.373626373626374},
 'Greece': {'d2v': 0.7116976838841338, 'd_gen': 0.38866396761133615},
 'Italy': {'d2v': 0.6197929368858993, 'd_gen': 4.701357466063349},
 'Latin': {'d2v': 0.6359986031936685, 'd_gen': 13.50663129973475},
 'Mount_Vesuvius': {'d2v': 0.3994719493645049, 'd_gen': -1.5292307692307692},
 'Volcano': {'d2v': 0.7033479267248763, 'd_gen': 0.15614236509758928},
 'World_Heritage_Site': {'d2v': 0.8588057016025433,
  'd_gen': -0.16208791208791196}}

In [78]:
pompeii = nx.ego_graph(DG, 'Pompeii', radius=2, undirected=True, distance=None) # using path steps as distance
pompeii.number_of_nodes()

1962

Pompeii has 23 in-links and 13 out-links, but each of those 36 connections adds so many other nodes that we get to 1962 very fast.  For better viewing, let's shrink that down to the first 36 only, and then afterwards we'll try using doc2vec distance as the metric and see how many more nodes we add, since those distances are less than 1 usually.

In [82]:
pompeii = nx.ego_graph(DG, 'Pompeii', radius=1, undirected=True, distance=None) # using path steps as distance
pompeii.number_of_nodes()

34

34 nodes, not 37, so 3 connections are reciprocal.

In [83]:
nx.write_graphml(pompeii, 'pompeii.graphml')

In [84]:
pompeiier = nx.ego_graph(DG, 'Pompeii', radius=1, undirected=True, distance='d2v')
pompeiier.number_of_nodes()

73

In [85]:
nx.write_graphml(pompeiier, 'pompeiier.graphml')

**Here's the smaller ego graph, showing all topics one step away from Pompeii, in both directions.  The larger the node, the more general of a score it has in the Wikispeedia graph, i.e. the more in-links it has compared to out-links.  The colors of the nodes and edges are darker red when the topic has a closer semantic similarity to Pompeii, according to the cosine distances generated from our Doc2Vec embeddings.**

![Pompeii](https://raw.githubusercontent.com/ebhtra/gory-graph/main/Wikispeedia/images/pompeii.svg)

**It's clear from the smaller graph that the lighter nodes, i.e. the topics that are less semantically close to Pompeii, are pointing to Pompeii, without Pompeii linking back.  The topics that Pompeii tends to point to, on the other hand, tend to be more closely related to it (darker hued).  There's no reason that asymmetry should necessarily occur, other than if Pompeii is more general of a topic than the nodes that point to it.**

In [98]:
# Pompeii generality
DG.node['Pompeii']

{'gen': 1.7692307692307692}

In [99]:
# Median generality of whole Wikispeedia graph
degreeDF.generality.median()

0.4705882352941176

**So Pompeii is much more general than the median topic, but the topics nearby it also tend to be more general, as evidenced by the fact that nodes linking to it yet not being particularly close to it semantically, such as Literacy, Advertising, Lemon, and Book, are similar to it in terms of size.**  

**The larger graph gets cluttered fast, being so connected. There are twice as many topics, since any node that can be reached by clicking on links that sum up to less than the radius of 1 are included.**

**This brings a couple of broader topics into the mix, with Spain and Portugal now on the radar, by virtue of being very closely related to topics Pompeii is very closely related to (Italy and Latin, e.g.).  There are also a lot of more specific nodes within reach now as well, for the same reason (individual Roman emperors, e.g.).  Most of those smaller nodes are still closely related to the text of Pompeii, as seen by their medium hues, yet they simply don't happen to be linked directly to Pompeii, since larger topics like "Italy" are more convenient links for Wiki-historians.**

![pompeiier](https://raw.githubusercontent.com/ebhtra/gory-graph/main/Wikispeedia/images/pompeiier.svg)