### In an effort to isolate code and concepts that fit into the main project thread, yet make the reading of that thread more difficult when combined in a single notebook, this is a short notebook that simply explores the in-degrees and out-degrees of the Wikispeedia articles.

**Rather than trying to find how general a topic is based on its WordNet synset depth, we could try another approach more suited to our current application:  It's conceivable that a general topic might have many other topics pointing to it (higher in-degrees, in graph terms), as compared to how many other topics it points to (out-degrees).  So let's take a look at the numbers for each article.**

In [6]:
import pandas as pd
import networkx as nx

In [5]:
docDF = pd.read_csv('doctextDF.csv', index_col=0)
docDF.head(2)

Unnamed: 0,article,texts
0,Second_Crusade,second crusade military history war religio...
1,Navassa_Island,navassa island north american geography nav...


In [7]:
links = pd.read_csv('linksDF.csv', index_col=0)
links.head(2)

Unnamed: 0,linkSource,linkTarget,cosDist,doc2vecDist
0,Áedán_mac_Gabráin,Bede,0.906869,0.626857
1,Áedán_mac_Gabráin,Columba,0.771878,0.376003


In [8]:
DG = nx.DiGraph()
# create map from article names to tfidf distance matrix indexes
doc_ids = {article:idx for idx, article in enumerate(docDF.article)}

for source, target, d2v in zip(links.linkSource.values, links.linkTarget.values, links.doc2vecDist):
    DG.add_edge(source, target, weight=1000*float(d2v)) # can't use numpy floats with graphml

In [10]:
degreeDF = pd.DataFrame({'topic': docDF.article,
                         'in_dgr': [DG.in_degree(art) for art in docDF.article],
                         'out_dgr': [DG.out_degree(art) for art in docDF.article]})
degreeDF.head()

Unnamed: 0,topic,in_dgr,out_dgr
0,Second_Crusade,9,26
1,Navassa_Island,4,56
2,Evan_Rachel_Wood,0,7
3,Tropical_Storm_Henri_(2003),1,12
4,Final_Fantasy_Adventure,0,5


Although the relative total number of degrees ("degree centrality") does convey some info about each node, it's the ratio between in and out that most interests us here.

In [12]:
degreeDF['generality'] = degreeDF.in_dgr / degreeDF.out_dgr
degreeDF.head()

Unnamed: 0,topic,in_dgr,out_dgr,generality
0,Second_Crusade,9,26,0.346154
1,Navassa_Island,4,56,0.071429
2,Evan_Rachel_Wood,0,7,0.0
3,Tropical_Storm_Henri_(2003),1,12,0.083333
4,Final_Fantasy_Adventure,0,5,0.0


In [13]:
degreeDF.generality.mean()

inf

Well, we can't have infinity now, can we?  It seems safe to claim that any article that doesn't point to another serves no purpose here, since we want to find out how people get from one idea to another.  But what's very much worth dwelling on here is that these 4600 articles are chosen to serve the purposes of the Wikispeedia game, in a way that makes it convenient to recognize any patterns that arise from the game-playing.  So articles that have no out-degrees in our provided `links` file aren't necessarily dead ends in the actual, entire Wikipedia graph, they are only such in our provided microcosm (Wikispeedia).  Nevertheless, we can't have them ruin our analysis, so we'll just ignore them instead.  Note that nodes with no in-degrees are still very useful to us, as we would love to know how they get to other nodes.

In [14]:
degreeDF = degreeDF[degreeDF.out_dgr > 0] # No more infinity from dividing by zero
degreeDF.generality.mean()

0.8594144959776306

That's surprisingly high, if you use our first 5 rows as an indicator.  They show that "Second Crusade" is much more general than "Final Fantasy Adventure", yet we only see 5 rows, so we should want to see why the overall number is so much higher.

In [15]:
degreeDF.sort_values('generality', ascending=False).head(20)

Unnamed: 0,topic,in_dgr,out_dgr,generality
3395,Chordate,338,8,42.25
1751,Binomial_nomenclature,295,12,24.583333
496,Scientific_classification,519,25,20.76
1960,Climate,138,8,17.25
3021,Animal,492,29,16.965517
1234,DVD,63,4,15.75
2912,Latin,443,29,15.275862
3680,Gas,72,5,14.4
880,Cultivar,41,3,13.666667
4483,Currency,291,22,13.227273


That is absolutely fascinating, in many ways.  For one thing, we see topics that have nothing to do with geographic or temporal dimensions, unlike the key centrality measures revealed in our main analysis.  For another, when we do see a specific geographical location, it's France instead of USA and Britain.  As far as time, we get "Time zone", not "20th century".  Another detail is that "DVD" is so general.  Why do so many topics point to DVD without DVD reciprocating?  It's almost as if DVD had been a future star yet had led to nothing! 

Hopefully you're curious about the tail of this DF, having seen the head.

In [16]:
degreeDF.sort_values(['generality', 'out_dgr']).head(20)

Unnamed: 0,topic,in_dgr,out_dgr,generality
38,Terik,0,1,0.0
1136,Pro_Milone,0,1,0.0
2290,Emma_Roberts,0,1,0.0
2358,Pere_Marquette_1225,0,1,0.0
2931,All_your_base_are_belong_to_us,0,1,0.0
3333,Human_abdomen,0,1,0.0
4027,White_Mountain_art,0,1,0.0
4268,T._D._Judah,0,1,0.0
4363,Dewey_Square,0,1,0.0
4457,Sunol_Water_Temple,0,1,0.0


Good luck if you happen to get those as the source of your randomized Wikispeedia game.  And if they're the target, you have no chance to reach them, since their in-degree is 0.

#### An important point of modularizing this notebook is to make its processed data meaningful in the main thread by saving the DF.

In [18]:
degreeDF.to_csv('degreeDF.csv')