In [None]:
'''
The libraries we used:
'''
import numpy as np
import pandas as pd
import math
import networkx as nx
import csv
from statistics import median, mean
import queue as Q
import threading
from numba import autojit
import functs as fnc


## Research question 1

Opening the file that countains the information about the edges of our graph.

In [58]:
nodesfile = np.loadtxt('wiki-topcats-reduced.txt', delimiter="\t", dtype =int)

To see what kind of graph is contained in the file we've got we first check for self loops.

In [59]:
self_count = 0
for row in nodesfile:
    if row[0] == row[1]:
        self_count += 1
self_count

777

Now we're checking if our graph is directed by seeing if the first node occurs the same number of times in both columns. If the number of occurences is the same, we have a undirected graph. If not, its directed.

In [61]:
dir_count1 = 0
dir_count2 = 0
for node1 in nodesfile:
    if nodesfile[0][0] == node1[0]:
        dir_count1 +=1
    if nodesfile[0][0] == node1[1]:
        dir_count2 +=1
if dir_count1 != dir_count2:
    print("directed graph")
else:
    print("no directed graph")

directed graph


Now we're creating the graph using the networkx library (as the TA's said on slack that we can use networkx for this purpose). We choose set the graph type to MultiDiGraph as we already prooved that there are self loops and that we have a directed graph. By creating the edges from our file, the nodes will automatically be created by networkx.

In [62]:
graph = nx.MultiDiGraph()
for nodes in nodesfile:
    graph.add_edge(nodes[0], nodes[1])

In [63]:
nx.info(graph)

'Name: \nType: MultiDiGraph\nNumber of nodes: 461193\nNumber of edges: 2645247\nAverage in degree:   5.7357\nAverage out degree:   5.7357'

Getting the number of nodes by getting a list of unique values and the get the length of this list.

In [64]:
no_nodes = len(np.unique(nodesfile))
no_nodes

461193

The number of all edges we can also retrieve by seeing how many rows our array has, as every row represents one edge.


In [65]:
no_edges = len(nodesfile)
no_edges

2645247

We first tried to evaluate the number of edges and nodes directly from the given file. As the outputs above show, there are no differences in the number of edges we assumed and the number of edges counted by networkx. 

Calculating the average degree by dividing the total number of edges (number of rows in our array) by the number of nodes (unique values in our array). As you can see here, the result of our calculation for the average doesn't differ lot from the output nx.info() gave us.

In [66]:
avg_degree = no_edges/no_nodes
avg_degree

5.735661642739591

Calculating the density of our graph

In [67]:
dens = (no_edges)/(no_nodes*(no_nodes-1))
dens

1.2436602635647606e-05

Calculating the density using the number of edges we read from the file earlier. As a value of 1 indicates a very dense graph and our result for the density is very low value, we conclude that it is a very sparse graph. 

## Research question 2

First we have to do some preperation for our calculations of the block-ranking and the score of every article in every category.

Creating a list of unique nodes to be used later. 

In [None]:
nodes = np.unique(nodesfile)

Creating a dictionary out of the given list by removing "Category:" and "\n" and the splitting the lines. The first value of each line is the name of the category (we used it as key for our dictionary). The following values are the nodes contained in this category.

In [None]:
categories = open('wiki-topcats-categories.txt', 'r', encoding = "utf8").readlines()

In [None]:
cats = {}
for line in categories:
    temp = line.strip("Category:").strip("\n").split()
    temp[0] = temp[0].strip(";")
    cats[temp[0]] = temp[1:]

In [None]:
len(cats)

Before we clean up our dictionary we already remove all categories that contain less than 3500 entries. This will spare us some time later on as there will be less values/categories we have to check.

In [None]:
cats2 = {}
for key in cats.keys():
    if len(cats[key]) > 3500:
        cats2[key] = cats[key]

In [None]:
len(cats2)

In the following step we created a second dictionary by checking if the values for every key are contained in our list of unique nodes we created earlier

In [None]:
for key in cats2.keys():
    temp = []
    for node in cats2[key]:
        if int(node) in nodes:
            temp.append(node)
    cats2[key] = temp


As you can see after cleaning up, six of our categories contain less then 3500 entry's again.

In [None]:
i = 0
for keys in cats2.keys():
    if len(cats2[keys]) < 3500:
        i+=1
print(i)

In [None]:
fnc.save_dict(cats2,fileName="wiki-topcats-categories_modified.txt")

In [None]:
cats2 = fnc.open_dict("wiki-topcats-categories_modified.txt")

In the cells below you can see the code we used to remove the pagenames of nodes that are not in our graph. To access the pagenames we created a dictionary where the key is the number of the node and the item is the name of the page.

In [None]:
pagenames = open('wiki-topcats-page-names.txt', 'r', encoding = "utf8").readlines()

In [None]:
pagenames[0].strip("\n").split()[0]

In [None]:
pages_split = {}
for page in pagenames:
    pages_split[page.strip("\n").split()[0]] = " ".join(page.strip("\n").split()[1:])
pages = {}
for node in nodes:
    pages[node] = pages_split[str(node)]
del pages_split

In [None]:
fnc.save_dict(pages,fileName='wiki-topcats-page-names_modified.txt')

For the usage in the further steps (search the distance between categories) we append the categories as attributes to the nodes. If the node belongs to a category, the value will be set to "True". Otherwise is will be "False". 

In [None]:
for key in cats2.keys():
    for node in graph.nodes:
        graph.node[node][key] = False
    temp = cats2[key]
    for node in temp:
        graph.node[int(node)][key] = True

### Block-ranking. Calculating the shortest distance between the categories
To speed up our calculations we tried to use _"@autojit"_ wich is part of the _"numba"_ library. This will convert our the code placed after _"@autojit"_ to a code that can be executed below the python executer and this way might speed up the process a lot as it can be executed directly on the CPU. Some functions of course can't be converted, what will lead to checking with the python instance and the libraries loaded. This again might lead to a slower process, as happened in our case. With usigh _"@autojit"_ our _BFS_ ran 814.86 seconds, without 710.23 seconds using "Year_of_birth_unknown" as starting category and 'English_television_actors' as target category. Using _threating_ we could drop the runtime for these both categories to 642.2 secornds. We have to state that this is not a 100% reliable result as we only once for every approach. Due to the time it takes for running, we could'nt test in more detail.

As our C0 we used __"Year_of_birth_unknown"__. It is the smallest of our categories.

In [None]:
smallest = fnc.run_bfs('Year_of_birth_unknown', graph, cats2.keys())

Save the result to file.

In [None]:
fnc.save_dict(smallest, 'results.csv')

Opening the file that contains our results.

In [68]:
smallest = fnc.open_dict('results.csv')

In [70]:
for key in smallest.keys():
    if type(smallest[key]) != list:
        smallest[key] = smallest[key].split()

Printing the results of our shortest paths from C0 to the other categories.

In [71]:
smallest

{'English_footballers': ['4.0', '3.9053475935828876', '634'],
 'The_Football_League_players': ['4.0', '3.9925412892914225', '637'],
 'Association_football_forwards': ['4.0', '3.6680896478121663', '634'],
 'Association_football_goalkeepers': ['4', '4.0021310602024505', '644'],
 'Association_football_midfielders': ['4', '3.5982951518380393', '636'],
 'Association_football_defenders': ['5.0', '4.298499464094319', '636'],
 'Living_people': ['2.0', '2.0278494069107786', '587'],
 'Harvard_University_alumni': ['3.0', '2.781068217874141', '637'],
 'Major_League_Baseball_pitchers': ['5', '4.375734901122394', '644'],
 'Members_of_the_United_Kingdom_Parliament_for_English_constituencies': ['3',
  '3.1770334928229667',
  '634'],
 'Indian_films': ['5.0', '4.313947226709747', '637'],
 'Year_of_death_missing': ['4.0', '3.3721294363256784', '606'],
 'English_cricketers': ['4.0', '3.742138364779874', '620'],
 'Year_of_birth_missing_(living_people)': ['3.0', '2.971518987341772', '636'],
 'Rivers_of_Roma

Because the median doesn't seem a aqequate measure for the distance between the catogries (it can be accidentely a far distance between the two categories), wie decided to rank by the mean distance. The mean was in most cases also close to the median so it seemed to be the better measure. As the number of inifinitives doesn't vary much we didn't take them into account for our block ranking. In the next step we sort the dictionary by the mean distance.

In [72]:
smallest = sorted(smallest.items(), key=lambda x: x[1][1])

Here you can see the our block-ranking/distances (median and mean) for every category to our C0. The third number represents the number of infinitives.

In [73]:
smallest

[('Article_Feedback_Pilot', ['2.0', '1.5792811839323466', '632']),
 ('Living_people', ['2.0', '2.0278494069107786', '587']),
 ('Harvard_University_alumni', ['3.0', '2.781068217874141', '637']),
 ('Year_of_birth_missing', ['3.0', '2.804591836734694', '558']),
 ('Fellows_of_the_Royal_Society', ['3', '2.8631801373481247', '642']),
 ('People_from_New_York_City', ['3', '2.9465608465608466', '635']),
 ('Year_of_birth_missing_(living_people)',
  ['3.0', '2.971518987341772', '636']),
 ('English-language_films', ['3', '2.9888594164456235', '636']),
 ('American_film_actors', ['3', '3.0563230605738574', '635']),
 ('American_Jews', ['3', '3.09447983014862', '633']),
 ('American_films', ['3.0', '3.1020084566596196', '644']),
 ('Black-and-white_films', ['4', '3.1592967501331914', '634']),
 ('Members_of_the_United_Kingdom_Parliament_for_English_constituencies',
  ['3', '3.1770334928229667', '634']),
 ('American_military_personnel_of_World_War_II',
  ['3', '3.2150079829696647', '634']),
 ('British_fil

### Preparation for the calculatioin of the article score.
In the following step we assigned every node either to C0 (if it was included) or to the category that is the closest to C0.

In [None]:
cat_mod = fnc.key_substraction(cats2, 'Year_of_birth_unknown', smallest)


In [None]:
fnc.save_dict(cat_mod, 'cats_modified.csv')

In [35]:
cats3 = fnc.open_dict('cats_modified.csv')

Using the new dictionary that contains the information which code belongs to which category to update the attributes we assigned to our graph. Now every node in our graph belongs to only one category.

In [42]:
for key in cats3.keys():
    for node in graph.nodes:
        graph.node[node][key] = False
        graph.node[node]['score'] = 0
    temp = cats3[key]
    for node in temp:
        graph.node[int(node)][key] = True

Creating a list of categories to be used for scoring the articles. Element 0 is our starting category. The other will be appended according to their distance to C0.

In [39]:
cat_list = ['Year_of_birth_unknown']
keys = [keys[0] for keys in smallest]
for key in keys:
    cat_list.append(key)

Getting the subgraph out of our scoring function. See the functs.py to see how we calculated our score.

In [47]:
sub_graph_scored = fnc.article_score(graph, cat_list)
            

In [74]:
sub_graph_scored.node[279122]['score']

944

Appending the names of the pages to our nodes for the display of the ranking of the articles in each category.


In [50]:
for node in sub_graph_scored.nodes():
    sub_graph_scored.node[node]['pagename'] = pages[node]

Setting up a dataframe to save the information about the 100 sites for every category, sorted by the score.

In [55]:
#creating a dataframe to display the scoring of the articles for every category
outframe = pd.DataFrame(columns = cat_list)
for cat in cat_list:
    if cat != 'Living_people':
        print(cat, len(cats3[cat]))
        tempframe = pd.DataFrame(columns= ['Pagename (Score)', 'Score'])
        for i in range(len(cats3[cat])):
            tempframe.loc[i] = [str(sub_graph_scored.node[int(cats3[cat][i])]['pagename']+ ' ('+ str(sub_graph_scored.node[int(cats3[cat][i])]['score'])+')'), sub_graph_scored.node[int(cats3[cat][i])]['score']] 
        tempframe = tempframe.sort_values('Score', ascending=False)
        tempframe = tempframe.reset_index(drop=True)
        outframe[cat] = tempframe['Pagename (Score)'].loc[:100]
        del tempframe

Year_of_birth_unknown 2536
Article_Feedback_Pilot 3484
Harvard_University_alumni 2507
Year_of_birth_missing 4339
Fellows_of_the_Royal_Society 3439
People_from_New_York_City 4614
Year_of_birth_missing_(living_people) 28309
English-language_films 22463
American_film_actors 13865
American_Jews 2907
American_films 15159
Black-and-white_films 4929
Members_of_the_United_Kingdom_Parliament_for_English_constituencies 6491
American_military_personnel_of_World_War_II 3720
British_films 4422
American_television_actors 11531
Year_of_death_missing 4120
English_television_actors 3361
Place_of_birth_missing_(living_people) 5491
Association_football_midfielders 5801
Association_football_forwards 4927
English_cricketers 3253
Rivers_of_Romania 7729
Windows_games 4025
English_footballers 7538
The_Football_League_players 2794
English-language_albums 4760
Association_football_goalkeepers 3737
Debut_albums 7561
Main_Belt_asteroids 11660
Asteroids_named_for_people 360
Association_football_defenders 4588
Indi

Due to the running time and the deadline closing in we had to exclude the biggest category ('Living_people') from the ranking. In the output below you can see the ranking of the articles in the other categories.

In [56]:
outframe

Unnamed: 0,Year_of_birth_unknown,Article_Feedback_Pilot,Living_people,Harvard_University_alumni,Year_of_birth_missing,Fellows_of_the_Royal_Society,People_from_New_York_City,Year_of_birth_missing_(living_people),English-language_films,American_film_actors,...,English_footballers,The_Football_League_players,English-language_albums,Association_football_goalkeepers,Debut_albums,Main_Belt_asteroids,Asteroids_named_for_people,Association_football_defenders,Indian_films,Major_League_Baseball_pitchers
0,Diogenes Lartius (21),To Kill a Mockingbird (film) (11742),,John F. Kennedy (62240),Edward V. Hartford (1022),Winston Churchill (37547),Sydney Pollack (30121),Aljean Harmetz (34717),Rebecca (1940 film) (178798),R. Lee Ermey (59381),...,Wayne Rooney (1458),Kevin Keegan (10757),Definitely Maybe (1881),Goalkeeper (association football) (3125),Definitely Maybe (1881),Asteroid belt (10555),5535 Annefrank (10555),Alan Hansen (1882),Neeraj Vora (31),Roger Clemens (310)
1,Stephen Dingate (17),Allen Ginsberg (4907),,William Rehnquist (8694),Joe Gould (manager) (663),Royal Society (1557),Leelee Sobieski (29668),Kevyn Major Howard (29623),The Great Ziegfeld (141533),James Woods (54368),...,Alan Shearer (1293),Kenny Dalglish (6371),A Grand Don't Come for Free (1873),Shay Given (1757),Illmatic (15),Cybele asteroid (7),2685 Masursky (10555),Geremi Njitap (1457),Bend It Like Beckham (23),Nolan Ryan (245)
2,Andrew Gibson (footballer) (16),Laserblast (2282),,John Quincy Adams (6944),John Johnstone (mayor) (458),Margaret Thatcher (1013),Stanley Kubrick (29579),John McMartin (9332),The Power and the Glory (film) (140577),James Marsden (54080),...,Michael Owen (1233),Eric Cantona (6024),Let It Be (416),Antti Niemi (footballer) (1278),I Dreamed a Dream (album) (13),3355 Onizuka (6),4055 Magellan (2),Gareth Bale (741),Amar Akbar Anthony (18),Sandy Koufax (214)
3,Tom Faulkner (13),Isaac Asimov (2072),,Leonard Bernstein (6203),Martin Snyder (326),"Charles, Prince of Wales (326)",Stephen Sondheim (26473),Jim Powell (historian) (5487),The Battle Over Citizen Kane (140575),Kate Bosworth (54045),...,Peter Crouch (1181),Graeme Souness (5016),Sgt. Pepper's Lonely Hearts Club Band (35),Mark Schwarzer (1213),Please Please Me (13),3352 McAuliffe (6),1580 Betulia (1),Colin Calderwood (735),Hisss (17),Randy Johnson (211)
4,L Bu (12),Scream 4 (2009),,Norman Mailer (5969),Barry Garner (281),William Ewart Gladstone (261),Emmy Rossum (11286),Michael Horowitz (2541),Citizen Kane (140574),Beau Bridges (54041),...,Teddy Sheringham (1048),Mark Hughes (4966),The Joshua Tree (32),Bruce Grobbelaar (824),Appetite for Destruction (11),3353 Jarvis (6),1685 Toro (1),Colin Hendry (622),Sholay (17),Tom Seaver (191)
5,Pausanias (geographer) (12),Charlize Theron (1585),,T. S. Eliot (5846),Ben Pon (senior) (244),Richard Dawkins (194),Joseph Papp (9020),Nathan George (2347),Sunset Boulevard (film) (80770),Stacy Keach (54019),...,Theo Walcott (994),Ryan Giggs (4561),The Beatles (album) (18),Neville Southall (704),Enter the Wu-Tang (36 Chambers) (11),3356 Resnik (6),878 Mildred (1),Jim Gannon (458),Chhalia (16),Mariano Rivera (179)
6,Dong Zhuo (10),Whitney Houston (1497),,"Barack Obama, Sr. (5415)",Jean Tatlock (60),"Arthur Wellesley, 1st Duke of Wellington (167)",Jerome Robbins (8699),Martin Gottfried (1322),Lolita (1962 film) (72614),Walton Goggins (53915),...,Jermain Defoe (511),Ian Rush (4438),Pet Sounds (17),Neil Sullivan (677),The Piper at the Gates of Dawn (10),3350 Scobee (6),254 Augusta (1),Anthony Gerrard (454),Vaali (16),Bob Gibson (161)
7,Theocritus (10),Dreamgirls (film) (1368),,Theodore Roosevelt (5353),Glen Frey (44),Benjamin Disraeli (161),Herman J. Mankiewicz (8532),Kate Baldwin (1312),Fear and Desire (59169),Laz Alonso (53895),...,Robbie Fowler (463),Roy Keane (3758),John Wesley Harding (album) (17),Bryan Gunn (511),Some Gave All (10),3354 McNair (6),25098 Gridnev (0),Steve Staunton (415),Mard (15),Roy Halladay (158)
8,Yuan Shu (9),Wanted (2008 film) (986),,John Adams (5282),Stanley Kerr (44),Harold Wilson (149),Theodore Roosevelt (5353),John Doyle (director) (1312),Dirty Harry (56869),Rhys Coiro (53894),...,David Beckham (438),Iain Dowie (2897),Thriller (album) (16),Brad Friedel (485),The Chronic (10),3351 Smith (6),25113 Benwasserman (0),Frank Sinclair (410),Chandramukhi (15),Curt Schilling (153)
9,Tim Stevenson (9),Matt Damon (948),,Cole Porter (5194),Lee Roberts (39),Charles Darwin (140),Allen Ginsberg (4907),Maria Riccetto (1307),Straw Dogs (53853),Willa Holland (53868),...,John Aldridge (414),Joe Kinnear (2559),Revolver (album) (16),Pat Jennings (468),Ten (Pearl Jam album) (9),253 Mathilde (5),25103 Kimdongyoung (0),Kyle Walker (380),Suhaag (1979 film) (15),Pedro Martnez (144)


In [57]:
outframe.to_csv('outframe.csv', sep='\t')