# Information Pursuit: A Wikispeedia Analysis

This notebook developes an in-depth analysis of the `Wikispeedia` dataset. The goal is to identify the characteristics of human navigation paths, and use this knowledge to produce useful insights on how to assess the difficulty of arbitrary pairs of articles, and the downfalls of common human strategies.

This study requires advanced inspection of the Wikispeedia network of pages, and the results collected from multiple thousand games.

This notebook is divided into four main parts:
1. Hub Navigation Analysis
2. Path Efficiency Analysis
3. Link Position Impact
4. Navigation Strategies

In [None]:
%load_ext autoreload
%autoreload 2

### Understanding the data

Before stepping into the analysis, we first describe briefly the data at hand. This also allows us to show a few transformations that have been done to ease data manipulation.

Load and transform some data to make it easy to use

In [None]:
from src.utils.data_utils import load_graph_data

graph_data = load_graph_data()

loading raw data from tsv files...
formatting articles...
formatting categories...
formatting links...
formatting paths...
formatting distance matrix...
building graph...


In [None]:
from src.utils.general_utils import describe_dict

describe_dict(graph_data)

Keyword                       | Type (shape)          
------------------------------------------------------
shortest-path-distance-matrix   Array (4604, 4604)    
paths_finished                  DataFrame (51318, 9)  
articles                        DataFrame (4604, 1)   
paths_unfinished                DataFrame (24875, 9)  
links                           DataFrame (119882, 2) 
categories                      DataFrame (5204, 2)   
graph                           DiGraph (4604, 124486)


### Why analysing human behaviors

In [None]:
# TODO: Antoine. Show differences between paths and shortest paths

## Wikispeedia network analysis

It seems quite reasonable to hypothetized that humans navigate the pages of the Wikispeedia website based on relationships existing between the target article concept and the content of the articles found along the way. In order to reason about these semantic relations, humans abstract ideas and form their inner world model, making internalizing concepts efficient and smooth.

In this analysis, we verify if an intuitive top-down approach is indeed the most prevalent strategy of the players.

*Do players have a tendancy to over-utilize hubs of the Wikispeedia networks?*\
*Is this strategy usually paying-off?*

In [None]:
import networkx as nx
import numpy as np


def compute_hubs(graph):
	hubs = nx.hits(graph, normalized=True)[0]

	distribution = np.array([*hubs.values()])
	mean = np.mean(distribution)
	std = np.std(distribution)

	# check positive outliers in the hub values
	significant_hubs = list(filter(lambda t: t[1] - mean > 8 * std, hubs.items()))
	significant_hubs = sorted(significant_hubs, key=lambda t: t[1], reverse=True)

	return significant_hubs

In [None]:
compute_hubs(graph_data["graph"])

[('Driving_on_the_left_or_right', 0.0013171632208290843)]

In [None]:
# TODO: Fred code.

## Path Efficiency Analysis

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut fermentum pretium nibh quis accumsan. In at nunc mauris. Integer varius ante non massa pharetra auctor. Interdum et malesuada fames ac ante ipsum primis in faucibus. Sed sit amet commodo nunc. Nulla felis enim, lobortis ut pharetra quis, bibendum sed nunc. Quisque accumsan sapien ac vehicula pretium. Maecenas quis tellus hendrerit, bibendum felis at, iaculis odio. Nullam sed fringilla lorem. Morbi nunc orci, fringilla vel semper dapibus, blandit quis lacus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.


   - Compare actual paths with shortest paths found computationally
   - Develop metrics for path "efficiency" considering both length and completion time
   - Create visualization tools for path comparison and analysis
   - Analyze distribution of successful vs. abandoned paths

In [None]:
# TODO: Peter code

# p-values and stuff. logits

In [None]:
# TODO: Gabriel code

# Graphs and stuff
# people tired => people bad

## Navigation Strategies

It seems quite reasonable to hypothetized that humans navigate the pages of the Wikispeedia website based on relationships existing between the target article concept and the content of the articles found along the way. In order to reason about these semantic relations, humans abstract ideas and form their inner world model, making internalizing concepts efficient and smooth.

In this analysis, we verify if an intuitive top-down approach is indeed the most prevalent strategy of the players.

*Do players have a tendancy to over-utilize hubs of the Wikispeedia networks?*\
*Is this strategy usually paying-off?*

### Top-down approach

Represent user paths as a graph

In [None]:
from src.data.graph import extract_players_graph

finished_paths_graph = extract_players_graph(graph_data, paths_finished=True)
unfinished_paths_graph = extract_players_graph(graph_data, paths_finished=False)

Note that 4 edges are present in 'paths_finished.tsv' but not in 'links.tsv':
{('Finland', 'Åland'), ('Bird', 'Wikipedia_Text_of_the_GNU_Free_Documentation_License'), ('Claude_Monet', 'Édouard_Manet'), ('Republic_of_Ireland', 'Éire')}
Note that 62 edges are present in 'paths_unfinished.tsv' but not in 'links.tsv':
{('Liverpool_F.C.', 'Wikipedia_Text_of_the_GNU_Free_Documentation_License'), ('Juice', 'Wikipedia_Text_of_the_GNU_Free_Documentation_License'), ('Accountancy', 'Wikipedia_Text_of_the_GNU_Free_Documentation_License'), ('Scientific_classification', 'Wikipedia_Text_of_the_GNU_Free_Documentation_License'), ('World_Health_Organization', 'Wikipedia_Text_of_the_GNU_Free_Documentation_License'), ('Swastika', 'Wikipedia_Text_of_the_GNU_Free_Documentation_License'), ('Islam', 'Wikipedia_Text_of_the_GNU_Free_Documentation_License'), ("Maxwell's_equations", 'Wikipedia_Text_of_the_GNU_Free_Documentation_License'), ('Ireland', 'Éire'), ('Cosmic_microwave_background_radiation', 'Wikipedia_T

In [None]:
compute_hubs(finished_paths_graph)

[('United_States', 0.032523284079864996),
 ('Europe', 0.01928897240455547),
 ('United_Kingdom', 0.015584267109294229),
 ('England', 0.014512061177973051),
 ('North_America', 0.013763742733676464),
 ('Earth', 0.012658182958466232),
 ('World_War_II', 0.009468624468380463),
 ('English_language', 0.008884412977769703),
 ('Great_Britain', 0.007144918668458417),
 ('France', 0.0071205427603357)]

In [None]:
compute_hubs(unfinished_paths_graph)

[('United_States', 0.02245843005865249),
 ('United_Kingdom', 0.011189493299372716),
 ('England', 0.010328107713851743),
 ('Europe', 0.009365067575018657),
 ('Animal', 0.008076413290830507),
 ('North_America', 0.00804392676203397),
 ('Mammal', 0.007004155538156376),
 ('World_War_II', 0.00667763623055239),
 ('English_language', 0.006385391170843089),
 ('Earth', 0.00585486571179679)]

In [None]:
# TODO: Timothée code.

### Link Position Impact

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut fermentum pretium nibh quis accumsan. In at nunc mauris. Integer varius ante non massa pharetra auctor. Interdum et malesuada fames ac ante ipsum primis in faucibus. Sed sit amet commodo nunc. Nulla felis enim, lobortis ut pharetra quis, bibendum sed nunc. Quisque accumsan sapien ac vehicula pretium. Maecenas quis tellus hendrerit, bibendum felis at, iaculis odio. Nullam sed fringilla lorem. Morbi nunc orci, fringilla vel semper dapibus, blandit quis lacus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.

In [None]:
# TODO: Timothée code

## Conclusion

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut fermentum pretium nibh quis accumsan. In at nunc mauris. Integer varius ante non massa pharetra auctor. Interdum et malesuada fames ac ante ipsum primis in faucibus. Sed sit amet commodo nunc. Nulla felis enim, lobortis ut pharetra quis, bibendum sed nunc. Quisque accumsan sapien ac vehicula pretium. Maecenas quis tellus hendrerit, bibendum felis at, iaculis odio. Nullam sed fringilla lorem. Morbi nunc orci, fringilla vel semper dapibus, blandit quis lacus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.