# Europe in Wikispeedia: Unmasking Geographic Bias

*TheDataDreamTeam*

This project focuses on investigating geographical biases in the Wikispeedia game and player behavior, using the 2007 Wikipedia Selection for schools dataset as the data source. Our goal is to explore if biases exist towards Europe in article selection and gameplay.

# Imports

In [1]:
import re
import os
import copy
import pickle

import seaborn as sns
from plotly.subplots import make_subplots
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.figure_factory as ff
from dash import Dash, dcc, html

import scipy
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

import networkx as nx

!!! plotly requires kaleido for interactive plots !!!

# Setting 

In [2]:
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

REMOVE_INTERNATIONAL = True
INTERNATIONAL_LABEL = "International"
EUROPE_LABEL = "Europe"


PLOTS_PATH = "plots"
PLOTS_PATH_PLT = os.path.join(PLOTS_PATH, "plt")
PLOTS_PATH_PX = os.path.join(PLOTS_PATH, "px")
PLOTS_PATH_HTML = os.path.join(PLOTS_PATH, "html")

FIGURE_WIDTH = 800
FIGURE_HEIGHT = 600

for path in [PLOTS_PATH_PLT, PLOTS_PATH_PX, PLOTS_PATH_HTML]: 
    os.makedirs(path, exist_ok=True)

# Data loading

This section contains all of the data loading.

## All articles

Core of the wikipeadia are articles. The list of names of  all available articles is stored in the file `articles.tsv`.

In [3]:
df_articles_all = pd.read_csv(
    os.path.join("Data", "wikispeedia_paths-and-graph", "articles.tsv"),
    delimiter="\t",
    header=None,
    names=["name"],
    skip_blank_lines=True,
    comment="#"
)

display(df_articles_all.head())
print("Size:", df_articles_all.shape)

Unnamed: 0,name
0,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in
1,%C3%85land
2,%C3%89douard_Manet
3,%C3%89ire
4,%C3%93engus_I_of_the_Picts


Size: (4604, 1)


## Article categories

Each article can have multiple categories and subcategories specified as long strings. Many of them are too specific, and we would have minimal data for them. Therefore, we decided to work with the main category, which is the first one from the left after the subject.

In [4]:
df_categories = pd.read_csv(
    os.path.join("Data", "wikispeedia_paths-and-graph", "categories.tsv"),
    delimiter="\t",
    header=None,
    names=["article", "category"],
    skip_blank_lines=True,
    comment="#"
)

main_categories = []
for category in df_categories["category"].values:
    main_categories.append(category.split(".")[1])

df_categories["categoryMain"] = main_categories

display(df_categories.head())
print("Size:", df_categories.shape)

Unnamed: 0,article,category,categoryMain
0,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,subject.History.British_History.British_Histor...,History
1,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,subject.People.Historical_figures,People
2,%C3%85land,subject.Countries,Countries
3,%C3%85land,subject.Geography.European_Geography.European_...,Geography
4,%C3%89douard_Manet,subject.People.Artists,People


Size: (5204, 3)


We ended up with 15 main categories:

In [5]:
df_categories.categoryMain.unique().tolist()

['History',
 'People',
 'Countries',
 'Geography',
 'Business_Studies',
 'Science',
 'Everyday_life',
 'Design_and_Technology',
 'Music',
 'IT',
 'Language_and_literature',
 'Mathematics',
 'Religion',
 'Art',
 'Citizenship']

## Article continent labels

We have used GPT-3.5 to classify each article (based on its name) to one of the geographical locations: `Africa`, `Antarctica`, `Asia`, `Europe`, the `Middle East`, `North America`, `Australia`, `South America`, and `International` label for articles that represent casual things from everyday life or articles without a strong link to geographical location. We constructed our own heuristic based on the article category to compare the GPT performance. We started dividing the geographical categories into groups, searching for the keywords in the article names, and iteratively growing the list of known assigned subcategories and articles.

We crosschecked the GPT results with our heuristic method (Around 1000 articles labeled), and the results were surprising with more than 90% agreement. We analyzed the errors and saw two patterns. GPT decided to label the article as North America, and the category was related to South America, or there were mistakes related to Asia and the Middle East. Even though the Middle East was part of the geographical categories in the original Wikispeedia dataset, we decided to merge it with Asia. A minority of errors were caused by our system. Cases like French Polynesia, which contains the French word, and our system assigns Europe as a label, but in fact, the article refers to a set of islands in the South Pacific Ocean. There might be a few articles where the correct label is ambiguous, and it is a small error that enters our analysis. 

In [6]:
df_continents = pd.read_csv(os.path.join("Data", "continents.csv"))

if REMOVE_INTERNATIONAL:
    labeled_articles_all_count = len(df_continents)
    df_continents = df_continents[df_continents.continent != INTERNATIONAL_LABEL]
    labeled_articles_count = len(df_continents)
    print(f"Removing articles labeled as {INTERNATIONAL_LABEL}, Removed articles: {labeled_articles_all_count - labeled_articles_count}")

display(df_continents.head())
print("Size:", df_continents.shape)

Removing articles labeled as International, Removed articles: 1870


Unnamed: 0,article,continent
0,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Europe
1,%C3%85land,Europe
2,%C3%89douard_Manet,Europe
3,%C3%89ire,Europe
4,%C3%93engus_I_of_the_Picts,Europe


Size: (2734, 2)


We decided to remove the Internation label as we want to focus on geographical bias.

We campact the information of df_continents and df_categories DataFrames for futher analysis.

In [7]:
df_continents_categories = pd.merge(df_continents, df_categories, on="article", how="left")

display(df_continents_categories.head())
print("Size:", df_continents_categories.shape)

Unnamed: 0,article,continent,category,categoryMain
0,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Europe,subject.History.British_History.British_Histor...,History
1,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Europe,subject.People.Historical_figures,People
2,%C3%85land,Europe,subject.Countries,Countries
3,%C3%85land,Europe,subject.Geography.European_Geography.European_...,Geography
4,%C3%89douard_Manet,Europe,subject.People.Artists,People


Size: (3177, 4)


In [8]:
df_articles = df_continents_categories[["article", "continent"]].drop_duplicates()
df_articles = pd.merge(df_articles, df_continents_categories.groupby("article")["categoryMain"].apply(list).reset_index(), on="article")
df_articles = pd.merge(df_articles, df_continents_categories.groupby("article")["category"].apply(list).reset_index(), on="article")

display(df_articles.head())
print("Size:", df_articles.shape)

Unnamed: 0,article,continent,categoryMain,category
0,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Europe,"[History, People]",[subject.History.British_History.British_Histo...
1,%C3%85land,Europe,"[Countries, Geography]","[subject.Countries, subject.Geography.European..."
2,%C3%89douard_Manet,Europe,[People],[subject.People.Artists]
3,%C3%89ire,Europe,"[Countries, Geography]","[subject.Countries, subject.Geography.European..."
4,%C3%93engus_I_of_the_Picts,Europe,"[History, People]",[subject.History.British_History.British_Histo...


Size: (2734, 4)


## Article word count


The article length might be an important factor when navigating the wikispeedia game. Less attentive players may miss an important link in longer articles or get bored. To analyze this, we compute article `length` based on the number of words. Each article contains a disclaimer at the end, which we need to remove.

In [9]:
plaintext_path = os.path.join("Data", "plaintext_articles")

word_counts = []
for article_name in df_articles.article:
    file_path = os.path.join(plaintext_path, article_name + ".txt")

    with open(file_path, "r", encoding="utf-8") as file:

        _ = file.readline() # Skip the first line because it contains the word #copyright
        content = file.read()

    content = content[:re.search("Retrieved from", content).start(0)]
    word_counts.append(len(content.split()))

df_articles["length"] = word_counts

display(df_articles.head())
print("Size:", df_articles.shape)

Unnamed: 0,article,continent,categoryMain,category,length
0,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Europe,"[History, People]",[subject.History.British_History.British_Histo...,1836
1,%C3%85land,Europe,"[Countries, Geography]","[subject.Countries, subject.Geography.European...",2412
2,%C3%89douard_Manet,Europe,[People],[subject.People.Artists],2887
3,%C3%89ire,Europe,"[Countries, Geography]","[subject.Countries, subject.Geography.European...",2026
4,%C3%93engus_I_of_the_Picts,Europe,"[History, People]",[subject.History.British_History.British_Histo...,2029


Size: (2734, 5)


## Page Rank

Page Rank is the main element that drives today's well-known Google search. And we can use it to measure the importance of the article in our Wikispeedia network.

In [10]:
df_pagerank = pd.read_csv(os.path.join("Data", "page_rank.csv"))
df_articles = pd.merge(df_articles, df_pagerank, on="article", how="left").fillna(df_pagerank.min())

display(df_articles.head())
print("Size:", df_articles.shape)

Unnamed: 0,article,continent,categoryMain,category,length,pageRank
0,%C3%81ed%C3%A1n_mac_Gabr%C3%A1in,Europe,"[History, People]",[subject.History.British_History.British_Histo...,1836,3.3e-05
1,%C3%85land,Europe,"[Countries, Geography]","[subject.Countries, subject.Geography.European...",2412,3.3e-05
2,%C3%89douard_Manet,Europe,[People],[subject.People.Artists],2887,3.3e-05
3,%C3%89ire,Europe,"[Countries, Geography]","[subject.Countries, subject.Geography.European...",2026,3.3e-05
4,%C3%93engus_I_of_the_Picts,Europe,"[History, People]",[subject.History.British_History.British_Histo...,2029,3.3e-05


Size: (2734, 6)


## Paths

Players' paths are the key to analyzing the players' behavior. In addition to the provided information, we add columns consisting of the number of back clicks, unique articles, and whether the player succeeded in finding the target article.


In [11]:
df_paths_finished = pd.read_csv(
    os.path.join("Data", "wikispeedia_paths-and-graph", "paths_finished.tsv"),
    sep="\t",
    header=None,
    names=["hashedIpAddress", "timestamp", "durationInSec", "path", "rating"],
    skip_blank_lines=True,
    comment="#"
)
df_paths_unfinished = pd.read_csv(
    os.path.join("Data", "wikispeedia_paths-and-graph", "paths_unfinished.tsv"),
    sep="\t",
    header=None,
    names=["hashedIpAddress", "timestamp", "durationInSec", "path", "target", "motif"],
    skip_blank_lines=True,
    comment="#"
)

df_paths_finished["backclicks"] = df_paths_finished["path"].apply(lambda x: x.count("<"))
df_paths_finished["pathSteps"] = df_paths_finished["path"].apply(lambda x: x.count(";") + 1)
df_paths_finished["uniqueArticles"] = df_paths_finished["pathSteps"] - df_paths_finished["backclicks"]
df_paths_finished["path"] = df_paths_finished["path"].apply(lambda x: x.split(";"))
df_paths_finished["start"] = df_paths_finished["path"].str[0]
df_paths_finished["target"] = df_paths_finished["path"].str[-1]
df_paths_finished["isFinished"] = True

df_paths_unfinished["backclicks"] = df_paths_unfinished["path"].apply(lambda x: x.count("<"))
df_paths_unfinished["pathSteps"] = df_paths_unfinished["path"].apply(lambda x: x.count(";") + 1)
df_paths_unfinished["uniqueArticles"] = df_paths_unfinished["pathSteps"] - df_paths_unfinished["backclicks"]
df_paths_unfinished["path"] = df_paths_unfinished["path"].apply(lambda x: x.split(";"))
df_paths_unfinished["start"] = df_paths_unfinished["path"].str[0]
df_paths_unfinished["isFinished"] = False

df_paths = pd.concat([df_paths_finished, df_paths_unfinished])
df_paths = df_paths[df_paths["start"].isin(df_articles_all.name) & df_paths["target"].isin(df_articles_all.name)]
df_paths["durationInMin"] = df_paths["durationInSec"] / 60

display(df_paths.head())
print("Size:", df_paths.shape)


Unnamed: 0,hashedIpAddress,timestamp,durationInSec,path,rating,backclicks,pathSteps,uniqueArticles,start,target,isFinished,motif,durationInMin
0,6a3701d319fc3754,1297740409,166,"[14th_century, 15th_century, 16th_century, Pac...",,0,9,9,14th_century,African_slave_trade,True,,2.766667
1,3824310e536af032,1344753412,88,"[14th_century, Europe, Africa, Atlantic_slave_...",3.0,0,5,5,14th_century,African_slave_trade,True,,1.466667
2,415612e93584d30e,1349298640,138,"[14th_century, Niger, Nigeria, British_Empire,...",,0,8,8,14th_century,African_slave_trade,True,,2.3
3,64dd5cd342e3780c,1265613925,37,"[14th_century, Renaissance, Ancient_Greece, Gr...",,0,4,4,14th_century,Greece,True,,0.616667
4,015245d773376aab,1366730828,175,"[14th_century, Italy, Roman_Catholic_Church, H...",3.0,0,7,7,14th_century,John_F._Kennedy,True,,2.916667


Size: (76164, 13)


## Shortest Paths

The shortest path is another important metric. Some of the start and target article pairs might be far away from each other, and thus, the game round is much harder for the player.

In [12]:
shortest_paths = []
with open(os.path.join("Data", "wikispeedia_paths-and-graph", "shortest-path-distance-matrix.txt")) as file:
    for line in file:
        line = line.strip()
        if line == "" or line.startswith("#"):
            continue
        shortest_paths.append(list(map(lambda x: -1 if x == "_" else int(x), list(line))))
        
shortest_paths = np.array(shortest_paths)

df_shortest_paths = pd.DataFrame(shortest_paths, index=df_articles_all.name, columns=df_articles_all.name)
df_paths["shortestPath"] = df_paths.apply(lambda row: df_shortest_paths.loc[row["start"], row["target"]], axis="columns")
df_paths = df_paths[df_paths["shortestPath"] >= 0]

display(df_paths.head())
print("Size:", df_paths.shape)

Unnamed: 0,hashedIpAddress,timestamp,durationInSec,path,rating,backclicks,pathSteps,uniqueArticles,start,target,isFinished,motif,durationInMin,shortestPath
0,6a3701d319fc3754,1297740409,166,"[14th_century, 15th_century, 16th_century, Pac...",,0,9,9,14th_century,African_slave_trade,True,,2.766667,3
1,3824310e536af032,1344753412,88,"[14th_century, Europe, Africa, Atlantic_slave_...",3.0,0,5,5,14th_century,African_slave_trade,True,,1.466667,3
2,415612e93584d30e,1349298640,138,"[14th_century, Niger, Nigeria, British_Empire,...",,0,8,8,14th_century,African_slave_trade,True,,2.3,3
3,64dd5cd342e3780c,1265613925,37,"[14th_century, Renaissance, Ancient_Greece, Gr...",,0,4,4,14th_century,Greece,True,,0.616667,2
4,015245d773376aab,1366730828,175,"[14th_century, Italy, Roman_Catholic_Church, H...",3.0,0,7,7,14th_century,John_F._Kennedy,True,,2.916667,3


Size: (76155, 14)


# Data Exploration

In the section on data exploration, we explored relationships between articles, categories, and our continent labels. We present here some of the results we investigated. For the interactive plots, please refer to the end of this notebook.

In [13]:
article_count_per_continent = df_continents.groupby("continent").size().sort_index()

display(article_count_per_continent)

continent
Africa            265
Antarctica          9
Asia              377
Australia         122
Europe           1245
North America     593
South America     123
dtype: int64

From the `article_count_per_continent` Series and Pie chart below, one can see that the majority of articles belong to Europe. This is expected as the wikispeedia section of articles comes from a set used for the English education system. Next, we examine the distribution of articles in various main categories within each continent.

![](plots/px/articles_count_per_continent_pie.png)

Next, we looked we looked into the categories more deeply.

In [14]:
df_continents_categories_counts = pd.crosstab(df_continents_categories["continent"], df_continents_categories["categoryMain"]).sort_index()

display(df_continents_categories_counts)
print("Size:", df_continents_categories_counts.shape)

categoryMain,Art,Business_Studies,Citizenship,Countries,Design_and_Technology,Everyday_life,Geography,History,IT,Language_and_literature,Mathematics,Music,People,Religion,Science
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Africa,0,1,4,52,4,2,143,30,0,9,0,2,17,1,61
Antarctica,0,0,0,0,0,0,7,0,0,0,0,0,0,0,3
Asia,1,6,15,55,11,19,169,53,0,14,1,3,51,29,17
Australia,0,0,1,15,3,5,64,10,0,1,0,2,12,1,25
Europe,13,17,33,55,71,51,287,313,2,101,0,33,415,41,33
North America,0,7,26,4,59,51,166,53,4,15,0,18,183,5,28
South America,0,0,1,48,1,0,85,9,0,0,0,10,9,0,11


Size: (7, 15)


The figures below show that the most represented category is `Geography,` followed by `People` and `History`, which is well suited for our analysis of geographical bias.

![](plots/px/articles_count_per_category_bar.png)

![](plots/px/articles_count_per_category_pie.png)


Now, let's analyze the collected paths in our dataset.

In [15]:
df_articles_target = df_articles.copy()
df_articles_target.columns = [column[0].upper() + column[1:] for column in df_articles_target.columns]
df_articles_target = df_articles_target.add_prefix("target")

df_paths_articles = pd.merge(df_paths, df_articles_target, left_on="target", right_on="targetArticle", suffixes=["", ]).drop(columns="targetArticle")

df_start_articles = df_articles.copy()
df_start_articles.columns = [column[0].upper() + column[1:] for column in df_start_articles.columns]
df_start_articles = df_start_articles.add_prefix("start")
df_paths_articles = pd.merge(df_paths_articles, df_start_articles, left_on="start", right_on="startArticle", suffixes=["", ]).drop(columns="startArticle")

df_paths_articles["isFinishedInt"] = df_paths_articles["isFinished"].astype(int)

display(df_paths_articles.head())
print("Size:", df_paths_articles.shape)

Unnamed: 0,hashedIpAddress,timestamp,durationInSec,path,rating,backclicks,pathSteps,uniqueArticles,start,target,...,targetCategoryMain,targetCategory,targetLength,targetPageRank,startContinent,startCategoryMain,startCategory,startLength,startPageRank,isFinishedInt
0,1a218aa161301e6e,1355086784,40,"[James_Bond, United_Kingdom, Europe, Africa, A...",,0,6,6,James_Bond,African_slave_trade,...,[History],[subject.History.General_history],2654,5.5e-05,Europe,[Everyday_life],[subject.Everyday_life.Films],7496,0.000186,1
1,1ad6fbd964102221,1332642329,144,"[James_Bond, Star_Wars, Mythology, The_Lord_of...",,5,11,6,James_Bond,Iron_Maiden,...,[Music],[subject.Music.Performers_and_composers],4047,0.000159,Europe,[Everyday_life],[subject.Everyday_life.Films],7496,0.000186,0
2,3e6b12634169fb72,1357250279,28,"[James_Bond, Sean_Connery, Scotland, Scottish_...",,0,4,4,James_Bond,Scottish_Gaelic_language,...,[Language_and_literature],[subject.Language_and_literature.Languages],4780,0.000243,Europe,[Everyday_life],[subject.Everyday_life.Films],7496,0.000186,1
3,2141997163054c23,1272956123,18,"[James_Bond, United_States, Canada, Stephen_Ha...",,0,4,4,James_Bond,Stephen_Harper,...,[People],[subject.People.Political_People],4801,0.000107,Europe,[Everyday_life],[subject.Everyday_life.Films],7496,0.000186,1
4,15945db656214ee5,1253827056,64,"[James_Bond, Germany, Adolf_Hitler, Nazi_Germa...",,0,5,5,James_Bond,Nazism,...,[History],[subject.History.World_War_II],7377,0.000706,Europe,[Everyday_life],[subject.Everyday_life.Films],7496,0.000186,1


Size: (19163, 25)


In [16]:
df_article_path_stats = pd.DataFrame()

df_article_path_stats["article"] = df_articles["article"]
df_article_path_stats["continent"] = df_articles["continent"]
df_article_path_stats["targetFinished"] = df_articles["article"].map(df_paths_finished["target"].value_counts()).fillna(0)
df_article_path_stats["targetUnfinished"] = df_articles["article"].map(df_paths_unfinished["target"].value_counts()).fillna(0)

df_article_path_stats["startFinished"] = df_articles["article"].map(df_paths_finished["start"].value_counts()).fillna(0)
df_article_path_stats["startUnfinished"] = df_articles["article"].map(df_paths_unfinished["start"].value_counts()).fillna(0)

paths_finished = pd.Series(np.concatenate(df_paths_finished.path.values))
paths_unfinished = pd.Series(np.concatenate(df_paths_unfinished.path.values))

df_article_path_stats["anyFinished"] = df_articles["article"].map(paths_finished.value_counts()).fillna(0)
df_article_path_stats["anyUnfinished"] = df_articles["article"].map(paths_unfinished.value_counts()).fillna(0)
df_article_path_stats["anyPercentage"] = (df_article_path_stats["anyFinished"] + df_article_path_stats["anyUnfinished"]) / (len(paths_finished) + len(paths_unfinished))

display(df_article_path_stats.sort_values("anyPercentage", ascending=False).head())
print("Size:", df_article_path_stats.shape)

Unnamed: 0,article,continent,targetFinished,targetUnfinished,startFinished,startUnfinished,anyFinished,anyUnfinished,anyPercentage
2563,United_States,North America,28.0,3.0,44.0,7.0,8896.0,3553.0,0.026149
827,Europe,Europe,17.0,2.0,26.0,15.0,4362.0,1249.0,0.011786
2560,United_Kingdom,Europe,28.0,0.0,16.0,6.0,3904.0,1424.0,0.011192
803,England,Europe,111.0,14.0,98.0,45.0,3332.0,1226.0,0.009574
62,Africa,Africa,28.0,5.0,75.0,23.0,2796.0,794.0,0.007541


Size: (2734, 9)


From the DataFrames above, we have observed that the PageRank of start and target articles and the number of back clicks follow a heavy-tailed distribution. If we look at the distribution of the start/target articles, we can again see the trend of most articles labeled as Europe. Figure below show percentage of finished articles. For example, +-30% of all played games are finished and also start from Europe-related articles, and the rest of the successfully finished games make up another +-40%.

![](plots/px/finished_path_percentage_per_article_continent_bar.png)

# Naive Analysis

We have observed strong evidence that the Europe-related articles are as present as all other continents. But does that mean that the players perform better in games that lead to Europe-related articles? That is what we aim to answer. We define a treatment group as a set of all played paths and a control group as a set of all paths related to one of the other continents. We do not consider paths leading to articles with the `International` label.

We performed a T-test to test players' performance in treatment/control groups. And for game success rate (isFinished variable) and path length (pathSteps variable), the T-test rejects the null hypothesis at a significance level of 5%. That the means of given variables for two groups are different. But can we really reject the null hypothesis? There might be common confounders that influenced our results, and rejecting the null hypothesis would be a foolish claim.

In [17]:
df_analysis = df_paths_articles.copy()
df_analysis["treatment"] = df_analysis.targetContinent == "Europe"

for col in ["isFinishedInt", "durationInMin", "pathSteps"]:
    print(col, *scipy.stats.ttest_ind(df_analysis[df_analysis.treatment][col], df_analysis[~df_analysis.treatment][col], equal_var=False))


isFinishedInt 2.153208373839039 0.03131578545381225
durationInMin -1.2669158094901722 0.205202333537421
pathSteps -3.069419741398319 0.0021482721338009847


# Matching

We perform matching to control for confounding factors and ensure a fair comparison between the treatment and control groups. The following factors might influence the difficulty of finding the target article and, therefore, affect the success rate. Some article categories might be generally better-known topics. The player path cannot be shorter than the shortest possible path. Article page rank shows how often the player would encounter an article in a random walk. 

We computed a correlation coefficient to validate our idea of possible relationships between given variables, which showed the largest values for the target article PageRank and the shortest path (around 0.3 Spearman's rank correlation).
We decided to match on the following conditions. The edge is added to the graph if all the following conditions are met (logical AND):
    - At least one of the start articles' categories match
    - At least one of the target articles' categories match
    - Shortest paths' length match

The weight of the edge is computed as propensity score of target/start page rank nad length

Matching allows us to create more comparable groups, reducing bias and increasing the reliability of our analysis. To compare differences of distributions before and after matching please refer to the Plot 5 at the end of the notebook.

In [18]:
corr_cols = ["targetLength", "targetPageRank", "startLength", "startPageRank", "isFinished", "shortestPath"]

print("Pearson correlation")
display(df_analysis[corr_cols].corr()["isFinished"])
print("Spearman's rank correlation")
display(df_analysis[corr_cols].corr("spearman")["isFinished"])

Pearson correlation


targetLength      0.115651
targetPageRank    0.153305
startLength       0.000729
startPageRank     0.015119
isFinished        1.000000
shortestPath     -0.330766
Name: isFinished, dtype: float64

Spearman's rank correlation


targetLength      0.114733
targetPageRank    0.334440
startLength      -0.002270
startPageRank     0.015304
isFinished        1.000000
shortestPath     -0.321564
Name: isFinished, dtype: float64

In [19]:
eq = "isFinishedInt ~ startLength + startPageRank + targetLength + targetPageRank"

model = smf.logit(eq, df_analysis).fit()

df_analysis["propensityScore"] = model.predict()

model.summary()

Optimization terminated successfully.
         Current function value: 0.577595
         Iterations 10


0,1,2,3
Dep. Variable:,isFinishedInt,No. Observations:,19163.0
Model:,Logit,Df Residuals:,19158.0
Method:,MLE,Df Model:,4.0
Date:,"Sat, 23 Dec 2023",Pseudo R-squ.:,0.04305
Time:,00:05:16,Log-Likelihood:,-11068.0
converged:,True,LL-Null:,-11566.0
Covariance Type:,nonrobust,LLR p-value:,2.7110000000000003e-214

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.4315,0.038,11.258,0.000,0.356,0.507
startLength,-9.693e-06,6.49e-06,-1.493,0.135,-2.24e-05,3.03e-06
startPageRank,-37.8948,20.384,-1.859,0.063,-77.847,2.058
targetLength,3.563e-05,6.26e-06,5.693,0.000,2.34e-05,4.79e-05
targetPageRank,1200.8379,63.857,18.805,0.000,1075.681,1325.995


In [20]:
# treatment_df = df_analysis[df_analysis["treatment"]]
# control_df = df_analysis[~df_analysis["treatment"]]

# def get_similarity(propensity_score1, propensity_score2):
#     '''Calculate similarity for instances with given propensity scores'''
#     return 1 - np.abs(propensity_score1 - propensity_score2)

# G = nx.Graph()
# for control_id, control_row in control_df.iterrows():
#     for treatment_id, treatment_row in treatment_df.iterrows():

#         if len(set(treatment_row['startCategoryMain']) & set(control_row['startCategoryMain'])) \
#         and len(set(treatment_row['targetCategoryMain']) & set(control_row['targetCategoryMain'])) \
#         and treatment_row["shortestPath"] == control_row["shortestPath"]:
#             weight = get_similarity(treatment_row["propensityScore"], control_row["propensityScore"])
#             G.add_edge(treatment_id, control_id, weight=weight)

# matching = nx.max_weight_matching(G)

# with open("matching.pkl", "wb") as file:
#     pickle.dump(matching, file)


In [21]:
with open(os.path.join("Data", "matching.pkl"), "rb") as file:
    matched_indices = pickle.load(file)

matched_indices = [i[0] for i in list(matched_indices)] + [i[1] for i in list(matched_indices)]
df_analysis_balanced = df_analysis.iloc[matched_indices]

display(df_analysis_balanced.head())
print("Size:", df_analysis_balanced.shape)


Unnamed: 0,hashedIpAddress,timestamp,durationInSec,path,rating,backclicks,pathSteps,uniqueArticles,start,target,...,targetLength,targetPageRank,startContinent,startCategoryMain,startCategory,startLength,startPageRank,isFinishedInt,treatment,propensityScore
8201,2eb1aba256417389,1352179248,263,"[Leaning_Tower_of_Pisa, Government, Democracy,...",3.0,0,7,7,Leaning_Tower_of_Pisa,Richard_Nixon,...,8056,0.000403,Europe,[Design_and_Technology],[subject.Design_and_Technology.Architecture],1275,4e-05,1,False,0.766518
18479,291cbcb573e79d58,1344681398,19,"[Harlem_Globetrotters, Austria, Switzerland, Z...",,0,4,4,Harlem_Globetrotters,Z%C3%BCrich,...,3081,0.000177,North America,[Everyday_life],[subject.Everyday_life.Sports_teams],2053,6.1e-05,1,True,0.675117
6101,28ed64876d7d2fcb,1330484711,65,"[Pel%C3%A9, United_Nations, United_Kingdom, Sc...",1.0,0,4,4,Pel%C3%A9,Scotland,...,8044,0.002839,South America,[People],[subject.People.Sports_and_games_people],3465,0.000119,1,True,0.983525
2719,6887a3157f124771,1350940289,175,"[Niagara_Falls, English_Channel, United_Kingdo...",,0,4,4,Niagara_Falls,Llywelyn_the_Great,...,4967,5.1e-05,North America,[Geography],[subject.Geography.North_American_Geography],4556,0.000101,0,True,0.650704
17321,0d57c8c57d75e2f5,1285771412,79,"[Vladimir_Lenin, London, M25_motorway, M1_moto...",1.0,0,5,5,Vladimir_Lenin,M6_motorway,...,2248,5.3e-05,Europe,[People],[subject.People.Political_People],4655,0.000299,1,True,0.626804


Size: (12342, 27)


# Observation Study

After the matching, we perform the same test for the balanced treatment and control groups. But now the results are not that promising. We cannot reject the null Hypothesis on a significance level of 5% for all the tested variables. It seems that the results we obtained in the naive analysis were only effects caused by common confounders, and we cannot say that there is a Geographical bias of players' performance in games leading to articles related to Europe.

In [22]:
for col in ["isFinishedInt", "durationInMin", "pathSteps"]:
    print(col, *scipy.stats.ttest_ind(df_analysis_balanced[df_analysis_balanced.treatment][col], df_analysis_balanced[~df_analysis_balanced.treatment][col], equal_var=False))


isFinishedInt 0.18058607331224732 0.8566954535620267
durationInMin -0.0915027657930923 0.9270945941633754
pathSteps 1.3221557391528922 0.18614270027329563


# Data Story Plots

In preparation for the visual exploration of our data story, we generate a set of distinctive colors for each continent.

In [23]:
continents = df_continents["continent"].unique()
random_colors = sns.color_palette("husl", n_colors=len(continents))
continents_colors = {}
continents_colors_int = {}
for i in range(len(continents)):
    continents_colors[continents[i]] = random_colors[i]
    continents_colors_int[continents[i]] = tuple(map(lambda x: int(255 * x), random_colors[i]))
    continents_colors_int[continents[i]] = "#{0:02x}{1:02x}{2:02x}".format(*continents_colors_int[continents[i]])
print(continents_colors)
print(continents_colors_int)

CONTINENTS_NUM = len(continents_colors)

TREATMENT_LABEL = "Europe"
CONTROL_LABEL = "Other"

{'Europe': (0.9677975592919913, 0.44127456009157356, 0.5358103155058701), 'North America': (0.7757319041862729, 0.5784925270759935, 0.19475566538551875), 'Australia': (0.5105309046900421, 0.6614299289084904, 0.1930849118538962), 'Asia': (0.20433460114757862, 0.6863857739476534, 0.5407103379425205), 'Africa': (0.21662978923073606, 0.6676586160122123, 0.7318695594345369), 'South America': (0.5049017849530067, 0.5909119231215284, 0.9584657252128558), 'Antarctica': (0.9587050080494409, 0.3662259565791742, 0.9231469575614251)}
{'Europe': '#f67088', 'North America': '#c59331', 'Australia': '#82a831', 'Asia': '#34af89', 'Africa': '#37aaba', 'South America': '#8096f4', 'Antarctica': '#f45deb'}


## Plot 1: Number of Articles per Continent

In [24]:
fig_name = "articles_count_per_continent"
fig_title = "Fraction of articles per Continent"
fig_ylabel = "Count"
fig_xlabel = "Continent"


# fig = px.bar(
#     x=article_count_per_continent.index,
#     y=article_count_per_continent.values,
#     color=[continents_colors_int[continent] for continent in article_count_per_continent.index],
#     color_discrete_map="identity",
#     labels={"index": fig_ylabel, "value": fig_xlabel},

# )
# fig.update_layout(
#     title_text=fig_title,
#     title_x=0.5,
#     #xaxis=dict(tickangle=-45),
#     width=FIGURE_WIDTH,
#     height=FIGURE_HEIGHT,
# )

# fig.update_xaxes(title_text="Continent")
# fig.update_yaxes(title_text="Count")

# fig.write_image(os.path.join(PLOTS_PATH_PX, f"{fig_name}_bar.pdf"))
# fig.write_html(os.path.join(PLOTS_PATH_HTML, f"{fig_name}_bar.html"))
# fig.show()

pull = np.zeros_like(article_count_per_continent.index) + 0.1 * (article_count_per_continent.index == EUROPE_LABEL)
fig = go.Figure(data=[go.Pie(
    values=article_count_per_continent.values,
    labels=article_count_per_continent.index.tolist(),
    pull=pull.tolist(),
    marker_colors=[continents_colors_int[continent] for continent in article_count_per_continent.index],
    sort=False
)])

fig.update_layout(
    title_text=fig_title,
    title_x=0.5,
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT,
)
fig.write_image(os.path.join(PLOTS_PATH_PX, f"{fig_name}_pie.png"))
fig.write_html(os.path.join(PLOTS_PATH_HTML, f"{fig_name}_pie.html"))
fig.show()


## Plot 2: Continent Distribution per Category

This visualization explores the distribution of articles across different continents within various categories. The bar chart provides an overview of the article count per category, while the interactive pie chart allows users to select specific categories.

In [25]:
fig_name = "articles_count_per_category"
fig_title = "Continent distribution per Category"
fig_xlabel = "Article Count"
fig_ylabel = "Category"


categories_sorted = df_continents_categories_counts.sum(axis="index").sort_values().index

fig = px.bar(
    df_continents_categories_counts.T.loc[categories_sorted],
    orientation ="h",
    title=fig_title,
    color_discrete_sequence=[continents_colors_int[continent] for continent in df_continents_categories_counts.index],
)

fig.update_xaxes(title_text="Count")
fig.update_yaxes(title_text="Category")

fig.update_layout(
    legend_title_text="",
    title_x=0.5,
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT    
)
fig.write_html(os.path.join(PLOTS_PATH_HTML, f"{fig_name}_bar.html"))
fig.write_image(os.path.join(PLOTS_PATH_PX, f"{fig_name}_bar.png"))
fig.show()


fig = go.Figure()

annotations = {}
buttons = []
visible = True
mask = [False] * len(categories_sorted)
max_name_len = max(len(name) for name in continents)
for category_idx, category in enumerate(reversed(categories_sorted)):
    category_data = df_continents_categories_counts[category]
    category_data = category_data[category_data > 0]

    category_name = category.replace("_", " ")
    labels = [f"{name : <{max_name_len}}" for name in category_data.index]

    pull = np.zeros_like(category_data.index) + 0.1 * (category_data.index == EUROPE_LABEL)
    fig.add_trace(go.Pie(
        labels=labels,
        values=category_data.values,
        pull=pull.tolist(),
        marker_colors=[continents_colors_int[continent] for continent in category_data.index],
        visible=visible,
        name=category_name,
        sort=False
    ))

    annotation = dict(
        text=f"Category: {category_name}",
        x=-0.3,
        y=0.05,
        xanchor="left",
        showarrow=False
    )
    if visible:
        fig.add_annotation(annotation)

    mask[category_idx] = True
    buttons.append(dict(
        label=category_name,
        method="update",
        args=[
            {"visible": list(mask)},
            {"title": fig_title, "annotations": [annotation]}
        ]
    ))
    mask[category_idx] = False
    visible=False

fig.update_layout(
    title_text=fig_title,
    title_x=0.7,
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT,
    legend=dict(
        x=-0.3,
        y=0.1
    )
)


fig.update_layout(
    updatemenus=[
        dict(
            active=0,
            buttons=buttons
        )
    ]
)

fig.write_html(os.path.join(PLOTS_PATH_HTML, f"{fig_name}_pie.html"))
fig.write_image(os.path.join(PLOTS_PATH_PX, f"{fig_name}_pie.png"))
fig.show()

## Plot 3 Finished paths starting/leading to article related to specific continent

Plot shows count of finished paths normalized to the total number of paths.

In [26]:
fig_name = "finished_path_percentage_per_article_continent"
fig_title = "Percentage of Finished Paths by {} Article Continent"
fig_ylabel = "Percentage"
fig_xlabel = "Continent"


finished_paths_per_start_article_continent = df_analysis.groupby("startContinent")["isFinished"].sum() / len(df_analysis)
finished_paths_per_start_article_continent = finished_paths_per_start_article_continent.reset_index()
finished_paths_per_start_article_continent["treatment"] = finished_paths_per_start_article_continent["startContinent"] == EUROPE_LABEL
finished_paths_per_start_article_continent["labels"] = finished_paths_per_start_article_continent["treatment"].map({True: TREATMENT_LABEL, False: CONTROL_LABEL})
finished_paths_per_start_article_continent = finished_paths_per_start_article_continent.sort_values("startContinent", ascending=False)

finished_paths_per_target_article_continent = df_analysis.groupby("targetContinent")["isFinished"].sum() / len(df_analysis)
finished_paths_per_target_article_continent = finished_paths_per_target_article_continent.reset_index()
finished_paths_per_target_article_continent["treatment"] = finished_paths_per_target_article_continent["targetContinent"] == EUROPE_LABEL
finished_paths_per_target_article_continent["labels"] = finished_paths_per_target_article_continent["treatment"].map({True: TREATMENT_LABEL, False: CONTROL_LABEL})
finished_paths_per_target_article_continent = finished_paths_per_target_article_continent.sort_values("targetContinent", ascending=False)

fig = go.Figure()

for _, row in finished_paths_per_start_article_continent.iterrows():
    fig.add_trace(go.Bar(
        x=(row["labels"],),
        y=(row["isFinished"],),
        name=row["startContinent"],
        hovertemplate=f"{row['isFinished'] :.3f}",
        marker_color=continents_colors_int[row["startContinent"]],
        visible=True
    ))

for _, row in finished_paths_per_target_article_continent.iterrows():
    fig.add_trace(go.Bar(
        x=(row["labels"],),
        y=(row["isFinished"],),
        name=row["targetContinent"],
        hovertemplate=f"{row['isFinished'] :.3f}",
        marker_color=continents_colors_int[row["targetContinent"]],
        visible=False
    ))

continents_num = len(finished_paths_per_start_article_continent)
buttons = [
    dict(
        label="Start Articles",
        method="update",
        args=[
            {"visible": [True] * continents_num + [False] * continents_num},
            {"title": fig_title.format("Start"), "annotations": []}
        ]
    ),
    dict(
        label="Target Articles",
        method="update",
        args=[
            {"visible": [False] * continents_num + [True] * continents_num},
            {"title": fig_title.format("Target"), "annotations": []}
        ]
    )
]

fig.update_layout(
    updatemenus=[
        dict(
            active=0,
            buttons=buttons,
            x=0.,
            xanchor="left",
            y=1.1,
            yanchor="top"
        ),
    ]
)

fig.update_layout(
    title=fig_title.format("Start"),
    title_x=0.5,
    yaxis_title=fig_ylabel,
    xaxis_title=fig_xlabel,
    barmode="stack",
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT,
)

fig.write_html(os.path.join(PLOTS_PATH_HTML, f"{fig_name}_bar.html"))
fig.write_image(os.path.join(PLOTS_PATH_PX, f"{fig_name}_bar.png"))
fig.show()

## Plot 4 T-Test results for naive study

In [27]:
fig_name = "paths_count_naive"
fig_title = "{} of played games"
fig_ylabel = "Continent"
fig_xlabel = "Games"


count = pd.crosstab(df_analysis["treatment"], df_analysis["isFinished"]).sort_index(ascending=False)
percentage = pd.crosstab(df_analysis["treatment"], df_analysis["isFinished"], normalize="columns").sort_index(ascending=False)
percentage = np.char.mod("%0.2f", percentage.values * 100)
percentage = np.core.defchararray.add(percentage, np.full(percentage.shape, " %", dtype='2U'))

test_results = scipy.stats.ttest_ind(df_analysis[df_analysis["treatment"]]["isFinished"], df_analysis[~df_analysis["treatment"]]["isFinished"])


fig = ff.create_annotated_heatmap(
    count.values,
    annotation_text=count.values,
    colorscale="Blues",
    x=["Unfinished", "Finished"],
    y=[TREATMENT_LABEL, CONTROL_LABEL]
)

count_annotations = fig.to_dict()["layout"]["annotations"]
percentage_annotations = []
for count_annot, percentage_annot in zip(count_annotations, percentage.flatten()):
    tmp_annot = copy.deepcopy(count_annot)
    tmp_annot["text"] = percentage_annot
    percentage_annotations.append(copy.deepcopy(tmp_annot))

second_title = dict(
    font={"color": fig.layout.title.font.color, "size": 16},
    text=f"T-Test statistic: {test_results.statistic:.3f} and p-value {test_results.pvalue:.3f}",
    xref="paper",
    yref="paper",
    x=0.85,
    y=1.08,
    showarrow=False
)

count_annotations.append(second_title)
percentage_annotations.append(second_title)

buttons = [
    dict(label="Count", method="update", args=[{}, {"annotations": count_annotations, "title": fig_title.format("Number")}]),
    dict(label="Percentage", method="update", args=[{}, {"annotations": percentage_annotations, "title": fig_title.format("Percentage")}]),
]

fig.update_layout(
    title=fig_title.format("Number"),
    title_x=0.5,
    title_y=.95,
    yaxis_title=fig_ylabel,
    xaxis_title=fig_xlabel,
    xaxis_side="bottom",
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT,
    updatemenus=[dict(type="buttons", showactive=True, buttons=buttons, x=0, xanchor="left", y=1.1, yanchor="top", direction="right")],
    margin=dict(t=100),
    annotations=count_annotations
)

fig.write_html(os.path.join(PLOTS_PATH_HTML, f"{fig_name}_map.html"))
fig.write_image(os.path.join(PLOTS_PATH_PX, f"{fig_name}_map.pdf"))
fig.show()


# Plot 5 Difference of matched and original data distributions

In [28]:
default_color_scale = px.colors.qualitative.Plotly

In [29]:
%%capture

fig_shortes_path = make_subplots(
    rows=2,
    cols=1,
    shared_xaxes=True,
    vertical_spacing=0.1,
    subplot_titles=("Original Data", "Matched Data")
)

fig_shortes_path.add_trace(go.Histogram(
    x=df_analysis[df_analysis["treatment"]]["shortestPath"],
    nbinsx=8,
    histnorm="probability",
    opacity=0.5,
    name=TREATMENT_LABEL,
    marker_color=default_color_scale[0]
), row=1, col=1)
fig_shortes_path.add_trace(go.Histogram(
    x=df_analysis[~df_analysis["treatment"]]["shortestPath"],
    nbinsx=8,
    histnorm="probability",
    opacity=0.5,
    name=CONTROL_LABEL,
    marker_color=default_color_scale[1]
), row=1, col=1)

fig_shortes_path.add_trace(go.Histogram(
    x=df_analysis_balanced[df_analysis_balanced["treatment"]]["shortestPath"],
    nbinsx=8,
    histnorm="probability",
    opacity=0.5,
    name=TREATMENT_LABEL,
    marker_color=default_color_scale[0],
    showlegend=False
), row=2, col=1)
fig_shortes_path.add_trace(go.Histogram(
    x=df_analysis_balanced[~df_analysis_balanced["treatment"]]["shortestPath"],
    nbinsx=8,
    histnorm="probability",
    opacity=0.5,
    name=CONTROL_LABEL,
    marker_color=default_color_scale[1],
    showlegend=False
), row=2, col=1)

fig_shortes_path.update_xaxes(title_text="Shortest Path", row=2, col=1)
fig_shortes_path.update_yaxes(title_text="Fraction", row=2, col=1)
fig_shortes_path.update_yaxes(title_text="Fraction", row=1, col=1)

fig_shortes_path.update_layout(
    legend_title="Continent",
    barmode="overlay",
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT,
)


In [30]:
%%capture

fig_start_category = make_subplots(
    rows=2,
    cols=1,
    shared_xaxes=True,
    vertical_spacing=0.1,
    subplot_titles=("Original Data", "Matched Data")
)

start_categories = df_analysis[df_analysis["treatment"]]["startCategoryMain"].explode().value_counts()
fig_start_category.add_trace(go.Bar(
    x=start_categories.index.str.replace("_", " "),
    y=start_categories.values,
    name=TREATMENT_LABEL,
    marker_color=default_color_scale[0],
), row=1, col=1)

start_categories = df_analysis[~df_analysis["treatment"]]["startCategoryMain"].explode().value_counts()
fig_start_category.add_trace(go.Bar(
    x=start_categories.index.str.replace("_", " "),
    y=start_categories.values,
    name=CONTROL_LABEL,
    marker_color=default_color_scale[1],
), row=1, col=1)
start_categories = df_analysis_balanced[df_analysis_balanced["treatment"]]["startCategoryMain"].explode().value_counts()
fig_start_category.add_trace(go.Bar(
    x=start_categories.index.str.replace("_", " "),
    y=start_categories.values,
    name=TREATMENT_LABEL,
    marker_color=default_color_scale[0],
    showlegend=False
), row=2, col=1)

start_categories = df_analysis_balanced[~df_analysis_balanced["treatment"]]["startCategoryMain"].explode().value_counts()
fig_start_category.add_trace(go.Bar(
    x=start_categories.index.str.replace("_", " "),
    y=start_categories.values,
    name=CONTROL_LABEL,
    marker_color=default_color_scale[1],
    showlegend=False
), row=2, col=1)

fig_start_category.update_layout(
    legend_title="Continent",
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT,
)

fig_start_category.update_xaxes(title_text="Category", row=2, col=1)
fig_start_category.update_yaxes(title_text="Count", row=2, col=1)
fig_start_category.update_yaxes(title_text="Count", row=1, col=1)
fig_start_category.update_yaxes(type='log', row=1, col=1)
fig_start_category.update_yaxes(type='log', row=2, col=1)


In [31]:
%%capture

fig_target_category = make_subplots(
    rows=2,
    cols=1,
    shared_xaxes=True,
    vertical_spacing=0.1,
    subplot_titles=("Original Data", "Matched Data")
)

target_categories = df_analysis[df_analysis["treatment"]]["targetCategoryMain"].explode().value_counts()
fig_target_category.add_trace(go.Bar(
    x=target_categories.index.str.replace("_", " "),
    y=target_categories.values,
    name=TREATMENT_LABEL,
    marker_color=default_color_scale[0],
), row=1, col=1)

target_categories = df_analysis[~df_analysis["treatment"]]["targetCategoryMain"].explode().value_counts()
fig_target_category.add_trace(go.Bar(
    x=target_categories.index.str.replace("_", " "),
    y=target_categories.values,
    name=CONTROL_LABEL,
    marker_color=default_color_scale[1],
), row=1, col=1)
target_categories = df_analysis_balanced[df_analysis_balanced["treatment"]]["targetCategoryMain"].explode().value_counts()
fig_target_category.add_trace(go.Bar(
    x=target_categories.index.str.replace("_", " "),
    y=target_categories.values,
    name=TREATMENT_LABEL,
    marker_color=default_color_scale[0],
    showlegend=False
), row=2, col=1)

target_categories = df_analysis_balanced[~df_analysis_balanced["treatment"]]["targetCategoryMain"].explode().value_counts()
fig_target_category.add_trace(go.Bar(
    x=target_categories.index.str.replace("_", " "),
    y=target_categories.values,
    name=CONTROL_LABEL,
    marker_color=default_color_scale[1],
    showlegend=False
), row=2, col=1)

fig_target_category.update_layout(
    legend_title="Continent",
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT,
)

fig_target_category.update_xaxes(title_text="Category", row=2, col=1)
fig_target_category.update_yaxes(title_text="Count", row=2, col=1)
fig_target_category.update_yaxes(title_text="Count", row=1, col=1)
fig_target_category.update_yaxes(type='log', row=1, col=1)
fig_target_category.update_yaxes(type='log', row=2, col=1)


In [32]:
%%capture

fig_target_pg = make_subplots(
    rows=2,
    cols=1,
    shared_xaxes=True,
    vertical_spacing=0.1,
    subplot_titles=("Original Data", "Matched Data")
)

fig_target_pg.add_trace(go.Histogram(
    x=np.log(df_analysis[df_analysis["treatment"]]["targetPageRank"]),
    nbinsx=8,
    histnorm="probability",
    opacity=0.5,
    name=TREATMENT_LABEL,
    marker_color=default_color_scale[0]
), row=1, col=1)
fig_target_pg.add_trace(go.Histogram(
    x=np.log(df_analysis[~df_analysis["treatment"]]["targetPageRank"]),
    nbinsx=8,
    histnorm="probability",
    opacity=0.5,
    name=CONTROL_LABEL,
    marker_color=default_color_scale[1]
), row=1, col=1)

fig_target_pg.add_trace(go.Histogram(
    x=np.log(df_analysis_balanced[df_analysis_balanced["treatment"]]["targetPageRank"]),
    nbinsx=8,
    histnorm="probability",
    opacity=0.5,
    name=TREATMENT_LABEL,
    marker_color=default_color_scale[0],
    showlegend=False
), row=2, col=1)
fig_target_pg.add_trace(go.Histogram(
    x=np.log(df_analysis_balanced[~df_analysis_balanced["treatment"]]["targetPageRank"]),
    nbinsx=8,
    histnorm="probability",
    opacity=0.5,
    name=CONTROL_LABEL,
    marker_color=default_color_scale[1],
    showlegend=False
), row=2, col=1)

fig_target_pg.update_xaxes(title_text="Page Rank (log)", row=2, col=1)
fig_target_pg.update_yaxes(title_text="Fraction", row=2, col=1)
fig_target_pg.update_yaxes(title_text="Fraction", row=1, col=1)

fig_target_pg.update_layout(
    legend_title="Continent",
    barmode="overlay",
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT,
)


In [33]:
%%capture

fig_target_length = make_subplots(
    rows=2,
    cols=1,
    shared_xaxes=True,
    vertical_spacing=0.1,
    subplot_titles=("Original Data", "Matched Data")
)

fig_target_length.add_trace(go.Histogram(
    x=df_analysis[df_analysis["treatment"]]["targetLength"],
    histnorm="probability",
    opacity=0.5,
    nbinsx=20,
    name=TREATMENT_LABEL,
    marker_color=default_color_scale[0]
), row=1, col=1)
fig_target_length.add_trace(go.Histogram(
    x=df_analysis[~df_analysis["treatment"]]["targetLength"],
    histnorm="probability",
    opacity=0.5,
    nbinsx=20,
    name=CONTROL_LABEL,
    marker_color=default_color_scale[1]
), row=1, col=1)

fig_target_length.add_trace(go.Histogram(
    x=df_analysis_balanced[df_analysis_balanced["treatment"]]["targetLength"],
    histnorm="probability",
    opacity=0.5,
    nbinsx=20,
    name=TREATMENT_LABEL,
    marker_color=default_color_scale[0],
    showlegend=False
), row=2, col=1)
fig_target_length.add_trace(go.Histogram(
    x=df_analysis_balanced[~df_analysis_balanced["treatment"]]["targetLength"],
    histnorm="probability",
    opacity=0.5,
    nbinsx=20,
    name=CONTROL_LABEL,
    marker_color=default_color_scale[1],
    showlegend=False
), row=2, col=1)

fig_target_length.update_xaxes(title_text="Word count", row=2, col=1)
fig_target_length.update_yaxes(title_text="Fraction", row=2, col=1)
fig_target_length.update_yaxes(title_text="Fraction", row=1, col=1)

fig_target_length.update_layout(
    legend_title="Continent",
    barmode="overlay",
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT,
)


In [34]:
%%capture

fig_start_pg = make_subplots(
    rows=2,
    cols=1,
    shared_xaxes=True,
    vertical_spacing=0.1,
    subplot_titles=("Original Data", "Matched Data")
)

fig_start_pg.add_trace(go.Histogram(
    x=np.log(df_analysis[df_analysis["treatment"]]["startPageRank"]),
    nbinsx=8,
    histnorm="probability",
    opacity=0.5,
    name=TREATMENT_LABEL,
    marker_color=default_color_scale[0]
), row=1, col=1)
fig_start_pg.add_trace(go.Histogram(
    x=np.log(df_analysis[~df_analysis["treatment"]]["startPageRank"]),
    nbinsx=8,
    histnorm="probability",
    opacity=0.5,
    name=CONTROL_LABEL,
    marker_color=default_color_scale[1]
), row=1, col=1)

fig_start_pg.add_trace(go.Histogram(
    x=np.log(df_analysis_balanced[df_analysis_balanced["treatment"]]["startPageRank"]),
    nbinsx=8,
    histnorm="probability",
    opacity=0.5,
    name=TREATMENT_LABEL,
    marker_color=default_color_scale[0],
    showlegend=False
), row=2, col=1)
fig_start_pg.add_trace(go.Histogram(
    x=np.log(df_analysis_balanced[~df_analysis_balanced["treatment"]]["startPageRank"]),
    nbinsx=8,
    histnorm="probability",
    opacity=0.5,
    name=CONTROL_LABEL,
    marker_color=default_color_scale[1],
    showlegend=False
), row=2, col=1)

fig_start_pg.update_xaxes(title_text="Page Rank (log)", row=2, col=1)
fig_start_pg.update_yaxes(title_text="Fraction", row=2, col=1)
fig_start_pg.update_yaxes(title_text="Fraction", row=1, col=1)

fig_start_pg.update_layout(
    legend_title="Continent",
    barmode="overlay",
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT,
)


In [35]:
%%capture

fig_start_length = make_subplots(
    rows=2,
    cols=1,
    shared_xaxes=True,
    vertical_spacing=0.1,
    subplot_titles=("Original Data", "Matched Data")
)

fig_start_length.add_trace(go.Histogram(
    x=df_analysis[df_analysis["treatment"]]["startLength"],
    histnorm="probability",
    opacity=0.5,
    nbinsx=20,
    name=TREATMENT_LABEL,
    marker_color=default_color_scale[0]
), row=1, col=1)
fig_start_length.add_trace(go.Histogram(
    x=df_analysis[~df_analysis["treatment"]]["startLength"],
    histnorm="probability",
    opacity=0.5,
    nbinsx=20,
    name=CONTROL_LABEL,
    marker_color=default_color_scale[1]
), row=1, col=1)

fig_start_length.add_trace(go.Histogram(
    x=df_analysis_balanced[df_analysis_balanced["treatment"]]["startLength"],
    histnorm="probability",
    opacity=0.5,
    nbinsx=20,
    name=TREATMENT_LABEL,
    marker_color=default_color_scale[0],
    showlegend=False
), row=2, col=1)
fig_start_length.add_trace(go.Histogram(
    x=df_analysis_balanced[~df_analysis_balanced["treatment"]]["startLength"],
    histnorm="probability",
    opacity=0.5,
    nbinsx=20,
    name=CONTROL_LABEL,
    marker_color=default_color_scale[1],
    showlegend=False
), row=2, col=1)

fig_start_length.update_xaxes(title_text="Word count", row=2, col=1)
fig_start_length.update_yaxes(title_text="Fraction", row=2, col=1)
fig_start_length.update_yaxes(title_text="Fraction", row=1, col=1)

fig_start_length.update_layout(
    legend_title="Continent",
    barmode="overlay",
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT,
)


In [36]:
fig_shortes_path.write_html(os.path.join(PLOTS_PATH_HTML, "matching_diff_shortes_path.html"))
fig_start_category.write_html(os.path.join(PLOTS_PATH_HTML, "matching_diff_start_category.html"))
fig_target_category.write_html(os.path.join(PLOTS_PATH_HTML, "matching_diff_target_category.html"))
fig_target_pg.write_html(os.path.join(PLOTS_PATH_HTML, "matching_diff_target_pg.html"))
fig_target_length.write_html(os.path.join(PLOTS_PATH_HTML, "matching_diff_target_length.html"))
fig_start_pg.write_html(os.path.join(PLOTS_PATH_HTML, "matching_diff_start_pg.html"))
fig_start_length.write_html(os.path.join(PLOTS_PATH_HTML, "matching_diff_start_length.html"))

In [37]:


app = Dash(__name__)

app.layout = html.Div([
    dcc.Tabs([
        dcc.Tab(label="Shortest Path", children=[
            dcc.Graph(
                figure=fig_shortes_path
            )
        ]),
        dcc.Tab(label="Category of start Article", children=[
            dcc.Graph(
                figure=fig_start_category
            )
        ]),
        dcc.Tab(label="Category of target Article", children=[
            dcc.Graph(
                figure=fig_target_category
            )
        ]),
        dcc.Tab(label="Page Rank of target Article", children=[
            dcc.Graph(
                figure=fig_target_pg
            )
        ]),
        dcc.Tab(label="Length of target Article", children=[
            dcc.Graph(
                figure=fig_target_length
            )
        ]),
        dcc.Tab(label="Page Rank of start Article", children=[
            dcc.Graph(
                figure=fig_start_pg
            )
        ]),
        dcc.Tab(label="Length of start Article", children=[
            dcc.Graph(
                figure=fig_start_length
            )
        ]),
    ],
    style={"width": FIGURE_WIDTH, "font-family": "Arial"})
])

if __name__ == "__main__":
    app.run_server()


## Plot 6 - Observation study Results 

In [38]:
fig_name = "paths_count_observe"
fig_title = "{} of played games"
fig_ylabel = "Continent"
fig_xlabel = "Games"


count = pd.crosstab(df_analysis_balanced["treatment"], df_analysis_balanced["isFinished"]).sort_index(ascending=False)
percentage = pd.crosstab(df_analysis_balanced["treatment"], df_analysis_balanced["isFinished"], normalize="columns").sort_index(ascending=False)
percentage = np.char.mod("%0.2f", percentage.values * 100)
percentage = np.core.defchararray.add(percentage, np.full(percentage.shape, " %", dtype='2U'))

test_results = scipy.stats.ttest_ind(df_analysis_balanced[df_analysis_balanced["treatment"]]["isFinished"], df_analysis_balanced[~df_analysis_balanced["treatment"]]["isFinished"])

fig = ff.create_annotated_heatmap(
    count.values,
    annotation_text=count.values,
    colorscale="Blues",
    x=["Unfinished", "Finished"],
    y=[TREATMENT_LABEL, CONTROL_LABEL]
)

count_annotations = fig.to_dict()["layout"]["annotations"]
percentage_annotations = []
for count_annot, percentage_annot in zip(count_annotations, percentage.flatten()):
    tmp_annot = copy.deepcopy(count_annot)
    tmp_annot["text"] = percentage_annot
    percentage_annotations.append(copy.deepcopy(tmp_annot))

second_title = dict(
    font={"color": fig.layout.title.font.color, "size": 16},
    text=f"T-Test statistic: {test_results.statistic:.3f} and p-value {test_results.pvalue:.3f}",
    xref="paper",
    yref="paper",
    x=0.85,
    y=1.08,
    showarrow=False
)

count_annotations.append(second_title)
percentage_annotations.append(second_title)

buttons = [
    dict(label="Count", method="update", args=[{}, {"annotations": count_annotations, "title": fig_title.format("Number")}]),
    dict(label="Percentage", method="update", args=[{}, {"annotations": percentage_annotations, "title": fig_title.format("Percentage")}]),
]

fig.update_layout(
    title=fig_title.format("Number"),
    title_x=0.5,
    title_y=.95,
    yaxis_title=fig_ylabel,
    xaxis_title=fig_xlabel,
    xaxis_side="bottom",
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT,
    updatemenus=[dict(type="buttons", showactive=True, buttons=buttons, x=0, xanchor="left", y=1.1, yanchor="top", direction="right")],
    margin=dict(t=100),
    annotations=count_annotations
)

fig.write_html(os.path.join(PLOTS_PATH_HTML, f"{fig_name}_map.html"))
fig.write_image(os.path.join(PLOTS_PATH_PX, f"{fig_name}_map.pdf"))
fig.show()
