# GeoWiki: Investigating Geographical Biases in Wikispeedia

*TheDataDreamTeam*

This project focuses on investigating geographical biases in the Wikispeedia game and player behavior, using the 2007 Wikipedia Selection for schools dataset as the data source. Our goal is to explore if biases exist towards North America and Europe in article selection and gameplay.

# Imports

In [None]:
import re
import os

import seaborn as sns
import plotly.subplots
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go

import scipy
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

import networkx as nx

# Setting 

In [None]:
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

REMOVE_INTERNATIONAL = True
INTERNATIONAL_LABEL = "International"
EUROPE_LABEL = "Europe"


PLOTS_PATH = "plots"
PLOTS_PATH_PLT = os.path.join(PLOTS_PATH, "plt")
PLOTS_PATH_PX = os.path.join(PLOTS_PATH, "px")
PLOTS_PATH_HTML = os.path.join(PLOTS_PATH, "html")

FIGURE_WIDTH = 800
FIGURE_HEIGHT = 600

for path in [PLOTS_PATH_PLT, PLOTS_PATH_PX, PLOTS_PATH_HTML]: 
    os.makedirs(path, exist_ok=True)

# Data loading

## All articles

To begin our analysis, we load the dataset containing information about all articles from the Wikispeedia game. The dataset is stored in the file `articles.tsv`.

This dataset provides valuable information about all articles in the Wikispeedia game, setting the foundation for further exploration and analysis.

In [None]:
df_articles_all = pd.read_csv(
    os.path.join("Data", "wikispeedia_paths-and-graph", "articles.tsv"),
    delimiter="\t",
    header=None,
    names=["name"],
    skip_blank_lines=True,
    comment="#"
)

display(df_articles_all.head())
print("Size:", df_articles_all.shape)

## Article continent labels

Next, we label the articles with their respective continents using information stored in the `continents.csv` file.

In [None]:
df_continents = pd.read_csv(os.path.join("Data", "continents.csv"))

if REMOVE_INTERNATIONAL:
    labeled_articles_all_count = len(df_continents)
    df_continents = df_continents[df_continents.continent != INTERNATIONAL_LABEL]
    labeled_articles_count = len(df_continents)
    print(f"Removing articles labeled as {INTERNATIONAL_LABEL}, Removed articles: {labeled_articles_all_count - labeled_articles_count}")

display(df_continents.head())
print("Size:", df_continents.shape)

This dataset, now labeled with continents, is essential for our geographical analysis. The removal of articles labeled as "International" is done as a data preprocessing step. This decision is made to focus the analysis on articles that are distinctly associated with specific continents, making it easier to investigate geographical biases in the Wikispeedia game.

The continent labels enable us to explore geographical biases in the Wikispeedia game.

## Article categories

We start by loading information about article categories from the dataset.

In [None]:
df_categories = pd.read_csv(
    os.path.join("Data", "wikispeedia_paths-and-graph", "categories.tsv"),
    delimiter="\t",
    header=None,
    names=["article", "category"],
    skip_blank_lines=True,
    comment="#"
)

main_categories = []
for category in df_categories["category"].values:
    main_categories.append(category.split(".")[1])

df_categories["categoryMain"] = main_categories

display(df_categories.head())
print("Size:", df_categories.shape)

Now, let's merge the continent labels with the article categories.

In [None]:
df_continents_categories = pd.merge(df_continents, df_categories, on="article", how="left")

display(df_continents_categories.head())
print("Size:", df_continents_categories.shape)

Finally, let's create a dataset with unique articles, their associated continents, and lists of main and subcategories.

In [None]:
df_articles = df_continents_categories[["article", "continent"]].drop_duplicates()
df_articles = pd.merge(df_articles, df_continents_categories.groupby("article")["categoryMain"].apply(list).reset_index(), on="article")
df_articles = pd.merge(df_articles, df_continents_categories.groupby("article")["category"].apply(list).reset_index(), on="article")

display(df_articles.head())
print("Size:", df_articles.shape)

## Article word count

Now, we explore the word count of each article in our dataset. We retrieve this information from the plaintext versions of the articles.

The resulting dataset includes a new column, `length` representing the word count of each article

In [None]:
plaintext_path = os.path.join("Data", "plaintext_articles")

word_counts = []
for article_name in df_articles.article:
    file_path = os.path.join(plaintext_path, article_name + ".txt")

    with open(file_path, "r", encoding="utf-8") as file:

        _ = file.readline() # Skip the first line because it contains the word #copyright
        content = file.read()

    content = content[:re.search("Retrieved from", content).start(0)]
    word_counts.append(len(content.split()))

df_articles["length"] = word_counts

display(df_articles.head())
print("Size:", df_articles.shape)

## Page Rank

In this step, we load the Page Rank data from the `page_rank.csv` file and merge it with our existing dataset. The resulting dataset now includes information about the Page Rank of each article. Page Rank can offer insights into the importance or centrality of an article within the Wikispeedia network.

This information will be valuable for our analysis, allowing us to consider the influence and significance of articles when exploring geographical biases in the Wikispeedia game.

In [None]:
df_pagerank = pd.read_csv(os.path.join("Data", "page_rank.csv"))
df_articles = pd.merge(df_articles, df_pagerank, on="article", how="left").fillna(0)

display(df_articles.head())
print("Size:", df_articles.shape)

## Paths

In this section, we load information about both finished and unfinished paths from the Wikispeedia game. Additional columns are added to facilitate analysis, including the number of backclicks, total path steps, unique articles visited, and whether the path is completed or not.

In [None]:
df_paths_finished = pd.read_csv(
    os.path.join("Data", "wikispeedia_paths-and-graph", "paths_finished.tsv"),
    sep="\t",
    header=None,
    names=["hashedIpAddress", "timestamp", "durationInSec", "path", "rating"],
    skip_blank_lines=True,
    comment="#"
)
df_paths_unfinished = pd.read_csv(
    os.path.join("Data", "wikispeedia_paths-and-graph", "paths_unfinished.tsv"),
    sep="\t",
    header=None,
    names=["hashedIpAddress", "timestamp", "durationInSec", "path", "target", "motif"],
    skip_blank_lines=True,
    comment="#"
)

df_paths_finished["backclicks"] = df_paths_finished["path"].apply(lambda x: x.count("<"))
df_paths_finished["pathSteps"] = df_paths_finished["path"].apply(lambda x: x.count(";") + 1)
df_paths_finished["uniqueArticles"] = df_paths_finished["pathSteps"] - df_paths_finished["backclicks"]
df_paths_finished["path"] = df_paths_finished["path"].apply(lambda x: x.split(";"))
df_paths_finished["start"] = df_paths_finished["path"].str[0]
df_paths_finished["target"] = df_paths_finished["path"].str[-1]
df_paths_finished["isFinished"] = True

df_paths_unfinished["backclicks"] = df_paths_unfinished["path"].apply(lambda x: x.count("<"))
df_paths_unfinished["pathSteps"] = df_paths_unfinished["path"].apply(lambda x: x.count(";") + 1)
df_paths_unfinished["uniqueArticles"] = df_paths_unfinished["pathSteps"] - df_paths_unfinished["backclicks"]
df_paths_unfinished["path"] = df_paths_unfinished["path"].apply(lambda x: x.split(";"))
df_paths_unfinished["start"] = df_paths_unfinished["path"].str[0]
df_paths_unfinished["isFinished"] = False

df_paths = pd.concat([df_paths_finished, df_paths_unfinished])
df_paths = df_paths[df_paths["start"].isin(df_articles_all.name) & df_paths["target"].isin(df_articles_all.name)]

display(df_paths.head())
print("Size:", df_paths.shape)


## Shortest Paths

Here, we extract information about the shortest paths between articles from the provided file. The resulting dataset, `df_shortest_paths`, is then merged with the existing paths dataset, `df_paths`.

In [None]:
shortest_paths = []
with open(os.path.join("Data", "wikispeedia_paths-and-graph", "shortest-path-distance-matrix.txt")) as file:
    for line in file:
        line = line.strip()
        if line == "" or line.startswith("#"):
            continue
        shortest_paths.append(list(map(lambda x: -1 if x == "_" else int(x), list(line))))
        
shortest_paths = np.array(shortest_paths)

df_shortest_paths = pd.DataFrame(shortest_paths, index=df_articles_all.name, columns=df_articles_all.name)
df_paths["shortestPath"] = df_paths.apply(lambda row: df_shortest_paths.loc[row["start"], row["target"]], axis="columns")
df_paths = df_paths[df_paths["shortestPath"] >= 0]

display(df_paths.head())
print("Size:", df_paths.shape)

# Data Exploration

In this section, we explore the distribution of articles across different continents. The `article_count_per_continent` DataFrame provides a summary of the number of articles in each continent.

In [None]:
article_count_per_continent = df_continents.groupby("continent").size().sort_index()

display(article_count_per_continent)

Next, we examine the distribution of articles in various main categories within each continent.

In [None]:
df_continents_categories_counts = pd.crosstab(df_continents_categories["continent"], df_continents_categories["categoryMain"]).sort_index()

display(df_continents_categories_counts)
print("Size:", df_continents_categories_counts.shape)

We merge information about target and start articles with the paths data.

In [None]:
df_articles_target = df_articles.copy()
df_articles_target.columns = [column[0].upper() + column[1:] for column in df_articles_target.columns]
df_articles_target = df_articles_target.add_prefix("target")

df_paths_articles = pd.merge(df_paths, df_articles_target, left_on="target", right_on="targetArticle", suffixes=["", ]).drop(columns="targetArticle")

df_start_articles = df_articles.copy()
df_start_articles.columns = [column[0].upper() + column[1:] for column in df_start_articles.columns]
df_start_articles = df_start_articles.add_prefix("start")
df_paths_articles = pd.merge(df_paths_articles, df_start_articles, left_on="start", right_on="startArticle", suffixes=["", ]).drop(columns="startArticle")

df_paths_articles["isFinishedInt"] = df_paths_articles["isFinished"].astype(int)

display(df_paths_articles.head())
print("Size:", df_paths_articles.shape)

Finally, we perform an analysis of article path statistics. The resulting DataFrame, `df_article_path_stats`, contains information about the number of finished and unfinished paths for each article, along with percentages and relevant details. This exploration sets the stage for deeper insights into user interactions with articles in the Wikispeedia game.

In [None]:
df_article_path_stats = pd.DataFrame()

df_article_path_stats["article"] = df_articles["article"]
df_article_path_stats["continent"] = df_articles["continent"]
df_article_path_stats["targetFinished"] = df_articles["article"].map(df_paths_finished["target"].value_counts()).fillna(0)
df_article_path_stats["targetUnfinished"] = df_articles["article"].map(df_paths_unfinished["target"].value_counts()).fillna(0)

df_article_path_stats["startFinished"] = df_articles["article"].map(df_paths_finished["start"].value_counts()).fillna(0)
df_article_path_stats["startUnfinished"] = df_articles["article"].map(df_paths_unfinished["start"].value_counts()).fillna(0)

paths_finished = pd.Series(np.concatenate(df_paths_finished.path.values))
paths_unfinished = pd.Series(np.concatenate(df_paths_unfinished.path.values))

df_article_path_stats["anyFinished"] = df_articles["article"].map(paths_finished.value_counts()).fillna(0)
df_article_path_stats["anyUnfinished"] = df_articles["article"].map(paths_unfinished.value_counts()).fillna(0)
df_article_path_stats["anyPercentage"] = (df_article_path_stats["anyFinished"] + df_article_path_stats["anyUnfinished"]) / (len(paths_finished) + len(paths_unfinished))

display(df_article_path_stats.sort_values("anyPercentage", ascending=False).head())
print("Size:", df_article_path_stats.shape)

# Naive Analysis

We conduct a naive statistical analysis to identify potential differences between paths leading to articles related to Europe (treatment group) and paths leading to articles related to other continents (control group). The t-tests for metrics such as completion status, duration, path steps, and rating provide initial insights into potential disparities between the two groups.

In [None]:
df_analysis = df_paths_articles.copy()
df_analysis = df_analysis.fillna(0)
df_analysis["treatment"] = df_analysis.targetContinent == "Europe"

for col in ["isFinishedInt", "durationInSec", "pathSteps", "rating"]:
    print(col, *scipy.stats.ttest_ind(df_analysis[df_analysis.treatment][col], df_analysis[~df_analysis.treatment][col], equal_var=False))


# Matching

Matching is a crucial step in observational studies to control for confounding factors and ensure a fair comparison between the treatment and control groups. In our study, we propose matching based on the following factors:

- Starting article
- PageRank of the goal (indicating the same probability of reaching the goal)
- Same category of the target article

Matching allows us to create more comparable groups, reducing bias and increasing the reliability of our analysis. By considering these factors, we aim to create balanced groups that are comparable in terms of key characteristics. 

# Observation Study

This study aims to uncover trends and patterns in the behavior of players when interacting with articles associated with Europe compared to other continents.

The matched groups, created through the matching process, provide a controlled environment for analysis.

By controlling for confounding factors through matching, we aim to derive meaningful and reliable conclusions about the influence of geographical factors on user experiences in the Wikispeedia game.

# Data Story Plots

In preparation for the visual exploration of our data story, we generate a set of distinctive colors for each continent.

In [None]:
continents = df_continents["continent"].unique()
random_colors = sns.color_palette("husl", n_colors=len(continents))
continents_colors = {}
continents_colors_int = {}
for i in range(len(continents)):
    continents_colors[continents[i]] = random_colors[i]
    continents_colors_int[continents[i]] = tuple(map(lambda x: int(255 * x), random_colors[i]))
    continents_colors_int[continents[i]] = "#{0:02x}{1:02x}{2:02x}".format(*continents_colors_int[continents[i]])
print(continents_colors)
print(continents_colors_int)

CONTINENTS_NUM = len(continents_colors)

## Plot 1: Number of Articles per Continent

In [None]:
article_count_per_continent

In [None]:
article_count_per_continent.to_frame()

This visualization presents the distribution of articles across different continents. The bar chart and pie chart provide a visual representation of the number of articles per continent, offering insights into the dataset's geographical coverage.

In [None]:
fig_name = "articles_count_per_continent"
fig_title = "Number of articles per Continent"
fig_ylabel = "Count"
fig_xlabel = "Continent"


fig = px.bar(
    x=article_count_per_continent.index,
    y=article_count_per_continent.values,
    color=[continents_colors_int[continent] for continent in article_count_per_continent.index],
    color_discrete_map="identity",
    labels={"index": fig_ylabel, "value": fig_xlabel},

)
fig.update_layout(
    title_text=fig_title,
    title_x=0.5,
    #xaxis=dict(tickangle=-45),
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT,
)
fig.write_image(os.path.join(PLOTS_PATH_PX, f"{fig_name}_bar.pdf"))
fig.write_html(os.path.join(PLOTS_PATH_HTML, f"{fig_name}_bar.html"))
fig.show()

pull = np.zeros_like(article_count_per_continent.index) + 0.1 * (article_count_per_continent.index == EUROPE_LABEL)
fig = go.Figure(data=[go.Pie(
    values=article_count_per_continent.values,
    labels=article_count_per_continent.index.tolist(),
    pull=pull.tolist(),
    marker_colors=[continents_colors_int[continent] for continent in article_count_per_continent.index],
    sort=False
)])

fig.update_layout(
    title_text=fig_title,
    title_x=0.5,
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT,
)
fig.write_image(os.path.join(PLOTS_PATH_PX, f"{fig_name}_pie.pdf"))
fig.write_html(os.path.join(PLOTS_PATH_HTML, f"{fig_name}_pie.html"))
fig.show()


## Plot 2: Continent Distribution per Category

This visualization explores the distribution of articles across different continents within various categories. The bar chart provides an overview of the article count per category, while the interactive pie chart allows users to select specific categories.

In [None]:
fig_name = "articles_count_per_category"
fig_title = "Continent distribution per Category"
fig_xlabel = "Article Count"
fig_ylabel = "Category"


categories_sorted = df_continents_categories_counts.sum(axis="index").sort_values().index

fig = px.bar(
    df_continents_categories_counts.T.loc[categories_sorted],
    orientation ="h",
    title=fig_title,
    labels={"index": fig_ylabel, "value": fig_xlabel},
    color_discrete_sequence=[continents_colors_int[continent] for continent in df_continents_categories_counts.index],
)
fig.update_layout(
    legend_title_text="",
    title_x=0.5,
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT    
)
fig.write_html(os.path.join(PLOTS_PATH_HTML, f"{fig_name}_bar.html"))
fig.write_image(os.path.join(PLOTS_PATH_PX, f"{fig_name}_bar.pdf"))
fig.show()


fig = go.Figure()

annotations = {}
buttons = []
visible = True
mask = [False] * len(categories_sorted)
max_name_len = max(len(name) for name in continents)
for category_idx, category in enumerate(reversed(categories_sorted)):
    category_data = df_continents_categories_counts[category]
    category_data = category_data[category_data > 0]

    category_name = category.replace("_", " ")
    labels = [f"{name : <{max_name_len}}" for name in category_data.index]

    pull = np.zeros_like(category_data.index) + 0.1 * (category_data.index == EUROPE_LABEL)
    fig.add_trace(go.Pie(
        labels=labels,
        values=category_data.values,
        pull=pull.tolist(),
        marker_colors=[continents_colors_int[continent] for continent in category_data.index],
        visible=visible,
        name=category_name,
        sort=False
    ))

    annotation = dict(
        text=f"Category: {category_name}",
        x=-0.3,
        y=0.05,
        xanchor="left",
        showarrow=False
    )
    if visible:
        fig.add_annotation(annotation)

    mask[category_idx] = True
    buttons.append(dict(
        label=category_name,
        method="update",
        args=[
            {"visible": list(mask)},
            {"title": fig_title, "annotations": [annotation]}
        ]
    ))
    mask[category_idx] = False
    visible=False


fig.update_layout(
    title_text=fig_title,
    title_x=0.7,
    width=FIGURE_WIDTH,
    height=FIGURE_HEIGHT,
    legend=dict(
        x=-0.3,
        y=0.1
    )
)


fig.update_layout(
    updatemenus=[
        dict(
            active=0,
            buttons=buttons
        )
    ]
)

fig.write_html(os.path.join(PLOTS_PATH_HTML, f"{fig_name}_pie.html"))
fig.write_image(os.path.join(PLOTS_PATH_PX, f"{fig_name}_pie.pdf"))
fig.show()

## Plot 3