# Life 101

In this notebook, we create a simple analogy between Wikispeedia game and "Life". We found out that some pieces of advice extracted from the game are also applicable in real life and we can learn from them!

# Data Preparation

Please add folders `wikispeedia_articles_html`, `wikispeedia_articles_plaintext`, `wikispeedia_paths-and-graph` from the orignal dataset.

In [1]:
import pandas as pd
import numpy as np
import src.utils as utils
import src.plots as plots
import plotly.express as px
import plotly.graph_objects as go

from ipysigma import Sigma
# Automatically reload the module in case of changes
%load_ext autoreload
%autoreload 2

In [2]:
data_path = "./data/wikispeedia_paths-and-graph/"

# Load finished paths
finished_paths = utils.load_dataframe(data_path + "paths_finished.tsv", columns=["hashedIpAddress", "timestamp", "durationInSec", "path", "rating"])

# Load unfinished paths
unfinished_paths = utils.load_dataframe(data_path + "paths_unfinished.tsv", columns=["hashedIpAddress", "timestamp", "durationInSec", "path", "target", "type"])

# Load categories
categories = utils.load_dataframe(data_path + "categories.tsv", columns=["page", "category"])

# Load links
links_df = utils.load_dataframe(data_path + "links.tsv", columns=["source", "target"])

In [3]:
# Data cleaning and preprocessing
categories_dict = utils.manage_categories(categories.copy())
finished_paths_df = utils.manage_paths(finished_paths.copy(), categories_dict.copy())
unfinished_paths_df = utils.manage_paths(unfinished_paths.copy(), categories_dict.copy())
links_dict = utils.manage_links(links_df.copy())

#create graph and page ranks
graph = utils.create_graph(links_df.copy())
ranks = utils.page_rank(graph)

# Save the cleaned dataframes
finished_paths_df.to_csv("./data/clean_data/clean_finished_paths.csv")
unfinished_paths_df.to_csv("./data/clean_data/clean_unfinished_paths.csv")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  paths_df["source_general_category"] = paths_df["source_category"].apply(lambda x: x[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  paths_df["target_general_category"] = paths_df["target_category"].apply(lambda x: x[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  paths_df["clean_path"] = path

# 0. How much data is too much?

As our analysis needs large computations (LLMs, web browser simulations...), we can't do it on the whole dataset. We rather do it on a subset of game pairs Source-Target.  In this section we argue that we can reduce the dataset size without losing too much information.

In [4]:
plots.plot_top_game_pairs(finished_paths_df, number_of_pairs = 50)

Top 50 pairs represent 10.85% of the total games


The percentage of the dataset we cover is only 10%, however, we take pairs that are the most frequently played which makes the games statistically very significant. In addition to that, we keep similar distribution of the sources and the targets as shown in the following plots.

In [5]:
plots.source_target_distribution(finished_paths_df, number_of_pairs=50, categories=categories_dict)

We can see that we keep about the same sources and targets distribution as in the original dataset, where Science, Everyday life, People and Geography are the most popular categories. This applies also for the targets. In addtion t that, we have the presence of some of the less popular categories in sources and targets as Art, IT or Music which makes our dataset more diverse and may give us ideas bout outliers. Finally, we also repect the main flows as Science goes to Science, Coutries fo to Geography, etc. This makes our dataset representative and we can expect that the conclusions we draw from it are valid for the whole dataset.

# 1. Links coordinates analysis

We obtained the x and y coordinates of the links by using Selenium. It enables us to open the html file of the articles in a simulated browser and then select the links we are interested in. We set the window size of the browser to 1920x1080 because it is the most common. 
Also, when the next article in path is accessible via several links, we stored the coordinates of all these links.

We run this algorithm on each path of the 50 most popular source-target peers.

The code doing that can be found in the `src/links_coordinates.py` file.
 
The data collected this way is stored in the files `data/links_coordinates_optimal.csv` for the links coordinates of the optimal paths, `data/links_coordinates_finished.csv` for the links coordinates of the finished paths and `data/links_coordinates_unfinished.csv` for the links coordinates of the unfinished paths.

Let's vizualize where people click!

In [6]:
links_coords_optimal = pd.read_csv("data/links_coordinates_optimal.csv")
links_coords_finished = pd.read_csv("data/links_coordinates_finished.csv")
links_coords_unfinished = pd.read_csv("data/links_coordinates_unfinished.csv")

links_coords_optimal["links_coords"] = links_coords_optimal["links_coords"].apply(eval)
links_coords_finished["links_coords"] = links_coords_finished["links_coords"].apply(eval)
links_coords_unfinished["links_coords"] = links_coords_unfinished["links_coords"].apply(eval)

links_coords_optimal["normalized_links_coords"] = links_coords_optimal["normalized_links_coords"].apply(eval)
links_coords_finished["normalized_links_coords"] = links_coords_finished["normalized_links_coords"].apply(eval)
links_coords_unfinished["normalized_links_coords"] = links_coords_unfinished["normalized_links_coords"].apply(eval)

In [7]:
all_links_coords = []
for links_coords in links_coords_optimal["links_coords"]:
    all_links_coords.extend(links_coords)

all_links_coords = [coord for sublist in all_links_coords for coord in sublist]

plots.plot_coordinates_distribution(all_links_coords, "Optimal Paths Coordinates Distribution")

In [8]:
unfinished_all_links_coords = []
for links_coords in links_coords_unfinished["links_coords"]:
    unfinished_all_links_coords.extend(links_coords)

unfinished_all_links_coords = [coord for sublist in unfinished_all_links_coords for coord in sublist]

finished_all_links_coords = []
for links_coords in links_coords_finished["links_coords"]:
    finished_all_links_coords.extend(links_coords)

finished_all_links_coords = [coord for sublist in finished_all_links_coords for coord in sublist]

plots.compare_coordinates_distribution(unfinished_all_links_coords, finished_all_links_coords, "Unfinished Paths Coordinates Distribution", "Finished Paths Coordinates Distribution")

As we can see, we, as humans, are lazy. Indeed most of the links clicked are at the very top of the articles, probably in the first article while the best links can be found at almost any coordinates in the article as we can see in the first plot.

We also observe that almost no links are clicked after 2000 in Y coordinates. For most screen sizes, reaching these coordinates requires scrolling down. This means people most of time do not make the effort to explore the whole article before making a choice.

Comparing the plots, we can see that people who found a path clicked more on links between 750 and 2000 in Y coordinate than those who didn't find a path. This shows that they are more likely to explore the entire article presented to them.

To take tha analysis a step further, we can even take a look at how the level of exploration changes over time. The following plot shows the variance in the horizontal and vertical position, relative to the article size, of the clicked links on a page as a function of the player's position in the path. A higher variance suggests that players explore the page more thoroughly, with selected links being less concentrated in specific areas.

Here we used the position of the links divided by the page size so that the article size does not influence the data. By normalizing the link positions in this way, we ensure that the variance we observe is due to the players' exploration behavior rather than differences in article dimensions. This allows us to make more accurate comparisons across different pages and better understand how players interact with the content.

In [9]:
plots.plot_link_coordinates_variance_per_step(links_coords_finished, links_coords_unfinished)

As we can see, players put much more effort into selecting their first link, but this investment drops sharply just after that and stabilizes for the remaining steps. This pattern aligns with expectations: people tend to invest more effort at the start of a task, but motivation quickly wanes, leading them to prioritize time and convenience over further exploration.

## Takeaways 

We, humans, are lazy and could benefit greatly from considering a wider range of options before making a decision. The previous analysis observed a key behavior of human decision-making: the tendency to favor the easy, less demanding choice over a more thoughtful, well-considered one.

# 2. Don't trust LLMs blindly!

In this section, we consider the top 10 source-target pairs successfully played in Wikispeedia, that is, the 10 most frequent source-target pairs found in the finished paths dataset.

First, we start by comparing 2 different models using 2 different prompting strategies, which gives us 4 runs to compare. The models are 4bit quantized versions of Qwen-3B and Llama-3B. We chose those 2 models because they are fairly small LLMs that could run on our machines/Colab in a reasonable time. We run the same task from source to target 100 times with a short prompt referred to as "simple prompt" in this notebook, and 30 times with the long prompt referred to as "detailed prompt" (due to time limitations as you can see in the plot below). This allows us to have statistically relevant paths especially that the LLMs are sometimes lost and not able to finish the paths. The file that generates the LLM results is /utils/LLM.py. Please note that CUDA is required for this file to run. (It may not generate the exact same paths but they will still have the same behaviour statistically.)

In [10]:
plots.plot_llm_times()

## Prompting strategy

In order to get the models to work, we did prompt engineering trying multiple instructions. The overall idea of the prompt is that we give the model only the target and the list of possible links at a certain step and we ask it about the best choice in order to get closer to the target. If the answer is not in the list, we keep the context of the conversation and we tell the model that it gave us an article not in the list and we ask it to correct itself. If it does, then we continue, otherwise, the path is considered aborted. If we have an answer, we restart with a completely new context using a similar prompt. 

The "simple prompt" contains one sentence explaining what the model has to do, whereas the "detailed prompt" contains several steps guiding the model to think, understand the target word and explain why it made a certain choice at each step.

| Simple Prompt | Detailed Prompt |
| -------- | ------- |
| I will give you a list of elements. Choose one element in that list that you think is the most related to {target}. </br> You should never, under no circumstances, choose an element that is not in the list otherwise you will fail forever. <br>The first word of your response should be the answer which is an element of the list. Here is the list of elements to choose from: {list} </br>  |  I will give you a list of elements. Choose one element in that list that you think is the most related to {target}.You should never, under no circumstances, choose an element that is not in the list otherwise you will fail forever. Follow the instructions carefully: <br> 1- Understand the target word: {target} </br> 2- Look carefully at the list of elements provided and understand them. <br> 3- Choose the element that you think is the most related to the target word. </br> 4- Return the element you chose as your answer, only one element. <br> 5- Explain why you chose this element.The first word of your response should be the answer which is an element of the list. Afterwards explain your choice. Here is the list of elements to choose from: {list} </br>|


## Comparison metric

In order to choose the best model/prompt combination, we compute this metric for each one: 
                $$ Performance \_ Score = \frac{success \_ frequency}{paths \_ lengths} $$
With $ success \_ frequency $ being the fraction of successful paths from the generated paths for each source-target pair, the higher the frequency the better the model, and $ paths \_ lengths $ being the average length of all paths, the shorter the paths the better the model.


In [11]:
# Group the finished paths DataFrame by "source" and "target" columns, then count occurrences in each group
top_paths = finished_paths_df.groupby(["source", "target"])["hashedIpAddress"].count().rename("count")

# Sort the grouped data by the count in descending order to get the most frequent paths
top_paths = top_paths.sort_values(ascending=False).reset_index().head(10)
print("Top 10 paths: ")
print(top_paths)

# Extract the "source" column values of the top paths as a list
sources = top_paths['source'].tolist()

# Extract the "target" column values of the top paths as a list
targets = top_paths['target'].tolist()

# Get the LLM paths
# uncomment the following lines to run the LLM generation (CUDA is required) and comment the next one
# import src.LLM as LLM
# llm_paths = LLM.llm_paths(sources, targets, links_dict)
#llm_paths = utils.read_llm_paths("./data/llm_paths.json")

Top 10 paths: 
     source             target  count
0  Asteroid             Viking   1043
1     Brain          Telephone   1040
2   Theatre              Zebra    905
3   Pyramid               Bean    642
4    Batman               Wood    148
5      Bird  Great_white_shark    138
6    Batman      The_Holocaust    119
7      Bird       Adolf_Hitler    107
8      Beer                Sun     99
9    Batman             Banana     69


In [12]:
plots.compare_llms_and_prompts(download=True)

From the plot, we can see that Qwen with Simple Prompt (in green) has a higher performance score in most cases (7 out of 10). We choose this combination of model/prompt to compare with the performance of human players on the same source-target pairs.

In [13]:
# Plot LLM and players mean lengths and errors
llm_paths = utils.read_llm_paths("./data/llm_responses_qwen_simple_prompt.json")
plots.plot_llms_vs_players(sources, targets, finished_paths_df, llm_paths)
p_values = utils.tstats_pvalues(sources, targets, finished_paths_df, llm_paths)
plots.plot_tsatistics(sources, targets, p_values)

## Statistical analysis

The first plot shows that the LLM's performance is worse in all the source-target pairs. However, we are not confident that it's really the case; For example, the players mean falls into the standard error of the LLM mean for the Brain-Telephone paths. This is why we also calculate the p-values. We define our null hypothesis as "LLMs have larger average path length than human players". We can see that for all the source-target pairs, the p-values are much larger than 0.05 so we cannot reject the null hypothesis. As the p-values are very high, it indicates that our observation are not so unlikely to have occurred by chance.

In [14]:
plots.plot_llm_frequency(sources, targets, llm_paths)

Not only LLMs are not as good as humans but also they tried to cheat in the game. In fact, when we look at the number of games finished by the LLMs, we can see that for the paths they struggled on, they finished less games than for the easy paths. (example: the 3 source-target pairs with the highest mean lengths are : Bird -> Great white Shark, Theatre -> Zebra and Pyramid -> Bean they are also the 3 source-target pairs with the lowest number of finished paths.)
This means that they twice tried to give us an answer that is not in their list of possibilities so they did not follow the task.

## What is the LLM's strategy?
We studied the LLM strategy to know why they don't performance as good as real players. We ran page rank algorithm on the graph of links and plotted the average ranks of nodes for the most common path lengths for players and LLMs on each source-target pair. We can see that players tend to go up quickly to a "Hub" and then go down to the target. The LLMs on the other hand don't have a clear pattern in their paths. This confirms to us that they are not good on all tasks and also that they are not very consistent. However, we observed that when they use the "Hub" strategy they can be as good as humans (example: Batman -> The Holocaust path).

In [15]:
plots.plot_llm_vs_players_strategies(sources, targets, finished_paths_df, llm_paths, ranks)

After this discovery we made and the hypothesis that humans are better because they go away to a hub and then home in on target, we wanted to try a different prompting strategy with Qwen where we ask it to first choose the most general concept and then give it the same prompt as before (choose the word that is the most related to the target word). We conduct analysis to see if the hub strategy makes performance better.

In [16]:
plots.average_ranks_first_step(ranks)

We can see that the mean rank of the first step taken by Qwen when prompted to go to hub is higher than the one with the simple prompt.

But will this make it perform better ?

In [17]:
plots.hub_impact()

No, prompting the LLM to follow the same strategy as humans didn't make it perform better. Turns out that LLMs cannot outperform humans in all tasks... at least, not yet!

## Take away about LLMs
LLMs are a very good tool that could make our lives easier in many ways. However, one should not trust them blindly. They can give us inaccurate or even wrong answers. We can ask them for help and advice but they are still far from human strategic thinking and way of reasoning.

# 3. Analyse crowd performance vs average performance

Our idea is to start a game with a given source `src` and target `dst` that might not have been played before. We then exploit all the data of the previous games. To choose the second page to click on, we aggregate all the paths that either have source `src` and destination `dst`, or those that go through `src` and have target `dst`. This way, we have the next page each player chose after `src`. We select our next page using majority voting. We repeat this operation until we reach `dst` (we call this procedure the crowd algorithm). For Condorcet's jury theorem to apply, we need to maximise the number of voters at each step. To do this, we chose paths that maximize 'voter scores'. We call the voter score of a path the minimum number of voters encountered by the crowd algorithm. (ie: at each step of the path, we guarantee a certain number of voters)

We run this algorithm on each (`src`, `dst`) tuple with a voter score > 50 and compare the results with what the real players obtained on average for the same (`src`, `dst`) tuple.

The code doing this can be found in crowd.py

The data collected this way is stored in the file `data/promising_crowd_paths.csv`. On these paths, we computed the performance of the crowd and the individual players in the file `data/crowd_vs_players.csv`

Let's see if the crowd outperformed the individual players!

In [None]:
import src.crowd as crowd

# uncomment the following lines to rerun the crowd algorithm
#crowd.compute_scores(finished_paths_df)
#crowd.stats_players_crowd(finished_paths_df)

# Read CSV file containing the final scores
crowd_res = pd.read_csv("./data/crowd_vs_players.csv")

We now show you visually what look like the crowd algorithm. The crowd path is red

In [None]:
plots.plot_graph_between(finished_paths_df, src='Herbivore', dst='Zebra')

Sigma(nx.DiGraph with 9 nodes and 19 edges)

In [None]:
plots.plot_crowd_players_comparison(crowd_res)

In [None]:
plots.plot_crowd_players_density(crowd_res)

Let's see for how many games, the crowd was right or had the same length

In [None]:
print("Number of games where the crowd performed as well of better")
np.sum((crowd_res['players_score'] - crowd_res['crowd_score']) >= 0) / len(crowd_res['players_score'])

Number of games where the crowd performed as well of better


0.9782608695652174

So 98% of time, the crowd was right.

In [None]:
import scipy.stats

# Let's compute a T-Test to see if the averages are significantly different, we choose a a paired t-test. 
# This test compares the means of two related groups
res = scipy.stats.ttest_rel(crowd_res['players_score'], crowd_res['crowd_score'], alternative='less')
print("T-Test result: ", res)
res.pvalue

T-Test result:  TtestResult(statistic=9.87829228094965, pvalue=0.9999999999999998, df=91)


0.9999999999999998

The null hypothesis (="the mean of the distribution underlying ans_stats is greater than the mean of the distribution underlying crowd_stats").

P-value >> 0.05, this indicates that our observation is not so unlikely to have occurred by chance.

Therefore, we do not reject the null hypothesis

## How can the crowd be wrong ?

In [None]:
plots.plot_path_length_distribution(crowd_res, finished_paths_df)

Let us analyze the two games where the crowd algorithm failed. By examining the graph of path length distributions from individual players, we observe that the algorithm fails because it follows the overwhelming majority. This majority  selects a path that is one edge longer than the optimal path. Interestingly, the average path length among individual players is slightly shorter than the most common path length, this is due to a small number of players discovering the optimal path, which lowers the overall average.

# Conclusion

From our analysis of the Wikispeedia dataset, we’ve uncovered valuable insights for everyday decision-making. First, our brains remain more logical and reliable than LLMs, reminding us to trust and engage our critical thinking. Second, shortcuts driven by laziness often hinder progress, emphasizing the importance of effort and thoroughness. Finally, seeking advice from others enriches our understanding and helps us make more informed decisions, though it’s crucial to evaluate input carefully to avoid groupthink.