## Topic 2 - Players' Behavior Before Balancing

#### Question or interest: 
Are there countries that are clicked by players more often in Wikispeedia? 

#### Plot
Interactive and fun way to represent players' behavior 
Map with dots for each country. Size of the dot corresponds to the click count per country while the color corresponds to the number of articles associated to a country. Edges drawn between dots are scaled based on the number of time that a player uses the link between two articles associated with the corresponding countries.

#### What do we learn from this
From this first naive approach, it seems that players are highly biased in their way to play Wikispeedia. Some countries like United States, United Kingdom, and Australia are overrepresented while other are underrepresented and almost not visible on the previous map.

#### Transition to next topic
Can we rationally explain this overrepresentation? Are there some confounding factors that influence the previous finding? Can a balancing be enough to get rid of the apparent bias? Multiple balancing methods such as accounting for the number of articles per country, for the number of links in toward articles, or for categories of articles are investigated. A linear regression based on different factors is also made to determine whether countries determine the number of clicks. Finally a PageRank algorithm is run to understand if players navigate Wikispeedia differently than a random walk.


In [8]:
%load_ext autoreload
%autoreload 2

In [14]:
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from sklearn.preprocessing import MinMaxScaler
import plotly.graph_objects as go
from geopy.geocoders import Nominatim

from src.utils.functions import find_pairs
from src.data.dataloader import *


In [15]:
# Import dataset and do a basic preprocessing to have it in the right format
data = pd.read_csv("data/country_clicks_links.csv", index_col=0)
articles = data.index.tolist()
clicks = [f"{data.click_count.iloc[i]} clicks" for i in range(len(articles))]

country_clicks = data.groupby('Top_1_name')['click_count'].sum().reset_index()
top_1_counts = data['Top_1_name'].value_counts()
country_clicks['occurrences'] = country_clicks['Top_1_name'].map(top_1_counts)
country_clicks["scaled_click_count"] = country_clicks["click_count"] / country_clicks["occurrences"]

countries = country_clicks.Top_1_name.tolist()
clicks = country_clicks.click_count.tolist()

#clicks_label = [f"{clicks[i]} clicks" for i in range(len(clicks))]
num_articles = country_clicks.occurrences.tolist()

finished_paths = load_path_finished_dataframe()
finished_paths_divided = finished_paths["path"].apply(lambda row: row.split(';'))

unfinished_paths = load_path_unfinished_distance_dataframe()
unfinished_paths_divided = unfinished_paths["path"].apply(lambda row: row.split(';'))

In [16]:
# Get a set of coordinates (latitude, longitude) for each country for visualisation purposes
geolocator = Nominatim(user_agent="my_app")

def get_country_coordinates(country_name):
    location = geolocator.geocode(country_name)
    return (location.latitude, location.longitude)

coords = []
for country in countries: 
    coords.append(get_country_coordinates(country))

latitudes = [coord[0] for coord in coords]
longitudes = [coord[1] for coord in coords]

# color of nodes is proportional to number of articles associated with the node (country)
scaler = MinMaxScaler()
normalized_counts = scaler.fit_transform([[count] for count in num_articles]).flatten()

color_map = plt.cm.get_cmap('Reds')
colors_hex = [matplotlib.colors.to_hex(color_map(norm)) for norm in normalized_counts]

# size of nodes is proportional to the click count 
size_scaler = MinMaxScaler(feature_range=(9, 337))
node_sizes = size_scaler.fit_transform([[count] for count in clicks]).flatten()

  color_map = plt.cm.get_cmap('Reds')


In [26]:
# Get edges 
# Focus on finished paths first
all_pairs_finished = find_pairs(finished_paths_divided)
all_pairs_finished_df = pd.DataFrame({'1-unit long path': all_pairs_finished})

# Then, focus on unfinished paths
all_pairs_unfinished = find_pairs(unfinished_paths_divided)
all_pairs_unfinished_df = pd.DataFrame({'1-unit long path': all_pairs_unfinished})

# Combine finished and unfinished paths
all_pairs_merged = pd.concat([all_pairs_finished_df, all_pairs_unfinished_df])

# Divide pairs into 2 columns 'From' and 'To' > Split
all_pairs_countries = all_pairs_merged.copy()
all_pairs_countries['Article_from'] = all_pairs_merged['1-unit long path'].apply(lambda row: row[0].split(',')[0] if isinstance(row[0], str) else row)
all_pairs_countries['Article_to'] = all_pairs_merged['1-unit long path'].apply(lambda row: row[1].split(',')[0] if isinstance(row[0], str) else row)

# Associate each article from columns 'From' and 'To' to their Top_1_name country
all_pairs_countries['Top1_country_From'] = all_pairs_countries['Article_from'].map(data['Top_1_name'])
all_pairs_countries['Top1_country_To'] = all_pairs_countries['Article_to'].map(data['Top_1_name'])

# Create a column with a pair of countries
all_pairs_countries['1-unit long path - COUNTRIES'] = all_pairs_countries['Top1_country_From'] + "-> " + all_pairs_countries['Top1_country_To']

# Normalize
all_pairs_countries_normalized = all_pairs_countries["1-unit long path - COUNTRIES"].value_counts() / all_pairs_countries["1-unit long path - COUNTRIES"].value_counts().sum()


In [52]:
# Main plot - World map

# Create the nodes
node_trace = go.Scattergeo(
    lon=longitudes,
    lat=latitudes,
    text=countries,
    mode='markers',
    marker=dict(
        size=node_sizes/4,
        color=colors_hex,
        line=dict(width=0.5, color='rgb(40,40,40)')
    ),
    hovertemplate='<b>Country:</b> %{text}<br>' +
                  '<b>Articles:</b> %{customdata[0]}<br>' +
                  '<b>Clicks:</b> %{customdata[1]}<extra></extra>',
    customdata=np.column_stack((num_articles, clicks))
)

# Create edges (path between articles based on their associated country)
edges = all_pairs_countries_normalized.index
edge_traces = []
for edge, weight in all_pairs_countries_normalized.items():
    country_from, country_to = edge.split('-> ')
    
    # Get coordinates for both countries
    lon1, lat1 = longitudes[countries.index(country_from)], latitudes[countries.index(country_from)]
    lon2, lat2 = longitudes[countries.index(country_to)], latitudes[countries.index(country_to)]
    
    # Create edge trace
    edge_trace = go.Scattergeo(
        lon = [lon1, lon2, None],
        lat = [lat1, lat2, None],
        mode = 'lines',
        showlegend=False,
        line = dict(width = weight * 1000, color = 'rgba(0, 128, 0, 0.1)'),
        hoverinfo = 'none'
    )
    
    edge_traces.append(edge_trace)

# Create the world map
layout = go.Layout(
    title='World map of the number of articles and the click count per country before scaling',
    showlegend=False,
    geo=dict(
        scope='world',
        projection_type='equirectangular',
        showland=True,
        landcolor='rgb(243, 243, 243)',
    ),
)

# Create the figure
fig = go.Figure(data=[node_trace] + edge_traces, layout=layout)

# Implement toggling visibility
fig.update_layout(
    updatemenus=[
        {
            'buttons': [
                {
                    'label': 'Show Both Nodes and Edges',
                    'method': 'update',
                    'args': [{'visible': [True] + [True] * len(edge_traces)}]
                },
                {
                    'label': 'Show Nodes Only',
                    'method': 'update',
                    'args': [{'visible': [True] + [False] * len(edge_traces)}]
                },
                {
                    'label': 'Show Edges Only',
                    'method': 'update',
                    'args': [{'visible': [False] + [True] * len(edge_traces)}]
                },
                {
                    'label': 'Hide All',
                    'method': 'update',
                    'args': [{'visible': [False] * (1 + len(edge_traces))}]
                }
            ],
            'direction': 'down',
        }
    ]
)

# Show the plot
fig.show()
fig.write_html("world_counts_and_articles_before_scaling.html")

In [53]:
# Account for the number of articles per country
clicks_sclaled = country_clicks.scaled_click_count.tolist()

# size of nodes is proportional to the scaled click count 
size_scaler = MinMaxScaler(feature_range=(9, 337))
node_sizes_scaled = size_scaler.fit_transform([[count] for count in clicks_sclaled]).flatten()

# Main plot - World map but with scaled click count
# Create the nodes
node_trace = go.Scattergeo(
    lon=longitudes,
    lat=latitudes,
    text=countries,
    mode='markers',
    marker=dict(
        size=node_sizes_scaled/4,
        color=colors_hex,
        line=dict(width=0.5, color='rgb(40,40,40)')
    ),
    hovertemplate='<b>Country:</b> %{text}<br>' +
                  '<b>Articles:</b> %{customdata[0]}<br>' +
                  '<b>Clicks:</b> %{customdata[1]}<extra></extra>',
    customdata=np.column_stack((num_articles, clicks_sclaled))
)

# Create the world map
layout = go.Layout(
    title='World map of the number of articles and the click count per country before scaling',
    showlegend=False,
    geo=dict(
        scope='world',
        projection_type='equirectangular',
        showland=True,
        landcolor='rgb(243, 243, 243)',
    ),
)

# Create the figure
fig = go.Figure(data=[node_trace] + edge_traces, layout=layout)

# Implement toggling visibility
fig.update_layout(
    updatemenus=[
        {
            'buttons': [
                {
                    'label': 'Show Both Nodes and Edges',
                    'method': 'update',
                    'args': [{'visible': [True] + [True] * len(edge_traces)}]
                },
                {
                    'label': 'Show Nodes Only',
                    'method': 'update',
                    'args': [{'visible': [True] + [False] * len(edge_traces)}]
                },
                {
                    'label': 'Show Edges Only',
                    'method': 'update',
                    'args': [{'visible': [False] + [True] * len(edge_traces)}]
                },
                {
                    'label': 'Hide All',
                    'method': 'update',
                    'args': [{'visible': [False] * (1 + len(edge_traces))}]
                }
            ],
            'direction': 'down',
        }
    ]
)

# Show the plot
fig.show()
fig.write_html("world_counts_and_articles_after_scaling.html")