## Topic 2 - Players' Behavior Before Balancing

#### Question or interest: 
Are there countries that are clicked by players more often in Wikispeedia? 

#### Plot
Interactive and fun way to represent players' behavior 
Map with dots for each country. Size of the dot corresponds to the click count per country while the color corresponds to the number of articles associated to a country. Edges drawn between dots are scaled based on the number of time that a player uses the link between two articles associated with the corresponding countries.

#### What do we learn from this
From this first naive approach, it seems that players are highly biased in their way to play Wikispeedia. Some countries like United States, United Kingdom, and Australia are overrepresented while other are underrepresented and almost not visible on the previous map.

#### Transition to next topic
Can we rationally explain this overrepresentation? Are there some confounding factors that influence the previous finding? Can a balancing be enough to get rid of the apparent bias? Multiple balancing methods such as accounting for the number of articles per country, for the number of links in toward articles, or for categories of articles are investigated. A linear regression based on different factors is also made to determine whether countries determine the number of clicks. Finally a PageRank algorithm is run to understand if players navigate Wikispeedia differently than a random walk.


In [44]:
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from sklearn.preprocessing import MinMaxScaler
import plotly.graph_objects as go
from geopy.geocoders import Nominatim

In [23]:
# Import dataset 
data = pd.read_csv("data/country_clicks_links.csv", index_col=0)
articles = data.index.tolist()
clicks = [f"{data.click_count.iloc[i]} clicks" for i in range(len(articles))]

country_clicks = data.groupby('Top_1_name')['click_count'].sum().reset_index()
top_1_counts = data['Top_1_name'].value_counts()
country_clicks['occurrences'] = country_clicks['Top_1_name'].map(top_1_counts)
country_clicks["scaled_click_count"] = country_clicks["click_count"] / country_clicks["occurrences"]

countries = country_clicks.Top_1_name.tolist()
clicks = country_clicks.click_count.tolist()

#clicks_label = [f"{clicks[i]} clicks" for i in range(len(clicks))]
num_articles = country_clicks.occurrences.tolist()

In [None]:
# Get a set of coordinates (latitude, longitude) for each country for visualisation purposes
geolocator = Nominatim(user_agent="my_app")

def get_country_coordinates(country_name):
    location = geolocator.geocode(country_name)
    return (location.latitude, location.longitude)

coords = []
for country in countries: 
    coords.append(get_country_coordinates(country))

latitudes = [coord[0] for coord in coords]
longitudes = [coord[1] for coord in coords]

In [None]:
# Main plot - World map

# color of nodes is proportional to number of articles associated with the node (country)
scaler = MinMaxScaler()
normalized_counts = scaler.fit_transform([[count] for count in num_articles]).flatten()

color_map = plt.cm.get_cmap('Reds')
colors_hex = [matplotlib.colors.to_hex(color_map(norm)) for norm in normalized_counts]

# size of nodes is proportional to the click count 
size_scaler = MinMaxScaler(feature_range=(9, 337))
node_sizes = size_scaler.fit_transform([[count] for count in clicks]).flatten()

# Create the nodes
node_trace = go.Scattergeo(
    lon=longitudes,
    lat=latitudes,
    text=countries,
    mode='markers',
    marker=dict(
        size=node_sizes/4,
        color=colors_hex,
        line=dict(width=0.5, color='rgb(40,40,40)')
    ),
    hovertemplate='<b>Country:</b> %{text}<br>' +
                  '<b>Articles:</b> %{customdata[0]}<br>' +
                  '<b>Clicks:</b> %{customdata[1]}<extra></extra>',
    customdata=np.column_stack((num_articles, clicks))
)

# Create the world map
layout = go.Layout(
    title='World map of the number of articles and the click count per country before scaling',
    showlegend=False,
    geo=dict(
        scope='world',
        projection_type='equirectangular',
        showland=True,
        landcolor='rgb(243, 243, 243)',
    ),
)

# Create the figure
fig = go.Figure(data=[node_trace], layout=layout)

# Show the plot
fig.show()
fig.write_html("world_counts_and_articles_before_scaling.html")


The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap(obj)`` instead.



FileNotFoundError: [Errno 2] No such file or directory: '\\graphs\\world_counts_and_articles_before_scaling.html'

In [76]:
# Account for the number of articles per country
clicks_sclaled = country_clicks.scaled_click_count.tolist()

# size of nodes is proportional to the scaled click count 
size_scaler = MinMaxScaler(feature_range=(9, 337))
node_sizes_scaled = size_scaler.fit_transform([[count] for count in clicks_sclaled]).flatten()

# Main plot - World map but with scaled click count
# Create the nodes
node_trace = go.Scattergeo(
    lon=longitudes,
    lat=latitudes,
    text=countries,
    mode='markers',
    marker=dict(
        size=node_sizes_scaled/4,
        color=colors_hex,
        line=dict(width=0.5, color='rgb(40,40,40)')
    ),
    hovertemplate='<b>Country:</b> %{text}<br>' +
                  '<b>Articles:</b> %{customdata[0]}<br>' +
                  '<b>Clicks:</b> %{customdata[1]}<extra></extra>',
    customdata=np.column_stack((num_articles, clicks_sclaled))
)

# Create the world map
layout = go.Layout(
    title='World map of the number of articles and the click count per country before scaling',
    showlegend=False,
    geo=dict(
        scope='world',
        projection_type='equirectangular',
        showland=True,
        landcolor='rgb(243, 243, 243)',
    ),
)

# Create the figure
fig = go.Figure(data=[node_trace], layout=layout)

# Show the plot
fig.show()
fig.write_html("world_counts_and_articles_after_scaling.html")