## Accounting for the influence of the graph's structure

As we saw in our first naive analysis, the click count metric is not a good proxy for analyzing the players intentions. Indeed, it seems to be heavily influenced by the graph's structure. We will now explore how we can account for this influence and remove it as much as possible.

### What variables influence the click count?

As seen before, countries with more articles have more clicks. This is very much expected, as the more articles a country has, the more likely it is that a user will encounter an article of that country. This is the most obvious example of a confounding variable that influences the click count. But this is not necessarily the only one. Indeed, it could be that other variables like the in-degree, out-degree, or even the categories of the articles have an influence on the click count. For example, for a given country, if the distribution of in-degrees is higher than the average, it means that players will see more links to that country, and thus might click more often on it.

#### Ideal rebalancing

To make sure we account for as much confounders as possible, the ideal thing to do would be to set up some kind of propensity score matching. But how? Usually, propensity score matching is done between two groups that are compared in the experiment: a treatment group and a control group. However, in our case, we are comparing click counts across countries, meaning we do not have 2, but rather 249 groups that are compared with one another (one group per country).

The natural thing to do would then be to simply extend the propensity score matching to the 249 groups! Instead of looking for pairs of articles that have a similar propensity score, we would look for k-tuples, with k being the number of countries. But there are two issues with this approach:
1. Propensity score matching requires an algorithm that finds maximum cardinality matchings. Although there exists such algorithms that run in polynomial time when k=2, when k>2 the problem is NP-hard, meaning all currently known algorithms for this problem run in exponential time (pretty bad).
2. Alright, but our dataset is not that big! Couldn't we just use an exponential time algorithm and be done with it? Well, another problem would then arise: for a lot of countries, the number of articles assigned to them is 1. This means that we would only be able to create a single k-tuple, and then we would already have exhausted all articles for a lot of countries. That would mean that our rebalanced dataset would contain only one article per country, which of course is not enough to make any kind of analysis.

[TODO: put some graph from this page for example ?](https://en.wikipedia.org/wiki/3-dimensional_matching)

#### A simpler approach

This means we need to consider something simpler. We will first analyze how much each variable influences the click count, and then manually normalize the click count by the variables that seem to have the most influence. This is a very naive approach, but it is the best we can do given the constraints of the problem.

#### Regression analysis

[TODO: explain what regression analysis is]

### Normalizing the click count

Given the results of the regression analysis, we will only consider two confounders, which seem to have the most influence on the click count: the number of articles per country and the in-degree of each article. We will define a new metric, the normalized click count, computed as follows:

* For each article, its click count is divided by its in-degree.
* When computing the click count of a country, sum the normalized click counts of all articles in that country, and then divide by the number of articles in that country.

Of course, the way we computed this new metric is quite arbitrary, and there might be many other ways to do it. For example, if the relationship between the in-degree and the click count was quadratic, it would make more sense to divide the click count by the square of the in-degree. But given that it is very hard to know for sure what the relationship between our confounders and the click count is, we will stick to this simple approach. This actually still makes quite a bit of sense: we are essentially counting, for a given country, the average number of clicks that a single link to that country receives.