In [3]:
import pyarrow.feather as feather
import pandas as pd
from src.utils import *

# Scoring Methods for Game Path Analysis  

***Prioritizing Scores Based on Minimal Clicks*** 

### **Difference Between Played Path Length and Optimal Distance**  
- **Optimal Distance**: The shortest possible distance from the start to the target article.  
- **Played Path Length**: The actual number of clicks (or visited articles - 1). This is represented in the dataset as `simplified_path_length`.  
- Why use `simplified_path_length` instead of `full_path_length`?  
  Simplified paths eliminate detours, thus only the articles relevant in finishing the path are taken into account.

We define the **Path Score** for a completed path as:  
$$
\mathbf{Path\ Score} = \frac{\text{Optimal\ Distance}}{\text{Simplified\ Path\ Length}}
$$  

This score ranges from 0 to 1, where 1 indicates the closest adherence to the optimal path. We refer to this score as the **path weight**, representing the ratio of actual path length to optimal distance.

#### **Article Scoring Based on Path Weights**  
After computing path weights for all completed paths, we use them to derive article scores using two approaches:

1. **Weighted Average**  
   Compute the average path weight for each article across all paths it appears in:  
   $$
   \mathbf{Article\ Score} = \frac{\sum_{i=1}^n w_i}{n}
   $$  
   where  $w_1, w_2, \dots, w_n$ are the path weights, and  $n$ is the number of paths the article appears in.  
   - This score is about article quality over quantity.
   - Only articles with a minimum appearance threshold are included to ensure meaningful scores.  
   - **Function**: `calculate_avg_article_weights(df, count_cutoff=30, scaling=None)`  

2. **Sum of Centered Weights**  
   - **Centering**: First, compute the mean article weight across all paths:  
     $$
     \text{Mean\ Article\ Weight} = \frac{\sum_{i=1}^N (\text{path}_i\ \text{weight} \times \text{num\_artcicles\_in\_path}_i)}{\sum_{i=1}^N \text{num\_artcicles\_in\_path}_i}
     $$  
     where $N$ is the total number of paths (or a downsampled subset), and $\text{num\_artcicles\_in\_path}_i$ the number of articles in simplified path $i$ (without start and target article).

   - **Centered Weights**:  
     $$
     \mathbf{Centered\ Weight} = \mathbf{Path\ Score} - \text{Mean\ Article\ Weight}
     $$  
     Why center using article weight and not path weight? Because in the end, we are interested in computing article weights, and since paths don't have the same number of articles, the average path weight is not the same as the average article weight.

   - Compute the article score by summing all centered weights for the paths the article appears in:  
     $$
     \mathbf{Article\ Score} = \sum_{i=1}^n cw_i
     $$  
     where $cw_1, cw_2, \dots, cw_n$ are the centered weights.  
   - This score balances quality and usefulness within the game.  
   - Only articles with a minimum appearance threshold are included to ensure meaningful scores.  
   - **Function**: `calculate_sum_article_cweights(df, count_cutoff=30, scaling=None)`

---

## Scores Based on Article Appearance in Detours  

### **Detour Ratio**  
- Detours occur when articles are backtracked (i.e don't appear in simplified paths).  
- For each article  $i$, the **Detour Ratio** is:  
  $$
  \mathbf{DetourRatio_i} = \frac{\text{detour\_count}_i}{\text{total\_appearances}_i},
  $$  

  where the $\text{detour\_count}_i$ and $\text{total\_appearances}_i$ are the number of appearance in detours and total number of apperances for article $i$ respectively.
- Only articles with a minimum total appearance threshold are considered.  
- **Function**: `calculate_detour_ratios(df, count_cutoff=30, scaling=None)`  

---

## Scores Based on Article Presence in Unfinished Paths  

### **Unfinished Ratio**  
- Measures how frequently an article appears in incomplete paths.  
- For each article $i$, the **Unfinished Ratio** is:  
  $$
  \mathbf{UnfinishedRatio_i} = \frac{\text{unfinished\_count}_i}{\text{total\_appearances}_i}
  $$  
- Again, articles must meet a minimum appearance threshold for meaningful scores.  
- **Function**: `calculate_unfinished_ratios(df, count_cutoff=30, scaling=None)`  

**========================================================================================================================================**



***Now Consider Scores That Reward Finishing the Game as Fast as Possible***

### **Weighted Average of Article Speed**  
We first compute **path speed**, defined as the time taken to complete the path (from `durationInSec`) divided by `full_path_length`. Similar to the weighted average of path weights, we can compute the average speed for each article. This involves extracting all $n$ paths an article appears in and calculating the average of the associated path speeds $s_1, s_2, \dots, s_n$:  
$$
\mathbf{Article\ Speed} = \frac{\sum_{i=1}^n s_i}{n}
$$  

Where $s_1, s_2, \dots, s_n$ are the path speeds, and $n$ is the number of paths containing the article.  

- Only articles with a minimum total appearance threshold are included for meaningful scores.  
- **Function**: `calc_avg_article_speed(df, count_cutoff=30, scaling=None)`  

---

### **Sum of Centered Article Speed**  
This approach mirrors the **sum of centered weights** but uses **path speed** instead of path weight.  

1. **Centering**: Compute the mean path speed across all paths:  
   $$
   \text{Mean\ Path\ Speed} = \frac{\sum_{i=1}^N (\text{path}_i\ \text{speed} \times \text{full\_path\_length}_i)}{\sum_{i=1}^N \text{full\_path\_length}_i}
   $$  
   where $N$ is the total number of paths (or a downsampled subset).  

2. **Centered Speeds**:  
   $$
   \mathbf{Centered\ Speed} = \mathbf{Path\ Speed} - \text{Mean\ Path\ Speed}
   $$  

3. Compute the article score by summing all centered speeds for the paths the article appears in:  
   $$
   \mathbf{Article\ Score} = \sum_{i=1}^n cs_i
   $$  
   where $cs_1, cs_2, \dots, cs_n$ are the centered speeds, and $n$ is the number of paths the article appears in.  

- This score provides a balance between speed quality and frequency of appearance.  
- **Function**: `calc_sum_article_cspeed(df, count_cutoff=30, scaling=None)`  

---  
***Imporant note about the scaling***  
The functions are all coded in a way that when scaling is applied to the scores, large values always are better. So for example, even if the ratio of unfinished paths should be as small as possible, the sacled score column, is flipped, so that larger means better. This way when different scores are combined in a composite score, bigger is also always better.

---  


## Now some example of how to get the scores.

First need to load the filtered_paths and 

In [5]:
filtered_paths = feather.read_feather('Data/dataframes/filtered_paths.feather')
finished_paths = filtered_paths[filtered_paths['finished']]

# downsample data to one IpAdress per identifier
# this way players can't just learn paths and then play them as fast as possible
# this is part of the data filtering process
finished_paths = finished_paths.groupby(['hashedIpAddress', 'identifier']).sample(n=1, random_state=42)

Then compute the scores prioritizing minimal number of clicks

In [6]:
avg_weight_df = calculate_avg_article_weights(finished_paths, count_cutoff=30, scaling='standard')
unfinished_ratio_df = calculate_unfinished_ratios(filtered_paths, count_cutoff=30, scaling='standard')
detour_ratio_df = calculate_detour_ratios(finished_paths, count_cutoff=30, scaling='standard')

# for this one it might make sense to downsample the data set so that only one start target pair sample is present (the same way I do it Notebook_P2)
sum_cweight_df = calculate_sum_article_cweights(finished_paths, count_cutoff=30, scaling='standard')

# there are a bunch of prints which can be removed in the final version

NameError: name 'calculate_avg_article_weights' is not defined

... and the speed related ones

In [2]:
# need adtional speed filtering
speed_filt_finished = filter_duration(finished_paths)

# get the score dfs
avg_speed_df = calc_avg_article_speed(speed_filt_finished, count_cutoff=30, scaling='standard')
sum_scpeed_df = calc_sum_article_cspeed(speed_filt_finished, count_cutoff=30, scaling='standard')

NameError: name 'filter_duration' is not defined

All individual score dfs have the same (or very similar) format, something like: 

 | article (index) | n_apperances | raw_score | scaled_score |

Example:

In [14]:
avg_weight_df.sort_values(by='standard', ascending=False)

Unnamed: 0_level_0,n_appearances,weighted_avg,standard
article,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Achilles,47,0.834448,4.811790
J._K._Rowling,55,0.796696,4.044191
Mario,38,0.764568,3.390951
Harry_Potter,62,0.754634,3.188972
Lead,30,0.753016,3.156074
...,...,...,...
Anatomy,40,0.485813,-2.276821
Irrigation,67,0.480171,-2.391556
Gas,45,0.472884,-2.539718
Atheism,33,0.472619,-2.545097


To perform statistical analysis, and particularly machine learning, the scaled score should be used. Scaling parameter options can be found in function docstrings. However, don't forget that when we do machine learning, to prevent data leakage we should first perform a train test splitt, and then do scaling. So in that case just work with raw scores as labels then slpit them and then scale.

You can also make a df with some scores for comparison, and compute composite scores to it.

In [18]:
from sklearn.decomposition import PCA

# Combine the metrics into one DataFrame
composite_df = pd.DataFrame(index=avg_weight_df.index)
composite_df['weight_avg'] = avg_weight_df['weighted_avg']
composite_df['weight_avg_scaled'] = avg_weight_df['standard']

composite_df['unfinished_ratio'] = unfinished_ratio_df['unfinished_ratio']
composite_df['unf_ratio_scaled'] = unfinished_ratio_df['standard']

composite_df['detour_ratio'] = detour_ratio_df['detour_ratio']
composite_df['detour_ratio_scaled'] = detour_ratio_df['standard']

# Compute composite score using PCA for all three scaled metrics
pca = PCA(n_components=1)
composite_df['comp_score_3'] = pca.fit_transform(
    composite_df[['weight_avg_scaled', 'unf_ratio_scaled', 'detour_ratio_scaled']]
)

# Compute composite score using PCA for only weight_avg_scaled and detour_ratio_scaled
pca = PCA(n_components=1)
composite_df['comp_score_2'] = pca.fit_transform(
    composite_df[['weight_avg_scaled', 'detour_ratio_scaled']]
)

# Sort by the composite score (correct column name)
composite_df = composite_df.sort_values(by='comp_score_3', ascending=False)
composite_df.to_feather('Data/dataframes/composite_scores_df.feather')


Unnamed: 0_level_0,weight_avg,weight_avg_scaled,unfinished_ratio,unf_ratio_scaled,detour_ratio,detour_ratio_scaled,comp_score_3,comp_score_2
article,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Achilles,0.834448,4.811790,0.131148,0.850784,0.000000,1.312944,4.158275,4.630274
J._K._Rowling,0.796696,4.044191,0.117647,1.006611,0.017544,0.949604,3.559448,3.794620
Algebra,0.716872,2.421177,0.060606,1.664997,0.000000,1.312944,3.068307,2.686373
Harry_Potter,0.754634,3.188972,0.155844,0.565727,0.000000,1.312944,2.952162,3.310698
Parrot,0.711134,2.304506,0.102564,1.180703,0.000000,1.312944,2.723494,2.591504
...,...,...,...,...,...,...,...,...
Sport,0.550422,-0.963159,0.512563,-3.551640,0.173913,-2.288856,-3.903475,-2.162060
DVD,0.544194,-1.089802,0.393939,-2.182449,0.254902,-3.966165,-4.100757,-3.241349
Eukaryote,0.521525,-1.550714,0.432836,-2.631404,0.229167,-3.433178,-4.369271,-3.305899
Optical_fiber,0.512302,-1.738252,0.280000,-0.867322,0.339286,-5.713781,-4.701717,-4.785863


## What (composite) scores do I suggest.

Of course there are a lot of different combinations that can be tried out. Also a lot of different parameter as input for the function. Before doing statistical analysis a reasonable threshold should be agreed on (30 is kinda arbitrary). As for scaling, I think it is worth trying both the standard and minmax.

**I would start with the following scores**
- Only article average weights => `avg_weight_df['standard']`
- Article average weights + detour ratios => `composite_df['comp_score_2']`
- Sum of centered average weights => `sum_cweight_df['standard']`
- Only average article speed => `avg_speed_df['standard']`

(`'standard'` would become `'minmax'` if that is used for scaling)


Since there are so many other combinations that might make sense, we can discuss that together to make sure in the end we have the most meaningfull thing.

