Here we are going to try and predict if person will succeed in all steps of the game. At the beginning, we will only look at pagerank of target (step 0). After first click, we add up informations about cosine similarity of first click, shortest path length and count from first click to target, pagerank of first click. When having at least 2 clicks, we now look at the duration, the pagerank_target, the cosine similarity of last click with regard to target, the shortest path length from last click to target, the shortest path count from last click to target, the maximum pagerank of all clicks so far, how often the back option was used / number of clicks, the difference between cosine similarity of last click - target and source - target.

We will have to have 3 models:
- One for predicting win when only having source article -> this we already have, inherent difficulty !
- One for predicting when we only have first click
- One for predicting win when we have more than one click

We will also test if we get better results by training one model for each number of clicks. This would be highly inefficient since the number of clicks can theoretically go to infinity, but since most games have a smaller number of clicks we could use these models in most cases.

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import scipy.stats as stats
import statsmodels.api as sm

from matplotlib import pyplot as plt

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, mean_squared_error, r2_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from utils.data_processing import *
from utils.graph_processing import *
from models.logistic_regression import LogisticRegression

In [2]:
games = load_preprocessed_games()
games.head()

Loaded 51318 finished paths in df of shape (51318, 7)
Loaded 24875 unfinished paths in df of shape (24875, 8)
After filtering all paths after 2011-02-07 05:02:15
we kept 23245 paths out of 51318 finished paths
There are 24875 unfinished paths
Loaded 4604 articles in df of shape (4604, 1)
Pruning invalid games. Initially we have 48120 games
Pruned invalid games. Now we have 48092 valid games
After removing timeouted games, there are 38775 games left


Unnamed: 0,difficulty_rating,duration,finished,hashIP,num_backward,path,path_length,source,target,timestamp,type_end
0,,166,True,6a3701d319fc3754,0,"[14th_century, 15th_century, 16th_century, Pac...",9,14th_century,African_slave_trade,2011-02-15 03:26:49,
1,3.0,88,True,3824310e536af032,0,"[14th_century, Europe, Africa, Atlantic_slave_...",5,14th_century,African_slave_trade,2012-08-12 06:36:52,
2,,138,True,415612e93584d30e,0,"[14th_century, Niger, Nigeria, British_Empire,...",8,14th_century,African_slave_trade,2012-10-03 21:10:40,
3,3.0,175,True,015245d773376aab,0,"[14th_century, Italy, Roman_Catholic_Church, H...",7,14th_century,John_F._Kennedy,2013-04-23 15:27:08,
4,,110,True,5295bca242be81fe,0,"[14th_century, Europe, North_America, United_S...",6,14th_century,John_F._Kennedy,2013-07-03 22:26:54,


In [67]:
games = games[games['path_length'] > 1] # remove games of length 1
condition = (games['path_length'] == 2) & (games['finished'] == True)
games = games[~condition] # removes games won in one click

# Predicting using only first click

We have to add up information about pagerank of target, cosine similarity of first click with regards to target, shortest path length and countn from first click to target, pagerank of first click.

In [68]:
games_one = games.copy()
games_one['first_click'] = games_one.apply(lambda row: row['path'][1], axis = 1)
games_one.head()

Unnamed: 0,difficulty_rating,duration,finished,hashIP,num_backward,path,path_length,source,target,timestamp,type_end,first_click
0,,166,True,6a3701d319fc3754,0,"[14th_century, 15th_century, 16th_century, Pac...",9,14th_century,African_slave_trade,2011-02-15 03:26:49,,15th_century
1,3.0,88,True,3824310e536af032,0,"[14th_century, Europe, Africa, Atlantic_slave_...",5,14th_century,African_slave_trade,2012-08-12 06:36:52,,Europe
2,,138,True,415612e93584d30e,0,"[14th_century, Niger, Nigeria, British_Empire,...",8,14th_century,African_slave_trade,2012-10-03 21:10:40,,Niger
3,3.0,175,True,015245d773376aab,0,"[14th_century, Italy, Roman_Catholic_Church, H...",7,14th_century,John_F._Kennedy,2013-04-23 15:27:08,,Italy
4,,110,True,5295bca242be81fe,0,"[14th_century, Europe, North_America, United_S...",6,14th_century,John_F._Kennedy,2013-07-03 22:26:54,,Europe


In [69]:
games_one.drop(columns = ["difficulty_rating", 'duration', 'hashIP', 'num_backward', 'path_length', 'path','type_end', 'timestamp'], inplace = True)
games_one.head()

Unnamed: 0,finished,source,target,first_click
0,True,14th_century,African_slave_trade,15th_century
1,True,14th_century,African_slave_trade,Europe
2,True,14th_century,African_slave_trade,Niger
3,True,14th_century,John_F._Kennedy,Italy
4,True,14th_century,John_F._Kennedy,Europe


In [70]:
node_stats_df = load_or_compute_node_stats()
games_one = merge_with_node_data(games_one, node_stats_df, columns = ['target', 'first_click'], data = ['pagerank'])
games_one.head()

Loaded 4604 node stats


Unnamed: 0,finished,source,target,first_click,pagerank_target,pagerank_first_click
0,True,14th_century,African_slave_trade,15th_century,3e-05,0.001024
1,True,14th_century,African_slave_trade,Europe,3e-05,0.006698
2,True,14th_century,African_slave_trade,Niger,3e-05,0.000408
3,True,14th_century,John_F._Kennedy,Italy,0.000315,0.003975
4,True,14th_century,John_F._Kennedy,Europe,0.000315,0.006698


In [71]:
embeddings_df = load_embeddings()
games_one = compute_cosine_similarity(games_one, embeddings_df, pairs = [['first_click', 'target']])
games_one.head()

Loaded 4604 embeddings in df of shape (4604, 1)


Unnamed: 0,finished,source,target,first_click,pagerank_target,pagerank_first_click,cosine_sim_first_click_target
0,True,14th_century,African_slave_trade,15th_century,3e-05,0.001024,0.187263
1,True,14th_century,African_slave_trade,Europe,3e-05,0.006698,0.146602
2,True,14th_century,African_slave_trade,Niger,3e-05,0.000408,0.309651
3,True,14th_century,John_F._Kennedy,Italy,0.000315,0.003975,0.03784
4,True,14th_century,John_F._Kennedy,Europe,0.000315,0.006698,-0.128314


In [8]:
pair_data = load_pair_data()
pair_data.head()

Loaded 4604 articles in df of shape (4604, 1)
Loaded 119882 links in df of shape (119882, 2)


Unnamed: 0_level_0,Unnamed: 1_level_0,shortest_path_length,shortest_path_count,max_sp_node_degree,max_sp_avg_node_degree,avg_sp_avg_node_degree,one_longer_path_count,max_ol_node_degree,max_ol_avg_node_degree,avg_ol_avg_node_degree,two_longer_path_count,max_tl_node_degree,max_tl_avg_node_degree
source,target,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
10th_century,10th_century,0,1,0,0,0,0,0,0,0,0,0,0
10th_century,11th_century,1,1,26,13,13,2,62,29,26,7,112,46
10th_century,12th_century,2,5,48,24,23,121,180,93,37,30,169,70
10th_century,13th_century,2,4,79,35,28,131,169,84,39,30,169,71
10th_century,14th_century,2,4,53,26,22,113,169,84,37,30,169,70


In [72]:
games_one = add_pair_data(games_one, pair_data, pairs =[['first_click', 'target']], names = ["first"], data = ['shortest_path_length', 'shortest_path_count'])
games_one.head()

Dropped 38 games without link statistics


Unnamed: 0,finished,source,target,first_click,pagerank_target,pagerank_first_click,cosine_sim_first_click_target,shortest_path_length_first,shortest_path_count_first
0,True,14th_century,African_slave_trade,15th_century,3e-05,0.001024,0.187263,3.0,3.0
1,True,14th_century,African_slave_trade,Europe,3e-05,0.006698,0.146602,3.0,7.0
2,True,14th_century,African_slave_trade,Niger,3e-05,0.000408,0.309651,3.0,4.0
3,True,14th_century,John_F._Kennedy,Italy,0.000315,0.003975,0.03784,2.0,1.0
4,True,14th_century,John_F._Kennedy,Europe,0.000315,0.006698,-0.128314,2.0,2.0


In [73]:
games_one.columns

Index(['finished', 'source', 'target', 'first_click', 'pagerank_target',
       'pagerank_first_click', 'cosine_sim_first_click_target',
       'shortest_path_length_first', 'shortest_path_count_first'],
      dtype='object')

In [74]:
games_one.drop(columns = ['source', 'target', 'first_click'],inplace = True)

In [75]:
features_1 = ['pagerank_target',
       'pagerank_first_click', 'cosine_sim_first_click_target',
       'shortest_path_length_first', 'shortest_path_count_first']

In [76]:
model_1 = LogisticRegression(games_one, features_1)
model_1.fit()

Class distribution: finished
True     0.5
False    0.5
Name: proportion, dtype: float64
Total number of samples: 24602
Optimization terminated successfully.
         Current function value: 0.624088
         Iterations 6
Training Set Metrics:
Threshold:   0.4380
F1 Score:    0.6497
Precision:   0.6499
Accuracy:    0.6498
              precision    recall  f1-score   support

       False     0.6580    0.6465    0.6522      2461
        True     0.6524    0.6638    0.6581      2460

    accuracy                         0.6552      4921
   macro avg     0.6552    0.6552    0.6551      4921
weighted avg     0.6552    0.6552    0.6551      4921



In [77]:
model_1.summary()

0,1,2,3
Dep. Variable:,finished,No. Observations:,19681.0
Model:,Logit,Df Residuals:,19676.0
Method:,MLE,Df Model:,4.0
Date:,"Fri, 13 Dec 2024",Pseudo R-squ.:,0.09963
Time:,22:52:15,Log-Likelihood:,-12283.0
converged:,True,LL-Null:,-13642.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
x1,0.3598,0.027,13.423,0.000,0.307,0.412
x2,-0.0413,0.016,-2.566,0.010,-0.073,-0.010
x3,0.0738,0.017,4.387,0.000,0.041,0.107
x4,-0.6915,0.023,-30.116,0.000,-0.737,-0.647
x5,0.2108,0.018,11.837,0.000,0.176,0.246


# Predicting using more than one click

Now we have to create a function that will automatically build a dataset using the first n number of clicks. n will be at least 2.

In [86]:
def build_dataset(starting_games, n):
    cols = []
    for i in range(n): # extracting article 
        cols.append(f"{i+1}_click")
        starting_games[f"{i+1}_click"] = starting_games.apply(lambda row: row['path'][i+1], axis = 1)
    starting_games['duration'] = (n/(starting_games['path_length']-1))* starting_games['duration'] # extracting avg duration

    starting_games['num_back'] = starting_games.apply(lambda a: (a[cols] == '<').sum()/ (n-1), axis = 1) 
    print(starting_games['num_back'].describe())
    print(starting_games['num_back'].unique())

    # removing < sign
    starting_games['2_click'] = starting_games.apply(lambda row: row['2_click'] if (row['2_click'] != '<') else row['source'], axis = 1)
        
    if n > 2:
        for i in range(3, n+1):
            starting_games[f"{i}_click"] = starting_games.apply(
                    lambda row: row[f"{i}_click"] if (row[f"{i}_click"] != '<') else row[f"{i-2}_click"], axis=1)
    
    node_stats_df = load_or_compute_node_stats()
    cols.append('source')
    starting_games = merge_with_node_data(starting_games, node_stats_df, columns = cols, data = ['pagerank'])
    temp = []
    for i in cols:
        temp.append(f"pagerank_{i}")
    starting_games['max_pagerank'] = starting_games.apply(lambda row: row[temp].max(), axis = 1)
    print(temp)
    
    embeddings_df = load_embeddings()
    starting_games = compute_cosine_similarity(starting_games, embeddings_df, pairs = [['source', 'target'], [f"{n}_click", 'target']])
    starting_games['cos_diff'] = starting_games[f'cosine_sim_{n}_click_target'] - starting_games['cosine_sim_source_target']

    if 'pair_data' in globals():
        pair_data = load_pair_data()
    starting_games = add_pair_data(starting_games, pair_data, pairs =[[f'{n}_click', 'target']], names = [f"{n}"], data = ['shortest_path_length', 'shortest_path_count'])

    for i in range(1, n):
        cols.append(f"pagerank_{i}_click")

    cols.append("cosine_sim_source_target")
    cols.append("pagerank_source")
    cols.extend(["difficulty_rating", 'hashIP', 'num_backward', 'path_length', 'path','type_end', 'timestamp'])
    cols.append('target')
    starting_games.drop(columns = cols, inplace = True)
    features = starting_games.columns.values.tolist()
    features.remove("finished")
    
    
    return starting_games, features

In [87]:
new_games = games.copy()
new_games = new_games[new_games['path_length'] > 2] 
condition = (new_games['path_length'] == 3) & (new_games['finished'] == True)
new_games = new_games[~condition] # removes games won in two clicks
new_games.head()

Unnamed: 0,difficulty_rating,duration,finished,hashIP,num_backward,path,path_length,source,target,timestamp,type_end
0,,166,True,6a3701d319fc3754,0,"[14th_century, 15th_century, 16th_century, Pac...",9,14th_century,African_slave_trade,2011-02-15 03:26:49,
1,3.0,88,True,3824310e536af032,0,"[14th_century, Europe, Africa, Atlantic_slave_...",5,14th_century,African_slave_trade,2012-08-12 06:36:52,
2,,138,True,415612e93584d30e,0,"[14th_century, Niger, Nigeria, British_Empire,...",8,14th_century,African_slave_trade,2012-10-03 21:10:40,
3,3.0,175,True,015245d773376aab,0,"[14th_century, Italy, Roman_Catholic_Church, H...",7,14th_century,John_F._Kennedy,2013-04-23 15:27:08,
4,,110,True,5295bca242be81fe,0,"[14th_century, Europe, North_America, United_S...",6,14th_century,John_F._Kennedy,2013-07-03 22:26:54,


In [88]:
dataset, features = build_dataset(new_games, 2)
print(features)
dataset.head()

count    31900.000000
mean         0.049185
std          0.216257
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: num_back, dtype: float64
[0. 1.]
Loaded 4604 node stats
['pagerank_1_click', 'pagerank_2_click', 'pagerank_source']
Loaded 4604 embeddings in df of shape (4604, 1)
Loaded 4604 articles in df of shape (4604, 1)
Loaded 119882 links in df of shape (119882, 2)
Dropped 11 games without link statistics
['duration', 'num_back', 'pagerank_2_click', 'max_pagerank', 'cosine_sim_2_click_target', 'cos_diff', 'shortest_path_length_2', 'shortest_path_count_2']


Unnamed: 0,duration,finished,num_back,pagerank_2_click,max_pagerank,cosine_sim_2_click_target,cos_diff,shortest_path_length_2,shortest_path_count_2
0,41.5,True,0.0,0.001223,0.001223,0.261171,0.058727,3.0,6.0
1,44.0,True,0.0,0.003321,0.006698,0.387016,0.184572,2.0,1.0
2,39.428571,True,0.0,0.00063,0.000642,0.293009,0.090565,2.0,1.0
3,58.333333,True,0.0,0.002173,0.003975,0.108953,0.029451,2.0,1.0
4,44.0,True,0.0,0.002751,0.006698,-0.044935,-0.124437,2.0,2.0


In [89]:
model_2 = LogisticRegression(dataset, features)
model_2.fit()

Class distribution: finished
True     0.5
False    0.5
Name: proportion, dtype: float64
Total number of samples: 20954
Optimization terminated successfully.
         Current function value: 0.567847
         Iterations 7
Training Set Metrics:
Threshold:   0.4810
F1 Score:    0.7111
Precision:   0.7116
Accuracy:    0.7112
              precision    recall  f1-score   support

       False     0.7337    0.7099    0.7216      2096
        True     0.7189    0.7422    0.7304      2095

    accuracy                         0.7261      4191
   macro avg     0.7263    0.7261    0.7260      4191
weighted avg     0.7263    0.7261    0.7260      4191



In [91]:
model_2.summary()

0,1,2,3
Dep. Variable:,finished,No. Observations:,16763.0
Model:,Logit,Df Residuals:,16755.0
Method:,MLE,Df Model:,7.0
Date:,"Fri, 13 Dec 2024",Pseudo R-squ.:,0.1808
Time:,23:44:25,Log-Likelihood:,-9518.8
converged:,True,LL-Null:,-11619.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
x1,-1.0988,0.042,-26.428,0.000,-1.180,-1.017
x2,0.0054,0.018,0.307,0.758,-0.029,0.040
x3,-0.0474,0.024,-2.012,0.044,-0.094,-0.001
x4,-0.0328,0.023,-1.413,0.158,-0.078,0.013
x5,0.0415,0.036,1.152,0.249,-0.029,0.112
x6,0.0796,0.035,2.290,0.022,0.011,0.148
x7,-1.0334,0.026,-40.380,0.000,-1.084,-0.983
x8,0.2284,0.020,11.636,0.000,0.190,0.267


We can see the probability of guessing is now bigger. Let's check with different ns.

In [92]:
for n in range (2, 11):
    print(f"Prediction based on first {n} clicks: ")
    n_games = games.copy()
    n_games = n_games[n_games['path_length'] > n]
    condition = (n_games['path_length'] == (n+1)) & (n_games['finished'] == True)
    n_games = n_games[~condition] # removes games won in n clicks exactly

    print("Number of games before building dataset: ", n_games.shape)

    dataset, features = build_dataset(n_games, n)

    model_n = LogisticRegression(dataset, features)
    model_n.fit()
    print("----------------------------------------------------")
    print()
    

Prediction based on first 2 clicks: 
Number of games before building dataset:  (31900, 11)
count    31900.000000
mean         0.049185
std          0.216257
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: num_back, dtype: float64
[0. 1.]
Loaded 4604 node stats
['pagerank_1_click', 'pagerank_2_click', 'pagerank_source']
Loaded 4604 embeddings in df of shape (4604, 1)
Loaded 4604 articles in df of shape (4604, 1)
Loaded 119882 links in df of shape (119882, 2)
Dropped 11 games without link statistics
Class distribution: finished
True     0.5
False    0.5
Name: proportion, dtype: float64
Total number of samples: 20954
Optimization terminated successfully.
         Current function value: 0.567847
         Iterations 7
Training Set Metrics:
Threshold:   0.4810
F1 Score:    0.7111
Precision:   0.7116
Accuracy:    0.7112
              precision    recall  f1-score   support

       False     0.7337    0.7099    0.7216      20

In [85]:
model_n.summary()

0,1,2,3
Dep. Variable:,finished,No. Observations:,2616.0
Model:,Logit,Df Residuals:,2609.0
Method:,MLE,Df Model:,6.0
Date:,"Fri, 13 Dec 2024",Pseudo R-squ.:,0.1636
Time:,23:38:52,Log-Likelihood:,-1516.6
converged:,True,LL-Null:,-1813.3
Covariance Type:,nonrobust,LLR p-value:,6.306e-125

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
x1,0.1070,0.044,2.447,0.014,0.021,0.193
x2,-0.0641,0.046,-1.396,0.163,-0.154,0.026
x3,-0.0731,0.046,-1.586,0.113,-0.163,0.017
x4,-0.0801,0.113,-0.707,0.479,-0.302,0.142
x5,0.0578,0.112,0.518,0.605,-0.161,0.276
x6,-1.2206,0.063,-19.425,0.000,-1.344,-1.097
x7,0.3044,0.052,5.845,0.000,0.202,0.407
