# ADA Project

Now that we understand the dataset, we are going to try to answer our research questions in this notebook.

In [2]:
import pandas as pd
import time
from collections import deque
import numpy as np
pd.options.mode.chained_assignment = None

In [3]:
# read the dataframes
all_games = pd.read_pickle("data/games.pkl")
all_orders = pd.read_pickle("data/orders.pkl")
all_players = pd.read_pickle("data/players.pkl")
all_turns = pd.read_pickle("data/turns.pkl")
all_units = pd.read_pickle("data/units.pkl")

# remove duplicates
all_units = all_units.drop_duplicates()

# Descriptive statistics

We look at the **outcome** (binary) considering the **treatment**, which can be:
- engaged: player who was engaged in a friendship
- single: player who was not engaged in a friendship

And for engaged player, there is an aditional treatment: 
- betrayer: player who betrayed another one
- betrayed: player who ended up betrayed by another
- neutral: player who was not engaged in a *broken* friendship

In order to answer this question, let's define a few functions that we will use in the later analysis.

In [4]:
countries = ['A', 'E', 'F', 'G', 'I', 'R', 'T']
pairs = [x+y for x in countries for y in countries if y > x]

In [5]:
def get_engageds(friendships):
    """Returns the players engaged in a friendship"""
    cols = [col for col in friendships.columns if np.count_nonzero(friendships[col] != 0)]
    return list(set(''.join(cols)))

def get_singles(engageds):
    """Returns players not engaged in a friendship"""
    singles = countries.copy()
    for engaged in engageds:
        singles.remove(engaged)
    return singles

In [6]:
def get_betrayers_and_betrayed(friendships):
    """Given the Friendships dataframe as defined in our analysis, returns all the players who commited 
    betrayals and all players who ended up betrayed"""
    cols = [col for col in friendships.columns if np.count_nonzero(friendships[col] != 0)]
    betrayers = []
    betrayeds = []
    for c in cols: 
        tmp = friendships[c]
        values = tmp[tmp != 0].values
        if type(values[-1]) == str: 
            betrayer = values[-1]
            betrayers.append(betrayer)
            tmp = list(c)
            tmp.remove(betrayer)
            betrayeds.append(tmp[0])
            
    return betrayers, betrayeds

def get_neutrals(betrayers, betrayeds):
    """Given betrayers and betrayeds players of a game, returns the list of players
    who were not involved in a broken friendship
    
    Required: player must be 'engaged'
    """
    neutrals = countries.copy()
    for b in betrayers: 
        if b in neutrals: neutrals.remove(b)
    for b in betrayeds: 
        if b in neutrals: neutrals.remove(b)
    return neutrals

In [7]:
def get_winners(game_id):
    winner = all_players.query("game_id == @game_id & won == 1")
    return winner.country.values

def get_losers(winners):
    loosers = countries.copy()
    for w in winners: loosers.remove(w)
    return loosers

In [8]:
# load the data to analyse
games_id = np.load("data/subset500_N1/games_id.npy")
all_friendships = np.load("data/subset500_N1/friendships.npy", allow_pickle=True)
verbose = False

data_overall = np.zeros(shape = (2,2))
data_engaged = np.zeros(shape = (3,2))

treatments_overall = ["single", "engaged"]
treatments_engaged = ["betrayer", "betrayed", "neutral"]
outcomes = ["winner", "loser"]

stats_overall = pd.DataFrame(data_overall, index = treatments_overall, columns = outcomes )
stats_engageds = pd.DataFrame(data_engaged, index = treatments_engaged, columns = outcomes )

N = len(games_id)
for i, game_id in enumerate(games_id):
    # reconstruct the obtained data
    data = all_friendships[i]
    years = np.arange(1901, 1901 + data.shape[0] * 0.5, 0.5)
    friendships = pd.DataFrame(data = all_friendships[i], columns = pairs, index = years)
    
    # get outcomes
    winners = get_winners(game_id)
    losers = get_losers(winners)
    
    # get treatment 1 
    engageds = get_engageds(friendships)
    singles = get_singles(engageds)
    
    # get treatment 2
    betrayers, betrayeds = get_betrayers_and_betrayed(friendships)
    neutrals = get_neutrals(betrayers, betrayeds)
    
    # overal statistics
    for w in winners:
        if w in singles: stats_overall.loc["single", "winner"] += 1
        if w in engageds: stats_overall.loc["engaged", "winner"] += 1
    for l in losers:
        if l in singles: stats_overall.loc["single", "loser"] += 1
        if l in engageds: stats_overall.loc["engaged", "loser"] += 1
    
    # statistics about betrayers: 
    for winner in winners: 
        if winner in engageds: 
            if winner in betrayers: stats_engageds.loc["betrayer", "winner"] += 1
            if winner in betrayeds: stats_engageds.loc["betrayed", "winner"] += 1
            if winner in neutrals: stats_engageds.loc["neutral", "winner"] += 1
    for loser in losers: 
        if winner in engageds:
            if loser in betrayers: stats_engageds.loc["betrayer", "loser"] += 1
            if loser in betrayeds: stats_engageds.loc["betrayed", "loser"] += 1
            if loser in neutrals: stats_engageds.loc["neutral", "loser"] += 1
            
    if verbose:
        print("\nGame",i)
        print("Winners: ", winners, " and Losers", losers)
        print("Betrayers: ", betrayers, " and Betrayed", betrayeds)
        print("Neutrals: ", neutrals)



In [9]:
win_ratio = stats_overall.winner / (stats_overall.loser + stats_overall.winner)
stats_overall["win_ratio"] = win_ratio
print("Statistics over all dataset")
stats_overall

Statistics over all dataset


Unnamed: 0,winner,loser,win_ratio
single,186.0,1947.0,0.087201
engaged,371.0,996.0,0.271397


In [10]:
win_ratio = stats_engageds.winner / (stats_engageds.loser + stats_engageds.winner)
stats_engageds["win_ratio"] = win_ratio
print("Statistics over players engaged in a friendship")
stats_engageds

Statistics over players engaged in a friendship


Unnamed: 0,winner,loser,win_ratio
betrayer,133.0,126.0,0.513514
betrayed,70.0,189.0,0.27027
neutral,171.0,1610.0,0.096013


What can we see here ? 
- among all players that were involved in a broken friendship (either *betrayer* or *betrayed*) the chances of winning go towards the betrayer. Betrayed players have much higher chances of loosing and about 5 times less chances of winning. . .. 
- the neutral players represents the majority of players, however their chances of winning are not much bigger than the chances of win than betrayed people. 

These results make us strongly believe that **betrayals strongly influence the outcome of the game**. 

What can we do next ? 

- select 250 games with  and without betrayals (and all with friendships), then do a matching based on the games properties, and look at what differs once a betrayal happened for the players who were engaged in a friendship !

- quantify agressivity of players towards others, and using the same dataset as before, try to see what happens to a player that was betrayed

# Machine Learning Analysis: can we predict betrayals using game features ? 

In [11]:
import statsmodels.formula.api as smf

features = pd.read_pickle("data/features_N1/features.pkl")
features

Unnamed: 0,game_id,has_betrayal,length,outcome,n_aoh_x,n_aoh_y,n_aos_x,n_aos_y,supports_xy,supports_yx
0,112524.0,True,1.0,0.0,1.0,2.0,0.0,0.0,1.0,1.0
1,112524.0,True,2.0,0.0,2.0,4.0,0.0,1.0,1.0,1.0
2,112524.0,True,3.0,0.0,1.0,2.0,0.0,1.0,1.0,1.0
3,112524.0,True,4.0,1.0,2.0,3.0,0.0,0.0,0.0,1.0
0,76114.0,True,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...
1,102119.0,False,2.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,102119.0,False,3.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
3,102119.0,False,4.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0
4,102119.0,False,5.0,0.0,5.0,1.0,0.0,0.0,1.0,1.0


In [12]:
# train logistic regression

# 1. normalisation 
predictors = ["length", "n_aoh_x", "n_aoh_y","n_aos_x", "n_aos_y", "supports_xy", "supports_yx"]
# predictors = ["length", "n_aoh_x", "n_aoh_y", "n_aos_y",  "supports_yx"]

features = features[predictors + ["outcome"]]

for p in predictors: 
    features[p] = (features[p] - features[p].mean()) / features[p].std()

features

Unnamed: 0,length,n_aoh_x,n_aoh_y,n_aos_x,n_aos_y,supports_xy,supports_yx,outcome
0,-1.061841,-0.372672,0.471909,-0.344557,-0.336199,0.335109,0.347504,0.0
1,-0.797561,0.488811,2.199593,-0.344557,2.448372,0.335109,0.347504,0.0
2,-0.533281,-0.372672,0.471909,-0.344557,2.448372,0.335109,0.347504,0.0
3,-0.269002,0.488811,1.335751,-0.344557,-0.336199,-1.249225,0.347504,1.0
0,-1.061841,-1.234155,-1.255774,-0.344557,-0.336199,0.335109,-1.195380,0.0
...,...,...,...,...,...,...,...,...
1,-0.797561,-1.234155,-0.391933,-0.344557,-0.336199,0.335109,-1.195380,0.0
2,-0.533281,-1.234155,-0.391933,-0.344557,-0.336199,0.335109,0.347504,0.0
3,-0.269002,-0.372672,-0.391933,-0.344557,-0.336199,0.335109,0.347504,0.0
4,-0.004722,3.073260,-0.391933,-0.344557,-0.336199,0.335109,0.347504,0.0


In [13]:
formula = "outcome ~ " + ' + '.join(predictors)
formula

'outcome ~ length + n_aoh_x + n_aoh_y + n_aos_x + n_aos_y + supports_xy + supports_yx'

In [14]:
mod = smf.logit(formula = formula, data = features)
res = mod.fit()
res.summary()

Optimization terminated successfully.
         Current function value: 0.250526
         Iterations 7


0,1,2,3
Dep. Variable:,outcome,No. Observations:,1847.0
Model:,Logit,Df Residuals:,1839.0
Method:,MLE,Df Model:,7.0
Date:,"Tue, 08 Dec 2020",Pseudo R-squ.:,0.0618
Time:,17:27:16,Log-Likelihood:,-462.72
converged:,True,LL-Null:,-493.2
Covariance Type:,nonrobust,LLR p-value:,9.684e-11

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.6892,0.101,-26.608,0.000,-2.887,-2.491
length,0.4844,0.075,6.430,0.000,0.337,0.632
n_aoh_x,-0.0290,0.090,-0.322,0.747,-0.206,0.148
n_aoh_y,-0.0125,0.090,-0.139,0.889,-0.188,0.163
n_aos_x,-0.0262,0.092,-0.284,0.776,-0.207,0.155
n_aos_y,-0.0841,0.098,-0.855,0.393,-0.277,0.109
supports_xy,0.4169,0.085,4.928,0.000,0.251,0.583
supports_yx,0.1932,0.083,2.321,0.020,0.030,0.356


In [15]:
tmp = res.predict()
for i, prediction in enumerate(tmp):
    print(prediction, " versus : ", features.iloc[i, -1])

0.049527751086902674  versus :  0.0
0.04280590169692642  versus :  0.0
0.0505602699701083  versus :  0.0
0.03672944267566279  versus :  1.0
0.03894292103287273  versus :  0.0
0.044026834349282416  versus :  0.0
0.04742974936631122  versus :  0.0
0.054283400705557074  versus :  0.0
0.07715967853669004  versus :  1.0
0.03894292103287273  versus :  0.0
0.04298674221755202  versus :  0.0
0.04758232829697003  versus :  0.0
0.05373104090245747  versus :  0.0
0.06206288723796575  versus :  0.0
0.09260351823803939  versus :  0.0
0.09411998625798525  versus :  0.0
0.13726416846351053  versus :  0.0
0.12950069575695689  versus :  0.0
0.13984801386485102  versus :  0.0
0.06528977179495121  versus :  0.0
0.0752714074609407  versus :  0.0
0.14775909542315743  versus :  0.0
0.1646154848404898  versus :  0.0
0.18839590705255183  versus :  0.0
0.12108542841178965  versus :  0.0
0.21762679295754583  versus :  0.0
0.2376281374659847  versus :  0.0
0.20850578728212557  versus :  0.0
0.16091520432418677  



THIS IS OLD ....

# Paired experiment

Paired study over 500 friendships: 250 that didn't ended up in betrayal, and 250 that did. For each friendship, we have
- the game_id
- the players involved : betrayer / betrayeds
- the winners / the losers 

and things we can do the matching upon (*using which outcome ? Maybe 'is one of the player winner'*)
- length of the friendship
- length of the game
- ? average score of the 2 players when they become friends. 
- ? average score .. when they quit being friends.

Then we can attribute to each player a status 'betrayer' or 'betrayed'. 

And finally we could look at the following statistics 
- probability of winning of one of the player.  
- average agressivity of one of the player towards the other. 

## Study design

In order to do this study, we must obtain a new dataset. In the notations, 's' means 'start of friendship', 'e' means 'end of friendship' and 'f' means 'final' (end of the game).

- desired columns: "game_ids", "has_betrayed", "betrayer", "betrayed", "winners", "friendship_length", "game_length", "score_s", "score_e", "score_f", "agressivity_s", "agressivity_e"
- each row is one game ...

Method
1. Construct the dataset
2. Propensity score matching using 'has_betrayed' as outcome
3. Observing effect over the 2 groups: "control" and "treated"

# Potential plots
