# "A Wolf in Sheep's Clothing"
## Linguistic Cues in Social Deduction Games

## Abstract

We studied [[Niculae et al., 2015]](#niculae) and it enabled us to highlight certain clues of language linked to the imminence of betrayal. We would like to apply similar techniques to detect betrayal in social deduction games, like [Town of Salem](#salem), [Secret Hitler](#hitler), [Among Us](#amongus) or [Werewolf/Mafia](#mafia). Is it possible, by studying the public exchanges of the players during a textual game, to spot the "traitor"? The major difference with the basic article is that we are not looking for a betrayal to come - the breaking of a friendship - but a betrayal that has already taken place - for instance, the "wolf" seeks to win by posing as a "villager". As such, we are going to analyse textual exchanges of different games, and try to apply the same methods to multiple sessions.

## Introduction

### Linguistic Harbingers of Betrayal

The goal of this code is to give an extension to the paper called [Linguistic Harbingers of Betrayal](http://vene.ro/betrayal/#paper). It was written by Vlad Niculae, Srijan Kumar, Jordan Boyd-Graber and Cristian Danescu-Niculescu-Mizil.

The paper is a study of the interpersonal relations, and more precisely of the betrayal that can occur between two people. Its goal is to see if we can predict a forthcoming betrayal with linguistic signs in a dyadic interaction. In order to do that, the authors collected the conversations of people playing a strategic game called [*Diplomacy*](https://boardgamegeek.com/boardgame/483/diplomacy) and they studied their interactions. 

Their study concludes that slight changes in certain attributes of the conversation may indicate impending betrayal. These indicators are as follows: positive sentiment, politeness or focus on future planning-signal.

### How to spot a traitor

### The Mafiascum Dataset

We would like to apply similar techniques to detect betrayal in a social deduction game called [*The Mafia*](https://en.wikipedia.org/wiki/Mafia_(party_game)). We found a [dataset](https://bitbucket.org/bopjesvla/thesis/src/master/) containing almost 1100 different games. They were taken on a website called [Mafiascum](https://www.mafiascum.net/) which allows player to discuss and play this game with the help of a moderator.

The main difference between *Diplomacy* and and *The Mafia* is that in the second game, the traitors are present from the beginning of the game. Their goal is to avoid to be detected and eliminate all the other players. In *Diplomacy*, every player can become a traitor at some point of the game. We won't look for a betrayal to come - the breaking of a friendship - but a betrayal that has already taken place. As such, we are going to analyse textual exchanges in the game [*The Mafia*](https://www.youtube.com/watch?v=QK736KcqdK4).

To work with the data we will use the library Pandas. (COMPLETE WITH WHAT WE WILL USE!)

In [1]:
import pandas as pd
import numpy as np

The first step is to import the exchanged messages from the database. They are divided in three different `json` files. We will import and concatenate them together. We will also import the `json` files containing information about the players of each game (they are also divided in three different files).

In [2]:
# Path to the files containing mafiascum data
PATH_MAFIASCUM = 'Data/mafiascum/src/' 

# Name of the files containing the data
post_files = ['mini-normal1.json', 'mini-normal2.json', 'large-normal.json']
info_files = ['mini-normal-slots.json', 'large-normal-slots.json', 'old-normal-slots.json']

# Read the files
posts_mafia = pd.concat(pd.read_json(PATH_MAFIASCUM + fn, orient='records') for fn in post_files)
info_mafia = pd.concat(pd.read_json(PATH_MAFIASCUM + fn, orient='records') for fn in info_files)

# Define the columns "game_id" and "author" as index for the first dataset
posts_mafia.set_index(['game_id', 'author'], inplace=True)

As we will see, the dataset `posts_mafia` contains 5 columns:
- **game_id:** is the number of the game in which the message was sent
- **author:** is the name of the author of the message
- **content:** the text of the post
- **inserted_at:** the time and date at which the message was posted
- **post_no:** the numbering of the posts for this game

We have defined the two first columns as index.

In [3]:
posts_mafia.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,content,inserted_at,post_no
game_id,author,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
29549,Tierce,Click here for games 44 through 1400.\n\nMini ...,2013-06-25 05:19:00,0
29549,Tierce,Mini Normals: 1501 - 1505\n\n\nGame 1501: We'r...,2013-12-06 13:39:00,1
29549,N,Mini Normals 1509 - 1544\n\n\nMini 1509 - Marr...,2015-02-21 08:18:00,2
29549,N,Mini Normals 1553 - 1600\n\n\nMini: 1553 Gone ...,2015-02-21 13:09:00,3
29549,N,Mini Normals 1601 - 1648\n\n\nMini 1601: B_E's...,2015-04-21 03:24:00,4


The dataset `info_mafia` contains also 5 columns:
- **game_id:** is the number of the game in which the player is playing
- **author:** is the concatenation of what we can find in the three next columns
- **users:** the nickname of the player
- **role:** the role of the player in the game
- **event:** What happened to the player during the game (did he [die](https://www.youtube.com/watch?v=AZfZnbTgY4E&t)? How? Did he [survive](https://www.youtube.com/watch?v=btPJPFnesV4)?)

Let's see it:

In [4]:
info_mafia.head()

Unnamed: 0,game_id,text,users,role,event
0,24200,"Kthxbye, Town Odd-Night Vigilante, survives",[Kthxbye],Town Odd-Night Vigilante,survives
1,24200,"Slandaar, Mafia Traitor, died Night 3",[Slandaar],Mafia Traitor,died Night 3
2,24200,"Gorgon, Vanilla Townie, survives",[Gorgon],Vanilla Townie,survives
3,24200,"Malakittens replaces theaceofspades, Town Maso...","[Malakittens, theaceofspades]",Town Mason,died Night 2
4,24200,"TheConman17, Mafia One-Shot Bulletproof, lynch...",[TheConman17],Mafia One-Shot Bulletproof,lynched Day 1


---
## Data concierge

We will start with the second dataset `info_mafia` and find for each player if he/she's playing a betrayer role or not. A betrayer is defined as a player beeing in [the bad (and ugly?)](https://www.youtube.com/watch?v=AFa1-kciCb4) side. More precisely the roles containing one the following terms:
- mafia
- goon
- wolf
- serial killer
- SK

We will create a new column called *betrayer* in the dataframe telling with *True* or *False* if the role of the player contains one of this word.

In [5]:
# Look in the column role if it contains the given words 
info_mafia['betrayer'] = info_mafia['role'].str.contains('mafia|goon|wolf|serial.?killer'
                                                         , case=False) | info_mafia['role'].str.contains('SK')
# select only the games with known roles
mask_role_notnan = [not np.any(pd.isnull(info_mafia[info_mafia.game_id == g_id].role)) for g_id in info_mafia.game_id]
info_mafia = info_mafia[mask_role_notnan]
# display DF
info_mafia.head()

Unnamed: 0,game_id,text,users,role,event,betrayer
0,24200,"Kthxbye, Town Odd-Night Vigilante, survives",[Kthxbye],Town Odd-Night Vigilante,survives,False
1,24200,"Slandaar, Mafia Traitor, died Night 3",[Slandaar],Mafia Traitor,died Night 3,True
2,24200,"Gorgon, Vanilla Townie, survives",[Gorgon],Vanilla Townie,survives,False
3,24200,"Malakittens replaces theaceofspades, Town Maso...","[Malakittens, theaceofspades]",Town Mason,died Night 2,False
4,24200,"TheConman17, Mafia One-Shot Bulletproof, lynch...",[TheConman17],Mafia One-Shot Bulletproof,lynched Day 1,True


Thanks to these information, it is possible to find out, for each game, if the Town won or if the Mafia accomplished its mission. This can be done by comparing *event* 's column in the previous DataFrame with the *betrayer* one. If an event stating *survives* is associated with *True* in *betrayer*, then the Mafia won, otherweise Town succeeded in taking the Mafia down. We will find that out a bit later.

Before finishing with this dataframe, we have to deal with another particularity. Sometimes a player is replaced by another one during a game. In this case, the players are all written in the *users* column. We have then to separate this column and create a line per user. In addition we will only keep the columns we will need later (*game_id*, *users*, *betrayer*) and put them in a new dataframe called `betrayer_df`. Finaly we will define the columns *game_id* and *users* as index. Thus we will have the same index as the other dataframe and it will be easier to merge them later.

In [6]:
# Seperate the users and create the new dataframe
betrayer_df = info_mafia.explode('users')[['game_id', 'users', 'betrayer']]

# Change name of the users column and set index to correspond with the other dataframe
betrayer_df.rename(columns ={'users':'author'}, inplace = True)
betrayer_df.set_index(['game_id', 'author'], inplace = True)
betrayer_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,betrayer
game_id,author,Unnamed: 2_level_1
24200,Kthxbye,False
24200,Slandaar,True
24200,Gorgon,False
24200,Malakittens,False
24200,theaceofspades,False


Now we have the information we want for this dataframe. We will now start playing with the other one.

In the ´posts_mafia´ dataframe we can find all the messages posted during each game. But a part of the messages were sent by the moderator of the game. His goal is to guide players through the game. He will make some announcement for starting the game and ending the game. He will also collect the votes that will choose the people who will be eliminated by the players.

As we only want to study the behaviour of the players we find the messages sent by the moderator. In order to do that we will look for the very first message sent in every game. This message is always sent by the moderator who introduce the new game. By doing this, we can find the moderator's nickname and thus find all his messages.

In [7]:
# We want to find each messages sent by the moderator: the first post of each game is always sent by the moderator
first_posts = posts_mafia[posts_mafia["post_no"] == 0]
first_posts.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,content,inserted_at,post_no
game_id,author,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
29549,Tierce,Click here for games 44 through 1400.\n\nMini ...,2013-06-25 05:19:00,0
71796,PenguinPower,Welcome to Mini Normal 1911 \nPenguin Mafia R...,2017-05-12 11:45:00,0
71697,nancy,Welcome to Girls ♥ Girls Mini 1909\nPart 1: On...,2017-05-02 22:15:00,0
71675,XnadrojX,Welcome to Mini Normal 1908 - In The Web\nI am...,2017-05-02 06:30:00,0
71640,Dierfire,Mini Normal 1905\nModerator: Dierfire\nReviewe...,2017-04-29 18:37:00,0


We can see for example that the moderator of the first game in the database is called *Tierce*.

Now we can find all the messages sent by the moderator in each game.

In [8]:
# look for all the posts sent by the moderator in each game
moderator_posts = posts_mafia.loc[first_posts.index]
moderator_posts.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,content,inserted_at,post_no
game_id,author,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
29549,Tierce,Click here for games 44 through 1400.\n\nMini ...,2013-06-25 05:19:00,0
29549,Tierce,Mini Normals: 1501 - 1505\n\n\nGame 1501: We'r...,2013-12-06 13:39:00,1
71796,PenguinPower,Welcome to Mini Normal 1911 \nPenguin Mafia R...,2017-05-12 11:45:00,0
71796,PenguinPower,Rules\nRules shamelessly stolen from various m...,2017-05-12 11:52:00,1
71796,PenguinPower,Role PMs have been sent out. Game will start ...,2017-05-12 12:47:00,2


After the game is over, players often start talking about what happened during the game, bragging about their merits or despairing about the fact that they think they were killed unjustly. We are not interested in this part of the discussion because by that time the roles of the players will have been revealed and therefore there is no more betrayer. Therefore we want to eliminate all these messages.

First of all, we will search in the messages sent by the moderator for all those that give the result of a vote. Indeed, a game ends with the elimination of a player leading to the victory of one of the two sides (["the good guys" or "the bad guys"](https://www.youtube.com/watch?v=gDIlTlkOBYc) depending on how you see things).

In [9]:
# We search in the messages sent by the moderator the posts telling the results of the votes
vote_counts = moderator_posts[moderator_posts["content"].str.contains("vote ?count|vc|not voting \("
                                                                      , case=False)].copy()
vote_counts.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,content,inserted_at,post_no
game_id,author,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
71796,PenguinPower,Welcome to Mini Normal 1911 \nPenguin Mafia R...,2017-05-12 11:45:00,0
71796,PenguinPower,"Vote Count 1.01\nNot Voting (13): Transcend, L...",2017-05-12 19:49:00,4
71796,PenguinPower,"Vote Count 1.02\nGamma Emerald (2): Transcend,...",2017-05-12 20:40:00,25
71796,PenguinPower,"Vote Count 1.03\nMulch (3): Agent Sparkles, Ry...",2017-05-13 00:39:00,112
71796,PenguinPower,Vote Count 1.04\nAgent Sparkles (2): Transcend...,2017-05-13 01:43:00,153


Now we can find the post of the moderator ending a game. This is simply the last post talking about a vote. Let's find that!

In [10]:
# Find for each game the last post talking about a vote => the end of the game.
last_posts = vote_counts['post_no'].max(level=0)
last_posts.head()

game_id
71796    3130
71697    4741
71675    3256
71640    3354
71483    2596
Name: post_no, dtype: int64

We have now for each game the number of the message which is ending the game. We will add to the `posts_mafia` dataframe a new column called `post_no_last` telling this number.

In [11]:
# Add a new column with the number of the last message
posts_mafia = posts_mafia.join(last_posts, rsuffix='_last', how='inner')
posts_mafia

Unnamed: 0_level_0,Unnamed: 1_level_0,content,inserted_at,post_no,post_no_last
game_id,author,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
71796,PenguinPower,Welcome to Mini Normal 1911 \nPenguin Mafia R...,2017-05-12 11:45:00,0,3130
71796,PenguinPower,Rules\nRules shamelessly stolen from various m...,2017-05-12 11:52:00,1,3130
71796,PenguinPower,Role PMs have been sent out. Game will start ...,2017-05-12 12:47:00,2,3130
71796,PenguinPower,Day 1 Starts Now!,2017-05-12 19:47:00,3,3130
71796,PenguinPower,"Vote Count 1.01\nNot Voting (13): Transcend, L...",2017-05-12 19:49:00,4,3130
...,...,...,...,...,...
20035,The Master,Why is this game taking so long? I posted alm...,2002-07-17 10:16:00,391,219
20035,mole,Just building suspense.\n\nEverybody wins! I w...,2002-07-17 12:01:00,392,219
20035,Soothsayer,Well done to the town. Bah at the mafia - sorr...,2002-07-17 12:52:00,393,219
20035,Antrax,"Okay, how was my play any less than perfect? I...",2002-07-17 13:03:00,394,219


Now we can eliminate all the lines that have a *post_no* higher than the *post_no_last*. That means that the massage was sent after the moderator ended the game.

In [12]:
#keep only thes ligne that have a lower or equal message's number than the "last message number"
posts_mafia = posts_mafia[posts_mafia['post_no'] <= posts_mafia['post_no_last']]
posts_mafia

Unnamed: 0_level_0,Unnamed: 1_level_0,content,inserted_at,post_no,post_no_last
game_id,author,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
71796,PenguinPower,Welcome to Mini Normal 1911 \nPenguin Mafia R...,2017-05-12 11:45:00,0,3130
71796,PenguinPower,Rules\nRules shamelessly stolen from various m...,2017-05-12 11:52:00,1,3130
71796,PenguinPower,Role PMs have been sent out. Game will start ...,2017-05-12 12:47:00,2,3130
71796,PenguinPower,Day 1 Starts Now!,2017-05-12 19:47:00,3,3130
71796,PenguinPower,"Vote Count 1.01\nNot Voting (13): Transcend, L...",2017-05-12 19:49:00,4,3130
...,...,...,...,...,...
20035,Internet Stranger,So the best liars are the ones that do get cau...,2002-05-07 17:12:00,215,219
20035,quercitron,"Antrax, that is the funniest Mafia post I have...",2002-05-07 18:25:00,216,219
20035,Internet Stranger,"Quote (mole @ May 06 2002 , 05:46)SaberKitty 1...",2002-05-07 18:53:00,217,219
20035,Victim,I see 3 possible outcomes from this crazed mes...,2002-05-07 18:57:00,218,219


Finaly we can now delete the messages sent by the moderators during the games. We will create a new dataframe called `messages_players` containing only the messages sent by the players during the games.

In [13]:
# redefine the indexes of the dataframe to make them unique
no_posts_mafia = posts_mafia.reset_index().set_index(['game_id', 'author', 'post_no'])
no_moderator_posts = moderator_posts.reset_index().set_index(['game_id', 'author', 'post_no'])

# Keep only the messages that were not send by the moderator
messages_players = no_posts_mafia[~no_posts_mafia.index.isin(no_moderator_posts.index)].reset_index()
messages_players.set_index(['game_id', 'author'], inplace = True )
messages_players.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,post_no,content,inserted_at,post_no_last
game_id,author,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
71796,Gamma Emerald,5,Opening with RQS\n1) How excited are you for t...,2017-05-12 19:49:00,3130
71796,Transcend,6,VOTE: gamma,2017-05-12 19:51:00,3130
71796,Gamma Emerald,7,"My answers\n1) I'm quite excited, I played PP'...",2017-05-12 19:55:00,3130
71796,Mulch,8,WTF are RQS,2017-05-12 20:02:00,3130
71796,Gamma Emerald,9,"RQS is random questioning stage, it's a type o...",2017-05-12 20:04:00,3130


Now we can merge the two dataframes `messages_players` and `betrayer_df`.

In [14]:
messages_players = messages_players.merge(betrayer_df, left_index = True, right_index = True).reset_index()
messages_players

Unnamed: 0,game_id,author,post_no,content,inserted_at,post_no_last,betrayer
0,26,CoolBot,27,\n\nThis sounds like you're trying to limit an...,2003-07-24 21:28:00,273,False
1,26,CoolBot,40,It was Someone's suggesting he didn't want ano...,2003-07-25 17:48:00,273,False
2,26,CoolBot,65,I'm not really sure how a neutral role would e...,2003-07-28 15:44:00,273,False
3,26,CoolBot,85,So the robot detector... is a robot? I don't ...,2003-07-29 01:09:00,273,False
4,26,CoolBot,99,Has anyone considered that maybe Someone is a ...,2003-07-29 18:14:00,273,False
...,...,...,...,...,...,...,...
691984,71796,marshy,2215,not if im around. i prolly take you to 3 man a...,2017-05-28 16:01:00,3130,False
691985,71796,marshy,2237,rip,2017-05-28 21:42:00,3130,False
691986,71796,marshy,2240,what do you want my thoughts on boi?,2017-05-28 21:59:00,3130,False
691987,71796,marshy,2242,when frozens scum he sheeps the power players ...,2017-05-28 22:13:00,3130,False


In [15]:
# How many games are left
len(np.unique(messages_players.reset_index().game_id))

637

Now we have a merged dataframe that keeps only the informations we need. We can see that some of the games we had at the beginning are no longer present in this database: the database is now shorter. Indeed, when we merged the two dataframe, a part of the correspondences could not be made because the indexes on which the merging was made were not present in the two tables (for example the role of a player is unknown). We now have 637 games left.

As promised before, we will now add a new column called *town_won* that states the outcome of the game: *True* if the Town won or *False* the Mafia won.

In [16]:
# Create a new dataframe that will make correspond the game_id with the outcome
df_outcome = pd.DataFrame(np.unique(messages_players.game_id), columns=['game_id']).set_index('game_id')    
town_won = []

# We loop for each game
for game_id in df_outcome.index:
    # calculate how many players are in this game
    nb_players = len(info_mafia[info_mafia.game_id == game_id].event)
    # for every players search if the event is "survive"
    outcome_clue = [(str(event)=='' or 'survive' in str(event)) for event in info_mafia[info_mafia.game_id==game_id].event]
    if np.any(outcome_clue): #if someone survives
        # True if someone from town survived
        town_won.append(~info_mafia[info_mafia.game_id == game_id].betrayer.iloc[outcome_clue.index(True)])
    else :
        town_won.append(np.NaN) # nobody survived
    
df_outcome['town_won'] = town_won

# merge with the dataframe
messages_players = messages_players.merge(df_outcome, on='game_id')
messages_players.head()

Unnamed: 0,game_id,author,post_no,content,inserted_at,post_no_last,betrayer,town_won
0,26,CoolBot,27,\n\nThis sounds like you're trying to limit an...,2003-07-24 21:28:00,273,False,True
1,26,CoolBot,40,It was Someone's suggesting he didn't want ano...,2003-07-25 17:48:00,273,False,True
2,26,CoolBot,65,I'm not really sure how a neutral role would e...,2003-07-28 15:44:00,273,False,True
3,26,CoolBot,85,So the robot detector... is a robot? I don't ...,2003-07-29 01:09:00,273,False,True
4,26,CoolBot,99,Has anyone considered that maybe Someone is a ...,2003-07-29 18:14:00,273,False,True


The next step is now to classify the messages in a time frame for each game. We will add temporal markers to this dataset in order to be able to study the evolution of some features. We will also try to find to which round of the game each message belongs. 

We will first start by adding temporal markers to the messages. The simplest way to do so is by using the timestamps *inserted_at* and normalizing on the game in order to get comparable values across games.

First each message will be associated with a relative in game time (0: beginning, 1: end). Let's begin by defining a DataFrame with major information about times of the game (start, end, length).

In [17]:
# Create a new dataframe that will contains the starting time of the game, the ending time and the length
df_game_times = pd.DataFrame(np.unique(messages_players.game_id), columns=['game_id']) 

# The starting time is simply the time when the moderator sent his first message
df_game_times['start_time'] = [moderator_posts.loc[game_id,:].inserted_at.iloc[0] for game_id in df_game_times.game_id]

# The ending time is when the moderator announce the results of the final votes
df_game_times['end_time'] = [moderator_posts.loc[game_id,:][moderator_posts.loc[game_id,:].post_no ==last_posts.loc[game_id]].inserted_at[-1]
           for game_id in df_game_times.game_id]

# Calculate the length of the game (length = end - start)
df_game_times['time_length'] = df_game_times['end_time']-df_game_times['start_time']

df_game_times = df_game_times.set_index('game_id')
df_game_times.head()

Unnamed: 0_level_0,start_time,end_time,time_length
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
26,2003-07-23 19:05:00,2003-09-02 19:14:00,41 days 00:09:00
28,2003-07-23 20:47:00,2003-08-06 20:18:00,13 days 23:31:00
38,2003-07-25 00:06:00,2003-08-22 04:15:00,28 days 04:09:00
94,2003-08-08 01:32:00,2003-08-22 21:12:00,14 days 19:40:00
108,2003-08-16 21:22:00,2003-10-04 08:11:00,48 days 10:49:00


These information can then be used to normalize the timestamp of each message with relation to its game.

In [18]:
rel_time = []

# for each message we will add a new column containing the relative time (in the game) when the message was sent
for index, row_posts in messages_players.iterrows():
    game_id = row_posts.game_id
    rel_time.append((row_posts.inserted_at - df_game_times.loc[game_id].start_time)/df_game_times.loc[game_id].time_length)

#add the new column
messages_players['rel_time'] = rel_time
messages_players.head()

Unnamed: 0,game_id,author,post_no,content,inserted_at,post_no_last,betrayer,town_won,rel_time
0,26,CoolBot,27,\n\nThis sounds like you're trying to limit an...,2003-07-24 21:28:00,273,False,True,0.026808
1,26,CoolBot,40,It was Someone's suggesting he didn't want ano...,2003-07-25 17:48:00,273,False,True,0.047469
2,26,CoolBot,65,I'm not really sure how a neutral role would e...,2003-07-28 15:44:00,273,False,True,0.118529
3,26,CoolBot,85,So the robot detector... is a robot? I don't ...,2003-07-29 01:09:00,273,False,True,0.128097
4,26,CoolBot,99,Has anyone considered that maybe Someone is a ...,2003-07-29 18:14:00,273,False,True,0.145455


Now that we have achieved the expected result, let's try to find the rounds ! At first glance it doesn't seem very complicated, but unfortunately a lot of parameters complicate things. First of all, the moderators are different from game to game. So we cannot rely on words they might say because they change every time. Then there are also no *Bot* that punctuates the game in a regular way. And icing on the cake, the moderators don't just talk between rounds, but also makes comment during the game.

However in some games finding rounds can be done by retrieving a common pattern in the moderator's votecount messages. It can be noticed by investigating the messages sent that some moderators (sadly not all of them) specify a [deadline](https://www.youtube.com/watch?v=jR9Rl-gFi6o) to the current vote session with a common structure : `Deadline: (expired on TIMESTAMP)` (see example below). Finding the post number associated with the last message in which meticulous moderators used a specific deadline timestamp is then a way to cut these games into rounds.

In [19]:
# printing an example of moderator's post mentionning "expired on "
example_index = 2
print('Post number {post_number}, posted at {timestamp}, by {mod} in game {game_id} : \n"\n{content}"\
'.format(post_number = vote_counts.post_no.iloc[example_index]
         ,timestamp = vote_counts.inserted_at.iloc[example_index]
         ,mod = vote_counts.index[example_index][1]
         ,game_id = vote_counts.index[example_index][0]
         , content = vote_counts.content.iloc[example_index]
        )
     )

Post number 25, posted at 2017-05-12 20:40:00, by PenguinPower in game 71796 : 
"
Vote Count 1.02
Gamma Emerald (2): Transcend, FireScreamer
Tammy (2): marshy, Zulfy 
Agent Sparkles (1): Ranmaru
Mulch (1): Agent Sparkles

Not Voting (7): LaLight, Mulch, Ryker, Tammy, AdumbroDeus, FrozenFlame, Gamma Emerald

With 13 alive, it takes 7 to lynch.

Deadline: (expired on 2017-05-26 16:00:00)

Mod Notes:  Not sure I like this automatic vote counter thing.  Edit:  Think I found a work around.
"


First let's modify the `vote_counts` DataFrame by dropping the *author* column as it doesn't bring any information to know the moderator's pseudonym, plus add a column with the timestamp associated with the sequence "expired on " if it is present in each of the messages contained in the DataFrame.

In [20]:
vote_counts_per_game = vote_counts.reset_index(level=1, drop=True).copy() # author brings no information
# add column to the DF w/ timestamp if it follows the targeted pattern
vote_counts_per_game['round_expired'] = [pd.to_datetime(vc_message.split("expired on ",1)[1][:19])
                      if ("expired on " in vc_message) else pd.NaT
                      for vc_message in vote_counts.content]

vote_counts_per_game.head()      

Unnamed: 0_level_0,content,inserted_at,post_no,round_expired
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
71796,Welcome to Mini Normal 1911 \nPenguin Mafia R...,2017-05-12 11:45:00,0,NaT
71796,"Vote Count 1.01\nNot Voting (13): Transcend, L...",2017-05-12 19:49:00,4,2017-05-26 16:00:00
71796,"Vote Count 1.02\nGamma Emerald (2): Transcend,...",2017-05-12 20:40:00,25,2017-05-26 16:00:00
71796,"Vote Count 1.03\nMulch (3): Agent Sparkles, Ry...",2017-05-13 00:39:00,112,2017-05-26 16:00:00
71796,Vote Count 1.04\nAgent Sparkles (2): Transcend...,2017-05-13 01:43:00,153,2017-05-26 16:00:00


Based on the added column *round_expired*, a DataFrame containing information about each of the games with conscientious moderators can be built. It aims at gathering the indexes of such games and the posts numbers than mark the end of rounds (ie. last time a specific timestamp is used as a deadline).

In [21]:
# find all the games in which "expired on " is mentionned at least once
game_expired_mention = np.unique(vote_counts_per_game[~pd.isnull(vote_counts_per_game.round_expired)].index)
# build a DF to gather informations about these games
df_expired = pd.DataFrame(game_expired_mention, columns=['game_id']).set_index('game_id')

# add a column with all the different deadlines used by the mod
unique_round_expired = []

for game_id in df_expired.index:
    unique_round_expired.append(np.unique(vote_counts_per_game.loc[game_id].round_expired))
    unique_round_expired[-1] = unique_round_expired[-1][~pd.isnull(unique_round_expired[-1])]
    
df_expired['unique_round_expired'] = unique_round_expired

# select only the games within which moderator stated at least 2 different deadlines
df_expired['nb_rounds'] = [len(game_rounds_ts) for game_rounds_ts in df_expired.unique_round_expired]
df_expired = df_expired[df_expired.nb_rounds>1]

# retrieving the post number associated with the end of a round
df_expired['end_round_post_no'] = [np.array([vote_counts_per_game.loc[game_id][vote_counts_per_game.loc[game_id].round_expired == pd.to_datetime(date)].post_no.iloc[-1]
                                             for date in df_expired.loc[game_id].unique_round_expired])
                                   for game_id in df_expired.index]

df_expired.head()

Unnamed: 0_level_0,unique_round_expired,nb_rounds,end_round_post_no
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
17173,"[2011-05-14T12:00:00.000000000, 2011-05-17T19:...",8,"[736, 740, 850, 868, 931, 939, 892, 932]"
17407,"[2011-06-09T23:00:00.000000000, 2011-06-26T02:...",3,"[1012, 1263, 1356]"
17666,"[2011-06-15T16:00:00.000000000, 2011-06-24T10:...",6,"[142, 349, 483, 627, 676, 688]"
17864,"[2011-07-19T17:00:00.000000000, 2011-07-22T17:...",3,"[880, 1018, 1489]"
18287,"[2011-08-01T18:00:00.000000000, 2011-08-03T13:...",6,"[470, 540, 830, 1024, 1060, 1072]"


Within this games' subset, the messages sent by the players can be categorized according to the round they were sent. This is done by selecting a reduced DataFrame within `messages_players` and computing the number of posts marking the end of rounds that occur before each of the players message is sent.

In [22]:
# reducing to the games of interest
messages_players_expired = messages_players[messages_players.game_id.isin(df_expired.index)].copy()

# adding the `rounds` column
mess_expired_round = []

for index, row_posts in messages_players_expired.iterrows():
    game_id = row_posts.game_id
    mess_expired_round.append(np.sum(row_posts.post_no > df_expired.loc[game_id].end_round_post_no))

messages_players_expired['rounds'] = mess_expired_round

# display DF
messages_players_expired.head()

Unnamed: 0,game_id,author,post_no,content,inserted_at,post_no_last,betrayer,town_won,rel_time,rounds
197222,17173,Elsa von Spielburg,11,Vote: McGriddle\n\nWhy would you tout the prod...,2011-04-07 20:55:00,939,False,True,0.046744,0
197223,17173,Elsa von Spielburg,23,"Really, Occult (and others), how often does th...",2011-04-08 17:28:00,939,False,True,0.059936,0
197224,17173,Elsa von Spielburg,30,"\n\nAgreed, the RVS is largely useless until s...",2011-04-08 21:05:00,939,False,True,0.062257,0
197225,17173,Elsa von Spielburg,31,"\n\nWho is the ""he"" in this sentence. I don't...",2011-04-08 21:08:00,939,False,True,0.062289,0
197226,17173,Elsa von Spielburg,42,"\n\nYeah, I'm wondering how being the 2nd vote...",2011-04-09 02:33:00,939,False,True,0.065767,0


Now we have two dataframe. One containing 231 games with the messages that are separated by round and another one with 637 games with messages that have only a time code that tell when they were sent during the game (relative time).

In [23]:
print(len(np.unique(messages_players_expired.game_id)))

231


In [24]:
town_won.count(np.NaN)

13

In [25]:
np.save('messages_players', messages_players)
np.save('messages_players_expired', messages_players_expired)

---

## Frameworks

### Politeness  

In [26]:
# Politeness framework

### Talkativeness

In [27]:
# Talkativeness framework

### Wraping up

In [28]:
# Using the API

---
## Analysis

In [29]:
# Something something what are the conclusions?