<center><h1 style="font-size:3em"> Graph2Speak </h1></center>
<center><h3> Improving Speaker Identification using Network Knowledge in Criminal Conversational Data </h3><center>

Paper: https://arxiv.org/abs/2006.02093

*Maël Fabien, Seyyed Saeed Sarfjoo, Petr Motlicek, Srikanth Madikeri*

In [3]:
# General
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set of functions
from src.utils import *

In [4]:
episode = "s01e23"

In [5]:
dict_spk, spk_dict, spk_coord = ep_dicts(episode)
dict_spk, spk_dict, spk_coord

({'maleshopper': '1001_csi',
  'bradwalden': '1002_csi',
  'hunterbaumgartner': '1003_csi',
  'gregsanders': '1004_csi',
  'brass': '1005_csi',
  'sydgoggle': '1006_csi',
  'dralbertrobbins': '1007_csi',
  'sheriffbrianmobley': '1008_csi',
  'warrick': '1009_csi',
  'nick': '1010_csi',
  'sara': '1011_csi',
  'catherine': '1012_csi',
  'agentrickculpepper': '1013_csi',
  'grissom': '1014_csi'},
 {'1001_csi': 'maleshopper',
  '1002_csi': 'bradwalden',
  '1003_csi': 'hunterbaumgartner',
  '1004_csi': 'gregsanders',
  '1005_csi': 'brass',
  '1006_csi': 'sydgoggle',
  '1007_csi': 'dralbertrobbins',
  '1008_csi': 'sheriffbrianmobley',
  '1009_csi': 'warrick',
  '1010_csi': 'nick',
  '1011_csi': 'sara',
  '1012_csi': 'catherine',
  '1013_csi': 'agentrickculpepper',
  '1014_csi': 'grissom'},
 {'maleshopper': [50, 50],
  'bradwalden': [50, 100],
  'hunterbaumgartner': [50, 150],
  'gregsanders': [100, 50],
  'brass': [100, 100],
  'sydgoggle': [100, 150],
  'dralbertrobbins': [150, 50],
  'she

# I. Ground truth

In [6]:
truth_events = pd.read_csv("src/graph_input/all_events_%s.csv"%episode)
truth_events = truth_events[['speaker', 'conv']].drop_duplicates().dropna()
truth_events['speaker'] = truth_events['speaker'].apply(lambda x: x.replace("/", "").replace(".", "").replace("'", ""))
truth_events.head()

Unnamed: 0,speaker,conv
0,brass,1.0
17,grissom,1.0
120,sara,2.0
122,brass,2.0
126,grissom,2.0


In [7]:
f = open("src/speaker_id_input/%s.txt"%episode, "r")
list_spk_keep = []

for line in f:
    list_spk_keep.append(line.replace("\n", "").replace(".", "").replace("'", ""))

In [8]:
truth_events = truth_events[truth_events['speaker'].isin(list_spk_keep)]

In [10]:
G, plot = build_graph(truth_events, "conv", "speaker", "truth", episode, spk_coord)
plot

# II. Speaker ID Prediction

Benchmark performance from Kaldi:

In [11]:
perf_s01e07 = 0.916
perf_s01e08 = 0.919
perf_s01e19 = 0.579
perf_s01e20 = 0.746
perf_s01e23 = 0.686
perf_s02e01 = 0.880
perf_s02e04 = 0.894

We need 2 dataframes here, a summary of all of the scores of all speaker against each file, and another file of who has the maximum score, corresponding to the prediction of Speaker Id:

In [12]:
pred = get_all_pred_scores("src/speaker_id_output/scores_%s/csi_test_unique_scores"%episode, spk_dict)
pred.head()

Unnamed: 0,Model,File,Truth,Conv,Score
16,maleshopper,brass_Conv1,brass,1,-15.6086
137,bradwalden,brass_Conv1,brass,1,20.92868
258,hunterbaumgartner,brass_Conv1,brass,1,-16.20695
379,gregsanders,brass_Conv1,brass,1,-44.48259
500,brass,brass_Conv1,brass,1,62.38687


In [13]:
winners = get_pred_speakers(pred)
winners.head()

Unnamed: 0,Pred,Truth,Conv
0,brass,brass,1
1,grissom,grissom,1
2,dralbertrobbins,dralbertrobbins,10
3,dralbertrobbins,grissom,10
4,nick,nick,10


Re-compute the speaker accuracy:

In [14]:
speaker_accuracy(winners)

0.6859504132231405

And plot the predicted network:

In [15]:
G_pred, plot_pred = build_graph(winners, "Conv", "Pred", "pred", episode, spk_coord)
plot_pred

# III. Improving Speaker Identification using Network Knowledge

We need 2 datasets again, one to build the list of all candidates, and another one to keep all the candidates from pred above a given threshold:

In [16]:
cand = build_candidates(pred)
cand.head()

Unnamed: 0,Conv,NumChar,Conversation,Truth,Candidate,Score
0,1,2,1_brass,"[brass, grissom]","[maleshopper, bradwalden, hunterbaumgartner, b...","[-15.6086, 20.92868, -16.20695, 62.38687, 21.7..."
11,2,3,2_brass,"[brass, grissom, sara]","[maleshopper, bradwalden, hunterbaumgartner, g...","[-34.3393, -28.01566, -10.51414, -28.32325, -3..."
30,4,2,4_grissom,"[grissom, sara]","[maleshopper, bradwalden, hunterbaumgartner, g...","[-23.39263, -28.59703, -16.32076, -7.717767, -..."
41,5,4,5_catherine,"[catherine, grissom, sara, warrick]","[maleshopper, bradwalden, hunterbaumgartner, g...","[-0.05778657, -32.82574, 7.806642, 6.962511, -..."
46,6,2,6_grissom,"[grissom, sheriffbrianmobley]","[maleshopper, bradwalden, hunterbaumgartner, g...","[-0.8141875, 6.923984, -10.56885, -8.876238, -..."


In [17]:
score_sup = keep_higher_scores(pred, threshold=-15)
score_sup.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  score_sup["Conv"] = score_sup["Conv"].astype(int)


Unnamed: 0,Model,File,Truth,Conv,Score
0,bradwalden,brass_Conv1,brass,1,20.92868
1,brass,brass_Conv1,brass,1,62.38687
2,sydgoggle,brass_Conv1,brass,1,21.74345
3,dralbertrobbins,brass_Conv1,brass,1,23.83234
4,sheriffbrianmobley,brass_Conv1,brass,1,6.081506


In [18]:
df_res, G_rank, trace_conv = rerank_graph(score_sup, winners, cand, threshold=-15)

Conversation 1 out of 54
Conversation 2 out of 54
Conversation 4 out of 54
Conversation 5 out of 54
Conversation 6 out of 54
Conversation 7 out of 54
Conversation 8 out of 54
Conversation 9 out of 54
Conversation 10 out of 54
Conversation 11 out of 54
Conversation 12 out of 54
Conversation 13 out of 54
Conversation 14 out of 54
Conversation 15 out of 54
Conversation 16 out of 54
Conversation 17 out of 54
Conversation 18 out of 54
Conversation 19 out of 54
Conversation 20 out of 54
Conversation 21 out of 54
Conversation 22 out of 54
Conversation 23 out of 54
Conversation 24 out of 54
Conversation 25 out of 54
Conversation 27 out of 54
Conversation 28 out of 54
Conversation 29 out of 54
Conversation 30 out of 54
Conversation 31 out of 54
Conversation 32 out of 54
Conversation 33 out of 54
Conversation 35 out of 54
Conversation 37 out of 54
Conversation 38 out of 54
Conversation 39 out of 54
Conversation 40 out of 54
Conversation 41 out of 54
Conversation 42 out of 54
Conversation 43 out 

Where are predictions different?

In [19]:
df_res[df_res['GaphEnhance'] != df_res['Prediction']]

Unnamed: 0,Conv,GaphEnhance,Truth,Prediction
1,2,"[brass, grissom, hunterbaumgartner]","[brass, grissom, sara]","[grissom, hunterbaumgartner, sara]"
3,5,"[brass, grissom, grissom, hunterbaumgartner]","[catherine, grissom, sara, warrick]","[catherine, grissom, grissom, warrick]"
5,7,"[agentrickculpepper, brass, grissom]","[agentrickculpepper, grissom, sheriffbrianmobley]","[agentrickculpepper, agentrickculpepper, grissom]"
8,10,"[brass, grissom, grissom]","[dralbertrobbins, grissom, nick]","[dralbertrobbins, dralbertrobbins, nick]"
9,11,"[brass, grissom]","[dralbertrobbins, grissom]","[dralbertrobbins, grissom]"
14,16,"[brass, grissom, grissom]","[agentrickculpepper, grissom, sara]","[agentrickculpepper, grissom, grissom]"
19,21,"[brass, grissom]","[brass, grissom]","[brass, brass]"
20,22,"[brass, grissom, sara]","[agentrickculpepper, grissom, sara]","[agentrickculpepper, agentrickculpepper, sara]"
24,27,"[maleshopper, sara]","[agentrickculpepper, maleshopper, sara]","[maleshopper, nick, sara]"
26,29,"[agentrickculpepper, brass, grissom, nick]","[agentrickculpepper, grissom, maleshopper, sara]","[agentrickculpepper, grissom, hunterbaumgartne..."


### Conversation accuracy

In [20]:
conversation_accuracy(df_res, "Prediction")

0.40816326530612246

In [21]:
conversation_accuracy(df_res, "GaphEnhance")

0.3673469387755102

### Speaker accuracy

In [22]:
final_speaker_accuracy(df_res, "Prediction")

0.7142857142857143

In [23]:
final_speaker_accuracy(df_res, "GaphEnhance")

0.6386554621848739

### Final Network

In [24]:
plot_rank = final_graph(G_rank, trace_conv, episode, spk_coord)
plot_rank