<center><h1 style="font-size:3em"> Graph2Speak </h1></center>
<center><h3> Improving Speaker Identification using Network Knowledge in Criminal Conversational Data </h3><center>

Paper: https://arxiv.org/abs/2006.02093

*Maël Fabien, Seyyed Saeed Sarfjoo, Petr Motlicek, Srikanth Madikeri*

In [1]:
# General
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set of functions
from src.utils import *

In [2]:
episode = "s02e09"

In [3]:
dict_spk, spk_dict, spk_coord = ep_dicts(episode)
dict_spk, spk_dict, spk_coord

({'robbins': '1001_csi',
  'officerspencer': '1002_csi',
  'greg': '1003_csi',
  'bobbydawson': '1004_csi',
  'maxduncan': '1005_csi',
  'warrick': '1006_csi',
  'nick': '1007_csi',
  'brass': '1008_csi',
  'sara': '1009_csi',
  'grissom': '1010_csi',
  'catherine': '1011_csi'},
 {'1001_csi': 'robbins',
  '1002_csi': 'officerspencer',
  '1003_csi': 'greg',
  '1004_csi': 'bobbydawson',
  '1005_csi': 'maxduncan',
  '1006_csi': 'warrick',
  '1007_csi': 'nick',
  '1008_csi': 'brass',
  '1009_csi': 'sara',
  '1010_csi': 'grissom',
  '1011_csi': 'catherine'},
 {'robbins': [50, 50],
  'officerspencer': [50, 100],
  'greg': [50, 150],
  'bobbydawson': [100, 50],
  'maxduncan': [100, 100],
  'warrick': [100, 150],
  'nick': [150, 50],
  'brass': [150, 100],
  'sara': [150, 150],
  'grissom': [200, 50],
  'catherine': [200, 100]})

# I. Ground truth

In [4]:
truth_events = pd.read_csv("src/graph_input/all_events_%s.csv"%episode).drop_duplicates().dropna()

dict_len = {}
for c in np.unique(truth_events['conv']):
    dict_len[int(c)] = len(truth_events[truth_events['conv']==c])
        
truth_events = truth_events[['speaker', 'conv']]
truth_events['speaker'] = truth_events['speaker'].apply(lambda x: x.replace("/", "").replace(".", "").replace("'", ""))
truth_events.head()

Unnamed: 0,speaker,conv
0,craps_player,0.0
1,craps_player,0.0
2,craps_player,0.0
3,craps_player,0.0
4,craps_player,0.0


In [5]:
f = open("src/speaker_id_input/%s.txt"%episode, "r")
list_spk_keep = []

for line in f:
    list_spk_keep.append(line.replace("\n", "").replace(".", "").replace("'", ""))

In [6]:
truth_events = truth_events[truth_events['speaker'].isin(list_spk_keep)]

In [7]:
G, plot = build_graph(truth_events, "conv", "speaker", "truth", episode, spk_coord)
plot

# II. Speaker ID Prediction

Benchmark performance from Kaldi:

In [8]:
perf_s01e07 = 0.916
perf_s01e08 = 0.919
perf_s01e19 = 0.579
perf_s01e20 = 0.746
perf_s01e23 = 0.686
perf_s02e01 = 0.880
perf_s02e04 = 0.894
perf_s02e06 = 0.855

We need 2 dataframes here, a summary of all of the scores of all speaker against each file, and another file of who has the maximum score, corresponding to the prediction of Speaker Id:

In [9]:
pred = get_all_pred_scores("src/speaker_id_output/scores_%s/csi_test_unique_scores"%episode, spk_dict)
pred.head()

Unnamed: 0,Model,File,Truth,Conv,Score
2,robbins,brass_Conv12,brass,12,-25.86888
107,officerspencer,brass_Conv12,brass,12,-3.495802
212,greg,brass_Conv12,brass,12,-28.78191
317,bobbydawson,brass_Conv12,brass,12,-34.91832
422,maxduncan,brass_Conv12,brass,12,-24.53049


In [10]:
winners = get_pred_speakers(pred)
winners.head()

Unnamed: 0,Pred,Truth,Conv
0,brass,brass,12
1,brass,brass,13
2,grissom,grissom,13
3,nick,nick,14
4,grissom,grissom,15


Re-compute the speaker accuracy:

In [11]:
speaker_accuracy(winners)

0.9142857142857143

And plot the predicted network:

In [12]:
G_pred, plot_pred = build_graph(winners, "Conv", "Pred", "pred", episode, spk_coord)
plot_pred

# III. Improving Speaker Identification using Network Knowledge

We need 2 datasets again, one to build the list of all candidates, and another one to keep all the candidates from pred above a given threshold:

In [13]:
cand = build_candidates(pred)
cand.head()

Unnamed: 0,Conv,NumChar,Conversation,Truth,Candidate,Score
8,2,2,2_brass,"[brass, grissom]","[robbins, officerspencer, greg, maxduncan, war...","[-12.80708, -8.590842, -22.11562, -2.5491, -3...."
29,4,2,4_nick,"[nick, warrick]","[robbins, officerspencer, greg, bobbydawson, m...","[-31.71048, -24.56566, -6.259256, -31.78416, 1..."
39,5,2,5_brass,"[brass, grissom]","[robbins, officerspencer, greg, bobbydawson, m...","[-15.6182, -12.4353, -18.4734, -24.5174, 10.13..."
41,6,3,6_brass,"[brass, grissom, maxduncan]","[robbins, officerspencer, greg, bobbydawson, m...","[-12.02583, -16.58757, -25.1794, -39.18103, -7..."
42,8,3,8_catherine,"[catherine, officerspencer, sara]","[sara, catherine, robbins, officerspencer, gre...","[-6.81527, 54.47673, -12.16429, 72.06229, -36...."


In [22]:
score_sup = keep_higher_scores(pred, threshold=-25)
score_sup.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  score_sup["Conv"] = score_sup["Conv"].astype(int)


Unnamed: 0,Model,File,Truth,Conv,Score
0,robbins,brass_Conv2,brass,2,-12.80708
1,officerspencer,brass_Conv2,brass,2,-8.590842
2,greg,brass_Conv2,brass,2,-22.11562
3,maxduncan,brass_Conv2,brass,2,-2.5491
4,warrick,brass_Conv2,brass,2,-3.484541


In [23]:
df_res, G_rank, trace_conv = rerank_graph(score_sup, winners, cand, dict_len, threshold=-15)

Conversation 2 out of 50
Conversation 4 out of 50
Conversation 5 out of 50
Conversation 6 out of 50
Conversation 8 out of 50
Conversation 9 out of 50
Conversation 12 out of 50
Conversation 13 out of 50
Conversation 14 out of 50
Conversation 15 out of 50
Conversation 16 out of 50
Conversation 17 out of 50
Conversation 18 out of 50
Conversation 19 out of 50
Conversation 20 out of 50
Conversation 21 out of 50
Conversation 22 out of 50
Conversation 23 out of 50
Conversation 24 out of 50
Conversation 25 out of 50
Conversation 26 out of 50
Conversation 27 out of 50
Conversation 28 out of 50
Conversation 29 out of 50
Conversation 30 out of 50
Conversation 31 out of 50
Conversation 32 out of 50
Conversation 33 out of 50
Conversation 34 out of 50
Conversation 35 out of 50
Conversation 36 out of 50
Conversation 37 out of 50
Conversation 38 out of 50
Conversation 39 out of 50
Conversation 40 out of 50
Conversation 41 out of 50
Conversation 42 out of 50
Conversation 43 out of 50
Conversation 45 ou

Where are predictions different?

In [24]:
df_res[df_res['GaphEnhance'] != df_res['Prediction']]

Unnamed: 0,Conv,GaphEnhance,Truth,Prediction
3,6,"[brass, brass, grissom]","[brass, grissom, maxduncan]","[brass, grissom, maxduncan]"
22,28,"[brass, grissom, nick]","[brass, catherine, grissom, nick]","[brass, catherine, grissom, nick]"
28,34,"[greg, grissom, nick, warrick]","[greg, grissom, nick, warrick]","[greg, grissom, maxduncan, warrick]"
30,36,"[catherine, catherine, grissom, sara]","[catherine, grissom, nick, sara]","[bobbydawson, catherine, grissom, sara]"
38,45,[],[catherine],[sara]


### Conversation accuracy

In [25]:
conversation_accuracy(df_res, "Prediction")

0.7954545454545454

In [26]:
conversation_accuracy(df_res, "GaphEnhance")

0.7727272727272727

### Speaker accuracy

In [27]:
final_speaker_accuracy(df_res, "Prediction")

0.9142857142857143

In [28]:
final_speaker_accuracy(df_res, "GaphEnhance")

0.9047619047619048

### Final Network

In [29]:
plot_rank = final_graph(G_rank, trace_conv, episode, spk_coord)
plot_rank