<center><h1 style="font-size:3em"> Graph2Speak </h1></center>
<center><h3> Improving Speaker Identification using Network Knowledge in Criminal Conversational Data </h3><center>

Paper: https://arxiv.org/abs/2006.02093

*Maël Fabien, Seyyed Saeed Sarfjoo, Petr Motlicek, Srikanth Madikeri*

In [1]:
# General
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set of functions
from src.utils import *

In [2]:
episode = "s02e06"

In [3]:
dict_spk, spk_dict, spk_coord = ep_dicts(episode)
dict_spk, spk_dict, spk_coord

({'roger': '1001_csi',
  'robinchilds': '1002_csi',
  'rogerjennings': '1003_csi',
  'sgtoriley': '1004_csi',
  'kimmarita': '1005_csi',
  'tinakolas': '1006_csi',
  'robbins': '1007_csi',
  'benjaminjennings': '1008_csi',
  'warrick': '1009_csi',
  'brass': '1010_csi',
  'fatherpowell': '1011_csi',
  'nick': '1012_csi',
  'sara': '1013_csi',
  'catherine': '1014_csi',
  'grissom': '1015_csi'},
 {'1001_csi': 'roger',
  '1002_csi': 'robinchilds',
  '1003_csi': 'rogerjennings',
  '1004_csi': 'sgtoriley',
  '1005_csi': 'kimmarita',
  '1006_csi': 'tinakolas',
  '1007_csi': 'robbins',
  '1008_csi': 'benjaminjennings',
  '1009_csi': 'warrick',
  '1010_csi': 'brass',
  '1011_csi': 'fatherpowell',
  '1012_csi': 'nick',
  '1013_csi': 'sara',
  '1014_csi': 'catherine',
  '1015_csi': 'grissom'},
 {'roger': [50, 50],
  'robinchilds': [50, 100],
  'rogerjennings': [50, 150],
  'sgtoriley': [100, 50],
  'kimmarita': [100, 100],
  'tinakolas': [100, 150],
  'robbins': [150, 50],
  'benjaminjennings':

# I. Ground truth

In [4]:
truth_events = pd.read_csv("src/graph_input/all_events_%s.csv"%episode)
truth_events = truth_events[['speaker', 'conv']].drop_duplicates().dropna()
truth_events['speaker'] = truth_events['speaker'].apply(lambda x: x.replace("/", "").replace(".", "").replace("'", ""))
truth_events.head()

Unnamed: 0,speaker,conv
0,benjaminjennings,0.0
2,grissom,1.0
7,sgt_oriley,1.0
12,nick,2.0
29,grissom,2.0


In [5]:
f = open("src/speaker_id_input/%s.txt"%episode, "r")
list_spk_keep = []

for line in f:
    list_spk_keep.append(line.replace("\n", "").replace(".", "").replace("'", ""))

In [6]:
truth_events = truth_events[truth_events['speaker'].isin(list_spk_keep)]

In [7]:
G, plot = build_graph(truth_events, "conv", "speaker", "truth", episode, spk_coord)
plot

# II. Speaker ID Prediction

Benchmark performance from Kaldi:

In [8]:
perf_s01e07 = 0.916
perf_s01e08 = 0.919
perf_s01e19 = 0.579
perf_s01e20 = 0.746
perf_s01e23 = 0.686
perf_s02e01 = 0.880
perf_s02e04 = 0.894
perf_s02e06 = 0.855

We need 2 dataframes here, a summary of all of the scores of all speaker against each file, and another file of who has the maximum score, corresponding to the prediction of Speaker Id:

In [9]:
pred = get_all_pred_scores("src/speaker_id_output/scores_%s/csi_test_unique_scores"%episode, spk_dict)
pred.head()

Unnamed: 0,Model,File,Truth,Conv,Score
0,roger,benjaminjennings_Conv0,benjaminjennings,0,1.871236
124,robinchilds,benjaminjennings_Conv0,benjaminjennings,0,-18.37531
248,rogerjennings,benjaminjennings_Conv0,benjaminjennings,0,-0.110526
372,sgtoriley,benjaminjennings_Conv0,benjaminjennings,0,-27.05018
496,kimmarita,benjaminjennings_Conv0,benjaminjennings,0,-35.71399


In [10]:
winners = get_pred_speakers(pred)
winners.head()

Unnamed: 0,Pred,Truth,Conv
0,roger,benjaminjennings,0
1,grissom,grissom,1
2,sgtoriley,sgtoriley,1
3,catherine,catherine,10
4,kimmarita,kimmarita,10


Re-compute the speaker accuracy:

In [11]:
speaker_accuracy(winners)

0.8548387096774194

And plot the predicted network:

In [12]:
G_pred, plot_pred = build_graph(winners, "Conv", "Pred", "pred", episode, spk_coord)
plot_pred

# III. Improving Speaker Identification using Network Knowledge

We need 2 datasets again, one to build the list of all candidates, and another one to keep all the candidates from pred above a given threshold:

In [13]:
cand = build_candidates(pred)
cand.head()

Unnamed: 0,Conv,NumChar,Conversation,Truth,Candidate,Score
0,0,1,0_benjaminjennings,[benjaminjennings],"[roger, robinchilds, rogerjennings, sgtoriley,...","[1.871236, -18.37531, -0.1105263, -27.05018, -..."
1,1,2,1_grissom,"[grissom, sgtoriley]","[roger, robinchilds, rogerjennings, sgtoriley,...","[-12.563, -30.62826, -16.33519, -18.48514, -18..."
12,2,2,2_grissom,"[grissom, nick]","[roger, robinchilds, rogerjennings, sgtoriley,...","[-16.22852, -36.40934, -8.166042, -14.3365, -2..."
31,4,3,4_grissom,"[grissom, nick, sara]","[roger, robinchilds, rogerjennings, sgtoriley,...","[1.686985, -34.28004, -13.15392, 0.5959857, -2..."
41,5,3,5_grissom,"[grissom, nick, sara]","[roger, rogerjennings, sgtoriley, tinakolas, r...","[13.90878, -8.934477, 7.126632, -15.86177, 4.5..."


In [14]:
score_sup = keep_higher_scores(pred, threshold=-15)
score_sup.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  score_sup["Conv"] = score_sup["Conv"].astype(int)


Unnamed: 0,Model,File,Truth,Conv,Score
0,roger,benjaminjennings_Conv0,benjaminjennings,0,1.871236
1,rogerjennings,benjaminjennings_Conv0,benjaminjennings,0,-0.110526
2,tinakolas,benjaminjennings_Conv0,benjaminjennings,0,-14.15123
3,brass,benjaminjennings_Conv0,benjaminjennings,0,-8.804805
4,sara,benjaminjennings_Conv0,benjaminjennings,0,-10.15619


In [15]:
df_res, G_rank, trace_conv = rerank_graph(score_sup, winners, cand, threshold=-15)

Conversation 0 out of 58
Conversation 1 out of 58
Conversation 2 out of 58
Conversation 4 out of 58
Conversation 5 out of 58
Conversation 6 out of 58
Conversation 7 out of 58
Conversation 8 out of 58
Conversation 9 out of 58
Conversation 10 out of 58
Conversation 11 out of 58
Conversation 12 out of 58
Conversation 13 out of 58
Conversation 14 out of 58
Conversation 15 out of 58
Conversation 16 out of 58
Conversation 17 out of 58
Conversation 18 out of 58
Conversation 19 out of 58
Conversation 20 out of 58
Conversation 21 out of 58
Conversation 22 out of 58
Conversation 23 out of 58
Conversation 24 out of 58
Conversation 25 out of 58
Conversation 26 out of 58
Conversation 27 out of 58
Conversation 28 out of 58
Conversation 29 out of 58
Conversation 30 out of 58
Conversation 31 out of 58
Conversation 32 out of 58
Conversation 33 out of 58
Conversation 34 out of 58
Conversation 35 out of 58
Conversation 36 out of 58
Conversation 39 out of 58
Conversation 40 out of 58
Conversation 41 out o

Where are predictions different?

In [16]:
df_res[df_res['GaphEnhance'] != df_res['Prediction']]

Unnamed: 0,Conv,GaphEnhance,Truth,Prediction
23,24,"[nick, robinchilds, sara]","[nick, sara, warrick]","[nick, robinchilds, robinchilds]"
28,29,"[brass, catherine, warrick]","[brass, catherine, warrick]","[brass, catherine, sgtoriley]"
34,35,"[roger, sgtoriley]","[roger, sgtoriley]","[brass, roger]"


### Conversation accuracy

In [17]:
conversation_accuracy(df_res, "Prediction")

0.7090909090909091

In [18]:
conversation_accuracy(df_res, "GaphEnhance")

0.7454545454545455

### Speaker accuracy

In [23]:
final_speaker_accuracy(df_res, "Prediction")

0.8617886178861789

In [24]:
final_speaker_accuracy(df_res, "GaphEnhance")

0.8861788617886179

### Final Network

In [25]:
plot_rank = final_graph(G_rank, trace_conv, episode, spk_coord)
plot_rank