<center><h1 style="font-size:3em"> Graph2Speak </h1></center>
<center><h3> Improving Speaker Identification using Network Knowledge in Criminal Conversational Data </h3><center>

Working paper: https://www.overleaf.com/read/ymhjvfwmfwzd

*Maël Fabien, Seyyed Saeed Sarfjoo, Petr Motlicek, Srikanth Madikeri*

In [1]:
# General
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Network
import itertools
import networkx as nx
import nx_altair as nxa
from networkx.algorithms.centrality import betweenness_centrality
from networkx.algorithms.approximation import min_weighted_dominating_set, average_clustering, max_clique

# Set of functions
from utils import *

In [2]:
episode = "s01e07"

In [3]:
dict_spk, spk_dict = ep_dicts(episode)
dict_spk, spk_dict

({'eddiewillows': '1001_csi',
  'jesseoverton': '1002_csi',
  'conradecklie': '1003_csi',
  'sheriff_brianmobley': '1004_csi',
  'tedgoggle': '1005_csi',
  'lie_detector_operator': '1006_csi',
  'nick': '1007_csi',
  'warrick': '1008_csi',
  'det_oriley': '1009_csi',
  'brass': '1010_csi',
  'tinacollins': '1011_csi',
  'sara': '1012_csi',
  'catherine': '1013_csi',
  'grissom': '1014_csi'},
 {'1001_csi': 'eddiewillows',
  '1002_csi': 'jesseoverton',
  '1003_csi': 'conradecklie',
  '1004_csi': 'sheriff_brianmobley',
  '1005_csi': 'tedgoggle',
  '1006_csi': 'lie_detector_operator',
  '1007_csi': 'nick',
  '1008_csi': 'warrick',
  '1009_csi': 'det_oriley',
  '1010_csi': 'brass',
  '1011_csi': 'tinacollins',
  '1012_csi': 'sara',
  '1013_csi': 'catherine',
  '1014_csi': 'grissom'})

# I. Ground truth

In [20]:
truth_events = pd.read_csv("graph_input/all_events_%s.csv"%episode)
truth_events = truth_events[['speaker', 'conv']].drop_duplicates().dropna()
truth_events['speaker'] = truth_events['speaker'].apply(lambda x: x.replace("/", "").replace(".", "").replace("'", ""))
truth_events.head()

Unnamed: 0,speaker,conv
0,tinacollins,0.0
24,grissom,1.0
25,det_oriley,1.0
101,grissom,2.0
148,shibley,2.0


In [21]:
G, plot = build_graph(truth_events, "conv", "speaker", "truth")
plot

# II. Speaker ID Prediction

Benchmark performance from Kaldi:

In [22]:
perf_s01e07 = 0.916
perf_s01e08 = 0.919
perf_s01e19 = 0.579
perf_s01e20 = 0.746
perf_s02e01 = 0.880
perf_s02e04 = 0.894

We need 2 dataframes here, a summary of all of the scores of all speaker against each file, and another file of who has the maximum score, corresponding to the prediction of Speaker Id:

In [23]:
pred = get_all_pred_scores("sid_output/scores_%s/csi_test_unique_scores"%episode, spk_dict)
pred.head()

Unnamed: 0,Model,File,Truth,Conv,Score
27,eddiewillows,tinacollins_Conv0,tinacollins,0,-17.55911
123,jesseoverton,tinacollins_Conv0,tinacollins,0,-4.16105
219,conradecklie,tinacollins_Conv0,tinacollins,0,-17.07024
315,sheriff_brianmobley,tinacollins_Conv0,tinacollins,0,-32.4096
411,tedgoggle,tinacollins_Conv0,tinacollins,0,-28.67418


In [24]:
winners = get_pred_speakers(pred)
winners.head()

Unnamed: 0,Pred,Truth,Conv
0,tinacollins,tinacollins,0
1,det_oriley,det_oriley,1
2,grissom,grissom,1
3,grissom,grissom,10
4,det_oriley,det_oriley,11


Re-compute the speaker accuracy:

In [25]:
speaker_accuracy(winners)

0.9166666666666666

And plot the predicted network:

In [26]:
G_pred, plot_pred = build_graph(winners, "Conv", "Pred", "pred")
plot_pred

# III. Improving Speaker Identification using Network Knowledge

We need 2 datasets again, one to build the list of all candidates, and another one to keep all the candidates from pred above a given threshold:

In [27]:
cand = build_candidates(pred)
cand.head()

Unnamed: 0,Conv,NumChar,Conversation,Truth,Candidate,Score
0,0,1,0_tinacollins,[tinacollins],"[eddiewillows, jesseoverton, conradecklie, she...","[-17.55911, -4.16105, -17.07024, -32.4096, -28..."
1,1,2,1_det_oriley,"[det_oriley, grissom]","[eddiewillows, jesseoverton, conradecklie, she...","[-13.57533, -16.27033, -10.17966, 7.101351, -3..."
12,2,2,2_grissom,"[grissom, sara]","[eddiewillows, jesseoverton, conradecklie, she...","[6.577901, 4.356473, -4.583953, 20.2203, 0.100..."
23,3,2,3_det_oriley,"[det_oriley, grissom]","[eddiewillows, jesseoverton, conradecklie, she...","[-25.47741, -7.629332, -15.41951, -16.17051, -..."
41,5,3,5_det_oriley,"[det_oriley, grissom, tinacollins]","[eddiewillows, jesseoverton, conradecklie, she...","[-3.301001, -8.831428, -19.57747, 5.522082, -4..."


In [28]:
score_sup = keep_higher_scores(pred, threshold=-15)
score_sup.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  score_sup["Conv"] = score_sup["Conv"].astype(int)


Unnamed: 0,Model,File,Truth,Conv,Score
0,jesseoverton,tinacollins_Conv0,tinacollins,0,-4.16105
1,lie_detector_operator,tinacollins_Conv0,tinacollins,0,-14.53733
2,nick,tinacollins_Conv0,tinacollins,0,-12.12411
3,tinacollins,tinacollins_Conv0,tinacollins,0,18.20214
4,sara,tinacollins_Conv0,tinacollins,0,1.25702


In [29]:
df_res, G_rank, trace_conv = rerank_graph(score_sup, winners, cand, threshold=-15)

Conversation 0 out of 48
Conversation 1 out of 48
Conversation 2 out of 48
Conversation 3 out of 48
Conversation 5 out of 48
Conversation 6 out of 48
Conversation 7 out of 48
Conversation 9 out of 48
Conversation 10 out of 48
Conversation 11 out of 48
Conversation 12 out of 48
Conversation 13 out of 48
Conversation 14 out of 48
Conversation 15 out of 48
Conversation 16 out of 48
Conversation 17 out of 48
Conversation 18 out of 48
Conversation 19 out of 48
Conversation 20 out of 48
Conversation 21 out of 48
Conversation 22 out of 48
Conversation 23 out of 48
Conversation 24 out of 48
Conversation 25 out of 48
Conversation 26 out of 48
Conversation 27 out of 48
Conversation 28 out of 48
Conversation 29 out of 48
Conversation 30 out of 48
Conversation 33 out of 48
Conversation 34 out of 48
Conversation 35 out of 48
Conversation 36 out of 48
Conversation 37 out of 48
Conversation 38 out of 48
Conversation 39 out of 48
Conversation 40 out of 48
Conversation 41 out of 48
Conversation 42 out 

Where are predictions different?

In [30]:
df_res[df_res['GaphEnhance'] != df_res['Prediction']]

Unnamed: 0,Conv,GaphEnhance,Truth,Prediction
6,7,"[eddiewillows, grissom, grissom]","[catherine, grissom, nick, warrick]","[eddiewillows, grissom, grissom, warrick]"
13,15,"[grissom, sara]","[grissom, sara]","[brass, sara]"
35,39,"[brass, catherine, grissom, nick, warrick]","[brass, catherine, grissom, nick, warrick]","[brass, catherine, jesseoverton, nick, warrick]"
44,48,"[grissom, nick]","[catherine, grissom, nick]","[grissom, nick, tinacollins]"


### Conversation accuracy

In [31]:
conversation_accuracy(df_res, "Prediction")

0.8444444444444444

In [32]:
conversation_accuracy(df_res, "GaphEnhance")

0.8888888888888888

### Speaker accuracy

In [33]:
final_speaker_accuracy(df_res, "Prediction")

0.9166666666666666

In [34]:
final_speaker_accuracy(df_res, "GaphEnhance")

0.9270833333333334

### Final Network

In [35]:
plot_rank = final_graph(G_rank, trace_conv)
plot_rank