<center><h1 style="font-size:3em"> Graph2Speak </h1></center>
<center><h3> Improving Speaker Identification using Network Knowledge in Criminal Conversational Data </h3><center>

Paper: https://arxiv.org/abs/2006.02093

*Maël Fabien, Seyyed Saeed Sarfjoo, Petr Motlicek, Srikanth Madikeri*

In [1]:
# General
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set of functions
from src.utils import *

In [2]:
episode = "s02e09"

In [3]:
dict_spk, spk_dict, spk_coord = ep_dicts(episode)
dict_spk, spk_dict, spk_coord

({'mrfram': '1001_csi',
  'detoriley': '1002_csi',
  'davidphillips': '1003_csi',
  'robbins': '1004_csi',
  'kelseyfram': '1005_csi',
  'dennisfram': '1006_csi',
  'managerofromaninis': '1007_csi',
  'juliabarett': '1008_csi',
  'nick': '1009_csi',
  'brass': '1010_csi',
  'sara': '1011_csi',
  'warrick': '1012_csi',
  'catherine': '1013_csi',
  'grissom': '1014_csi'},
 {'1001_csi': 'mrfram',
  '1002_csi': 'detoriley',
  '1003_csi': 'davidphillips',
  '1004_csi': 'robbins',
  '1005_csi': 'kelseyfram',
  '1006_csi': 'dennisfram',
  '1007_csi': 'managerofromaninis',
  '1008_csi': 'juliabarett',
  '1009_csi': 'nick',
  '1010_csi': 'brass',
  '1011_csi': 'sara',
  '1012_csi': 'warrick',
  '1013_csi': 'catherine',
  '1014_csi': 'grissom'},
 {'mrfram': [50, 50],
  'detoriley': [50, 100],
  'davidphillips': [50, 150],
  'robbins': [100, 50],
  'kelseyfram': [100, 100],
  'dennisfram': [100, 150],
  'managerofromaninis': [150, 50],
  'juliabarett': [150, 100],
  'nick': [150, 150],
  'brass':

# I. Ground truth

In [4]:
truth_events = pd.read_csv("src/graph_input/all_events_%s.csv"%episode).drop_duplicates().dropna()

dict_len = {}
for c in np.unique(truth_events['conv']):
    dict_len[int(c)] = len(truth_events[truth_events['conv']==c])
        
truth_events = truth_events[['speaker', 'conv']]
truth_events['speaker'] = truth_events['speaker'].apply(lambda x: x.replace("/", "").replace(".", "").replace("'", ""))
truth_events.head()

Unnamed: 0,speaker,conv
0,barryschickle,0.0
1,barryschickle,0.0
2,barryschickle,0.0
3,barryschickle,0.0
4,barryschickle,0.0


In [5]:
f = open("src/speaker_id_input/%s.txt"%episode, "r")
list_spk_keep = []

for line in f:
    list_spk_keep.append(line.replace("\n", "").replace(".", "").replace("'", ""))

In [6]:
truth_events = truth_events[truth_events['speaker'].isin(list_spk_keep)]

In [7]:
G, plot = build_graph(truth_events, "conv", "speaker", "truth", episode, spk_coord)
plot

# II. Speaker ID Prediction

Benchmark performance from Kaldi:

In [8]:
perf_s01e07 = 0.916
perf_s01e08 = 0.919
perf_s01e19 = 0.579
perf_s01e20 = 0.746
perf_s01e23 = 0.686
perf_s02e01 = 0.880
perf_s02e04 = 0.894
perf_s02e06 = 0.855

We need 2 dataframes here, a summary of all of the scores of all speaker against each file, and another file of who has the maximum score, corresponding to the prediction of Speaker Id:

In [9]:
pred = get_all_pred_scores("src/speaker_id_output/scores_%s/csi_test_unique_scores"%episode, spk_dict)
pred.head()

Unnamed: 0,Model,File,Truth,Conv,Score
120,mrfram,brass_Conv1,brass,1,-38.86657
243,detoriley,brass_Conv1,brass,1,-1.792155
366,davidphillips,brass_Conv1,brass,1,-39.43423
489,robbins,brass_Conv1,brass,1,-10.19601
612,kelseyfram,brass_Conv1,brass,1,-20.95274


In [10]:
winners = get_pred_speakers(pred)
winners.head()

Unnamed: 0,Pred,Truth,Conv
0,brass,brass,1
1,grissom,grissom,1
2,davidphillips,davidphillips,10
3,nick,nick,10
4,sara,sara,10


Re-compute the speaker accuracy:

In [11]:
speaker_accuracy(winners)

0.8943089430894309

And plot the predicted network:

In [12]:
G_pred, plot_pred = build_graph(winners, "Conv", "Pred", "pred", episode, spk_coord)
plot_pred

# III. Improving Speaker Identification using Network Knowledge

We need 2 datasets again, one to build the list of all candidates, and another one to keep all the candidates from pred above a given threshold:

In [13]:
cand = build_candidates(pred)
cand.head()

Unnamed: 0,Conv,NumChar,Conversation,Truth,Candidate,Score
0,1,2,1_brass,"[brass, grissom]","[mrfram, detoriley, davidphillips, robbins, ke...","[-38.86657, -1.792155, -39.43423, -10.19601, -..."
19,3,1,3_catherine,[catherine],"[mrfram, detoriley, davidphillips, kelseyfram,...","[-31.42015, -39.27356, -12.85153, 0.5060293, -..."
29,4,2,4_catherine,"[catherine, warrick]","[dennisfram, juliabarett, sara, catherine, mrf...","[-17.69169, -35.99288, -6.885671, -3.342968, -..."
39,5,2,5_brass,"[brass, grissom]","[mrfram, detoriley, davidphillips, robbins, ke...","[-29.53336, -2.869998, -14.53197, 10.6197, -9...."
49,6,2,6_brass,"[brass, grissom]","[detoriley, davidphillips, robbins, kelseyfram...","[6.379692, -32.67316, 0.4403149, -19.31599, -3..."


In [14]:
score_sup = keep_higher_scores(pred, threshold=-15)
score_sup.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  score_sup["Conv"] = score_sup["Conv"].astype(int)


Unnamed: 0,Model,File,Truth,Conv,Score
0,detoriley,brass_Conv1,brass,1,-1.792155
1,robbins,brass_Conv1,brass,1,-10.19601
2,nick,brass_Conv1,brass,1,-14.21531
3,brass,brass_Conv1,brass,1,66.67432
4,grissom,brass_Conv1,brass,1,11.07321


In [15]:
df_res, G_rank, trace_conv = rerank_graph(score_sup, winners, cand, dict_len, threshold=-15)

Conversation 1 out of 58
Conversation 3 out of 58
Conversation 4 out of 58
Conversation 5 out of 58
Conversation 6 out of 58
Conversation 7 out of 58
Conversation 8 out of 58
Conversation 9 out of 58
Conversation 10 out of 58
Conversation 11 out of 58
Conversation 12 out of 58
Conversation 13 out of 58
Conversation 14 out of 58
Conversation 15 out of 58
Conversation 16 out of 58
Conversation 17 out of 58
Conversation 18 out of 58
Conversation 19 out of 58
Conversation 21 out of 58
Conversation 22 out of 58
Conversation 23 out of 58
Conversation 24 out of 58
Conversation 25 out of 58
Conversation 26 out of 58
Conversation 27 out of 58
Conversation 28 out of 58
Conversation 31 out of 58
Conversation 32 out of 58
Conversation 33 out of 58
Conversation 34 out of 58
Conversation 35 out of 58
Conversation 36 out of 58
Conversation 37 out of 58
Conversation 38 out of 58
Conversation 39 out of 58
Conversation 40 out of 58
Conversation 41 out of 58
Conversation 42 out of 58
Conversation 43 out 

Where are predictions different?

In [22]:
df_res[df_res['GaphEnhance'] != df_res['Prediction']]

Unnamed: 0,Conv,GaphEnhance,Truth,Prediction
15,17,"[brass, juliabarett, kelseyfram]","[brass, dennisfram, grissom, juliabarett, kels...","[brass, juliabarett, kelseyfram, kelseyfram, w..."
17,19,"[grissom, nick, nick]","[grissom, nick, sara]","[nick, nick, robbins]"


### Conversation accuracy

In [23]:
conversation_accuracy(df_res, "Prediction")

0.7884615384615384

In [24]:
conversation_accuracy(df_res, "GaphEnhance")

0.7884615384615384

### Speaker accuracy

In [25]:
final_speaker_accuracy(df_res, "Prediction")

0.8934426229508197

In [26]:
final_speaker_accuracy(df_res, "GaphEnhance")

0.9016393442622951

### Final Network

In [21]:
plot_rank = final_graph(G_rank, trace_conv, episode, spk_coord)
plot_rank