# Analyze speakers

In [1]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('../data/euroleaks/parsed.csv')
df.head()

Unnamed: 0,speaker,speech,timestamp,date
0,Jeroen Dijsselbloem,… of your responses or questions. And can I fi...,1900-01-01 00:00:00,2015-02-24 00:00:00
1,Speaker 2,"Uh, yes, uh, thank you, Jeroen. Well, uh, comm...",1900-01-01 00:00:10,2015-02-24 00:00:00
2,Michael Noonan,Michael Noonan.,1900-01-01 00:01:27,2015-02-24 00:00:00
3,Speaker 2,"Uh, it is therefore regrettable that, uh-",1900-01-01 00:01:29,2015-02-24 00:00:00
4,Speaker 3,Has entered the conference.,1900-01-01 00:01:33,2015-02-24 00:00:00


Use to find examples of transcription artifacts:

In [3]:
brackets_ = re.compile('\[.*\]')
paranthesis_ = re.compile('\(.*\)')
pattern = re.compile('20(th)?,? ?(uh)?,? ?(of)?,? ?(uh)?,? ?february|february ?(the)? ?20t?h?[^\d,. ]')

#for s in df.speech:
    #if brackets_.match(s) or paranthesis_.match(s):
    #if '20' in s:
        #print(s, '\n')

Handle missing speaker

In [4]:
df[df.speaker.isnull()]

Unnamed: 0,speaker,speech,timestamp,date
1394,,"Jeroen Dijsselbloem\nNow, let’s see who is on ...",,2015-07-01 00:00:00


In [5]:
df.speaker.loc[df.speaker.isnull()] = 'jeroen dijsselbloem'

## inspect unique speakers

In [6]:
# strip and make lowercase
df.speaker = df.speaker.apply(lambda s: s.strip().lower() if not pd.isnull(s) else s)

In [7]:
# display all the names

for s in df.speaker.unique():
    if 'speaker' not in s:
        print(s)

jeroen dijsselbloem
michael noonan
pierre moscovici
mario draghi
wolfgang schäuble
christine lagarde
yanis varoufakis
yanis [not varoufakis]
luis de guindos
maria luís
marco buti
thomas wieser
declan costello
computer
benoit couré
paul thomsen
greek representative
thomas
benoit cœuré
nikos theocarakis
irina
irana
nabil
tooma
tropa
ricci
hans
paul
klaus regling
peter kažimír
martin
hans jörg schelling
dušan mramor
michel sapin
pier carlo padoan
edward scicluna
rimantas šadžius
poul thomsen
alexander stubb
inaudible
yanis varoufakis [privately]
johan van overtveldt
maria luís albuquerque
benoît cœuré
kian
male
group
johan
maria luis albuquerque
harris georgiades
translator
michel
luis pierre
luis
peter kazimir
wolfgang schauble
wolfgang


## drop some rows
For instance those which transcribe words said by computer.

In [8]:
#df[df.speech == 'Has entered the conference.']

In [9]:
df = df[df.speech != 'Has entered the conference.']

In [10]:
#df[df.speaker == 'group']

In [11]:
df = df[df.speaker != 'group']

In [12]:
#df[df.speaker == 'inaudible']

In [13]:
df = df[df.speaker != 'inaudible']

In [14]:
#for row in df[df.speaker == 'inaudible'].iterrows():
#    print(row[1].speech)
#    print()

In [15]:
df[df.speaker == 'yanis varoufakis [privately]']

Unnamed: 0,speaker,speech,timestamp,date
730,yanis varoufakis [privately],Good point.,1900-01-01 01:54:51,2015-06-18 00:00:00
738,yanis varoufakis [privately],Mm. Did you hear that?,1900-01-01 01:56:08,2015-06-18 00:00:00
746,yanis varoufakis [privately],"Πολύ καλό, καλύτερο από ποτέ.",1900-01-01 02:01:15,2015-06-18 00:00:00


In [16]:
df = df[df.speaker != 'yanis varoufakis [privately]']

## unidentified speakers

In [17]:
search_term = 'speaker'

for speaker in df.speaker.apply(lambda s: s.strip().lower() if not pd.isnull(s) else s).unique():
    if not pd.isnull(speaker) and search_term in speaker:
        print(speaker)
        pass

speaker 2
speaker 5
speaker 9
speaker 10
speaker 6
speaker 7
speaker 8
speaker 11
speaker 12
speaker 13
speaker 14
speaker 19
speaker 1
speaker 3
unidentified speaker
speaker 16
speaker 20
speaker 4
speaker 17
speaker 18
speaker 21
speaker


**Remark**: speaker 3, 2 and 1 might be relevant as their wordcounts are 1500, 800, 750, respectively. See preliminary_analysis notebook.

But this is problematic because I would have listen and check if speaker 3 is the same speaker on different meetings.

## manually construct mapping of different version of the same name to that name
Note that there is Thomas Wieser and just Thomas, two distinct persons.

In [18]:
amend_names = {
    'wolfgang schäuble': [
        'wolfgang schäuble',
        'wolfgang schauble',
        'wolfgang'
    ],
    'peter kažimír': [
        'peter kažimír',
        'peter kazimir'
    ],
    'michel sapin': [
        'michel sapin',
        'michel',
        'translator'
    ],
    'maria luís albuquerque': [
        'maria luís albuquerque',
        'maria luís',
        'maria luis albuquerque'
    ],
    'johan van overtveldt': [
        'johan van overtveldt',
        'johan'
    ],
    'benoît cœuré': [
        'benoît cœuré',
        'benoit couré',
        'benoit cœuré'
    ],
    'hans jörg schelling': [
        'hans jörg schelling',
        'hans'
    ],
    'poul mathias thomsen': [
        'paul thomsen',
        'paul',
        'poul thomsen'
    ],
    'luis de guindos': [
        'luis de guindos',
        'luis'
    ],
    'irina': [
        'irina',
        'irana'
    ],
    'jānis reirs': [
        'yanis [not varoufakis]'
    ],
    'luca antonio ricci': [
        'ricci'
    ],
    'thomas steffen': [
        'thomas'
    ],
    'nikos theocarakis': [
        'nikos theocarakis',
        'greek representative'
    ]
}

In [19]:
# dump to json
import json

json = json.dumps(amend_names)
with open('../data/euroleaks/amend_names.json', 'w') as f:
    f.write(json)

In [20]:
# invert dict
amend_names_inv = {value: key for key,values in amend_names.items() for value in values}

In [21]:
# amend speaker names
df.speaker = df.speaker.apply(lambda s: amend_names_inv[s] if s in amend_names_inv.keys() else s)

## speaker identification based on context and listening

In [22]:
df.speaker[np.logical_and(df.speaker == 'speaker 1', df.date=='2015-05-11 00:00:00')] = 'pierre moscovici'
df.speaker[np.logical_and(df.speaker == 'speaker 2', df.date=='2015-05-11 00:00:00')] = 'benoît cœuré'

In [23]:
for speaker in df.speaker.unique():
    if 'speaker' not in speaker:
        print(speaker)
        pass

jeroen dijsselbloem
michael noonan
pierre moscovici
mario draghi
wolfgang schäuble
christine lagarde
yanis varoufakis
jānis reirs
luis de guindos
maria luís albuquerque
marco buti
thomas wieser
declan costello
benoît cœuré
poul mathias thomsen
nikos theocarakis
thomas steffen
irina
nabil
tooma
tropa
luca antonio ricci
hans jörg schelling
klaus regling
peter kažimír
martin
dušan mramor
michel sapin
pier carlo padoan
edward scicluna
rimantas šadžius
alexander stubb
johan van overtveldt
kian
male
harris georgiades
luis pierre


Manually map speaker to entity.

In [24]:
# DONE maybe update after you get an answer to your email; got the response, nothing to update

# missing:
# estonia (Maris Lauri, Sven Sester)
# luxembourg (Pierre Gramegna)


speaker_to_entity = {
    'jeroen dijsselbloem': 'EG President', # the Netherlands
    'michael noonan': 'Ireland',
    'pierre moscovici': 'European Commission',
    'mario draghi': 'ECB',
    'wolfgang schäuble': 'Germany',
    'thomas steffen': 'Germany', # State Secretary at the Federal Ministry of Finance under Schauble
    'christine lagarde': 'IMF',
    'yanis varoufakis': 'Greece',
    'luis de guindos': 'Spain',
    'maria luís albuquerque': 'Portugal',
    'marco buti': 'European Commission', # https://ec.europa.eu/info/persons/director-general-marco-buti_en
    'thomas wieser': 'EWG President', # economic and financial committee, president of EWG
    'declan costello': 'European Commission', #dg ecfin
    'benoît cœuré': 'ECB',
    'poul mathias thomsen': 'IMF',
    'nikos theocarakis': 'Greece',
    'hans jörg schelling': 'Austria',
    'klaus regling': 'ESM', # head of european stability mechanism
    'peter kažimír': 'Slovakia',
    'dušan mramor': 'Slovenia',
    'michel sapin': 'France',
    'pier carlo padoan': 'Italy',
    'edward scicluna': 'Malta',
    'rimantas šadžius': 'Lithuania',
    'alexander stubb': 'Finland', # from May 29
    'tooma': 'Finland', # based on saying they have two and a half weeks until elections on April 1
    'johan van overtveldt': 'Belgium',
    'harris georgiades': 'Cyprus',
    'luis pierre': 'European Commission',
    'jānis reirs': 'Latvia',
    'luca antonio ricci': 'IMF' # https://www.imf.org/external/np/cv/AuthorCV.aspx?AuthID=108
}

In [25]:
# dump to json
import json

json = json.dumps(speaker_to_entity)
with open('../data/euroleaks/name_to_entity.json', 'w') as f:
    f.write(json)

## print the mapping in alphabetical order, in format for table in latex

In [26]:
ix = np.argsort(list(speaker_to_entity.keys()))

for s,e in zip(np.array(list(speaker_to_entity.keys()))[ix], np.array(list(speaker_to_entity.values()))[ix]):
    print(f'\\hline\n{s} & {e} \\\\')

\hline
alexander stubb & Finland \\
\hline
benoît cœuré & ECB \\
\hline
christine lagarde & IMF \\
\hline
declan costello & European Commission \\
\hline
dušan mramor & Slovenia \\
\hline
edward scicluna & Malta \\
\hline
hans jörg schelling & Austria \\
\hline
harris georgiades & Cyprus \\
\hline
jeroen dijsselbloem & EG President \\
\hline
johan van overtveldt & Belgium \\
\hline
jānis reirs & Latvia \\
\hline
klaus regling & ESM \\
\hline
luca antonio ricci & IMF \\
\hline
luis de guindos & Spain \\
\hline
luis pierre & European Commission \\
\hline
marco buti & European Commission \\
\hline
maria luís albuquerque & Portugal \\
\hline
mario draghi & ECB \\
\hline
michael noonan & Ireland \\
\hline
michel sapin & France \\
\hline
nikos theocarakis & Greece \\
\hline
peter kažimír & Slovakia \\
\hline
pier carlo padoan & Italy \\
\hline
pierre moscovici & European Commission \\
\hline
poul mathias thomsen & IMF \\
\hline
rimantas šadžius & Lithuania \\
\hline
thomas steffen & Germany 

### still don't know who these people represent...

In [27]:
for speaker in df.speaker.unique():
    if not ('speaker' in speaker or speaker in speaker_to_entity.keys()):
        print(speaker)

irina
nabil
tropa
martin
kian
male


When do these unidentified speakers mostly speak? Are they members of the EWG?

In [28]:
speaker_of_interest = 'luis pierre'

In [29]:
df[df.speaker == speaker_of_interest].date.unique()

array(['2015-07-01 00:00:00'], dtype=object)

In [30]:
for row in df[df.speaker == speaker_of_interest].iterrows():
    print(row[1].speech)
    print()

– Ok. [I hear today ] that the circumstances have obviously changed, Greece is now in arrears to the IMF following a non payment and the EFSF program that expired. The Commission response to non-payment [inaudible] to propose reaction to member states [inaudible] and we discuss that morning in the College with the agreement of the Commission today we recommend that member states fully reserve the right to make a decision once there is more clarity on the situation. Yesterday Jeroen you asked the Institutions to prepare an assessment of Greece’s proposed amendments to the list of prior actions published on Sunday. This has been done on the basis of yesterday’s letter and the views outlined, are shared, by the three institutions, but of course everybody can express opinion. Firstly I must underline (crosstalk) that the document published on Sunday relates to discussions on the extension of the EFSF program.

Moreover (inaudible) circumstances plus the last request for new two year ESM pr