AMI and ICSI Json Loads

In [22]:
import json
import os
import pandas as pd
import statistics
import unsupervised_topic_segmentation.dataset as ds
import create_test_data
from unsupervised_topic_segmentation import core, eval, types, dataset

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Find json paths and load into list

In [2]:
%ls

README.md                        environment_op.yml
Untitled.ipynb                   [34micsi_dataset[m[m/
[34m__pycache__[m[m/                     [34mparliamentary_dataset[m[m/
[34mami-and-icsi-corpora-master[m[m/     solbiati_experiments.ipynb
ami_isci_json_loading.ipynb      test_data_create.ipynb
create_test_data.py              transcripts.pickle
download_data.ipynb              [34munsupervised_topic_segmentation[m[m/
environment.yml


# ICSI
Not using icsi at the moment

In [3]:
#not using icsi at the moment
icsi_path='ami-and-icsi-corpora-master/icsi-corpus/output/dialogueActs/'

text_jsons_isci=[]

file_paths = os.listdir(icsi_path)
file_paths = list(filter(lambda x: x.endswith(".json"),file_paths))

for file_path in file_paths:
    #probably can comment out the text section because text is contained in topic 
    with open( icsi_path + file_path ) as file:
        json_load = json.loads(file.read())
    text_jsons_isci.append(json_load)
    
print(len(text_jsons_isci))


75


In [4]:
text_lengths_isci=[len(x) for x in text_jsons_isci]
statistics.median(text_lengths_isci)

1454

isci has fewer transcripts (75 versus 136), but might be better for our purposes because transcripts are longer (1454 utterances precleaning, rather than 251 in ami postcleaning), 

# AMI
Using AMI instead

In [5]:
#ami loading
text_path='ami-and-icsi-corpora-master/ami-corpus/output/dialogueActs/'
topic_path='ami-and-icsi-corpora-master/ami-corpus/output/topics/'

text_jsons=[]
topic_jsons=[]

file_paths = os.listdir(topic_path)
file_paths = list(filter(lambda x: x.endswith(".json"),file_paths))

for file_path in file_paths:
    #probably can comment out the text section because text is contained in topic 
    with open( text_path + file_path ) as file:
        json_load = json.loads(file.read())
    text_jsons.append(json_load)
    
    with open( topic_path + file_path ) as file:
        json_load = json.loads(file.read())
    topic_jsons.append(json_load)

print(len(text_jsons)==len(topic_jsons))
len(topic_jsons)


True


136

# Viewing AMI

In [6]:
text_jsons[0] #list where text is held in 'text'

print(set([x['label'] for x in text_jsons[0]])) #set of labels
print(set([x['speaker'] for x in text_jsons[0]])) #set of speakers

{'el.inf', 'be.neg', 'off', 'fra', 'el.und', 'bck', 'und', 'oth', 'stl', 'el.ass', 'ass', 'sug', 'el.sug', 'inf', 'be.pos'}
{'C', 'B', 'D', 'A'}


Notice that the text has a lot of filler. For example: <vocalsound>, um, uh, mm-hmm, uh-uh, oh 

We can see that the sentence structure is different than our council text. There are many shorter sentences, some utterances are interupted and others contain multiple sentences. 
    

In [7]:
text = [x['text'] for x in text_jsons[0]]
text_clean = ds.preprocessing(pd.DataFrame({'sentences':text}),'sentences')
text_clean.shape #previously 410,2

(423, 2)

Topics 

In [8]:
print(topic_jsons[0][0].keys())
print(set([x['topic'] for x in topic_jsons[0]])) 


dict_keys(['id', 'topic', 'description', 'dialogueacts', 'subtopics'])
{'opening', 'industrial designer presentation', 'closing', 'marketing expert presentation', 'discussion', 'project specs and roles of participants', 'interface specialist presentation'}


# Clean AMI

This code takes 3.7 seconds to run. Make sure not to delete the .copy()

In [9]:
transcripts=[]
text_lengths=[]
topic_lengths=[] #can remove later
FILLERS=["um", "uh", "oh", "hmm", "mm-hmm", "uh-uh", "you know"]

for topic_json in topic_jsons:

    df_temp=pd.DataFrame()
    has_topic_desc=len(set([x['topic'] for x in topic_json]))>1

    for index, topic in enumerate(topic_json): 
        text = [x['text'] for x in topic['dialogueacts']]
        if df_temp.empty:
            df_temp=pd.DataFrame({'sentences':text,'topic_count':index,'topic_desc':topic['topic'],'has_topic_desc':has_topic_desc})
        else:
            df_temp=pd.concat([df_temp,pd.DataFrame({'sentences':text,'topic_count':index,'topic_desc':topic['topic'],'has_topic_desc':has_topic_desc})])    

    df_clean=ds.preprocessing(df_temp,'sentences',FILLERS.copy(),min_caption_len=20)
    
    text_lengths.append(len(df_clean))
    topic_lengths.append(df_clean.groupby(['topic_count']).size().mean())
    transcripts.append(df_clean)
    
df_clean

Unnamed: 0,index,sentences,topic_count,topic_desc,has_topic_desc
0,1,Right it was function F_ eight or something,0,agenda/equipment issues,True
1,3,This one right there,0,agenda/equipment issues,True
2,6,Who is gonna do a PowerPoint presentation ?,0,agenda/equipment issues,True
3,11,I thought we all were,0,agenda/equipment issues,True
4,12,"Yeah , I have one too , okay",0,agenda/equipment issues,True
...,...,...,...,...,...
183,7,We might possibly have done,11,closing,True
184,9,"Alright , see you all soon",11,closing,True
185,0,If we've if we've finished at five minutes be...,12,agenda/equipment issues,True
186,4,I just have to there's a few little bits and...,12,agenda/equipment issues,True


calculate average topic length

In [10]:
sum(text_lengths)/len(text_lengths) #this will be an issue because our texts are longer in our city council transcripts

216.28676470588235

In [11]:
import statistics
statistics.median(topic_lengths) #this is median of averages, not total average

24.944444444444443

## Make AMI Transcripts match Council Data format

In [12]:
ami_ids=["AMI_"+str(x) for x in range(len(transcripts))]
sentences=[list(x['sentences']) for x in transcripts]
topic_counts=[list(x['topic_count']) for x in transcripts]

In [13]:
t_cd = create_test_data.transcript_pickle_to_pd()
t_ami = pd.DataFrame({'transcript_id':ami_ids,
    'sentences':sentences,
    'topic_counts':topic_counts,
    'length':text_lengths,
    'avg_topic_length':topic_lengths
    })

t_cd

Unnamed: 0,transcript_id,sentences
0,d0a7e5864959,"[And older woman Jocasta Zamarripa., Shortly, ..."
1,e9a7a8ac9081,"[Meeting., My name is Cavalier Johnson., I'm c..."
2,694b0e5b01a7,[Joining you this morning is Vice Chair Alderm...
3,ad734a167e5a,"[Our first meeting of 2020, the Judiciary and ..."
4,fe845b99f32e,"[Alderman Hamilton., Here., Kovach., Here., Ba..."
...,...,...
407,468bb3242311,[Commission for Tuesday September 1st 2020 at ...
408,16327f867c7e,[Our 2021 meeting of the Fire and Police Commi...
409,ea6416ba848c,"[Thank you, Mr. President., You know, my colle..."
410,d32da631854f,"[This meeting will come to order, this council..."


In [14]:
t_ami

Unnamed: 0,transcript_id,sentences,topic_counts,length,avg_topic_length
0,AMI_0,[welcome to the second meeting of this design...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",234,39.000000
1,AMI_1,"[ we don't have any changes ,, Forgot to inser...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",618,61.800000
2,AMI_2,"[ We're the first ones , Marketing Expert , ye...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, ...",97,19.400000
3,AMI_3,"[Good morning everybody , So , we are asked to...","[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",72,14.400000
4,AMI_4,"[This is our conceptual design meeting , I'll...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",103,11.444444
...,...,...,...,...,...
131,AMI_131,"[You all saw the newsflash ?, Or you got the s...","[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",429,35.750000
132,AMI_132,[we'll start off with a quick overview of the ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...",105,8.750000
133,AMI_133,"[So we come again for the the second meeting ,...","[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, ...",31,4.428571
134,AMI_134,"[ Wouldn't wanna be Project Manager , , what...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",558,55.800000


In [27]:
results,labels,topics,doc_count = create_test_data.generate_segment(t_ami, doc_count_limit = 5, sentence_min = 25, supervised = True)  

In [28]:
print(doc_count)
print(set(topics))

3
{0, 1, 2, 3, 4, 5, 6, 1002, 1003, 1004, 1005, 2003, 2005, 2006}


In [29]:
test_data = pd.DataFrame(data={'caption':results,'label':labels,'meeting_id':1})
test_data['duration'] = test_data.caption.apply(lambda x: len(x.split(' ')))  # 1 word/s
test_data['end_time'] = test_data.duration.cumsum()
test_data['start_time'] = test_data.duration.cumsum() - test_data.duration
test_data = test_data[['meeting_id','start_time','end_time','caption','label']]
test_data = ds.preprocessing(test_data, 'caption')  # note that this adds (old) `index` column, but topic_segmentation uses actual index
test_data

Unnamed: 0,index,meeting_id,start_time,end_time,caption,label
0,0,1,0,8,I see you all find your places,0
1,1,1,8,18,Is everybody sitting on the right place ? Yeah ?,0
2,2,1,18,24,First I will introduce myself,0
3,3,1,24,35,"I don't know if if everybody knows me ,",0
4,4,1,35,44,let's start off with a little presentation,0
...,...,...,...,...,...,...
296,296,1,3487,3498,That can be like the turbo banana plus plus co...,2
297,297,1,3498,3503,Maybe objective banana ?,2
298,298,1,3503,3509,We'll see n next meeting,2
299,299,1,3509,3517,We have to go design the prototype,2


In [40]:
test_data.caption.to_list()

['I see you all find your places ',
 'Is everybody sitting on the right place ? Yeah ?',
 'First I will introduce myself ',
 "I don't know if   if everybody knows me ,",
 "let's start off  with a little presentation ",
 " Now first I'll tell you a little bit about the setting ",
 'You can see there are a few cameras here ',
 "They'll record  our actions",
 "and you'll have wires and microphones that will record your voice ",
 ' there are also some microphones there',
 "but th  you don't have to pay a lot of attention on those ,",
 "because it will  disappear when you don't attend to it ",
 'is there a project docents folder ?',
 'There are some notes in it already I see , some docents ',
 " I'll start with the presentation kick off ",
 'Is being modified by the administrator ',
 "Let's do it read only ",
 "  , that's interesting ",
 "I don't know if you've noticed , but  we're working for Real Reaction ",
 " it's a company in  electronics ",
 'We put fashion in electronics ,',
 ' we pu

In [35]:
old_algorithm = types.BERTSegmentation(
    sentence_comparison_window=50,
    text_tiling=types.OriginalSegmentation(
        smoothing_passes=2,
        smoothing_window=1,
        topic_change_threshold=0.6,
        max_segments_cap=True,
        max_segments_cap__average_segment_length=120))

new_algorithm = types.BERTSegmentation(
    sentence_comparison_window=25,
    text_tiling=types.NewSegmentation(
        stdevs=1))

even_algorithm = types.EvenSegmentation(k=25)

random_algorithm = types.RandomSegmentation(random_threshold=0.99)

In [36]:
eval.eval_topic_segmentation(
    topic_segmentation_algorithm=new_algorithm,
    input_df = test_data)

[108, 117, 125, 178, 196, 203, 221]
Pk on 1 meetings: 0.3788546255506608
WinDiff on 1 meetings: 0.8942731277533039


{'average_Pk_': 0.3788546255506608, 'average_windiff_': 0.8942731277533039}

In [37]:
eval.eval_topic_segmentation(
    topic_segmentation_algorithm=even_algorithm,
    input_df = test_data)

Even segmentation: [0, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300]
Pk on 1 meetings: 0.3392070484581498
WinDiff on 1 meetings: 1.0


{'average_Pk_': 0.3392070484581498, 'average_windiff_': 1.0}

In [38]:
eval.eval_topic_segmentation(
    topic_segmentation_algorithm=random_algorithm,
    input_df = test_data)

Random segmentation: [19, 159]
Pk on 1 meetings: 0.4889867841409692
WinDiff on 1 meetings: 0.4889867841409692


{'average_Pk_': 0.4889867841409692, 'average_windiff_': 0.4889867841409692}