AMI and ICSI Json Loads

In [1]:
import json
import os
import pandas as pd
import statistics
import unsupervised_topic_segmentation.dataset as ds
import create_test_data

Find json paths and load into list

In [2]:
%ls

README.md                        environment_op.yml
Untitled.ipynb                   [34micsi_dataset[m[m/
[34m__pycache__[m[m/                     [34mparliamentary_dataset[m[m/
[34mami-and-icsi-corpora-master[m[m/     solbiati_experiments.ipynb
ami_isci_json_loading.ipynb      test_data_create.ipynb
create_test_data.py              transcripts.pickle
download_data.ipynb              [34munsupervised_topic_segmentation[m[m/
environment.yml


# ICSI
Not using icsi at the moment

In [3]:
#not using icsi at the moment
icsi_path='ami-and-icsi-corpora-master/icsi-corpus/output/dialogueActs/'

text_jsons_isci=[]

file_paths = os.listdir(icsi_path)
file_paths = list(filter(lambda x: x.endswith(".json"),file_paths))

for file_path in file_paths:
    #probably can comment out the text section because text is contained in topic 
    with open( icsi_path + file_path ) as file:
        json_load = json.loads(file.read())
    text_jsons_isci.append(json_load)
    
print(len(text_jsons_isci))


75


In [4]:
text_lengths_isci=[len(x) for x in text_jsons_isci]
statistics.median(text_lengths_isci)

1454

isci has fewer transcripts (75 versus 136), but might be better for our purposes because transcripts are longer (1454 utterances precleaning, rather than 251 in ami postcleaning), 

# AMI
Using AMI instead

In [5]:
#ami loading
text_path='ami-and-icsi-corpora-master/ami-corpus/output/dialogueActs/'
topic_path='ami-and-icsi-corpora-master/ami-corpus/output/topics/'

text_jsons=[]
topic_jsons=[]

file_paths = os.listdir(topic_path)
file_paths = list(filter(lambda x: x.endswith(".json"),file_paths))

for file_path in file_paths:
    #probably can comment out the text section because text is contained in topic 
    with open( text_path + file_path ) as file:
        json_load = json.loads(file.read())
    text_jsons.append(json_load)
    
    with open( topic_path + file_path ) as file:
        json_load = json.loads(file.read())
    topic_jsons.append(json_load)

print(len(text_jsons)==len(topic_jsons))
len(topic_jsons)


FileNotFoundError: [Errno 2] No such file or directory: 'ami-and-icsi-corpora-master/ami-corpus/output/topics/'

# Viewing AMI

In [6]:
text_jsons[0] #list where text is held in 'text'

print(set([x['label'] for x in text_jsons[0]])) #set of labels
print(set([x['speaker'] for x in text_jsons[0]])) #set of speakers

{'bck', 'und', 'be.pos', 'el.sug', 'off', 'el.inf', 'sug', 'oth', 'el.ass', 'el.und', 'fra', 'inf', 'stl', 'ass'}
{'C', 'A', 'B', 'D'}


Notice that the text has a lot of filler. For example: <vocalsound>, um, uh, mm-hmm, uh-uh, oh 

We can see that the sentence structure is different than our council text. There are many shorter sentences, some utterances are interupted and others contain multiple sentences. 
    

In [7]:
text = [x['text'] for x in text_jsons[0]]
text_clean = ds.preprocessing(pd.DataFrame({'sentences':text}),'sentences')
text_clean.shape #previously 410,2

(334, 2)

Topics 

In [8]:
print(topic_jsons[0][0].keys())
print(set([x['topic'] for x in topic_jsons[0]])) 


dict_keys(['id', 'topic', 'description', 'dialogueacts', 'subtopics'])
{'closing', 'opening', 'interface specialist presentation', 'industrial designer presentation', 'discussion', 'marketing expert presentation'}


# Clean AMI

This code takes 3.7 seconds to run. Make sure not to delete the .copy()

In [9]:
transcripts=[]
text_lengths=[]
topic_lengths=[] #can remove later
FILLERS=["um", "uh", "oh", "hmm", "mm-hmm", "uh-uh", "you know"]

for topic_json in topic_jsons:

    df_temp=pd.DataFrame()
    has_topic_desc=len(set([x['topic'] for x in topic_json]))>1

    for index, topic in enumerate(topic_json): 
        text = [x['text'] for x in topic['dialogueacts']]
        if df_temp.empty:
            df_temp=pd.DataFrame({'sentences':text,'topic_count':index,'topic_desc':topic['topic'],'has_topic_desc':has_topic_desc})
        else:
            df_temp=pd.concat([df_temp,pd.DataFrame({'sentences':text,'topic_count':index,'topic_desc':topic['topic'],'has_topic_desc':has_topic_desc})])    

    df_clean=ds.preprocessing(df_temp,'sentences',FILLERS.copy(),min_caption_len=20)
    
    text_lengths.append(len(df_clean))
    topic_lengths.append(df_clean.groupby(['topic_count']).size().mean())
    transcripts.append(df_clean)
    
df_clean

Unnamed: 0,index,sentences,topic_count,topic_desc,has_topic_desc
0,2,This is our first team meeting,0,,False
1,3,"I'll be your Project Manager for today , for t...",0,,False
2,5,will be giving this presentation for you to k...,0,,False
3,7,that's the agenda for today,0,,False
4,9,"of course we're new to each other ,",0,,False
...,...,...,...,...,...
187,1,the the personal coach will give you the your ...,3,,False
188,2,So we'll just meet back in here thirty minutes,3,,False
189,3,I'm sure we have that,3,,False
190,9,thanks for attending,3,,False


calculate average topic length

In [10]:
sum(text_lengths)/len(text_lengths) #this will be an issue because our texts are longer in our city council transcripts

216.28676470588235

In [11]:
import statistics
statistics.median(topic_lengths) #this is median of averages, not total average

24.944444444444443

## Make AMI Transcripts match Council Data format

In [12]:
ami_ids=["AMI_"+str(x) for x in range(len(transcripts))]
sentences=[list(x['sentences']) for x in transcripts]
topic_counts=[list(x['topic_count']) for x in transcripts]

In [13]:
t_cd = create_test_data.transcript_pickle_to_pd()
t_ami = pd.DataFrame({'transcript_id':ami_ids,
    'sentences':sentences,
    'topic_counts':topic_counts,
    'length':text_lengths,
    'avg_topic_length':topic_lengths
    })

t_cd

Unnamed: 0,transcript_id,sentences
0,d0a7e5864959,"[And older woman Jocasta Zamarripa., Shortly, ..."
1,e9a7a8ac9081,"[Meeting., My name is Cavalier Johnson., I'm c..."
2,694b0e5b01a7,[Joining you this morning is Vice Chair Alderm...
3,ad734a167e5a,"[Our first meeting of 2020, the Judiciary and ..."
4,fe845b99f32e,"[Alderman Hamilton., Here., Kovach., Here., Ba..."
...,...,...
407,468bb3242311,[Commission for Tuesday September 1st 2020 at ...
408,16327f867c7e,[Our 2021 meeting of the Fire and Police Commi...
409,ea6416ba848c,"[Thank you, Mr. President., You know, my colle..."
410,d32da631854f,"[This meeting will come to order, this council..."


In [14]:
t_ami

Unnamed: 0,transcript_id,sentences,topic_counts,length,avg_topic_length
0,AMI_0,"[Now I have my screen back too , we have pre...","[0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",230,38.333333
1,AMI_1,"[This is our third meeting already , I hope yo...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",78,11.142857
2,AMI_2,"[Everybody found his place again ? Yeah ?, thi...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",90,12.857143
3,AMI_3,"[Good morning everybody , I'm glad you could ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, ...",135,16.875000
4,AMI_4,"[English from now on , Where are are all the ...","[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",251,22.818182
...,...,...,...,...,...
131,AMI_131,"[start of the first meeting , Right , so agend...","[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, ...",180,20.000000
132,AMI_132,"[I wanna find our if our remote works , here...","[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, ...",232,38.666667
133,AMI_133,[So do we need to re-train Mike on how to put ...,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, ...",84,12.000000
134,AMI_134,"[That's as far as it goes , good morning eve...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",152,38.000000


In [15]:
hold=create_test_data.generate_segment(t_ami, doc_count_limit = 5, sentence_min = 20, supervised = True)  

In [19]:
print(hold[3])
print(set(hold[2]))

4
{0, 1, 1003, 1004, 1005, 2001, 2002, 3002, 3003, 3004, 3005, 3006}
