In [1]:
from unsupervised_topic_segmentation import core, eval, types, dataset
import create_test_data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.decoder.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [2]:
# read transcripts.pickle
with open('transcripts.pickle', 'rb') as handle:
    transcripts = pickle.load(handle)

Options:
- datasets: `types.TopicSegmentationDatasets.AMI`, `types.TopicSegmentationDatasets.ICSI` _(need to download and implement in unsupervised_topic_segmentation/dataset.py)_
- algorithms: `types.BERTSegmentation`, `types.RandomSegmentation`, `types.TopicSegmentationAlgorithm.EvenSegmentation` _(SBERT not implemented)_

Hyperparams to consider for `BERTSegmentation`:
- `PARALLEL_INFERENCE_INSTANCES = 20` in `core.py`: infer multiple meetings at once
- `MAX_SEGMENTS_CAP`: True or False, default True. "Add a max segment limit so there are not too many segments"; in this implementation, "local maxima are sorted by depth_score value and we take only the first K where the K+1th local maxima is lower then the threshold". False "is the vanilla TextTiling used for Pk optimization". 
- `MAX_SEGMENTS_CAP__AVERAGE_SEGMENT_LENGTH`: used as cap if above is True, int default 60. This was originally supposed to be in seconds (see explanation below) but we can easily make it refer to words or sentences.
- `TEXT_TILING`: additional tiling hyperparams. Set to `types.TextTilingHyperparameters` with below additional hyperparams, otherwise defaults to None (which throws error).

Additional hyperparams in `TextTilingHyperparameters`:
- `SENTENCE_COMPARISON_WINDOW`: int default 15. Number of sentences in each non-overlapping window to consider as chunk.
- `SMOOTHING_PASSES` int default 2. Number of smoothing iterations on similarity scores.
- `SMOOTHING_WINDOW`: int default 1. Neighborhood considered in smoothing similarity scores (unit is chunks).
- `TOPIC_CHANGE_THRESHOLD`: float default 0.6.

Hyperparams for `RandomSegmentation`:
- `random_threshold`

Hyperparams for `EvenSegmentation`:
- `k`: Number of sentences per segment

Note: the meeting duration is used in the original code in the following way, when `MAX_SEGMENTS_CAP` is True: The total meeting duration for each meeting is divided by `MAX_SEGMENTS_CAP_AVERAGE_SEGMENT_LENGTH` to determine the number of max segments.

In [3]:
old_algorithm = types.BERTSegmentation(
    sentence_comparison_window=50,
    text_tiling=types.OriginalSegmentation(
        smoothing_passes=2,
        smoothing_window=1,
        topic_change_threshold=0.6,
        max_segments_cap=True,
        max_segments_cap__average_segment_length=120))
new_algorithm = types.BERTSegmentation(
    sentence_comparison_window=50,
    text_tiling=types.NewSegmentation(
        stdevs=1))

new_algorithm

BERTSegmentation(SENTENCE_COMPARISON_WINDOW=50, TEXT_TILING=NewSegmentation(STDEVS=1))

Just to run inference (no eval), use `core.topic_segmentation` with arguments:
- `topic_segmentation_algorithm`: choose from above
- `df`: transcript pandas dataframe with columns below
- `meeting_id_col_name`: str for col of meetings to produce splits within
- `start_col_name`: str for col with start time of caption (barely used, could cut)
- `end_col_name`: str for col with end time of caption (barely used, could cut)
- `caption_col_name`: str for col of sentence strings

In [4]:
test_data = pd.DataFrame(data={'meeting_id':'d0a7e5864959','caption':transcripts['d0a7e5864959']})
test_data = dataset.add_durations(test_data)  # "duration" column is just number of words
test_data = dataset.preprocessing(test_data,min_caption_len=5)  # note that this adds (old) `index` column, but topic_segmentation uses actual index
test_data

Unnamed: 0,index,meeting_id,caption,end_time,start_time
0,0,d0a7e5864959,And older woman Jocasta Zamarripa.,5,0
1,1,d0a7e5864959,"Shortly, Alderman Rainey and Alderman Hamilton...",15,5
2,2,d0a7e5864959,So let's get started.,19,15
3,3,d0a7e5864959,"Item nber 211814, reappointment of Brian Rott ...",36,19
4,4,d0a7e5864959,"Mr. Brian Rott, I see you.",42,36
...,...,...,...,...,...
447,454,d0a7e5864959,And that concludes our committee meeting today.,4349,4342
448,455,d0a7e5864959,Thank you.,4351,4349
449,456,d0a7e5864959,See you all in the community.,4357,4351
450,457,d0a7e5864959,Thanks to additionally.,4360,4357


In [5]:
core.topic_segmentation(
    topic_segmentation_algorithm=new_algorithm,
    df=test_data,
    meeting_id_col_name='meeting_id',
    start_col_name='start_time',
    end_col_name='end_time',
    caption_col_name='caption')

{'d0a7e5864959': [382, 427, 448]}

In [6]:
i = 112
test_data.iloc[(i-10):(i+10)]

Unnamed: 0,index,meeting_id,start_time,end_time,caption
102,103,d0a7e5864959,988,997,So we will do our grand opening in September
103,104,d0a7e5864959,997,1016,And I can let when that actual date will be o...
104,105,d0a7e5864959,1016,1018,"Excellent, excellent"
105,106,d0a7e5864959,1018,1034,Do me a favor and send an invitation to the co...
106,107,d0a7e5864959,1034,1038,We'll definitely do that
107,108,d0a7e5864959,1038,1044,"All right, anybody have any questions?"
108,109,d0a7e5864959,1044,1051,Any of my colleagues have any questions?
109,110,d0a7e5864959,1051,1058,We have been joined by Alderman Hamilton
110,111,d0a7e5864959,1058,1064,"Good to see you, Alderman Hamilton"
111,112,d0a7e5864959,1064,1070,"And I'll move approval, Mr"


In [6]:
results,labels,topics,doc_count = create_test_data.generate_segment()
test_data = pd.DataFrame(data={'caption':results,'label':labels,'meeting_id':1})
test_data = dataset.add_durations(test_data)
test_data = test_data[['meeting_id','start_time','end_time','caption','label']]
test_data = dataset.preprocessing(test_data, 'caption')  # note that this adds (old) `index` column, but topic_segmentation uses actual index
test_data

Unnamed: 0,index,meeting_id,start_time,end_time,caption,label
0,1,1,2,17,"Presence are or could be Commissioner Crowder,...",0
1,2,1,17,50,Also present is Pardeep Kalika from the Sittle...,0
2,3,1,50,75,This is the second special meeting to involve ...,0
3,4,1,75,99,The Board of Commissioners of the Fire and Pol...,0
4,5,1,99,121,Those six candidates were interviewed by the m...,0
...,...,...,...,...,...,...
712,946,1,14236,14242,Any other questions by committee members?,5
713,947,1,14242,14260,"Hearing none, Alderman Coggs moves to hold to ...",5
714,948,1,14260,14267,Are there any objections to that motion?,5
715,949,1,14267,14271,"Hearing none, so ordered.",5


In [7]:
eval.eval_topic_segmentation(
    topic_segmentation_algorithm=new_algorithm,
    input_df = test_data,
    verbose=True)

[106, 114, 123, 161, 165, 350, 420, 566, 625]
Pk on 1 meetings: 0.6547987616099071
WinDiff on 1 meetings: 0.6934984520123839


{'average_Pk_': 0.6547987616099071, 'average_windiff_': 0.6934984520123839}

In [8]:
even_algorithm = types.EvenSegmentation(k=100)
eval.eval_topic_segmentation(
    topic_segmentation_algorithm=even_algorithm,
    input_df = test_data,
    verbose=True)

Even segmentation: [0, 100, 200, 300, 400, 500, 600, 700]
Pk on 1 meetings: 0.5061919504643962
WinDiff on 1 meetings: 0.5727554179566563


{'average_Pk_': 0.5061919504643962, 'average_windiff_': 0.5727554179566563}

In [9]:
random_algorithm = types.RandomSegmentation(random_threshold=0.99)
eval.eval_topic_segmentation(
    topic_segmentation_algorithm=random_algorithm,
    input_df = test_data,
    verbose=True)

Random segmentation: [47, 56, 166, 185, 260, 513, 522, 580, 596, 684, 703, 715]
Pk on 1 meetings: 0.456656346749226
WinDiff on 1 meetings: 0.7461300309597523


{'average_Pk_': 0.456656346749226, 'average_windiff_': 0.7461300309597523}

In [10]:
iterations = 2
test_algorithm = types.BERTSegmentation(
    sentence_comparison_window=50,
    text_tiling=types.NewSegmentation(stdevs=1))
even_algorithm = types.EvenSegmentation(k=100)
random_algorithm = types.RandomSegmentation(random_threshold=0.99)

outputs = eval.multiple_eval(
    create_test_data.generate_segment,
    iterations,test_algorithm,even_algorithm,random_algorithm)

In [11]:
outputs

Unnamed: 0,test_pk,test_windiff,even_pk,even_windiff,random_pk,random_windiff,n_captions,n_segments
0,0.398795,0.468675,0.531325,0.692771,0.360241,0.66506,1119,5
1,0.060932,0.172043,0.734767,0.90681,0.218638,0.322581,491,3
