In [1]:
from unsupervised_topic_segmentation import core, eval, types, dataset
import create_test_data
import pandas as pd
import numpy as np
import pickle

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [2]:
# read transcripts.pickle
with open('transcripts.pickle', 'rb') as handle:
    transcripts = pickle.load(handle)

Options:
- datasets: `types.TopicSegmentationDatasets.AMI`, `types.TopicSegmentationDatasets.ICSI` _(need to download and implement in unsupervised_topic_segmentation/dataset.py)_
- algorithms: `types.BERTSegmentation`, `types.RandomSegmentation`, `types.TopicSegmentationAlgorithm.EvenSegmentation` _(SBERT not implemented)_

Hyperparams to consider for `BERTSegmentation`:
- `PARALLEL_INFERENCE_INSTANCES = 20` in `core.py`: infer multiple meetings at once
- `MAX_SEGMENTS_CAP`: True or False, default True. "Add a max segment limit so there are not too many segments"; in this implementation, "local maxima are sorted by depth_score value and we take only the first K where the K+1th local maxima is lower then the threshold". False "is the vanilla TextTiling used for Pk optimization". 
- `MAX_SEGMENTS_CAP__AVERAGE_SEGMENT_LENGTH`: used as cap if above is True, int default 60. This was originally supposed to be in seconds (see explanation below) but we can easily make it refer to words or sentences.
- `TEXT_TILING`: additional tiling hyperparams. Set to `types.TextTilingHyperparameters` with below additional hyperparams, otherwise defaults to None (which throws error).

Additional hyperparams in `TextTilingHyperparameters`:
- `SENTENCE_COMPARISON_WINDOW`: int default 15. Number of sentences in each non-overlapping window to consider as chunk.
- `SMOOTHING_PASSES` int default 2. Number of smoothing iterations on similarity scores.
- `SMOOTHING_WINDOW`: int default 1. Neighborhood considered in smoothing similarity scores (unit is chunks).
- `TOPIC_CHANGE_THRESHOLD`: float default 0.6.

Hyperparams for `RandomSegmentation`:
- `random_threshold`

Hyperparams for `EvenSegmentation`:
- `k`: Number of sentences per segment

Note: the meeting duration is used in the original code in the following way, when `MAX_SEGMENTS_CAP` is True: The total meeting duration for each meeting is divided by `MAX_SEGMENTS_CAP_AVERAGE_SEGMENT_LENGTH` to determine the number of max segments.

In [3]:
algorithm = types.BERTSegmentation(
    text_tiling=types.TextTilingHyperparameters(
        sentence_comparison_window=50,
        smoothing_passes=2,
        smoothing_window=1,
        topic_change_threshold=0.6),
    max_segments_cap=True,
    max_segments_cap__average_segment_length=120)
algorithm

BERTSegmentation(TEXT_TILING=TextTilingHyperparameters(SENTENCE_COMPARISON_WINDOW=50, SMOOTHING_PASSES=2, SMOOTHING_WINDOW=1, TOPIC_CHANGE_THRESHOLD=0.6), MAX_SEGMENTS_CAP=True, MAX_SEGMENTS_CAP__AVERAGE_SEGMENT_LENGTH=120)

Just to run inference (no eval), use `core.topic_segmentation` with arguments:
- `topic_segmentation_algorithm`: choose from above
- `df`: transcript pandas dataframe with columns below
- `meeting_id_col_name`: str for col of meetings to produce splits within
- `start_col_name`: str for col with start time of caption (barely used, could cut)
- `end_col_name`: str for col with end time of caption (barely used, could cut)
- `caption_col_name`: str for col of sentence strings

In [13]:
test_data = pd.DataFrame(data={'meeting_id':'d0a7e5864959','caption':transcripts['d0a7e5864959']})
test_data = dataset.add_durations(test_data)  # "duration" column is just number of words
test_data = dataset.preprocessing(test_data,min_caption_len=5)  # note that this adds (old) `index` column, but topic_segmentation uses actual index
test_data

Unnamed: 0,index,meeting_id,start_time,end_time,caption
0,0,d0a7e5864959,0,5,And older woman Jocasta Zamarripa
1,1,d0a7e5864959,5,15,"Shortly, Alderman Rainey and Alderman Hamilton..."
2,2,d0a7e5864959,15,19,So let's get started
3,3,d0a7e5864959,19,36,"Item nber 211814, reappointment of Brian Rott ..."
4,4,d0a7e5864959,36,42,"Brian Rott, I see you"
...,...,...,...,...,...
483,454,d0a7e5864959,4342,4349,And that concludes our committee meeting today
484,455,d0a7e5864959,4349,4351,Thank you
485,456,d0a7e5864959,4351,4357,See you all in the community
486,457,d0a7e5864959,4357,4360,Thanks to additionally


In [5]:
core.topic_segmentation(
    topic_segmentation_algorithm=algorithm,
    df=test_data,
    meeting_id_col_name='meeting_id',
    start_col_name='start_time',
    end_col_name='end_time',
    caption_col_name='caption')

[306 130  79  60]


In [6]:
i = 249
test_data.iloc[(i-10):(i+10)]

Unnamed: 0,index,meeting_id,start_time,end_time,caption
239,231,d0a7e5864959,2190,2231,", the Sculpture Milwaukee, Downtown Dining, Em..."
240,232,d0a7e5864959,2231,2244,It's been really a pleasure to be a part of al...
241,233,d0a7e5864959,2244,2265,"And for the past several years, I've worked on..."
242,234,d0a7e5864959,2265,2279,And along with the graffiti removal and landsc...
243,235,d0a7e5864959,2279,2298,"We removed 14,000 gallons of litter alone last..."
244,236,d0a7e5864959,2298,2311,It just makes Milwaukee so clean and nice when...
245,237,d0a7e5864959,2311,2314,It's very attractive
246,238,d0a7e5864959,2314,2329,"So it's been really, I've been really happy to..."
247,239,d0a7e5864959,2329,2334,How did y'all do that?
248,240,d0a7e5864959,2334,2344,"Mussock, how did y'all do all that cleaning up?"


In [7]:
# don't we need to add sentence_comparison_window+2 to those indices in order to get the true indices? don't see where they fixed that
k = algorithm.TEXT_TILING.SENTENCE_COMPARISON_WINDOW
test_data.iloc[(i+k+2-10):(i+k+2+10)]

Unnamed: 0,index,meeting_id,start_time,end_time,caption
291,282,d0a7e5864959,2753,2761,Do you have any questions by committee members?
292,283,d0a7e5864959,2761,2763,We don't
293,284,d0a7e5864959,2763,2772,"With that, owner woman Zamparipa moves for you..."
294,285,d0a7e5864959,2772,2777,"Hearing no objections, so ordered"
295,286,d0a7e5864959,2777,2781,"Thank you, Mr"
296,286,d0a7e5864959,2777,2781,Kennedy
297,287,d0a7e5864959,2781,2788,I will see you in the community
298,288,d0a7e5864959,2788,2798,"All right, nber eight, I mean, nber nine, excu..."
299,289,d0a7e5864959,2798,2821,"File nber 220049, reappointment of Jeff Welk t..."
300,289,d0a7e5864959,2798,2821,"Jeff Welk, are you available?"


In [8]:
results, labels, doc_count = create_test_data.generate_segment()
test_data = pd.DataFrame(data={'caption':results,'label':labels,'meeting_id':1})
test_data['duration'] = test_data.caption.apply(lambda x: len(x.split(' ')))  # 1 word/s
test_data['end_time'] = test_data.duration.cumsum()
test_data['start_time'] = test_data.duration.cumsum() - test_data.duration
test_data = test_data[['meeting_id','start_time','end_time','caption','label']]
test_data = dataset.preprocessing(test_data, 'caption')  # note that this adds (old) `index` column, but topic_segmentation uses actual index
test_data

Unnamed: 0,index,meeting_id,start_time,end_time,caption,label
0,0,1,0,6,"I am Chairwoman, Alderman Milena Akins",0
1,1,1,6,15,We are joined by Vice Chair Alderman Nick Kovach,0
2,2,1,15,23,We are also joined by Alderman Mark Borkowski,0
3,3,1,23,31,We are also joined by Alderman Nakia Dotton,0
4,4,1,31,39,We are also joined by Alderman Jocasta Zamaripa,0
...,...,...,...,...,...,...
303,397,1,4770,4775,"Obviously, Alderwoman Dodd is here",2
304,398,1,4775,4791,"She was excused earlier, so we want to recogni...",2
305,399,1,4791,4816,I think we've gotten all our discussion out of...,2
306,400,1,4816,4823,Will the clerk please call the roll?,2


In [9]:
eval.eval_topic_segmentation(
    topic_segmentation_algorithm=algorithm,
    input_df = test_data)

[176  14]


TypeError: list indices must be integers or slices, not float

In [None]:
even_algorithm = types.EvenSegmentation(k=100)
eval.eval_topic_segmentation(
    topic_segmentation_algorithm=even_algorithm,
    input_df = test_data)

Even segmentation: [0, 100, 200, 300]
Pk on 1 meetings: 0.4750830564784053
WinDiff on 1 meetings: 0.5083056478405316


{'average_Pk_': 0.4750830564784053, 'average_windiff_': 0.5083056478405316}

In [None]:
random_algorithm = types.RandomSegmentation(random_threshold=0.99)
eval.eval_topic_segmentation(
    topic_segmentation_algorithm=random_algorithm,
    input_df = test_data)

Random segmentation: [102, 146, 149, 201, 203, 308]
Pk on 1 meetings: 0.46179401993355484
WinDiff on 1 meetings: 0.5116279069767442


{'average_Pk_': 0.46179401993355484, 'average_windiff_': 0.5116279069767442}