In [1]:
from unsupervised_topic_segmentation import core, eval, types
import pandas as pd
import numpy as np
import pickle

In [3]:
# read transcripts.pickle
with open('transcripts.pickle', 'rb') as handle:
    transcripts = pickle.load(handle)

Options:
- datasets: `types.TopicSegmentationDatasets.AMI`, `types.TopicSegmentationDatasets.ICSI` _(need to download and implement in unsupervised_topic_segmentation/dataset.py)_
- algorithms: `types.TopicSegmentationAlgorithm.BERT`, `types.TopicSegmentationAlgorithm.RANDOM`, `types.TopicSegmentationAlgorithm.EVEN`
- config: `types.TopicSegmentationConfig` _(edit in unsupervised_topic_segmentation/types.py)_

Hyperparams to consider:
- `PARALLEL_INFERENCE_INSTANCES = 20` in `core.py`: infer multiple meetings at once
- `MAX_SEGMENTS_CAP` in `TopicSegmentationConfig`: True or False, default True. "Add a max segment limit so there are not too many segments"; in this implementation, "local maxima are sorted by depth_score value and we take only the first K where the K+1th local maxima is lower then the threshold". False "is the vanilla TextTiling used for Pk optimization". 
- `MAX_SEGMENTS_CAP__AVERAGE_SEGMENT_LENGTH` in `TopicSegmentationConfig`: used as cap if above is True, int default 60. This was originally supposed to be in seconds (see explanation below).
- `TEXT_TILING` in `TopicSegmentationConfig`: additional tiling hyperparams. Set to `types.TextTilingHyperparameters` with below additional hyperparams, otherwise defaults to None (which throws error).

Additional hyperparams in `TextTilingHyperparameters`:
- `SENTENCE_COMPARISON_WINDOW`: int default 15. Number of sentences in each non-overlapping window to consider as chunk.
- `SMOOTHING_PASSES` int default 2. Number of smoothing iterations on similarity scores.
- `SMOOTHING_WINDOW`: int default 1. Neighborhood considered in smoothing similarity scores (unit is chunks).
- `TOPIC_CHANGE_THRESHOLD`: float default 0.6.

Note: the meeting duration is used in the original code in the following way, when `MAX_SEGMENTS_CAP` is True: The total meeting duration for each meeting is divided by `MAX_SEGMENTS_CAP_AVERAGE_SEGMENT_LENGTH` to determine the number of max segments.

In [6]:
config = types.TopicSegmentationConfig(
    text_tiling=types.TextTilingHyperparameters(
        sentence_comparison_window=15,
        smoothing_passes=2,
        smoothing_window=1,
        topic_change_threshold=0.6),
    max_segments_cap=True,
    max_segments_cap__average_segment_length=60)
config

TopicSegmentationConfig(TEXT_TILING=TextTilingHyperparameters(SENTENCE_COMPARISON_WINDOW=15, SMOOTHING_PASSES=2, SMOOTHING_WINDOW=1, TOPIC_CHANGE_THRESHOLD=0.6), MAX_SEGMENTS_CAP=True, MAX_SEGMENTS_CAP__AVERAGE_SEGMENT_LENGTH=60)

Just to run inference (no eval), use `core.topic_segmentation` with arguments:
- `topic_segmentation_algorithm`: choose from above
- `df`: transcript pandas dataframe with columns below
- `meeting_id_col_name`: str for col of meetings to produce splits within
- `start_col_name`: str for col with start time of caption (barely used, could cut)
- `end_col_name`: str for col with end time of caption (barely used, could cut)
- `caption_col_name`: str for col of sentence strings
- `topic_segmentation_config` = TopicSegmentationConfig object (see above)