# Linear text segmentation

<!-- {{ add_binder_block(page) }} -->

## Introduction

Linear text segmentation consists in dividing a text into several meaningful segments.
Linear text segmentation can be seen as a change point detection task and therefore can be carried out with `ruptures`. 
This example performs exactly that on a well-known data set intoduced in [[Choi2000](#Choi2000)].

### Setup

First we import packages and define a few utility functions.
This section can be skipped at first reading.

**Library imports.**

In [None]:
import re  # For regular expression
from textwrap import wrap  # Format text output nicely
from pathlib import Path

import matplotlib.cm as cm
import matplotlib.pyplot as plt
import nltk
import numpy as np
import ruptures as rpt  # our package
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import regexp_tokenize
from ruptures.base import BaseCost
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
nltk.download("stopwords")
STOPWORD_SET = set(
    stopwords.words("english")
)  # set of stopwords of the English language
PUNCTUATION_SET = set("!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~")

**Utility functions.**

In [None]:
def fig_ax(figsize=(15, 5), dpi=150):
    """Return a (matplotlib) figure and ax objects with given size."""
    return plt.subplots(figsize=figsize, dpi=dpi)

In [None]:
def load_original_text(filepath: Path) -> (list, list):
    """Read a file and return the text and the paragraphs' boundaries.

    The text is returned as a list of sentences.
    The paragraphs' boundaries are returned as a list of indexes.
    """
    list_of_excerpts = filepath.read_text().strip("==========\n").split("==========")
    true_bkps = list()
    original_text = list()
    for excerpt in list_of_excerpts:
        list_of_sentences = excerpt.strip("\n").split("\n")
        true_bkps.append(len(list_of_sentences))
        original_text.extend(list_of_sentences)
    true_bkps = np.cumsum(true_bkps).tolist()

    return original_text, true_bkps

In [None]:
def preprocess(list_of_sentences: list) -> list:
    """Preprocess each sentence (remove punctuation, stopwords, then stemming.)"""
    transformed = list()
    for sentence in list_of_sentences:
        ps = PorterStemmer()
        list_of_words = regexp_tokenize(text=sentence.lower(), pattern="\w+")
        list_of_words = [
            ps.stem(word) for word in list_of_words if word not in STOPWORD_SET
        ]
        transformed.append(" ".join(list_of_words))
    return transformed

## Data

**Description**

The text to segment is a concatenation of excerpts from ten different documents randomly selected from the so-called Brown corpus (described [here](http://icame.uib.no/brown/bcm.html)).
Each excerpt has nine to eleven sentences, amounting to 99 sentences in total.
The complete text is shown in [Appendix A](#Appendix-A).

These data stem from a larger data set which is thoroughly described in [[Choi2000](#Choi2000)] and can be downloaded [here](https://web.archive.org/web/20030206011734/http://www.cs.man.ac.uk/~mary/choif/software/C99-1.2-release.tgz).
This is a common benchmark to evaluate text segmentation methods.

In [None]:
# Loading the text
filepath = Path("../data/0.ref")
original_text, true_bkps = load_original_text(filepath=filepath)

print(f"There are {len(original_text)} sentences, from {len(true_bkps)} documents.")

The objective is to automatically recover the boundaries of the 10 excerpts, using the fact that they come from quite different documents and therefore have distinct topics.

For instance, in the small extract of text printed in the following cell, an accurate text segmentation procedure would be able to detect that the first two sentences (10 and 11) and the last three sentences (12 to 14) belong to two different documents and have with very different semantic fields.

<!--
<p align="center">
  <img width="50%" src="/images/choi_example.png">
</p>
-->

In [None]:
# print 5 sentences from the original text
start, end = 9, 14
for (line_number, sentence) in enumerate(original_text[start:end], start=start + 1):
    sentence = sentence.strip("\n")
    print(f"{line_number:>2}: {sentence}")

**Preprocessing**

Before performing text segmentation, the original text is preprocessed.
In a nutshell (see [[Choi2000](#Choi2000)] for more detail),

- the punctuation and stopwords are removed;
- words are reduced to their stems (e.g., "waited" and "waiting" become "wait");
- a vector of word counts is computed.

In [None]:
# transform text
transformed_text = preprocess(original_text)
# print original and transformed
ind = 97
print("Original sentence:")
print(f"\t{original_text[ind]}")
print()
print("Transformed:")
print(f"\t{transformed_text[ind]}")

In [None]:
# Once the text is preprocessed, each sentence is transformed into a vector of word counts.
vectorizer = CountVectorizer(analyzer="word")
vectorized_text = vectorizer.fit_transform(transformed_text)

msg = f"There are {len(vectorizer.get_feature_names())} different words in the corpus, e.g. {vectorizer.get_feature_names()[20:30]}."
print(msg)

Note that the vectorized text representation is a (very) sparse matrix.

## Text segmentation

To compare (the vectorized representation of) two sentences, [[Choi2000]](#Choi2000) uses the cosine similarity $k_{\text{cosine}}: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$:

$$ k_{\text{cosine}}(x, y) := \frac{\langle x \mid y \rangle}{\|x\|\|y\|} $$

where $x$ and $y$ are two $d$-dimensionnal vectors.

Text segmnentation now amounts to a kernel change point detection (see LINK for more details).
However, this particular kernel is not implemented in `ruptures` therefore we need to create a [custom cost function](../../user-guide/costs/costcustom).
(Actually, it is implemented in `ruptures` but the current implementation does not exploit the sparse structure of the vectorized text representation and can therefore be slow.)

Let $y=\{y_0, y_1,\dots,y_{T-1}\}$ be a $d$-dimensionnal signal with $T$ samples.
Recall that a cost function $c(\cdot)$ that derives from a kernel $k(\cdot, \cdot)$ is such that

$$
c(y_{a..b}) = \sum_{t=a}^{b-1} G_{t, t} - \frac{1}{b-a} \sum_{a \leq s < b } \sum_{a \leq t < b} G_{s,t}
$$

where $y_{a..b}$ is the subsignal $\{y_a, y_{a+1},\dots,y_{b-1}\}$ and $G_{st}:=k(y_s, y_t)$.
In other words, $(G_{st})_{st}$ is the $T\times T$ Gram matrix of $y$.
Thanks to this formula, we can now implement our custom cost function (named `CosineCost` in the following cell).

In [None]:
class CosineCost(BaseCost):
    """Cost derived from the cosine similarity."""

    # The 2 following attributes must be specified for compatibility.
    model = "custom_cosine"
    min_size = 2

    def fit(self, signal):
        """Set the internal parameter."""
        self.signal = signal
        self.gram = cosine_similarity(signal, dense_output=False)
        return self

    def error(self, start, end) -> float:
        """Return the approximation cost on the segment [start:end].

        Args:
            start (int): start of the segment
            end (int): end of the segment
        Returns:
            segment cost
        Raises:
            NotEnoughPoints: when the segment is too short (less than `min_size` samples).
        """
        if end - start < self.min_size:
            raise NotEnoughPoints
        sub_gram = self.gram[start:end, start:end]
        val = sub_gram.diagonal().sum()
        val -= sub_gram.sum() / (end - start)
        return val

### Compute similarity matrix

In [None]:
# Initialize the matrix
similarities = np.zeros((n_sentences, n_sentences))
# Fill the matrix with similarities
for i in np.arange(n_sentences):
    for j in np.arange(i, n_sentences):
        similarities[i, j] = cosine_similarity(preprocessed[i], preprocessed[j])
        similarities[j, i] = similarities[i, j]

### Display similarity matrix

In [None]:
fig, ax = fig_ax((4, 4))
plt.imshow(-np.log(similarities), cmap=cm.plasma)
ax.set_title("Cosine similarities matrix", fontsize=10)
ax.set_xlabel("Sentence index", fontsize=8)
ax.set_ylabel("Sentence index", fontsize=8)
plt.show()

We see that some artifacts appear around the similarity matrix diagonal where the cosine measure are a bit higher. 

The similarity matrix seems to be noisy when looking with a naked eye. We will see that this is not an issue for `ruptures` since the cosine similarity approach offers some nice mathematical properties that we describe below. 

We use the [Dynamic Programming](../../user-guide/detection/dynp) search method since it has the two following nice properties :

* It finds **optimal** boundaries
* It allows full modularity and can run with a [Custom Cost](../../user-guide/costs/costcustom) (the [KernelCPD](../../user-guide/detection/kernelcpd) search method implemented in C only allows for a pre-implemented list of kernels and those do not support natively a similarity matrix as an input)

In [None]:
# Create the object and run the algorythm
algo = rpt.Dynp(custom_cost=MyCost(), min_size=2, jump=1).fit(similarities)
result = algo.predict(n_bkps=n_bkps)

In [None]:
print(result)
print(bkps)

## Display results

### Display boundaries within the text

In [None]:
nb_char = 60  # configure line character width

# Initialize some counters
c_real_bkps_idx = 0
c_computed_bkps_icx = 0
line_counter = 1

for i, sentence in enumerate(original):
    if i == bkps[c_real_bkps_idx]:
        # Display real boundaries
        print(f"\n\t{'='*nb_char}\n")
        c_real_bkps_idx += 1
    if i == result[c_computed_bkps_icx]:
        # Display computed boundaries
        print(f"\t{'*' *nb_char}")
        c_computed_bkps_icx += 1
    sentence_wrap = textwrap.wrap(
        sentence[:-1], width=nb_char
    )  # removes trailing '\n' for readability purposes
    for j, c_sentence_wrap in enumerate(sentence_wrap):
        print(f"{str(line_counter) + '.' if j == 0 else ''}\t{c_sentence_wrap}")
    line_counter += 1

### Display boundaries on the similarity matrix

In [None]:
fig, ax = fig_ax((4, 4))  # creates figure
previous_bkp = 0

for c_bpks in result:
    c_bpks -= 1
    ax.vlines(
        [c_bpks, previous_bkp],
        previous_bkp,
        c_bpks,
        color="black",
        linestyles="dashed",
        linewidth=0.8,
    )
    ax.hlines(
        [c_bpks, previous_bkp],
        previous_bkp,
        c_bpks,
        color="black",
        linestyles="dashed",
        linewidth=0.8,
    )
    previous_bkp = c_bpks
plt.imshow(-np.log(similarities), cmap=cm.plasma)
ax.set_title("Cosine similarities matrix\nWith computed boudaries", fontsize=10)
ax.set_xlabel("Sentence index", fontsize=8)
ax.set_ylabel("Sentence index", fontsize=8)
plt.show()

## Conclusion

## Appendix A

The complete text used in this notebook is as follows.
Note that the line numbers and the blank lines (added to visually mark the boundaries between excerpts) are not part of the text fed to the segmentation method.

In [None]:
for (start, end) in rpt.utils.pairwise([0] + bkps):
    excerpt = original[start:end]
    for (n_line, sentence) in enumerate(excerpt, start=start + 1):
        sentence = sentence.strip("\n")
        print(f"{n_line:>2}: {sentence}")
    print()

## References

<a id="Choi2000">[Choi2000]</a>
Choi, F. Y. Y. (2000). Advances in domain independent linear text segmentation. Proceedings of the North American Chapter of the Association for Computational Linguistics Conference (NAACL), 26–33.

<a id="ChoiDataset">[ChoiDataset]</a>
The dataset can be obtained from an archived version of the C99 segmentation [code release](http://web.archive.org/web/20010422042459/http://www.cs.man.ac.uk/~choif/software/C99-1.2-release.tgz). We thank [[Alemi & Ginsparg](#Alemi_Ginsparg)] for pointing to the dataset link. 

<a id="BrownCorpus">[BrownCorpus]</a>
Manual accessible [here](http://icame.uib.no/brown/bcm.html), Henry Kučera and W. Nelson Francis, Brown University, Department of Linguistics, 1964, revised 1971, Revised and Amplified 1979

<a id="Alemi_Ginsparg">[Alemi & Ginsparg]</a>
Alexander A Alemi and Paul Ginsparg, Text Segmentation based on Semantic Word Embeddings, March 15th 2015, Cornell University, accessible [here](https://arxiv.org/pdf/1503.05543.pdf)

<a id="Porter1980">[Porter1980]</a>
M. Porter. 1980. An algorithm for suffix stripping. Program, 14(3):130-137, July. We 

## To be deleted

In [None]:
# rank, to be deleted
def get_rank(matrix: np.array, vicinity: int):
    res = np.zeros(matrix.shape)
    n_lines, n_columns = matrix.shape
    for i in np.arange(n_lines):
        for j in np.arange(i, n_columns):
            sub_matrix = matrix[
                max(0, i - vicinity) : min(i + vicinity, n_lines),
                max(0, j - vicinity) : min(j + vicinity, n_columns),
            ]
            res[i, j] = np.sum(np.where(sub_matrix < matrix[i, j]))
            res[j, i] = res[i, j]
    return res


similarities_rank = get_rank(similarities, 3)
print(similarities_rank.shape)
fig, ax = fig_ax((5, 3))
plt.imshow(-similarities_rank, cmap=cm.plasma)
plt.show()

In [None]:
algo = rpt.Pelt(custom_cost=MyCost(), min_size=2, jump=1).fit(similarities)
res_n_bkps = []
res_sum_of_cost = []
x = np.logspace(-0.5, 0.5, num=100)
for pen in x:
    result = algo.predict(pen=pen)
    # print(result)
    res_n_bkps.append(len(result) - 1)
    res_sum_of_cost.append(algo.cost.sum_of_costs(result))

fig, ax = fig_ax((5, 3))
ax.plot(x, res_n_bkps, "b-")
ax.set_xlabel("penality")
ax.set_ylabel("Number of computed break points", color="b")
ax.vlines(x[58], 0, 50, colors="red")

ax2 = ax.twinx()
ax2.plot(x, res_sum_of_cost, "r.")
ax2.set_ylabel("Sum of costs", color="r")

In [None]:
algo = rpt.Dynp(custom_cost=MyCost(), min_size=2, jump=1).fit(similarities)


def get_sum_of_cost(algo, n_bkps) -> float:
    """Return the sum of costs for the change points `bkps`"""
    bkps = algo.predict(n_bkps=n_bkps)
    return algo.cost.sum_of_costs(bkps)


n_bkps_max = 20  # K_max
array_of_n_bkps = np.arange(1, n_bkps_max + 1)

fig, ax = fig_ax((5, 3))
ax.plot(
    array_of_n_bkps,
    [get_sum_of_cost(algo=algo, n_bkps=i) for i in array_of_n_bkps],
    "-*",
    alpha=0.5,
)
ax.set_xticks(array_of_n_bkps)
ax.set_xlabel("Number of change points")
ax.set_title("Sum of costs")
ax.grid(axis="x")
ax.set_xlim(0, n_bkps_max + 1)