# Word Segmentation and Association

![](../figs/intro_nlp/words/entelecheia_associaltion_vs_segmentation.png)

## Word Segmentation

- **Word segmentation** is the task of splitting a string of characters into words.
- Word segmentation is important for a machine to understand the meaning of a sentence.
- In English, we can split a string of characters into words by spaces.
- However, in languages like Chinese and Janpanese, there is no space between words.  
- Even in English, there are some cases where no space is used between words.
- Humans can easily segment a string of characters into words, even though there is no space between words.
- For example, we can easily segment the string of characters `Ilikechocolate` into words `I like chocolate`.

## Why should we segment words?

There are many applications that require word segmentation, even in English.

- Normalizing English compound nouns that are variably written for search engines.
  - For example, `ice cream` and `ice-cream` should be segmented into `icecream`.
- Word segmentation for compounds: Both orginal words and split words should be in the dictionary.
- Typing errors may be corrected by word segmentation.
- Conversion errors: During conversion, some spaces may be lost.
- OCR errors: OCRed text may contain errors.
- Keyword extraction from URL addresses, domain names, table column description or programming variables that are written without spaces.
- For password analysis, the extraction of terms from passwords can be required.
- Automatic CamelCasing of programming variables.
- Speech recognition: Speech recognition systems may not properly recognize spaces between words.


## Generating segment variants

We can generate all possible segment variants of a string of characters. Each distinct segment variant is called a **composition**.

- En a string of length $n$, there are $n-1$ possible positions to split the string.
- Each of the $n-1$ positions can be used as word boundary.
- Therefore, there are $2^{n-1}$ possible compositions.

The compositions have to be evaluated to find the best segmentation.

- The best segmentation is the one that has the highest probability.


### Naive Recursive Algorithm

- The naive recursive algorithm is to generate all possible compositions and evaluate them.
- The time complexity of the naive recursive algorithm is $O(2^n)$.
- The naive recursive algorithm is not efficient for long strings.

In [6]:
from pprint import pprint

def segment_naive(string):
    if not string:
        return []
    else:
        return [[string]] + [
            [string[:i]] + rest
            for i in range(1, len(string))
            for rest in segment_naive(string[i:])
        ]

In [7]:
pprint(segment_naive("isit"))

[['isit'],
 ['i', 'sit'],
 ['i', 's', 'it'],
 ['i', 's', 'i', 't'],
 ['i', 'si', 't'],
 ['is', 'it'],
 ['is', 'i', 't'],
 ['isi', 't']]


In [8]:
pprint(segment_naive("가방에"))

[['가방에'], ['가', '방에'], ['가', '방', '에'], ['가방', '에']]


In [10]:
text = "thisislongtext"
print(len(text), len(segment_naive(text)))

14 8192


In [13]:
text = "아버지가방에들어가신다" # Father goes into the bag or Father enters the room
print(len(text), len(segment_naive(text)))

11 1024


### Dynamic Programming

- Dynamic programming is a technique to solve a problem by breaking it into subproblems and storing the results of subproblems to avoid computing the same results again.
- The time complexity of dynamic programming is $O(n)$.
- For long strings, dynamic programming is much more efficient than the naive recursive algorithm.

```python
def segment(string, dictionary):
    if not string:
        return []
    for end in range(1, len(string) + 1):
        first, rest = string[:end], string[end:]
        if first in dictionary:
            return [first] + segment(rest, dictionary)
    return [string]
```

### Triangular Matrix

- The dynamic programming algorithm can be implemented using a triangular matrix.
- The tryangular matrix algorithm uses nested loops and a circular buffer to store the results of subproblems.
- A triangular matrix of parts with increasing length is generated and organized in a circular buffer.
- This allows a constant amount of memory to be used for the algorithm.


### Unknown Words

- We can not rely on the dictionary to segment all words.
- There are uncommon words, new words, misspelled words, foreign words, proper nouns, slang words, etc.
- Even in these cases, we want to segment the words into meaningful parts.
- Therefore, we have to estimate the probability of any possible segmentation.

## Evaluation of Compositions

- Generally, we can evaluate a composition by calculating the probability of the composition.
- Word probabilities can be estimated from a corpus:

    $$
    P(w_i) = \frac{c(w_i)}{N}
    $$

    where $c(w_i)$ is the count of word $w_i$ and $N$ is the total number of words in the corpus.

- However, for unkonwn words, we have to use other criteria to evaluate the composition.
- At word boundary, the uncertainty of the segmentation increases.
- By measuring the uncertainty, we can evaluate the composition.

### Uncertainty of word boundaries

- The uncertainty of word boundaries can be measured by the entropy of the word boundary.
- Harris, 1970 said that if the uncertainty of successive tokens increases, the location is a word boundary.
- Feng et al., 2004 proposed a statistical criterion called accessor variety (AV) to measure how likely a sub-sequence is a word, and then to find the best segmentation pattern that maximizes a target function of accessor variety and the length of the sub-sequence as variants. 
- Jin and TanakaIshii, 2006 proposed branch entropy as another criterion for unsupervised segmentation.
- Both criteria share a similar assumption as in the fundamental work by Harris, 1970, that the uncertainty of successive tokens increases at word boundaries.
- The latter is the countinous version of the former.


![](../figs/intro_nlp/words/branching_entropy_uncertainty.png)


### Accessor Variety

- The accessor variety (AV) defines that the uncertainty of a sub-sequence is the number of different words that can be formed by adding a sub-sequence to the sub-sequence.
- For the right-side accessor variety, it is the number of different words that can be formed by adding a sub-sequence to the right side of the sub-sequence.
- For the following sub-sequence, the right-side accessor variety of `hope` is 2, because `hope` can be followed by `less` or `fully`.

    ```
    "hopeful": 100
    "hopeless": 80
    ```
- The left-side accessor variety is the number of different words that can be formed by adding a sub-sequence to the left side of the sub-sequence.
- For example, the left-side accessor variety of `less` is 3, because `hopeless`, `useless`, and `pointless` can be formed by adding `less` to the left side of `less`.

    ```
    "hopeless": 80
    "unless": 160
    "pointless": 70
    ```
- Depending on the language, the left-side accessor variety or the right-side accessor variety may be more suitable for segmentation.
- Threshold values can be used to determine the word boundaries.
- The threshold values can be determined by the corpus.

### Branch Entropy

- The branch entropy is defined as the entropy of the distribution of the number of words that can be formed by adding a single character to the end of a sub-sequence.

    $$
    \text{BE}(w|c) = -\sum_{i=1}^n p_i \log p_i
    $$

    where $p_i$ is the probability of the number of words that can be formed by adding a single character to the end of a sub-sequence $w$ and $c$ is the character.

- As in the case of accessor variety, the branch entropy can be calculated for the left-side and the right-side.


## References

- [Uncertanty to word boundary; Accessor Variety & Branching Entropy](https://lovit.github.io/nlp/2018/04/09/branching_entropy_accessor_variety/)
- [Fast Word Segmentation of Noisy Text](https://medium.com/towards-data-science/fast-word-segmentation-for-noisy-text-2c2c41f9e8da)