# Lab 1: Segmentation Exercise


##  <font color='green'>Setup 1</font>: Load Libraries

In [1]:
import re

## <font color='blue'>Task</font> 1: Improving a sentence segmenter

Sentence segmentation is not a trivial task either. There might be some cases where your simple sentence segmentation won't work properly.

First, make sure you understand the following sentence segmentation code used in the lecture:

In [2]:

def sentence_segment(match_regex, tokens):
    """
    Splits a sequence of tokens into sentences, splitting wherever the given matching regular expression
    matches.

    Parameters
    ----------
    match_regex the regular expression that defines at which token to split.
    tokens the input sequence of string tokens.

    Returns
    -------
    a list of token lists, where each inner list represents a sentence.

    >>> tokens = ['the','man','eats','.','She', 'sleeps', '.']
    >>> sentence_segment(re.compile('\.'), tokens)
    [['the', 'man', 'eats', '.'], ['She', 'sleeps', '.']]
    """
    current = []
    sentences = [current]
    for tok in tokens:
        current.append(tok)
        if match_regex.match(tok):
            current = []
            sentences.append(current)
    if not sentences[-1]:
        sentences.pop(-1)
    return sentences


### Improve the sentence segmenter

Next, modify the following code so that sentence segmentation returns correctly segmented sentences on the following text:

In [14]:
text = """
Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch is the longest official one-word placename in U.K. Isn't that weird? I mean, someone took the effort to really make this name as complicated as possible, huh?! Of course, U.S.A. also has its own record in the longest name, albeit a bit shorter... This record belongs to the place called Chargoggagoggmanchauggagoggchaubunagungamaugg. There's so many wonderful little details one can find out while browsing http://www.wikipedia.org during their Ph.D. or an M.Sc.
"""

token = re.compile('https?:\/\/[a-z0-9.?/]+|[\(\)]|\w+[\'-]\w+|Mr\.|Ph\.D\.|[\w\']+|[?!]+|\.+\s')

tokens = token.findall(text)
sentences = sentence_segment(re.compile('\.+\s|[?!]'), tokens)
for sentence in sentences:
    print(sentence)

['Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch', 'is', 'the', 'longest', 'official', 'one-word', 'placename', 'in', 'U', 'K', '. ']
["Isn't", 'that', 'weird', '?']
['I', 'mean', 'someone', 'took', 'the', 'effort', 'to', 'really', 'make', 'this', 'name', 'as', 'complicated', 'as', 'possible', 'huh', '?!']
['Of', 'course', 'U', 'S', 'A', '. ']
['also', 'has', 'its', 'own', 'record', 'in', 'the', 'longest', 'name', 'albeit', 'a', 'bit', 'shorter', '... ']
['This', 'record', 'belongs', 'to', 'the', 'place', 'called', 'Chargoggagoggmanchauggagoggchaubunagungamaugg', '. ']
["There's", 'so', 'many', 'wonderful', 'little', 'details', 'one', 'can', 'find', 'out', 'while', 'browsing', 'http://www.wikipedia.org', 'during', 'their', 'Ph.D.', 'or', 'an', 'M', 'Sc', '.\n']


Questions:
- what elements of a sentence did you have to take care of here?
- is it useful or possible to enumerate all such possible examples?
- how would you deal with all URLs effectively?
- are there any specific punctuation not covered in the example you might think of?

# Background Reading

* Jurafsky & Martin, [Speech and Language Processing (Third Edition)](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf): Chapter 2, Regular Expressions, Text Normalization, Edit Distance.



## Reference
* This notebook is adapted from the course material which originates from Sebastian Riedel https://github.com/uclnlp/stat-nlp-book