# Tokenización de sentencias

EL framework NLTK provee varias interfaces  para realizar la tokenización de sentencias. Más información [aquí](https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize).

Nota: Para realizar la instalación de NLTK mediante conda, ejecutar el siguiente comando:

```bash
> conda install nltk
```

In [None]:
import nltk
from nltk.tokenize import sent_tokenize
from pprint import pprint

In [None]:
sample_text = 'We will discuss briefly about the basic syntax, structure and design philosophies. There is a defined hierarchical syntax for Python code which you should remember when writing code! Python is a really powerful programming language!'
default_st = nltk.sent_tokenize
sample_sentences = default_st(text = sample_text)

print('Total sentences in sample_text:', len(sample_sentences))
print('Sample text sentences : ')
pprint(sample_sentences)

In [None]:
corpus = """Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested."""

documents = sent_tokenize(corpus)

In [10]:
documents

['Beautiful is better than ugly.',
 'Explicit is better than implicit.',
 'Simple is better than complex.',
 'Complex is better than complicated.',
 'Flat is better than nested.']

In [11]:
for sentence in documents:
    print(sentence)

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.


## Corpus a partir de proyecto Gutenberg

In [12]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/cmillan/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [None]:
from nltk.corpus import gutenberg

In [None]:
alice = (
    gutenberg
    .raw(
        fileids = 'carroll-alice.txt'
        )
)



In [15]:
# Caracteres totales en 'alice wonderland'
print(len(alice))

144395


In [16]:
# Imprimir los primeros 100 caracteres en el corpus
print(alice[0:100])


[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was


La función `nltk.sent_tokenize` es la función por defecto para la tokenización de sentencias que nltk recomienda. 

Nota: Utiliza la clase PunktSentenceTokenizer la cual ha sido entrenada con muchos modelos de lenguajes.

In [17]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/cmillan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [18]:
alice_sentences = nltk.sent_tokenize(text=alice)

In [None]:
default_st = nltk.sent_tokenize
alice_sentences = default_st(text=alice)

In [21]:
print('Total sentences in alice:', len(alice_sentences))
print('First 5 sentences in alice: ')
print(alice_sentences[0:5])

Total sentences in alice: 1625
First 5 sentences in alice: 
["[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I.", "Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversation?'", 'So she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her.', "There was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!", 'Oh dear!']


# Tokenizar otros lenguajes.

`europarl_raw` es un corpus de texto que contiene transcripciones de los debates del Parlamento Europeo (European Parliament Proceedings Parallel Corpus).

In [23]:
nltk.download('europarl_raw')

[nltk_data] Downloading package europarl_raw to
[nltk_data]     /Users/cmillan/nltk_data...
[nltk_data]   Package europarl_raw is already up-to-date!


True

In [24]:
from nltk.corpus import europarl_raw

german_text = europarl_raw.german.raw(fileids='ep-00-01-17.de')

# Total characters in the corpus 
print(len(german_text))
# First 100 characters in the corpus
print(german_text[0:100])

157171
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sit


In [25]:
german_sentences_def = default_st(text=german_text, language='german')

# loading german text tokenizer into a PunktSentenceTokenizer instance
german_tokenizer = nltk.data.load(resource_url='tokenizers/punkt/german.pickle')
german_sentences = german_tokenizer.tokenize(german_text)

# verify the type of german_tokenizer should be PunktSentenceTokenizer
print(type(german_tokenizer))

<class 'nltk.tokenize.punkt.PunktTokenizer'>


In [26]:
print(german_sentences_def == german_sentences)

# print first 5 sentences of the corpus
for sent in german_sentences[0:5]:
    print(sent)

True
 
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem 17. Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .
Wie Sie feststellen konnten , ist der gefürchtete " Millenium-Bug " nicht eingetreten .
Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .
Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen .
Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen - , allen Opfern der Stürme , insbesondere in den verschiedenen Ländern der Europäischen Union , in einer Schweigeminute zu gedenken .


Es posible tokenizar sentencias pertenecientes a distintos lenguajes de dos diferentes formas. 

1. Mediante la clase PunktSentenceTokenizer:

In [27]:
punkt_st = nltk.tokenize.PunktSentenceTokenizer()

sample_sentences = punkt_st.tokenize(sample_text)
pprint(sample_sentences)

['We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 'There is a defined hierarchical syntax for Python code which you should '
 'remember when writing code!',
 'Python is a really powerful programming language!']


2. Al utilizar una instancia de la clase RegexpTokenizer, donde se utiliza una expresión regular específica basada en patrones para segmentar las oraciones.

In [17]:
SENTENCE_TOKENS_PATTERN = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z]\.)(?<=\.|\?|\!)\s'

regex_st = nltk.tokenize.RegexpTokenizer( pattern=SENTENCE_TOKENS_PATTERN, gaps=True)
sample_sentences = regex_st.tokenize(sample_text)
pprint(sample_sentences)

['We will discuss briefly about the basic syntax, structure and design '
 'philosophies.',
 'There is a defined hierarchical syntax for Python code which you should '
 'remember when writing code!',
 'Python is a really powerful programming language!']
