**TokenizeProcessor**: divide o texto de entrada em tokens e frases. Após sua utilização, o documento se torna uma lista de tokens e sentenças.

In [1]:
import stanza
nlp = stanza.Pipeline(lang='en', processors='tokenize')
doc = nlp('This is a test sentence for stanza. This is another sentence.')
for i, sentence in enumerate(doc.sentences):
    print(f' ===== Sentence {i+1} tokens =====')
    print(*[f'id: {token.id}\\text: {token.text}' for token in sentence.tokens], sep='\n')

  from .autonotebook import tqdm as notebook_tqdm
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 156kB [00:00, 383kB/s]                     
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.4.0/models/tokenize/combined.pt: 100%|██████████| 647k/647k [00:34<00:00, 18.8kB/s]
2022-05-07 15:14:04 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2022-05-07 15:14:04 INFO: Use device: cpu
2022-05-07 15:14:04 INFO: Loading: tokenize
2022-05-07 15:14:04 INFO: Done loading processors!


 ===== Sentence 1 tokens =====
id: (1,)\text: This
id: (2,)\text: is
id: (3,)\text: a
id: (4,)\text: test
id: (5,)\text: sentence
id: (6,)\text: for
id: (7,)\text: stanza
id: (8,)\text: .
 ===== Sentence 2 tokens =====
id: (1,)\text: This
id: (2,)\text: is
id: (3,)\text: another
id: (4,)\text: sentence
id: (5,)\text: .


In [2]:
# acessando a segmentação novamente:
print([sentence.text for sentence in doc.sentences])

['This is a test sentence for stanza.', 'This is another sentence.']


## Tokenização sem segmentação de sentença

In [3]:
import stanza
nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_no_ssplit=True)
doc = nlp('This is a sentence.\n\nThis is a seconda. This is a third.')
for i, sentence in enumerate(doc.sentences):
    print(f'===== Sentence {i+1} tokens =====')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 156kB [00:00, 1.12MB/s]                    
2022-05-07 15:14:20 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2022-05-07 15:14:20 INFO: Use device: cpu
2022-05-07 15:14:20 INFO: Loading: tokenize
2022-05-07 15:14:20 INFO: Done loading processors!


===== Sentence 1 tokens =====
id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: a
id: (4,)	text: sentence
id: (5,)	text: .
===== Sentence 2 tokens =====
id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: a
id: (4,)	text: seconda
id: (5,)	text: .
id: (6,)	text: This
id: (7,)	text: is
id: (8,)	text: a
id: (9,)	text: third
id: (10,)	text: .


## Processar texto pré-tokenizado

Utilizar somente a segmentação de sentenças quando o texto já estiver tokenizado.

In [4]:
import stanza
nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_pretokenized=True)
doc = nlp('This is token.ization done my way!\nSentence split, too!')
for i, sentence in enumerate(doc.sentences):
    print(f'===== Sentence {i+1} tokens =====')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 156kB [00:00, 1.30MB/s]                    
2022-05-07 15:14:27 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2022-05-07 15:14:27 INFO: Use device: cpu
2022-05-07 15:14:27 INFO: Loading: tokenize
2022-05-07 15:14:27 INFO: Done loading processors!


===== Sentence 1 tokens =====
id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: token.ization
id: (4,)	text: done
id: (5,)	text: my
id: (6,)	text: way!
===== Sentence 2 tokens =====
id: (1,)	text: Sentence
id: (2,)	text: split,
id: (3,)	text: too!


In [6]:
# alternativa: usar lista de strings
nlp = stanza.Pipeline(lang='en', processors='tokenize', tokenize_pretokenized=True)
doc = nlp([['This', 'is', 'token.ization', 'done','my','way!'], ['sentence','split','too!']])
for i, sentence in enumerate(doc.sentences):
    print(f'===== Sentence {i+1} tokens =====')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json: 156kB [00:00, 316kB/s]                     
2022-05-07 15:14:51 INFO: Loading these models for language: en (English):
| Processor | Package  |
------------------------
| tokenize  | combined |

2022-05-07 15:14:51 INFO: Use device: cpu
2022-05-07 15:14:51 INFO: Loading: tokenize
2022-05-07 15:14:51 INFO: Done loading processors!


===== Sentence 1 tokens =====
id: (1,)	text: This
id: (2,)	text: is
id: (3,)	text: token.ization
id: (4,)	text: done
id: (5,)	text: my
id: (6,)	text: way!
===== Sentence 2 tokens =====
id: (1,)	text: sentence
id: (2,)	text: split
id: (3,)	text: too!
