# **Tokenziation**

### *Topics*
1.   Corpus - basically, a paragraph. A corpus is a collection of authentic text or audio organized into datasets.
2.   Documents - sentences. A document is a discrete unit of text that represents an object of analysis, such as a letter, email, novel, or even an individual sentence or paragraph.
3.   Vocabulary - Unique words
4.   Words - words



## **NLTK Library**

In [13]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [14]:
corpus = """It is crucial to emphasize that all generated responses must adhere strictly to the designated language, without incorporating any other languages.
Additionally, it is essential to consider any specified modifiers when crafting a response to a query. Please ensure that the format is structured in paragraphs, avoiding any list format.
Responses should be concise and presented in a formal style, with paragraph numbering only applied if multiple paragraphs are produced."""

In [15]:
corpus

'It is crucial to emphasize that all generated responses must adhere strictly to the designated language, without incorporating any other languages. \nAdditionally, it is essential to consider any specified modifiers when crafting a response to a query. Please ensure that the format is structured in paragraphs, avoiding any list format. \nResponses should be concise and presented in a formal style, with paragraph numbering only applied if multiple paragraphs are produced.'

In [16]:
print(corpus)

It is crucial to emphasize that all generated responses must adhere strictly to the designated language, without incorporating any other languages. 
Additionally, it is essential to consider any specified modifiers when crafting a response to a query. Please ensure that the format is structured in paragraphs, avoiding any list format. 
Responses should be concise and presented in a formal style, with paragraph numbering only applied if multiple paragraphs are produced.


In [17]:
# Lets convert paragraphs into sentences
from nltk.tokenize import sent_tokenize

In [18]:
sent_tokenize(corpus)

['It is crucial to emphasize that all generated responses must adhere strictly to the designated language, without incorporating any other languages.',
 'Additionally, it is essential to consider any specified modifiers when crafting a response to a query.',
 'Please ensure that the format is structured in paragraphs, avoiding any list format.',
 'Responses should be concise and presented in a formal style, with paragraph numbering only applied if multiple paragraphs are produced.']

In [19]:
documents = sent_tokenize(corpus)

In [21]:
# Lets convert paragraph/sentence into words
from nltk.tokenize import word_tokenize

In [24]:
word_tokenize(corpus)

['It',
 'is',
 'crucial',
 'to',
 'emphasize',
 'that',
 'all',
 'generated',
 'responses',
 'must',
 'adhere',
 'strictly',
 'to',
 'the',
 'designated',
 'language',
 ',',
 'without',
 'incorporating',
 'any',
 'other',
 'languages',
 '.',
 'Additionally',
 ',',
 'it',
 'is',
 'essential',
 'to',
 'consider',
 'any',
 'specified',
 'modifiers',
 'when',
 'crafting',
 'a',
 'response',
 'to',
 'a',
 'query',
 '.',
 'Please',
 'ensure',
 'that',
 'the',
 'format',
 'is',
 'structured',
 'in',
 'paragraphs',
 ',',
 'avoiding',
 'any',
 'list',
 'format',
 '.',
 'Responses',
 'should',
 'be',
 'concise',
 'and',
 'presented',
 'in',
 'a',
 'formal',
 'style',
 ',',
 'with',
 'paragraph',
 'numbering',
 'only',
 'applied',
 'if',
 'multiple',
 'paragraphs',
 'are',
 'produced',
 '.']

In [25]:
all_words = word_tokenize(corpus)

In [26]:
for word in all_words:
  print(word)

It
is
crucial
to
emphasize
that
all
generated
responses
must
adhere
strictly
to
the
designated
language
,
without
incorporating
any
other
languages
.
Additionally
,
it
is
essential
to
consider
any
specified
modifiers
when
crafting
a
response
to
a
query
.
Please
ensure
that
the
format
is
structured
in
paragraphs
,
avoiding
any
list
format
.
Responses
should
be
concise
and
presented
in
a
formal
style
,
with
paragraph
numbering
only
applied
if
multiple
paragraphs
are
produced
.


In [27]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()

In [28]:
tokenizer.tokenize(corpus)

['It',
 'is',
 'crucial',
 'to',
 'emphasize',
 'that',
 'all',
 'generated',
 'responses',
 'must',
 'adhere',
 'strictly',
 'to',
 'the',
 'designated',
 'language',
 ',',
 'without',
 'incorporating',
 'any',
 'other',
 'languages.',
 'Additionally',
 ',',
 'it',
 'is',
 'essential',
 'to',
 'consider',
 'any',
 'specified',
 'modifiers',
 'when',
 'crafting',
 'a',
 'response',
 'to',
 'a',
 'query.',
 'Please',
 'ensure',
 'that',
 'the',
 'format',
 'is',
 'structured',
 'in',
 'paragraphs',
 ',',
 'avoiding',
 'any',
 'list',
 'format.',
 'Responses',
 'should',
 'be',
 'concise',
 'and',
 'presented',
 'in',
 'a',
 'formal',
 'style',
 ',',
 'with',
 'paragraph',
 'numbering',
 'only',
 'applied',
 'if',
 'multiple',
 'paragraphs',
 'are',
 'produced',
 '.']

In [29]:
# Lets write an NLTK program to tokenize sentences in some other language
para = '''NLTK ist Open Source Software. Der Quellcode wird unter den Bedingungen der Apache License Version 2.0 vertrieben.
Die Dokumentation wird unter den Bedingungen der Creative Commons-Lizenz Namensnennung - Nicht kommerziell - Keine
abgeleiteten Werke 3.0 in den Vereinigten Staaten verteilt.'''


In [30]:
sent_tokenize(para)

['NLTK ist Open Source Software.',
 'Der Quellcode wird unter den Bedingungen der Apache License Version 2.0 vertrieben.',
 'Die Dokumentation wird unter den Bedingungen der Creative Commons-Lizenz Namensnennung - Nicht kommerziell - Keine \nabgeleiteten Werke 3.0 in den Vereinigten Staaten verteilt.']

In [32]:
word_tokenize(para)

['NLTK',
 'ist',
 'Open',
 'Source',
 'Software',
 '.',
 'Der',
 'Quellcode',
 'wird',
 'unter',
 'den',
 'Bedingungen',
 'der',
 'Apache',
 'License',
 'Version',
 '2.0',
 'vertrieben',
 '.',
 'Die',
 'Dokumentation',
 'wird',
 'unter',
 'den',
 'Bedingungen',
 'der',
 'Creative',
 'Commons-Lizenz',
 'Namensnennung',
 '-',
 'Nicht',
 'kommerziell',
 '-',
 'Keine',
 'abgeleiteten',
 'Werke',
 '3.0',
 'in',
 'den',
 'Vereinigten',
 'Staaten',
 'verteilt',
 '.']

In [33]:
# Let me now try tokenizing words, sentence wise
text = "Joe waited for the train. The train was late. Mary and Samantha took the bus. I looked for Mary and Samantha at the bus station."
result = [word_tokenize(t) for t in sent_tokenize(text)]
for s in result:
    print(s)

['Joe', 'waited', 'for', 'the', 'train', '.']
['The', 'train', 'was', 'late', '.']
['Mary', 'and', 'Samantha', 'took', 'the', 'bus', '.']
['I', 'looked', 'for', 'Mary', 'and', 'Samantha', 'at', 'the', 'bus', 'station', '.']
