# Tokenization
Tokenization is a key process in Natural Language Processing (NLP) that involves breaking down text into smaller, manageable pieces called tokens. These tokens can be words, subwords, or characters, and they help transform raw text into a structured format that models can easily understand and process.

## Corpus
A large collection of text used to train language models. It includes text from various sources like books, articles, and social media. 

In [8]:
corpus="""True Happiness is something comes from deep inside. It comes from our hearts. Although we feel that material things or pleasures can make us happy, it is not true. A person who is wealthy may or may not be happy. However, a person who is happy is always wealthy.

Happiness reflects one’s positive attitude towards life. Such a person strongly believes that whatever happens is for good. Even if he fails in life, he or she doesn’t blame destiny for it. Also,  such a person doesn’t lose hope. He is always hopeful.

An optimistic person will always be happy. Moreover, such a person will be able to find pleasure even in the pain. Also, a happy person doesn’t depend on others or external factors to make him happy. He will manage to be happy even in the worst situations.

Thus, we can conclude that happiness is an internal factor. It can be achieved by self-realization. Only by seeking unity with the Almighty, can one find true happiness.
"""

In [9]:
print(corpus)

True Happiness is something comes from deep inside. It comes from our hearts. Although we feel that material things or pleasures can make us happy, it is not true. A person who is wealthy may or may not be happy. However, a person who is happy is always wealthy.

Happiness reflects one’s positive attitude towards life. Such a person strongly believes that whatever happens is for good. Even if he fails in life, he or she doesn’t blame destiny for it. Also,  such a person doesn’t lose hope. He is always hopeful.

An optimistic person will always be happy. Moreover, such a person will be able to find pleasure even in the pain. Also, a happy person doesn’t depend on others or external factors to make him happy. He will manage to be happy even in the worst situations.

Thus, we can conclude that happiness is an internal factor. It can be achieved by self-realization. Only by seeking unity with the Almighty, can one find true happiness.



## About NLTK Library
The Natural Language Toolkit (NLkT) is a powerful Python library for working with human language data (text). It provides tools for text processing, analysis, and computational linguistics.

### Tokenization: 
Splitting text into words or sentences.
    1. nltk.word_tokenize(text): Tokenizes a string into words.
    2. nltk.sent_tokenize(text): Tokenizes a string into sentences.

#### Sentence Tokenization
Sentence tokenization is the process of dividing a piece of text into individual sentences. It is an important step in text processing and natural language understanding, as it helps to break down continuous text into manageable chunks (i.e., sentences) for further analysis or processing.

In [10]:
## Sentence tokenization practical
import nltk
nltk.download('punkt_tab')
nltk.download('punkt')

[nltk_data] Downloading package punkt_tab to /home/ash/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /home/ash/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [11]:
sentences=nltk.sent_tokenize(corpus)

In [None]:
print(typSe(sentences))

<class 'list'>


In [9]:
for sentence in sentences:
    print(sentence)

True Happiness is something comes from deep inside.
It comes from our hearts.
Although we feel that material things or pleasures can make us happy, it is not true.
A person who is wealthy may or may not be happy.
However, a person who is happy is always wealthy.
Happiness reflects one’s positive attitude towards life.
Such a person strongly believes that whatever happens is for good.
Even if he fails in life, he or she doesn’t blame destiny for it.
Also,  such a person doesn’t lose hope.
He is always hopeful.
An optimistic person will always be happy.
Moreover, such a person will be able to find pleasure even in the pain.
Also, a happy person doesn’t depend on others or external factors to make him happy.
He will manage to be happy even in the worst situations.
Thus, we can conclude that happiness is an internal factor.
It can be achieved by self-realization.
Only by seeking unity with the Almighty, can one find true happiness.


#### Word Tokenization
This process splits text into individual words, making it easier to analyze language at the word level. For example, the sentence "The quick brown fox" is tokenized into ["The", "quick", "brown", "fox"] 

In [None]:
words=nltk.word_tokenize(corpus)

In [15]:
for word in words:
    print(word)

True
Happiness
is
something
comes
from
deep
inside
.
It
comes
from
our
hearts
.
Although
we
feel
that
material
things
or
pleasures
can
make
us
happy
,
it
is
not
true
.
A
person
who
is
wealthy
may
or
may
not
be
happy
.
However
,
a
person
who
is
happy
is
always
wealthy
.
Happiness
reflects
one
’
s
positive
attitude
towards
life
.
Such
a
person
strongly
believes
that
whatever
happens
is
for
good
.
Even
if
he
fails
in
life
,
he
or
she
doesn
’
t
blame
destiny
for
it
.
Also
,
such
a
person
doesn
’
t
lose
hope
.
He
is
always
hopeful
.
An
optimistic
person
will
always
be
happy
.
Moreover
,
such
a
person
will
be
able
to
find
pleasure
even
in
the
pain
.
Also
,
a
happy
person
doesn
’
t
depend
on
others
or
external
factors
to
make
him
happy
.
He
will
manage
to
be
happy
even
in
the
worst
situations
.
Thus
,
we
can
conclude
that
happiness
is
an
internal
factor
.
It
can
be
achieved
by
self-realization
.
Only
by
seeking
unity
with
the
Almighty
,
can
one
find
true
happiness
.


In [16]:
words=nltk.wordpunct_tokenize(corpus)

In [17]:
print(words)

['True', 'Happiness', 'is', 'something', 'comes', 'from', 'deep', 'inside', '.', 'It', 'comes', 'from', 'our', 'hearts', '.', 'Although', 'we', 'feel', 'that', 'material', 'things', 'or', 'pleasures', 'can', 'make', 'us', 'happy', ',', 'it', 'is', 'not', 'true', '.', 'A', 'person', 'who', 'is', 'wealthy', 'may', 'or', 'may', 'not', 'be', 'happy', '.', 'However', ',', 'a', 'person', 'who', 'is', 'happy', 'is', 'always', 'wealthy', '.', 'Happiness', 'reflects', 'one', '’', 's', 'positive', 'attitude', 'towards', 'life', '.', 'Such', 'a', 'person', 'strongly', 'believes', 'that', 'whatever', 'happens', 'is', 'for', 'good', '.', 'Even', 'if', 'he', 'fails', 'in', 'life', ',', 'he', 'or', 'she', 'doesn', '’', 't', 'blame', 'destiny', 'for', 'it', '.', 'Also', ',', 'such', 'a', 'person', 'doesn', '’', 't', 'lose', 'hope', '.', 'He', 'is', 'always', 'hopeful', '.', 'An', 'optimistic', 'person', 'will', 'always', 'be', 'happy', '.', 'Moreover', ',', 'such', 'a', 'person', 'will', 'be', 'able',