# Tokenization with NLTK
This notebook demonstrates sentence and word tokenization using NLTK. Tokenization is the process of splitting text into sentences and words, which is a fundamental step in NLP.

## 1. Define Example Corpus
Provide a sample text corpus for tokenization.

In [1]:
corpus = """hello i am learning nlp! I am learning nlp. nlp is very interesting. I like nlp."""

## 2. View Corpus
Display the original text corpus.

In [2]:
corpus

'hello i am learning nlp! I am learning nlp. nlp is very interesting. I like nlp.'

## 3. Import NLTK and Download Resources
Import NLTK and download the 'punkt' tokenizer models required for sentence and word tokenization.

In [3]:
##tokenization
## para -> sentence
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 4. Sentence Tokenization
Split the corpus into sentences using NLTK's `sent_tokenize`.

In [4]:
from nltk.tokenize import sent_tokenize
documents = sent_tokenize(corpus)

## 5. View Tokenized Sentences
Display the list of sentences obtained from tokenization.

In [5]:
type(documents)
documents

['hello i am learning nlp!',
 'I am learning nlp.',
 'nlp is very interesting.',
 'I like nlp.']

## 6. Print Each Sentence
Iterate through the tokenized sentences and print each one.

In [6]:
## iteration in documents
for sentence in documents:
    print(sentence)

hello i am learning nlp!
I am learning nlp.
nlp is very interesting.
I like nlp.


## 7. Word Tokenization
Split the corpus and each sentence into words using NLTK's `word_tokenize`.

In [7]:
## tokenization
## paras(corpus) -> words
## documents(sentence) -> words
from nltk.tokenize import word_tokenize
words = word_tokenize(corpus)
for sentence in documents:
    documents_to_words = word_tokenize(sentence)
    print(documents_to_words)
words

['hello', 'i', 'am', 'learning', 'nlp', '!']
['I', 'am', 'learning', 'nlp', '.']
['nlp', 'is', 'very', 'interesting', '.']
['I', 'like', 'nlp', '.']


['hello',
 'i',
 'am',
 'learning',
 'nlp',
 '!',
 'I',
 'am',
 'learning',
 'nlp',
 '.',
 'nlp',
 'is',
 'very',
 'interesting',
 '.',
 'I',
 'like',
 'nlp',
 '.']