<a href="https://colab.research.google.com/github/Unisvet/haf_ai/blob/main/NLTK_Tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Tokenization Example with NLTK

This script demonstrates a basic example of tokenization using the Natural Language Toolkit (NLTK) library in Python.

**Functionality:**

1. **Installation:** Installs the NLTK library using pip if it's not already installed.
2. **Import:** Imports necessary modules from NLTK: `word_tokenize` for word tokenization and `FreqDist` for frequency distribution (although not used in this specific example).
3. **Sample Text:** Defines a sample text string for tokenization.
4. **Data Download:** Downloads the 'punkt_tab' resource using `nltk.download('punkt_tab')`. This resource is essential for sentence tokenization, which is internally used by `word_tokenize`.
5. **Tokenization:** Tokenizes the sample text using `word_tokenize` and stores the resulting tokens in a list called `tokens`.
6. **Output:** Prints the generated tokens to the console.

**Usage:**

1. Ensure you have Python and pip installed.
2. Run the script. It will install NLTK if necessary, download the required data, and then perform tokenization on the sample text, printing the results.

**Dependencies:**

- NLTK (version 3.9.1)

**Note:**

The `punkt_tab` resource download is crucial for this script to function correctly. This resource provides the data required for sentence tokenization, which is a prerequisite for word tokenization using `word_tokenize`.

In [1]:
# 1. Install NLTK (if not already installed)
!pip install nltk==3.9.1

# 2. Import necessary modules
import nltk
from nltk.tokenize import word_tokenize
from nltk import FreqDist

# 3. Create a sample text
text = "This is a paragraph about tokenization. Tokenization is a common task in natural language processing. It involves splitting text into individual words or tokens. These tokens are then used for further analysis."





In [3]:
# Download the 'punkt_tab' resource
nltk.download('punkt_tab') # Download the necessary data for tokenization.
# 4. Tokenize the text
tokens = word_tokenize(text)

# 5. Print the tokens
print("Tokens:", tokens)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Tokens: ['This', 'is', 'a', 'paragraph', 'about', 'tokenization', '.', 'Tokenization', 'is', 'a', 'common', 'task', 'in', 'natural', 'language', 'processing', '.', 'It', 'involves', 'splitting', 'text', 'into', 'individual', 'words', 'or', 'tokens', '.', 'These', 'tokens', 'are', 'then', 'used', 'for', 'further', 'analysis', '.']


In [4]:
# 6. Count the number of tokens
num_tokens = len(tokens)
print("\nNumber of tokens:", num_tokens)



Number of tokens: 36


In [5]:
#7. Identify the frequency of each token
token_freq = FreqDist(tokens)
print("\nToken Frequency:", token_freq.most_common())


Token Frequency: [('.', 4), ('is', 2), ('a', 2), ('tokens', 2), ('This', 1), ('paragraph', 1), ('about', 1), ('tokenization', 1), ('Tokenization', 1), ('common', 1), ('task', 1), ('in', 1), ('natural', 1), ('language', 1), ('processing', 1), ('It', 1), ('involves', 1), ('splitting', 1), ('text', 1), ('into', 1), ('individual', 1), ('words', 1), ('or', 1), ('These', 1), ('are', 1), ('then', 1), ('used', 1), ('for', 1), ('further', 1), ('analysis', 1)]
