In [1]:
# nltk - natual language tool kit which is use to perform the nlp activities
import nltk

NLTK (Natural Language Toolkit) is a Python library used for working with human language data (text) and performing tasks like tokenization, stemming, tagging, and more in NLP.

```python
import nltk  # Importing the Natural Language Toolkit library


In [2]:
# If the NLTK library is not installed, you need to install the NLTK library using pip
!pip install nltk



If the **NLTK** library is not installed and you try to import it, you'll get the following error:

### ❌ Error:
```bash
ModuleNotFoundError: No module named 'nltk'
```

### ✅ Solution:
You need to install the NLTK library using pip:

In Terminal/CMD:
```bash
pip install nltk
```

In Jupyter Notebook:
```python
!pip install nltk
```

After successful installation, you can import it without error:

```python
import nltk
``` 

In [3]:
nltk.download('all') # download all data/libraries inside nltk
nltk.download('punkt') # download Specific data for tokenization inide nltk

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\Lenovo\AppData\Roaming\nltk_data

True

In [14]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


| Function | Returns |
|----------|---------|
| `texts()` | A list of all available text corpora |
| `sents()` | List of **sentences** from all texts, each sentence as a list of words` |


```python
texts()  # shows all NLTK sample texts loaded (text1 to text9)

sents()  # returns all sentences from those texts, split into word tokens
```

In [None]:
texts() # shows all NLTK sample texts loaded (text1 to text9)

text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [16]:
text1

<Text: Moby Dick by Herman Melville 1851>

In [17]:
text2

<Text: Sense and Sensibility by Jane Austen 1811>

In [None]:
sents() # returns all sentences from those texts, split into word tokens

sent1: Call me Ishmael .
sent2: The family of Dashwood had long been settled in Sussex .
sent3: In the beginning God created the heaven and the earth .
sent4: Fellow - Citizens of the Senate and of the House of Representatives :
sent5: I have a problem with people PMing me to lol JOIN
sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !
sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters .
sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .


In [19]:
sent1

['Call', 'me', 'Ishmael', '.']

In [20]:
sent2

['The',
 'family',
 'of',
 'Dashwood',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'Sussex',
 '.']

**Alternatives to the NLTK library** for Natural Language Processing (NLP):


### 🔹 1. **spaCy**
- **Faster and more efficient** than NLTK.
- Great for industrial applications.
- Built-in models for Named Entity Recognition (NER), POS tagging, parsing, etc.

```bash
pip install spacy
```
```python
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Hello, world!")
print(doc)
```

In [4]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Hello, world!")
print(doc)

Hello, world!


### 🔹 2. **TextBlob**
- Simpler and easier for beginners.
- Built on top of NLTK and Pattern.
- Supports sentiment analysis, translation, and more.

```bash
pip install textblob
```

```python
from textblob import TextBlob
blob = TextBlob("TextBlob is simple to use.")
print(blob)
```

In [5]:
import textblob

from textblob import TextBlob
blob = TextBlob("TextBlob is simple to use.")
print(blob)

TextBlob is simple to use.


### 🔹 3. **Transformers (by Hugging Face)**
- Best for deep learning-based NLP.
- Pre-trained models like BERT, GPT, T5, etc.
- Handles tasks like text classification, summarization, Q&A, etc.

```bash
pip install transformers
```

```python
from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

text = """
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction 
between computers and humans through natural language. The ultimate goal of NLP is to enable computers to understand, 
interpret, and generate human language in a way that is both meaningful and useful.
"""

summary=summarizer(text, max_length=50, min_length=25, do_sample=False)
print("\nSummary:", summary[0]['summary_text'])
```

In [6]:
import transformers

from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
text = """
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction 
between computers and humans through natural language. The ultimate goal of NLP is to enable computers to understand, 
interpret, and generate human language in a way that is both meaningful and useful.
"""
summary=summarizer(text, max_length=50, min_length=25, do_sample=False)
print("\nSummary:", summary[0]['summary_text'])

  from .autonotebook import tqdm as notebook_tqdm
Device set to use cpu



Summary: Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and humans through natural language. The ultimate goal of NLP is to enable computers to understand, interpret,


### 🔹 4. **Gensim**
- Specialized in topic modeling and document similarity.
- Commonly used for Word2Vec and LDA.

```bash
pip install gensim
from gensim.corpora import Dictionary

texts = [["hello", "world"], ["hello", "gensim"]]

# Create dictionary
dictionary = Dictionary(texts)

# Convert to BoW
bow = dictionary.doc2bow(["hello", "gensim"])
print(bow)
```

In [7]:
import gensim 
from gensim.corpora import Dictionary

texts = [["hello", "world"], ["hello", "gensim"]]

# Create dictionary
dictionary = Dictionary(texts)

# Convert to BoW
bow = dictionary.doc2bow(["hello", "gensim"])
print(bow)

[(0, 1), (2, 1)]


Each has its strength—**spaCy** for speed and production, **TextBlob** for simplicity, and **Transformers** for cutting-edge performance.

Let's perform Simple Task of Tokenization

## Tokenization

In [8]:
text = "Natural Language Processing is fun!"

### Using NLTK 

In [9]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens)

['Natural', 'Language', 'Processing', 'is', 'fun', '!']


### Using Spacy


In [10]:
nlp = spacy.load("en_core_web_sm") #  Loads a pre-trained spaCy model for tokenization

doc = nlp(text)
tokens = [token.text for token in doc]
print(tokens)

['Natural', 'Language', 'Processing', 'is', 'fun', '!']


### Using TextBlob

In [11]:
from textblob import TextBlob

blob = TextBlob(text)
tokens = blob.words
print(tokens)

['Natural', 'Language', 'Processing', 'is', 'fun']


### Using Hugging Face Transformers

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize(text)
print(tokens)

['natural', 'language', 'processing', 'is', 'fun', '!']


### Using Gensim

In [13]:
from gensim.utils import simple_preprocess

tokens = simple_preprocess(text)
print(tokens)

['natural', 'language', 'processing', 'is', 'fun']
