## NLP Key Terms: Corpus, Documents, Words, Vocabulary

---

### 1. Corpus
- A **corpus** is a **collection of documents or texts**.
- It is the **complete dataset** that contains all the text data.
- Used as the input for NLP tasks.

> Example: All reviews on an e-commerce site.

---

### 2. Document
- A **document** is a **single piece of text** from the corpus.
- It can be a paragraph, a sentence, or even a short message.

> Example: One review, one tweet, or one paragraph.

---

### 3. Word
- A **word** is the **smallest unit** of text.
- It is obtained by **tokenizing** a document.

> Example: In “I love data science”, the words are: I, love, data, science.

---

### 4. Vocabulary
- The **vocabulary** is the **set of all unique words** in the entire corpus.
- No repetitions — each word appears only once in the vocabulary.

> Example: From ["I love NLP", "NLP is fun"],  
> Vocabulary = {I, love, NLP, is, fun}

---

### Quick Analogy (for understanding)

| Concept     | Analogy Example                |
|-------------|--------------------------------|
| Corpus      | Book / Full Collection         |
| Document    | Paragraph / Sentence           |
| Word        | Individual word                |
| Vocabulary  | Unique words used in the book  |


In [3]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [6]:
corpus = '''Hey there! Welcome to Anshum Banga’s NLP tutorials.
Dive into the full course on GitHub, and soon you'll be talking to machines like they’re your best friend.
Don’t rush, though — even computers need some time to understand us.
'''

In [7]:
print(corpus)

Hey there! Welcome to Anshum Banga’s NLP tutorials.
Dive into the full course on GitHub, and soon you'll be talking to machines like they’re your best friend.
Don’t rush, though — even computers need some time to understand us.



In [8]:
# Toekenisation 

# converting sentences to paragraphs 

from nltk.tokenize import sent_tokenize

# this would help us to convert parapraphs into sentences

In [12]:
sent_tokenize(corpus)
# As we can see, the tokenizer splits the text into sentences
# It incorrectly considers a period ('.') as the end of a sentence
# Similarly, it treats an exclamation mark ('!') as a sentence boundary
# This leads to incorrect sentence segmentation in cases where punctuation marks are used in different contexts

documents = sent_tokenize(corpus)
documents

['Hey there!',
 'Welcome to Anshum Banga’s NLP tutorials.',
 "Dive into the full course on GitHub, and soon you'll be talking to machines like they’re your best friend.",
 'Don’t rush, though — even computers need some time to understand us.']

In [13]:
# loops 

for i in documents:
    print(i)

Hey there!
Welcome to Anshum Banga’s NLP tutorials.
Dive into the full course on GitHub, and soon you'll be talking to machines like they’re your best friend.
Don’t rush, though — even computers need some time to understand us.


In [14]:
## Second Type of Tokenisation 
# Convert Paragraph into words 
# convert sentences into words 

In [15]:
from nltk.tokenize import word_tokenize

In [17]:
print(word_tokenize(corpus))

# Here, you can see that each word is separated individually.
# Why do we do this? Because every word carries its own significance in understanding the meaning of the sentence.

['Hey', 'there', '!', 'Welcome', 'to', 'Anshum', 'Banga', '’', 's', 'NLP', 'tutorials', '.', 'Dive', 'into', 'the', 'full', 'course', 'on', 'GitHub', ',', 'and', 'soon', 'you', "'ll", 'be', 'talking', 'to', 'machines', 'like', 'they', '’', 're', 'your', 'best', 'friend', '.', 'Don', '’', 't', 'rush', ',', 'though', '—', 'even', 'computers', 'need', 'some', 'time', 'to', 'understand', 'us', '.']


In [19]:
for sentences in documents:
    print(word_tokenize(sentences))

['Hey', 'there', '!']
['Welcome', 'to', 'Anshum', 'Banga', '’', 's', 'NLP', 'tutorials', '.']
['Dive', 'into', 'the', 'full', 'course', 'on', 'GitHub', ',', 'and', 'soon', 'you', "'ll", 'be', 'talking', 'to', 'machines', 'like', 'they', '’', 're', 'your', 'best', 'friend', '.']
['Don', '’', 't', 'rush', ',', 'though', '—', 'even', 'computers', 'need', 'some', 'time', 'to', 'understand', 'us', '.']


In [20]:
# wordpunkt_tokenize 

from nltk.tokenize import wordpunct_tokenize

In [22]:
print(wordpunct_tokenize(corpus))

# The function wordpunct_tokenize(corpus) splits the text into individual words and punctuation marks.
# It tokenizes the corpus into a list of words while keeping punctuation as separate tokens.
# This is useful when we want to handle punctuation separately or analyze word-punctuation combinations.

['Hey', 'there', '!', 'Welcome', 'to', 'Anshum', 'Banga', '’', 's', 'NLP', 'tutorials', '.', 'Dive', 'into', 'the', 'full', 'course', 'on', 'GitHub', ',', 'and', 'soon', 'you', "'", 'll', 'be', 'talking', 'to', 'machines', 'like', 'they', '’', 're', 'your', 'best', 'friend', '.', 'Don', '’', 't', 'rush', ',', 'though', '—', 'even', 'computers', 'need', 'some', 'time', 'to', 'understand', 'us', '.']


In [28]:
# tree bank word tokeniser 

from nltk.tokenize import TreebankWordTokenizer

In [29]:
tokenizer = TreebankWordTokenizer()
tokenizer

<nltk.tokenize.treebank.TreebankWordTokenizer at 0x213fd94b1d0>

In [32]:
print(tokenizer.tokenize(corpus))

#The TreebankWordTokenizer will split your text into words and punctuation properly without separating each character.
#full stop would not be treated as a separated word 

['Hey', 'there', '!', 'Welcome', 'to', 'Anshum', 'Banga’s', 'NLP', 'tutorials.', 'Dive', 'into', 'the', 'full', 'course', 'on', 'GitHub', ',', 'and', 'soon', 'you', "'ll", 'be', 'talking', 'to', 'machines', 'like', 'they’re', 'your', 'best', 'friend.', 'Don’t', 'rush', ',', 'though', '—', 'even', 'computers', 'need', 'some', 'time', 'to', 'understand', 'us', '.']
