# <center>NLP💬🔉 By 🎯Udaya ( Data Engineer 📚) </center>

## Corpus
A large collection of documents used for training NLP models. The corpus provides the data from which the vocabulary is built and upon which the models are trained.
#### Example: ["I love NLP.", "NLP is fun."]

In [1]:
corpus = ["I love NLP.", "NLP is fun."]
print("Corpus:👉 ", corpus)

Corpus:👉  ['I love NLP.', 'NLP is fun.']


## Document
Definition: A single piece of text, such as an article, a paragraph, or a sentence. It is usually the unit of analysis in text processing.
#### Example: "I love NLP." is one document.

In [2]:
document = "I love NLP."
print("Document:👉 ", document)

Document:👉  I love NLP.


## Words
Definition: The basic units of language, representing distinct meanings. Words can be in different forms, such as singular, plural, verbs, nouns, etc.
#### Example: "cat", "running", "beautiful"

In [3]:
words = ["cat", "running", "beautiful"]
print("Words:👉 ", words)

Words:👉  ['cat', 'running', 'beautiful']


## Tokens
Definition: Tokens are individual pieces of a sentence or text that have been segmented, often corresponding to words, punctuation, or other meaningful elements.
#### Example: In the sentence "I love NLP.", the tokens are ["I", "love", "NLP", "."]

In [4]:
import nltk
from nltk.tokenize import word_tokenize

sentence = "I love NLP."
tokens = word_tokenize(sentence)
print("Tokens:👉 ", tokens)

Tokens:👉  ['I', 'love', 'NLP', '.']


In [5]:
type(tokens)

list

## Vocabulary
Definition: The set of unique tokens found in a corpus. This represents the words and symbols that the model or algorithm has been trained to recognize and understand.
#### Example: If our corpus is ["I love NLP", "NLP is fun"], the vocabulary is {"I", "love", "NLP", "is", "fun"}

In [6]:
from nltk.tokenize import word_tokenize

In [7]:
documents = ["I love NLP", "NLP is fun"]

tokenized_documents = [word_tokenize(doc) for doc in documents]
vocabulary = set(token for doc in tokenized_documents for token in doc)
print("Vocabulary:👉 ", vocabulary)

Vocabulary:👉  {'is', 'fun', 'love', 'NLP', 'I'}


# Tokenization
Definition: The process of breaking down a text into smaller units, typically tokens. Tokenization is a crucial preprocessing step in NLP as it transforms raw text into a format that can be analyzed and processed by algorithms.
#### Example: Tokenizing the sentence "Tokenization is fun!" results in ["Tokenization", "is", "fun", "!"]

[NLTK - Natural Language Toolkit](https://www.nltk.org/)

[spaCy](https://spacy.io/)

In [8]:
import nltk

## Types of Tokenization
### Word Tokenization:

Breaks text into individual words or tokens based on whitespace or punctuation.
#### Example: "I love NLP." -> ["I", "love", "NLP", "."]


`word_tokenize`

In [9]:
text = "I love NLP. NLP is fun!"

from nltk.tokenize import word_tokenize
word_tokens = word_tokenize(text)

print("Word Tokenization:👉 ", word_tokens)

Word Tokenization:👉  ['I', 'love', 'NLP', '.', 'NLP', 'is', 'fun', '!']


`wordpunct_tokenize`

In [10]:
from nltk.tokenize import wordpunct_tokenize

corpus = """ Welcome to, my mastery map of #NLP. 
Do follow me ! """

wordpunct_tokenize(corpus)

['Welcome',
 'to',
 ',',
 'my',
 'mastery',
 'map',
 'of',
 '#',
 'NLP',
 '.',
 'Do',
 'follow',
 'me',
 '!']

`word_tokenize`  v/s `wordpunct_tokenize`

### wordpunct_tokenize
Function: Splits text into words and separates all punctuation characters.
#### Example:
* Input: "Hello, world!"
* Output: ['Hello', ',', 'world', '!']
* Use Case: When you need each punctuation mark to be a separate token.
### word_tokenize
Function: Splits text into words while treating punctuation more contextually, often keeping contractions and other linguistic elements together.
#### Example:
* Input: "Don't go."
* Output: ['Do', "n't", 'go', '.']
* Use Case: When you need a more nuanced handling of punctuation and contractions.

In [11]:
from nltk.tokenize import wordpunct_tokenize
text = "Don't go. Hello, world!"

print('👇👇👇👇👇')
# Using wordpunct_tokenize
tokens_wp = wordpunct_tokenize(text)
print("wordpunct_tokenize:", tokens_wp)

print('👇👇👇👇👇')

from nltk.tokenize import word_tokenize
text = "Don't go. Hello, world!"

# Using word_tokenize
tokens_wt = word_tokenize(text)
print("word_tokenize:", tokens_wt)

👇👇👇👇👇
wordpunct_tokenize: ['Don', "'", 't', 'go', '.', 'Hello', ',', 'world', '!']
👇👇👇👇👇
word_tokenize: ['Do', "n't", 'go', '.', 'Hello', ',', 'world', '!']


### Sentence Tokenization:

Splits text into individual sentences based on punctuation marks like periods, exclamation points, or question marks.
#### Example: "I love NLP. NLP is fun!" -> ["I love NLP.", "NLP is fun!"]


In [12]:
from nltk.tokenize import sent_tokenize

text = "I love NLP. NLP is fun!"
sentence_tokens = sent_tokenize(text)

print("Sentence Tokenization:👉 ", sentence_tokens)

Sentence Tokenization:👉  ['I love NLP.', 'NLP is fun!']


### Whitespace Tokenization:

Splits text based on whitespace characters like spaces, tabs, or newlines.
#### Example: "I love NLP." -> ["I", "love", "NLP."]


In [13]:
from nltk.tokenize import WhitespaceTokenizer
text = "I love NLP. NLP is fun!"

whitespace_tokenizer = WhitespaceTokenizer()
whitespace_tokens = whitespace_tokenizer.tokenize(text)

print("Whitespace Tokenization:👉 ", whitespace_tokens)

Whitespace Tokenization:👉  ['I', 'love', 'NLP.', 'NLP', 'is', 'fun!']


### Regular Expression Tokenization:

Tokenizes text based on specified patterns using regular expressions.
#### Example: Tokenizing based on all alphabetical characters -> "I love NLP." -> ["I", "love", "NLP"]


In [14]:
from nltk.tokenize import RegexpTokenizer
text = "I love NLP. NLP is fun!"

pattern = r'\w+'
regexp_tokenizer = RegexpTokenizer(pattern)
regexp_tokens = regexp_tokenizer.tokenize(text)

print("Regular Expression Tokenization:👉 ", regexp_tokens)

Regular Expression Tokenization:👉  ['I', 'love', 'NLP', 'NLP', 'is', 'fun']


### NGram Tokenization:

Creates tokens by combining N consecutive words from the text.
#### Example: Bigram tokenization -> "I love NLP." -> ["I love", "love NLP", "NLP ."]


In [15]:
from nltk.util import ngrams

text = "I love NLP. NLP is fun!"

from nltk.tokenize import word_tokenize
word_tokens = word_tokenize(text)
print("👉 ",word_tokens)

n = 2
ngram_tokens = list(ngrams(word_tokens, n))
print("N-Gram Tokenization (Bigrams):👉 ", ngram_tokens)

👉  ['I', 'love', 'NLP', '.', 'NLP', 'is', 'fun', '!']
N-Gram Tokenization (Bigrams):👉  [('I', 'love'), ('love', 'NLP'), ('NLP', '.'), ('.', 'NLP'), ('NLP', 'is'), ('is', 'fun'), ('fun', '!')]


### Custom Tokenization:

Tailors tokenization rules based on specific requirements or domain-specific knowledge.
#### Example: Tokenizing text in a medical domain may involve recognizing medical terms or abbreviations.

In [16]:
import re

def custom_tokenize(text):
    medical_terms_pattern = r'(?:\b(?:heart|lung|brain)\b)|(?:\b(?:COPD|MRI|ECG)\b)'
    tokens = re.findall(medical_terms_pattern, text, flags=re.IGNORECASE)
    return tokens
medical_text = "The patient underwent an MRI scan to examine the brain. COPD is a chronic lung disease."
custom_tokens = custom_tokenize(medical_text)
print("Custom Tokenization (Medical Terms and Abbreviations):👉 ", custom_tokens)

Custom Tokenization (Medical Terms and Abbreviations):👉  ['MRI', 'brain', 'COPD', 'lung']


#### More on Tokenization 👇

### Treebank Word Tokenizer

In [17]:
from nltk.tokenize import TreebankWordTokenizer

TBWT = TreebankWordTokenizer()

corpus = """ Welcome to, my mastery map of #NLP. 
Do follow me ! """

TBWT.tokenize(corpus)

['Welcome',
 'to',
 ',',
 'my',
 'mastery',
 'map',
 'of',
 '#',
 'NLP.',
 'Do',
 'follow',
 'me',
 '!']

`TreebankWordTokenizer` v/s `wordpunct_tokenize`

### TreebankWordTokenizer
Function: Tokenizes text using the conventions of the Penn Treebank.
#### Example:
* Input: "Dr. Smith's cat doesn't like fish."
* Output: ['Dr.', 'Smith', "'s", 'cat', 'does', "n't", 'like', 'fish', '.']
* Special Handling: Keeps some punctuation (e.g., periods in abbreviations) together with words and splits contractions properly.
### wordpunct_tokenize
Function: Splits text into words and separates all punctuation.
#### Example:
* Input: "Dr. Smith's cat doesn't like fish."
* Output: ['Dr', '.', 'Smith', "'", 's', 'cat', 'doesn', "'", 't', 'like', 'fish', '.']
* Special Handling: Treats every punctuation mark as a separate token.

In [18]:
# Using wordpunct_tokenize
from nltk.tokenize import wordpunct_tokenize

text = "Follow! Udaya on 👉 https://github.com/codeWudaya "
tokens_wp = wordpunct_tokenize(text)
print("wordpunct_tokenize: 👉", tokens_wp)

print("👇👇👇👇👇👇👇")

# Using TreebankWordTokenizer
from nltk.tokenize import TreebankWordTokenizer

text = "Follow! Udaya on 👉 https://github.com/codeWudaya "

treebank_tokenizer = TreebankWordTokenizer()
tokens_tb = treebank_tokenizer.tokenize(text)
print("TreebankWordTokenizer:👉 ", tokens_tb)

wordpunct_tokenize: 👉 ['Follow', '!', 'Udaya', 'on', '👉', 'https', '://', 'github', '.', 'com', '/', 'codeWudaya']
👇👇👇👇👇👇👇
TreebankWordTokenizer:👉  ['Follow', '!', 'Udaya', 'on', '👉', 'https', ':', '//github.com/codeWudaya']
