### Definitions
- Corpus = Entire Paragraph
- Documents = Sentences
- Vocabulary = Unique words in the Corpus
- Words = All words in the Corpus

### Tokenization 
- Tokenization is a process in which we take a Corpus or Documents and convert them into tokens

### Types of Tokenization
- 1) Corpus to Words
- 2) Documents to Words
- 3) Corpus to Documents

### Popular libraries for Tokenization
- NLTK
    - Open Source Library
- Spacey 
    - spacey.io
    - Open Source Library


### Comparision : Corpus --- to ---> Word Tokenization Techniques
<table>
<tr>
    <td></td>
    <td>word_tokenize</td>
    <td>wordpunct_tokenize</td>
    <td>TreebankWordTokenizer</td>
</tr>
<tr>
    <td>Punctuation behaviour</td>
    <td>Splits punctuations into seperate tokens, e.g. commas/periods become seperate tokens (e.g.1) I'm --split_to--> I, 'm (2 tokens) (e.g.2) don't --split_to--> do, n't (2 tokens)</td>
    <td>Splits punctuations into standalone tokens and breaks words around punctuations e.g.  I'm --split_to--> I, ', m (3 tokens)</td>
    <td>Very close to word_tokenize (e.g) don't --split_to--> do, n't (2 tokens)</td>
</tr>
<tr>
    <td>Treebank punctuation rules compliance</td>
    <td>Follows Treebank punctuation rules</td>
    <td>Does not follow Treebank punctuation rules</td>
    <td>Follows Treebank punctuation rules</td>
</tr>
</table>

### Treebank punctuation rules
- Treebank punctuation refers to the standardized set of punctuation tokenization rules defined by the Penn Treebank, a large, manually annotated corpus created for linguistic research in computational linguistics.

- In practice, “Treebank punctuation” means how punctuation marks are separated, normalized, and represented as tokens so that text can be consistently parsed, tagged, and analyzed.

- Rules
    - Words and punctuation are separate tokens
    - Contractions are systematically split
    - Quotation marks and parentheses are normalized
    - Parsing and POS tagging become deterministic
    - These rules are implemented in NLTK’s TreebankWordTokenizer (and indirectly in word_tokenize).

In [34]:
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt_tab', quiet=True)
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.tokenize import TreebankWordTokenizer

In [23]:
corpus ='''
Hello Welcome, to Ankur Israni's NLP tutorials. Please do watch the entire course! to become an expert in NLP.
'''

In [None]:
#A
# Tokenization: 'Corpus to Documents' tokenization using 'sent_tokenize'
# documents are sentences
documents = sent_tokenize(corpus)
print('documents = ', documents)
print('type(documents) = ',type(documents))

print('Individual documents:')
for document in documents:
    print('document = ',document)

documents =  ["\nHello Welcome, to Ankur Israni's NLP tutorials.", 'Please do watch the entire course!', 'to become an expert in NLP.']
type(documents) =  <class 'list'>
Individual documents:
document =  
Hello Welcome, to Ankur Israni's NLP tutorials.
document =  Please do watch the entire course!
document =  to become an expert in NLP.


In [None]:
#B1
# Tokenization: 'Corpus to Words' using 'word_tokenize'
words = word_tokenize(corpus)
print('words: ',words)

words:  ['Hello', 'Welcome', ',', 'to', 'Ankur', 'Israni', "'s", 'NLP', 'tutorials', '.', 'Please', 'do', 'watch', 'the', 'entire', 'course', '!', 'to', 'become', 'an', 'expert', 'in', 'NLP', '.']


In [33]:
#B2
# Tokenization: 'Corpus to Words' using 'wordpunct_tokenize'
words = wordpunct_tokenize(corpus)
print('words = ',words)

words =  ['Hello', 'Welcome', ',', 'to', 'Ankur', 'Israni', "'", 's', 'NLP', 'tutorials', '.', 'Please', 'do', 'watch', 'the', 'entire', 'course', '!', 'to', 'become', 'an', 'expert', 'in', 'NLP', '.']


In [None]:
#B3
# Tokenization: 'Corpus to Words' using TreebankWordTokenizer
treeBankWordTokenizer = TreebankWordTokenizer()
words = treeBankWordTokenizer.tokenize(corpus)
print('words = ',words)

words =  ['Hello', 'Welcome', ',', 'to', 'Ankur', 'Israni', "'s", 'NLP', 'tutorials.', 'Please', 'do', 'watch', 'the', 'entire', 'course', '!', 'to', 'become', 'an', 'expert', 'in', 'NLP', '.']


In [31]:
#C
# Tokenization: 'Document to Words' using 'word_tokenize'
for document in documents:
    word = word_tokenize(document)
    print("word = ",word)

word =  ['Hello', 'Welcome', ',', 'to', 'Ankur', 'Israni', "'s", 'NLP', 'tutorials', '.']
word =  ['Please', 'do', 'watch', 'the', 'entire', 'course', '!']
word =  ['to', 'become', 'an', 'expert', 'in', 'NLP', '.']
