# Tokenizing Words and Sentences with NLTK

### Corpus - 
#### Body of text, singular. Corpora is the plural of this. Example: A collection of medical journals.

### Lexicon - 
#### Words and their meanings. Example: English dictionary. Consider, however, that various fields will have different lexicons. For example: To a financial investor, the first meaning for the word "Bull" is someone who is confident about the market, as compared to the common English lexicon, where the first meaning for the word "Bull" is an animal. As such, there is a special lexicon for financial investors, doctors, children, mechanics, and so on.

### Token - 
#### Each "entity" that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.

In [1]:
import nltk

In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [3]:
EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."


In [4]:
## using sentence tokenize
print(sent_tokenize(EXAMPLE_TEXT))

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard."]


In [5]:
## using word tokenize
print(word_tokenize(EXAMPLE_TEXT))

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome.', 'The', 'sky', 'is', 'pinkish-blue.', 'You', 'should', "n't", 'eat', 'cardboard', '.']


### There are a few things to note here. 
#### 1. notice that punctuation is treated as a separate token. Also, notice the separation of the word "shouldn't" into "should" and "n't." 
#### 2. notice that "pinkish-blue" is indeed treated like the "one word" 