**Corpus**  - Body of text, singular. **Corpora** is the plural of this. Example: A collection of medical journals.

**Lexicon** - Words and their meanings. Example: English dictionary. Consider, however, that various fields will have different lexicons. For example: To a financial investor, the first meaning for the word **Bull** is someone who is confident about the market, as compared to the common English lexicon, where the first meaning for the word **Bull** is an animal. As such, there is a special lexicon for financial investors, doctors, children, mechanics, and so on.

**Token** - Each **entity** that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.

**Tokenization** - There are two types of tokenization: 1- Word tokenization .... 2- Sentence tokenization

In [25]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

In [8]:
example_paragraph = ("Hello, Mr. NLTK, I'm expecting a lot from you. This is my first task in nltk. Hoping to get a lot of "
                     "good stuff from nltk.")

In [10]:
# tokenize paragraph into sentence tokens
sent_tokenize(example_paragraph)

["Hello, Mr. NLTK, I'm expecting a lot from you.",
 'This is my first task in nltk.',
 'Hoping to get a lot of good stuff from nltk.']

In [12]:
# tokenize paragraph into word tokens (nltk treat punctuations as independent word/token)
word_tokenize(example_paragraph)

['Hello',
 ',',
 'Mr.',
 'NLTK',
 ',',
 'I',
 "'m",
 'expecting',
 'a',
 'lot',
 'from',
 'you',
 '.',
 'This',
 'is',
 'my',
 'first',
 'task',
 'in',
 'nltk',
 '.',
 'Hoping',
 'to',
 'get',
 'a',
 'lot',
 'of',
 'good',
 'stuff',
 'from',
 'nltk',
 '.']

**Note**: tokenization can be done using regular expressions as well but as the structure of text becomes more complex, it
becomes too difficult to write regex for that so here nltk comes with lot of easiness plus efficent as well.

**Stop Words** - Stop the most common/occuring words from a sentence. You can add the words to stop words which you want to stop. Let;s see an example,

In [17]:
example_sentence = "This is an example showing off stop word filteration."

stop_words = set(stopwords.words("english"))
words = word_tokenize(example_sentence)

filtered = [w for w in words if w not in stop_words]
filtered

['This', 'example', 'showing', 'stop', 'word', 'filteration', '.']

**Stemming** - The idea of stemming is a sort of normalizing method. Many variations of words carry the same meaning, other than when tense is involved. The reason why we stem is to shorten the lookup, and normalize sentences.

Consider an example,

1- I was taking a ride in the car.
2- I was riding in the car.

In above two sentences, both ride and riding have same meaning.

In [23]:
ps = PorterStemmer()

example_sentence = "I was taking a ride in the car. I was riding in the car."
words = word_tokenize(example_sentence)

stemmed = [ps.stem(w) for w in words]
stemmed

['I',
 'wa',
 'take',
 'a',
 'ride',
 'in',
 'the',
 'car',
 '.',
 'I',
 'wa',
 'ride',
 'in',
 'the',
 'car',
 '.']