In [13]:
import nltk

Lexicons can be defined as the vocabulary of a person, language, or branch of knowledge. In simple terms, a lexicon can be thought of as a dictionary of terms that are called lexemes.  
Before building vocabulary,We need to know these:    
Phonemes: Speech sounds that differentiate words.  
Graphemes: Groups of letters representing phonemes.  
Morphemes: Smallest meaningful units in language.  

# Tokenization

Tokenization breaks text into meaningful chunks called tokens. Tokens include words, numbers, punctuation, symbols, and sometimes emoticons. It's a fundamental step in text processing to understand and work with the semantic meaning of each chunk.    Lets see this in code.

In [1]:
sentence = "The capital of China is Beijing"
sentence.split()

['The', 'capital', 'of', 'China', 'is', 'Beijing']

## Issues with tokenization

In [2]:
sentence = "China's capital is Beijing"
sentence.split()

["China's", 'capital', 'is', 'Beijing']

Should it be China, Chinas, or China's? A split method does not often know how to deal with situations containing apostrophes.  
See other similar examples:


In [6]:
sentence1 = "Beijing is where we'll go"
sentence1.split()

["Let's", 'travel', 'to', 'Hong', 'Kong', 'from', 'Beijing']

In [7]:
sentence2 = "I'm going to travel to Beijing"
sentence2.split()

["I'm", 'going', 'to', 'travel', 'to', 'Beijing']

In [8]:
sentence3 = "Let's travel to Hong Kong from Beijing"
sentence3.split()

["Let's", 'travel', 'to', 'Hong', 'Kong', 'from', 'Beijing']

Here, ideally, Hong Kong should be one token, but think of another sentence: The name of the King is Kong. In such scenarios, Kong should be an individual token. In such situations, context can play a major role in understanding how to treat similar token representations
when the context varies.

In [9]:
sentence = "A friend is pursuing his M.S from Beijing"
sentence.split()

['A', 'friend', 'is', 'pursuing', 'his', 'M.S', 'from', 'Beijing']

Here, the period between M and S is actually indicative of an abbreviation

In [10]:
sentence = "Most of the times umm I travel"
sentence.split()

['Most', 'of', 'the', 'times', 'umm', 'I', 'travel']

Even though a token such as umm is not a part of English vocabulary, it becomes important
in use cases where speech synthesis is involved as it indicates that the person is taking a
pause here and trying to think of something. 

In [12]:
sentence = "Beijing is a cool place!!! :-P <3 #Awesome"
sentence.split()

['Beijing', 'is', 'a', 'cool', 'place!!!', ':-P', '<3', '#Awesome']

Social media's data boom brings rich info but also "millennial language" like emoticons, abbreviations. Understanding this evolving text is crucial, e.g., "P:-" for a tongue-out face. Hashtags summarize emotions in posts, leading to specialized tokenizers like TweetTokenizer.

## Regular Expressions

Regular expressions (regex) define search patterns in text, a key tool for pattern identification. They're highly effective, like finding email IDs with a consistent pattern. Regex is preferred over ML for such tasks. Widely used, like Stanford NLP's SUTime tokenization, spotting dates, times, and more in text.

## Regular expressions-based tokenizers

NLTK's RegexpTokenizer in Python uses regular expressions to tokenize text. For instance, to capture money, words, and abbreviations in a sentence like "A Rolex watch costs $3000.0 - $8000.0 in the USA," you can define a regex pattern and tokenize it using the tokenizer object, like this:

In [14]:
from nltk.tokenize import RegexpTokenizer
s = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA."
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokenizer.tokenize(s)

['A',
 'Rolex',
 'watch',
 'costs',
 'in',
 'the',
 'range',
 'of',
 '$3000.0',
 '-',
 '$8000.0',
 'in',
 'USA',
 '.']

\w+: Matches any word character [a-zA-Z0-9_] one or more times.  
\$[\d\.]+: Matches $ followed by digits or . one or more times.    
\S+: Matches any non-whitespace character one or more times.  
There are also other RegexpTokenizer like BlankLine and WordPunct tokenizer.

## Treebank Tokenizer

The Treebank tokenizer, based on the Penn Treebank, splits text mostly on punctuation. It handles contractions like "doesn't" as"does" and "n't", and removes periods at line ends. Commas followed by spaces are also split.

In [15]:
 from nltk.tokenize import TreebankWordTokenizer
 s = "I'm going to buy a Rolex watch that doesn't cost more than $3000.0"
 tokenizer = TreebankWordTokenizer()
 tokenizer.tokenize(s)

['I',
 "'m",
 'going',
 'to',
 'buy',
 'a',
 'Rolex',
 'watch',
 'that',
 'does',
 "n't",
 'cost',
 'more',
 'than',
 '$',
 '3000.0']

## TweetTokenizer

The TweetTokenizer in NLTK is designed for social media text, handling usernames, hashtags, and emoticons. For instance, tokenizing the tweet "@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3" would result in:

In [17]:
from nltk.tokenize import TweetTokenizer
s = "@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3"
tokenizer = TweetTokenizer()
tokenizer.tokenize(s)

['@amankedia',
 "I'm",
 'going',
 'to',
 'buy',
 'a',
 'Rolexxxxxxxx',
 'watch',
 '!',
 '!',
 '!',
 ':-D',
 '#happiness',
 '#rolex',
 '<3']

TweetTokenizer in NLTK offers a reduce_len parameter to handle repeated characters commonly seen in social media. For instance, "Rolexxxxxxxx" can be tokenized as "Rolexxx" with reduce_len enabled

In [19]:
from nltk.tokenize import TweetTokenizer
s = "@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3"
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
tokenizer.tokenize(s)

["I'm",
 'going',
 'to',
 'buy',
 'a',
 'Rolexxx',
 'watch',
 '!',
 '!',
 '!',
 ':-D',
 '#happiness',
 '#rolex',
 '<3']

# Stemming

Stemming is the process of reducing words to their base form, called the stem, by removing inflectional forms. This involves chopping off affixes. For example, "computer," "computerization," and "computerize" might all be stemmed to "compute." The resulting stem may not always be a valid word. Common stemming algorithms are the Porter stemmer (for English) and Snowball stemmer (for multiple languages).

In [20]:
from nltk.stem.snowball import SnowballStemmer
print(SnowballStemmer.languages)

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


Let's now first apply the Porter stemmer to words and see its effects in the following code
block:

In [24]:
plurals = ['caresses', 'flies', 'dies', 'mules', 'died', 'agreed', 'owned', 'humbled', 'sized', 'meeting', 'stating', 'siezing', 'itemization', 'traditional', 'reference', 'colonizer', 'plotted', 'having', 'generously']
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))

caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have gener


Next, let's see how the Snowball stemmer would do on the same text

In [25]:
stemmer2 = SnowballStemmer(language='english')
singles = [stemmer2.stem(plural) for plural in plurals]
print(' '.join(singles))

caress fli die mule die agre own humbl size meet state siez item tradit refer colon plot have generous


The Snowball stemmer, an improvement on the Porter stemmer, requires a language parameter. Generally, its output is similar to the Porter stemmer, but with minor differences. For instance, "generously" is stemmed to "gener" by the Porter stemmer and "generous" by the Snowball stemmer. 

# Lemmatization


Lemmatization is the process of converting words to their base or dictionary form using contextual information. Unlike stemming, lemmatization ensures meaningful base forms. Lemmatizers consider context, part-of-speech tags, and word meaning. The resulting base form is called the lemma. Words can have different lemmas depending on context. Popular lemmatizers include WordNet, Spacy, TextBlob, Gensim, among others. Let's focus on WordNet and Spacy lemmatizers.

## WordNet lemmatizer

WordNet is a free lexical English database. It groups words into cognitive synonyms (synsets) for nouns, verbs, adjectives, and adverbs. These synsets are linked by semantic relationships. NLTK provides an interface for WordNet, allowing easy lemmatization.

In [27]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
s = "We are putting in efforts to enhance our understanding of \
 Lemmatization"
token_list = s.split()
print("The tokens are: ", token_list)
lemmatized_output = ' '.join([lemmatizer.lemmatize(token) for token \
 in token_list])
print("The lemmatized output is: ", lemmatized_output)

The tokens are:  ['We', 'are', 'putting', 'in', 'efforts', 'to', 'enhance', 'our', 'understanding', 'of', 'Lemmatization']
The lemmatized output is:  We are putting in effort to enhance our understanding of Lemmatization


NLTK's averaged perceptron tagger provides POS tags for words, enabling WordNet lemmatizer to work effectively

The POS tags for the sentence We are trying our best to understand
Lemmatization here provided by the POS tagging method can be found in the following
code snippet:

In [28]:
nltk.download('averaged_perceptron_tagger')
pos_tags = nltk.pos_tag(token_list)
pos_tags

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('We', 'PRP'),
 ('are', 'VBP'),
 ('putting', 'VBG'),
 ('in', 'IN'),
 ('efforts', 'NNS'),
 ('to', 'TO'),
 ('enhance', 'VB'),
 ('our', 'PRP$'),
 ('understanding', 'NN'),
 ('of', 'IN'),
 ('Lemmatization', 'NN')]

To use the POS tags with the WordNet lemmatizer, we need to map the POS tags from the NLTK tagger to the format WordNet expects. Here's how to do that:

In [35]:
from nltk.corpus import wordnet
##This is a common method which is widely used across the NLP community of practitioners and readers
def get_part_of_speech_tags(token):
# """Maps POS tags to first character lemmatize() accepts. We are focusing on Verbs, Nouns, Adjectives and Adverbs here."""
    tag_dict = {"J": wordnet.ADJ,"N": wordnet.NOUN,"V": wordnet.VERB,"R": wordnet.ADV}
    tag = nltk.pos_tag([token])[0][1][0].upper()
    return tag_dict.get(tag, wordnet.NOUN)
lemmatized_output_with_POS_information = [lemmatizer.lemmatize(token,
get_part_of_speech_tags(token)) for token in token_list]
print(' '.join(lemmatized_output_with_POS_information))

We be put in effort to enhance our understand of Lemmatization


The following conversions happened:  
are to be  
putting to put  
efforts to effort  
understanding to understand  

Let’s compare this with the Snowball stemmer:

In [36]:
stemmer2 = SnowballStemmer(language='english')
stemmed_sentence = [stemmer2.stem(token) for token in token_list]
print(' '.join(stemmed_sentence))

we are put in effort to enhanc our understand of lemmat


## Spacy lemmatizer

Spacy lemmatizer uses pretrained models to parse text, identifying properties like POS tags, named-entity tags, etc., with a single function call. The built-in models assign POS tags and lemmatize words automatically.

In [42]:
pip install spacy

Collecting spacy
  Obtaining dependency information for spacy from https://files.pythonhosted.org/packages/60/37/f8b6807426300c4cb9aee6a04979df2ddaeb02f2579caf4599232fbab8bd/spacy-3.7.4-cp39-cp39-win_amd64.whl.metadata
  Using cached spacy-3.7.4-cp39-cp39-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Obtaining dependency information for spacy-legacy<3.1.0,>=3.0.11 from https://files.pythonhosted.org/packages/c3/55/12e842c70ff8828e34e543a2c7176dac4da006ca6901c9e8b43efab8bc6b/spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Obtaining dependency information for spacy-loggers<2.0.0,>=1.0.0 from https://files.pythonhosted.org/packages/33/78/d1a1a026ef3af911159398c939b1509d5c36fe524c7b644f34a5146c4e16/spacy_loggers-1.0.5-py3-none-any.whl.metadata
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting m


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


   ----------- ---------------------------- 3.5/12.2 MB 14.6 kB/s eta 0:09:53
   ----------- ---------------------------- 3.5/12.2 MB 14.6 kB/s eta 0:09:53
   ----------- ---------------------------- 3.5/12.2 MB 14.6 kB/s eta 0:09:53
   ----------- ---------------------------- 3.5/12.2 MB 14.6 kB/s eta 0:09:53
   ----------- ---------------------------- 3.5/12.2 MB 14.6 kB/s eta 0:09:53
   ----------- ---------------------------- 3.5/12.2 MB 14.6 kB/s eta 0:09:53
   ----------- ---------------------------- 3.5/12.2 MB 14.6 kB/s eta 0:09:53
   ----------- ---------------------------- 3.5/12.2 MB 14.6 kB/s eta 0:09:53
   ----------- ---------------------------- 3.5/12.2 MB 14.6 kB/s eta 0:09:53
   ----------- ---------------------------- 3.5/12.2 MB 14.6 kB/s eta 0:09:53
   ----------- ---------------------------- 3.5/12.2 MB 14.6 kB/s eta 0:09:53
   ----------- ---------------------------- 3.5/12.2 MB 14.6 kB/s eta 0:09:53
   ----------- ---------------------------- 3.5/12.2 MB 14.6 kB/

In [59]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 660.6 kB/s eta 0:00:20
     --------------------------------------- 0.1/12.8 MB 544.7 kB/s eta 0:00:24
     --------------------------------------- 0.1/12.8 MB 581.0 kB/s eta 0:00:22
     --------------------------------------- 0.1/12.8 MB 656.4 kB/s eta 0:00:20
     --------------------------------------- 0.2/12.8 MB 654.6 kB/s eta 0:00:20
      -------------------------------------- 0.2/12.8 MB 901.1 kB/s eta 0:00:14
      -------------------------------------- 0.3/12.8 MB 930.9 kB/s eta 0:00:14
     - -------------------------------------- 0.5/12.8 MB 1.4 MB/s eta 0:00:09
     - -------------------------------------- 0.6/12.8 MB 1.4 MB/s eta 0:00:09
     -- --------------------------


[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [60]:
import spacy
# nlp = spacy.load('en')
nlp = spacy.load('en_core_web_sm')

doc = nlp("We are putting in efforts to enhance our understanding of Lemmatization")
" ".join([token.lemma_ for token in doc])


'we be put in effort to enhance our understanding of lemmatization'

The spacy lemmatizer performed a decent job without the input information of the POS
tags. The advantage here is that there's no need to look out for external dependencies for
fetching POS tags as the information is built into the pretrained model.


## Stopword removal

Stopwords like "a," "an," "the," "in," "at," etc., are common in text but carry little information. They are necessary for sentence completion and grammar. Filtering stopwords in NLP tasks reduces vocabulary and search space. Stopwords lists vary by use case and language, needing modification based on the problem being addressed.

In [44]:
# nltk.download('stopwords')
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
", ".join(stop)

"myself, you've, only, am, having, not, wouldn, too, you, doesn't, be, by, on, your, so, needn't, further, very, aren't, then, it's, against, who, mightn, whom, re, in, him, now, where, over, into, herself, such, both, there, won't, own, than, s, and, to, yourself, did, hasn, more, has, our, here, i, are, no, under, hadn, his, their, below, down, hasn't, mustn, won, because, that, during, while, how, same, it, as, shouldn, shan, after, didn't, you'd, out, he, what, m, the, up, ll, haven't, through, yours, should, off, once, wasn't, couldn, was, other, himself, mightn't, most, don't, y, needn, can, ours, mustn't, we, me, shan't, itself, but, this, which, at, were, before, shouldn't, an, why, about, my, wouldn't, don, above, o, its, been, if, all, themselves, just, weren't, some, aren, from, you're, nor, them, you'll, she's, ve, of, for, they, ma, or, theirs, these, doesn, ain, will, does, do, each, that'll, didn, she, being, had, haven, d, any, when, between, those, weren, a, isn't, her

Wh-words like "who," "what," "when," "why," "how," "which," "where," and "whom" are stopwords but crucial in tasks like question answering and classification. To prevent filtering them out during stopword removal, you can specify a custom stopwords list excluding these words

In [48]:
wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']
stop = set(stopwords.words('english'))
sentence = "how are we putting in efforts to enhance our understanding of Lemmatization"
for word in wh_words:
    stop.remove(word)
sentence_after_stopword_removal = [token for token in sentence.split() if
token not in stop]
" ".join(sentence_after_stopword_removal)

'how putting efforts enhance understanding Lemmatization'

After stopword removal, the sentence "how are we putting in efforts to enhance our understanding of Lemmatization" becomes "how putting efforts enhance understanding Lemmatization." This process removes stopwords like "we," "in," "to," "our," and "of," usually the initial step after tokenization in building a vocabulary or preprocessing text data.

## Case folding



Case folding involves converting all text to lowercase, treating words like "the" and "The" as identical. This normalization aids systems like search engines. Proper nouns like "Lamborghini" become "lamborghini," ensuring consistency. However, case folding might be limiting for proper nouns derived from common nouns. Acronyms like "CAT" could map to common nouns when converted to lowercase. While ML models could address this, lowercase is often preferred, especially for user input predominantly in lowercase. Language plays a key role; English capitalization carries more information than in some other languages.

In [50]:
s = "We are putting in efforts to enhance our understanding of Lemmatization"
s = s.lower()
s


'we are putting in efforts to enhance our understanding of lemmatization'

## N-grams


N-grams combine words to capture compound meanings. Unigrams are single words, bigrams pairs (like "dinner table"), and trigrams triples (like "the United Arab Emirates"). They retain meaning lost when words are processed individually. NLP tasks often use unigrams, bigrams, and trigrams to gather comprehensive information from text

In [51]:
# The following code illustrates an example of capturing bigrams:
from nltk.util import ngrams
s = "Natural Language Processing is the way to go"
tokens = s.split()
bigrams = list(ngrams(tokens, 2))
[" ".join(token) for token in bigrams]

['Natural Language',
 'Language Processing',
 'Processing is',
 'is the',
 'the way',
 'way to',
 'to go']

In [53]:
# Let's try and capture trigrams from the same sentence using the following code:
s = "Natural Language Processing is the way to go"
tokens = s.split()
trigrams = list(ngrams(tokens, 3))
[" ".join(token) for token in trigrams]


['Natural Language Processing',
 'Language Processing is',
 'Processing is the',
 'is the way',
 'the way to',
 'way to go']

### Taking care of HTML tags

Text preprocessing steps should be performed before applying algorithms to the data, but the choice of steps depends on the specific use case. After preprocessing, tokens can be combined to form the vocabulary.

In [54]:
s = "Natural Language Processing is the way to go"
tokens = set(s.split())
vocabulary = sorted(tokens)
vocabulary

['Language', 'Natural', 'Processing', 'go', 'is', 'the', 'to', 'way']

## Summary

Building a natural language vocabulary involves critical preprocessing steps, vital for NLP data. Proper preprocessing enhances machine learning performance, yielding better results than scenarios without it. These steps are crucial in NLP and broader machine learning applications.