## Lexical Processing

Lexical processing techniques are fundamental components of Natural Language Processing (NLP) that involve handling and processing textual data at the word or token level.

##### Two key techniques

1. **Tokenization**
2. **Normalization**

### 1. Tokenization

![](https://miro.medium.com/v2/resize:fit:1400/1*CdjbU3J5BYuIi-4WbWnKng.png)

* **Tokenization** is the process of breaking down a text into smaller units, which are typically words or subwords.
* The purpose of tokenization is to create a meaningful representation of the text that can be further analyzed or processed by NLP algorithms.
* Tokens are the basic units of text used for various NLP tasks such as text classification, named entity recognition, and sentiment analysis.

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
from nltk.tokenize import word_tokenize

text = "Tokenization is the process of breaking down text into smaller units, typically words or subwords, called tokens."
tokens = word_tokenize(text)

print(tokens)

['Tokenization', 'is', 'the', 'process', 'of', 'breaking', 'down', 'text', 'into', 'smaller', 'units', ',', 'typically', 'words', 'or', 'subwords', ',', 'called', 'tokens', '.']


### 2.Normalization

**Normalization** in NLP involves transforming text into a standard, consistent format. Two common techniques for normalization are

1. **Stemming**
2. **Lemmatization**.

Both techniques aim to reduce words to their base or root form, but they do so in different ways.

![](https://miro.medium.com/v2/resize:fit:1400/0*6k_6zouDWBMWkehE)

#### **2.1 Stemming**

* **Stemming** is the process of removing suffixes from words to reduce them to their root form. It operates by chopping off the end of words based on common patterns.

![](https://qph.cf2.quoracdn.net/main-qimg-187b045c480fa7c0b16869daa0661b5a)

Example using Porter Stemmer:

In [7]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

words = ["running", "easily", "cats", "attractive", "programming"]
stemmed_words = [stemmer.stem(word) for word in words]

print("Word ==> Stemmed Word")
print("=======================")
for word, stem_word in zip(words, stemmed_words):
  print(f"{word} ==> {stem_word}")

Word ==> Stemmed Word
running ==> run
easily ==> easili
cats ==> cat
attractive ==> attract
programming ==> program


Stemming reduces words to their base form, but it may not always result in a real word. For example, "easily" is stemmed to "easili", which is not a valid English word.

#### 2.2 Lemmatization:

Lemmatization, on the other hand, reduces words to their base or dictionary form (lemma) while ensuring that the resulting lemma is a valid word.

Example using WordNet Lemmatizer:

In [11]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [13]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

words = ["running", "easily", "cats", "attractive", "programming"]

# Define a function to map part-of-speech tags to WordNet tags
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in words]

print("Word ==> Lemmitized Word")
print("=======================")
for word, lemmi_word in zip(words, lemmatized_words):
  print(f"{word} ==> {lemmi_word}")

Word ==> Lemmitized Word
running ==> run
easily ==> easily
cats ==> cat
attractive ==> attractive
programming ==> program
