<a href="https://colab.research.google.com/github/adnaen/machine-learning-notes/blob/main/deep_learning/4_nlp/terminologies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Terminologies in NLP**

In [102]:
import re
import nltk

test_text: str = "I'm Walking through a lake, but i did'nt like it!"
test_text = test_text.lower()

## **Tokenization**: splitting text into words.(unlike split word with spaces, it split spaces and special char)

In [103]:
re.split("\s", test_text)

["i'm",
 'walking',
 'through',
 'a',
 'lake,',
 'but',
 'i',
 "did'nt",
 'like',
 'it!']

In [104]:
# with nltk
words = nltk.word_tokenize(test_text)
words

['i',
 "'m",
 'walking',
 'through',
 'a',
 'lake',
 ',',
 'but',
 'i',
 "did'nt",
 'like',
 'it',
 '!']

## **Stemming** : reducing word to their root form (mostly it just cutout the postfix words such as 'ing' 'ed' 's' from the word)

In [105]:
# e.g.
# running -> runn
# Natural -> Natur

In [106]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
" ".join([ps.stem(each) for each in words])  # stemmed text

"i 'm walk through a lake , but i did'nt like it !"

## **Part Of Speech (POS)**: identify and classify word as noun, verbe .. and so on

In [107]:
from nltk.tag import pos_tag

word_pos = pos_tag(words)
word_pos

[('i', 'NN'),
 ("'m", 'VBP'),
 ('walking', 'VBG'),
 ('through', 'IN'),
 ('a', 'DT'),
 ('lake', 'NN'),
 (',', ','),
 ('but', 'CC'),
 ('i', 'JJ'),
 ("did'nt", 'VBP'),
 ('like', 'IN'),
 ('it', 'PRP'),
 ('!', '.')]

## **Lemmatization**: covert word into their meaningfull form (it give more meaningfull and dictionary based word)

- **Before apply lemmatization its better perform pos_tag first, bcz the lemmatization need post tag of word to perform well.**
- **only need to lemmatize 'NOUN', 'VERB', 'ADVERB', 'ADJECTIVE'**

In [108]:
# e.g.
# better -> good

In [109]:
from nltk.corpus import wordnet

pos_result = []
for each in word_pos:
    pt = each[1][0]
    match pt:
        case "V":
            pos_result.append((each[0], wordnet.VERB))
        case "N":
            pos_result.append((each[0], wordnet.NOUN))
        case "J":
            pos_result.append((each[0], wordnet.ADJ))
        case _:
            pos_result.append((each[0],))


pos_result

[('i', 'n'),
 ("'m", 'v'),
 ('walking', 'v'),
 ('through',),
 ('a',),
 ('lake', 'n'),
 (',',),
 ('but',),
 ('i', 'a'),
 ("did'nt", 'v'),
 ('like',),
 ('it',),
 ('!',)]

In [111]:
from nltk.stem import WordNetLemmatizer

wl = WordNetLemmatizer()
" ".join([wl.lemmatize(each[0], pos=each[1] if len(each) >= 2 else "n") for each in pos_result])

"i 'm walk through a lake , but i did'nt like it !"