<a href="https://colab.research.google.com/github/adnaen/machine-learning-notes/blob/main/deep_learning/4_nlp/terminologies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Terminologies in NLP**

**we use spacy for nlp tasks over nltk, because spacy is more pythonic and have object oriented way**

In [1]:
import re
import spacy

text: str = "I'm Walking through Dr. Strange a lake, but i did'nt like it!.But i know one thing i have eaten thing"

## **Tokenization**: splitting text into words a.k.a tokens (unlike split word with spaces, it split spaces and special char)

In [2]:
nlp = spacy.blank("en")
doc = nlp(text)

for token in doc:
    print(token)

I
'm
Walking
through
Dr.
Strange
a
lake
,
but
i
did'nt
like
it!.But
i
know
one
thing
i
have
eaten
thing


## **Stemming** : reducing word to their root form (mostly it just cutout the suffix words such as 'ing' 'ed' 's' from the word)

In [3]:
# e.g.
# running -> run
# Natural -> Natur
# eaten -> eaten
# ate -> ate

In [4]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
" ".join([ps.stem(token) for token in ["running", "natural", "eaten", "eat", "ate"]])  # stemmed text

'run natur eaten eat ate'

**Spacy does not have  stemming option, bcz its Lemmetization is enough to get accurate base token**

## **Lemmatization**: covert tokens into their base form, work similar to stemming but lemmatization give more meaningfull base token

In [5]:
# e.g.
# better -> well
# eaten -> eat
# ate -> eat

In [6]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m73.1 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [7]:
# in top cell, we just download the pre-trained english language model
nlp = spacy.load("en_core_web_sm")
doc = nlp("Adnan is running after ate launch, and he was good at it.")

for token in doc:
    print(f"{token.text}\t | {token.lemma_}")

Adnan	 | Adnan
is	 | be
running	 | run
after	 | after
ate	 | ate
launch	 | launch
,	 | ,
and	 | and
he	 | he
was	 | be
good	 | good
at	 | at
it	 | it
.	 | .


## **Part Of Speech (POS)**: identify and classify word as noun, verbe .. and so on

In [8]:

nlp = spacy.load("en_core_web_sm")
doc = nlp("Adnan is running after ate launch, and he was good at it.")

for token in doc:
    print(f"{token}\t | {token.pos_} ({spacy.explain(token.pos_)}) | {token.tag_} ({spacy.explain(token.tag_)})")

Adnan	 | PROPN (proper noun) | NNP (noun, proper singular)
is	 | AUX (auxiliary) | VBZ (verb, 3rd person singular present)
running	 | VERB (verb) | VBG (verb, gerund or present participle)
after	 | ADP (adposition) | IN (conjunction, subordinating or preposition)
ate	 | ADJ (adjective) | JJ (adjective (English), other noun-modifier (Chinese))
launch	 | NOUN (noun) | NN (noun, singular or mass)
,	 | PUNCT (punctuation) | , (punctuation mark, comma)
and	 | CCONJ (coordinating conjunction) | CC (conjunction, coordinating)
he	 | PRON (pronoun) | PRP (pronoun, personal)
was	 | AUX (auxiliary) | VBD (verb, past tense)
good	 | ADJ (adjective) | JJ (adjective (English), other noun-modifier (Chinese))
at	 | ADP (adposition) | IN (conjunction, subordinating or preposition)
it	 | PRON (pronoun) | PRP (pronoun, personal)
.	 | PUNCT (punctuation) | . (punctuation mark, sentence closer)


## **Named Entity Recognition (NER)** : categorize tokens into PERSON, MONEY, DATE ... so on

In [9]:
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("As CEO of both Alphabet and Google in 2023, Sundar Pichai earned a salary of $2 million annually")

for token in doc.ents:
    print(f"{token.text} | {token.label_}")

displacy.render(doc, style="ent", jupyter=True)

Alphabet and Google | ORG
2023 | DATE
Sundar Pichai | PERSON
$2 million | MONEY
annually | DATE


## **Bag Of Words (BOW)**:
- First, it identify all the unique words in the dataset
- Then, it make count on each word how much time that word occure in the sentence
- Then, it genreate a matrix with that values
- Voccbulary (unique values)

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

texts = [
    "Hey John, could you please clarify me to our new Data Science Project!",
    "Sure MCcathy, here the main thing that we need to implement."
]

cv = CountVectorizer(stop_words="english")
text_vector = cv.fit_transform(texts)

print(f"Voccabulary (unique values in the 'texts' dataset): \n{cv.get_feature_names_out()}")

print(f"array : \n{text_vector.toarray()}")

Voccabulary (unique values in the 'texts' dataset): 
['clarify' 'data' 'hey' 'implement' 'john' 'main' 'mccathy' 'need' 'new'
 'project' 'science' 'sure' 'thing']
array : 
[[1 1 1 0 1 0 0 0 1 1 1 0 0]
 [0 0 0 1 0 1 1 1 0 0 0 1 1]]


- **voccabulary count: `23`**
- **It convert all sentence to each vector (size: voccabulary size)**
- **Count each word if the word in voccabulary, else it mark as 0**

## **TF-IDF (Term Frequency & Inverse Document Frequency)**

- **It reduce the importence of the word which is most occured in entire docuement. And give weight to words that are less occured in less**
- **`Term Frequency`** : Measure the frequency of word in one docuement.
- **`Inverse Docuement Frequency`** : Measure the rareness of a word in entire docuement, and it give importence if the word has high rareness and vise verse

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf = TfidfVectorizer(stop_words="english")
tf_idf_model = tf_idf.fit_transform(texts)

In [12]:
tf_idf.get_feature_names_out()
tf_idf_model.toarray()

array([[0.37796447, 0.37796447, 0.37796447, 0.        , 0.37796447,
        0.        , 0.        , 0.        , 0.37796447, 0.37796447,
        0.37796447, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.40824829, 0.        ,
        0.40824829, 0.40824829, 0.40824829, 0.        , 0.        ,
        0.        , 0.40824829, 0.40824829]])