## Using and comparing **spaCy** for <b><i>POS, Lemma, and NER with NLTK - stemming</i></b>

In [None]:
# !pip install spacy --upgrade

In [None]:
# !python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     -- ------------------------------------- 0.8/12.8 MB 6.7 MB/s eta 0:00:02
     ---- ----------------------------------- 1.6/12.8 MB 6.5 MB/s eta 0:00:02
     ------- -------------------------------- 2.4/12.8 MB 4.5 MB/s eta 0:00:03
     --------- ------------------------------ 2.9/12.8 MB 3.7 MB/s eta 0:00:03
     --------- ------------------------------ 3.1/12.8 MB 3.5 MB/s eta 0:00:03
     ----------- ---------------------------- 3.7/12.8 MB 3.0 MB/s eta 0:00:04
     ------------ --------------------------- 3.9/12.8 MB 2.8 MB/s eta 0:00:04
     ------------ --------------------------- 3.9/12.8 MB 2.8 MB/s eta 0:00:04
     ------------- -------------------------- 4.2/12.8 MB 2.3 MB/s eta 0:00:04
     ------------- ----------------------

## spaCy: Industrial-strength NLP : *https://spacy.io/*

In [None]:
# import spacy

In [None]:
# Load the English language model
# Ensure the model is installed
# try:
#     spacy.cli.download("en_core_web_sm")
# except SystemExit:
#     pass
#
# nlp = spacy.load('en_core_web_sm')

In [None]:
import en_core_web_sm

import spacy

nlp = en_core_web_sm.load()

In [None]:
# Sample text for processing

text = """Building LLM Powered Applications delves into the fundamental concepts, cutting-edge technologies, and practical applications that LLMs offer, ultimately paving the way for the emergence of large foundation models (LFMs) that extend the boundaries of AI capabilities.

The book begins with an in-depth introduction to LLMs. We then explore various mainstream architectural frameworks, including both proprietary models (GPT 3.5/4) and open-source models (Falcon LLM), and analyze their unique strengths and differences. Moving ahead, with a focus on the Python-based, lightweight framework called LangChain, we guide you through the process of creating intelligent agents capable of retrieving information from unstructured data and engaging with structured data using LLMs and powerful toolkits. Furthermore, the book ventures into the realm of LFMs, which transcend language modeling to encompass various AI tasks and modalities, such as vision and audio.

Whether you are a seasoned AI expert or a newcomer to the field, this book is your roadmap to unlock the full potential of LLMs and forge a new era of intelligent machines."""
doc = nlp(text)

In [None]:
# Part-of-Speech Tagging and Lemmatization

print("POS Tagging and Lemmatization:")  # compare stemm - lemma
for token in doc:
    print(f"{token.text:<12} | {token.pos_:<10} | {token.lemma_:<10}")

POS Tagging and Lemmatization:
Building     | VERB       | build     
LLM          | PROPN      | LLM       
Powered      | PROPN      | Powered   
Applications | PROPN      | Applications
delves       | NOUN       | delf      
into         | ADP        | into      
the          | DET        | the       
fundamental  | ADJ        | fundamental
concepts     | NOUN       | concept   
,            | PUNCT      | ,         
cutting      | VERB       | cut       
-            | PUNCT      | -         
edge         | NOUN       | edge      
technologies | NOUN       | technology
,            | PUNCT      | ,         
and          | CCONJ      | and       
practical    | ADJ        | practical 
applications | NOUN       | application
that         | PRON       | that      
LLMs         | PROPN      | LLMs      
offer        | VERB       | offer     
,            | PUNCT      | ,         
ultimately   | ADV        | ultimately
paving       | VERB       | pave      
the          | DET        | t

## Named Entity Recognition


In [None]:
print("\nNamed Entity Recognition:")

for entity in doc.ents:
    print(f"{entity.text:<35} | {entity.label_:<15} | {spacy.explain(entity.label_)}")


Named Entity Recognition:
Building LLM Powered Applications   | ORG             | Companies, agencies, institutions, etc.
AI                                  | GPE             | Countries, cities, states
GPT                                 | ORG             | Companies, agencies, institutions, etc.
3.5/4                               | CARDINAL        | Numerals that do not fall under another type
Falcon LLM                          | ORG             | Companies, agencies, institutions, etc.
Python                              | ORG             | Companies, agencies, institutions, etc.
LangChain                           | ORG             | Companies, agencies, institutions, etc.
AI                                  | GPE             | Countries, cities, states
AI                                  | ORG             | Companies, agencies, institutions, etc.


### Spacy does not have built-in stemming functionality like NLTK, but we can implement a simple stemming function using the **PorterStemmer** from NLTK

In [None]:
import nltk

nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Onepoint\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

## Tokenize using NLTK for stemming

In [None]:
stemmer = PorterStemmer()
nltk_tokens = word_tokenize(text)

# Stemming using NLTK
stemmed_words = [stemmer.stem(token) for token in nltk_tokens]  # applying the PorterStemmer to each token

In [None]:
print(stemmed_words)

['build', 'llm', 'power', 'applic', 'delv', 'into', 'the', 'fundament', 'concept', ',', 'cutting-edg', 'technolog', ',', 'and', 'practic', 'applic', 'that', 'llm', 'offer', ',', 'ultim', 'pave', 'the', 'way', 'for', 'the', 'emerg', 'of', 'larg', 'foundat', 'model', '(', 'lfm', ')', 'that', 'extend', 'the', 'boundari', 'of', 'ai', 'capabl', '.', 'the', 'book', 'begin', 'with', 'an', 'in-depth', 'introduct', 'to', 'llm', '.', 'we', 'then', 'explor', 'variou', 'mainstream', 'architectur', 'framework', ',', 'includ', 'both', 'proprietari', 'model', '(', 'gpt', '3.5/4', ')', 'and', 'open-sourc', 'model', '(', 'falcon', 'llm', ')', ',', 'and', 'analyz', 'their', 'uniqu', 'strength', 'and', 'differ', '.', 'move', 'ahead', ',', 'with', 'a', 'focu', 'on', 'the', 'python-bas', ',', 'lightweight', 'framework', 'call', 'langchain', ',', 'we', 'guid', 'you', 'through', 'the', 'process', 'of', 'creat', 'intellig', 'agent', 'capabl', 'of', 'retriev', 'inform', 'from', 'unstructur', 'data', 'and', 'en

## Each word from **nltk_tokens** is paired with its <u>**stemmed**</u> equivalent in **stemmed_words** and printed in a neatly formatted table.

In [None]:
print('NLTK Tokens:    | Stemmed Words:')

for original, stemmed in zip(nltk_tokens, stemmed_words):
    print(f"{original:15} | {stemmed:<20}")  #Part-of-Speech Tagging and Lemmatization

NLTK Tokens:    | Stemmed Words:
Building        | build               
LLM             | llm                 
Powered         | power               
Applications    | applic              
delves          | delv                
into            | into                
the             | the                 
fundamental     | fundament           
concepts        | concept             
,               | ,                   
cutting-edge    | cutting-edg         
technologies    | technolog           
,               | ,                   
and             | and                 
practical       | practic             
applications    | applic              
that            | that                
LLMs            | llm                 
offer           | offer               
,               | ,                   
ultimately      | ultim               
paving          | pave                
the             | the                 
way             | way                 
for             | for          

In [None]:
print('Tokens >  Stemmed  > Lemma  >  POS')

for token, stemmed in zip(doc, stemmed_words):
    print(f"t: {token.text:<13} |s: {stemmed:<12} |l: {token.lemma_:<15} | {token.pos_:<10}")

Tokens >  Stemmed  > Lemma  >  POS
t: Building      |s: build        |l: build           | VERB      
t: LLM           |s: llm          |l: LLM             | PROPN     
t: Powered       |s: power        |l: Powered         | PROPN     
t: Applications  |s: applic       |l: Applications    | PROPN     
t: delves        |s: delv         |l: delf            | NOUN      
t: into          |s: into         |l: into            | ADP       
t: the           |s: the          |l: the             | DET       
t: fundamental   |s: fundament    |l: fundamental     | ADJ       
t: concepts      |s: concept      |l: concept         | NOUN      
t: ,             |s: ,            |l: ,               | PUNCT     
t: cutting       |s: cutting-edg  |l: cut             | VERB      
t: -             |s: technolog    |l: -               | PUNCT     
t: edge          |s: ,            |l: edge            | NOUN      
t: technologies  |s: and          |l: technology      | NOUN      
t: ,             |s: practi