<a href="https://colab.research.google.com/github/deilquesce/API_practice/blob/main/ExtractingTestLab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will demostrate how to perform tokenization,stemming,lemmatization and pos_tagging using libraries like [spacy](https://spacy.io/) and [nltk](https://www.nltk.org/) 

In [None]:
#This will be our corpus which we will work on
corpus_original = "Need to finalize the demo corpus which will be used for this notebook and it should be done soon !!. It should be done by the ending of this month. But will it? This notebook has been run 4 times !!"
corpus = "I love to work on cars in my possession. It makes me very passionate and fufilled! 2 "

In [None]:
#lower case the corpus
corpus = corpus.lower()
print(corpus)

i love to work on cars in my possession. it makes me very passionate and fufilled! 2 


In [None]:
#removing digits in the corpus
import re
corpus = re.sub(r'\d+','', corpus)
print(corpus)

i love to work on cars in my possession. it makes me very passionate and fufilled!  


In [None]:
#removing punctuations
import string
corpus = corpus.translate(str.maketrans('', '', string.punctuation))
print(corpus)

i love to work on cars in my possession it makes me very passionate and fufilled  


In [None]:
#removing trailing whitespaces
corpus = ' '.join([token for token in corpus.split()])
corpus

'i love to work on cars in my possession it makes me very passionate and fufilled'

In [None]:
!pip install spacy==3.0.5
!python -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.7/13.7 MB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Tokenizing the text

In [None]:
from pprint import pprint
##NLTK
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
stop_words_nltk = set(stopwords.words('english'))

tokenized_corpus_nltk = word_tokenize(corpus)
print("\nNLTK\nTokenized corpus:",tokenized_corpus_nltk)
tokenized_corpus_without_stopwords = [i for i in tokenized_corpus_nltk if not i in stop_words_nltk]
print("Tokenized corpus without stopwords:",tokenized_corpus_without_stopwords)


##SPACY 
from spacy.lang.en.stop_words import STOP_WORDS
import spacy
spacy_model = spacy.load('en_core_web_sm')

stopwords_spacy = spacy_model.Defaults.stop_words
print("\nSpacy:")
tokenized_corpus_spacy = word_tokenize(corpus)
print("Tokenized Corpus:",tokenized_corpus_spacy)
tokens_without_sw= [word for word in tokenized_corpus_spacy if not word in stopwords_spacy]

print("Tokenized corpus without stopwords",tokens_without_sw)


print("Difference between NLTK and spaCy output:\n",
      set(tokenized_corpus_without_stopwords)-set(tokens_without_sw))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!



NLTK
Tokenized corpus: ['i', 'love', 'to', 'work', 'on', 'cars', 'in', 'my', 'possession', 'it', 'makes', 'me', 'very', 'passionate', 'and', 'fufilled']
Tokenized corpus without stopwords: ['love', 'work', 'cars', 'possession', 'makes', 'passionate', 'fufilled']

Spacy:
Tokenized Corpus: ['i', 'love', 'to', 'work', 'on', 'cars', 'in', 'my', 'possession', 'it', 'makes', 'me', 'very', 'passionate', 'and', 'fufilled']
Tokenized corpus without stopwords ['love', 'work', 'cars', 'possession', 'makes', 'passionate', 'fufilled']
Difference between NLTK and spaCy output:
 set()


Notice the difference output after stopword removal using nltk and spacy

### Stemming

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()

print("Before Stemming:")
print(corpus)

print("After Stemming:")
for word in tokenized_corpus_nltk:
    print(stemmer.stem(word),end=" ")

Before Stemming:
i love to work on cars in my possession it makes me very passionate and fufilled
After Stemming:
i love to work on car in my possess it make me veri passion and fufil 

### Lemmatization

In [None]:
import nltk
nltk.download('omw-1.4')
  
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
lemmatizer=WordNetLemmatizer()

for word in tokenized_corpus_nltk:
    print(lemmatizer.lemmatize(word),end=" ")

i love to work on car in my possession it make me very passionate and fufilled 

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### POS Tagging

In [None]:
#POS tagging using spacy
corpus_original = "I love to work on cars in my possession. It makes me very passionate and fufilled! 2"
print("POS Tagging using spacy:")
doc = spacy_model(corpus_original)
# Token and Tag
for token in doc:
    print(token,":", token.pos_)

#pos tagging using nltk
nltk.download('averaged_perceptron_tagger')
print("POS Tagging using NLTK:")
pprint(nltk.pos_tag(word_tokenize(corpus_original)))

POS Tagging using spacy:
I : PRON
love : VERB
to : PART
work : VERB
on : ADP
cars : NOUN
in : ADP
my : PRON
possession : NOUN
. : PUNCT
It : PRON
makes : VERB
me : PRON
very : ADV
passionate : ADJ
and : CCONJ
fufilled : VERB
! : PUNCT
2 : NUM
POS Tagging using NLTK:
[('I', 'PRP'),
 ('love', 'VBP'),
 ('to', 'TO'),
 ('work', 'VB'),
 ('on', 'IN'),
 ('cars', 'NNS'),
 ('in', 'IN'),
 ('my', 'PRP$'),
 ('possession', 'NN'),
 ('.', '.'),
 ('It', 'PRP'),
 ('makes', 'VBZ'),
 ('me', 'PRP'),
 ('very', 'RB'),
 ('passionate', 'JJ'),
 ('and', 'CC'),
 ('fufilled', 'VBD'),
 ('!', '.'),
 ('2', 'CD')]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


There are various other libraries you can use to perform these common pre-processing steps