#### SENTENCE SEGMENTATION

In [1]:
from nltk.tokenize import sent_tokenize, word_tokenize
mytext = "In the previous chapter, we saw examples of some common NLP applications that we might encounter in everyday life. If we were asked to build such an application, think about how we would approach doing so at our organization. We would normally walk through the requirements and break the problem down into several sub-problems, then try to develop a step-by-step procedure to solve them. Since language processing is involved, we would also list all the forms of text processing needed at each step. This step-by-step processing of text is known as pipeline. It is the series of steps involved in building any NLP model. These steps are common in every NLP project, so it makes sense to study them in this chapter. Understanding some common procedures in any NLP pipeline will enable us to get started on any NLP problem encountered in the workplace. Laying out and developing a text-processing pipeline is seen as a starting point for any NLP application development process. In this chapter, we will learn about the various steps involved and how they playimportant roles in solving the NLP problem and we’ll see a few guidelines about when and how to use which step. In later chapters, we’ll discuss specific pipelines for various NLP tasks (e.g., Chapters 4–7)."
my_sentences = sent_tokenize(mytext)

In [2]:
my_sentences

['In the previous chapter, we saw examples of some common NLP applications that we might encounter in everyday life.',
 'If we were asked to build such an application, think about how we would approach doing so at our organization.',
 'We would normally walk through the requirements and break the problem down into several sub-problems, then try to develop a step-by-step procedure to solve them.',
 'Since language processing is involved, we would also list all the forms of text processing needed at each step.',
 'This step-by-step processing of text is known as pipeline.',
 'It is the series of steps involved in building any NLP model.',
 'These steps are common in every NLP project, so it makes sense to study them in this chapter.',
 'Understanding some common procedures in any NLP pipeline will enable us to get started on any NLP problem encountered in the workplace.',
 'Laying out and developing a text-processing pipeline is seen as a starting point for any NLP application developmen

#### WORD TOKENIZATION

In [3]:
for sentence in my_sentences:
    print(sentence)
    print(word_tokenize(sentence))

In the previous chapter, we saw examples of some common NLP applications that we might encounter in everyday life.
['In', 'the', 'previous', 'chapter', ',', 'we', 'saw', 'examples', 'of', 'some', 'common', 'NLP', 'applications', 'that', 'we', 'might', 'encounter', 'in', 'everyday', 'life', '.']
If we were asked to build such an application, think about how we would approach doing so at our organization.
['If', 'we', 'were', 'asked', 'to', 'build', 'such', 'an', 'application', ',', 'think', 'about', 'how', 'we', 'would', 'approach', 'doing', 'so', 'at', 'our', 'organization', '.']
We would normally walk through the requirements and break the problem down into several sub-problems, then try to develop a step-by-step procedure to solve them.
['We', 'would', 'normally', 'walk', 'through', 'the', 'requirements', 'and', 'break', 'the', 'problem', 'down', 'into', 'several', 'sub-problems', ',', 'then', 'try', 'to', 'develop', 'a', 'step-by-step', 'procedure', 'to', 'solve', 'them', '.']
Since

#### Stop Words

In [4]:
corpus = "Need to finalize the demo corpus which will be used for this notebook & should be done soon !!. It should be done by the ending of this month. But will it? This notebook has been run 4 times !!"

In [5]:
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
nltk.download("punkt")
from nltk.tokenize import word_tokenize
stop_words_nltk = set(stopwords.words("english"))

tokenized_corpus_nltk = word_tokenize(corpus)
print("\nNLTK\nTokenized corpus:",tokenized_corpus_nltk)
tokenized_corpus_without_stopwords = [i for i in tokenized_corpus_nltk if not i in stop_words_nltk]
print("\nTokenized corpus without stopwords: ", tokenized_corpus_without_stopwords)


NLTK
Tokenized corpus: ['Need', 'to', 'finalize', 'the', 'demo', 'corpus', 'which', 'will', 'be', 'used', 'for', 'this', 'notebook', '&', 'should', 'be', 'done', 'soon', '!', '!', '.', 'It', 'should', 'be', 'done', 'by', 'the', 'ending', 'of', 'this', 'month', '.', 'But', 'will', 'it', '?', 'This', 'notebook', 'has', 'been', 'run', '4', 'times', '!', '!']

Tokenized corpus without stopwords:  ['Need', 'finalize', 'demo', 'corpus', 'used', 'notebook', '&', 'done', 'soon', '!', '!', '.', 'It', 'done', 'ending', 'month', '.', 'But', '?', 'This', 'notebook', 'run', '4', 'times', '!', '!']


[nltk_data] Downloading package stopwords to C:\Users\ESRA
[nltk_data]     ABLAK\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\ESRA
[nltk_data]     ABLAK\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [6]:
from spacy.lang.en.stop_words import STOP_WORDS
import spacy
spacy_model = spacy.load("en_core_web_sm")

In [None]:
pip install spacy

In [None]:
!python -m spacy download en_core_web_sm

In [7]:
stopwords_spacy = spacy_model.Defaults.stop_words
print("\nSpacy:")
tokenized_corpus_spacy = word_tokenize(corpus)
print("\nTokenized Corpus:",tokenized_corpus_spacy)
tokens_without_sw = [word for word in tokenized_corpus_spacy if not word in stopwords_spacy]
print("\nTokenized corpus without stopwords", tokens_without_sw)


Spacy:

Tokenized Corpus: ['Need', 'to', 'finalize', 'the', 'demo', 'corpus', 'which', 'will', 'be', 'used', 'for', 'this', 'notebook', '&', 'should', 'be', 'done', 'soon', '!', '!', '.', 'It', 'should', 'be', 'done', 'by', 'the', 'ending', 'of', 'this', 'month', '.', 'But', 'will', 'it', '?', 'This', 'notebook', 'has', 'been', 'run', '4', 'times', '!', '!']

Tokenized corpus without stopwords ['Need', 'finalize', 'demo', 'corpus', 'notebook', '&', 'soon', '!', '!', '.', 'It', 'ending', 'month', '.', 'But', '?', 'This', 'notebook', 'run', '4', 'times', '!', '!']


In [8]:
print("difference between NLTK and spaCy output:\n",
     set(tokenized_corpus_without_stopwords)-set(tokens_without_sw))

difference between NLTK and spaCy output:
 {'used', 'done'}


#### Stemming

In [9]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()

print("Before Stemming:")
print(corpus)
print("After Stemming:")
for word in tokenized_corpus_nltk:
    print(stemmer.stem(word),end=" ")

Before Stemming:
Need to finalize the demo corpus which will be used for this notebook & should be done soon !!. It should be done by the ending of this month. But will it? This notebook has been run 4 times !!
After Stemming:
need to final the demo corpu which will be use for thi notebook & should be done soon ! ! . it should be done by the end of thi month . but will it ? thi notebook ha been run 4 time ! ! 

#### Lemmatization

In [10]:
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to C:\Users\ESRA
[nltk_data]     ABLAK\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [11]:
lemmatizer = WordNetLemmatizer()
for word in tokenized_corpus_nltk:
    print(lemmatizer.lemmatize(word),end=" ")

Need to finalize the demo corpus which will be used for this notebook & should be done soon ! ! . It should be done by the ending of this month . But will it ? This notebook ha been run 4 time ! ! 

In [12]:
sp = spacy.load("en_core_web_sm")
token = sp(u"better")
for word in token:
    print(word.text, word.lemma_)

better well


#### POS Tagging

In [14]:
corpus_original = "Need to finalize the demo corpus which will be used for this notebook and it should be done soon !!. It should be done by the ending of this month. But will it? This notebook has been run 4 times !!"

In [16]:
print("POS Tagging using spacy:")
doc = spacy_model(corpus_original)
# Token and Tag
for token in doc:
    print(token,":",token.pos_)
    
# pos tagginf using nltk
nltk.download("avaraged_perceptron_tagger")
print("POS Tagging using NLTK")
print(nltk.pos_tag(word_tokenize(corpus_original)))

POS Tagging using spacy:
Need : VERB
to : PART
finalize : VERB
the : DET
demo : NOUN
corpus : NOUN
which : PRON
will : AUX
be : AUX
used : VERB
for : ADP
this : DET
notebook : NOUN
and : CCONJ
it : PRON
should : AUX
be : AUX
done : VERB
soon : ADV
! : PUNCT
! : PUNCT
. : PUNCT
It : PRON
should : AUX
be : AUX
done : VERB
by : ADP
the : DET
ending : NOUN
of : ADP
this : DET
month : NOUN
. : PUNCT
But : CCONJ
will : AUX
it : PRON
? : PUNCT
This : DET
notebook : NOUN
has : AUX
been : AUX
run : VERB
4 : NUM
times : NOUN
! : PUNCT
! : PUNCT
POS Tagging using NLTK
[('Need', 'NN'), ('to', 'TO'), ('finalize', 'VB'), ('the', 'DT'), ('demo', 'NN'), ('corpus', 'NN'), ('which', 'WDT'), ('will', 'MD'), ('be', 'VB'), ('used', 'VBN'), ('for', 'IN'), ('this', 'DT'), ('notebook', 'NN'), ('and', 'CC'), ('it', 'PRP'), ('should', 'MD'), ('be', 'VB'), ('done', 'VBN'), ('soon', 'RB'), ('!', '.'), ('!', '.'), ('.', '.'), ('It', 'PRP'), ('should', 'MD'), ('be', 'VB'), ('done', 'VBN'), ('by', 'IN'), ('the', 'DT

[nltk_data] Error loading avaraged_perceptron_tagger: Package
[nltk_data]     'avaraged_perceptron_tagger' not found in index
