# Pipeline Components

This Jupyter notebook shows some example usages of
NLP components in different frameworks, in particular
- [NLTK](https://www.nltk.org/)
- [spaCy](https://spacy.io/)
- [gensim](https://radimrehurek.com/gensim/)
- ([scikit-learn](https://scikit-learn.org/stable/))

adapted from: 
- https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk
- https://www.geeksforgeeks.org/python-nlp-analysis-of-restaurant-reviews/?ref=lbp
- https://www.nltk.org/book/ch07.html
- https://github.com/DistrictDataLabs/intro-to-nltk/blob/master/NLTK.ipynb
- https://buildmedia.readthedocs.org/media/pdf/nltk/latest/nltk.pdf
- https://spacy.io/usage/spacy-101 
- https://www.analyticsvidhya.com/blog/2020/03/spacy-tutorial-learn-natural-language-processing/
- https://www.machinelearningplus.com/spacy-tutorial-nlp/ 
- https://realpython.com/natural-language-processing-spacy-python/ 
- https://stackabuse.com/python-for-nlp-working-with-the-gensim-library-part-1/ 


For package installation guidelines please visit the respective websites.

## NLTK

NLTK stands for the Natural Language Toolkit and is written by two eminent computational linguists, Steven Bird (Senior Research Associate of the LDC and professor at the University of Melbourne) and Ewan Klein (Professor of Linguistics at Edinburgh University). It contains text processing libraries for tokenization, parsing, classification, stemming, tagging and semantic reasoning. 

In this notebook, you are going to outline the following topics:
- sentence tokenizer
- word tokenizer 
- part-of-speech tags 
- named entity recognition 
- stopwords 
- stemming and lemmatization 

### Sentence tokenizer
`sent_tokenize`: a Punkt sentence tokenizer (Return a sentence-tokenized copy of text)

This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used. The NLTK data package includes a pre-trained Punkt tokenizer for English.

However, Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text.



In [1]:
import nltk
from nltk.tokenize import sent_tokenize

print('Your NLTK version is {}.\n'.format(nltk.__version__))

text_0 = "Musk was born to a Canadian mother and South African " \
           "father and raised in Pretoria, South Africa. He briefly " \
           "attended the University of Pretoria before moving to Canada " \
           "when he was 17 to attend Queen's University."

print('Sentence splitter:')
for sent in sent_tokenize(text_0):
    print(sent)



Your NLTK version is 3.5.

Sentence splitter:
Musk was born to a Canadian mother and South African father and raised in Pretoria, South Africa.
He briefly attended the University of Pretoria before moving to Canada when he was 17 to attend Queen's University.


### Word tokenizer

`word_tokenize`: a Treebank tokenizer

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e.g., using sent_tokenize(). In our case we just feed in the entire paragraph. 
This tokenizer performs the following steps:

    - split standard contractions, e.g. ``don't`` -> ``do n't`` and ``they'll`` -> ``they 'll``
    - treat most punctuation characters as separate tokens
    - split off commas and single quotes, when followed by whitespace
    - separate periods that appear at the end of line

In [2]:
print('\nTokens:')
tokens = nltk.word_tokenize(text_0)
for t in tokens:
    print(t)



Tokens:
Musk
was
born
to
a
Canadian
mother
and
South
African
father
and
raised
in
Pretoria
,
South
Africa
.
He
briefly
attended
the
University
of
Pretoria
before
moving
to
Canada
when
he
was
17
to
attend
Queen
's
University
.


### Part-of-Speech tags 
`pos_tag`: a maximum entropy tagger trained on the Penn Treebank

A “tag” is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'born' with a verb part of speech tag ('VBN').


There are several other taggers including (notably) the BrillTagger as well as the BrillTrainer to train your own tagger or tagset.

In [3]:
# in case of an error, uncomment and run the following line to install the tagger
#nltk.download('averaged_perceptron_tagger')

print('\n\nPart-of-Speech (for first five tokens):')
tagged = nltk.pos_tag(tokens)
for t in tagged[0:5]:
    print(t)





Part-of-Speech (for first five tokens):
('Musk', 'NNP')
('was', 'VBD')
('born', 'VBN')
('to', 'TO')
('a', 'DT')


### Named entity recognition
Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on.
NLTK provides a classifier that has already been trained to recognize named entities on the Penn Treebank, accessed with the function `nltk.ne_chunk()`. If we set the parameter `binary=True`, then named entities are just tagged as NE; otherwise, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE.
You can also retrain the chunker if you'd like - the code is very readable to extend it with a Gazette or otherwise.

In [4]:
#incase of error, uncomment and run the following line to install the need components 
#nltk.download('words')
#nltk.download('maxent_ne_chunker')

print('\n\nNamed Entities (based on noun phrase chunking):')
for sent in nltk.sent_tokenize(text_0):
   for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
      if hasattr(chunk, 'label'):
         print(chunk.label(), ' '.join(c[0] for c in chunk))





Named Entities (based on noun phrase chunking):
PERSON Musk
GPE Canadian
GPE South African
GPE Pretoria
GPE South Africa
ORGANIZATION University
GPE Pretoria
GPE Canada
PERSON Queen
ORGANIZATION University


### Stopwords 
One of the common preprocessing method is removal of stopwords from the corpus, NLTK offers a list of stopwords in different languagaes. This list may vary across different libraries, and you might want to add or remove some of them based on your application domain.

In [5]:
print('\nStopwords:')
# depending on how you installed an initialized NLTK
# you might want to include the command
# nltk.download('stopwords')
nltk_stopwords = nltk.corpus.stopwords.words('english')

print('NLTK includes {} stopwords.'.format(len(nltk_stopwords)))
print(nltk_stopwords[0:10], '...')


Stopwords:
NLTK includes 179 stopwords.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"] ...


### Stemming and lemmatization 
We have an immense number of word forms, e.g., plural, verb tenses - it is helpful for many applications to normalize these word forms into some canonical word for further exploration. In English (and many other languages) - mophological context indicate gender, tense, quantity, etc. but these subtleties might not be necessary:

Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”. NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer you should use one of these in preference to crafting your own using regular expressions, since these handle a wide range of irregular cases. Here we look at two stemming techniques, porter and lancaster: the major difference between the porter and lancaster stemming algorithms is that the lancaster stemmer is significantly more aggressive than the porter stemmer. 
Let's have a brief look at the stemmer.
See, e.g., https://www.nltk.org/api/nltk.stem.html

Lemmatization reduces words to their base word, which is linguistically a correct lemma. It transforms a root word with the use of vocabulary and morphological analysis. Lemmatization is usually more sophisticated than stemming. A stemmer works on an individual word without knowledge of the context. For example, The word "better" has "good" as its lemma.
The WordNet lemmatizer only removes affixes if the resulting word is in its dictionary. 


In [6]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem import WordNetLemmatizer

porter = PorterStemmer()
lancaster=LancasterStemmer()
#incase of an error, uncomment the following line
#nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

print("{0:20}{1:20}{2:20}{3:20}".format("Word","Porter Stemmer","lancaster Stemmer", "WordNet lemmatizer"))
for word in nltk.word_tokenize(text_0):
    print("{0:20}{1:20}{2:20}{3:20}".format(word,porter.stem(word),lancaster.stem(word),lemmatizer.lemmatize(word)))
    




Word                Porter Stemmer      lancaster Stemmer   WordNet lemmatizer  
Musk                musk                musk                Musk                
was                 wa                  was                 wa                  
born                born                born                born                
to                  to                  to                  to                  
a                   a                   a                   a                   
Canadian            canadian            canad               Canadian            
mother              mother              moth                mother              
and                 and                 and                 and                 
South               south               sou                 South               
African             african             afr                 African             
father              father              fath                father              
and                 and     

## SpaCy
spaCy is a free, open-source library spaCy for Natural Language Processing developed by Matthew Honnibal and Ines Montani. The factors that work in the favor of spaCy are the set of features it offers, the ease of use, and the fact that the library is always kept up-to-date.

The first step for a text string, when working with spaCy, is to pass it to an NLP object. This object is essentially a pipeline of several text pre-processing operations through which the input text string has to go through. When you call `nlp` on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. 

[pipline]: https://d33wubrfki0l68.cloudfront.net/16b2ccafeefd6d547171afa23f9ac62f159e353d/48b91/pipeline-7a14d4edd18f3edfee8f34393bff2992.svg ""
![pipline][pipline]

As you can see in the figure above, the NLP pipeline has multiple components, such as tokenizer, tagger, parser, ner, etc. So, the input text string has to go through all these components before we can work on it.
You can use the below code to figure out the active pipeline components:


In [7]:
import spacy

print('Your spaCy version is {}.\n'.format(spacy.__version__))

# load the English model and assign it to the object nlp
# to download the model : python -m spacy download en_core_web_sm 
nlp = spacy.load("en_core_web_sm")

nlp.pipe_names


Your spaCy version is 2.3.2.



['tagger', 'parser', 'ner']

If you do not need the entire pipeline you can disable parts of it using: `nlp.disable_pipes('tagger', 'parser')`, which disables `tagger` and `parser`

### Sentence tokenizer
Spacy’s pretrained neural models provide sentence segmentation via syntactic dependency parsers. It also provides a rule-based Sentencizer, which will be very likely to fail for more complex sentences. The component is also available via the string name "sentencizer".

In [8]:
text_1 = """The attorney general has previously been supportive of Trump's unfounded claims about voter fraud, and this latest move comes during an incredibly tense time and could inflame an already fraught transition. President-elect Joe Biden is beginning his transition into office while Trump and his administration refuse to recognize the former vice president's victory, making baseless claims about voter fraud and illegal votes that threaten to undermine the bedrock of American government."""
doc_1 = nlp(text_1)

print('\nThe sentences are:')
for sent in doc_1.sents:
    print(sent)


The sentences are:
The attorney general has previously been supportive of Trump's unfounded claims about voter fraud, and this latest move comes during an incredibly tense time and could inflame an already fraught transition.
President-elect Joe Biden is beginning his transition into office while Trump and his administration refuse to recognize the former vice president's victory, making baseless claims about voter fraud and illegal votes that threaten to undermine the bedrock of American government.


### Word tokenizer
In a first step, spaCy separates word by space and then applying some guidelines such as exception rule, prefix, suffix etc. An example is shown in the figure below:

[word]: https://spacy.io/tokenization-57e618bd79d933c4ccd308b5739062d6.svg ""
![word][word]

In [9]:
text_2 = 'The Republican president is being challenged by Democratic Party nominee Joe Biden'
print('\nTokens in the sentence:')
for token in nlp(text_2):
    print(token.text)


Tokens in the sentence:
The
Republican
president
is
being
challenged
by
Democratic
Party
nominee
Joe
Biden


### Part-of-Speech tags 
After tokenization, spaCy can parse and tag a given Doc. In spaCy, POS tags are available as an attribute on the Token object:

Here, two attributes of the Token class are accessed:

- `tag_` lists the fine-grained part of speech.
- `pos_` lists the coarse-grained part of speech.

`spacy.explain` gives descriptive details about a particular POS tag. spaCy provides a complete tag list along with an explanation for each tag.



In [10]:
print('\nPart-of-Speech:')
for token in nlp(text_2):
    print(token.text,  token.pos_, spacy.explain(token.tag_))


Part-of-Speech:
The DET determiner
Republican ADJ adjective
president NOUN noun, singular or mass
is AUX verb, 3rd person singular present
being AUX verb, gerund or present participle
challenged VERB verb, past participle
by ADP conjunction, subordinating or preposition
Democratic PROPN noun, proper singular
Party PROPN noun, proper singular
nominee NOUN noun, singular or mass
Joe PROPN noun, proper singular
Biden PROPN noun, proper singular


### Named entity recognition
spaCy features an extremely fast statistical entity recognition system that assigns labels to contiguous spans of tokens. The default model identifies a variety of named and numeric entities, including companies, locations, organizations and products. You can add arbitrary classes to the entity recognition system, and update the model with new examples.

spaCy has the property ents on Doc objects. You can use it to extract named entities.
In the above below, `entity` is a `Span` object with various attributes:
- `text` gives the Unicode text representation of the entity.
- `start_char` denotes the character offset for the start of the entity.
- `end_char` denotes the character offset for the end of the entity.
- `label_` gives the label of the entity.
spacy.explain gives descriptive details about an entity label. 

In [11]:
print('\nNamed Entities:')

text_3 = "The Republican president is being challenged by Democratic Party nominee Joe Biden, who is best known as Barack Obama’s vice-president but has been in US politics since the 1970s"
doc_3 = nlp(text_3)

for entity in doc_3.ents:
    print(entity.text, '-->', entity.label_)


Named Entities:
Republican --> NORP
Democratic Party --> ORG
Joe Biden --> PERSON
Barack --> GPE
US --> GPE
the 1970s --> DATE


### Stopwords 
SpaCy has a list of its own stopwords, different from NLTK. 

In [12]:
stopwords = spacy.lang.en.stop_words.STOP_WORDS

print('SpaCy contains {} stopwords'.format(len(stopwords)))

SpaCy contains 326 stopwords


### Stemming and lemmatization 
It might be surprising to you but spaCy doesn't contain any function for stemming as it relies on lemmatization only. For stemming it is better to use NLTK instead. 
spaCy has the attribute lemma_ on the Token class. This attribute has the lemmatized form of a token.

In [13]:
for token in doc_1:

    if not token.is_punct:
        print(token, '-->', token.lemma_)

The --> the
attorney --> attorney
general --> general
has --> have
previously --> previously
been --> be
supportive --> supportive
of --> of
Trump --> Trump
's --> 's
unfounded --> unfounded
claims --> claim
about --> about
voter --> voter
fraud --> fraud
and --> and
this --> this
latest --> late
move --> move
comes --> come
during --> during
an --> an
incredibly --> incredibly
tense --> tense
time --> time
and --> and
could --> could
inflame --> inflame
an --> an
already --> already
fraught --> fraught
transition --> transition
President --> President
elect --> elect
Joe --> Joe
Biden --> Biden
is --> be
beginning --> begin
his --> -PRON-
transition --> transition
into --> into
office --> office
while --> while
Trump --> Trump
and --> and
his --> -PRON-
administration --> administration
refuse --> refuse
to --> to
recognize --> recognize
the --> the
former --> former
vice --> vice
president --> president
's --> 's
victory --> victory
making --> make
baseless --> baseless
claims --> cl

## Gensim


Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. It is a great package for processing texts, working with word vector models (such as Word2Vec, FastText etc) and for building topic models.
Also, another significant advantage with gensim is: it lets you handle large text files without having to load the entire file in memory.

In [14]:
import gensim
from gensim.summarization.textcleaner import split_sentences
from gensim.utils import tokenize

print('Your Gensim version is {}.\n'.format(gensim.__version__))


Your Gensim version is 3.8.3.



### Sentence tokenizer
`split_sentences` is part of the summarization code and splits text into sentences.

In [15]:
print('\nSentences:')
for sent in split_sentences(text_0):
    print(sent)



Sentences:
Musk was born to a Canadian mother and South African father and raised in Pretoria, South Africa.
He briefly attended the University of Pretoria before moving to Canada when he was 17 to attend Queen's University.


### Word tokenizer
`tokenize` outputs tokens as unicode strings, removing accent marks and optionally lowercasing the unidoc string by assigning True to one of the parameters, lowercase, to_lower, or lower.

In [16]:
print('\nTokens:')
for token in tokenize(text_0):
    print(token)


Tokens:
Musk
was
born
to
a
Canadian
mother
and
South
African
father
and
raised
in
Pretoria
South
Africa
He
briefly
attended
the
University
of
Pretoria
before
moving
to
Canada
when
he
was
to
attend
Queen
s
University
