## **Basic Implementation of spaCy & NLTK**

spaCy -> Newer Library, Already comes tuned with spacy.load("en_core_web_sm")

NLTK  -> Older Library, Can be heavily customised but hard to operate with

In [2]:
import nltk 
import spacy

nlp = spacy.load("en_core_web_sm")

In [3]:
doc = nlp("I like black cats. Dr. Ramesh likes brown dogs. There are a lot of people that I know of that have multiple pets. For example Koompilala, Dhunnibaba etc. because Having pets is cute.")
for sentence in doc.sents:
    print(sentence)
    
# for sentence in doc.sents:          
#     for word in sentence:                   #This prints all words one by one.
#         print(word)  

I like black cats.
Dr. Ramesh likes brown dogs.
There are a lot of people that I know of that have multiple pets.
For example Koompilala, Dhunnibaba etc. because Having pets is cute.


In [4]:
from nltk.tokenize import sent_tokenize

sent_tokenize("I like black cats. Dr. Ramesh likes brown dogs. There are a lot of people that I know of that have multiple pets. For example Koompilala, Dhunnibaba etc. because Having pets is cute.")

['I like black cats.',
 'Dr. Ramesh likes brown dogs.',
 'There are a lot of people that I know of that have multiple pets.',
 'For example Koompilala, Dhunnibaba etc.',
 'because Having pets is cute.']

## **Tokenization in spaCY**

In [5]:
#Word Tokenization -> 
import spacy
nlp = spacy.blank('en')

doc = nlp("Let's go to N.Y!")

for token in doc:
    print(token) 

Let
's
go
to
N.Y
!


In [6]:
span = doc[2:4]
type(nlp) , type(doc) , type(token) , type(span)

(spacy.lang.en.English,
 spacy.tokens.doc.Doc,
 spacy.tokens.token.Token,
 spacy.tokens.span.Span)

In [7]:
doc = nlp ("I have 2 two ₹ coins in my pocket.")

doc[2].like_num , doc[3].like_num , doc[4].is_currency, doc[5].is_digit

(True, True, True, False)

In [8]:
student_details = """
John Smith, 19, john.smith@email.com, 2005-03-15
Emily Johnson, 21, emily.j@university.edu, 2003-07-22
Michael Chen, 20, mchen98@student.org, 2004-11-30
Sophia Rodriguez, 18, srodriguez@school.net, 2006-01-08
Alexander Kim, 22, akim2002@college.com, 2002-09-04
"""
doc = nlp(student_details)
email_array =[]
for token in doc:
    if token.like_email:
        email_array.append(token.text)
        
        
print(email_array)

['john.smith@email.com', 'emily.j@university.edu', 'mchen98@student.org', 'srodriguez@school.net', 'akim2002@college.com']


In [9]:
#This is how we add components to pipelines -> 
print(nlp.pipe_names)

nlp.add_pipe('sentencizer')

print(nlp.pipe_names)


[]
['sentencizer']


In [10]:
text='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, 
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''

for url in nlp(text): 
    if url.like_url:
        print(url)

http://www.data.gov/
http://www.science
http://data.gov.uk/.
http://www3.norc.org/gss+website/
http://www.europeansocialsurvey.org/.


In [11]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"
doc = nlp(transactions)

for token in doc:
    if token.like_num and doc[token.i+1].is_currency:
        print(token.text , doc[token.i+1].text)


two $
500 €


## **Language Processing Pipeline in spaCy**

In [12]:
import spacy

nlp = spacy.blank('en')

print(nlp.pipe_names)

#Already trained NLP Pipeline -> 
nlp = spacy.load('en_core_web_sm')

print(nlp.pipe_names)


[]
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


In [13]:
complex_sentence = "Yesterday, 42 enthusiastic students from our local high school eagerly participated in the 15th annual science fair, where they showcased 7 innovative projects."
doc = nlp(complex_sentence)

for token in doc:
    
    print(f"{token.text:<15} {token.lemma_:<15} {token.pos_:<5}")


Yesterday       yesterday       NOUN 
,               ,               PUNCT
42              42              NUM  
enthusiastic    enthusiastic    ADJ  
students        student         NOUN 
from            from            ADP  
our             our             PRON 
local           local           ADJ  
high            high            ADJ  
school          school          NOUN 
eagerly         eagerly         ADV  
participated    participate     VERB 
in              in              ADP  
the             the             DET  
15th            15th            ADJ  
annual          annual          ADJ  
science         science         NOUN 
fair            fair            NOUN 
,               ,               PUNCT
where           where           SCONJ
they            they            PRON 
showcased       showcase        VERB 
7               7               NUM  
innovative      innovative      ADJ  
projects        project         NOUN 
.               .               PUNCT


In [14]:
doc = nlp("Real Madrid FC has scored many goals in last 5 games, Increasing their valuation by 3 million $.")

for ent in doc.ents:
    print(f"{ent.text:<25} {ent.label_:<15} {spacy.explain(ent.label_)}")
    
    
from spacy import displacy

displacy.render(doc, style ="ent")

Real Madrid FC            ORG             Companies, agencies, institutions, etc.
last 5                    DATE            Absolute or relative dates or periods
3 million $               MONEY           Monetary values, including unit


In [15]:
#Code to only create a pipeline with ner rather than all the entities --> 

source_nlp = spacy.load('en_core_web_sm')

nlp = spacy.blank('en')

nlp.add_pipe('ner', source=source_nlp)

nlp.pipe_names

['ner']

## **Stemming and Lemmatization ->** 

**Stemming** -> Getting the base word out of a complex word. It just uses simple rules like removing "able", "ing" etc. Examples -> 

- Talking -> Talk
- Adjustable -> Adjust


**Lemmatization** -> Getting the base word by using linguistic knowledge and not just fixed rules. Base word is also called lemma. Examples -> 
- Ate -> Eat
- Ability -> Able

In [16]:
import nltk 
import spacy

In [17]:
from nltk import PorterStemmer
stemmer = PorterStemmer()

words = [
    "running", "jumps", "easily", "dogs", "cats", "better", "friendliness", "jumping", "leaves", "babies", "ability", "organization", "programmer"
]

for word in words : 
    print(f"{word:<15} {stemmer.stem(word)}")

running         run
jumps           jump
easily          easili
dogs            dog
cats            cat
better          better
friendliness    friendli
jumping         jump
leaves          leav
babies          babi
ability         abil
organization    organ
programmer      programm


Stemming doesn't use language comphrehension and thus its results are not the best.

In [18]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("running jumps easily dogs cats better friendliness jumping leaves babies ability organization programmer")

for token in doc:
    print(f"{token.text:<15} {token.lemma_}")


running         running
jumps           jump
easily          easily
dogs            dog
cats            cat
better          well
friendliness    friendliness
jumping         jump
leaves          leave
babies          baby
ability         ability
organization    organization
programmer      programmer


## **Part Of Speech POS Tagging ->**

In [19]:
import spacy
nlp = spacy.load('en_core_web_sm')

short_sentence = "The curious cat quickly chased five red butterflies across our sunny garden yesterday."

doc = nlp(short_sentence)

for token in doc:
    print(f"{token.text:<12} {token.pos_:<10} {spacy.explain(token.pos_):<15} {token.tag_:<10} {spacy.explain(token.tag_)}")

The          DET        determiner      DT         determiner
curious      ADJ        adjective       JJ         adjective (English), other noun-modifier (Chinese)
cat          NOUN       noun            NN         noun, singular or mass
quickly      ADV        adverb          RB         adverb
chased       VERB       verb            VBD        verb, past tense
five         NUM        numeral         CD         cardinal number
red          ADJ        adjective       JJ         adjective (English), other noun-modifier (Chinese)
butterflies  NOUN       noun            NNS        noun, plural
across       ADP        adposition      IN         conjunction, subordinating or preposition
our          PRON       pronoun         PRP$       pronoun, possessive
sunny        ADJ        adjective       JJ         adjective (English), other noun-modifier (Chinese)
garden       NOUN       noun            NN         noun, singular or mass
yesterday    NOUN       noun            NN         noun, singul

In [20]:
earnings_text="""Microsoft Corp. today announced the following results for the quarter ended December 31, 2021, as compared to the corresponding period of last fiscal year:

·         Revenue was $51.7 billion and increased 20%
·         Operating income was $22.2 billion and increased 24%
·         Net income was $18.8 billion and increased 21%
·         Diluted earnings per share was $2.48 and increased 22%
“Digital technology is the most malleable resource at the world’s disposal to overcome constraints and reimagine everyday work and life,” said Satya Nadella, chairman and chief executive officer of Microsoft. “As tech as a percentage of global GDP continues to increase, we are innovating and investing across diverse and growing markets, with a common underlying technology stack and an operating model that reinforces a common strategy, culture, and sense of purpose.”
“Solid commercial execution, represented by strong bookings growth driven by long-term Azure commitments, increased Microsoft Cloud revenue to $22.1 billion, up 32% year over year” said Amy Hood, executive vice president and chief financial officer of Microsoft."""

doc = nlp(earnings_text)

filtered_tokens = []

for token in doc:
    if token.pos_ not in ["SPACE", "PUNCT", "X"]:
        filtered_tokens.append(token)
        
filtered_tokens[:10]

[Microsoft,
 Corp.,
 today,
 announced,
 the,
 following,
 results,
 for,
 the,
 quarter]

In [21]:
count = doc.count_by(spacy.attrs.POS)
for k,v in count.items():
    print(doc.vocab[k].text, "|",v)

PROPN | 15
NOUN | 45
VERB | 23
DET | 9
ADP | 16
NUM | 16
PUNCT | 27
SCONJ | 1
ADJ | 20
SPACE | 10
AUX | 6
SYM | 5
CCONJ | 12
ADV | 3
PART | 3
PRON | 2


## **Named Entity Recognition (NER) ->**

In [22]:
import spacy 

nlp = spacy.load("en_core_web_sm")

nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [45]:
text = "Apple Inc. is planning to open a new store in New York City next month. CEO Tim Cook announced the plan during a conference in San Francisco last week. Twitter is the most popular social media platform in the world."

doc = nlp(text)

for ent in doc.ents:
    print(f"{ent.text:<20} {ent.label_:<15} {spacy.explain(ent.label_):<20}")
    
    


Apple Inc.           ORG             Companies, agencies, institutions, etc.
New York City        GPE             Countries, cities, states
next month           DATE            Absolute or relative dates or periods
Tim Cook             PERSON          People, including fictional
San Francisco        GPE             Countries, cities, states
last week            DATE            Absolute or relative dates or periods


In [46]:
#As Twitter is not included in spacy, we can add it manually.

from spacy.tokens import Span

s1 = Span(doc,31,32,label="ORG")
new_ents = list(doc.ents)
new_ents.append(s1)
doc.ents = new_ents

for ent in doc.ents:
    print(f"{ent.text:<20} {ent.label_:<15} {spacy.explain(ent.label_):<20}")

Apple Inc.           ORG             Companies, agencies, institutions, etc.
New York City        GPE             Countries, cities, states
next month           DATE            Absolute or relative dates or periods
Tim Cook             PERSON          People, including fictional
San Francisco        GPE             Countries, cities, states
last week            DATE            Absolute or relative dates or periods
Twitter              ORG             Companies, agencies, institutions, etc.
