<a href="https://colab.research.google.com/github/atharvadesai1/NLP-practice/blob/main/spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [49]:
import spacy

In [50]:
nlp  = spacy.blank('en')
doc = nlp('Dr. STRANGE is such a brillant fighter. He needs to be in Avenger. People will seriously love it! He buys an orange of 4$ from the market')
for word in doc:
  print(word)

Dr.
STRANGE
is
such
a
brillant
fighter
.
He
needs
to
be
in
Avenger
.
People
will
seriously
love
it
!
He
buys
an
orange
of
4
$
from
the
market


In [51]:
doc[0]

Dr.

In [52]:
type(doc)

spacy.tokens.doc.Doc

In [53]:
money = doc[-4]
money

$

In [54]:
money.is_currency

True

In [55]:
excl = doc[-11]
excl

!

In [56]:
excl.is_punct

True

In [57]:
excl.is_left_punct

False

In [58]:
name = doc[1]
name

STRANGE

In [59]:
name.is_upper

True

In [60]:
nlp.pipe_names

[]

In [63]:
nlp.add_pipe('sentencizer')
nlp.pipe_names

['sentencizer']

In [64]:
doc2 = nlp("Dr. STRANGE is such a brillant fighter. He needs to be in Avenger. People will seriously love it! He buys an orange of 4$ from the market")
for sentence in doc2.sents:
  print(sentence)

Dr. STRANGE is such a brillant fighter.
He needs to be in Avenger.
People will seriously love it!
He buys an orange of 4$ from the market


## **Extracting email from the txt file**

In [65]:
with open('/content/Studentportal.txt') as f:
  text = f.readlines()
text

['Student Portal\n',
 '\n',
 'Name       Birthday       Email \n',
 'Atharva    2 June, 2003   desaiatharva50@gmail.com\n',
 'Darsh      25 June, 2003  darshjain45@gmail.com\n',
 'Raj        13 Feb, 2003   rajghag897@gmail.com\n',
 'Gaurang    15 Sept, 2003  bhoglegaurang101@gmail.com']

In [66]:
text = " ".join(text)
text



In [67]:
email = []
nlp = spacy.load('en_core_web_sm')
doc3 = nlp(text)
for word in doc3:
  if(word.like_email):
    email.append(word)

email

[desaiatharva50@gmail.com,
 darshjain45@gmail.com,
 rajghag897@gmail.com,
 bhoglegaurang101@gmail.com]

## **Support in other language**

In [68]:
# nlp  = spacy.blank('hi')
hin = nlp('यह बहुत अच्छा लग रहा था. मैं इससे अपनी आँखें नहीं हटा पा रहा हूँ')
for word in hin:
  print(word)

यह
बहुत
अच्छा
लग
रहा
था
.
मैं
इससे
अपनी
आँखें
नहीं
हटा
पा
रहा
हूँ


In [69]:
hin[0].is_title

False

## **Tasks**

In [None]:
text='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/,
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''
link = []
exlink = nlp(text)
for word in exlink:
  if(word.like_url):
    link.append(word)

link

In [70]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"
amount = []
trans = nlp(transactions)
for word in trans:
  if(word.is_currency):
    prev = word.i - 1
    money = trans[prev:prev+2]
    amount.append(money)
amount

[two $, 500 €]

In [71]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [72]:
text = 'Dr. STRANGE is such a brillant fighter. He needs to be in Avenger. People will seriously love it! He buys an orange of 4$ from the market'
doc = nlp(text)
for word in doc:
  print(word ,' | ',spacy.explain(word.pos_),' | ',word.lemma_)

Dr.  |  proper noun  |  Dr.
STRANGE  |  proper noun  |  STRANGE
is  |  auxiliary  |  be
such  |  determiner  |  such
a  |  determiner  |  a
brillant  |  adjective  |  brillant
fighter  |  noun  |  fighter
.  |  punctuation  |  .
He  |  pronoun  |  he
needs  |  verb  |  need
to  |  particle  |  to
be  |  auxiliary  |  be
in  |  adposition  |  in
Avenger  |  proper noun  |  Avenger
.  |  punctuation  |  .
People  |  noun  |  People
will  |  auxiliary  |  will
seriously  |  adverb  |  seriously
love  |  verb  |  love
it  |  pronoun  |  it
!  |  punctuation  |  !
He  |  pronoun  |  he
buys  |  verb  |  buy
an  |  determiner  |  an
orange  |  noun  |  orange
of  |  adposition  |  of
4  |  numeral  |  4
$  |  numeral  |  $
from  |  adposition  |  from
the  |  determiner  |  the
market  |  noun  |  market


## **Named Entity Recognition**

In [73]:
text = 'Elon Musk, has completed his $44bn (£38.1bn) takeover of Twitter'
doc = nlp(text)
for ent in doc.ents:
  print(ent.text ,' | ',ent.label_)

Elon Musk  |  PERSON
44bn  |  MONEY
Twitter  |  PRODUCT


In [74]:
from spacy import displacy

displacy.render(doc, style="ent")

'<div class="entities" style="line-height: 2.5; direction: ltr">\n<mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    Elon Musk\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">PERSON</span>\n</mark>\n, has completed his $\n<mark class="entity" style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    44bn\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">MONEY</span>\n</mark>\n (£38.1bn) takeover of \n<mark class="entity" style="background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    Twitter\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left:

## **Trained processing pipeline in French**

In [85]:
!python -m spacy download fr_core_news_sm

Collecting fr-core-news-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.6.0/fr_core_news_sm-3.6.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-3.6.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


In [86]:
nlp = spacy.load("fr_core_news_sm")

In [89]:
doc = nlp("Tesla Inc va racheter Twitter pour $45 milliards de dollars")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
Twitter  |  MISC  |  Miscellaneous entities, e.g. events, nationalities, products or works of art


In [91]:
nlp.pipe_names

['tok2vec', 'morphologizer', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [92]:
for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_)

Tesla  |  PROPN  |  Tesla
Inc  |  PROPN  |  Inc
va  |  VERB  |  aller
racheter  |  VERB  |  racheter
Twitter  |  VERB  |  twitter
pour  |  ADP  |  pour
$  |  NOUN  |  dollar
45  |  NUM  |  45
milliards  |  NOUN  |  milliard
de  |  ADP  |  de
dollars  |  NOUN  |  dollar


## **Adding a component to a blank pipline**

In [93]:
source_nlp = spacy.load("en_core_web_sm")

nlp = spacy.blank("en")
nlp.add_pipe("ner", source=source_nlp)
nlp.pipe_names

['ner']

In [94]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)

Tesla Inc ORG
$45 billion MONEY
