# spacy_loaded_pipeline



- nlp.pipe_names is empty array indicating no components in the pipeline. Pipeline is something that starts with a tokenizer
<br>
- More general diagram for nlp pipeline may look something like below

![image.png](attachment:89ace87f-8747-493d-973d-1a63e1e73bb7.png)![image.png](attachment:3200c832-f5dc-4fe3-97a4-e493a2f076f5.png)



Download trained pipeline

To download trained pipeline use a command such as,

python -m spacy download en_core_web_sm

This downloads the small (sm) pipeline for english language

Further instructions on : https://spacy.io/usage/models#quickstart


# spacy_blank_pipeline

![image.png](attachment:f331d65e-8216-4e3f-8b8e-03746438941c.png)![image.png](attachment:12b82df4-d536-4123-a4f1-3fe5910572c5.png)


- we add some componenets in pipe, we get ultimate doc object,
- pipeline is something taht come after tokenizer
- and we have number of processing steps.
- We get above error because we have a blank pipeline as shown below. Pipeline is something that starts with a Tokenizer component in a dotted rectange below. You can see there is nothing there hence the blank pipeline

# sentecizer

![image.png](attachment:5a98fe41-dbe5-4f3a-b172-ce75a4717b30.png)![image.png](attachment:ce037977-e289-422f-8904-aad09ec171e0.png)

In [5]:
import spacy

In [6]:
# I created a blank language processing pipeline

nlp = spacy.blank("en")

doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

# It is Tokenize
for token in doc:
    print(token)

# It is tokenize becoz although this pipeline is blank, i get tokenizor component by default.

Captain
america
ate
100
$
of
samosa
.
Then
he
said
I
can
do
this
all
day
.


In [7]:
# pipe_names

nlp.pipe_names

[]

In [9]:
# I use some pre trained pipe line with different componenets

nlp = spacy.load("en_core_web_sm")
nlp.pipe_names


['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [10]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f430f8f7ca0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f430f8f7d60>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f430faedee0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f430f7b8a80>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f430f7428c0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f430faee030>)]

sm in en_core_web_sm means small. There are other models available as well such as medium, large etc. Check this: https://spacy.io/usage/models#quickstart

In [11]:
# Importing the spacy library
import spacy

# Loading the English language model from spaCy
nlp = spacy.load("en_core_web_sm")

# Creating a document by passing a text string to the NLP model
doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

# Iterating over each token in the processed document
for token in doc:
    # Printing the token (word or punctuation), its part-of-speech explanation, and its lemma
    print(token, " | ", spacy.explain(token.pos_), " | ", token.lemma_)

# pos -> part of speech, every word has a meaning

Captain  |  proper noun  |  Captain
america  |  proper noun  |  america
ate  |  verb  |  eat
100  |  numeral  |  100
$  |  numeral  |  $
of  |  adposition  |  of
samosa  |  proper noun  |  samosa
.  |  punctuation  |  .
Then  |  adverb  |  then
he  |  pronoun  |  he
said  |  verb  |  say
I  |  pronoun  |  I
can  |  auxiliary  |  can
do  |  verb  |  do
this  |  pronoun  |  this
all  |  determiner  |  all
day  |  noun  |  day
.  |  punctuation  |  .


# Named Entity Recognition

In [12]:
# Importing the spaCy library
import spacy

# Importing the visualizer tool for Named Entity Recognition (NER) from spaCy
from spacy import displacy

# Loading the English language model from spaCy
nlp = spacy.load("en_core_web_sm")

# Creating a document by passing a text string to the NLP model
doc = nlp("Tesla Inc is going to acquire Twitter for $45 billion")

# Iterating over each recognized named entity in the document
for ent in doc.ents:
    # Printing the entity text and its label (e.g., PERSON, ORG, MONEY, etc.)
    print(ent.text, ent.label_)




Tesla Inc ORG
Twitter PERSON
$45 billion MONEY


In [13]:
# Importing the visualizer tool for Named Entity Recognition (NER) from spaCy
from spacy import displacy


# Using spaCy's displacy module to visualize named entities in the document
displacy.render(doc, style="ent")  # The "ent" style highlights named entities

# Trained processing pipeline in French

In [16]:
# nlp = spacy.load("fr_core_news_sm") # Give os error becoz it is not installed




# we need to install the processing pipeline for french language using this command,

# python -m spacy download fr_core_news_sm


In [17]:
nlp = spacy.load("fr_core_news_sm")

In [18]:
# Creating a document by passing a French text string to the NLP model
doc = nlp("Tesla Inc va racheter Twitter pour $45 milliards de dollars")

# Iterating over each recognized named entity in the document
for ent in doc.ents:
    # Printing the entity text, its label (e.g., ORG, MONEY, etc.), and an explanation of the label
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  PER  |  Named person or family.
Twitter  |  MISC  |  Miscellaneous entities, e.g. events, nationalities, products or works of art


In [19]:
for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_)

Tesla  |  PROPN  |  Tesla
Inc  |  PROPN  |  Inc
va  |  VERB  |  aller
racheter  |  VERB  |  racheter
Twitter  |  VERB  |  twitter
pour  |  ADP  |  pour
$  |  NOUN  |  dollar
45  |  NUM  |  45
milliards  |  NOUN  |  milliard
de  |  ADP  |  de
dollars  |  NOUN  |  dollar


# Adding a component to a blank pipeline

https://spacy.io/usage/processing-pipelines#pipelines

In [21]:
# Importing the spaCy library
import spacy

# Loading the pre-trained English language model from spaCy
source_nlp = spacy.load("en_core_web_sm")

# Creating a blank English NLP pipeline
nlp = spacy.blank("en")

# Adding the Named Entity Recognition (NER) component from the pre-trained model to the blank pipeline
nlp.add_pipe("ner", source=source_nlp)

# Printing the names of all the components (pipes) in the new NLP pipeline
print(nlp.pipe_names)


['ner']


In [22]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)

Tesla Inc ORG
$45 billion MONEY


# Assginment

In [23]:
#importing necessary libraries 
import spacy

nlp = spacy.load("en_core_web_sm")  #creating an object and loading the pre-trained model for "English"



Excersie: 1

    Get all the proper nouns from a given text in a list and also count how many of them.
    Proper Noun means a noun that names a particular person, place, or thing.



In [24]:
text = '''Ravi and Raju are the best friends from school days.They wanted to go for a world tour and 
visit famous cities like Paris, London, Dubai, Rome etc and also they called their another friend Mohan to take part of this world tour.
They started their journey from Hyderabad and spent next 3 months travelling all the wonderful cities in the world and cherish a happy moments!
'''

# https://spacy.io/usage/linguistic-features

#creating the nlp object
doc = nlp(text)   


#list for storing the proper nouns
all_proper_nouns = []  


for token in doc:
  if token.pos_ == "PROPN":        #checking the whether token belongs to parts of speech "PROPN" [Proper Noun]
    all_proper_nouns.append(token)
  

#finally printing the results
print("Proper Nouns: ", all_proper_nouns)
print("Count: ", len(all_proper_nouns))

Proper Nouns:  [Raju, Paris, London, Dubai, Rome, Mohan, Hyderabad]
Count:  7



Excersie: 2

    Get all companies names from a given text and also the count of them.
    Hint: Use the spacy ner functionality



In [25]:
text = '''The Top 5 companies in USA are Tesla, Walmart, Amazon, Microsoft, Google and the top 5 companies in 
India are Infosys, Reliance, HDFC Bank, Hindustan Unilever and Bharti Airtel'''


doc = nlp(text)

#list for storing the company names
all_company_names = []

for ent in doc.ents:
  if ent.label_ == 'ORG':     #checking the whether token belongs to entity "ORG" [Organisation]
    all_company_names.append(ent)



#finally printing the results
print("Company Names: ", all_company_names)
print("Count: ", len(all_company_names))

Company Names:  [Tesla, Walmart, Amazon, Microsoft, Google, Infosys, Reliance, HDFC Bank, Hindustan Unilever, Bharti Airtel]
Count:  10
