Importing libraries: spaCy library, which is used for processing and analyzing text in natural language.


In [None]:
import spacy

Creating a spaCy Object: This initializes a blank spaCy pipeline for the English language ("en").
A "blank" pipeline means it doesn’t have pre-trained models for tasks like Named Entity Recognition (NER) or part-of-speech tagging. It's essentially an empty pipeline that can be used to process text without these advanced features.

The doc object contains the processed text and can be further analyzed, such as extracting tokens, named entities, or performing other NLP tasks like parsing and tagging.

In [2]:
nlp = spacy.blank("en")

doc = nlp("During my studies, I was able to acquire a set of faculties that enabled me to reinforce my skills in multiple areas: a significant knowledge of the fundamentals of Python in order to implement data science techniques and to build machine learning as well as deep learning models. ")


this is how tokenization is done:

In [3]:
for token in doc:
    print(token)

During
my
studies
,
I
was
able
to
acquire
a
set
of
faculties
that
enabled
me
to
reinforce
my
skills
in
multiple
areas
:
a
significant
knowledge
of
the
fundamentals
of
Python
in
order
to
implement
data
science
techniques
and
to
build
machine
learning
as
well
as
deep
learning
models
.


We can access tokens one be one and check some rule

In [4]:
doc[0]

During

In [5]:
type(doc)

spacy.tokens.doc.Doc

In [6]:
firsttoken= doc[0]

In [None]:
# is alphabetic or not
firsttoken.is_alpha

True

In [None]:
# is__numeric or not
firsttoken.like_num

False

In [None]:
# get the text of the token
firsttoken.text

'During'

from a text file, using tokenization and their rules, we can print only token that are considered as an email address

In [10]:
with open("C:/Users/Wajih/Desktop/Projects Wajih/Exploring SpaCy/students.txt") as f:
    text = f.readlines()
text

['Dayton high school, 8th grade students information\n',
 '\n',
 'Name\tbirth day   \temail\n',
 '-----\t------------\t------\n',
 'Virat   5 June, 1882    virat@kohli.com\n',
 'Maria\t12 April, 2001  maria@sharapova.com\n',
 'Serena  24 June, 1998   serena@williams.com \n',
 'Joe      1 May, 1997    joe@root.com\n',
 '\n']

In [11]:
type(text)

list

In [12]:
text = ' '.join(text)
text



In [13]:
type(text)

str

In [14]:
doc = nlp(text)
emails=[]
print(doc[-2])

joe@root.com


In [15]:
doc = nlp(text)
emails=[]
for token in doc:
    if token.like_email:
        emails.append(token)
print(emails)

[virat@kohli.com, maria@sharapova.com, serena@williams.com, joe@root.com]


We can add special rules for a tokenizer

In [16]:
doc = nlp("I wannabe buying Two apples and 1 banana for Just 1.5$!")

tokens = [token.text for token in doc]
print(tokens)

['I', 'wannabe', 'buying', 'Two', 'apples', 'and', '1', 'banana', 'for', 'Just', '1.5', '$', '!']


In [17]:
from spacy.symbols import ORTH

nlp.tokenizer.add_special_case("wannabe",[{ORTH:"wanna"},{ORTH:"be"}])

doc = nlp("I wannabe buying Two apples and 1 banana for Just 1.5$!")

tokens = [token.text for token in doc]
print(tokens)

['I', 'wanna', 'be', 'buying', 'Two', 'apples', 'and', '1', 'banana', 'for', 'Just', '1.5', '$', '!']


In spaCy, the nlp.add_pipe("sentencizer") command adds a sentencizer component to the NLP processing pipeline.
--> You can then access the sentences using "doc.sents"

In [18]:
nlp.add_pipe("sentencizer")


<spacy.pipeline.sentencizer.Sentencizer at 0x2a864b0cd50>

In [19]:
doc = nlp("Dr. Joujou wannabe buying Two apples and 1 banana for Just 1.5$! i do not have money")

for sentence in doc.sents:
    print(sentence)

Dr. Joujou wannabe buying Two apples and 1 banana for Just 1.5$!
i do not have money


extract Urls from a certain text

In [20]:
text='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/, 
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''

doc = nlp(text)
urls = [token.text for token in doc if token.like_url]
print(urls)

['http://www.data.gov/', 'http://www.science.gov/', 'http://data.gov.uk/.', 'http://www3.norc.org/gss+website/', 'http://www.europeansocialsurvey.org/.']


extract amounts of transactions from a text

In [21]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"

doc = nlp(transactions)
a = []
for i in range(0,len(doc)-2):
    if doc[i].like_num and doc[i+1].is_currency:
        a.append(' '.join([str(doc[i]),str(doc[i+1])]))

print(a)



['two $', '500 €']


In [22]:
nlp = spacy.blank("en")
text = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"

doc = nlp(text)
for token in doc:
    print(token)

Tony
gave
two
$
to
Peter
,
Bruce
gave
500
€
to
Steve


In [23]:
nlp.pipe_names

[]

* spacy.blank():
-Creates a black spaCy pipeline for a specific language like english
-contains only a tokenizer by default
-No pre-trained data or components are included(like POS tagging or named entity recognition)
-through this blank pipeline, we can customize it by adding our own componenes like sentencizer
* spacy.load("en_core_web_sm"):
-loads a pre-trained spaCy language model (english)
-includes pre-trained word vectors
-can do tokenization, POS tagging, NER, sentence segmentation

In [None]:
#Loads the small English language model ("en_core_web_sm") in spaCy for natural language processing tasks.
nlp = spacy.load("en_core_web_sm")

Returns a list of the names of the processing pipelines in the loaded spaCy model.
Let me explain it more simply:

In spaCy, when you load a language model (like nlp = spacy.load("en_core_web_sm")), it comes with a series of steps, or "components," that help process text. These components could include things like:

-Tokenizer (splits text into words)
-POS Tagger (labels words with their grammatical role, like noun or verb)
-Named Entity Recognizer (identifies names of people, places, etc.)
The command nlp.pipe_names gives you the names of all these steps in the order spaCy applies them. For example, it might return:


The output ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'] represents the components in the spaCy pipeline, listed in the order they are applied to the text:

tok2vec: Converts words into word vectors (numerical representations).
tagger: Assigns part-of-speech tags to words (e.g., noun, verb).
parser: Analyzes sentence structure (syntax), building dependency trees.
attribute_ruler: Modifies word attributes (like part-of-speech or lemma) based on rules.
lemmatizer: Reduces words to their base or dictionary form (e.g., "running" to "run").
ner: Identifies named entities, such as names of people, places, or organizations.

In [25]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Processes the text and prints each token along with its part-of-speech tag and lemma (base form).

In [26]:
text = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"

doc = nlp(text)
for token in doc:
    print(token," | ", token.pos_," | ",token.lemma_)

Tony  |  PROPN  |  Tony
gave  |  VERB  |  give
two  |  NUM  |  two
$  |  NUM  |  $
to  |  ADP  |  to
Peter  |  PROPN  |  Peter
,  |  PUNCT  |  ,
Bruce  |  PROPN  |  Bruce
gave  |  VERB  |  give
500  |  NUM  |  500
€  |  NOUN  |  €
to  |  PART  |  to
Steve  |  PROPN  |  Steve


Identifies and prints named entities in the text, along with their label (type) and a brief explanation of each label.

In [27]:
text = "Tony gave 2 $ to Peter, Bruce gave 500 € to Steve"

doc = nlp(text)
for ent in doc.ents:
    print(ent.text, " - " , ent.label_, ":", spacy.explain(ent.label_))
    

Tony  -  PERSON : People, including fictional
2 $  -  MONEY : Monetary values, including unit
Peter, Bruce gave  -  ORG : Companies, agencies, institutions, etc.
500 €  -  MONEY : Monetary values, including unit
Steve  -  PERSON : People, including fictional


Showing the token in a fancier way

In [28]:
from spacy import displacy

displacy.render(doc, style= "ent")

WORD EMBEDDING IN SPACY

In [None]:
nlp = spacy.load("en_core_web_lg")
doc = nlp("I love watching football and acmilan")

for token in doc: # token.has_vector is a property in spaCy that checks whether the token (word) has a word vector associated with it.
    print(token.text," Has Vector:" ,token.has_vector, " OOV: ",token.is_oov)

I  Has Vector: True  OOV:  False
love  Has Vector: True  OOV:  False
watching  Has Vector: True  OOV:  False
football  Has Vector: True  OOV:  False
and  Has Vector: True  OOV:  False
acmilan  Has Vector: False  OOV:  True


In [None]:
# Retrieves the word vector (numerical representation) of the first token in the document
doc[0].vector

array([ 1.8733e-01,  4.0595e-01, -5.1174e-01, -5.5482e-01,  3.9716e-02,
        1.2887e-01,  4.5137e-01, -5.9149e-01,  1.5591e-01,  1.5137e+00,
       -8.7020e-01,  5.0672e-02,  1.5211e-01, -1.9183e-01,  1.1181e-01,
        1.2131e-01, -2.7212e-01,  1.6203e+00, -2.4884e-01,  1.4060e-01,
        3.3099e-01, -1.8061e-02,  1.5244e-01, -2.6943e-01, -2.7833e-01,
       -5.2123e-02, -4.8149e-01, -5.1839e-01,  8.6262e-02,  3.0818e-02,
       -2.1253e-01, -1.1378e-01, -2.2384e-01,  1.8262e-01, -3.4541e-01,
        8.2611e-02,  1.0024e-01, -7.9550e-02, -8.1721e-01,  6.5621e-03,
        8.0134e-02, -3.9976e-01, -6.3131e-02,  3.2260e-01, -3.1625e-02,
        4.3056e-01, -2.7270e-01, -7.6020e-02,  1.0293e-01, -8.8653e-02,
       -2.9087e-01, -4.7214e-02,  4.6036e-02, -1.7788e-02,  6.4990e-02,
        8.8451e-02, -3.1574e-01, -5.8522e-01,  2.2295e-01, -5.2785e-02,
       -5.5981e-01, -3.9580e-01, -7.9849e-02, -1.0933e-02, -4.1722e-02,
       -5.5576e-01,  8.8707e-02,  1.3710e-01, -2.9873e-03, -2.62

In [37]:

doc[0].vector.shape

(300,)

In [39]:
token1 = nlp("bread")
token1[0].vector.shape

(300,)

We can check similarity score betwween tokens

In [40]:
doc = nlp("I ate a sandwich with french fries")

for token in doc:
    print(token.text, " - ", token1.text,token.similarity(token1))

I  -  bread 0.26605986855569747
ate  -  bread 0.5356433485040003
a  -  bread 0.24572681149853734
sandwich  -  bread 0.6874560014053445
with  -  bread 0.30557175171805506
french  -  bread 0.42025075623170777
fries  -  bread 0.6145180843733152


In [41]:
def print_similarity(base_word , words_to_compare):
    base_token = nlp(base_word)
    doc = nlp(words_to_compare)
    for token in doc:
            print(token.text, " - ", base_token.text,token.similarity(base_token))

In [42]:
print_similarity("iphone","apple samsung iphone dog kitten technology")

apple  -  iphone 0.6339781147910419
samsung  -  iphone 0.6678678014329177
iphone  -  iphone 1.0000000285783557
dog  -  iphone 0.17431037640553934
kitten  -  iphone 0.14685812907484028
technology  -  iphone 0.23019987023206995


In [43]:
king = nlp.vocab["king"].vector
man = nlp.vocab["man"].vector
woman = nlp.vocab["woman"].vector
queen = nlp.vocab["queen"].vector

result = king - man + woman
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity([result],[queen])

array([[0.78808445]], dtype=float32)

we can do lemmatization using spacy

In [44]:
doc = nlp ("Loving better did ability encouter wiser adjustable based")

for token in doc:
    print(token, " | " , token.lemma_, " | ", token.lemma)

Loving  |  love  |  3702023516439754181
better  |  well  |  4525988469032889948
did  |  do  |  2158845516055552166
ability  |  ability  |  11565809527369121409
encouter  |  encouter  |  14036664800987158468
wiser  |  wise  |  1716070729102715479
adjustable  |  adjustable  |  6033511944150694480
based  |  base  |  4715552063986449646


Printing the part of speech (POS) of each token in the text and their tags (a further categorization of parts of speech like the tense of a verb)

In [45]:
doc = nlp("Elon flies to mars yesterday. He carried pizza and pasta with him")

for token in doc:
    if token.pos_ not in["SPACE","X","PUNCT"]:
        print(token, " | ", token.pos_, " : ",spacy.explain(token.pos_), " | ",token.tag_, " : ", spacy.explain(token.tag_))

Elon  |  PROPN  :  proper noun  |  NNP  :  noun, proper singular
flies  |  VERB  :  verb  |  VBZ  :  verb, 3rd person singular present
to  |  ADP  :  adposition  |  IN  :  conjunction, subordinating or preposition
mars  |  PROPN  :  proper noun  |  NNP  :  noun, proper singular
yesterday  |  NOUN  :  noun  |  NN  :  noun, singular or mass
He  |  PRON  :  pronoun  |  PRP  :  pronoun, personal
carried  |  VERB  :  verb  |  VBD  :  verb, past tense
pizza  |  NOUN  :  noun  |  NN  :  noun, singular or mass
and  |  CCONJ  :  coordinating conjunction  |  CC  :  conjunction, coordinating
pasta  |  NOUN  :  noun  |  NN  :  noun, singular or mass
with  |  ADP  :  adposition  |  IN  :  conjunction, subordinating or preposition
him  |  PRON  :  pronoun  |  PRP  :  pronoun, personal


We will use spaCy for NER (Named entity recognition)

In [46]:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [47]:
doc = nlp("Tesla Inc is going to acquire Twitter company for 45$ Billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, ":",spacy.explain(ent.label_))

Tesla Inc  |  ORG : Companies, agencies, institutions, etc.
Twitter  |  PERSON : People, including fictional
45$ Billion  |  MONEY : Monetary values, including unit


In [48]:
from spacy import displacy
displacy.render(doc,style="ent")