<div style="text-align:center">
  <img src="https://github.com/floresernesto95/Images/blob/main/Notebook%200.jpg?raw=true" alt="Centered and Resized Image" width="600" height="200">
</div>

> # **Setup**

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

- **Notes**

    en_core_web_sm is a small English pipeline trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities.

> # **Show entities**

In [2]:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+' - ' +str(ent.start_char) +' - '+ str(ent.end_char) +
                  ' - '+ent.label_+ ' - '+str(spacy.explain(ent.label_)))
    else:
        print('No named entities found.')

- **Notes**

    This function takes a SpaCy doc object as input and prints out information about the named entities found in the document (text, start character index, end character index, label, and explanation of each entity).

In [3]:
doc1 = nlp("Apple is looking at buying U.K. startup for $1 billion.")
show_ents(doc1)

Apple - 0 - 5 - ORG - Companies, agencies, institutions, etc.
U.K. - 27 - 31 - GPE - Countries, cities, states
$1 billion - 44 - 54 - MONEY - Monetary values, including unit


In [4]:
doc2 = nlp(u'Can I please borrow 500 dollars from you to buy some Microsoft stock?')
for ent in doc2.ents:
    print(f'{ent.text}, {ent.label_}')

500 dollars, MONEY
Microsoft, ORG


In [5]:
ents = [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc2.ents]
print(ents)

[('500 dollars', 20, 31, 'MONEY'), ('Microsoft', 53, 62, 'ORG')]


- **Notes**
    
    - List form
    
    - The entity type is accessible either as a hash value using ent.label or as a string using ent.label_

- **Entity annotations**

    Doc.ents are token spans with their own set of annotations.
    
    <table align="left">
    <tr><td>ent.text</td><td>The original entity text</td></tr>
    <tr><td>ent.label</td><td>The entity type's hash value</td></tr>
    <tr><td>ent.label_</td><td>The entity type's string description</td></tr>
    <tr><td>ent.start</td><td>The token span's start index position in the Doc</td></tr>
    <tr><td>ent.end</td><td>The token span's stop index position in the Doc</td></tr>
    <tr><td>ent.start_char</td><td>The entity text's start index position in the Doc</td></tr>
    <tr><td>ent.end_char</td><td>The entity text's stop index position in the Doc</td></tr>
    </table>

In [6]:
doc3 = nlp("San Francisco considers banning sidewalk delivery robots.")
ent_san = [doc3[0].text, doc3[0].ent_iob_, doc3[0].ent_type_]
ent_francisco = [doc3[1].text, doc3[1].ent_iob_, doc3[1].ent_type_]
print(ent_san) 
print(ent_francisco)

['San', 'B', 'GPE']
['Francisco', 'I', 'GPE']


- **Notes**

    The standard way to access entity annotations is the doc.ents property, which produces a sequence of span objects. The span object acts as a sequence of tokens, so you can iterate over the entity or index into it. You can also get the text form of the whole entity, as though it were a single token.

    You can also access token entity annotations using the token.ent_iob_ and token.ent_type_ attributes. token.ent_iob_ indicates whether an entity starts, continues, or ends on the tag. If no entity type is set on a token, it will return an empty string.

- **IOB scheme**

    <table align="left">
    <tr><td>I</td><td>Token is inside an entity</td></tr>
    <tr><td>O</td><td>Token is outside an entity</td></tr>
    <tr><td>B</td><td>Token is the beginning of an entity</td></tr>
    </table>
.
    
.
    
.
    
- **NER tags**

    <table align="left">
    <tr><th>TYPE</th><th>DESCRIPTION</th><th>EXAMPLE</th></tr>
    <tr><td>PERSON</td><td>People, including fictional</td><td>Fred Flintstone</td></tr>
    <tr><td>NORP</td><td>Nationalities or religious or political groups</td><td>The Republican Party</td></tr>
    <tr><td>FAC</td><td>Buildings, airports, highways, bridges, etc</td><td>Logan International Airport, The Golden Gate</td></tr>
    <tr><td>ORG</td><td>Companies, agencies, institutions, etc</td><td>Microsoft, FBI, MIT</td></tr>
    <tr><td>GPE</td><td>Countries, cities, states</td><td>France, UAR, Chicago, Idaho</td></tr>
    <tr><td>LOC</td><td>Non-GPE locations, mountain ranges, bodies of water</td><td>Europe, Nile River, Midwest</td></tr>
    <tr><td>PRODUCT</td><td>Objects, vehicles, foods, etc. (Not services)</td><td>Formula 1</td></tr>
    <tr><td>EVENT</td><td>Named hurricanes, battles, wars, sports events, etc.</td><td>Olympic Games</td></tr>
    <tr><td>WORK_OF_ART</td><td>Titles of books, songs, etc.</td><td>The Mona Lisa</td></tr>
    <tr><td>LAW</td><td>Named documents made into laws</td><td>Roe v. Wade</td></tr>
    <tr><td>LANGUAGE</td><td>Any named language</td><td>English</td></tr>
    <tr><td>DATE</td><td>Absolute or relative dates or periods</td><td>20 July 1969</td></tr>
    <tr><td>TIME</td><td>Times smaller than a day</td><td>Four hours</td></tr>
    <tr><td>PERCENT</td><td>Percentage, including %</td><td>Eighty percent</td></tr>
    <tr><td>MONEY</td><td>Monetary values, including unit</td><td>Twenty Cents</td></tr>
    <tr><td>QUANTITY</td><td>Measurements, as of weight or distance</td><td>Several kilometers, 55kg</td></tr>
    <tr><td>ORDINAL</td><td>first, second, etc.</td><td>9th, Ninth</td></tr>
    <tr><td>CARDINAL</td><td>Numerals that do not fall under another type</td><td>2, Two, Fifty-two</td></tr>
    </table>

> # **Define entities**

In [7]:
from spacy.tokens import Span

In [8]:
doc4 = nlp('FB is hiring a new vice president of global policy.')
show_ents(doc4)

No named entities found.


- **Notes**

    spaCy does not recognize FB as a company.

In [9]:
ORG = doc4.vocab.strings['ORG']
new_ent = Span(doc4, 0, 1, label=ORG)
doc4.ents = list(doc4.ents) + [new_ent]
show_ents(doc4)

FB - 0 - 2 - ORG - Companies, agencies, institutions, etc.


- **Notes**

    This code, firstly retrieves the hash value associated with the entity label ORG from the vocabulary of the doc4 document, assigning it to the variable ORG. Next, it creates a new Span object called new_ent, representing an entity labeled as ORG, spanning from the start token index 0 (inclusive) to token index 1 (exclusive) within the doc4 document. Finally, it appends this newly created entity (new_ent) to the existing named entities (doc.ents) in the doc object, effectively adding the ORG entity to the document's named entities.

> # **NER visualization**

In [10]:
from spacy import displacy

In [11]:
doc5 = nlp("When S. Thrun started working on self driving cars at Google in 2007 few people outside of the company took him serious.")
displacy.render(doc5, style="ent", jupyter=True)

In [12]:
text = """Clearview AI, a New York-headquartered facial recognition company, has been fined £7.5 million ($9.4 million) by a U.K. 
privacy regulator. 
Over the last few years, the firm has collected images from the web and social media of people in Britain and elsewhere to create a 
global online database that can be used by law enforcement for facial recognition.
The Information Commission’s Office said Monday that the company has breached U.K. data protection laws.
The ICO has ordered Clearview to delete data it has on U.K. residents and banned it from collecting any more.
Clearview writes on its website that it has collected more than 20 billion facial images of people around the world. It collects 
publicly posted images from social media platforms like Facebook and Instagram, as well as news media, mugshot websites and other open 
sources. It does so without informing the individuals or asking for their consent.
Clearview’s platform allows law enforcement agencies to upload a photo of an individual and try to match it to photos that are stored 
in Clearview’s database.
John Edwards, the U.K.’s information commissioner, said in a statement: “The company not only enables identification of those people, 
but effectively monitors their behavior and offers it as a commercial service. That is unacceptable.”
He added that people expect their personal information to be respected, regardless of where in the world their data is being used."""

doc6 = nlp(text)

for sent in doc6.sents:
    displacy.render(nlp(sent.text), style='ent', jupyter=True)



- **Notes**

    Visualizes sentences line by line.

In [13]:
options = {'ents': ['ORG', 'PRODUCT']}
displacy.render(doc6, style='ent', jupyter=True, options=options)

- **Notes**

    This code specifies that only entities labeled as ORG and PRODUCT should be displayed. Then, it calls the displacy.render function to render the named entities in the document doc6 using the 'ent' style, which typically highlights entities with different colors or styles. The visualization is rendered in a Jupyter notebook environment, as indicated by the jupyter=True parameter.

In [14]:
colors = {'ORG': 'linear-gradient(90deg, #f2c707, #dc9ce7)', 'PRODUCT': 'radial-gradient(white, green)'}
options = {'ents': ['ORG', 'PRODUCT'], 'colors':colors}
displacy.render(doc6, style='ent', jupyter=True, options=options)

- **Notes**

    Customizes color and effects.

In [15]:
colors = {'ORG':'linear-gradient(90deg, #aa9cde, #dc9ce7)', 'PRODUCT':'radial-gradient(white, red)'}
options = {'ent':['ORG', 'PRODUCT'],'colors':colors}
displacy.render(doc6, style='ent', jupyter=True, options=options)

> # **Matcher**

In [16]:
from spacy.matcher import PhraseMatcher
from spacy.tokens import Span

In [17]:
doc7 = nlp('Our company plans to introduce a new vacuum cleaner. If successful, the vacuum cleaner will be our first product.')
show_ents(doc7)

first - 99 - 104 - ORDINAL - "first", "second", etc.


In [18]:
matcher = PhraseMatcher(nlp.vocab)

phrase_list = ['vacuum cleaner', 'vacuum-cleaner']
phrase_patterns = [nlp(text) for text in phrase_list]

matcher.add('newproduct', None, *phrase_patterns)
matches = matcher(doc7)
matches

[(2689272359382549672, 7, 9), (2689272359382549672, 14, 16)]

- **Notes**

    This code defines a list of desired phrase patterns, such as 'vacuum cleaner' and 'vacuum-cleaner', and convert each pattern into a spaCy Doc object. Next, we add these patterns to our matcher object using the add method. After setting up the matcher, we apply it to our original Doc object (doc) to find matches. The matches object contains information about the matches found in the document.

In [19]:
PROD = doc7.vocab.strings['PRODUCT']
new_ents = [Span(doc7, match[1], match[2], label=PROD) for match in matches]
doc7.ents = list(doc7.ents) + new_ents

show_ents(doc7)

vacuum cleaner - 37 - 51 - PRODUCT - Objects, vehicles, foods, etc. (not services)
vacuum cleaner - 72 - 86 - PRODUCT - Objects, vehicles, foods, etc. (not services)
first - 99 - 104 - ORDINAL - "first", "second", etc.


- **Notes** 

    Then, we create Spans from each match found using the PhraseMatcher. We extract the start and stop indices of each match and use them to create new Span objects representing the matched phrases as named entities. These new entities are then assigned the label "PRODUCT" from the spaCy vocabulary. Finally, we update the document's named entities with the newly created entities. The show_ents function is then called to display the named entities (including the newly added ones) found in the document.

> # **Extras**

In [20]:
doc8 = nlp('Originally priced at $29.50, the sweater was marked down to five dollars.')
len([ent for ent in doc8.ents if ent.label_=='MONEY'])

2

- **Notes**

    Uses a list comprehension to iterate over the entities in the document and filter only those entities whose label is 'MONEY'. Subsequently, calculate the length of the list of monetary entities found in the document.