In [6]:
import spacy

nlp = spacy.load('en_core_web_sm')
nlp

<spacy.lang.en.English at 0x7f2a3a65fa90>

In [34]:
# https://www.bbc.co.uk/news/uk-politics-57456641
txt = """World leaders meeting in Cornwall are to adopt strict measures on coal-fired power stations as part of the battle against climate change.

The G7 group will promise to move away from coal plants, unless they have technology to capture carbon emissions.

It comes as Sir David Attenborough warned that humans could be "on the verge of destabilising the entire planet".

He said G7 leaders face the most important decisions in human history.

The coal announcement came from the White House, which says it is the first time the leaders of wealthy nations have committed to keeping the projected global temperature rise to 1.5C.

That requires a range of urgent policies, chief among them being phasing out coal burning unless it includes carbon capture technology.

Coal is the world's dirtiest major fuel and ending its use is seen as a major step by environmentalists, but they also want guarantees rich countries will deliver on previous promises to help poorer countries cope with climate change.

The G7 will end the funding of new coal generation in developing countries and offer up to £2billion for poorer nations to stop using the fuel.

Climate change has been one of the key themes at the three-day summit in Carbis Bay, Cornwall.

Leaders of the seven major industrialised nations - the UK, US, Canada, Japan, France, Germany and Italy - are expected to set out global plans to reduce emissions from farming, transport, and the making of steel and cement.

And they will commit to protecting 30 percent of global land and marine areas for nature by 2030.

They are also expected to pledge to almost halve their emissions by 2030, relative to 2010 levels.
"""

When we call `nlp(text)` on a piece of text, the text is tokenized and produces a `Doc` object. This `Doc` is then processed in several different steps - called an **processing pipeline**.

https://spacy.io/usage/processing-pipelines

In [35]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f2a35214390>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f2a35263980>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f2a351921a0>)]

In [36]:
doc = nlp(txt)
doc
type(doc)

spacy.tokens.doc.Doc

In [44]:
# POS tags: https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html
# for i in range(10):
#   print(doc[i].text, doc[i].tag_)

# dependency parse: head denotes syntactic parent of the given token
for i in range(10):
  print(doc[i].text, doc[i].head)

# Visualizing a dependency parse or named entities in a text
# from spacy import displacy
# from IPython.core.display import display, HTML
# html = displacy.render(doc[:11], style='dep')
# display(HTML(html))

World leaders
leaders are
meeting leaders
in meeting
Cornwall in
are are
to adopt
adopt are
strict measures
measures adopt


In [61]:
# Doc level entity
# get entities with doc.ents
doc.ents

# GPE Geopolitical entity, i.e. countries, cities, states.
# 

for ent in doc.ents:
  print(ent.text, ent.label_) # ent.start_char, ent.end_char, 

Cornwall GPE
G7 GPE
David Attenborough PERSON
G7 CARDINAL
the White House ORG
first ORDINAL
up to £2billion MONEY
Climate ORG
three-day DATE
Carbis Bay GPE
Cornwall GPE
seven CARDINAL
UK GPE
US GPE
Canada GPE
Japan GPE
France GPE
Germany GPE
Italy GPE
30 percent PERCENT
2030 CARDINAL
2030 CARDINAL
2010 DATE


In [59]:
html = displacy.render(doc, style='ent')
display(HTML(html))

At the token level, we can use the IOB scheme to determine if a token is part of a named entity, or not.

The IOB scheme will denote if a token is at the *start* of a named entity, or if it's *inside* a named entity. For example: 

*  "Jeff Bezos" is a PERSON
*  "Jeff" is the beginning of the entity (IOB token `B`)
*  "Bezos" is INSIDE the entity (IOB token `I`)
*  Words that aren't part of entities are given the IOB token `O`.

In [70]:
# IOB scheme
# I – Token is inside an entity.
# O – Token is outside an entity.
# B – Token is the beginning of an entity.
# ents = [(t.text, t.ent_iob_) for t in doc[:100]]
# ents

ents = [t.text for t in doc if t.ent_iob_ in ['B', 'I']]
ents

['Cornwall',
 'G7',
 'David',
 'Attenborough',
 'G7',
 'the',
 'White',
 'House',
 'first',
 'up',
 'to',
 '£',
 '2billion',
 'Climate',
 'three',
 '-',
 'day',
 'Carbis',
 'Bay',
 'Cornwall',
 'seven',
 'UK',
 'US',
 'Canada',
 'Japan',
 'France',
 'Germany',
 'Italy',
 '30',
 'percent',
 '2030',
 '2030',
 '2010']

In [89]:
# How many of each entity type are mentioned in the text?
from collections import Counter

print(Counter([ent.label_ for ent in doc.ents]))

[ent for ent in doc.ents if ent.label_ == 'GPE']

Counter({'GPE': 11, 'CARDINAL': 4, 'ORG': 2, 'DATE': 2, 'PERSON': 1, 'ORDINAL': 1, 'MONEY': 1, 'PERCENT': 1})


[Cornwall,
 G7,
 Carbis Bay,
 Cornwall,
 UK,
 US,
 Canada,
 Japan,
 France,
 Germany,
 Italy]