<a href="https://colab.research.google.com/github/archaeogeek/colab/blob/master/NLP_with_Spacy_WIP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Setting up working environment.
Book says spaCy 2.0 or later

In [None]:
!pip install spacy
!pip install pytextrank --upgrade
!python -m spacy info

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
2023-01-24 18:05:52.654728: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[1m

spaCy version    3.4.4                         
Location         /usr/local/lib/python3.8/dist-packages/spacy
Platform         Linux-5.10.147+-x86_64-with-glibc2.29
Python version   3.8.10                        
Pipelines        en_core_web_sm (3.4.1)        



Install statistical models for spaCy. Models have naming syntax lang_type_genre_size. Downloading 'en' downloads the default en_core_web_md, which is the smallest general purpose model for the English language

In [None]:
!python -m spacy download en

2023-01-19 11:42:25.620464: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m69.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


# Basic NLP operations

**Tokenisation**
Parsing text into tokens, which can be words, numbers or punctuation


In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I am driving to Kendal')
print([w.text for w in doc])

['I', 'am', 'driving', 'to', 'Kendal']


**Lemmatization**
The base form of any given token. eg, the Lemma for 'driving' is 'drive'. Should also work for tenses

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'this dog enjoyed both running and sleeping, in fact it slept very well and runs most days')
for token in doc:
  print(token.text, token.lemma_)

this this
dog dog
enjoyed enjoy
both both
running run
and and
sleeping sleep
, ,
in in
fact fact
it it
slept sleep
very very
well well
and and
runs run
most most
days day


We need to deal with nicknames somehow- we can define them for spaCy to use. It doesn't seem to work if you just call 'doc' in the final line though- you seem to need to call it in the print statement. Maybe you need to define it after you add the tokenizer special case?

In [None]:
import spacy
from spacy.symbols import ORTH, LEMMA
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I am driving to Lancaster, in Lancs')
print([w.text for w in doc])
special_case = [{ORTH: u'Lancs', LEMMA: u'Lancashire'}]
nlp.tokenizer.add_special_case(u'Lancs', special_case)
print([w.lemma_ for w in nlp(u'I am driving to Lancaster in Lancs')])

['I', 'am', 'driving', 'to', 'Lancaster', ',', 'in', 'Lancs']


ValueError: ignored

**Part-of-speech Tagging**
Tags each word in a sentence with whether it's a noun, verb etc. Standard grammatical constructs, along with punctuation are called coarse-grained parts of speech. Tenses and types of pronoun etc are fine-grained parts of speech. NLP needs to do this for multi-sentence phrases as lemmatization will reduce everything to it's basic form- tense and anything signifying intent will disappear. The following snippet identifies the token that is in the *present progressive* form (`VBG`) or *base* form (`VB`) to identify the part of the phrase signifying intent rather than something that happened in the past.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I have driven to Lancaster. Now I am driving to Kendal')
print([w.text for w in doc if w.tag_ == 'VBG' or w.tag_ == 'VB'])
print([w.text for w in doc if w.pos_ == 'PROPN'])

['driving']
['Lancaster', 'Kendal']


**Syntactic Relations** The next step is to link the proper noun with verb that best signifies intent. SpaCy uses syntactic dependency labels to show the relationship between pairs of words in a sentence. Verb phrases are parents, noun phrases are children (but can have children of their own). The relationship between parents or `heads`, and children is one-to-many, so the dependency label (`dobj`) is always assigned to the child



In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I have driven to Lancaster. Now I am walking to Liverpool')
for token in doc:
  print(token.head.text, token.text, token.pos_, token.dep_)

driven I PRON nsubj
driven have AUX aux
driven driven VERB ROOT
driven to ADP prep
to Lancaster PROPN pobj
driven . PUNCT punct
walking Now ADV advmod
walking I PRON nsubj
walking am AUX aux
walking walking VERB ROOT
walking to ADP prep
to Liverpool PROPN pobj


In the above, looking at the `token.head.text` and the `token.dep_` you can extract the `ROOT` and the `pobj` to extract the intent part of the phrase. We can also split the phrase up into sentences with `doc.sents`

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I have driven to Lancaster. Now I am cycling to Liverpool')
for sent in doc.sents:
  print([w.text for w in sent if w.dep_ == 'ROOT' or w.dep_ == 'pobj'])

['driven', 'Lancaster']
['cycling', 'Liverpool']


**Named Entity Recognition** named entities are real objects such as people, organisations, locations. 

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I have driven to Lancaster. Now I am walking to Galgate. I have been to Burton-in-Kendal too.')
for token in doc:
  if token.ent_type !=0:
    print(token.text, token.ent_type_)

Lancaster GPE
Galgate ORG
Burton ORG


In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I have driven to Lancaster. Now I am walking to Galgate. I have been to Horndean and Burton-in-Kendal too.')
for token in doc:
  if token.ent_type !=0:
    print(token.text, token.ent_type_)

Lancaster GPE
Galgate ORG
Horndean LOC
Burton PERSON


# Gotchas

**UK Placenames** If we try a small village with a multi-part name such as *Burton-in-Kendal*, spaCy doesn't recognise it as a single proper noun. So we're going to need to try different entity taggers or dictionaries to find one that can understand more UK places (could be hills, mountains, etc too). If it's a single small place such as *Galgate* that's not in the list, it recognises it as a proper noun (`PROPN`) but doesn't tag it as the object (`pobj`). Then... if we use named entity recognition, it mis-tags *Lancaster* and *Burton* as `PERSON`, which is funny but problematic

This might be the point at which we need to use something like [mordecai](https://github.com/openeventdata/mordecai) to do our geoparsing, which will require an ES install...

Alternative approach- based on Chapter 10 (Training Models), there is an example that corrects the tag for a location from LOC to GPE. The section "Automating the Example Creation Process" takes a text file and a list of entities, and says "for each entity in this list that appears in the text file, classify as GPE". We should be able to both the text file and the list from the OS Open Names data and then use that to train the model. Let's see if we can use this approach to add Lancaster as a GPE in the model though...

In [None]:
import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

nlp = spacy.load('en_core_web_sm')

# Modify tokenizer infix patterns to stop the tokeniser breaking on hyphens
# from https://spacy.io/usage/linguistic-features#native-tokenizers
infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\\-\\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        # ✅ Commented out regex that splits on hyphens between letters:
        # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
    ]
)

infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer

doc = nlp(u'Could you send a taxi to Horndean? I need to get to Lancaster, via Burton-in-Kendal, to meet Richard Burton')
train_exams = []
places = ['Horndean', 'Burton-in-Kendal', 'Lancaster']
for sent in doc.sents:
  entities = []
  for token in sent:
    # sanity check the existing token type and type number before we process
    print(token.text, token.ent_type_, token.ent_type)
    if token.text in places:
      start = token.idx - sent.start_char
      # force entries from the list to have an entity type of GPE
      # note this doesn't affect the type number which could still be zero
      entity = (start, start + len(token), 'GPE')
    else:
      start = token.idx - sent.start_char
      # other entries not in the places list get their existing type and type number
      entity = (start, start + len(token), token.ent_type_)
    # prevent adding all the entities that still have no entity type
    if entity[2] !='':
      entities.append(entity)
  tpl = (sent.text, {'entities': entities})
  train_exams.append(tpl)
# print resulting correctly tagged training examples
print(train_exams)

Could  0
you  0
send  0
a  0
taxi  0
to  0
Horndean LOC 385
?  0
I  0
need  0
to  0
get  0
to  0
Lancaster PERSON 380
,  0
via  0
Burton-in-Kendal  0
,  0
to  0
meet  0
Richard PERSON 380
Burton PERSON 380
[('Could you send a taxi to Horndean?', {'entities': [(25, 33, 'GPE')]}), ('I need to get to Lancaster, via Burton-in-Kendal, to meet Richard Burton', {'entities': [(17, 26, 'GPE'), (32, 48, 'GPE'), (58, 65, 'PERSON'), (66, 72, 'PERSON')]})]


The above approach seems to correctly tag the places and doesn't add all the other scrote to the list of training examples

# Spacy Tokenisation


Aka how we deal with things like "London Road" without splitting it into "London" and "Road"

The example below shows using `token.rights` and `token.lefts` to pick up the left and right children of a token

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I live on Greaves Road in Lancaster but I used to live on Greenfield Street") #"bright red apples on the tree"
for sent in doc.sents:
  for token in sent:
    print(token.text, token.ent_type_, [t.text for t in token.lefts], token.n_lefts, [t.text for t in token.rights], token.n_rights)
# print([token.text for token in doc[2].lefts])  # ['bright', 'red']
# print([token.text for token in doc[3].rights])  # ['on']
print(doc[2].n_lefts)  # 2
print(doc[2].n_rights)  # 1

I  [] 0 [] 0
live  ['I'] 1 ['on', 'but', 'used'] 3
on  [] 0 ['Road'] 1
Greaves FAC [] 0 [] 0
Road FAC ['Greaves'] 1 ['in'] 1
in  [] 0 ['Lancaster'] 1
Lancaster GPE [] 0 [] 0
but  [] 0 [] 0
I  [] 0 [] 0
used  ['I'] 1 ['live'] 1
to  [] 0 [] 0
live  ['to'] 1 ['on'] 1
on  [] 0 ['Street'] 1
Greenfield FAC [] 0 [] 0
Street FAC ['Greenfield'] 1 [] 0
0
1


https://stackoverflow.com/questions/57206701/spacy-tokenizer-rule-for-exceptions-that-contain-whitespace suggests a couple of approaches. The simplest might be to merge noun chunks?

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("merge_noun_chunks")
doc = nlp(u'I live on Greaves Road')
for token in doc:
  print(token.text, token.pos_, token.dep_)


I PRON nsubj
live VERB ROOT
on ADP prep
Greaves Road PROPN pobj


# Adding a terminology list to our model

Thanks to https://notebook.community/psychemedia/parlihacks/notebooks/Text%20Scraping%20-%20Notes this might finally be the approach for pre-loading our additional location entries! The next step is (I think) to turn this into a custom pipeline that we can load alongside the proper one rather than doing the list munging each time

In [None]:
import spacy
import pandas as pd
from spacy import displacy
from spacy.pipeline import EntityRuler


nlp = spacy.load("en_core_web_sm")
# make sure we load and run our custom pipeline before the standard one
# else our entities get overwritten
ruler = nlp.add_pipe("entity_ruler", before="ner")

# make a set of patterns from the csv
# and then add them to our custom pipeline
# note this relies on top50_list.csv being loaded into the notebook session
#opennames = pd.read_csv('top50_list.csv')
opennames = pd.read_csv('osopennames_singlecolumn.csv')
#opennames_list= opennames["THING"].tolist()
opennames_list= opennames["Entity_Name"].tolist()
patterns=[]
for t in opennames_list:
  patterns.append({"label":"GPE", "pattern":t})
ruler.add_patterns(patterns)

# test with a doc that has some entries from the list
# and some that are not
# and use displacy to render the text showing the entities
doc = nlp("BN21 1AA is in my csv along with Saxon Ground, St Aubyn's Road and A2040. London is not, and Richard Burton is certainly not!")
displacy.render(doc, jupyter=True, style='ent')


Next we need to test this out with the whole csv, which has now been committed to the github repository at https://github.com/archaeogeek/colab, so let's download it, check which folder we're in and that it has the right sort of data in it

In [1]:
!wget https://raw.githubusercontent.com/archaeogeek/colab/master/osopennames_singlecolumn_2.csv
!pwd
!head -n 10 osopennames_singlecolumn_2.csv

--2023-02-08 20:36:58--  https://raw.githubusercontent.com/archaeogeek/colab/master/osopennames_singlecolumn_2.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44453616 (42M) [text/plain]
Saving to: ‘osopennames_singlecolumn_2.csv’


2023-02-08 20:36:58 (166 MB/s) - ‘osopennames_singlecolumn_2.csv’ saved [44453616/44453616]

/content
"label","pattern"
GPE,Westing
GPE,Underhoull
GPE,Gunnister
GPE,Gloup
GPE,Midbrake
GPE,Uyeasound
GPE,Burragarth
GPE,Cullivoe
GPE,Wick


In [14]:
import pandas as pd
df = pd.read_csv('osopennames_singlecolumn_2.csv')
df["pattern"] = df.pattern.apply(lambda x: {"lower": x})
df.to_json(r'osopennames.jsonl', orient='records', lines=True)



Hmm, the below doesn't work :-(

In [23]:
import spacy
import pandas as pd
from spacy import displacy
from spacy.pipeline import EntityRuler

config = {
   "path": "osopennames.jsonl"
}


nlp = spacy.load("en_core_web_sm")
# make sure we load and run our custom pipeline before the standard one
# else our entities get overwritten
ruler = nlp.add_pipe("entity_ruler")
ruler.from_disk("osopennames.jsonl")

# test with a doc that has some entries from the list
# and some that are not
# and use displacy to render the text showing the entities
doc = nlp("BN21 1AA is in my csv along with Saxon Ground, St Aubyn's Road and A2040. London is not, and Richard Burton is certainly not!")
displacy.render(doc, jupyter=True, style='ent')




# A side look at Geograpy #DEPRECATED


Initial tests show that the location information is pretty basic, but it does get it from an sqllite db that perhaps we can tweak

In [None]:
pip install geograpy3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import geograpy
test = "I am from Lancaster"
places = geograpy.get_geoPlace_context(text=test)
print(places)

countries=[]
regions=[]
cities=[]
other=[]


# A side look at Geopy #DEPRECATED

Technically a geocoder rather than a geoparser- it copes very well with our standard examples, and even postcodes, but fails when the string contains more than just the location (eg "I live at Greaves Road").

In [None]:
pip install geopy --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting geopy
  Downloading geopy-2.3.0-py3-none-any.whl (119 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 KB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: geopy
  Attempting uninstall: geopy
    Found existing installation: geopy 1.17.0
    Uninstalling geopy-1.17.0:
      Successfully uninstalled geopy-1.17.0
Successfully installed geopy-2.3.0


In [None]:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="geocode_test")
location = geolocator.geocode("LA1 4UR", country_codes="gb")
print(location.address)


Lancaster, Lancashire, England, LA1 4UR, United Kingdom


So... maybe the approach is to use Spacy for the tokenisation and then pass the tokens to geopy?
We'd need to include:
* regex for postcodes
* regex for common street suffixes
* hyphenisation

# Putting it all together #DEPRECATED

The code below combines the hyphenisation and merging noun chunks. It gives pretty good results *when* it properly classifies something as a noun but note that for Horndean it misclassifies it as an adjective. We still need some custom regex for dealing with postcodes

In [None]:
import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="geocode_test")
nlp = spacy.load('en_core_web_sm')

# Modify tokenizer infix patterns to stop the tokeniser breaking on hyphens
# from https://spacy.io/usage/linguistic-features#native-tokenizers
infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\\-\\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        # ✅ Commented out regex that splits on hyphens between letters:
        # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
    ]
)

infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer

# merge noun chunks into single tokens
nlp.add_pipe("merge_noun_chunks")
doc = nlp(u'I am a dog called Richard Burton. I live on Jensen Close but used to live in Greenfield Street, before that I lived in Burton-in-Kendal and I want to live in Horndean')
for token in doc:
  print(token.text, token.pos_, token.dep_)
  if token.pos_ == "PROPN":
    location = geolocator.geocode(token,country_codes="gb")
    print(location.address)


I PRON nsubj
am AUX ROOT
a dog NOUN attr
called VERB acl
Richard Burton PROPN oprd
Richard Burton, 6, Lyndhurst Road, Belsize Park, London Borough of Camden, London, Greater London, England, NW3 5PE, United Kingdom
. PUNCT punct
I PRON nsubj
live VERB ROOT
on ADP prep
Jensen Close PROPN pobj
Jensen Close, Haverbreaks, Aldcliffe, Lancaster, Lancashire, England, LA1 5BP, United Kingdom
but CCONJ cc
used VERB conj
to PART aux
live VERB xcomp
in ADP prep
Greenfield Street PROPN pobj
Greenfield Street, Holywell, Flintshire, Cymru / Wales, CH8 7PN, United Kingdom
, PUNCT punct
before ADP prep
that PRON pobj
I PRON nsubj
lived VERB advcl
in ADP prep
Burton-in-Kendal PROPN pobj
Burton-in-Kendal, South Lakeland, Cumbria, England, United Kingdom
and CCONJ cc
I PRON nsubj
want VERB conj
to PART aux
live VERB xcomp
in ADP prep
Horndean ADJ pobj


# Keyword extraction
Examples include https://betterprogramming.pub/extract-keywords-using-spacy-in-python-4a8415478fbf which works but is a bit unsophisticated. Pytextrank ought to be great but I can't get the simple code from https://github.com/DerwenAI/pytextrank to run


In [None]:
import spacy
import pytextrank

# example text
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")

# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank")
doc = nlp(text)

# examine the top-ranked phrases in the document
for phrase in doc._.phrases:
    print(phrase.text)
    print(phrase.rank, phrase.count)
    print(phrase.chunks)

AttributeError: ignored