<a href="https://colab.research.google.com/github/archaeogeek/colab/blob/main/NLP_with_Spacy_WIP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Setting up working environment.
Book says spaCy 2.0 or later

In [None]:
!pip install spacy
!pip install pytextrank --upgrade
!python -m spacy info

Install statistical models for spaCy. Models have naming syntax lang_type_genre_size. Downloading 'en' downloads the default en_core_web_md, which is the smallest general purpose model for the English language

In [None]:
!python -m spacy download en

# Basic NLP operations

**Tokenisation**
Parsing text into tokens, which can be words, numbers or punctuation


In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I am driving to Kendal')
print([w.text for w in doc])

**Lemmatization**
The base form of any given token. eg, the Lemma for 'driving' is 'drive'. Should also work for tenses

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'this dog enjoyed both running and sleeping, in fact it slept very well and runs most days')
for token in doc:
  print(token.text, token.lemma_)

We need to deal with nicknames somehow- we can define them for spaCy to use. It doesn't seem to work if you just call 'doc' in the final line though- you seem to need to call it in the print statement. Maybe you need to define it after you add the tokenizer special case?

In [None]:
import spacy
from spacy.symbols import ORTH, LEMMA
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I am driving to Lancaster, in Lancs')
print([w.text for w in doc])
special_case = [{ORTH: u'Lancs', LEMMA: u'Lancashire'}]
nlp.tokenizer.add_special_case(u'Lancs', special_case)
print([w.lemma_ for w in nlp(u'I am driving to Lancaster in Lancs')])

**Part-of-speech Tagging**
Tags each word in a sentence with whether it's a noun, verb etc. Standard grammatical constructs, along with punctuation are called coarse-grained parts of speech. Tenses and types of pronoun etc are fine-grained parts of speech. NLP needs to do this for multi-sentence phrases as lemmatization will reduce everything to it's basic form- tense and anything signifying intent will disappear. The following snippet identifies the token that is in the *present progressive* form (`VBG`) or *base* form (`VB`) to identify the part of the phrase signifying intent rather than something that happened in the past.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I have driven to Lancaster. Now I am driving to Kendal')
print([w.text for w in doc if w.tag_ == 'VBG' or w.tag_ == 'VB'])
print([w.text for w in doc if w.pos_ == 'PROPN'])

**Syntactic Relations** The next step is to link the proper noun with verb that best signifies intent. SpaCy uses syntactic dependency labels to show the relationship between pairs of words in a sentence. Verb phrases are parents, noun phrases are children (but can have children of their own). The relationship between parents or `heads`, and children is one-to-many, so the dependency label (`dobj`) is always assigned to the child



In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I have driven to Lancaster. Now I am walking to Liverpool')
for token in doc:
  print(token.head.text, token.text, token.pos_, token.dep_)

In the above, looking at the `token.head.text` and the `token.dep_` you can extract the `ROOT` and the `pobj` to extract the intent part of the phrase. We can also split the phrase up into sentences with `doc.sents`

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I have driven to Lancaster. Now I am cycling to Liverpool')
for sent in doc.sents:
  print([w.text for w in sent if w.dep_ == 'ROOT' or w.dep_ == 'pobj'])

**Named Entity Recognition** named entities are real objects such as people, organisations, locations. 

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I have driven to Lancaster. Now I am walking to Galgate. I have been to Burton-in-Kendal too.')
for token in doc:
  if token.ent_type !=0:
    print(token.text, token.ent_type_)

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I have driven to Lancaster. Now I am walking to Galgate. I have been to Horndean and Burton-in-Kendal too.')
for token in doc:
  if token.ent_type !=0:
    print(token.text, token.ent_type_)

# Gotchas

**UK Placenames** If we try a small village with a multi-part name such as *Burton-in-Kendal*, spaCy doesn't recognise it as a single proper noun. So we're going to need to try different entity taggers or dictionaries to find one that can understand more UK places (could be hills, mountains, etc too). If it's a single small place such as *Galgate* that's not in the list, it recognises it as a proper noun (`PROPN`) but doesn't tag it as the object (`pobj`). Then... if we use named entity recognition, it mis-tags *Lancaster* and *Burton* as `PERSON`, which is funny but problematic

This might be the point at which we need to use something like [mordecai](https://github.com/openeventdata/mordecai) to do our geoparsing, which will require an ES install...

Alternative approach- based on Chapter 10 (Training Models), there is an example that corrects the tag for a location from LOC to GPE. The section "Automating the Example Creation Process" takes a text file and a list of entities, and says "for each entity in this list that appears in the text file, classify as GPE". We should be able to both the text file and the list from the OS Open Names data and then use that to train the model. Let's see if we can use this approach to add Lancaster as a GPE in the model though...

In [None]:
import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex

nlp = spacy.load('en_core_web_sm')

# Modify tokenizer infix patterns to stop the tokeniser breaking on hyphens
# from https://spacy.io/usage/linguistic-features#native-tokenizers
infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
        r"(?<=[0-9])[+\\-\\*^](?=[0-9-])",
        r"(?<=[{al}{q}])\\.(?=[{au}{q}])".format(
            al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
        ),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
        # ✅ Commented out regex that splits on hyphens between letters:
        # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
        r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
    ]
)

infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer

doc = nlp(u'Could you send a taxi to Horndean? I need to get to Lancaster, via Burton-in-Kendal, to meet Richard Burton')
train_exams = []
places = ['Horndean', 'Burton-in-Kendal', 'Lancaster']
for sent in doc.sents:
  entities = []
  for token in sent:
    # sanity check the existing token type and type number before we process
    print(token.text, token.ent_type_, token.ent_type)
    if token.text in places:
      start = token.idx - sent.start_char
      # force entries from the list to have an entity type of GPE
      # note this doesn't affect the type number which could still be zero
      entity = (start, start + len(token), 'GPE')
    else:
      start = token.idx - sent.start_char
      # other entries not in the places list get their existing type and type number
      entity = (start, start + len(token), token.ent_type_)
    # prevent adding all the entities that still have no entity type
    if entity[2] !='':
      entities.append(entity)
  tpl = (sent.text, {'entities': entities})
  train_exams.append(tpl)
# print resulting correctly tagged training examples
print(train_exams)

The above approach seems to correctly tag the places and doesn't add all the other scrote to the list of training examples

# Spacy Tokenisation


Aka how we deal with things like "London Road" without splitting it into "London" and "Road"

The example below shows using `token.rights` and `token.lefts` to pick up the left and right children of a token

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("I live on Greaves Road in Lancaster but I used to live on Greenfield Street") #"bright red apples on the tree"
for sent in doc.sents:
  for token in sent:
    print(token.text, token.ent_type_, [t.text for t in token.lefts], token.n_lefts, [t.text for t in token.rights], token.n_rights)
# print([token.text for token in doc[2].lefts])  # ['bright', 'red']
# print([token.text for token in doc[3].rights])  # ['on']
print(doc[2].n_lefts)  # 2
print(doc[2].n_rights)  # 1

https://stackoverflow.com/questions/57206701/spacy-tokenizer-rule-for-exceptions-that-contain-whitespace suggests a couple of approaches. The simplest might be to merge noun chunks?

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("merge_noun_chunks")
doc = nlp(u'I live on Greaves Road')
for token in doc:
  print(token.text, token.pos_, token.dep_)


# Adding a terminology list to our model

Thanks to https://notebook.community/psychemedia/parlihacks/notebooks/Text%20Scraping%20-%20Notes this might finally be the approach for pre-loading our additional location entries! The next step is (I think) to turn this into a custom pipeline that we can load alongside the proper one rather than doing the list mungeing each time

In [None]:
import spacy
import pandas as pd
from spacy import displacy
from spacy.pipeline import EntityRuler


nlp = spacy.load("en_core_web_sm")
# see what we get before we do our stuff
doc = nlp("Saxon Ground, St Aubyn's Road and A2040 are places, and Richard Burton is not")
displacy.render(doc, jupyter=True, style='ent')

# make sure we load and run our custom pipeline before the standard one
# else our entities get overwritten
ruler = nlp.add_pipe("entity_ruler", before="ner")

# make a set of patterns from the csv
# and then add them to our custom pipeline
# note this relies on output_singlecolumn.csv being loaded into the notebook session
# where output_singlecolumn.csv is extracted from 
# https://bitbucket.org/elenarobu/osopennames-geonames-columnsmap/src/main/
opennames = pd.read_csv('output_singlecolumn.csv', dtype='string')
opennames_list= opennames.iloc[:,0].to_list()
patterns=[]
for t in opennames_list:
  patterns.append({"label":"GPE", "pattern":t})
print(len(patterns))
ruler.add_patterns(patterns)

# test with a doc that has some entries from the list
# and some that are not
# and use displacy to render the text showing the entities
doc = nlp("Saxon Ground, St Aubyn's Road and A2040 are places, and Richard Burton is not")
displacy.render(doc, jupyter=True, style='ent')


CF with another approach- phrase matching

In [None]:
import spacy
import pandas as pd
from spacy import displacy
from spacy.matcher import PhraseMatcher



nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)

# make a set of patterns from the csv
# and then create a phrase matcher as per 
# https://notebook.community/psychemedia/parlihacks/notebooks/Text%20Scraping%20-%20Notes
# note this relies on output_singlecolumn.csv being loaded into the notebook session
# where output_singlecolumn.csv is extracted from 
# https://bitbucket.org/elenarobu/osopennames-geonames-columnsmap/src/main/
opennames = pd.read_csv('output_singlecolumn.csv', dtype='string')
opennames_list= opennames.iloc[:,0].to_list()
patterns=[]
for t in opennames_list:
  patterns.append({"label":"GPE", "pattern":t})
print(len(patterns))

patterns = [nlp(t) for t in opennames_list]
matcher.add('GPE', patterns)


# test with a doc that has some entries from the list
# and some that are not
doc = nlp("Saxon Ground, St Aubyn's Road and A2040 are places, and Richard Burton is not")
matches = matcher(doc)
displacy.render(doc, jupyter=True, style='ent')

Next we need to test this out with the whole csv, which has now been committed to the github repository at https://github.com/archaeogeek/colab, so let's download it, check which folder we're in and that it has the right sort of data in it

In [None]:
!wget https://raw.githubusercontent.com/archaeogeek/colab/master/osopennames_singlecolumn_2.csv
!pwd
!head -n 10 osopennames_singlecolumn_2.csv

Saving the new model to disk

In [None]:
import spacy
import pandas as pd
from spacy import displacy
from spacy.pipeline import EntityRuler


nlp = spacy.load("en_core_web_sm", enable=["ner"])
# make sure we load and run our custom pipeline before the standard one
# else our entities get overwritten
ruler = nlp.add_pipe("entity_ruler", before="ner")

patterns=[]
osopennames = pd.read_csv('output_singlecolumn.csv', dtype='string')
o_list = osopennames.iloc[:,0]                        
for t in o_list:
  patterns.append({"label":"GPE", "pattern":t})                     
print(len(patterns))
ruler.add_patterns(patterns)
nlp.to_disk("osopennames_ner")

# test with a doc that has some entries from the list
# and some that are not
# and use displacy to render the text showing the entities
doc = nlp("Gloup is in my csv along with Westing. ZE2 9EH is not along with Saxon Ground, Doos' Geo and A2040. Los Angeles is not, and Richard Burton is certainly not!")
displacy.render(doc, jupyter=True, style='ent')



Loading our custom model and actually using it. We're almost there but have to figure out how to re-add the `en_core_web_sm` model *and* our custom pipeline

In [None]:
import spacy
from spacy import displacy
nlp = spacy.load("osopennames_ner")
print(nlp.pipe_names)
doc = nlp("I live at Greaves Road in Lancaster, Elena lives in Horndean down south somewhere. Mum lives in Endmoor, near Burton-in-Kendal")
displacy.render(doc, jupyter=True, style='ent')

Mounting google drive to save model to- note need to force encoding to utf8. Note also that folder must already exist for copy command to work

In [None]:
import locale
from google.colab import drive
drive.mount('/content/drive')
locale.getpreferredencoding = lambda: "UTF-8"


In [None]:
!cp -r "/content/osopennames_ner" "/content/drive/MyDrive/osopennames_ner"