# Natural Language Processing

## Technology NER

Here we gonna teach our model to learn to label technology stuff.

We gonna go through the whole process from labeling to training, so you understand how to do it.  The idea is:

1. Grab some raw text containing technological stuffs
2. Grab another text containing terms about technology
3. Use 2 to annotate 1
4. Then train the NER model with the annotated 1

First, let us grab the raw text containing technology related stuffs.  We grab these raw texts from patent

This is edited from https://github.com/kinivi/patent_ner_linking

## 1. Loading data

In [1]:
# if you've already unzipped the file
# this is a text I grab from 
#https://www.google.com/patents/sitemap/en/Sitemap/G06/G06K.html
patent_data = open('data/G06K.txt').read().strip()
patent_data[:500]

'COMMUNICATION DEVICE, COMMUNICATION METHOD AND PROGRAM\n_____2019_____3500050_____490084061_____EP3500000.txt_____G06K_____G06K7/10722:G06K7/1417:H04L67/104:H04M1/00:H04M11/00:H04W12/001:H04W12/04:H04W12/04033:H04W12/04071:H04W12/06:H04W76/14:H04W84/12:H04W84/20\nA communication device obtains identification information and a public key of a first other communication device by a particular obtaining method that does not use a wireless LAN and notifies the first other communication device of a role'

Since when we train NER, we need to give many samples, each sample as a `Doc`, we gonna split our `patent_data` into many samples.  One doc per one patent.  Looking closely, they are splitted by `\n\n`

In [2]:
# split into patents texts | 1 entry = 1 patent
patent_texts = patent_data.split('\n\n')
print("Length: ", len(patent_texts))
print("First patent: ",  patent_texts[0][:50])
print("Second patent: ", patent_texts[1][:50])

Length:  2003
First patent:  COMMUNICATION DEVICE, COMMUNICATION METHOD AND PRO
Second patent:  
OPERATIONAL STATUS CLASSIFICATION DEVICE
_____201


Next, let's grab some technological terms from another text file.  To extract relevant terms from the text, we can use `CountVectorizer` from scikit-learn. In such way, we can remove less frequent terms than some threshold.

In [3]:
# here are the potential terms
terms = open('data/manyterms.lower.txt').read().lower().strip().split('\n')
print(terms[44444:44456])
print(len(terms), 'terms')

['antonio superchi', 'antonio tarver', 'antonio torres jurado', 'antonio valdes', 'antonio valdes y fernandez bazan', 'antonio valdez', 'antonio valdés y bazán', 'antonio valdés y fernández bazán', 'antonio valente', 'antonio vitali', 'antonio vivaldi', 'antonio xavier machado e cerveira']
250984 terms


As you can see, we got a lot of irrelevant terms.  Let's filter only the top 25 for now.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# Here lowercase=False option is used to keep the original case of the terms, since we possibly could have term abbreviations. Like API, CAT, etc.
cvectorizer = CountVectorizer(ngram_range=(
    1, 4), stop_words="english", vocabulary=terms, lowercase=True)
X = cvectorizer.fit_transform(patent_texts)

Let's take a look at the results of the counting

In [5]:
#row = patents
#columns = terms
#value  = counts
X.toarray().shape

(2003, 250984)

Let's sum the row for each column (to get each term frequency), sort them, and map to actual vocab

In [6]:
import numpy as np

#sum them across all documents
counts = np.sum(X, axis=0)
counts.shape

(1, 250984)

In [7]:
#we can get the actual vocab name
vocabs = cvectorizer.get_feature_names_out()
cvectorizer.get_feature_names_out()[:10]

array(['0 0254 meters', '0 0254 metres', '0 1 integer program',
       '0 17 fireball', '0 17 remington', '0 17 remington fireball',
       '0 2 tactical', '0 20 tactical', '0 22 accelerator',
       '0 22 br remington'], dtype=object)

In [8]:
import pandas as pd

#put in the dataframe nicely for viewing
#.T to transpose columns to rows
df = pd.DataFrame(counts, columns = vocabs).T.sort_values(by=0, ascending=False)
df.head()

Unnamed: 0,0
electronic device,16280
control unit,9263
computer readable,6103
fingerprint sensor,5980
display device,5666


## 2. SpaCy NER

Let's start from the original model, and try to see how it looks.

In [9]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(patent_texts[0][18000:20000])
displacy.render(doc, style="ent", jupyter=True)

Looks great!  But what we want is to further enhance the model so it can tag some technological stuffs

First thing is the create a proper dataset that is compatible with spaCy 3.0 to train a NER model

### 2.1 Create Dataset

Here we used the library’s `PhraseMatcher` class to find the entities from the pre-defined Wiki list.

In [10]:
df.index[:25]

Index(['electronic device', 'control unit', 'computer readable',
       'fingerprint sensor', 'display device', 'block diagram',
       'computing device', 'fingerprint recognition', 'control device',
       'data processing', 'computer program', 'display screen',
       'biometric authentication', 'circuit board', 'face recognition',
       'facial expression', 'data storage', 'fingerprint identification',
       'feature data', 'external device', 'digital image', 'biometric data',
       'deep learning', 'blind spot', 'autonomous vehicle'],
      dtype='object')

In [21]:
from spacy.matcher import PhraseMatcher

nlp = spacy.blank("en")

# Creating matcher to label enitites in text
matcher = PhraseMatcher(nlp.vocab)

# Create an efficient stream of data
# nlp.pipe gives you docs
patterns = list(nlp.pipe(df.index[:25]))
print("patterns:", patterns[0])
print("type:    ", type(patterns[0]))
matcher.add("TECH", patterns) #expect list of docs

patterns: electronic device
type:     <class 'spacy.tokens.doc.Doc'>


In [25]:
#let's test our matcher
text = ["electronic device is very expensive", 
        "facial expression is the future"]
for doc in nlp.pipe(text):
    matches = matcher(doc)
    print(matches)
    for match_id, start, end in matches:
        print(match_id, doc[start:end])

[(726378195912679695, 0, 2)]
726378195912679695 electronic device
[(726378195912679695, 0, 2)]
726378195912679695 facial expression


Next, we can create training and dev dataset, where each sample is simply each sentence.

In [30]:
from spacy.tokens import DocBin, Span
from spacy.util import filter_spans #fix overlapping

def create_dataset(text):
    #text is each sentence.
    docs = []
    for doc in nlp.pipe(text):
        matches = matcher(doc)
        spans = [Span(doc, start, end, label=match_id) for match_id, start, end in matches]
        filtered_ents = filter_spans(spans)
        doc.ents = filtered_ents
        
        docs.append(doc)
        
    train_size = int(len(docs) * 0.8)
        
    train_docs = docs[:train_size]
    dev_docs   = docs[train_size:]

    train_doc_bin = DocBin(docs=train_docs)
    train_doc_bin.to_disk("docs/train.spacy")

    dev_doc_bin = DocBin(docs=dev_docs)
    dev_doc_bin.to_disk("docs/dev.spacy")

Split `patent_texts` into sentences, and create the dataset

In [31]:
# split each patent into chunks based on end line
patent_lines = patent_data.split('\n')
print(len(patent_lines))
patent_lines[2] #example

288792


('A communication device obtains identification information and a public key of a first other communication device by a particular obtaining method that does not use a wireless LAN and notifies the first other communication device of a role of the first other communication device in a communication based on Wi-Fi Direct. In addition, the communication device obtains identification information and a public key of a second other communication device by the particular obtaining method and notifies the second other communication device of a role of the second other communication device in the communication based on Wi-Fi Direct. One of the notified roles is a P2P Group Owner and the other one is a P2P Client, and the communication based on Wi-Fi Direct can be performed between the first other communication device and the second other communication device based on the notifications.',
 'The present invention relates to a communication device, a communication method, and a program.')

Since we have 280k+ chunks, it will take too much time, let's just grab 10000 chunks for now for training and dev.

In [32]:
create_dataset(patent_lines[:10000])

### 2.2 Generate config

In [33]:
!python3 -m spacy init config --force configs/tech-config.cfg --lang en --pipeline ner

[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
configs/tech-config.cfg
You can now add your data and train your pipeline:
python -m spacy train tech-config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


### 2.3 Training

In [34]:
gpu = spacy.require_gpu()
gpu

True

In [35]:
!python3 -m spacy train configs/tech-config.cfg --output ./output --paths.train docs/train.spacy --paths.dev docs/dev.spacy --gpu-id 0

[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-12-15 10:32:15,454] [INFO] Set up nlp object from config
[2022-12-15 10:32:15,464] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-12-15 10:32:15,467] [INFO] Created vocabulary
[2022-12-15 10:32:15,467] [INFO] Finished initializing nlp object
[2022-12-15 10:32:22,387] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     97.50    0.67    0.37    3.67    0.01
  0     200         12.33    838.85   32.63   90.06   19.93    0.33
  0     400         13.85     32.02   31.39   88.64   19.07    0.31
  0     600         16.55     42.56   37.90   92.42   23.84    0.38
  0     800         17.63     38.74   70.00   92.03   5

### 2.4 Loading and Testing

In [36]:
import spacy

nlp = spacy.load("output/model-best")
doc = nlp("iPhone is an electronic device.  The control unit is made in China.")

colors = {"TECH": "#F67DE3"}
options = {"colors": colors}

print(doc.ents)

spacy.displacy.render(doc, style="ent", options=options, jupyter=True)

(electronic device, control unit)
