Source: 
- Tutorial: https://yudanta.github.io/posts/train-an-indonesian-ner-from-a-blank-spacy-model/
- NER Model: https://raw.githubusercontent.com/yusufsyaifudin/indonesia-ner/master/resources/ner/data_train.txt
- NER Model: https://raw.githubusercontent.com/yohanesgultom/nlp-experiments/master/data/ner/training_data.txt

# 1. Import Library & Data Preparation

In [2]:
# Import library
import pickle
import spacy
import random
from spacy.util import minibatch, compounding
from spacy import load, displacy

In [5]:
# load datasets 
datasetpath = '/content/drive/MyDrive/Colab_Notebooks/Summary_NLP/NER/ner_training_blank_spacy/ner_spacy_fmt_datasets.pickle'
with open(datasetpath, 'rb') as f:
    ner_spacy_fmt_datasets = pickle.load(f)

In [6]:
ner_spacy_fmt_datasets[:2]

[('Pengamat politik dari Universitas Gadjah Mada Arie Sudjito menilai, keinginan Ketua Umum Partai Golkar Aburizal Bakrie untuk maju kembali sebagai ketua umum merupakan pemaksaan kehendak.',
  {'entities': [(22, 45, 'ORGANIZATION'),
    (46, 58, 'PERSON'),
    (89, 102, 'ORGANIZATION'),
    (103, 118, 'PERSON')]}),
 ('Menurut dia, ada kesan bahwa Aburizal menggunakan segala cara untuk memuluskan jalannya kembali menduduki Golkar 1.',
  {'entities': [(29, 37, 'PERSON'), (106, 112, 'ORGANIZATION')]})]

# 2. Membuat Blank Model

In [9]:
# Membuat blank model Spacy
nlp=spacy.blank("id")

nlp.add_pipe(nlp.create_pipe('ner'))

nlp.begin_training()

<thinc.neural.optimizers.Optimizer at 0x7fd518542cf8>

In [11]:
# Ambil data-modul untuk NER dan menyisihkan yang lainnya.

ner=nlp.get_pipe("ner")

# Library yang tidak diperlukan
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]

unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
unaffected_pipes

[]

In [12]:
# Ambil label tipe entity dari data training yang sudah diload

for _, annotations in ner_spacy_fmt_datasets:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])
        break

In [13]:
ner

<spacy.pipeline.pipes.EntityRecognizer at 0x7fd517a652e8>

# 4. Training Model

In [14]:
# Untuk training, data train akan dipilih secara random terlebih dahulu baru kemudian kita masukkan ke iterasi training,
# pada contoh ini kita coba untuk latih model sebanyak 30 kali perulangan.

# TRAINING THE MODEL DENGAN MINI BATCH ITERATION
with nlp.disable_pipes(*unaffected_pipes):

  # Training for 30 iterations
  for iteration in range(30):

    # shuufling examples  before every iteration
    random.shuffle(ner_spacy_fmt_datasets)
    losses = {}
    # batch up the examples using spaCy's minibatch
    batches = minibatch(ner_spacy_fmt_datasets, size=compounding(4.0, 32.0, 1.001))
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
    
    print("Losses at iteration {}".format(iteration), losses)

Losses at iteration 0 {'ner': 46201.449713197086}
Losses at iteration 1 {'ner': 44667.2777842439}
Losses at iteration 2 {'ner': 44348.14768103606}
Losses at iteration 3 {'ner': 44156.336387966294}
Losses at iteration 4 {'ner': 44167.76952163929}
Losses at iteration 5 {'ner': 44142.77658385835}
Losses at iteration 6 {'ner': 43725.03912848228}
Losses at iteration 7 {'ner': 43706.02412880413}
Losses at iteration 8 {'ner': 43656.20464827832}
Losses at iteration 9 {'ner': 43464.97962105447}
Losses at iteration 10 {'ner': 43320.47161302458}
Losses at iteration 11 {'ner': 43701.99122961775}
Losses at iteration 12 {'ner': 43444.7396546831}
Losses at iteration 13 {'ner': 43208.75134458104}
Losses at iteration 14 {'ner': 43326.2534882014}
Losses at iteration 15 {'ner': 43192.303193363856}
Losses at iteration 16 {'ner': 42987.918383733064}
Losses at iteration 17 {'ner': 43019.22543028767}
Losses at iteration 18 {'ner': 42719.06958604527}
Losses at iteration 19 {'ner': 43092.339563832196}
Losses a

# 5. Evaluasi

In [15]:
# Setelah proses training model selesai, untuk mencoba model yang telah kita train:
# test 
doc = nlp("SELUBUNG yang menyelimuti kasus penembakan yang menewaskan Pendeta Yeremia Zanambani di Kabupaten Intan Jaya, Papua kian terkuak. Hasil investigasi Tim Gabungan Pencari Fakta (TGPF) kasus tersebut menyatakan bahwa penembakan di Intan Jaya diduga dilakukan oleh aparat keamanan.")

In [16]:
# printing entities
doc.ents

(SELUBUNG, Pendeta Yeremia Zanambani, Kabupaten Intan, Papua, Intan Jaya)

In [18]:
print("Entities")
for ent in doc.ents:
    print(ent.text, '-->', ent.label_)

Entities
SELUBUNG --> ORGANIZATION
Pendeta Yeremia Zanambani --> PERSON
Kabupaten Intan --> LOCATION
Papua --> LOCATION
Intan Jaya --> LOCATION


Supaya kita tidak perlu melakukan training model berulang kali, maka model bisa disave:

In [19]:
from pathlib import Path

output_dir = Path('/content/drive/MyDrive/Colab_Notebooks/Summary_NLP/NER/ner_training_blank_spacy/nlp_id_checkpoint_2021_01_21')
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

Saved model to /content/drive/MyDrive/Colab_Notebooks/Summary_NLP/NER/ner_training_blank_spacy/nlp_id_checkpoint_2021_01_21


In [20]:
# load existing model 
from pathlib import Path
model_dir = Path('/content/drive/MyDrive/Colab_Notebooks/Summary_NLP/NER/ner_training_blank_spacy/nlp_id_checkpoint_2021_01_21')
print("Loading from", model_dir)
nlp_updated = spacy.load(model_dir)

Loading from /content/drive/MyDrive/Colab_Notebooks/Summary_NLP/NER/ner_training_blank_spacy/nlp_id_checkpoint_2021_01_21


In [21]:
# Evaluasi Ulang
# Contoh 1
doc1 = nlp_updated("Kementerian Perhubungan tidak mewajibkan rapid test COVID-19 untuk perjalanan darat lintas daerah, kecuali untuk tujuan Bali. Termasuk, dalam periode cuti bersama." )
print("Entities", [(ent.text, ent.label_) for ent in doc1.ents])

Entities [('Bali', 'LOCATION')]


In [22]:
displacy.render(doc1, style="ent", jupyter = True)

In [23]:
# contoh 2
doc2 = nlp_updated("Empat saksi terkait korupsi proyek infrastruktur fiktif yang dikerjakan PT Waskita Karya (Persero) Tbk absen dari panggilan KPK hari ini. Seorang di antaranya mantan Bupati Wakatobi, Hugua." )
print("Entities", [(ent.text, ent.label_) for ent in doc2.ents])

Entities [('PT Waskita Karya', 'ORGANIZATION'), ('KPK', 'ORGANIZATION'), ('Hugua', 'PERSON')]


In [24]:
displacy.render(doc2, style="ent", jupyter=True)