# Universal Dependencies (UD) Japanese

## Package Installation

We would install some required packages on this cell.

In [9]:
%pip install -q spacy-partial-tagger conllu

## Data Preparation

### Download dataset

In [12]:
!curl -qLO https://raw.githubusercontent.com/megagonlabs/UD_Japanese-GSD/master/spacy/ja_gsd-ud-train.ne.conllu
!curl -qLO https://raw.githubusercontent.com/megagonlabs/UD_Japanese-GSD/master/spacy/ja_gsd-ud-dev.ne.conllu
!curl -qLO https://raw.githubusercontent.com/megagonlabs/UD_Japanese-GSD/master/spacy/ja_gsd-ud-test.ne.conllu

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 37.8M  100 37.8M    0     0  66.4M      0 --:--:-- --:--:-- --:--:-- 66.4M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2827k  100 2827k    0     0  11.8M      0 --:--:-- --:--:-- --:--:-- 11.7M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2965k  100 2965k    0     0  16.4M      0 --:--:-- --:--:-- --:--:-- 16.4M


## Preparation

We would convert the CoNLL format dataset to the spaCy format to train a model.

In [21]:
import spacy
from conllu import parse_incr
from spacy.tokens import Doc, DocBin

from spacy_partial_tagger.util import make_char_based_doc


def make_doc_bin(vocab: spacy.Vocab, filename: str, max_size: int = 100) -> DocBin:
    with open(filename) as f:
        dataset = []
        for data in parse_incr(f):
            tokens = []
            tags = []
            for x in data:
                tokens.append(x["form"])
                tags.append(x["misc"].get("NE", "O"))
            dataset.append((tokens, tags))

    db = DocBin()
    for tokens, tags in dataset:
        doc = Doc(vocab, tokens, spaces=[False] * len(tokens))
        char_doc = make_char_based_doc(doc, tags)
        if len(char_doc) <= max_size:
            db.add(char_doc)
        else:
            for ent in char_doc.ents:
                rest = max_size - len(ent)
                start = max(0, ent.start - rest // 2)
                end = min(len(char_doc), ent.end + rest // 2)
                db.add(char_doc[start:end].as_doc())
    return db


nlp = spacy.blank("ja")

make_doc_bin(nlp.vocab, "ja_gsd-ud-train.ne.conllu", 30).to_disk("train.spacy")
make_doc_bin(nlp.vocab, "ja_gsd-ud-dev.ne.conllu", 1 << 60).to_disk("dev.spacy")
make_doc_bin(nlp.vocab, "ja_gsd-ud-test.ne.conllu", 1 << 60).to_disk("test.spacy")

## Training

We would train our tagger on the dataset above. First we would create a config file and then train the model based on the config file.

### Setup config

In [14]:
%%writefile config.cfg
[paths]
train = "./train.spacy"
dev = "./dev.spacy"
init_tok2vec = null
vectors = null

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = ${corpora.train.max_length}

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "ja"
pipeline = ["partial_ner"]
tokenizer = {"@tokenizers": "character_tokenizer.v1"}
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 15

[nlp.tokenizer]

[components]

[components.partial_ner]
factory = "partial_ner"

[components.partial_ner.loss]
@losses = "spacy-partial-tagger.ExpectedEntityRatioLoss.v1"
padding_index = -1
unknown_index = -100
outside_index = 0

[components.partial_ner.label_indexer]
@label_indexers = "spacy-partial-tagger.TransformerLabelIndexer.v1"
padding_index = ${components.partial_ner.loss.padding_index}
unknown_index= ${components.partial_ner.loss.unknown_index}

[components.partial_ner.model]
@architectures = "spacy-partial-tagger.PartialTagger.v1"

[components.partial_ner.model.misaligned_tok2vec]
@architectures = "spacy-partial-tagger.MisalignedTok2VecTransformer.v1"
model_name = "cl-tohoku/bert-base-japanese-whole-word-masking"

[components.partial_ner.model.encoder]
@architectures = "spacy-partial-tagger.LinearCRFEncoder.v1"
nI = 768
nO = null
dropout = 0.2

[components.partial_ner.model.decoder]
@architectures = "spacy-partial-tagger.ConstrainedViterbiDecoder.v1"
padding_index = ${components.partial_ner.loss.padding_index}

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
accumulate_gradient = 1
max_steps = 12000
patience = 3000
eval_frequency = 600
frozen_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_sequence.v1"
size = 15
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = false
use_averages = false
grad_clip = 5.0

[training.optimizer.learn_rate]
@schedules = "slanted_triangular.v1"
max_rate = 0.00002
num_steps = ${training.max_steps}
cut_frac = 0.1
ratio = 16
t = -1

[training.score_weights]
ents_per_type = null
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0

[pretraining]

[initialize]

[initialize.components]

[initialize.tokenizer]

Overwriting config.cfg


### Training

To train the model, please execute the command below.

In [23]:
%run -m spacy train config.cfg \
        --output=./ja-gsd \
        --paths.train train.spacy --paths.dev dev.spacy \
        --gpu-id 0

[2023-01-18 04:49:08,723] [INFO] Set up nlp object from config
INFO:spacy:Set up nlp object from config
[2023-01-18 04:49:08,748] [INFO] Pipeline: ['partial_ner']
INFO:spacy:Pipeline: ['partial_ner']
[2023-01-18 04:49:08,758] [INFO] Created vocabulary
INFO:spacy:Created vocabulary
[2023-01-18 04:49:08,763] [INFO] Finished initializing nlp object
INFO:spacy:Finished initializing nlp object


[38;5;2m✔ Created output directory: ja-gsd[0m
[38;5;4mℹ Saving to output directory: ja-gsd[0m
[38;5;4mℹ Using GPU: 0[0m
[1m


Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[2023-01-18 04:49:12,472] [INFO] Initialized pipeline components: ['partial_ner']
INFO:spacy:Initialized pipeline components:

[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['partial_ner'][0m
[38;5;4mℹ Initial learn rate: 1.25e-06[0m
E    #       LOSS PARTI...  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  ------  ------  ------  ------
  0       0          34.53    0.12    0.06    0.75    0.00
  0     600       38287.34   62.38   59.72   65.28    0.62
  1    1200       49377.57   73.45   67.62   80.38    0.73
  2    1800       21882.08   74.97   68.43   82.89    0.75
  3    2400       38962.85   75.44   72.51   78.62    0.75
  3    3000       38765.40   73.16   68.23   78.87    0.73
  4    3600       25548.66   75.54   76.52   74.59    0.76
  5    4200       47331.82   75.13   75.80   74.47    0.75
  6    4800       34199.46   75.10   79.30   71.32    0.75
  6    5400       25445.73   74.21   76.33   72.20    0.74
  7    6000       25420.56   72.68   73.96   71.45    0.73
  8    6600       38484.00   70.85   73.03   68.81    0.71
[38;5;2m✔ Saved pipeline to output directory[0

## Evaluation

We would evaluate the trained model above. Please execute the command below.

In [24]:
%run -m spacy evaluate ja-gsd/model-best test.spacy --gpu-id 0

[38;5;4mℹ Using GPU: 0[0m


Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[1m

NER P   75.41
NER R   75.53
NER F   75.47
SPEED   3676 

[1m

                   P        R        F
NORP           70.37    67.86    69.09
PERSON         78.64    91.01    84.38
ORG            72.58    60.00    65.69
DATE           88.16    79.76    83.75
GPE            73.47    87.80    80.00
TITLE_AFFIX    71.43    75.00    73.17
WORK_OF_ART    82.35    77.78    80.00
QUANTITY       78.05    82.05    80.00
EVENT          57.14    28.57    38.10
ORDINAL        63.16    92.31    75.00
PRODUCT        37.50    52.17    43.64
FAC            66.67    40.00    50.00
MONEY          87.50   100.00    93.33
TIME           76.92    76.92    76.92
LOC            90.48    76.00    82.61
PERCENT        75.00    42.86    54.55
LANGUAGE      100.00   100.00   100.00
MOVEMENT      100.00    50.00    66.67
LAW             0.00     0.00     0.00

