# 摘要

Dataset : ChineseDailyNerCorpus 

Model : Bert(chinese_L-12_H-768_A-12)作為Feature Extractor，抽取出Contextulized Embedding，下游辨識模型採用BiLSTM_CRF進行訓練

Want : 

1. 了解NER的資料多樣性
2. ground-truth的分佈情況
3. 訓練時間
4. 推論時間
5. 準確度指標的相關狀況
6. 實作上如何訓練

[ChineseDailyNerCorpus 下游模型Benchmark](https://colab.research.google.com/drive/1yKo5h1Eszou5_W18-BQvgqGuzK6uyEnd#scrollTo=4LqUOxB0LbmE)

# EDA

In [1]:
import kashgari
import os
from os.path import join as PJ
import time
from kashgari.corpus import ChineseDailyNerCorpus
import pandas as pd
import random
import numpy as np
from collections import defaultdict

# disble the logger
kashgari.logger.logger.propagate = False

SEED = 42
random.seed(SEED)



In [2]:
train_x, train_y = ChineseDailyNerCorpus.load_data("train")
valid_x, valid_y = ChineseDailyNerCorpus.load_data("validate")
test_x, test_y = ChineseDailyNerCorpus.load_data("test")

2021-04-20 00:21:31,335 [DEBUG] kashgari - loaded 20864 samples from /home/joetsai/.kashgari/datasets/china-people-daily-ner-corpus/example.train. Sample:
x[0]: ['据', '俄', '通', '社', '—', '塔', '斯', '社', '报', '道', '，', '叶', '利', '钦', '特', '别', '指', '出', '，', '米', '洛', '舍', '维', '奇', '是', '在', '同', '他', '会', '谈', '时', '作', '出', '这', '一', '决', '定', '的', '。']
y[0]: ['O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'I-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
2021-04-20 00:21:31,397 [DEBUG] kashgari - loaded 2318 samples from /home/joetsai/.kashgari/datasets/china-people-daily-ner-corpus/example.dev. Sample:
x[0]: ['脚', '下', '有', '路', '，', '一', '定', '要', '闯', '出', '一', '条', '适', '合', '自', '己', '的', '路', '，', '下', '岗', '不', '可', '怕', '，', '精', '神', '不', '能', '垮', '。']
y[0]: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',

In [3]:
print(type(train_x), type(train_x[0]), type(train_x[0][0]))

<class 'list'> <class 'list'> <class 'str'>


## 數量

|資料集|數量|
|-----|---|
|Train|20000+|
|Val|2300+|
|Test|4500+|


## 資料輸入格式 

1. train_x : 有包含NER句子的集合，list
2. train_x[0] : 第1列NER句子，list
3. train_x[0][0] : 簡體中文字，str
4. Encoding - BIO(Begin, Inner, Outter)


## 資料多樣性

NLP-NER 任務應該參考怎樣的資料多樣性?

例如 電腦視覺領域會參考以下幾個維度來衡量資料多樣性

藉由應用場景的資料多樣性(Testing set)，來規劃訓練集應當涵蓋的多樣性

```
1. 照片角度
2. 光照程度
3. 背景複雜程度(單一 --> 複雜)
4. 遠近
5. 有無遮擋
6. 可能的目標物變異
7. 可能的誤判
```

以下考量 : 

1. 句子長度分佈 : 

2. Ground-truth的分佈 : 

    2-1 : NER序列的長度分佈 : 

    2-2 : NER的種類分佈 : 

    2-3 : 包含NER的樣本以及為含有NER樣本的比例(Positive vs Negtive?) : 
    
    2-4 : QA 還可以怎麼觀察NER任務的資料多樣性?

In [4]:
sentence_length = {}
for i, sentence in enumerate(train_x):
    sentence_length[i] = len(sentence)

sentence_length_s = pd.Series(sentence_length)

# display(pd.Series(sentence_length_s))
display(sentence_length_s.describe())


# 平均來說，一個句子大概46個字長
# 最少可以到6個字，最多可以到570+個字

print('最短的句子')
min_length_idx = sentence_length_s.idxmin()
print(train_x[min_length_idx], train_y[min_length_idx], sep='\n\n')

print('-'*100)

print('最長的句子')
max_length_idx = sentence_length_s.idxmax()
print(train_x[max_length_idx], train_y[max_length_idx], sep='\n\n')

print('-'*100)

print('平均長度的句子')
avg_length_idx_list = sentence_length_s[sentence_length_s == 46].index.tolist()

avg_length_idx_sample = random.choices(avg_length_idx_list,k=2)

for avg_length_idx in avg_length_idx_sample:
    print(train_x[avg_length_idx], train_y[avg_length_idx], sep='\n\n')

count    20864.000000
mean        46.931557
std         30.077038
min          6.000000
25%         28.000000
50%         40.000000
75%         58.000000
max        574.000000
dtype: float64

最短的句子
['（', '子', '夜', '走', '笔', '）']

['O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
最長的句子
['北', '京', '小', '雨', '1', '8', '℃', '／', '2', '8', '℃', '天', '津', '雷', '阵', '雨', '1', '8', '℃', '／', '2', '8', '℃', '石', '家', '庄', '小', '雨', '转', '阴', '2', '3', '℃', '／', '2', '9', '℃', '太', '原', '小', '雨', '转', '多', '云', '1', '8', '℃', '／', '2', '8', '℃', '呼', '和', '浩', '特', '多', '云', '转', '晴', '1', '4', '℃', '／', '2', '7', '℃', '沈', '阳', '多', '云', '1', '8', '℃', '／', '2', '6', '℃', '大', '连', '多', '云', '1', '8', '℃', '／', '2', '3', '℃', '长', '春', '雷', '阵', '雨', '1', '6', '℃', '／', '2', '6', '℃', '哈', '尔', '滨', '雷', '阵', '雨', '1', '8', '℃', '／', '2', '8', '℃', '上', '海', '中', '雨', '2', '2', '℃', '／', '2', '6', '℃', '南', '京', '阴', '转', '雷', '阵', '雨', '2', '2', '℃', '／', '2', '9', '℃', '杭', '州', '小', '雨', '转', '中', '雨', '2', '2', '℃', '／', '2', '8', '℃', '合', '肥', '中', '雨', '2', '3', '℃', '／', '2', '8', '℃', '福', '州'

In [5]:
def show_x_y(idx):
    print(train_x[idx],train_y[idx])

In [6]:
def reset_ner_detector():
    detect_ner = False
    detect_ner_start = 0
    detect_ner_end = 0
    return detect_ner, detect_ner_start, detect_ner_end

In [7]:
def log_ner_stats(ner_d,sentence_i, x, y):
    ner_d['sentence_i'].append(sentence_i)
    ner_d['x'].append(x)
    ner_d['y'].append(y)
    return ner_d

In [8]:
## Ground-truth 類別分佈
## Y-sequence-length
## Y-category-type
## Contain Y, not Contain Y

padding = 0
ner_d = {
    'sentence_i':[],
    'x':[],
    'y':[]
}

for sentence_i,(sentecne,sentence_y) in enumerate(zip(train_x[:],train_y[:])):
    sentecne_np = np.array(sentecne)
    sentecne_y_np = np.array(sentence_y)
    detect_ner, detect_ner_start, detect_ner_end = reset_ner_detector()
    for char_i, char_y in enumerate(sentecne_y_np):
        # no ner detected
        if char_y == 'O' and not detect_ner:
            pass
        
        # get ner end!
        elif char_y == 'O' and detect_ner:
            detect_ner_end = char_i
            
            # get padding ner chunk
            ner_display_start = detect_ner_start - padding if detect_ner_start - padding > 0 else 0
            ner_display_end = detect_ner_end + padding if detect_ner_end + padding < len(sentecne_np) else sentecne_np
            ner_chunk_x = sentecne_np[ner_display_start : ner_display_end]
            ner_chunk_y = sentecne_y_np[ner_display_start : ner_display_end]
            # log ners
            ner_d = log_ner_stats(ner_d,
                                   sentence_i,
                                  ''.join(ner_chunk_x.tolist()),
                                   ner_chunk_y.tolist()
                                  )
            detect_ner, detect_ner_start, detect_ner_end = reset_ner_detector()
            
            # show ner chunk
#             print(sentence_i,ner_chunk_x, ner_chunk_y)

        # get ner start
        elif char_y != 'O' and not detect_ner:
            detect_ner_start = char_i
            detect_ner = True
        # inside the ner
        elif char_y != 'O' and detect_ner:
            pass

In [9]:
ner_df = pd.DataFrame(ner_d)

# show ners
display(
    'n rows : ',
    ner_df.shape[0],
    ner_df.sample(10,random_state=SEED)
)

## Y-sequence-length
ner_df['y_sequence_length'] = ner_df['x'].apply(len)
display(ner_df['y_sequence_length'].describe())

## Y-category-type
ner_df['y_cat'] = ner_df['y'].apply(lambda x : x[0]).str.replace('B-','')
display(ner_df['y_cat'].value_counts())

## Contain Y, not Contain Y
contains_ner_set = set(ner_df['sentence_i'].tolist())


display('total data rows : ',
        len(train_x),
        'contain ner : ',
        len(contains_ner_set),
        'not contain ner : ',
        len(train_x) - len(contains_ner_set)
       )

'n rows : '

31857

Unnamed: 0,sentence_i,x,y
26207,17155,印尼,"[B-LOC, I-LOC]"
19524,12781,欧洲,"[B-LOC, I-LOC]"
5688,3740,中心,"[B-ORG, I-ORG]"
5980,3940,货币委员会,"[B-ORG, I-ORG, I-ORG, I-ORG, I-ORG]"
18189,11888,中国京,"[B-LOC, I-LOC, B-LOC]"
896,593,崔玉山,"[B-PER, I-PER, I-PER]"
7672,4964,新疆,"[B-LOC, I-LOC]"
15757,10243,民盟中央,"[B-ORG, I-ORG, I-ORG, I-ORG]"
14462,9330,南亚,"[B-LOC, I-LOC]"
19954,13076,古巴,"[B-LOC, I-LOC]"


count    31857.000000
mean         3.437015
std          2.277463
min          1.000000
25%          2.000000
50%          3.000000
75%          4.000000
max         30.000000
Name: y_sequence_length, dtype: float64

LOC    14927
ORG     9057
PER     7873
Name: y_cat, dtype: int64

'total data rows : '

20864

'contain ner : '

12704

'not contain ner : '

8160

# Model profling


## Training & Inference Profiling

GPU : GeForce GTX 980 Ti 6G

|Task|speed|note|
|----|-----|----|
|Training|6 mins/epoch|
|Inference(GPU)|~100ms|GPU, mean of 50 iterations|
|Inference(CPU)|~150ms|CPU, mean of 50 iterations|


F1-score : 0.93@Epoch15: [check on monitor](https://wandb.ai/yltsai0609/bert-ner/runs/2hylsyy3/logs?workspace=user-yltsai0609)


## Samples

In [10]:
if os.getcwd() == PJ("/home","joetsai","work","yulong","bert_ner","docs"):
    os.chdir('..')
print('current path : ',os.getcwd())

current path :  /home/joetsai/work/yulong/bert_ner


In [11]:
SAVE_MODEL_PREFIX = PJ("trained_model", "ner_daily_news")
trained_model = kashgari.utils.load_model(SAVE_MODEL_PREFIX)

  
2021-04-20 00:21:33,502 [DEBUG] kashgari - ------------------------------------------------
2021-04-20 00:21:33,503 [DEBUG] kashgari - Loaded transformer model's vocab
2021-04-20 00:21:33,504 [DEBUG] kashgari - config_path       : language_model/bert/chinese_L-12_H-768_A-12/bert_config.json
2021-04-20 00:21:33,505 [DEBUG] kashgari - vocab_path      : language_model/bert/chinese_L-12_H-768_A-12/vocab.txt
2021-04-20 00:21:33,506 [DEBUG] kashgari - checkpoint_path : language_model/bert/chinese_L-12_H-768_A-12/bert_model.ckpt
2021-04-20 00:21:33,507 [DEBUG] kashgari - Top 50 words    : ['[PAD]', '[unused1]', '[unused2]', '[unused3]', '[unused4]', '[unused5]', '[unused6]', '[unused7]', '[unused8]', '[unused9]', '[unused10]', '[unused11]', '[unused12]', '[unused13]', '[unused14]', '[unused15]', '[unused16]', '[unused17]', '[unused18]', '[unused19]', '[unused20]', '[unused21]', '[unused22]', '[unused23]', '[unused24]', '[unused25]', '[unused26]', '[unused27]', '[unused28]', '[unused29]', '

In [28]:
smaples = random.choices(
    test_x,
    k=10,
)
for sentence in smaples:
    print(trained_model.predict([sentence], truncating=True))
    print(sentence)


2021-04-20 00:24:32,409 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)




2021-04-20 00:24:32,545 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:32,546 [DEBUG] kashgari - predict output argmax: [[0 6 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
2021-04-20 00:24:32,547 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)


[['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
['保', '定', '一', '中', '分', '校', '领', '导', '向', '品', '学', '兼', '优', '的', '下', '岗', '职', '工', '子', '女', '颁', '发', '奖', '学', '金', '，', '团', '员', '同', '学', '向', '下', '岗', '职', '工', '子', '女', '赠', '送', '学', '习', '用', '具', '。']


2021-04-20 00:24:32,663 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:32,664 [DEBUG] kashgari - predict output argmax: [[0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
2021-04-20 00:24:32,665 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)


[['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
['能', '不', '能', '实', '现', '跨', '世', '纪', '的', '宏', '伟', '目', '标', '，', '关', '系', '着', '党', '和', '国', '家', '的', '前', '途', '命', '运', '。']


2021-04-20 00:24:32,783 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:32,784 [DEBUG] kashgari - predict output argmax: [[0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
2021-04-20 00:24:32,785 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)


[['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
['没', '水', '吃', '，', '到', '几', '里', '路', '外', '的', '山', '沟', '沟', '去', '挑', '；', '下', '雨', '天', '山', '陡', '路', '滑', '，', '就', '吃', '房', '檐', '水', '；', '饭', '是', '挂', '面', '，', '菜', '是', '盐', '水', '煮', '黄', '豆', '。']


2021-04-20 00:24:32,903 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:32,904 [DEBUG] kashgari - predict output argmax: [[0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 4 3 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
2021-04-20 00:24:32,905 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)


[['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O']]
['库', '内', '保', '安', '、', '防', '灾', '系', '统', '精', '良', '，', '有', '单', '独', '的', '空', '调', '机', '组', '维', '持', '2', '0', '摄', '氏', '度', '恒', '温', '、', '5', '0', '％', '—', '6', '0', '％', '湿', '度', '，', '这', '些', '设', '备', '都', '是', '从', '日', '本', '进', '口', '的', '。']


2021-04-20 00:24:33,013 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:33,014 [DEBUG] kashgari - predict output argmax: [[0 4 1 1 1 1 6 2 2 1 1 1 4 4 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
2021-04-20 00:24:33,015 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)


[['B-LOC', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'B-LOC', 'B-LOC', 'O', 'O']]
['伊', '重', '申', '尊', '重', '联', '合', '国', '划', '定', '的', '伊', '科', '边', '界']


2021-04-20 00:24:33,128 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:33,129 [DEBUG] kashgari - predict output argmax: [[0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 5 5 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
2021-04-20 00:24:33,130 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)


[['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
['今', '本', '《', '列', '子', '》', '是', '魏', '晋', '时', '期', '出', '现', '的', '著', '作', '，', '托', '名', '战', '国', '列', '御', '寇', '所', '著', '，', '至', '今', '搞', '不', '清', '其', '作', '者', '是', '谁', '。']


2021-04-20 00:24:33,242 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:33,243 [DEBUG] kashgari - predict output argmax: [[0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
2021-04-20 00:24:33,245 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)


[['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
['他', '们', '决', '心', '学', '一', '技', '之', '长', '，', '铸', '健', '康', '灵', '魂', '，', '重', '新', '做', '人', '，', '以', '实', '际', '行', '动', '报', '答', '母', '亲', '的', '养', '育', '之', '恩', '。']


2021-04-20 00:24:33,357 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:33,358 [DEBUG] kashgari - predict output argmax: [[0 4 3 1 1 1 1 1 1 1 4 3 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
2021-04-20 00:24:33,359 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)


[['B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O']]
['桂', '林', '多', '山', '，', '山', '珍', '也', '是', '桂', '林', '菜', '肴', '的', '特', '色', '。']


2021-04-20 00:24:33,468 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:33,469 [DEBUG] kashgari - predict output argmax: [[0 1 1 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
2021-04-20 00:24:33,470 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)


[['O', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
['漫', '游', '佛', '湾', '还', '看', '到', '市', '俗', '化', '很', '浓', '的', '、', '反', '映', '贫', '民', '生', '活', '的', '场', '景', '《', '六', '道', '轮', '回', '图', '》', '，', '有', '融', '儒', '家', '思', '想', '于', '佛', '教', '教', '义', '的', '《', '父', '母', '恩', '重', '经', '变', '图', '》', '，', '还', '有', '说', '明', '宗', '教', '、', '哲', '理', '的', '《', '锁', '六', '耗', '图', '》', '等', '。']


2021-04-20 00:24:33,578 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:33,579 [DEBUG] kashgari - predict output argmax: [[0 1 1 1 1 1 1 4 3 1 4 3 1 1 1 1 1 1 1 1 1 4 3 1 1 1 1 1 0 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]


[['O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O']]
['与', '上', '年', '相', '比', '，', '巢', '湖', '和', '滇', '池', '污', '染', '程', '度', '有', '所', '加', '重', '，', '太', '湖', '有', '所', '减', '轻', '。']


In [29]:
ood_sentence = [
    '王小明去台北市立動物園玩',
    '高雄的西子灣是一個散心絕佳的好去處',
    "YC來到了桃園國際機場搭飛機",
    "Joe和YC正在Zoom進行兩週一次的機器學習討論會",
    "瑜隆和宜昌正在Zoom進行兩週一次的機器學習討論會",
    "痞客邦在捷運行天宮站附近，走路大概10分鐘，還挺遠的",
    "玉山銀行全台灣都有，總部在台北的信義區嗎？"
]


for sentecne in ood_sentence:
    print(trained_model.predict([list(sentecne)], truncating=True))
    print(list(sentecne))
    

2021-04-20 00:24:40,818 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)




2021-04-20 00:24:40,944 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:40,945 [DEBUG] kashgari - predict output argmax: [[0 7 5 5 1 4 3 3 3 3 3 3 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
2021-04-20 00:24:40,946 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)


[['B-PER', 'I-PER', 'I-PER', 'O', 'B-LOC', 'I-LOC', 'I-LOC', 'I-LOC', 'I-LOC', 'I-LOC', 'I-LOC', 'O']]
['王', '小', '明', '去', '台', '北', '市', '立', '動', '物', '園', '玩']


2021-04-20 00:24:41,066 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:41,067 [DEBUG] kashgari - predict output argmax: [[0 4 3 1 4 3 3 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
2021-04-20 00:24:41,069 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)


[['B-LOC', 'I-LOC', 'O', 'B-LOC', 'I-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
['高', '雄', '的', '西', '子', '灣', '是', '一', '個', '散', '心', '絕', '佳', '的', '好', '去', '處']


2021-04-20 00:24:41,174 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:41,176 [DEBUG] kashgari - predict output argmax: [[0 1 1 1 1 1 4 3 3 3 3 3 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
2021-04-20 00:24:41,176 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)


[['O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'I-LOC', 'I-LOC', 'I-LOC', 'I-LOC', 'O', 'O', 'O']]
['Y', 'C', '來', '到', '了', '桃', '園', '國', '際', '機', '場', '搭', '飛', '機']


2021-04-20 00:24:41,293 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:41,294 [DEBUG] kashgari - predict output argmax: [[0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
2021-04-20 00:24:41,295 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)


[['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
['J', 'o', 'e', '和', 'Y', 'C', '正', '在', 'Z', 'o', 'o', 'm', '進', '行', '兩', '週', '一', '次', '的', '機', '器', '學', '習', '討', '論', '會']


2021-04-20 00:24:41,406 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:41,408 [DEBUG] kashgari - predict output argmax: [[0 4 3 1 4 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
2021-04-20 00:24:41,409 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)


[['B-LOC', 'I-LOC', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
['瑜', '隆', '和', '宜', '昌', '正', '在', 'Z', 'o', 'o', 'm', '進', '行', '兩', '週', '一', '次', '的', '機', '器', '學', '習', '討', '論', '會']


2021-04-20 00:24:41,524 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:41,525 [DEBUG] kashgari - predict output argmax: [[0 6 2 2 1 4 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
2021-04-20 00:24:41,525 [DEBUG] kashgari - predict seq_length: 100, input: (2, 1, 100)


[['B-ORG', 'I-ORG', 'I-ORG', 'O', 'B-LOC', 'I-LOC', 'I-LOC', 'I-LOC', 'I-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
['痞', '客', '邦', '在', '捷', '運', '行', '天', '宮', '站', '附', '近', '，', '走', '路', '大', '概', '1', '0', '分', '鐘', '，', '還', '挺', '遠', '的']


2021-04-20 00:24:41,627 [DEBUG] kashgari - predict output: (1, 100)
2021-04-20 00:24:41,629 [DEBUG] kashgari - predict output argmax: [[0 6 2 2 2 1 4 3 1 1 1 1 1 1 4 3 1 4 3 3 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]


[['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'B-LOC', 'I-LOC', 'I-LOC', 'O', 'O']]
['玉', '山', '銀', '行', '全', '台', '灣', '都', '有', '，', '總', '部', '在', '台', '北', '的', '信', '義', '區', '嗎', '？']
