Data Structure

* level 1: string number
* level 2: context, NER_target
* level 3: 
  * H_type: Histologic Type
  * H_grade: Histologic Grade
  * TF: Tumor Focality
  * LV: Lymph-Vascular Invasion
  * CM: closest margin
  * size: size
* level 4: 
  * content: extracted paragraph, if no will be `without content`
  * annotate: target, if no will be `-`
  * content_tag and NER_taging: no use currently

資料筆數

In [2]:
len(data)

860

Context 為空的筆數

In [14]:
sum(1 for idx, example in data.items() if example"context" == "")

99

Token Length 分佈

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("t5-base", n_positions=2048)
context = example"context" for idx, example in data.items()
context_len = len(tokenizer(ctx)"input_ids") for ctx in context

max(context_len)

1783

資料介紹
* NER_H&F_keycontent.json(./data/raw/NER_H%26F_keycontent.json): 人工驗證，共 207 名個案，860 則病理報告
* NER_keycontent_708patients.json(./data/raw/NER_keycontent_708patients.json): 規則抓取，共 501 名個案，2187 則病理報告 + NER_H&F_keycontent.json

NER target
* **soap report** / **cetology report**: organ, Bx-site, sampling method, diagosis
* **path report**: H_type, H_grade, TF, LV, CM, size

觀察 NER_keycontent_708patients.json(./data/raw/NER_keycontent_708patients.json)

In [1]:
import json
from pathlib import Path
from transformers import AutoTokenizer

data = json.loads(Path("./data/raw/NER_keycontent_708patients.json").read_text())
Path("./data/raw/NER_H&F_keycontent.json").write_text(json.dumps(data, indent=2))

tokenizer = AutoTokenizer.from_pretrained("t5-base", n_positions=2048)
soap_report = example"soap report" for idx, example in data.items()
path_report = example"path report" for idx, example in data.items()
soap_len = len(tokenizer(report)"input_ids") for report in soap_report
path_len = len(tokenizer(report)"input_ids") for report in path_report

print(f"soap report max lenhth: {max(soap_len)}\npath report max length: {max(path_len)}")

Token indices sequence length is longer than the specified maximum sequence length for this model (962 > 512). Running this sequence through the model will result in indexing errors


soap report max lenhth: 962
path report max length: 2160


In [3]:
import json
from pathlib import Path
from transformers import AutoTokenizer

data = json.loads(Path("./data/raw/NER_keycontent_708patients.json").read_text())
Path("./data/raw/NER_H&F_keycontent.json").write_text(json.dumps(data, indent=2))

tokenizer = AutoTokenizer.from_pretrained("t5-base", n_positions=2048)

content = items"content" for idx, example in data.items() for target, items in example"NER_target".items()
content = list(set(content))
content_len = len(tokenizer(cont)"input_ids") for cont in content

print(f"max length: {max(content_len)}")

Token indices sequence length is longer than the specified maximum sequence length for this model (523 > 512). Running this sequence through the model will result in indexing errors


max length: 874


有哪些中文字

In [7]:
import re
import json
from pathlib import Path

data = json.loads(Path("./data/raw/NER_keycontent_708patients.json").read_text())
Path("./data/raw/NER_H&F_keycontent.json").write_text(json.dumps(data, indent=2))

words = 
zhPattern = re.compile(u'\u4e00-\u9fa5+')
for idx, example in data.items():
    for report in "soap report", "path report", "cetology report":
        match = zhPattern.findall(examplereport)
        words.extend(match)

print(sorted(list(set(words))))

['主治醫師', '余代和', '余鐵國', '健保署', '傅安生', '切片日期', '劉伯義', '劉平', '劉慶綬', '劉文賢', '劉方來', '劉榮木', '劉瑞雄', '劉瑞雲', '劉興基', '加註類別', '博仁醫院', '原位癌', '原本', '參考文獻', '口病專醫字第', '台北病理中心', '台安醫院', '史莊廷妹', '名字', '吳世雄', '吳文賢', '吳朝褔', '吳錦章', '吳錫全', '周中明', '周坤輝', '周彩霞', '周文章', '周朝榮', '周泰源', '周義雄', '周邱登美', '喬美華', '嚴敏禎', '報告日期', '壢新醫院', '夏德興', '姚菊瑛', '孟慶德', '孫德寶', '孫殿文', '孫靖嫺', '孫鳳舉', '宋騰琰', '年新出版', '廖', '廖林月花', '廖笑', '廖純霞', '張世欣', '張嬌蓮', '張宜崴', '張景晃', '張林秀賢', '張水源', '張水返', '張清根', '張秋煌', '張維富', '張鳴', '張齡材', '徐文芳', '徐辰芳', '忠孝醫院', '戴光明', '承王延秀', '振興醫療財團法人振興醫院', '振興醫院', '收件日期', '方台生', '方嘉郎', '方洪昌', '施隆彬', '景薇立', '更改加註類別', '更正為', '曾修良', '曾塗盛', '曾正均', '曾玉龍', '曾良玉', '曾蔡郁姬', '朱娟秀', '朱家瑩', '朱明細', '朱錫卿', '李', '李寶純', '李德發', '李振揚', '李方玉翠', '李明義', '李素真', '李詩鐘', '杜士英', '杜美瑤', '林余香', '林周清香', '林幸蓁', '林廖炎珠', '林志昌', '林忠華', '林文雄', '林智化', '林月波', '林月秀', '林柑', '林武龍', '林永和', '林芳美', '林蘇坤', '林賜恩', '林進祥', '林達雄', '林金在', '林金城', '林錦旺', '林麗平', '林麗淑', '林黃淑櫻', '柳天送', '梁丁海光', '楊勝雄', '楊志能', '楊朝全', '楊正治', '楊秋素', '楊鴻育', '檢字第', '檢體', '檢體編號', '正確的編碼為',

去除中文字

In [8]:
import re
import json
from pathlib import Path

data = json.loads(Path("./data/raw/NER_keycontent_708patients.json").read_text())
Path("./data/raw/NER_H&F_keycontent.json").write_text(json.dumps(data, indent=2))

zhPattern = re.compile(u'\u4e00-\u9fa5+')
for idx, example in data.items():
    for report in "soap report", "path report", "cetology report":
        match = zhPattern.findall(examplereport)
        for word in match:
            examplereport = examplereport.replace(word, "")

切分資料集
* 在假設病理報告間是彼此獨立的前提下切分資料
* 不過觀察資料的過程中，感覺前後順序有點關係

In [13]:
import json
from pathlib import Path
import numpy as np
from sklearn.model_selection import train_test_split

data = json.loads(Path("./data/raw/NER_keycontent_708patients.json").read_text())
Path("./data/raw/NER_H&F_keycontent.json").write_text(json.dumps(data, indent=2))

# Remove Chinese
zhPattern = re.compile(u'\u4e00-\u9fa5+')
for idx, example in data.items():
    for target, items in example"NER_target".items():
        match = zhPattern.findall(items"content")
        for word in match:
            items"content" = items"content".replace(word, "")

seed = 1209
train_ratio = 0.8
data_ids = np.array(int(idx) for idx in data.keys())
train_ids, test_ids = train_test_split(data_ids, shuffle=True, random_state=seed, train_size=train_ratio)
train = datastr(idx) for idx in train_ids
test = datastr(idx) for idx in test_ids
Path("./data/processed/train.json").write_text(json.dumps(train, indent=2, ensure_ascii=False))
Path("./data/processed/test.json").write_text(json.dumps(test, indent=2, ensure_ascii=False))

1809967

In [4]:
import pandas as pd
from collections import defaultdict
import json
from pathlib import Path

soap_target = "organ", "Bx-site", "sampling method", "diagnosis"
path_target = "H_type", "H_grade", "TF", "LV", "CM", "size"

target_to_question = {
    "organ": "What is the organ?", 
    "Bx-site": "What is the Bx-site?", 
    "sampling method": "What is the sampling method?", 
    "diagnosis": "What is the diagnosis?",
    "H_type": "What is the value of Histologic Type?", 
    "H_grade": "What is the value of Histologic Grade?", 
    "TF": "What is the value of Tumor Focality?", 
    "LV": "What is the value of Lymph-Vascular Invasion?", 
    "CM": "What is the value of closest margin?", 
    "size": "What is the value of tumor size?"
}

data_path = {
    "train": "./data/processed/train.json",
    "test": "./data/processed/test.json"
}

input_data = defaultdict(list)
for split, path in data_path.items():
    split_data = json.loads(Path(path).read_text())
    for example in split_data:
        for target in soap_target:
            if example"soap report" == "":
                assert example'cetology report' != ""
                input_datasplit.append(
                    "question",
                    f"{target_to_questiontarget} context: {example'cetology report'} NO",
                    "NO" if example"NER_target"target"annotate" == "-" else example"NER_target"target"annotate"
                )
            else:
                if example'soap report' == "":
                    continue
                input_datasplit.append(
                    "question",
                    f"{target_to_questiontarget} context: {example'soap report'} NO",
                    "NO" if example"NER_target"target"annotate" == "-" else example"NER_target"target"annotate"
                )
        for target in path_target:
            if example'path report' == "":
                continue
            input_datasplit.append(
                "question",
                f"{target_to_questiontarget} context: {example'path report'} NO",
                "NO" if example"NER_target"target"annotate" == "-" else example"NER_target"target"annotate"
            )

train_df = pd.DataFrame(input_data"train")
train_df.columns = "prefix", "input_text", "target_text"
train_df.to_excel("./data/processed/train.xlsx", index=False)

test_df = pd.DataFrame(input_data"test")
test_df.columns = "prefix", "input_text", "target_text"
test_df.to_excel("./data/processed/test.xlsx", index=False)

In [2]:
import pandas as pd
from collections import defaultdict
import json
from pathlib import Path
import re


SKIP_SIGN = "without content"
NO_TARGET = "-"
NO_TEXT = "NO"

target_to_question = {
    "organ": "What is the organ?", 
    "Bx-site": "What is the Bx-site?", 
    "sampling method": "What is the sampling method?", 
    "diagnosis": "What is the diagnosis?",
    "H_type": "What is the value of Histologic Type?", 
    "H_grade": "What is the value of Histologic Grade?", 
    "TF": "What is the value of Tumor Focality?", 
    "LV": "What is the value of Lymph-Vascular Invasion?", 
    "CM": "What is the value of closest margin?", 
    "size": "What is the value of tumor size?"
}

target_to_prefix = {
    "organ": "question", 
    "Bx-site": "biopsy", 
    "sampling method": "question", 
    "diagnosis": "diagnosis",
    "H_type": "question", 
    "H_grade": "question", 
    "TF": "question", 
    "LV": "question", 
    "CM": "question", 
    "size": "question"
}

data_path = {
    "train": "./data/processed/train.json",
    "test": "./data/processed/test.json"
}

input_data = defaultdict(list)
for split, path in data_path.items():
    split_data = json.loads(Path(path).read_text())
    for example in split_data:
        for target, items in example"NER_target".items():
            if items"content" == SKIP_SIGN:
                continue
            
            prefix = target_to_prefixtarget
            input_text = f"{target_to_questiontarget} context: {items'content'} {NO_TEXT}"
            input_text = re.sub("\s+", " ", input_text)
            target_text = NO_TEXT if items"annotate" == NO_TARGET else items"annotate"
            input_datasplit.append(prefix, input_text, target_text)

train_df = pd.DataFrame(input_data"train")
train_df.columns = "prefix", "input_text", "target_text"
train_df.to_excel("./data/processed/train.xlsx", index=False)

test_df = pd.DataFrame(input_data"test")
test_df.columns = "prefix", "input_text", "target_text"
test_df.to_excel("./data/processed/test.xlsx", index=False)

In [16]:
import pandas as pd

test_excel_path = "./data/processed/test.xlsx"
test_df = pd.read_excel(test_excel_path)
to_predict = f"{row.prefix}: {row.input_text}" for row in test_df.itertuples()
to_predict:3

['question: What is the organ? context: <MALIGNANCY> Lung, right lower lobe, CT-guided needle biopsy, small cell carcinoma NULL',
 'biopsy: What is the Bx-site? context: <MALIGNANCY> Lung, right lower lobe, CT-guided needle biopsy, small cell carcinoma NULL',
 'question: What is the sampling method? context: <MALIGNANCY> Lung, right lower lobe, CT-guided needle biopsy, small cell carcinoma NULL']

In [17]:
train_df.head()

Unnamed: 0,prefix,input_text,target_text
0,question,"What is the organ? context: Intestine, large, ...",
1,biopsy,"What is the Bx-site? context: Intestine, large...",
2,question,What is the sampling method? context: Intestin...,
3,diagnosis,"What is the diagnosis? context: Intestine, lar...",
4,question,"What is the organ? context: Lung, main bronchu...",Lung


In [18]:
test_df.head()

Unnamed: 0,prefix,input_text,target_text
0,question,"What is the organ? context: <MALIGNANCY> Lung,...",Lung
1,biopsy,What is the Bx-site? context: <MALIGNANCY> Lun...,RLL
2,question,What is the sampling method? context: <MALIGNA...,CT-guided needle biopsy
3,diagnosis,What is the diagnosis? context: <MALIGNANCY> L...,carcinoma
4,question,What is the organ? context: Pathologic Report ...,


開始訓練

In [1]:
# Reference: https://towardsdatascience.com/the-guide-to-multi-tasking-with-the-t5-transformer-90c70a08837b
import logging
import pandas as pd
from simpletransformers.t5 import T5Model, T5Args


train_df = pd.read_excel("./data/processed/train.xlsx", dtype=str)
test_df = pd.read_excel("./data/processed/test.xlsx", dtype=str)
to_predict = f"{row.prefix}: {row.input_text}" for row in test_df.itertuples()

# Configure the model
# General args: https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model
# T5 args: https://simpletransformers.ai/docs/t5-model/
# If memory problem occurs, set lower max_seq_length or train_batch_size
model_args = T5Args()
model_args.manual_seed = 1209
model_args.max_seq_length = 900
model_args.train_batch_size = 8
model_args.num_train_epochs = 5
model_args.use_multiprocessing = False
model_args.fp16 = False
model_args.save_steps = -1
model_args.save_model_every_epoch = False
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.n_gpu = 2

model = T5Model("t5", "t5-base", args=model_args)

# Train the model
model.train_model(train_df, use_cuda=True)

# Make predictions with the model
preds = model.predict(to_predict)

  0%|          | 0/11151 [00:00<?, ?it/s]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 0 of 5:   0%|          | 0/1394 [00:00<?, ?it/s]



Running Epoch 1 of 5:   0%|          | 0/1394 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/1394 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/1394 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/1394 [00:00<?, ?it/s]

Generating outputs:   0%|          | 0/358 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/2862 [00:00<?, ?it/s]

In [3]:
import pandas as pd

test_excel_path = "./data/processed/test.xlsx"
test_df = pd.read_excel(test_excel_path)
test_df"prediction" = preds
test_df.to_excel("./data/processed/test.xlsx", index=False)

Evaluation

In [11]:
import pandas as pd

test_excel_path = "./data/processed/test.xlsx"
test_df = pd.read_excel(test_excel_path)

labels = test_df.target_text.values.tolist()
preds = test_df.prediction.values.tolist()

acc = sum(1 for l, p in zip(labels, preds) if l == p) / len(labels)
print(f"Accuracy is {round(acc, 4) * 100} %; number is {len(labels)}")

Accuracy is 88.02 %; number is 2862


In [10]:
import pandas as pd

train_excel_path = "./data/processed/train.xlsx"
train_df = pd.read_excel(train_excel_path)

target_to_question = {
    "organ": "What is the organ?", 
    "Bx-site": "What is the Bx-site?", 
    "sampling method": "What is the sampling method?", 
    "diagnosis": "What is the diagnosis?",
    "H_type": "What is the value of Histologic Type?", 
    "H_grade": "What is the value of Histologic Grade?", 
    "TF": "What is the value of Tumor Focality?", 
    "LV": "What is the value of Lymph-Vascular Invasion?", 
    "CM": "What is the value of closest margin?", 
    "size": "What is the value of tumor size?"
}

for target, question in target_to_question.items():
    t_labels = row.target_text for row in train_df.itertuples() if question in row.input_text
    print(f"{target} number is {len(t_labels)}")

print(f"Total number is {len(train_df)}")

organ number is 2437
Bx-site number is 2437
sampling method number is 2437
diagnosis number is 2437
H_type number is 210
H_grade number is 257
TF number is 204
LV number is 210
CM number is 180
size number is 342
size total number is 11151


In [6]:
import pandas as pd

test_excel_path = "./data/processed/test.xlsx"
test_df = pd.read_excel(test_excel_path)

target_to_question = {
    "organ": "What is the organ?", 
    "Bx-site": "What is the Bx-site?", 
    "sampling method": "What is the sampling method?", 
    "diagnosis": "What is the diagnosis?",
    "H_type": "What is the value of Histologic Type?", 
    "H_grade": "What is the value of Histologic Grade?", 
    "TF": "What is the value of Tumor Focality?", 
    "LV": "What is the value of Lymph-Vascular Invasion?", 
    "CM": "What is the value of closest margin?", 
    "size": "What is the value of tumor size?"
}

for target, question in target_to_question.items():
    t_labels = row.target_text for row in test_df.itertuples() if question in row.input_text
    t_preds = row.prediction for row in test_df.itertuples() if question in row.input_text
    acc = sum(1 for l, p in zip(t_labels, t_preds) if l == p) / len(t_labels)
    print(f"{target} accuracy is {round(acc, 4) * 100} %; number is {len(t_labels)}")

organ accuracy is 96.89 %; number is 610
Bx-site accuracy is 83.28 %; number is 610
sampling method accuracy is 95.41 %; number is 610
diagnosis accuracy is 85.57000000000001 %; number is 610
H_type accuracy is 40.32 %; number is 62
H_grade accuracy is 93.33 %; number is 75
TF accuracy is 87.1 %; number is 62
LV accuracy is 87.88 %; number is 66
CM accuracy is 85.96000000000001 %; number is 57
size accuracy is 60.0 %; number is 100


In [12]:
import pandas as pd

test_excel_path = "./data/processed/test.xlsx"
test_df = pd.read_excel(test_excel_path)

target_to_question = {
    "organ": "What is the organ?", 
    "Bx-site": "What is the Bx-site?", 
    "sampling method": "What is the sampling method?", 
    "diagnosis": "What is the diagnosis?",
    "H_type": "What is the value of Histologic Type?", 
    "H_grade": "What is the value of Histologic Grade?", 
    "TF": "What is the value of Tumor Focality?", 
    "LV": "What is the value of Lymph-Vascular Invasion?", 
    "CM": "What is the value of closest margin?", 
    "size": "What is the value of tumor size?"
}

output = {"target": , "text": , "label": , "prediction": }
for target, question in target_to_question.items():
    for row in test_df.itertuples():
        if question in row.input_text:
            text = row.input_text
            text = text.replace(f"{question} context: ", "")
            text = text:-3
            if row.target_text != row.prediction:
                output"target".append(target)
                output"text".append(text)
                output"label".append("NULL" if row.target_text == "NO" else row.target_text)
                output"prediction".append("NULL" if row.prediction == "NO" else row.prediction)
output = pd.DataFrame(output)
output.to_excel("./outputs/errors.xlsx", index=False)
    

### 加入分類任務資料一起訓練
* 只有 `organ`, `box-site`, `sampling method`, `diagnosis` 是看 **cetology report** 和 **soap report**
* 其他的都是 **pathology report**

計算每個 target 的類別數

In [16]:
import json
from pathlib import Path

data_path = "./data/raw/merge_keycontent.json"
data = json.loads(Path(data_path).read_text())

targets =  list(data"1""NER_target".keys())
target_cat = {}
for target in targets:
    cat = 
    for idx, example in data.items():
        annotate = example"NER_target"target"annotate"
        if annotate != "-":
            cat.append(str(annotate).strip())
    target_cattarget = {"count": len(cat), "category": {c: cat.count(c) for c in set(cat)}}
Path("./save/target_cat.json").write_text(json.dumps(target_cat, indent=2))

59761

計算每個 target 的標註出現在原文的覆蓋率

In [7]:
import json
from pathlib import Path

data_path = "./data/raw/merge_keycontent.json"
data = json.loads(Path(data_path).read_text())

targets =  list(data"1""NER_target".keys())
target_coverage = {}
for target in targets:
    coverage = 0
    total = 0
    for idx, example in data.items():
        annotate = str(example"NER_target"target"annotate")
        content = str(example"NER_target"target"content")
        if annotate != "-":
            total += 1
            if annotate in example"soap report":
                coverage += 1
            elif annotate in example"path report":
                coverage += 1
            elif annotate in example"cetology report":
                coverage += 1
    target_coveragetarget = f"{round(coverage / total, 4) * 100} %"
Path("./save/target_coverage.json").write_text(json.dumps(target_coverage, indent=2))

1139

切分資料集

In [3]:
import json
from pathlib import Path
import numpy as np
from sklearn.model_selection import train_test_split

data_path = "./data/raw/merge_keycontent.json"
data = json.loads(Path(data_path).read_text())

seed = 1209
train_ratio = 0.8
data_ids = list(data.keys())
train_ids, test_ids = train_test_split(data_ids, shuffle=True, random_state=seed, train_size=train_ratio)
train = dataidx for idx in train_ids
test = dataidx for idx in test_ids
Path("./data/processed/train.json").write_text(json.dumps(train, indent=2, ensure_ascii=False))
Path("./data/processed/test.json").write_text(json.dumps(test, indent=2, ensure_ascii=False))

5560052

In [1]:
import json
from pathlib import Path

target_to_question = {
    'organ': "What is the organ?",
    'Bx-site': "What is the Bx-site?",
    'sampling method': "What is the sampling method?",
    'diagnosis': "What is the diagnosis?",
    'size': "What is the value of tumor size?",
    'Greatest_dimension': "What is the value of greatest dimension?",
    'H_type': "What is the value of Histologic Type?",
    'H_grade': "What is the value of Histologic Grade?",
    'TF': "What is the value of Tumor Focality?",
    'LV': "What is the value of Lymph-Vascular Invasion?",
    'CM': "What is the value of closest margin?",
    'VPI': "What is the value of VPI?",
    'EGFR': "What is the EGFR?",
    'ALK': "Is ALK Positive, Negative, or Unkown?",
    'ROS1': "Is ROS1 Positive, Negative, or Unkown?",
    'BRAF': "Is BRAF Positive, Negative, or Unkown?",
    'MET': "Is MET Positive, Negative, or Unkown?",
    'KRAS': "Is KRAS Positive, Negative, or Unkown?",
    'ERBB2': "Is ERBB2 Positive, Negative, or Unkown?",
    'PIK3CA': "Is PIK3CA Positive, Negative, or Unkown?",
    'NRAS': "Is NRAS Positive, Negative, or Unkown?",
    'MEK1': "Is MEK1 Positive, Negative, or Unkown?",
    'NTRK': "Is NTRK Positive, Negative, or Unkown?",
    'RET': "Is RET Positive, Negative, or Unkown?",
    'PDL1': "What is the PDL1?",
    'ver': "What is the version?",
    'pT': "What is the pT?",
    'pN': "What is the pN?",
    'pM': "What is the pM?",
    'pStage': "What is the pStage?",
    'CK7': "Is CK7 Positive, Negative, or Unkown?",
    'TTF': "Is TTF Positive, Negative, or Unkown?",
    'Napsin': "Is Napsin Positive, Negative, or Unkown?",
    'CK20': "Is CK20 Positive, Negative, or Unkown?",
    'P40': "Is P40 Positive, Negative, or Unkown?",
    'CDX2': "Is CDX2 Positive, Negative, or Unkown?",
    'P63': "Is P63 Positive, Negative, or Unkown?",
    'P16': "Is P16 Positive, Negative, or Unkown?",
    'cytokeratin': "Is cytokeratin Positive, Negative, or Unkown?",
    'Vimentin': "Is Vimentin Positive, Negative, or Unkown?",
    'PAX': "Is PAX Positive, Negative, or Unkown?",
    'CD56': "Is CD56 Positive, Negative, or Unkown?",
    'chromogranin': "Is chromogranin Positive, Negative, or Unkown?",
    'synaptophysin': "Is synaptophysin Positive, Negative, or Unkown?",
    'GATA3': "Is GATA3 Positive, Negative, or Unkown?"
}
Path("./save/target_to_question.json").write_text(json.dumps(target_to_question, indent=2))

2192

In [2]:
import json
from pathlib import Path

target_to_prefix = {
    'organ': "question",
    'Bx-site': "biopsy",
    'sampling method': "question",
    'diagnosis': "diagnosis",
    'size': "question",
    'Greatest_dimension': "question",
    'H_type': "question",
    'H_grade': "question",
    'TF': "question",
    'LV': "question",
    'CM': "question",
    'VPI': "question",
    'EGFR': "EGFR",
    'ALK': "ALK",
    'ROS1': "ROS1",
    'BRAF': "BRAF",
    'MET': "MET",
    'KRAS': "KRAS",
    'ERBB2': "ERBB2",
    'PIK3CA': "PIK3CA",
    'NRAS': "NRAS",
    'MEK1': "MEK1",
    'NTRK': "NTRK",
    'RET': "RET",
    'PDL1': "PDL1",
    'ver': "question",
    'pT': "question",
    'pN': "question",
    'pM': "question",
    'pStage': "question",
    'CK7': "CK7",
    'TTF': "TTF",
    'Napsin': "Napsin",
    'CK20': "CK20",
    'P40': "P40",
    'CDX2': "CDX2",
    'P63': "P63",
    'P16': "P16",
    'cytokeratin': "cytokeratin",
    'Vimentin': "Vimentin",
    'PAX': "PAX",
    'CD56': "CD56",
    'chromogranin': "chromogranin",
    'synaptophysin': "synaptophysin",
    'GATA3': "GATA3"
}
Path("./save/target_to_prefix.json").write_text(json.dumps(target_to_prefix, indent=2))

956

In [3]:
import json
from pathlib import Path

soap_targets = 'organ', 'Bx-site', 'sampling method', 'diagnosis'
Path("./save/soap_targets.json").write_text(json.dumps(soap_targets, indent=2))

62

In [20]:
import pandas as pd
from collections import defaultdict
import json
from pathlib import Path
import re


NO_TARGET = "-"
NO_TEXT = "NO"
USE_CONTENT = False

data_path = {
    "train": "./data/processed/train.json",
    "test": "./data/processed/test.json"
}

input_data = defaultdict(list)
for split, path in data_path.items():
    split_data = json.loads(Path(path).read_text())
    for example in split_data:
        for target, items in example"NER_target".items():
            prefix = target_to_prefixtarget
            question = target_to_questiontarget
            if USE_CONTENT:
                if target not in soap_targets and example"path report" == "":
                    continue
                input_text = f"{question} context: {items'content'} {NO_TEXT}"
                input_text = re.sub("\s+", " ", input_text)
                target_text = NO_TEXT if items"annotate" == NO_TARGET else items"annotate"
                input_datasplit.append(prefix, input_text, target_text)
            else:
                if target in soap_targets:
                    if example"soap report" != "":
                        input_text = f"{question} context: {example'soap report'} {NO_TEXT}"
                    else:
                        input_text = f"{question} context: {example'cetology report'} {NO_TEXT}"
                else:
                    if example"path report" != "":
                        input_text = f"{question} context: {example'path report'} {NO_TEXT}"
                    else:
                        continue
                input_text = re.sub("\s+", " ", input_text)
                target_text = NO_TEXT if items"annotate" == NO_TARGET else items"annotate"
                input_datasplit.append(prefix, input_text, target_text)
                    

train_df = pd.DataFrame(input_data"train")
train_df.columns = "prefix", "input_text", "target_text"
train_df.to_excel("./data/processed/train.xlsx", index=False)

test_df = pd.DataFrame(input_data"test")
test_df.columns = "prefix", "input_text", "target_text"
test_df.to_excel("./data/processed/test.xlsx", index=False)

In [1]:
# Reference: https://towardsdatascience.com/the-guide-to-multi-tasking-with-the-t5-transformer-90c70a08837b
import logging
import pandas as pd
from simpletransformers.t5 import T5Model, T5Args


train_df = pd.read_excel("./data/processed/train.xlsx", dtype=str)
test_df = pd.read_excel("./data/processed/test.xlsx", dtype=str)
to_predict = f"{row.prefix}: {row.input_text}" for row in test_df.itertuples()

# Configure the model
# General args: https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model
# T5 args: https://simpletransformers.ai/docs/t5-model/
# If memory problem occurs, set lower max_seq_length or train_batch_size
model_args = T5Args()
model_args.manual_seed = 1209
model_args.max_seq_length = 900
model_args.train_batch_size = 4
model_args.num_train_epochs = 3
model_args.use_multiprocessing = False
model_args.fp16 = False
model_args.save_steps = -1
model_args.save_model_every_epoch = False
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.n_gpu = 2

model = T5Model("t5", "t5-base", args=model_args)

# Train the model
model.train_model(train_df, use_cuda=True)

# Make predictions with the model
preds = model.predict(to_predict)
test_df"prediction" = preds
test_df.to_excel("./data/processed/test.xlsx", index=False)

  0%|          | 0/109665 [00:00<?, ?it/s]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 0 of 3:   0%|          | 0/27417 [00:00<?, ?it/s]



Running Epoch 1 of 3:   0%|          | 0/27417 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/27417 [00:00<?, ?it/s]

Generating outputs:   0%|          | 0/3432 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/27450 [00:00<?, ?it/s]

In [23]:
# Make predictions with the model
to_predict = f"{row.prefix}: {row.input_text}" for row in test_df.itertuples()
preds = model.predict(to_predict)
test_df"prediction" = preds
test_df.to_excel("./data/processed/test.xlsx", index=False)

Generating outputs:   0%|          | 0/3073 [00:00<?, ?it/s]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and the tokenizer under the `as_target_tokenizer` context manager to prepare
your targets.

Here is a short example:

model_inputs = tokenizer(src_texts, ...)
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



Decoding outputs:   0%|          | 0/24580 [00:00<?, ?it/s]

In [25]:
import pandas as pd

test_excel_path = "./data/processed/test.xlsx"
test_df = pd.read_excel(test_excel_path, dtype=str)

labels = test_df.target_text.values.tolist()
preds = test_df.prediction.values.tolist()

labels = l.strip() for l in labels
preds = p.strip() for p in preds

acc = sum(1 for l, p in zip(labels, preds) if l == p) / len(labels)
print(f"Accuracy is {round(acc, 4) * 100} %; number is {len(labels)}")

for target, question in target_to_question.items():
    t_labels, t_preds = , 
    t_labels = row.target_text for row in test_df.itertuples() if question in row.input_text
    t_preds = row.prediction for row in test_df.itertuples() if question in row.input_text
    acc = sum(1 for l, p in zip(t_labels, t_preds) if l == p) / len(t_labels)
    print(f"{target} accuracy is {round(acc, 4) * 100} %; number is {len(t_labels)}")

Accuracy is 96.41999999999999 %; number is 24580
organ accuracy is 95.57 %; number is 610
Bx-site accuracy is 78.85 %; number is 610
sampling method accuracy is 90.82000000000001 %; number is 610
diagnosis accuracy is 80.49 %; number is 610
size accuracy is 88.33 %; number is 540
Greatest_dimension accuracy is 93.7 %; number is 540
H_type accuracy is 89.44 %; number is 540
H_grade accuracy is 97.59 %; number is 540
TF accuracy is 97.22 %; number is 540
LV accuracy is 96.11 %; number is 540
CM accuracy is 95.19 %; number is 540
VPI accuracy is 97.22 %; number is 540
EGFR accuracy is 98.89 %; number is 540
ALK accuracy is 100.0 %; number is 540
ROS1 accuracy is 99.81 %; number is 540
BRAF accuracy is 100.0 %; number is 540
MET accuracy is 100.0 %; number is 540
KRAS accuracy is 100.0 %; number is 540
ERBB2 accuracy is 100.0 %; number is 540
PIK3CA accuracy is 100.0 %; number is 540
NRAS accuracy is 100.0 %; number is 540
MEK1 accuracy is 100.0 %; number is 540
NTRK accuracy is 100.0 %; n

In [18]:
targets_cat = {
    'EGFR': "18", "19", "19,20", "20", "20,21", "21", "N", "U",
    'ALK': "P", "N", "U",
    'ROS1': "P", "N", "U",
    'BRAF': "P", "N", "U",
    'MET': "P", "N", "U",
    'KRAS': "P", "N", "U",
    'ERBB2': "P", "N", "U",
    'PIK3CA': "P", "N", "U",
    'NRAS': "P", "N", "U",
    'MEK1': "P", "N", "U",
    'NTRK': "P", "N", "U",
    'RET': "P", "N", "U",
    'CK7': "P", "N", "U",
    'TTF': "P", "N", "U",
    'Napsin': "P", "N", "U",
    'CK20': "P", "N", "U",
    'P40': "P", "N", "U",
    'CDX2': "P", "N", "U",
    'P63': "P", "N", "U",
    'P16': "P", "N", "U",
    'cytokeratin': "P", "N", "U",
    'Vimentin': "P", "N", "U",
    'PAX': "P", "N", "U",
    'CD56': "P", "N", "U",
    'chromogranin': "P", "N", "U",
    'synaptophysin': "P", "N", "U",
    'GATA3': "P", "N", "U"
}

In [30]:
from pathlib import Path
import pandas as pd
from sklearn.metrics import classification_report, multilabel_confusion_matrix

test_excel_path = "./data/processed/test.xlsx"
test_df = pd.read_excel(test_excel_path)

classification_reports = 
confusion_matrics = 

df = test_dftest_df"prefix" == 'EGFR'
labels = df.target_text.values.tolist()
preds = df.prediction.values.tolist()

EGFR = "18", "19", "20", "21", "N", "U"

labels = 1 if e in str(l) else 0 for e in EGFR for l in labels
preds = 1 if e in str(l) else 0 for e in EGFR for l in preds

cr = classification_report(labels, preds)
cm = multilabel_confusion_matrix(labels, preds)
print(cr)
print(cm)

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.88      1.00      0.93        14
           2       1.00      0.33      0.50         6
           3       1.00      1.00      1.00        33
           4       0.94      1.00      0.97        44
           5       1.00      1.00      1.00       445

   micro avg       0.99      0.99      0.99       543
   macro avg       0.80      0.72      0.73       543
weighted avg       0.99      0.99      0.99       543
 samples avg       0.99      0.99      0.99       543

[[[539   0]
  [  1   0]]

 [[524   2]
  [  0  14]]

 [[534   0]
  [  4   2]]

 [[507   0]
  [  0  33]]

 [[493   3]
  [  0  44]]

 [[ 95   0]
  [  1 444]]]


  _warn_prf(average, modifier, msg_start, len(result))


In [32]:
from pathlib import Path
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix

test_excel_path = "./data/processed/test.xlsx"
test_df = pd.read_excel(test_excel_path)

classification_reports = 
confusion_matrics = 

df = test_dftest_df"prefix" == 'TTF'
labels = df.target_text.values.tolist()
preds = df.prediction.values.tolist()

preds = "U" if p not in "P", "N", "U" else p for p in preds

cr = classification_report(labels, preds)
cm = confusion_matrix(labels, preds)
print(cr)
print(cm)

              precision    recall  f1-score   support

           N       0.82      0.80      0.81        40
           P       0.95      0.88      0.91       103
           U       0.97      0.99      0.98       397

    accuracy                           0.96       540
   macro avg       0.91      0.89      0.90       540
weighted avg       0.95      0.96      0.95       540

[[ 32   3   5]
 [  5  91   7]
 [  2   2 393]]


In [None]:
from pathlib import Path
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix

test_excel_path = "./data/processed/test.xlsx"
test_df = pd.read_excel(test_excel_path)

classification_reports = 
confusion_matrics = 

for target, cat in targets_cat.items():
    print(target)
    df = test_dftest_df"prefix" == target
    labels = df.target_text.values.tolist()
    preds = df.prediction.values.tolist()
    # assert all(p in cat for p in preds), target

    cr = classification_report(labels, preds)
    cm = confusion_matrix(labels, preds)
    print(cr)
    print(cm)

In [22]:
from pathlib import Path
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix

test_excel_path = "./data/processed/test.xlsx"
test_df = pd.read_excel(test_excel_path)

classification_reports = 
confusion_matrics = 

for target, cat in targets_cat.items():
    print(target)
    df = test_dftest_df"prefix" == target
    labels = df.target_text.values.tolist()
    preds = df.prediction.values.tolist()
    # assert all(p in cat for p in preds), target

    cr = classification_report(labels, preds)
    cm = confusion_matrix(labels, preds)
    print(cr)
    print(cm)

EGFR
              precision    recall  f1-score   support

          18       0.00      0.00      0.00         1
          19       0.87      1.00      0.93        13
       19,20       1.00      1.00      1.00         1
          20       0.00      0.00      0.00         3
       20,21       1.00      0.50      0.67         2
          21       0.97      1.00      0.98        31
           N       0.94      1.00      0.97        44
           U       1.00      1.00      1.00       445

    accuracy                           0.99       540
   macro avg       0.72      0.69      0.69       540
weighted avg       0.98      0.99      0.99       540

[[  0   0   0   0   0   0   1   0]
 [  0  13   0   0   0   0   0   0]
 [  0   0   1   0   0   0   0   0]
 [  0   2   0   0   0   0   1   0]
 [  0   0   0   0   1   1   0   0]
 [  0   0   0   0   0  31   0   0]
 [  0   0   0   0   0   0  44   0]
 [  0   0   0   0   0   0   1 444]]
ALK
              precision    recall  f1-score   support

    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


[[  1   2   1]
 [  1   7   1]
 [  1   1 525]]
GATA3
              precision    recall  f1-score   support

           U       1.00      1.00      1.00       540

    accuracy                           1.00       540
   macro avg       1.00      1.00      1.00       540
weighted avg       1.00      1.00      1.00       540

[[540]]


Train with extracted content

In [4]:
import pandas as pd
from collections import defaultdict
import json
from pathlib import Path
import re


NO_TARGET = "-"
NO_TEXT = "NO"
USE_CONTENT = True

data_path = {
    "train": "./data/processed/train.json",
    "test": "./data/processed/test.json"
}

target_to_prefix = json.loads(Path("./save/target_to_prefix.json").read_text())
target_to_question = json.loads(Path("./save/target_to_question.json").read_text())
soap_targets = json.loads(Path("./save/soap_targets.json").read_text())

input_data = defaultdict(list)
for split, path in data_path.items():
    split_data = json.loads(Path(path).read_text())
    for example in split_data:
        for target, items in example"NER_target".items():
            prefix = target_to_prefixtarget
            question = target_to_questiontarget
            if USE_CONTENT:
                if target not in soap_targets and example"path report" == "":
                    continue
                input_text = f"{question} context: {items'content'} {NO_TEXT}"
            else:
                if target in soap_targets:
                    if example"soap report" != "":
                        input_text = f"{question} context: {example'soap report'} {NO_TEXT}"
                    else:
                        input_text = f"{question} context: {example'cetology report'} {NO_TEXT}"
                else:
                    if example"path report" != "":
                        input_text = f"{question} context: {example'path report'} {NO_TEXT}"
                    else:
                        continue
            input_text = re.sub("\s+", " ", input_text)
            if items"annotate" == NO_TARGET:
                target_text = NO_TEXT
            elif items"annotate" == "P":
                target_text = "Positive"
            elif items"annotate" == "N":
                target_text = "Negative"
            elif items"annotate" == "U":
                target_text = "Unknown"
            else:
                target_text = items"annotate"
            input_datasplit.append(prefix, input_text, target_text)
                    

train_df = pd.DataFrame(input_data"train")
train_df.columns = "prefix", "input_text", "target_text"
train_df.to_excel("./data/processed/train_content.xlsx", index=False)

test_df = pd.DataFrame(input_data"test")
test_df.columns = "prefix", "input_text", "target_text"
test_df.to_excel("./data/processed/test_content.xlsx", index=False)

In [None]:
# Reference: https://towardsdatascience.com/the-guide-to-multi-tasking-with-the-t5-transformer-90c70a08837b
import logging
import pandas as pd
from simpletransformers.t5 import T5Model, T5Args


train_df = pd.read_excel("./data/processed/train_content.xlsx", dtype=str)
test_df = pd.read_excel("./data/processed/test_content.xlsx", dtype=str)
to_predict = f"{row.prefix}: {row.input_text}" for row in test_df.itertuples()

# Configure the model
# General args: https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model
# T5 args: https://simpletransformers.ai/docs/t5-model/
# If memory problem occurs, set lower max_seq_length or train_batch_size
model_args = T5Args()
model_args.manual_seed = 1209
model_args.max_seq_length = 900
model_args.train_batch_size = 4
model_args.num_train_epochs = 3
model_args.use_multiprocessing = False
model_args.fp16 = False
model_args.save_steps = -1
model_args.save_model_every_epoch = False
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.n_gpu = 2
model_args.output_dir = "./outputs/t5-with-content/"
model_args.wandb_project = "lung-cancer"
model_args.wandb_kwargs = {"name": "t5-with-content"}

model = T5Model("t5", "t5-base", args=model_args)

# Train the model
model.train_model(train_df, use_cuda=True)

# Make predictions with the model
preds = model.predict(to_predict)
test_df"prediction" = preds
test_df.to_excel(model_args.output_dir+"test.xlsx", index=False)

In [1]:
import pandas as pd
import json
from pathlib import Path

test_excel_path = "./outputs/t5-with-content/test.xlsx"
test_df = pd.read_excel(test_excel_path, dtype=str)

labels = test_df.target_text.values.tolist()
preds = test_df.prediction.values.tolist()

labels = l.strip() for l in labels
preds = p.strip() for p in preds

target_to_question = json.loads(Path("./save/target_to_question.json").read_text())

acc = sum(1 for l, p in zip(labels, preds) if l == p) / len(labels)
print(f"Accuracy is {round(acc, 4) * 100} %; number is {len(labels)}")

for target, question in target_to_question.items():
    t_labels, t_preds = , 
    t_labels = row.target_text for row in test_df.itertuples() if question in row.input_text
    t_preds = row.prediction for row in test_df.itertuples() if question in row.input_text
    acc = sum(1 for l, p in zip(t_labels, t_preds) if l == p) / len(t_labels)
    print(f"{target} accuracy is {round(acc, 4) * 100} %; number is {len(t_labels)}")

Accuracy is 95.5 %; number is 24580
organ accuracy is 95.89999999999999 %; number is 610
Bx-site accuracy is 75.9 %; number is 610
sampling method accuracy is 86.72 %; number is 610
diagnosis accuracy is 68.52000000000001 %; number is 610
size accuracy is 88.33 %; number is 540
Greatest_dimension accuracy is 89.44 %; number is 540
H_type accuracy is 89.25999999999999 %; number is 540
H_grade accuracy is 97.41 %; number is 540
TF accuracy is 93.15 %; number is 540
LV accuracy is 97.04 %; number is 540
CM accuracy is 97.59 %; number is 540
VPI accuracy is 99.07000000000001 %; number is 540
EGFR accuracy is 98.7 %; number is 540
ALK accuracy is 100.0 %; number is 540
ROS1 accuracy is 98.7 %; number is 540
BRAF accuracy is 100.0 %; number is 540
MET accuracy is 100.0 %; number is 540
KRAS accuracy is 100.0 %; number is 540
ERBB2 accuracy is 100.0 %; number is 540
PIK3CA accuracy is 100.0 %; number is 540
NRAS accuracy is 100.0 %; number is 540
MEK1 accuracy is 100.0 %; number is 540
NTRK a

In [10]:

test_excel_path = "./outputs/t5-with-content/test.xlsx"
test_df = pd.read_excel(test_excel_path)
df = test_dftest_df"prefix" == 'ALK'
print(len(df))
print(len(test_df))

540
24580


In [11]:
test_excel_path = "./data/processed/test.xlsx"
test_df = pd.read_excel(test_excel_path)
df = test_dftest_df"prefix" == 'ALK'
print(len(df))
print(len(test_df))

540
24580


In [7]:
from pathlib import Path
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix

targets_cat = {
    'EGFR': "18", "19", "19", "20", "20", "21", "N", "U",
    'ALK': "Positive", "Negative", "Unknown",
    'ROS1': "Positive", "Negative", "Unknown",
    'BRAF': "Positive", "Negative", "Unknown",
    'MET': "Positive", "Negative", "Unknown",
    'KRAS': "Positive", "Negative", "Unknown",
    'ERBB2': "Positive", "Negative", "Unknown",
    'PIK3CA': "Positive", "Negative", "Unknown",
    'NRAS': "Positive", "Negative", "Unknown",
    'MEK1': "Positive", "Negative", "Unknown",
    'NTRK': "Positive", "Negative", "Unknown",
    'RET': "Positive", "Negative", "Unknown",
    'CK7': "Positive", "Negative", "Unknown",
    'TTF': "Positive", "Negative", "Unknown",
    'Napsin': "Positive", "Negative", "Unknown",
    'CK20': "Positive", "Negative", "Unknown",
    'P40': "Positive", "Negative", "Unknown",
    'CDX2': "Positive", "Negative", "Unknown",
    'P63': "Positive", "Negative", "Unknown",
    'P16': "Positive", "Negative", "Unknown",
    'cytokeratin': "Positive", "Negative", "Unknown",
    'Vimentin': "Positive", "Negative", "Unknown",
    'PAX': "Positive", "Negative", "Unknown",
    'CD56': "Positive", "Negative", "Unknown",
    'chromogranin': "Positive", "Negative", "Unknown",
    'synaptophysin': "Positive", "Negative", "Unknown",
    'GATA3': "Positive", "Negative", "Unknown"
}

test_excel_path = "./outputs/t5-with-content/test.xlsx"
test_df = pd.read_excel(test_excel_path)

classification_reports = 
confusion_matrics = 

for target, cat in targets_cat.items():
    print(target)
    df = test_dftest_df"prefix" == target
    labels = df.target_text.values.tolist()
    preds = df.prediction.values.tolist()
    #assert all(p in cat for p in preds), target

    cr = classification_report(labels, preds)
    cm = confusion_matrix(labels, preds)
    print(cr)
    print(cm)

EGFR
              precision    recall  f1-score   support

          18       0.00      0.00      0.00         1
          19       0.92      0.85      0.88        13
       19,20       0.00      0.00      0.00         1
          20       0.75      1.00      0.86         3
       20,21       0.00      0.00      0.00         2
          21       0.97      0.97      0.97        31
    Negative       0.96      1.00      0.98        44
    Positive       0.00      0.00      0.00         0
     Unknown       1.00      1.00      1.00       445
        none       0.00      0.00      0.00         0

    accuracy                           0.99       540
   macro avg       0.46      0.48      0.47       540
weighted avg       0.98      0.99      0.99       540

[[  0   0   0   0   0   0   1   0   0   0]
 [  0  11   0   0   0   0   1   0   0   1]
 [  0   1   0   0   0   0   0   0   0   0]
 [  0   0   0   3   0   0   0   0   0   0]
 [  0   0   0   1   0   1   0   0   0   0]
 [  0   0   0   0   0

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

              precision    recall  f1-score   support

    Positive       0.00      0.00      0.00         0
     Unknown       1.00      1.00      1.00       540

    accuracy                           1.00       540
   macro avg       0.50      0.50      0.50       540
weighted avg       1.00      1.00      1.00       540

[[  0   0]
 [  2 538]]
PAX
              precision    recall  f1-score   support

    Negative       1.00      1.00      1.00         3
    Positive       0.50      0.50      0.50         2
     Unknown       1.00      1.00      1.00       535

    accuracy                           1.00       540
   macro avg       0.83      0.83      0.83       540
weighted avg       1.00      1.00      1.00       540

[[  3   0   0]
 [  0   1   1]
 [  0   1 534]]
CD56
              precision    recall  f1-score   support

    Negative       0.80      0.67      0.73         6
    Positive       0.16      0.89      0.27         9
     Unknown       1.00      0.92      0.96       5

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
from pathlib import Path
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix

test_excel_path = "./data/processed/test.xlsx"
test_df = pd.read_excel(test_excel_path)

classification_reports = 
confusion_matrics = 

for target, cat in targets_cat.items():
    df = test_dftest_df"prefix" == target
    labels = df.target_text.values.tolist()
    preds = df.prediction.values.tolist()

    cr = classification_report(labels, preds, labels=cat)
    cm = confusion_matrix(labels, preds, labels=cat)
    print(cr)
    print(cm)
    # classification_reports.append(f"{target}\n{cr}\n")
    # confusion_matrics.append(f"{target}\n{cm}\n")

# Path("./save/all_classification_report.txt").write_text("\n".join(classification_reports))
# Path("./save/all_confusion_matrix.txt").write_text("\n".join(confusion_matrics))

In [34]:
from pathlib import Path
import pandas as pd
from sklearn.metrics import classification_report, multilabel_confusion_matrix

test_excel_path = "./outputs/t5-with-content/test.xlsx"
test_df = pd.read_excel(test_excel_path)

classification_reports = 
confusion_matrics = 

df = test_dftest_df"prefix" == 'EGFR'
labels = df.target_text.values.tolist()
preds = df.prediction.values.tolist()

preds = "U" if p in "Positive", "none" else p for p in preds

EGFR = "18", "19", "20", "21", "N", "U"

labels = 1 if e in str(l) else 0 for e in EGFR for l in labels
preds = 1 if e in str(l) else 0 for e in EGFR for l in preds

cr = classification_report(labels, preds)
cm = multilabel_confusion_matrix(labels, preds)
print(cr)
print(cm)

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       1.00      0.86      0.92        14
           2       1.00      0.67      0.80         6
           3       1.00      0.94      0.97        33
           4       0.96      1.00      0.98        44
           5       1.00      1.00      1.00       445

   micro avg       0.99      0.99      0.99       543
   macro avg       0.83      0.74      0.78       543
weighted avg       0.99      0.99      0.99       543
 samples avg       0.99      0.99      0.99       543

[[[539   0]
  [  1   0]]

 [[526   0]
  [  2  12]]

 [[534   0]
  [  2   4]]

 [[507   0]
  [  2  31]]

 [[494   2]
  [  0  44]]

 [[ 93   2]
  [  0 445]]]


  _warn_prf(average, modifier, msg_start, len(result))


In [41]:
from pathlib import Path
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix

test_excel_path = "./outputs/t5-with-content/test.xlsx"
test_df = pd.read_excel(test_excel_path)

df = test_dftest_df"prefix" == 'synaptophysin'
labels = df.target_text.values.tolist()
preds = df.prediction.values.tolist()

preds = "Unknown" if p == "none" else p for p in preds


cr = classification_report(labels, preds)
cm = confusion_matrix(labels, preds)
print(cr)
print(cm)

              precision    recall  f1-score   support

    Negative       0.75      0.75      0.75         4
    Positive       0.47      0.89      0.62         9
     Unknown       1.00      0.98      0.99       527

    accuracy                           0.98       540
   macro avg       0.74      0.87      0.79       540
weighted avg       0.99      0.98      0.98       540

[[  3   1   0]
 [  1   8   0]
 [  0   8 519]]


Train with 860 data

In [8]:
import json
from pathlib import Path

data_path = "./data/raw/merge_keycontent.json"
data = json.loads(Path(data_path).read_text())

eval_path = "./data/raw/path_eval_102.json"
eval_label = json.loads(Path(eval_path).read_text())

train = datastr(i) for i in range(1, 861)
eval = dataidx for idx in eval_label.keys()

In [9]:
train0

{'soap report': 'Lung, upper lobe, left, CT-guided biopsy, adenocarcinoma/n/n',
 'path report': 'The specimen submitted consists of three tissue fragments measuring up to 1.0 x 0.1 x 0.1 cm in size, fixed in formalin./n/nGrossly, they are gray, soft, and cord-like./n/nAll for section./n/nMicroscopically, it shows a picture of adenocarcinoma arranged in acinar pattern and infiltrating pattern. The carcinoma cells display mild to moderate nuclear pleomorphism. By immunostains, the carcinoma is positive for CK7, TTF-1, and napsin A, indicating lung origin.',
 'cetology report': '',
 'NER_target': {'organ': {'content': 'Lung, upper lobe, left, CT-guided biopsy, adenocarcinoma\n\n',
   'annotate': 'Lung'},
  'Bx-site': {'content': 'Lung, upper lobe, left, CT-guided biopsy, adenocarcinoma\n\n',
   'annotate': 'LUL'},
  'sampling method': {'content': 'Lung, upper lobe, left, CT-guided biopsy, adenocarcinoma\n\n',
   'annotate': 'CT-guided biopsy'},
  'diagnosis': {'content': 'Lung, upper lobe

In [10]:
import pandas as pd
from collections import defaultdict
import json
from pathlib import Path
import re


NO_TARGET = "-"
NO_TEXT = "none"
USE_CONTENT = True

target_to_prefix = json.loads(Path("./save/target_to_prefix.json").read_text())
target_to_question = json.loads(Path("./save/target_to_question.json").read_text())
soap_targets = json.loads(Path("./save/soap_targets.json").read_text())

# Load data and get train and eval data
data_path = "./data/raw/merge_keycontent.json"
data = json.loads(Path(data_path).read_text())

eval_path = "./data/raw/path_eval_102.json"
eval_label = json.loads(Path(eval_path).read_text())

train = datastr(i) for i in range(1, 861)
eval = dataidx for idx in eval_label.keys()

dataset = {"train": train, "eval": eval}

# Transform T5 format
input_data = defaultdict(list)
for split, split_data in dataset.items():
    for example in split_data:
        for target, items in example"NER_target".items():
            prefix = target_to_prefixtarget
            question = target_to_questiontarget
            if USE_CONTENT:
                if target not in soap_targets and example"path report" == "":
                    continue
                input_text = f"{question} context: {items'content'} {NO_TEXT}"
            else:
                if target in soap_targets:
                    if example"soap report" != "":
                        input_text = f"{question} context: {example'soap report'} {NO_TEXT}"
                    else:
                        input_text = f"{question} context: {example'cetology report'} {NO_TEXT}"
                else:
                    if example"path report" != "":
                        input_text = f"{question} context: {example'path report'} {NO_TEXT}"
                    else:
                        continue
            input_text = re.sub("\s+", " ", input_text)
            if items"annotate" == NO_TARGET:
                target_text = NO_TEXT
            elif items"annotate" == "P":
                target_text = "Positive"
            elif items"annotate" == "N":
                target_text = "Negative"
            elif items"annotate" == "U":
                target_text = "Unknown"
            else:
                target_text = items"annotate"
            input_datasplit.append(prefix, input_text, target_text)
                    

train_df = pd.DataFrame(input_data"train")
train_df.columns = "prefix", "input_text", "target_text"
train_df.to_excel("./data/processed/train_860.xlsx", index=False)

test_df = pd.DataFrame(input_data"eval")
test_df.columns = "prefix", "input_text", "target_text"
test_df.to_excel("./data/processed/eval_860.xlsx", index=False)

In [None]:
# Reference: https://towardsdatascience.com/the-guide-to-multi-tasking-with-the-t5-transformer-90c70a08837b
import logging
import pandas as pd
from simpletransformers.t5 import T5Model, T5Args


train_df = pd.read_excel("./data/processed/train_860.xlsx", dtype=str)
test_df = pd.read_excel("./data/processed/eval_860.xlsx", dtype=str)
to_predict = f"{row.prefix}: {row.input_text}" for row in test_df.itertuples()

# Configure the model
# General args: https://simpletransformers.ai/docs/usage/#configuring-a-simple-transformers-model
# T5 args: https://simpletransformers.ai/docs/t5-model/
# If memory problem occurs, set lower max_seq_length or train_batch_size
model_args = T5Args()
model_args.manual_seed = 1209
model_args.max_seq_length = 900
model_args.train_batch_size = 4
model_args.num_train_epochs = 5
model_args.use_multiprocessing = False
model_args.fp16 = False
model_args.save_steps = -1
model_args.save_model_every_epoch = False
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.n_gpu = 2
model_args.output_dir = "./outputs/t5-with-content-860/"
model_args.wandb_project = "lung-cancer"
model_args.wandb_kwargs = {"name": "t5-with-content-860"}

model = T5Model("t5", "t5-base", args=model_args)

# Train the model
model.train_model(train_df, use_cuda=True)

# Make predictions with the model
preds = model.predict(to_predict)
test_df"prediction" = preds
test_df.to_excel(model_args.output_dir+"test.xlsx", index=False)

In [34]:
import json
from pathlib import Path
import pandas as pd

target_to_question = json.loads(Path("./save/target_to_question.json").read_text())
test = pd.read_excel("./outputs/t5-with-content-860/test.xlsx")

def get_target(text):
    for target, question in target_to_question.items():
        if question in text:
            return target

test"target" = test"input_text".apply(get_target)

case_id = 
i = 860
for row in test.itertuples():
    if row.target == "organ":
        i += 1
    case_id.append(str(i))
test"ID" = case_id

def transform_prediction(text):
    if text == "Positive":
        return "positive"
    elif text == "Negative":
        return "negative"
    elif text == "Unknown":
        return "U"
    elif text == "none":
        return "-"
    else:
        return text

test"prediction" = test"prediction".apply(transform_prediction)

eval_path = "./data/raw/path_eval_102.json"
eval_label = json.loads(Path(eval_path).read_text())

for case_id, items in eval_label.items():
    for target, labels in items.items():
        labels"t5" = "-"

for row in test.itertuples():
    if row.prediction != "-":
        if row.target == "TTF":
            eval_labelrow.ID"TTF-1""t5" = row.prediction
        elif row.target == "Napsin":
            eval_labelrow.ID"NapsinUA""t5" = row.prediction
        elif row.target == "cytokeratin":
            eval_labelrow.ID"cytokeratin (AE1/AE3)""t5" = row.prediction
        elif row.target == "PAX":
            eval_labelrow.ID"PAXU8""t5" = row.prediction
        elif row.target == "chromogranin":
            eval_labelrow.ID"chromogranin-A""t5" = row.prediction
        elif row.target == "size":
            eval_labelrow.ID"tumor_size""t5" = row.prediction
        elif row.target == "Greatest_dimension":
            eval_labelrow.ID"Greatest dimension""t5" = row.prediction
        elif row.target == "LV":
            eval_labelrow.ID"LV_invasion""t5" = row.prediction
        elif row.target == "ver":
            eval_labelrow.ID"version""t5" = row.prediction
        elif row.target == "H_grade":
            eval_labelrow.ID"Hgrade""t5" = row.prediction
        elif row.target == "TF":
            eval_labelrow.ID"Tumor_Focality""t5" = row.prediction
        elif row.target == "H_type":
            eval_labelrow.ID"Htype""t5" = row.prediction
        elif row.target == "CM":
            eval_labelrow.ID"closest_margin""t5" = row.prediction
        else:
            eval_labelrow.IDrow.target"t5" = row.prediction

Path(eval_path).write_text(json.dumps(eval_label, indent=2))


372916

In [42]:
from sklearn.metrics import classification_report, confusion_matrix, multilabel_confusion_matrix
import warnings

warnings.filterwarnings("ignore")

correction = {
    "TTF-1": "TTF",
    "NapsinUA": "Napsin",
    "cytokeratin (AE1/AE3)": "cytokeratin",
    "PAXU8": "PAX",
    "chromogranin-A": "chromogranin",
    "tumor_size": "size",
    "Greatest dimension": "Greatest_dimension",
    "LV_invasion": "LV",
    "version": "ver",
    "Hgrade": "H_grade",
    "Tumor_Focality": "TF",
    "Htype": "H_type",
    "closest_margin": "CM"
}

ner_task = 
    "organ", "Bx-site", "sampling method", "diagnosis", "tumor_size", 
    "Greatest dimension", "Htype", "Hgrade", "Tumor_Focality", "LV_invasion", "closest_margin", 
    "VPI", "PDL1", "version", "pT", "pN", "pM", "pStage"


mt_task = "EGFR"
EGFR = "18", "19", "20", "21", "N", "U"

def check_ner(text):
    if text == "U":
        return "-"
    return text

def check_cls(text):
    if text == "-":
        return "U"
    elif text == "positive":
        return "P"
    elif text == "negative":
        return "N"
    return text

def check_mt(text):
    if text == "-":
        text = "U"
    elif text == "negative":
        text = "N"
    elif text == "positive":
        text = "U"
    text = str(text)
    vector = 1 if e in text else 0 for e in EGFR
    return vector

def ner_evaluation(y_true, y_pred):
    y_true = check_ner(y) for y in y_true
    y_pred = check_ner(y) for y in y_pred
    acc = sum(1 for t, p in zip(y_true, y_pred) if t == p) / len(y_true)
    print(round(acc*100, 2))

def cls_evaluation(y_true, y_pred):
    y_true = check_cls(y) for y in y_true
    y_pred = check_cls(y) for y in y_pred
    print(classification_report(y_true, y_pred))
    print(confusion_matrix(y_true, y_pred))

def mt_evaluation(y_true, y_pred):
    y_true = check_mt(y) for y in y_true
    y_pred = check_mt(y) for y in y_pred
    print(classification_report(y_true, y_pred))
    print(multilabel_confusion_matrix(y_true, y_pred))

eval_path = "./data/raw/path_eval_102.json"
eval_label = json.loads(Path(eval_path).read_text())

targets = list(set(tar for case_id, items in eval_label.items() for tar in items.keys()))

for target in targets:
    print(target)
    answer = itemstarget"Ans" for case_id, items in eval_label.items()
    rule = itemstarget"Rule" for case_id, items in eval_label.items()
    t5 = itemstarget"t5" for case_id, items in eval_label.items()
    if target in ner_task:
        print("rule")
        ner_evaluation(answer, rule)
        print("t5")
        ner_evaluation(answer, t5)
    elif target == mt_task:
        print("rule")
        mt_evaluation(answer, rule)
        print("t5")
        mt_evaluation(answer, t5)
    else:
        print("rule")
        cls_evaluation(answer, rule)
        print("t5")
        cls_evaluation(answer, t5)

NapsinUA
rule
              precision    recall  f1-score   support

           N       1.00      1.00      1.00         4
           P       1.00      1.00      1.00         1
           U       1.00      1.00      1.00        97

    accuracy                           1.00       102
   macro avg       1.00      1.00      1.00       102
weighted avg       1.00      1.00      1.00       102

[[ 4  0  0]
 [ 0  1  0]
 [ 0  0 97]]
t5
              precision    recall  f1-score   support

           N       0.75      0.75      0.75         4
           P       0.50      1.00      0.67         1
           U       0.99      0.98      0.98        97

    accuracy                           0.97       102
   macro avg       0.75      0.91      0.80       102
weighted avg       0.98      0.97      0.97       102

[[ 3  0  1]
 [ 0  1  0]
 [ 1  1 95]]
version
rule
100.0
t5
92.16
ALK
rule
              precision    recall  f1-score   support

           N       1.00      1.00      1.00         5
 

### Check the target position 

Take a look at the data

In [1]:
import pandas as pd

data_path = "./data/raw/TMUH_pathReport_T1-T4_NER.xlsx"
data = pd.read_excel(data_path)
data.head()

Unnamed: 0,id,part_num,Chat No.,path_soap,path_report,Cytology_report,organ,Bx-site,operation,Htype,tumor_size,Greatest dimension,Tumor_Focality,LV_invasion,closest_margin,version,pT,pN,pM,pStage
0,1,T1_1,17956402.0,"Lung, lower lobe, left, CT-guided needle biops...",The specimen submitted consists of five tissue...,,Lung,"lower lobe, left",CT-guided needle biopsy,,,,,,,,,,,
1,2,T1_2,,EGFR exon 18 mutation: Not detected\nEGFR exon...,Analysis of EGFR gene mutation\n(1)Tissue orig...,,Lung,,,,,,,,,,,,,
2,3,T1_3,,Result:\n1. PD-L1 expression in tumor cells (T...,Pathologic Report for PD-L1 (SP263) Assay (Ven...,,lung,lung,,,,,,,,,,,,
3,4,T1_4,,ROS1 expression is positive (intensity: modera...,ROS1 expression is positive (intensity: modera...,,,,,,,,,,,,,,,
4,5,T1_5,,The ALK expression is negative.\n\n,The ALK expression is negative.\n\n(1) Descrip...,,,,,,,,,,,,,,,


Build a target to report dictionary

In [2]:
target_to_report = {
    "organ": "path_soap",
    "Bx-site": "Cytology_report",
    "operation": "Cytology_report",
    "Htype": "path_report",
    "tumor_size": "path_report",
    "Greatest dimension": "path_report",
    "Tumor_Focality": "path_report",
    "LV_invasion": "path_report",
    "closest_margin": "path_report",
    "version": "path_report",
    "pT": "path_report",
    "pN": "path_report",
    "pM": "path_report",
    "pStage": "path_report"
}

In [3]:
from tqdm.auto import tqdm

record = []
target_list = list(target_to_report.keys())
for i, row in tqdm(data.iterrows(), total=data.shape[0]):
    for target in target_list:
        if not pd.isna(row[target]):
            report = row[target_to_report[target]]
            report = row["path_soap"] if pd.isna(report) else report
            assert report is not None, f"ID {row['id']}: {target} cannot map the report."
            
            # Find target index from report
            target_index = []
            cur_index = 0
            while report.find(row[target], cur_index) != -1:
                index = report.find(row[target], cur_index)
                target_index.append(str(index))
                cur_index = index + 1
            
            # Not found => find target in path report
            if len(target_index) == 0 and not pd.isna(row["path_report"]):
                report = row["path_report"]
                cur_index = 0
                while report.find(row[target], cur_index) != -1:
                    index = report.find(row[target], cur_index)
                    target_index.append(str(index))
                    cur_index = index + 1
            
            record.append([row["id"], report, target, row[target], len(target_index), ", ".join(target_index)])
record = pd.DataFrame(record, columns=["id", "report", "target", "target_text", "counts", "index"])
record.to_excel("./outputs/observation/target_counts.xlsx", index=False)

  0%|          | 0/3042 [00:00<?, ?it/s]

In [1]:
t = "01234"
x = "0"
t.find(x,0)

0

In [6]:
a = [[3,5,6], [1,2,3]]
a.sort()
a

[[1, 2, 3], [3, 5, 6]]

In [3]:
import pandas as pd
from tqdm import tqdm
import jsonlines
import json

data_path = "./data/raw/TMUH_pathReport_T1-T4_NER.xlsx"
data = pd.read_excel(data_path)
# data = data.iloc[:1379,:]

def find_entity(text, target, target_text):
    entities = []
    cur_index = 0
    while text.find(target_text, cur_index) != -1:
        beging = text.find(target_text, cur_index)
        end = beging + len(target_text)
        entities.append([beging, end, target])
        cur_index = end
    return entities

report_to_targets = {
    "Cytology_report": ["Bx-site", "operation"],
    "path_soap": ["organ"],
    "path_report": [
        "Htype",
        "tumor_size",
        "Greatest dimension",
        "Tumor_Focality",
        "LV_invasion",
        "closest_margin",
        "version",
        "pT",
        "pN",
        "pM",
        "pStage"
    ]
}

outputs = []

for i, row in tqdm(data.iterrows(), total=data.shape[0]):

    no_match_targets = []

    for report in ["Cytology_report", "path_soap", "path_report"]:

        if not pd.isna(row[report]):

            text = row[report]
            label = []

            if len(no_match_targets) != 0:
                match_targets = []
                for target in no_match_targets:
                    entities = find_entity(text, target, row[target])
                    if len(entities) != 0:
                        label.extend(entities)
                        match_targets.append(target)
                no_match_targets = [target for target in no_match_targets if target not in match_targets]

            for target in report_to_targets[report]:
                if not pd.isna(row[target]):
                    entities = find_entity(text, target, row[target])
                    if len(entities) == 0:
                        no_match_targets.append(target)
                    else:
                        label.extend(entities)
            
            label.sort()
            outputs.append({"text": text, "label": label})

        else:

            for target in report_to_targets[report]:
                if not pd.isna(row[target]):
                    no_match_targets.append(target)

    if len(no_match_targets) != 0:
        print(f"ID {row['id']} has no match targets {no_match_targets}")

with jsonlines.open("./data/processed/to-be-annotated-1.jsonl", "w") as writer:
    for line in outputs[:(len(outputs)//2)]:
        writer.write(line)

with jsonlines.open("./data/processed/to-be-annotated-2.jsonl", "w") as writer:
    for line in outputs[(len(outputs)//2):]:
        writer.write(line)

100%|██████████| 3042/3042 [00:00<00:00, 7457.85it/s]
