## Create vocabulary

- Adapted from the BERT vocabulary [BERT-Large, Uncased](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip)

- Remove non-English words
- Add MeSH IDs

In [7]:
import re
import pickle
import pandas as pd

In [1]:
from collections import defaultdict

In [112]:
bert_token_pattern = re.compile('^[\~\!\@\#\$\%\^\&\*\(\)\_\+\-\=\*\/\<\>\,\.\[\]\{\}\\\/\'\:a-z0-9]+$')

In [11]:
with open('./vocab_bert_large_uncased.txt', 'r') as f:
    vocab = f.readlines()
vocab = [w.strip('\n') for w in vocab]

In [103]:
# Remove the [xxx] tokens and add them later
to_be_cleaned = vocab[999:]

In [113]:
new_vocab = [w for w in to_be_cleaned if re.search(bert_token_pattern, w)]

In [107]:
new_vocab.sort()

In [123]:
len(new_vocab)

27613

In [101]:
new_vocab = ['[PAD]', '[CLS]', '[MASK]', '[SEP]', '[UNK]'] + \
            list([f'unused{i}' for i in range(1000)])

#### Add the MeSH IDs

In [128]:
with open('/scratch/cheng.jial/knowledge_graph_pubmed/data/mesh_data/mesh_d.pkl', 'rb') as f:
    mesh_d = pickle.load(f)

with open('/scratch/cheng.jial/knowledge_graph_pubmed/data/mesh_data/mesh_c.pkl', 'rb') as f:
    mesh_c = pickle.load(f)

In [139]:
# D005260: female
# D008297: male
del mesh_d['D005260']
del mesh_d['D008297']

In [147]:
def get_mesh_type(tree_code):
    tc = tree_code[0]
    if tc == 'C':
        return 'D'
    if tc == 'D':
        return 'C'

In [148]:
mesh_vocab = [f"{get_mesh_type(v['tree'][0])}MESH{k}" for k, v in mesh_d.items() 
              if v['tree'][0][0]=='C' or v['tree'][0][0]=='D']

In [152]:
mesh_vocab += [f"{get_mesh_type(mesh_d[v['mapto'][0]]['tree'][0])}MESH{k}" for k, v in mesh_c.items() 
               if mesh_d[v['mapto'][0]]['tree'][0][0]=='C' or mesh_d[v['mapto'][0]]['tree'][0][0]=='D']

In [156]:
cd = pd.read_csv('/scratch/cheng.jial/knowledge_graph_pubmed/data/cd_relation_pubmed.csv')

In [157]:
cd.head(1)

Unnamed: 0,source,pmid,ctoken,dtoken,sentence
0,pubmed,10480505,CHEMICALMESHD005947,DISEASEMESHD003924,DISEASEMESHD003924 was defined as a fasting pl...


In [158]:
cd['ctoken'] = [i.replace('CHEMICAL', 'C') for i in cd['ctoken']]
cd['dtoken'] = [i.replace('DISEASE', 'D') for i in cd['dtoken']]

Some tokens are not in mesh_dict. Add them manually

In [164]:
_ = [mesh_vocab.append(w) for w in list(cd['ctoken']) + list(cd['dtoken']) 
     if re.search(bert_token_pattern, w)]

In [None]:
mesh_vocab += to_be_added

In [165]:
mesh_vocab = list(set(mesh_vocab))

In [166]:
for i in list(cd['ctoken']) + list(cd['dtoken']):
    if i not in mesh_vocab:
        print(i)

CMESHC061951205
CMESHC051890
CMESHC122114
CMESHC009687
CMESHC121677
CMESHC030110
C,
CMESHC453980
CMESHC490728
CMESHC010238
CMESHC040029
CMESHC467567
CMESHC079150
C,
CMESHC048738
CMESHC502411
CMESHC551177
CMESHC405346
CMESHC515567
CMESHC507898
CMESHC079890
CMESHC048107
CMESHC015329
C)
CMESHC080245
CMESHC513092
CMESHC058218
CMESHC473478
CMESHC417052
CMESHC029100
CMESHC043435
CMESHC030110
CMESHC471405
CMESHC047246
CMESHC530429334
CMESHC096918
CMESHC088658
CMESHC030110
CMESHC058218
CMESHC063008
CMESHC065179
CMESHC496398
C-
C,
C,
CMESHC105934
CMESHC467567
CMESHC040029
CMESHC429886
CMESHC080245
CMESHC065179
CMESHC051890
CMESHC471405
CMESHC507898
CMESHC429886
CMESHC088482
CMESHC065179
C.
CMESHC507898
CMESHC081489
C,
C)
CMESHC097613
C-
CMESHC554682
CMESHC051890
C-
CMESHD013749
CMESHC076029
CMESHC076029
C)
CMESHC429886
C)
CMESHC467567
CMESHC105934
CMESHC065179
CMESHC502012
CMESHC467567
CMESHC513092
CMESHC088482
CMESHC081489
C)
CMESHC108606
CMESHC413408
CMESHC440975
CMESHC453962
CMESHC080245
CME

KeyboardInterrupt: 

In [167]:
re.search(bert_token_pattern, 'CMESHC010238')

In [3]:
special_token = ['[PAD]', '[UNK]', '[CLS]', '[MASK]', '[SEP]']

In [2]:
w2i = defaultdict(lambda: len(w2i))

In [4]:
[w2i[t] for t in special_token]

[0, 1, 2, 3, 4]

In [5]:
w2i

defaultdict(<function __main__.<lambda>()>,
            {'[PAD]': 0, '[UNK]': 1, '[CLS]': 2, '[MASK]': 3, '[SEP]': 4})