This notebook is for: given an arbitrary topic/term, return related topics. Here we shows the top 20. 

In [1]:
pip install transformers

Collecting transformers
  Downloading transformers-4.10.3-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 5.3 MB/s 
[?25hCollecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 48.8 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 36.3 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 47.1 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: Py

In [7]:
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances, manhattan_distances, linear_kernel
import pandas as pd
import numpy as np
from numpy import genfromtxt
import torch
import math
import pickle

## Define model and load saved embeddings

Define which model we want to use: BERT or ClinicalBERT

In [3]:
model = 'ClinicalBERT' # if want to use BERT, change 'ClinicalBERT' to 'BERT'

In [8]:
if model == 'ClinicalBERT': 
  model_name = 'emilyalsentzer/Bio_ClinicalBERT'
  !wget https://github.com/casszhao/FAIR/raw/main/sources/ClinicalBERT_embeddings.pkl
  embeddings_file_name = 'ClinicalBERT_embeddings.pkl'
elif model_name == 'sentence-transformers/bert-base-nli-mean-tokens':
  !wget https://github.com/casszhao/FAIR/raw/main/sources/BERT_embeddings.pkl
  embeddings_file_name = 'BERT_embeddings.pkl'


tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)


with open(embeddings_file_name,'rb') as f:
  Embeddings = pickle.load(f)

Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Load candidate categories list

In [9]:
url = 'https://raw.githubusercontent.com/casszhao/FAIR/main/sources/0901_full_list.csv'
sorted_cat = pd.read_csv(url, header=None)

get a dictionary for embedding looking up later, the key is the numerical index as the embeddings list share the same order of the candidate list

In [10]:
sorted_cat = sorted_cat[0].to_list()
sorted_cat = list(dict.fromkeys(sorted_cat))

a = (map(lambda x: x.lower(), sorted_cat))
lower_cat = list(a)
dic = {v: k for v, k in enumerate(lower_cat)}

## Given any topic and get the top 20

Here we show 2 examples, one for searching for Alzheimer's disease, one for searching for losing weight. The search here can be any terms, although it will make more sense if it is something related to public health and social inequality. 

In [11]:
def get_request_array(request, MAX_TOKEN):
  request_token = tokenizer.encode_plus(request, max_length=MAX_TOKEN, # length from 128 to 20
                                      truncation=True, padding='max_length',
                                      return_tensors='pt')

  request_id = request_token['input_ids'][0]
  request_attention_mask = request_token['attention_mask'][0]

  request_outputs = model(**request_token)
  request_embeddings = request_outputs.last_hidden_state
  request_mask = request_attention_mask.unsqueeze(-1).expand(request_embeddings.size()).float()
  request_masked_embeddings = request_embeddings * request_mask
  request_summed = torch.sum(request_masked_embeddings, 1)
  request_summed_mask = torch.clamp(request_mask.sum(1), min=1e-9)
  request_mean_pooled = request_summed / request_summed_mask
  return request_mean_pooled.detach().numpy()

In [12]:
search_1 = "Alzheimer"
search_2 = "lesbian"

In [13]:
topic_array_1 = get_request_array(request=search_1, MAX_TOKEN=20)
topic_array_2 = get_request_array(request=search_2, MAX_TOKEN=20)

## cosine similarity

Rank related topics by calculating the cosine similarity between the given topics and the categories in the candidate list.  

Here we show the top 20 similar topics (the smaller the cosine it is, the less similar to the given topic the category it is)

In [16]:
def cosine_simi_list_for_one(topic_array):
  simi_array = cosine_similarity(topic_array, Embeddings)
  simi_list = simi_array.tolist()[0]
  sorted_index = sorted(range(len(simi_list)), key=lambda k: simi_list[k])
  sorted_index.reverse() # the smaller the cosine it is, the bigger angle between two, then the less similar between the two. So reverse here.
  subs = list(map(dic.get, sorted_index, sorted_index))[:21] # only get the top 20 most similar words
  return subs

In [17]:
cosine20_for_search_1 = cosine_simi_list_for_one(topic_array_1)
print('the top 20 (cosine) related terms for ', search_1, 'is')
print(cosine20_for_search_1)
print('')
cosine20_for_search_2 = cosine_simi_list_for_one(topic_array_2)
print('the top 20 (cosine) related terms for ', search_2, 'is')
print(cosine20_for_search_2)

the top 20 (cosine) related terms for  Alzheimer is
["alzheimer's disease", 'parkinsons', 'schizophrenia', 'marburg virus', 'dementia with lewy bodies', 'ataxia', 'bipolar disorder', 'poverty in algeria', 'psychosis', 'infertility', 'otitis media', 'autistic spectrum disorder', 'vip syndrome', 'dementia', 'ebola vaccine', 'west nile virus', 'middle ear infection', 'stereotype threat', 'scurvy', 'psoriasis', 'poverty in nigeria']

the top 20 (cosine) related terms for  lesbian is
['transgender', 'recreation', 'genocide', 'farming', 'oppression', 'seasons', 'wealth', 'humanities', 'poverty', 'psychology', 'capitalist', 'anthropologist', 'exile', 'socialism', 'tourism', 'sport', 'piles', 'meditation', 'flood', 'researching', 'doi']


## euclidean_distances

Rank related topics by calculating the euclidean distances between the given topics and the categories in the candidate list. 

Here we show the top 20 similar topics (the smaller the distance it is, the more similar to the given topic the category it is)

In [18]:
# the closer the distance is more small 
def eucli_distance_list_for_one(topic_array):
  distance_array = euclidean_distances(topic_array, Embeddings)
  distance_list = distance_array.tolist()[0]
  sorted_index = sorted(range(len(distance_list)), key=lambda k: distance_list[k])
  subs = list(map(dic.get, sorted_index, sorted_index))[:21]
  return subs

In [19]:
eud20terms_for_search_1 = eucli_distance_list_for_one(topic_array_1)
print('the top 20 (distance) related terms for ', search_1, 'is')
print(eud20terms_for_search_1)
print('')

eud20terms_for_search_2 = eucli_distance_list_for_one(topic_array_2)
print('the top 20 (distance) related terms for ', search_2, 'is')
print(eud20terms_for_search_2)


the top 20 (distance) related terms for  Alzheimer is
["alzheimer's disease", 'parkinsons', 'schizophrenia', 'marburg virus', 'dementia with lewy bodies', 'ataxia', 'poverty in algeria', 'bipolar disorder', 'infertility', 'vip syndrome', 'ebola vaccine', 'psychosis', 'autistic spectrum disorder', 'dementia', 'otitis media', 'middle ear infection', 'stereotype threat', 'psoriasis', 'west nile virus', 'egalitarianism', 'poverty in nigeria']

the top 20 (distance) related terms for  lesbian is
['transgender', 'recreation', 'genocide', 'farming', 'oppression', 'seasons', 'wealth', 'humanities', 'poverty', 'psychology', 'capitalist', 'anthropologist', 'exile', 'socialism', 'tourism', 'sport', 'piles', 'meditation', 'flood', 'researching', 'doi']
