<a href="https://colab.research.google.com/github/adambouras/OHDSI-Symposium-Submission/blob/main/OHDSI_Submission_BERT_Similarity_Ranking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The aim of this project is to assess the similarities and differences that exits between assessment tools and questions for food insecurity that were identified by the Gravity Project (GP). These questions are divided into two groups; one group has been used as <a href="https://www.hl7.org/fhir/" target="_blank">FHIR resources</a> and mapped to OMOP CDM **(Group A)**. The other group of questions are identified by the GP but they are not mapped to OMOP CDM nor identified as FHIR resources **(Group B)**. This project will help identify the level of difference and substitutability that may exit between Group A and Group B.

In [1]:
%%capture
%pip install Cmake
%pip install bertopic
%pip install flair
%pip install -U sentence-transformers
%pip install dash_bio


In [None]:
%%capture
import pip
try:
  import pandas as pd
  import os
  import types
  import numpy as np
  import re
  from bertopic import *
  from pyxlsb import open_workbook
  import seaborn as sns
  from sentence_transformers import SentenceTransformer, util
  import plotly.express as px
  import plotly.graph_objects as go
  import pip
  from pyxlsb import open_workbook
  import pandas as pd
  import os
  import types
  import numpy as np
  import re
  from bertopic import BERTopic
except ModuleNotFoundError:
    pip.main(['install', "transformers", "bertopic"])
    from transformers import pipeline

For this project, I used the [Hugging Face](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) open source BERT Model to construct and build the algorithm. The model provides sentence similarity by mapping sentences and paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or sementic search. The training data set contains a total number of sentence pairs above 1 billion sentences. 
![Algorithm used to identify the questions similarities](./img/OHDSI-BERT%20Ranking.jpg)

Read the excel file from the github repo. the orginal [excel file](https://confluence.hl7.org/download/attachments/91994432/05142021%20Food%20Insecurity%20MASTER.xlsx?api=v2) was dwonloaded from [GP website](https://confluence.hl7.org/display/GRAV/Food+Insecurity). We conduct manual mapping of the assessment tool questions to [OMOP CDM using Athena](https://athena.ohdsi.org/search-terms/start). We created additional columns to report whether the questions [LOINC Codes](https://loinc.org/get-started/what-loinc-is/) was mapped or not. The final [excel file](08.31.2023%20Food%20Insecurity%20Screening%20SDOH%20Gravity%20Project.xlsx) is included in the github repostory of this project.

In [None]:
xls = pd.ExcelFile("./08.31.2023%20Food%20Insecurity%20Screening%20SDOH%20Gravity%20Project.xlsx")

In [4]:
xls.sheet_names

['Program Definitions',
 'Programs',
 'Screening Questions-Answers',
 'Diagnoses-Assessed Needs',
 'Interventions Planned-Completed',
 'Interventions CPT HCPCS',
 'Message and Task Related Codes',
 'Answer Codes',
 'Goals']

In [5]:
#create dictionary for later retrieval
d_s = {'sheet':[], 'df': []}
for i , c in enumerate(xls.sheet_names, 1):
    globals()['df'+ str(i)] = pd.read_excel(xls, c)
    d_s['sheet'].append(c+str(i))
    d_s['df'].append('df'+str(i))

In [6]:
d_s

{'sheet': ['Program Definitions1',
  'Programs2',
  'Screening Questions-Answers3',
  'Diagnoses-Assessed Needs4',
  'Interventions Planned-Completed5',
  'Interventions CPT HCPCS6',
  'Message and Task Related Codes7',
  'Answer Codes8',
  'Goals9'],
 'df': ['df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8', 'df9']}

In [7]:
df3.head()

Unnamed: 0,id,Standardized,Used_in_IG,Domain,Using_Organizations_Count,Relevant_Screening_Tool,LOINC_Panel_Code,LOINC_Panel_Name,Question_Concept,LOINC_Question_Code,...,Date,Answer_Concept,LOINC_Answer,LOINC_Answer_Code,SNOMED_CT_Code,SNOMED_CT_Fully_Specified_Name,Semantic_Tab,Screening_Reference_Link_Information,Notes,Element_Rating
0,S-5,Yes,No,Food Insecurity,1,Medicare THA,sdohcc-lncct-20191120151200ET,,Do you eat fewer than 2 meals a day,sdohcc-lncct-20191120151200ET,...,,Yes,LA33-6,,,,,https://mydoctor.kaiserpermanente.org/ncal/Ima...,,
1,S-5,Yes,No,Food Insecurity,1,Medicare THA,sdohcc-lncct-20191120151200ET,,Do you eat fewer than 2 meals a day,sdohcc-lncct-20191120151200ET,...,,No,LA32-8,,,,,,,
2,S-5,Yes,No,Food Insecurity,1,Medicare THA,sdohcc-lncct-20191120151300ET,,Do you always have enough money to buy the foo...,sdohcc-lncct-20191120151300ET,...,,Yes,LA33-6,,,,,,,
3,S-5,Yes,No,Food Insecurity,1,Medicare THA,sdohcc-lncct-20191120151300ET,,Do you always have enough money to buy the foo...,sdohcc-lncct-20191120151300ET,...,,No,LA32-8,,,,,,,
4,S-6,No,No,Food Insecurity,1,IHELP,sdohcc-lncct-20191127082300ET,,Do you have any concerns about having enough f...,sdohcc-lncct-20191127082300ET,...,,(open),sdohcc-lncct-20200122222000ET,,,,,https://sirenetwork.ucsf.edu/tools-resources/m...,,


#Create analysis of topics embedding

In [45]:
import re
sentences_yes = df3['Question_Concept'].loc[df3['Mapped_to_Athena'] == 'Yes'].unique().tolist()
sentences_yes = [x for x in sentences_yes if str(x) != 'nan']
#remove redundant words
pattern_yes = r"\s+\[U\.S\. FSS\]"
sentences_yes = [re.sub(pattern_yes, '', str(x)) for x in sentences_yes]
#select unique assessment questions:
duplicate_question = 'Within the past 12 months we worried whether our food would run out before we got money to buy more'
sentences_yes.remove(duplicate_question)
#select questions that were not mapped to any vocabularly using Athena:
sentences_no = df3['Question_Concept'].loc[df3['Mapped_to_Athena'] != 'Yes'].unique().tolist()
#clean string from unecessary strings Q1 Q14b. etc....
pattern = r"Q\d{1,2}\.\s+|Q\d{1,2}b\.\s+|\?|\."
sentences_no = [re.sub(pattern, '', str(x)) for x in sentences_no]
#Delete question mapped to Athena but not reported by the gravity team
duplicated_question  = 'Within the past 12 months, you worried that your food would run out before you got money to buy more'
sentences_no.remove(duplicated_question)
sentences = [sentences_yes, sentences_no]

In [47]:
for s in sentences_yes:
  print(s)

Within the past 12 months the food we bought just didn't last and we didn't have money to get more
Within the past 12 months, you worried that your food would run out before you got money to buy more
Within the past 12 months, the food you bought just didn't last and you didn't have money to get more
In the past year, have you or any family members you live with been unable to get any of the following when it was really needed?
Check all that apply


In [38]:
%%capture
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')

In [48]:
%%capture
model = SentenceTransformer('all-MiniLM-L6-v2')

#Compute bi-ranked embedding for food insecurity

ranked_domain ={'standard': [] ,'index_no':[],'label_no':[] ,'index_yes':[], 'label_yes': [], 'ranked_value':[]}
for h,s in enumerate(sentences):
  if h ==0: #select questions that have been mapped to Athena  for embedding
    embedding_1= model.encode(s, convert_to_tensor=True)
    for i, sentence in enumerate(sentences_no):
      embedding_2 = model.encode(sentence, convert_to_tensor=True)
      #Compute cosine-similarities
      cosine_scores = util.cos_sim(embedding_1, embedding_2)
      #Output the pairs with their score
      for j, sentence_yes in enumerate(sentences_yes):
        ranked_domain['standard'].append(h)
        ranked_domain['index_no'].append(i)
        ranked_domain['label_no'].append(sentence)
        ranked_domain['index_yes'].append(j)
        ranked_domain['label_yes'].append(sentence_yes)
        na = cosine_scores.numpy()
        ranked_domain['ranked_value'].append(round(na[j][0],2))
    #print(f"i: {i} and sentence {qt} \t\t, and j: {j} and standard: {dw} \t\t, and the cosine similarity is:{cosine_scores[j][0]}")
    #print("{} \t\t {} \t\t Score: {:.4f}".format(qt, dw, cosine_scores[j][i]))

In [49]:
df =pd.DataFrame.from_dict(ranked_domain)
#convert dataframe into pivote table:
df_map= df.pivot('index_no', 'index_yes',  'ranked_value')

In [54]:
df

Unnamed: 0,standard,index_no,label_no,index_yes,label_yes,ranked_value
0,0,0,Do you eat fewer than 2 meals a day,0,Within the past 12 months the food we bought j...,0.38
1,0,0,Do you eat fewer than 2 meals a day,1,"Within the past 12 months, you worried that yo...",0.43
2,0,0,Do you eat fewer than 2 meals a day,2,"Within the past 12 months, the food you bought...",0.35
3,0,0,Do you eat fewer than 2 meals a day,3,"In the past year, have you or any family membe...",0.04
4,0,1,Do you always have enough money to buy the foo...,0,Within the past 12 months the food we bought j...,0.64
...,...,...,...,...,...,...
363,0,90,Are you worried that you or others in your hou...,3,"In the past year, have you or any family membe...",0.25
364,0,91,"Within the past 12 months, the food you bought...",0,Within the past 12 months the food we bought j...,0.91
365,0,91,"Within the past 12 months, the food you bought...",1,"Within the past 12 months, you worried that yo...",0.86
366,0,91,"Within the past 12 months, the food you bought...",2,"Within the past 12 months, the food you bought...",0.99


In [56]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go
# load dataset

# Create figure
fig = go.Figure()

# Add surface trace
#fig.add_trace(go.Heatmap(z=df_p.values.tolist(), colorscale="Viridis"))

#create graph side by side
fig = make_subplots(rows=1, cols=1, shared_yaxes=False)

fig.add_trace(
    go.Heatmap(z=df_map.values.tolist(), colorscale="Viridis", coloraxis="coloraxis"),
    row=1, col=1
)


# Update plot sizing
fig.update_layout(
    width=800,
    height=900,
    autosize=False,
    margin=dict(t=100, b=0, l=0, r=0),
)

# Update 3D scene options
fig.update_scenes(
    aspectratio=dict(x=1, y=1, z=0.7),
    aspectmode="manual"
)

# Add dropdowns
button_layer_1_height = 1.08
steps = []
# add a slider
sliders = [dict(
    active=10,
    currentvalue={"prefix": "Frequency: "},
    pad={"t": 50},
    steps=steps
)]

fig.update_layout(
    updatemenus=[
        dict(
            buttons=list([
                dict(
                    args=["colorscale", "Viridis"],
                    label="Viridis",
                    method="restyle"
                ),
                dict(
                    args=["colorscale", "Cividis"],
                    label="Cividis",
                    method="restyle"
                ),
                dict(
                    args=["colorscale", "Blues"],
                    label="Blues",
                    method="restyle"
                ),
                dict(
                    args=["colorscale", "Greens"],
                    label="Greens",
                    method="restyle"
                ),
            ]),
            direction="down",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.1,
            xanchor="left",
            y=button_layer_1_height,
            yanchor="top"
        ),
        dict(
            buttons=list([
                dict(
                    args=["reversescale", False],
                    label="False",
                    method="restyle"
                ),
                dict(
                    args=["reversescale", True],
                    label="True",
                    method="restyle"
                )
            ]),
            direction="down",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.37,
            xanchor="left",
            y=button_layer_1_height,
            yanchor="top"
        ),
        dict(
            buttons=list([
                dict(
                    args=[{"contours.showlines": False, "type": "contour"}],
                    label="Hide lines",
                    method="restyle"
                ),
                dict(
                    args=[{"contours.showlines": True, "type": "contour"}],
                    label="Show lines",
                    method="restyle"
                ),
            ]),
            direction="down",
            pad={"r": 10, "t": 10},
            showactive=True,
            x=0.58,
            xanchor="left",
            y=button_layer_1_height,
            yanchor="top"
        ),
    ]
)

fig.update_layout(
    annotations=[
        dict(text="colorscale", x=0, xref="paper", y=1.06, yref="paper",
                             align="left", showarrow=False),
        dict(text="Reverse<br>Colorscale", x=0.25, xref="paper", y=1.07,
                             yref="paper", showarrow=False),
        dict(text="Lines", x=0.54, xref="paper", y=1.06, yref="paper",
                             showarrow=False)
    ], showlegend=False)

fig.update_layout(
    sliders=sliders
)

fig.show()