# Term Harmonize - STEP 6: Impute Hierarchy (TL4)
#### Author: Ryan Urbanowicz (ryanurb@upenn.edu) 
#### Institution: University of Pennsylvania - Perleman School of Medicine
#### Project: CMREF Data Harmonization 
#### Date: 9/1/21

#### Project Overview:
See the first notebook in this series ('Step_1_Term_Harmonize_Data_Preparation.ipynb') for an overview of this project, these notebooks, the target application, data availability, code dependencies, and our strategy for generalizing the code in these notebooks. 

#### Notebook Summary:
This notebook loads the working mapping file (following completion of PT and HLT mapping (which we assume has been completed fully). Next it performs the next level of term hierarchy imputation.  In the current target application this involves imputing the HLGTs from HLTs. However HLGT imputation includes possible branching, therefore further manual, subjective annotation will be required after running this notebook. Whenever the data includes relevant term data, we will apply fuzzy matching to assist in the selection of the most appropriate of available HGLT branches. When this relevant term data is not available we pick the first of the possible branches by default.  Branch 'quality' is tracked much like term mapping quality. 

***
## Load Python packages required in this notebook

In [1]:
#Load necessary packages.
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# Jupyter Notebook Hack: This code ensures that the results of multiple commands within a given cell are all displayed, rather than just the last. 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#Import Progress bar
from tqdm import tnrange, tqdm_notebook

***
## Load Working Map File and Relevent Ontology Files


#### Create general variable names for any target application specific values.

In [2]:
# Input filename for 'target dataset' (excel file loaded in this application)
target_study_data = 'Combined_MEDHX_TERMS_20studies.xlsx' 

ont_DL1_data = 'LLT.xlsx' # Input filename for ontology file defining all DL1 terms and their codes. 
ont_DL1_name_col = 'llt_name' # column label for DL1 term name
ont_DL1_code_col ='llt_code' # column label for DL1 term code
ont_DL1_cur_col = 'llt_currency' # column label for term currency
ont_DL2_data = 'PT.xlsx' # Input filename for ontology file defining all DL2 terms and their codes. 
ont_DL2_name_col = 'pt_name' # column label for DL2 term name
ont_DL2_code_col = 'pt_code' # column label for DL2 term code
ont_DL3_DL2_data = 'HLT_PT.xlsx' # Input filename for ontology file defining connections between DL2 and DL3 term codes. 
ont_DL3_data = 'HLT.xlsx' # Input filename for ontology file defining all DL3 terms and their codes.
ont_DL3_name_col = 'hlt_name' # column label for DL3 term name
ont_DL3_code_col = 'hlt_code' # column label for DL3 term code
ont_DL4_name_col = 'hlgt_name' # column label for DL4 term name
ont_DL4_code_col = 'hlgt_code' # column label for DL4 term code

ont_DL4_DL3_data = 'HLGT_HLT.xlsx' # Input filename for ontology file defining connections between DL3 and DL4 term codes. 
ont_DL4_data = 'HLGT.xlsx' # Input filename for ontology file defining all DL4 terms and their codes.
ont_DL5_DL4_data = 'SOC_HLGT.xlsx' # Input filename for ontology file defining connections between DL4 and DL5 term codes. 
ont_DL5_data = 'SOC.xlsx' # Input filename for ontology file defining all DL5 terms and their codes.

DL1_FT1 = 'MHTERM' # focus term 1: This term is available over all studies. 
DL1_FT2 = 'LLT_NAME' # focus term 3: an alternative term available for a subset of studies. This one supposedly conforms to the MedDRA standard so we expect it to yield more exact matches. May offer a better match for the lowest level of the standardized terminology.
DL1_FT3 = 'MHMODIFY' # focus term 2: an alternative term available for a subset of studies. May offer a better match for the lowest level of the standardized terminology.

DL2 = 'PT_NAME' # Secondary level terms (i.e. more general than DL1 terms)
DL3 = 'HLT_NAME' # Tertiary level terms (i.e. more general than DL2 terms)
DL4 = 'HLGT_NAME' # Quarternary level terms (i.e. more general than DL3 terms)
DL5 = 'SOC_NAME' # Quinary Level terms (i.e. more general than DL4 terms)

TL1_qual_code_header = 'LLT_map_code' # column name for lowest term level mapping quality code (added to mapping file)
TL1_name_header = 'T_LLT' # column name for the 'mapped' TL1 - term name (added to mapping file)
TL1_code_header = 'T_LLT_CODE' # column name for the 'mapped' TL1 - term code (added to mapping file)
TL2_name_header = 'T_PT'
TL2_code_header = 'T_PT_CODE'
TL3_name_header = 'T_HLT'
TL3_code_header = 'T_HLT_CODE'
TL4_name_header = 'T_HLGT'
TL4_code_header = 'T_HLGT_CODE'
TL5_name_header = 'T_SOC'
TL5_code_header = 'T_SOC_CODE'

FZ1_FT1 = 'FZMatch_1_'+DL1_FT1 # column name for best FT1 fuzzy match (temporarily added to mapping file)
FZ2_FT1 = 'FZMatch_2_'+DL1_FT1 # column name for second best FT1 fuzzy match (temporarily added to mapping file)
FZ3_FT1 = 'FZMatch_3_'+DL1_FT1 # column name for third best FT1 fuzzy match (temporarily added to mapping file)
FZ4_FT1 = 'FZMatch_4_'+DL1_FT1 # column name for fourth best FT1 fuzzy match (temporarily added to mapping file)
FZ5_FT1 = 'FZMatch_5_'+DL1_FT1 # column name for fifth best FT1 fuzzy match (temporarily added to mapping file)

FZMC = 'FZMatch_Choice_ID_'+DL1_FT1 #column name for the column where manual annotator will enter the number (1-5) indicating the FT1 fuzzy matched term that offers the best match (if a good one is identified)
FZCT = 'FZMatch_Copied_Term' #column name for the column where manual annotator can alternatively manually copy in the MedDRA LLT term that best matches the term information in this row (can come from FT2 or FT3 if term was not identified in FT1)

FZ1_FT2 = 'FZMatch_1_'+DL1_FT2 # column name for best FT2 fuzzy match (temporarily added to mapping file)
FZ2_FT2 = 'FZMatch_2_'+DL1_FT2 # column name for second best FT2 fuzzy match (temporarily added to mapping file)
FZ3_FT2 = 'FZMatch_3_'+DL1_FT2 # column name for third best FT2 fuzzy match (temporarily added to mapping file)
FZ4_FT2 = 'FZMatch_4_'+DL1_FT2 # column name for forth best FT2 fuzzy match (temporarily added to mapping file)
FZ5_FT2 = 'FZMatch_5_'+DL1_FT2 # column name for fifth best FT2 fuzzy match (temporarily added to mapping file)

FZ1_FT3 = 'FZMatch_1_'+DL1_FT3 # column name for best FT3 fuzzy match (temporarily added to mapping file)
FZ2_FT3 = 'FZMatch_2_'+DL1_FT3 # column name for second best FT3 fuzzy match (temporarily added to mapping file)
FZ3_FT3 = 'FZMatch_3_'+DL1_FT3 # column name for third best FT3 fuzzy match (temporarily added to mapping file)
FZ4_FT3 = 'FZMatch_4_'+DL1_FT3 # column name for forth best FT3 fuzzy match (temporarily added to mapping file)
FZ5_FT3 = 'FZMatch_5_'+DL1_FT3 # column name for fifth best FT3 fuzzy match (temporarily added to mapping file)

TL3_B = 'HLT_branches'
TL3_BQ = 'HLT_branch_quality'
TL3_FZT = 'HLT_Fuzzy_Terms'
TL3_FZS = 'HLT_Fuzzy_Scores'

TL4_B = 'HLGT_branches'
TL4_BQ = 'HLGT_branch_quality'
TL4_FZT = 'HLGT_Fuzzy_Terms'
TL4_FZS = 'HLGT_Fuzzy_Scores'

TL5_B = 'SOC_branches'
TL5_BQ = 'SOC_branch_quality'
TL5_FZT = 'SOC_Fuzzy_Terms'
TL5_FZS = 'SOC_Fuzzy_Scores'

### Load map File

In [3]:
#Load target (tab-delimited) file into a pandas data frame
target_map_file = 'MH_harmonization_map_15_TL3.csv' #Input filename (excel file loaded in this application)
td = pd.read_csv(target_map_file, na_values=' ') #Data loaded so that blank excell cells are 'NA'
td.shape

(28720, 21)

### Load 4th Level Terminology Standard File

In [4]:
tl4 = pd.read_excel(ont_DL4_data, sep='\t',na_values=' ')
tl4.shape

(337, 9)

### Load 3th to 4th Level Terminology Connection File

In [5]:
tl4_tl3 = pd.read_excel(ont_DL4_DL3_data, sep='\t',na_values=' ')
tl4_tl3.shape

(1755, 2)

***
## Insert New Columns for Hierarchy Imputation
Insert column into the mapping file needed for the fourth level of the hierarchy imputation (i.e. HLGT). For level 4 (i.e. HLGT) we will again add columns to handle the branch possibilities, the quality of our branch selection, and results from any fuzzy branch matching. 


*To adapt this code to other tasks, users may need to specify different column indexes below. We place these new columns after the original data columns.*

In [6]:
td.insert(loc=21,column=TL4_name_header,value='NA') 
td.insert(loc=22,column=TL4_code_header,value='NA') 

td.insert(loc=23,column=TL4_B,value='NA') 
td.insert(loc=24,column=TL4_BQ,value='NA') 
td.insert(loc=25,column=TL4_FZT,value='NA') 
td.insert(loc=26,column=TL4_FZS,value='NA') 
td.shape


(28720, 27)

## Define a method that takes a pandas column and turns it into a list

In [7]:
def listify(inList):
    nameList = []
    scoreList = []
    for each in inList:
        nameList.append(each[0])
        scoreList.append(each[1])
    return nameList, scoreList

***
## Impute TL4 from TL3 (Branches Possible)
In this application we impute HLGTs (i.e. TL4) from previously imputed HLTs (TL3). 

Before completing this task we lay out a coding scheme to describe the quality/confidence of any branch selection for this entire imputation proceedure. These codes for TL4 will be entered into the column with the 'TL4_BQ' label. We have developed a custom coding scheme to suit the needs of our target application:

* 0 = No branching: Only one possible term available for imputation - best quality implied. 
* 1 = Branching: Branch selected based on an exact match with DL3 available term. 
* 2 = Branching: DL3 term available, fuzzy matching applied, top scoring fuzzy match chosen/confirmed. 
* 3 = Branching: DL3 term available, fuzzy matching applied, non-top scoring fuzzy match chosen/confirmed.
* 4 = Branching: no DL3 term available - No fuzzy matching available. Picked first branch by default. 

In [8]:
data_count = 0
total_branched = 0
total_direct = 0

for each in tqdm_notebook(td[TL3_code_header], desc='1st loop'): #for each row
    if not pd.isna(each): #Check for missing value

        #Idenfity the TL3 code in the MedDRA HLGT_HLT file
        tempList = tl4_tl3[ont_DL3_code_col].tolist() # the column name reference is specific to the loaded MedDRA file
        indexList = [i for i,val in enumerate(tempList) if val==int(float(each))]
 
        #Put all connected TL4 codes in TL4_B separated by an underscore
        TL4_list = []
        TL4_str = ''
        for i in indexList:
            TL4_list.append(tl4_tl3[ont_DL4_code_col][i]) # the column name reference is specific to the loaded MedDRA file
            TL4_str += str(tl4_tl3[ont_DL4_code_col][i])+'_' # the column name reference is specific to the loaded MedDRA file
        
        #Branch Reporting
        if len(TL4_list) > 1: # is there more than one branch 
            total_branched += 1
            td[TL4_B][data_count] = TL4_str

        else: #only one branch found
            total_direct += 1
            #Identify term for single code
            tempList = tl4[ont_DL4_code_col].tolist() # the column name reference is specific to the loaded MedDRA file

            tempIndex = tempList.index(TL4_list[0]) #find index/location of code in tl4 set of terms
            term = tl4[ont_DL4_name_col][tempIndex] # the column name reference is specific to the loaded MedDRA file

            td[TL4_name_header][data_count] = term
            td[TL4_code_header][data_count] = TL4_list[0]
            td[TL4_BQ][data_count] = 0 # Top Branch Quality

    data_count +=1 

print("Directly Imputed: " +str(total_direct))
print("With Branches: " +str(total_branched))

HBox(children=(IntProgress(value=0, description='1st loop', max=28720), HTML(value='')))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Directly Imputed: 28324
With Branches: 366


## Save file in progress

In [9]:
td.to_csv("MH_harmonization_map_16_TL4.csv", header=True, index=False)  

## Exact Matching of TL4 Branches
This process will focus on TL4's with branches, as well as available DL4 terms to allow for exact matching. For any row without DL4 data, but branching we will pick the first branch by default and enter at TL4_BQ of 4 (lowest quality branch resolution).

In [10]:
td = pd.read_csv('MH_harmonization_map_16_TL4.csv', na_values=' ')
td.shape

data_count = 0
total_exact_matched = 0
total_data_no_match = 0
total_no_data = 0

for each in tqdm_notebook(td[TL4_B], desc='1st loop'): #for each row
    if not pd.isna(td[TL1_name_header][data_count]): #Check for missing value
        if td[TL4_BQ][data_count] != 0: # Focus on terms with multiple branches

            #Create list of branch codes
            codeList = each.strip('_').split('_')
            #Get MedDRA terms for each code in branch list - put in a list of terms
            textList = [] #text for corresponding branch terms
            tempList = tl4[ont_DL4_code_col].tolist() # the column name reference is specific to the loaded MedDRA file

            for code in codeList:
                #Idenfity the TL4 code in the MedDRA HLGT file
                tempIndex = tempList.index(int(code))

                #Put the corresponding TL5 term into the list
                term = tl4[ont_DL4_name_col][tempIndex] # the column name reference is specific to the loaded MedDRA file
                textList.append(str(term))

            #Exact Matching
            if not pd.isna(td[DL4][data_count]): #See if row has original DL4 data available.
                #Check for branch exact match
                term_count = 0
                matched = False
                for term in textList: #each branch term
                    if str(term).casefold() == str(td[DL4][data_count]).casefold(): 
                        td[TL4_name_header][data_count] = term # Map matching term
                        td[TL4_code_header][data_count] = codeList[term_count] # Map matching term code, the column name reference is specific to the loaded MedDRA LLT file
                        td[TL4_BQ][data_count] = 1 # Exact Match Branch Quality
                        total_exact_matched += 1
                        matched = True
                    term_count +=1
                if not matched:
                    total_data_no_match += 1

            else: # no DL4 data
                #Pick first branch by default.
                td[TL4_name_header][data_count] = textList[0] # Map matching term
                td[TL4_code_header][data_count] = codeList[0] # Map matching term code, the column name reference is specific to the loaded MedDRA LLT file
                td[TL4_BQ][data_count] = 4 # First branch picked by default - lowest quality branch resolution code. 
                total_no_data += 1
        
    data_count +=1 

print("Branches Resolved with Exact Match: " +str(total_exact_matched))
print("Branches with Data but No Exact Match: " +str(total_data_no_match))  
print("Branches with no Data: " +str(total_no_data))  

(28720, 27)

HBox(children=(IntProgress(value=0, description='1st loop', max=28720), HTML(value='')))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a


Branches Resolved with Exact Match: 237
Branches with Data but No Exact Match: 3
Branches with no Data: 126


## Save file in progress

In [11]:
td.to_csv("MH_harmonization_map_17_TL4.csv", header=True, index=False)  

## Fuzzy Matching of TL4 Branches
This process will focus on TL4's with branches, unresolved by exact matching, but have DL4 terms to allow for fuzzy matching.

In [13]:
td = pd.read_csv('MH_harmonization_map_17_TL4.csv', na_values=' ')
td.shape

data_count = 0
total_fuzzy_matched = 0
total_no_data = 0

for each in tqdm_notebook(td[TL4_B], desc='1st loop'): #for each row
    if not pd.isna(td[TL1_name_header][data_count]): #Check for missing value
        if pd.isna(td[TL4_BQ][data_count]): #haven't previously assigned branch quality a code (i.e. of 0, 1, or 4) (i.e. multiple branches, no exact match, and DL3 available)

            #Create list of branch codes
            codeList = each.strip('_').split('_')
            #Get MedDRA terms for each code in branch list - put in a list of terms
            textList = [] #text for corresponding branch terms
            tempList = tl4[ont_DL4_code_col].tolist() # the column name reference is specific to the loaded MedDRA file

            for code in codeList:
                #Idenfity the TL4 code in the MedDRA HLGT file
                tempIndex = tempList.index(int(code))

                #Put the corresponding TL4 term into the list
                term = tl4[ont_DL4_name_col][tempIndex] # the column name reference is specific to the loaded MedDRA file
                textList.append(str(term))

            #Perform fuzzy matching between row's HLT and term list
            bestFound = process.extract(td[DL4][data_count],textList)
            nameList,scoreList = listify(bestFound)

            # Add results of fuzzy matching
            td[TL4_FZT][data_count] = str(nameList)
            td[TL4_FZS][data_count] = str(scoreList)  

            total_fuzzy_matched += 1
        else:
            total_no_data += 1
            
    data_count += 1
print("Branches with Applied Fuzzy Matching: " +str(total_fuzzy_matched))
print("Branches with no Data: " +str(total_no_data))  

(28720, 27)

HBox(children=(IntProgress(value=0, description='1st loop', max=28720), HTML(value='')))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Branches with Applied Fuzzy Matching: 3
Branches with no Data: 28687


***
## Save working map file
Before moving on to the next step of the harmonization pipeline we will save our mapping file 'in progress'. Since a row index was previously added, we will set index to 'False' below.  

In [14]:
td.to_csv("MH_harmonization_map_18_TL4.csv", header=True, index=False)  

## Add codes for branch terms resolved with fuzzy matching
We examined the data and only 3 terms were resolved here so we simply confirmed and entered the best branch terms and looked up the codes in TL4 of MedDRA and manually entered their codes, saving this file as "MH_harmonization_map_18_TL4_Fuzzy.csv". This process could be automated like it was in the previous notebook as needed for a different application. 

## Notebook conclusion
In this notebook we have mapped all TL4 names and codes, this time resolving any possible term branches using available term information in the dataset.  This was completed with exact and fuzzy matching similar to the first phase of TL1 mapping. 

In the next notebook we replicate the same process we just completed for TL4, but for TL5. 