# Term Harmonize - STEP 7: Impute Hierarchy (TL5)
#### Author: Ryan Urbanowicz (ryanurb@upenn.edu) 
#### Institution: University of Pennsylvania - Perleman School of Medicine
#### Project: CMREF Data Harmonization 
#### Date: 7/18/19

#### Project Overview:
See the first notebook in this series ('Step_1_Term_Harmonize_Data_Preparation.ipynb') for an overview of this project, these notebooks, the target application, data availability, code dependencies, and our strategy for generalizing the code in these notebooks. 

#### Notebook Summary:
This notebook loads the working mapping file (following completion of PT, HLT, and HLGT mapping (which we assume has been completed fully). Next it performs the next level of term hierarchy imputation.  In the current target application this involves imputing the SOCs from HLGTs. However SOC imputation includes possible branching, therefore further manual, subjective annotation will be required after running this notebook. Whenever the data includes relevant term data, we will apply fuzzy matching to assist in the selection of the most appropriate of available SOC branches. When this relevant term data is not available we pick the first of the possible branches by default.  Branch 'quality' is tracked much like term mapping quality. 

***
## Load Python packages required in this notebook

In [7]:
#Load necessary packages.
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# Jupyter Notebook Hack: This code ensures that the results of multiple commands within a given cell are all displayed, rather than just the last. 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#Import Progress bar
from tqdm import tnrange, tqdm_notebook

***
## Load Working Map File and Relevent Terminology Standard Files


#### Create general variable names for any target application specific values.

In [8]:
# Input filename for 'target dataset' (excel file loaded in this application)
target_study_data = 'Combined_MEDHX_TERMS_20studies.xlsx' 

ont_DL1_data = 'LLT.xlsx' # Input filename for ontology file defining all DL1 terms and their codes. 
ont_DL1_name_col = 'llt_name' # column label for DL1 term name
ont_DL1_code_col ='llt_code' # column label for DL1 term code
ont_DL1_cur_col = 'llt_currency' # column label for term currency

ont_DL2_data = 'PT.xlsx' # Input filename for ontology file defining all DL2 terms and their codes. 
ont_DL2_name_col = 'pt_name' # column label for DL2 term name
ont_DL2_code_col = 'pt_code' # column label for DL2 term code

ont_DL3_DL2_data = 'HLT_PT.xlsx' # Input filename for ontology file defining connections between DL2 and DL3 term codes. 
ont_DL3_data = 'HLT.xlsx' # Input filename for ontology file defining all DL3 terms and their codes.
ont_DL3_name_col = 'hlt_name' # column label for DL3 term name
ont_DL3_code_col = 'hlt_code' # column label for DL3 term code

ont_DL4_DL3_data = 'HLGT_HLT.xlsx' # Input filename for ontology file defining connections between DL3 and DL4 term codes. 
ont_DL4_data = 'HLGT.xlsx' # Input filename for ontology file defining all DL4 terms and their codes.
ont_DL4_name_col = 'hlgt_name' # column label for DL4 term name
ont_DL4_code_col = 'hlgt_code' # column label for DL4 term code

ont_DL5_DL4_data = 'SOC_HLGT.xlsx' # Input filename for ontology file defining connections between DL4 and DL5 term codes. 
ont_DL5_data = 'SOC.xlsx' # Input filename for ontology file defining all DL5 terms and their codes.
ont_DL5_name_col = 'soc_name' # column label for DL5 term name
ont_DL5_code_col = 'soc_code' # column label for DL5 term code

DL1_FT1 = 'MHTERM' # focus term 1: This term is available over all studies. 
DL1_FT2 = 'LLT_NAME' # focus term 3: an alternative term available for a subset of studies. This one supposedly conforms to the MedDRA standard so we expect it to yield more exact matches. May offer a better match for the lowest level of the standardized terminology.
DL1_FT3 = 'MHMODIFY' # focus term 2: an alternative term available for a subset of studies. May offer a better match for the lowest level of the standardized terminology.

DL2 = 'PT_NAME' # Secondary level terms (i.e. more general than DL1 terms)
DL3 = 'HLT_NAME' # Tertiary level terms (i.e. more general than DL2 terms)
DL4 = 'HLGT_NAME' # Quarternary level terms (i.e. more general than DL3 terms)
DL5 = 'SOC_NAME' # Quinary Level terms (i.e. more general than DL4 terms)

TL1_qual_code_header = 'LLT_map_code' # column name for lowest term level mapping quality code (added to mapping file)
TL1_name_header = 'T_LLT' # column name for the 'mapped' TL1 - term name (added to mapping file)
TL1_code_header = 'T_LLT_CODE' # column name for the 'mapped' TL1 - term code (added to mapping file)
TL2_name_header = 'T_PT'
TL2_code_header = 'T_PT_CODE'
TL3_name_header = 'T_HLT'
TL3_code_header = 'T_HLT_CODE'
TL4_name_header = 'T_HLGT'
TL4_code_header = 'T_HLGT_CODE'
TL5_name_header = 'T_SOC'
TL5_code_header = 'T_SOC_CODE'

FZ1_FT1 = 'FZMatch_1_'+DL1_FT1 # column name for best FT1 fuzzy match (temporarily added to mapping file)
FZ2_FT1 = 'FZMatch_2_'+DL1_FT1 # column name for second best FT1 fuzzy match (temporarily added to mapping file)
FZ3_FT1 = 'FZMatch_3_'+DL1_FT1 # column name for third best FT1 fuzzy match (temporarily added to mapping file)
FZ4_FT1 = 'FZMatch_4_'+DL1_FT1 # column name for fourth best FT1 fuzzy match (temporarily added to mapping file)
FZ5_FT1 = 'FZMatch_5_'+DL1_FT1 # column name for fifth best FT1 fuzzy match (temporarily added to mapping file)

FZMC = 'FZMatch_Choice_ID_'+DL1_FT1 #column name for the column where manual annotator will enter the number (1-5) indicating the FT1 fuzzy matched term that offers the best match (if a good one is identified)
FZCT = 'FZMatch_Copied_Term' #column name for the column where manual annotator can alternatively manually copy in the MedDRA LLT term that best matches the term information in this row (can come from FT2 or FT3 if term was not identified in FT1)

FZ1_FT2 = 'FZMatch_1_'+DL1_FT2 # column name for best FT2 fuzzy match (temporarily added to mapping file)
FZ2_FT2 = 'FZMatch_2_'+DL1_FT2 # column name for second best FT2 fuzzy match (temporarily added to mapping file)
FZ3_FT2 = 'FZMatch_3_'+DL1_FT2 # column name for third best FT2 fuzzy match (temporarily added to mapping file)
FZ4_FT2 = 'FZMatch_4_'+DL1_FT2 # column name for forth best FT2 fuzzy match (temporarily added to mapping file)
FZ5_FT2 = 'FZMatch_5_'+DL1_FT2 # column name for fifth best FT2 fuzzy match (temporarily added to mapping file)

FZ1_FT3 = 'FZMatch_1_'+DL1_FT3 # column name for best FT3 fuzzy match (temporarily added to mapping file)
FZ2_FT3 = 'FZMatch_2_'+DL1_FT3 # column name for second best FT3 fuzzy match (temporarily added to mapping file)
FZ3_FT3 = 'FZMatch_3_'+DL1_FT3 # column name for third best FT3 fuzzy match (temporarily added to mapping file)
FZ4_FT3 = 'FZMatch_4_'+DL1_FT3 # column name for forth best FT3 fuzzy match (temporarily added to mapping file)
FZ5_FT3 = 'FZMatch_5_'+DL1_FT3 # column name for fifth best FT3 fuzzy match (temporarily added to mapping file)

TL3_B = 'HLT_branches'
TL3_BQ = 'HLT_branch_quality'
TL3_FZT = 'HLT_Fuzzy_Terms'
TL3_FZS = 'HLT_Fuzzy_Scores'

TL4_B = 'HLGT_branches'
TL4_BQ = 'HLGT_branch_quality'
TL4_FZT = 'HLGT_Fuzzy_Terms'
TL4_FZS = 'HLGT_Fuzzy_Scores'

TL5_B = 'SOC_branches'
TL5_BQ = 'SOC_branch_quality'
TL5_FZT = 'SOC_Fuzzy_Terms'
TL5_FZS = 'SOC_Fuzzy_Scores'

### Load map File

In [9]:
#Load target (tab-delimited) file into a pandas data frame
target_map_file = 'MH_harmonization_map_18_TL4_Fuzzy.csv' #Input filename (excel file loaded in this application)
td = pd.read_csv(target_map_file, na_values=' ') #Data loaded so that blank excell cells are 'NA'
td.shape

(28720, 27)

### Load 5th Level Terminology Standard File

In [10]:
tl5 = pd.read_excel(ont_DL5_data, sep='\t',na_values=' ')
tl5.shape

(27, 10)

### Load 4th to 5th Level Terminology Connection File

In [11]:
tl5_tl4 = pd.read_excel(ont_DL5_DL4_data, sep='\t',na_values=' ')
tl5_tl4.shape

(354, 2)

***
## Insert New Columns for Hierarchy Imputation
Insert column into the mapping file needed for the fifth level of the hierarchy imputation (i.e. SOC). For level 5 (i.e. SOC) we will again add columns to handle the branch possibilities, the quality of our branch selection, and results from any fuzzy branch matching. 


*To adapt this code to other tasks, users may need to specify different column indexes below. We place these new columns after the original data columns.*

In [12]:
td.insert(loc=27,column=TL5_name_header,value='NA') 
td.insert(loc=28,column=TL5_code_header,value='NA') 

td.insert(loc=29,column=TL5_B,value='NA') 
td.insert(loc=30,column=TL5_BQ,value='NA') 
td.insert(loc=31,column=TL5_FZT,value='NA') 
td.insert(loc=32,column=TL5_FZS,value='NA') 

## Define a method that takes a pandas column and turns it into a list

In [13]:
def listify(inList):
    nameList = []
    scoreList = []
    for each in inList:
        nameList.append(each[0])
        scoreList.append(each[1])
    return nameList, scoreList

***
## Impute TL5 from TL4 (Branches Possible)
In this application we impute SOCs (i.e. TL5) from previously imputed HLGTs (TL4). 

Before completing this task we lay out a coding scheme to describe the quality/confidence of any branch selection for this entire imputation proceedure. These codes for TL5 will be entered into the column with the 'TL5_BQ' label. We have developed a custom coding scheme to suit the needs of our target application:

* 0 = No branching: Only one possible term available for imputation - best quality implied. 
* 1 = Branching: Branch selected based on an exact match with DL3 available term. 
* 2 = Branching: DL3 term available, fuzzy matching applied, top scoring fuzzy match chosen/confirmed. 
* 3 = Branching: DL3 term available, fuzzy matching applied, non-top scoring fuzzy match chosen/confirmed.
* 4 = Branching: no DL3 term available - No fuzzy matching available. Picked first branch by default. 

In [14]:
data_count = 0
total_branched = 0
total_direct = 0

for each in tqdm_notebook(td[TL4_code_header], desc='1st loop'): #for each row
    if not pd.isna(each): #Check for missing value

        #Idenfity the TL4 code in the MedDRA HLT_PT file
        tempList = tl5_tl4[ont_DL4_code_col].tolist() # the column name reference is specific to the loaded MedDRA file
        indexList = [i for i,val in enumerate(tempList) if val==each]

        #Put all connected TL5 codes in TL5_B separated by an underscore
        TL5_list = []
        TL5_str = ''
        for i in indexList:
            TL5_list.append(tl5_tl4[ont_DL5_code_col][i]) # the column name reference is specific to the loaded MedDRA file
            TL5_str += str(tl5_tl4[ont_DL5_code_col][i])+'_' # the column name reference is specific to the loaded MedDRA file

        #Branch Reporting
        if len(TL5_list) > 1: # is there more than one branch 
            total_branched += 1
            td[TL5_B][data_count] = TL5_str

        else: #only one branch found
            total_direct += 1
            #Identify term for single code
            tempList = tl5[ont_DL5_code_col].tolist() # the column name reference is specific to the loaded MedDRA file

            tempIndex = tempList.index(TL5_list[0]) #find index/location of code in tl3 set of terms
            term = tl5[ont_DL5_name_col][tempIndex] # the column name reference is specific to the loaded MedDRA file

            td[TL5_name_header][data_count] = term
            td[TL5_code_header][data_count] = TL5_list[0]
            td[TL5_BQ][data_count] = 0 # Top Branch Quality

    data_count +=1 

print("Directly Imputed: " +str(total_direct))
print("With Branches: " +str(total_branched))

HBox(children=(IntProgress(value=0, description='1st loop', max=28720), HTML(value='')))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Directly Imputed: 28175
With Branches: 515


## Save file in progress

In [15]:
td.to_csv("MH_harmonization_map_19_TL5.csv", header=True, index=False)  

### Exact Matching of TL5 Branches
This process will focus on TL5's with branches, as well as available DL5 terms to allow for exact matching. For any row without DL5 data, but branching we will pick the first branch by default and enter at TL5_BQ of 4 (lowest quality branch resolution).

In [16]:
data = 'MH_harmonization_map_19_TL5.csv'
td = pd.read_csv(data, na_values=' ')
td.shape

data_count = 0
total_exact_matched = 0
total_data_no_match = 0
total_no_data = 0

for each in tqdm_notebook(td[TL5_B], desc='1st loop'): #for each row
    if not pd.isna(td[TL1_name_header][data_count]): #Check for missing value
        if td[TL5_BQ][data_count] != 0: # Focus on terms with multiple branches

            #Create list of branch codes
            codeList = each.strip('_').split('_')
            #Get MedDRA terms for each code in branch list - put in a list of terms
            textList = [] #text for corresponding branch terms
            tempList = tl5[ont_DL5_code_col].tolist() # the column name reference is specific to the loaded MedDRA file

            for code in codeList:
                #Idenfity the TL5 code in the MedDRA SOC file
                tempIndex = tempList.index(int(code))

                #Put the corresponding TL5 term into the list
                term = tl5[ont_DL5_name_col][tempIndex] # the column name reference is specific to the loaded MedDRA file
                textList.append(str(term))

            #Exact Matching
            if not pd.isna(td[DL5][data_count]): #See if row has original DL5 data available.
                #Check for branch exact match
                term_count = 0
                matched = False
                for term in textList: #each branch term
                    if str(term).casefold() == str(td[DL5][data_count]).casefold(): 
                        td[TL5_name_header][data_count] = term # Map matching term
                        td[TL5_code_header][data_count] = codeList[term_count] # Map matching term code, the column name reference is specific to the loaded MedDRA LLT file
                        td[TL5_BQ][data_count] = 1 # Exact Match Branch Quality
                        total_exact_matched += 1
                        matched = True
                    term_count +=1
                if not matched:
                    total_data_no_match += 1

            else: # no DL5 data
                #Pick first branch by default.
                td[TL5_name_header][data_count] = textList[0] # Map matching term
                td[TL5_code_header][data_count] = codeList[0] # Map matching term code, the column name reference is specific to the loaded MedDRA LLT file
                td[TL5_BQ][data_count] = 4 # First branch picked by default - lowest quality branch resolution code. 
                total_no_data += 1
        
    data_count +=1 

print("Branches Resolved with Exact Match: " +str(total_exact_matched))
print("Branches with Data but No Exact Match: " +str(total_data_no_match))  
print("Branches with no Data: " +str(total_no_data))  

(28720, 33)

HBox(children=(IntProgress(value=0, description='1st loop', max=28720), HTML(value='')))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a


Branches Resolved with Exact Match: 315
Branches with Data but No Exact Match: 100
Branches with no Data: 100


## Save file in progress

In [17]:
td.to_csv("MH_harmonization_map_20_TL5.csv", header=True, index=False) 

### Fuzzy Matching of TL5 Branches
This process will focus on TL5's with branches, unresolved by exact matching, but have DL5 terms to allow for exact matching

In [20]:
data = 'MH_harmonization_map_20_TL5.csv'
td = pd.read_csv(data, na_values=' ')
td.shape

data_count = 0
total_fuzzy_matched = 0
total_no_data = 0

for each in tqdm_notebook(td[TL5_B], desc='1st loop'): #for each row
    if not pd.isna(td[TL1_name_header][data_count]): #Check for missing value
        if pd.isna(td[TL5_BQ][data_count]): #haven't previously assigned branch quality a code (i.e. of 0, 1, or 4) (i.e. multiple branches, no exact match, and DL3 available)

            #Create list of branch codes
            codeList = each.strip('_').split('_')
            #Get MedDRA terms for each code in branch list - put in a list of terms
            textList = [] #text for corresponding branch terms
            tempList = tl5[ont_DL5_code_col].tolist() # the column name reference is specific to the loaded MedDRA file

            for code in codeList:
                #Idenfity the TL5 code in the MedDRA SOC file
                tempIndex = tempList.index(int(code))

                #Put the corresponding TL5 term into the list
                term = tl5[ont_DL5_name_col][tempIndex] # the column name reference is specific to the loaded MedDRA file
                textList.append(str(term))

            #Perform fuzzy matching between row's HLT and term list
            bestFound = process.extract(td[DL5][data_count],textList)
            nameList,scoreList = listify(bestFound)

            # Add results of fuzzy matching
            td[TL5_FZT][data_count] = str(nameList)
            td[TL5_FZS][data_count] = str(scoreList)  

            total_fuzzy_matched += 1
        else:
            total_no_data += 1
    data_count += 1
print("Branches with Applied Fuzzy Matching: " +str(total_fuzzy_matched))
print("Branches with no Data: " +str(total_no_data))  

(28720, 33)

HBox(children=(IntProgress(value=0, description='1st loop', max=28720), HTML(value='')))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Branches with Applied Fuzzy Matching: 100
Branches with no Data: 28590


***
## Save working map file
Before moving on to the next step of the harmonization pipeline we will save our mapping file 'in progress'. Since a row index was previously added, we will set index to 'False' below.  

In [21]:
td.to_csv("MH_harmonization_map_21_TL5.csv", header=True, index=False)  

## Manual annotation of fuzzy matched branch ranking
At this point we manually created a modiefied version of ("MH_harmonization_map_21_TL5.csv") called ('MH_harmonization_map_21_TL5_Fuzzy.csv') where the manual annotation is completed where an expert double checks the fuzzy matches and confirms the best branch adding the name of the branch into the term column (i.e. TL5_name_header). We didn't bother to create a subset of the original dataset here (like we did in resolving TL3) because there were relatively few manual annotations to complete.

## Identify and save term codes for all remaining TL5 terms

In [24]:
data = 'MH_harmonization_map_21_TL5_Fuzzy.csv'
td = pd.read_csv(data, na_values=' ')
td.shape

count = 0
for each in tqdm_notebook(td[TL5_code_header], desc='1st loop'): #for each row
    if not pd.isna(td[TL1_name_header][count]): #Check for missing value
        
        if pd.isna(each): 
            #Get MedDRA terms for each code in branch list - put in a list of terms
            textList = [] #text for corresponding branch terms
            tempList = tl5[ont_DL5_name_col].tolist() # the column name reference is specific to the loaded MedDRA file

            #Identify the TL5 term in teh MedDRA HLT file
            tempIndex = tempList.index(td[TL5_name_header][count])

            #Put the corresponding TL5 code into the list
            code = tl5[ont_DL5_code_col][tempIndex] # the column name reference is specific to the loaded MedDRA file
            td[TL5_code_header][count] = int(code)
            
    count += 1

(28720, 33)

HBox(children=(IntProgress(value=0, description='1st loop', max=28720), HTML(value='')))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy





***
## Save working map file
Before moving on to the next step of the harmonization pipeline we will save our mapping file 'in progress'. Since a row index was previously added, we will set index to 'False' below.  

In [25]:
td.to_csv("MH_harmonization_map_22_TL5.csv", header=True, index=False)  

## Notebook conclusions
In this notebook we have mapped all TL5 names and codes, this time resolving any possible term branches using available term information in the dataset.  This was completed with exact and fuzzy matching similar to the first phase of TL1 mapping. 

In the next notebook we map all unique rows back to the original target dataset from the first notebook, and then generate a complete summary of the mapping results at all ontological levels. 