# Term Harmonize - STEP 5 : Impute Hierarchy (TL2 and TL3)
#### Author: Ryan Urbanowicz (ryanurb@upenn.edu) 
#### Institution: University of Pennsylvania - Perleman School of Medicine
#### Project: CMREF Data Harmonization 
#### Date: 7/18/19

#### Project Overview:
See the first notebook in this series ('Step_1_Term_Harmonize_Data_Preparation.ipynb') for an overview of this project, these notebooks, the target application, data availability, code dependencies, and our strategy for generalizing the code in these notebooks. 

#### Notebook Summary:
This notebook loads the working mapping file (following completion of LLT mapping (which we assume has been completed fully). Next it will perform the first two levels of term hierarchy imputation.  In the current target application this involves imputing the PTs from LLTs, and then imputing the HLTs from the PTs. PT imputation is direct, since there are no alternative branches possible, so this task can be fully automated. However HLT imputation includes possible branching, therefore further manual, subjective annotation will be required after running this notebook. Whenever the data includes relevant term data, we will apply fuzzy matching to assist in the selection of the most appropriate of available HLT branches. When this relevant term data is not available we pick the first of the possible branches by default.  Branch 'quality' is tracked much like term mapping quality. 

***
## Load Python packages required in this notebook

In [1]:
#Load necessary packages.
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# Jupyter Notebook Hack: This code ensures that the results of multiple commands within a given cell are all displayed, rather than just the last. 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#Import Progress bar
from tqdm import tnrange, tqdm_notebook

***
## Load Working Map File and Relevent Ontology Files


#### Create general variable names for any target application specific values.

In [2]:
# Input filename for 'target dataset' (excel file loaded in this application)
target_study_data = 'Combined_MEDHX_TERMS_20studies.xlsx' 

ont_DL1_data = 'LLT.xlsx' # Input filename for ontology file defining all DL1 terms and their codes. 
ont_DL1_name_col = 'llt_name' # column label for DL1 term name
ont_DL1_code_col ='llt_code' # column label for DL1 term code
ont_DL1_cur_col = 'llt_currency' # column label for term currency
ont_DL2_data = 'PT.xlsx' # Input filename for ontology file defining all DL2 terms and their codes. 
ont_DL2_name_col = 'pt_name' # column label for DL2 term name
ont_DL2_code_col = 'pt_code' # column label for DL2 term code
ont_DL3_DL2_data = 'HLT_PT.xlsx' # Input filename for ontology file defining connections between DL2 and DL3 term codes. 
ont_DL3_data = 'HLT.xlsx' # Input filename for ontology file defining all DL3 terms and their codes.
ont_DL3_name_col = 'hlt_name' # column label for DL3 term name
ont_DL3_code_col = 'hlt_code' # column label for DL3 term code

ont_DL4_DL3_data = 'HLGT_HLT.xlsx' # Input filename for ontology file defining connections between DL3 and DL4 term codes. 
ont_DL4_data = 'HLGT.xlsx' # Input filename for ontology file defining all DL4 terms and their codes.
ont_DL5_DL4_data = 'SOC_HLGT.xlsx' # Input filename for ontology file defining connections between DL4 and DL5 term codes. 
ont_DL5_data = 'SOC.xlsx' # Input filename for ontology file defining all DL5 terms and their codes.

DL1_FT1 = 'MHTERM' # focus term 1: This term is available over all studies. 
DL1_FT2 = 'LLT_NAME' # focus term 3: an alternative term available for a subset of studies. This one supposedly conforms to the MedDRA standard so we expect it to yield more exact matches. May offer a better match for the lowest level of the standardized terminology.
DL1_FT3 = 'MHMODIFY' # focus term 2: an alternative term available for a subset of studies. May offer a better match for the lowest level of the standardized terminology.

DL2 = 'PT_NAME' # Secondary level terms (i.e. more general than DL1 terms)
DL3 = 'HLT_NAME' # Tertiary level terms (i.e. more general than DL2 terms)
DL4 = 'HLGT_NAME' # Quarternary level terms (i.e. more general than DL3 terms)
DL5 = 'SOC_NAME' # Quinary Level terms (i.e. more general than DL4 terms)

TL1_qual_code_header = 'LLT_map_code' # column name for lowest term level mapping quality code (added to mapping file)
TL1_name_header = 'T_LLT' # column name for the 'mapped' TL1 - term name (added to mapping file)
TL1_code_header = 'T_LLT_CODE' # column name for the 'mapped' TL1 - term code (added to mapping file)
TL2_name_header = 'T_PT'
TL2_code_header = 'T_PT_CODE'
TL3_name_header = 'T_HLT'
TL3_code_header = 'T_HLT_CODE'
TL4_name_header = 'T_HLGT'
TL4_code_header = 'T_HLGT_CODE'
TL5_name_header = 'T_SOC'
TL5_code_header = 'T_SOC_CODE'

FZ1_FT1 = 'FZMatch_1_'+DL1_FT1 # column name for best FT1 fuzzy match (temporarily added to mapping file)
FZ2_FT1 = 'FZMatch_2_'+DL1_FT1 # column name for second best FT1 fuzzy match (temporarily added to mapping file)
FZ3_FT1 = 'FZMatch_3_'+DL1_FT1 # column name for third best FT1 fuzzy match (temporarily added to mapping file)
FZ4_FT1 = 'FZMatch_4_'+DL1_FT1 # column name for fourth best FT1 fuzzy match (temporarily added to mapping file)
FZ5_FT1 = 'FZMatch_5_'+DL1_FT1 # column name for fifth best FT1 fuzzy match (temporarily added to mapping file)

FZMC = 'FZMatch_Choice_ID_'+DL1_FT1 #column name for the column where manual annotator will enter the number (1-5) indicating the FT1 fuzzy matched term that offers the best match (if a good one is identified)
FZCT = 'FZMatch_Copied_Term' #column name for the column where manual annotator can alternatively manually copy in the MedDRA LLT term that best matches the term information in this row (can come from FT2 or FT3 if term was not identified in FT1)

FZ1_FT2 = 'FZMatch_1_'+DL1_FT2 # column name for best FT2 fuzzy match (temporarily added to mapping file)
FZ2_FT2 = 'FZMatch_2_'+DL1_FT2 # column name for second best FT2 fuzzy match (temporarily added to mapping file)
FZ3_FT2 = 'FZMatch_3_'+DL1_FT2 # column name for third best FT2 fuzzy match (temporarily added to mapping file)
FZ4_FT2 = 'FZMatch_4_'+DL1_FT2 # column name for forth best FT2 fuzzy match (temporarily added to mapping file)
FZ5_FT2 = 'FZMatch_5_'+DL1_FT2 # column name for fifth best FT2 fuzzy match (temporarily added to mapping file)

FZ1_FT3 = 'FZMatch_1_'+DL1_FT3 # column name for best FT3 fuzzy match (temporarily added to mapping file)
FZ2_FT3 = 'FZMatch_2_'+DL1_FT3 # column name for second best FT3 fuzzy match (temporarily added to mapping file)
FZ3_FT3 = 'FZMatch_3_'+DL1_FT3 # column name for third best FT3 fuzzy match (temporarily added to mapping file)
FZ4_FT3 = 'FZMatch_4_'+DL1_FT3 # column name for forth best FT3 fuzzy match (temporarily added to mapping file)
FZ5_FT3 = 'FZMatch_5_'+DL1_FT3 # column name for fifth best FT3 fuzzy match (temporarily added to mapping file)

TL3_B = 'HLT_branches'
TL3_BQ = 'HLT_branch_quality'
TL3_FZT = 'HLT_Fuzzy_Terms'
TL3_FZS = 'HLT_Fuzzy_Scores'

TL4_B = 'HLGT_branches'
TL4_BQ = 'HLGT_branch_quality'
TL4_FZT = 'HLGT_Fuzzy_Terms'
TL4_FZS = 'HLGT_Fuzzy_Scores'

TL5_B = 'SOC_branches'
TL5_BQ = 'SOC_branch_quality'
TL5_FZT = 'SOC_Fuzzy_Terms'
TL5_FZS = 'SOC_Fuzzy_Scores'

### Load mapping file

In [3]:
#Load target (tab-delimited) file into a pandas data frame
target_map_file = 'MH_harmonization_map_9_Final.csv' #Input filename (excel file loaded in this application)
td = pd.read_csv(target_map_file, na_values=' ') #Data loaded so that blank excell cells are 'NA'
td.shape

(28720, 28)

### Load Lowest Level Terminology Standard File

In [14]:
tl1 = pd.read_excel(ont_DL1_data, sep='\t',na_values=' ')
tl1.shape

#Filter out any non-current low level terms (LLTs) 
tl1 = tl1.loc[tl1[ont_DL1_cur_col] == 'Y'] #column name is application specific.
tl1.shape
#Readjusts the row index values so there are no gaps in the sequence from the row removal (important for indexing later) 
tl1 = tl1.reset_index(drop=True) 

(78808, 11)

(69531, 11)

### Load 2nd Level Terminology Standard File

In [5]:
tl2 = pd.read_excel(ont_DL2_data, sep='\t',na_values=' ')
tl2.shape

(23088, 11)

### Load 3rd Level Terminology Standard File

In [6]:
tl3 = pd.read_excel(ont_DL3_data, sep='\t',na_values=' ')
tl3.shape

(1737, 9)

### Load 2nd to 3rd Level Terminology Connection File

In [7]:
tl3_tl2 = pd.read_excel(ont_DL3_DL2_data, sep='\t',na_values=' ')
tl3_tl2.shape

(33402, 2)

***
## Insert New Columns for Hierarchy Imputation
Insert column into the mapping file needed for the first two levels of the hierarchy imputation (i.e. PT and HLT). For level 3 (i.e. HLT) we will also add columns to handle the branch possibilities, the quality of our branch selection, and results from any fuzzy branch matching. 


*To adapt this code to other tasks, users may need to specify different column indexes below. We place these new columns after the original data columns.*

In [8]:
td.insert(loc=28,column=TL2_name_header,value='NA') 
td.insert(loc=29,column=TL2_code_header,value='NA') 
td.insert(loc=30,column=TL3_name_header,value='NA') 
td.insert(loc=31,column=TL3_code_header,value='NA') 

td.insert(loc=32,column=TL3_B,value='NA') 
td.insert(loc=33,column=TL3_BQ,value='NA') 
td.insert(loc=34,column=TL3_FZT,value='NA') 
td.insert(loc=35,column=TL3_FZS,value='NA') 

***
## Delete unnecessary columns
Remove the fuzzy match result columns from the last notebook run.  These are not needed moving forward in the map file. 


In [10]:
td.shape
#Drop FT1 fuzzy matching terms
td = td.drop([FZ1_FT1,FZ2_FT1,FZ3_FT1,FZ4_FT1,FZ5_FT1], axis=1)

#Drop FT2 fuzzy matching terms
td = td.drop([FZ1_FT2,FZ2_FT2,FZ3_FT2,FZ4_FT2,FZ5_FT2], axis=1)

#Drop FT3 fuzzy matching terms
td = td.drop([FZ1_FT3,FZ2_FT3,FZ3_FT3,FZ4_FT3,FZ5_FT3], axis=1)
td.shape

(28720, 36)

(28720, 21)

## Define a method that takes a pandas column and turns it into a list

In [8]:
def listify(inList):
    nameList = []
    scoreList = []
    for each in inList:
        nameList.append(each[0])
        scoreList.append(each[1])
    return nameList, scoreList

***
## Impute TL2 from TL1 (Direct mapping)
In this application we impute PTs (i.e. TL2) from previously mapped LLTs (TL1). Terms are directly linked, so the mapping is deterministic (i.e. no alternative branches exist).  This step is fully automated. 

In [None]:
data_count = 0
for each in tqdm_notebook(td[TL1_code_header], desc='1st loop'): #for each row 
    if not pd.isna(each): #Check for missing value

        #Idenfity the TL1 code in the MedDRA LLT file
        tempList = tl1[ont_DL1_code_col].tolist() # the column name reference is specific to the loaded MedDRA file
        tempIndex = tempList.index(each) 

        #Put the corresponding TL2 code into the key
        TL2_code = tl1[ont_DL2_code_col][tempIndex] # the column name reference is specific to the loaded MedDRA file
        td[TL2_code_header][data_count] = TL2_code

        #Identify the TL2 name from the MedDRA PT file
        tempList = tl2[ont_DL2_code_col].tolist() # the column name reference is specific to the loaded MedDRA file
        tempIndex = tempList.index(TL2_code) 

        #Put the corresponding TL2 name into the key
        TL2_name = tl2[ont_DL2_name_col][tempIndex] # the column name reference is specific to the loaded MedDRA file
        td[TL2_name_header][data_count] = TL2_name
    
    data_count +=1 

HBox(children=(IntProgress(value=0, description='1st loop', max=28720, style=ProgressStyle(description_width='…

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


## Save file in progress

In [None]:
td.to_csv("MH_harmonization_map_10_TL2.csv", header=True, index=False)  

***
## Impute TL3 from TL2 (Branches Possible)
In this application we impute HLTs (i.e. TL3) from previously imputed PTs (TL2). 

Before completing this task we lay out a coding scheme to describe the quality/confidence of any branch selection for this entire imputation proceedure. These codes for TL3 will be entered into the column with the 'TL3_BQ' label. We have developed a custom coding scheme to suit the needs of our target application:

* 0 = No branching: Only one possible term available for imputation - best quality implied. 
* 1 = Branching: Branch selected based on an exact match with DL3 available term. 
* 2 = Branching: DL3 term available, fuzzy matching applied, top scoring fuzzy match chosen/confirmed. 
* 3 = Branching: DL3 term available, fuzzy matching applied, non-top scoring fuzzy match chosen/confirmed.
* 4 = Branching: no DL3 term available - No fuzzy matching available. Picked first branch by default. 

In [10]:
data = 'MH_harmonization_map_10_TL2.csv'
td = pd.read_csv(data, na_values=' ')
td.shape

data_count = 0
total_branched = 0
total_direct = 0

for each in tqdm_notebook(td[TL2_code_header], desc='1st loop'): #for each row
    if not pd.isna(each): #Check for missing value
        
        #Idenfity the TL2 code in the MedDRA HLT_PT file
        tempList = tl3_tl2[ont_DL2_code_col].tolist() # the column name reference is specific to the loaded MedDRA file
        indexList = [i for i,val in enumerate(tempList) if val==each]

        #Put all connected TL3 codes in TL3_B separated by an underscore
        TL3_list = []
        TL3_str = ''
        for i in indexList:
            TL3_list.append(tl3_tl2[ont_DL3_code_col][i]) # the column name reference is specific to the loaded MedDRA file
            TL3_str += str(tl3_tl2[ont_DL3_code_col][i])+'_' # the column name reference is specific to the loaded MedDRA file

        #Branch Reporting
        if len(TL3_list) > 1: # is there more than one branch 
            total_branched += 1
            td[TL3_B][data_count] = TL3_str

        else: #only one branch found
            total_direct += 1
            #Identify term for single code
            tempList = tl3[ont_DL3_code_col].tolist() # the column name reference is specific to the loaded MedDRA file

            tempIndex = tempList.index(TL3_list[0]) #find index/location of code in tl3 set of terms
            term = tl3[ont_DL3_name_col][tempIndex] # the column name reference is specific to the loaded MedDRA file

            td[TL3_name_header][data_count] = term
            td[TL3_code_header][data_count] = TL3_list[0]
            td[TL3_BQ][data_count] = 0 # Top Branch Quality

    data_count +=1 

print("Directly Imputed: " +str(total_direct))
print("With Branches: " +str(total_branched))

(28720, 21)

HBox(children=(IntProgress(value=0, description='1st loop', max=28720), HTML(value='')))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Directly Imputed: 18597
With Branches: 10093


## Save file in progress

In [11]:
td.to_csv("MH_harmonization_map_11_TL3.csv", header=True, index=False)  

## Exact Matching of TL3 Branches
This process will focus on TL3's with branches, as well as available DL3 terms to allow for exact matching. For any row without DL3 data, but multiple branches are available, we will pick the first branch by default and enter at TL3_BQ of 4 (lowest quality branch resolution).

In [13]:
data_count = 0
total_exact_matched = 0
total_data_no_match = 0
total_no_data = 0

for each in tqdm_notebook(td[TL3_B], desc='1st loop'): #for each row
    if not pd.isna(td[TL1_name_header][data_count]): #Check for missing value
        
        if td[TL3_BQ][data_count] != 0: # Focus on terms with multiple branches

            #Create list of branch codes
            codeList = each.strip('_').split('_')
            #Get MedDRA terms for each code in branch list - put in a list of terms
            textList = [] #text for corresponding branch terms
            tempList = tl3[ont_DL3_code_col].tolist() # the column name reference is specific to the loaded MedDRA file

            for code in codeList:
                #Idenfity the TL3 code in the MedDRA HLT file
                tempIndex = tempList.index(int(code))

                #Put the corresponding TL3 term into the list
                term = tl3[ont_DL3_name_col][tempIndex] # the column name reference is specific to the loaded MedDRA file
                textList.append(str(term))

            #Exact Matching
            if not pd.isna(td[DL3][data_count]): #See if row has original DL3 data available.
                #Check for branch exact match
                term_count = 0
                matched = False
                for term in textList: #each branch term
                    if str(term).casefold() == str(td[DL3][data_count]).casefold(): 
                        td[TL3_name_header][data_count] = term # Map matching term
                        td[TL3_code_header][data_count] = codeList[term_count] # Map matching term code, the column name reference is specific to the loaded MedDRA LLT file
                        td[TL3_BQ][data_count] = 1 # Exact Match Branch Quality
                        total_exact_matched += 1
                        matched = True
                    term_count +=1
                if not matched:
                    total_data_no_match += 1

            else: # no DL3 data
                #Pick first branch by default.
                td[TL3_name_header][data_count] = textList[0] # Map matching term
                td[TL3_code_header][data_count] = codeList[0] # Map matching term code, the column name reference is specific to the loaded MedDRA LLT file
                td[TL3_BQ][data_count] = 4 # First branch picked by default - lowest quality branch resolution code. 
                total_no_data += 1
        
    data_count +=1 

print("Branches Resolved with Exact Match: " +str(total_exact_matched))
print("Branches with Data but No Exact Match: " +str(total_data_no_match))  
print("Branches with no Data: " +str(total_no_data))  

HBox(children=(IntProgress(value=0, description='1st loop', max=28720), HTML(value='')))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a

Branches Resolved with Exact Match: 5013
Branches with Data but No Exact Match: 658
Branches with no Data: 4422


## Save file in progress

In [14]:
td.to_csv("MH_harmonization_map_12_TL3.csv", header=True, index=False)  

## Fuzzy Matching of TL3 Branches
This process will focus on TL3's with branches, unresolved by exact matching, but have DL3 terms to allow for fuzzy matching.

In [18]:
data = 'MH_harmonization_map_12_TL3.csv'
td = pd.read_csv(data, na_values=' ')
td.shape

data_count = 0
total_fuzzy_matched = 0
total_no_data = 0

for each in tqdm_notebook(td[TL3_B], desc='1st loop'): #for each row
    if not pd.isna(td[TL1_name_header][data_count]): #Check for missing value
        
        if pd.isna(td[TL3_BQ][data_count]): #haven't previously assigned branch quality a code (i.e. of 0, 1, or 4) (i.e. multiple branches, no exact match, and DL3 available)

            #Create list of branch codes
            codeList = each.strip('_').split('_')
            #Get MedDRA terms for each code in branch list - put in a list of terms
            textList = [] #text for corresponding branch terms
            tempList = tl3[ont_DL3_code_col].tolist() # the column name reference is specific to the loaded MedDRA file

            for code in codeList:
                #Idenfity the TL3 code in the MedDRA HLT file
                tempIndex = tempList.index(int(code))

                #Put the corresponding TL3 term into the list
                term = tl3[ont_DL3_name_col][tempIndex] # the column name reference is specific to the loaded MedDRA file
                textList.append(str(term))

            #Perform fuzzy matching between row's HLT and term list
            bestFound = process.extract(td[DL3][data_count],textList)
            nameList,scoreList = listify(bestFound)

            # Add results of fuzzy matching
            td[TL3_FZT][data_count] = str(nameList)
            td[TL3_FZS][data_count] = str(scoreList)

            total_fuzzy_matched += 1
        else:
            total_no_data += 1
            
    data_count += 1
    
print("Branches with Applied Fuzzy Matching: " +str(total_fuzzy_matched))
print("Branches with no Data: " +str(total_no_data))  

(28720, 21)

HBox(children=(IntProgress(value=0, description='1st loop', max=28720), HTML(value='')))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Branches with Applied Fuzzy Matching: 658
Branches with no Data: 28032


***
## Save working map file
Before moving on to the next step of the harmonization pipeline we will save our mapping file 'in progress'. Since a row index was previously added, we will set index to 'False' below.  

In [39]:
td.to_csv("MH_harmonization_map_13_TL3.csv", header=True, index=False)  

## Manual annotation of fuzzy matched branch ranking
At this point we manually created a modiefied version of ("MH_harmonization_map_13_TL3.csv") called ('MH_harmonization_map_13_TL3_Fuzzy.csv') where we remove any row with a TL3 branch quality code of 0 or 1 (i.e. direct or exact branch matches) to make it easier for the manual annotator to focus on the rows that need to be checked. Next the manual annotation is completed where an expert double checks the fuzzy matches and confirms the best branch adding the name of the branch into the term column (i.e. TL3_name_header).  This file is named ('MH_harmonization_map_13_TL3_Fuzzy_NAN.csv'). 

## Integrate the manually annotated fuzzy branch matching file with the original direct/exact branch matching file
- load the original fuzzy branch matching file and remove rows with a code > 1.  This preserves the direct and exact branch matches. 
- load the manually annotated file and integrated with the file above.

In [40]:
data = 'MH_harmonization_map_13_TL3.csv'
td = pd.read_csv(data, na_values=' ')
td.shape

#Remove all rows from entire dataset with a code of 2, 3 or 4 (these will be replaced by the rows from the second round annotation dataset)
td2 = td[td[TL3_FZT].isnull()]
td2.shape

(28720, 21)

(28062, 21)

In [41]:
data_fuz = 'MH_harmonization_map_13_TL3_Fuzzy_NAN.csv'
tf = pd.read_csv(data_fuz, na_values=' ')
tf.shape

#combine the two, non-overlapping, manually annotated datasets
frames = [td2, tf]

td3 = pd.concat(frames)
td3.shape

td3.to_csv("MH_harmonization_map_14_TL3.csv", header=True, index=False)  

(658, 21)

(28720, 21)

## Identify and save term codes for all TL3 terms
At this points we should have identified a term (in the column TL3_name_header) for all rows in the dataset where we had previously succeeded in mapping TL1 and TL2. Next we want to add the corresponding codes for each of these terms. This is automated. 

In [42]:
data = 'MH_harmonization_map_14_TL3.csv'
td = pd.read_csv(data, na_values=' ')
td.shape

count = 0
for each in tqdm_notebook(td[TL3_code_header], desc='1st loop'): #for each row
    if not pd.isna(td[TL1_name_header][count]): #Check for missing value
        
        if pd.isna(each): 
            #Get MedDRA terms for each code in branch list - put in a list of terms
            textList = [] #text for corresponding branch terms
            tempList = tl3[ont_DL3_name_col].tolist() # the column name reference is specific to the loaded MedDRA file

            #Identify the TL3 term in teh MedDRA HLT file
            tempIndex = tempList.index(td[TL3_name_header][count])

            #Put the corresponding TL3 code into the list
            code = tl3[ont_DL3_code_col][tempIndex] # the column name reference is specific to the loaded MedDRA file
            td[TL3_code_header][count] = int(code)
            
    count += 1

(28720, 21)

HBox(children=(IntProgress(value=0, description='1st loop', max=28720), HTML(value='')))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


***
## Save working map file
Before moving on to the next step of the harmonization pipeline we will save our mapping file 'in progress'. Since a row index was previously added, we will set index to 'False' below.  

In [43]:
td.to_csv("MH_harmonization_map_15_TL3.csv", header=True, index=False)  

## Notebook conclusions
In this notebook we have directly mapped all TL2 names and codes from all TL1 name/codes that had been previously mapped.  Next we mapped all TL3 names and codes, this time resolving any possible term branches using available term information in the dataset.  This was completed with exact and fuzzy matching similar to the first phase of TL1 mapping. 

In the next notebook we replicate the same process we just completed for TL3, but for TL4. 