# Term Harmonize - STEP 8: Finalize and Summarize Mapping File
#### Author: Ryan Urbanowicz (ryanurb@upenn.edu) 
#### Institution: University of Pennsylvania - Perleman School of Medicine
#### Project: CMREF Data Harmonization 
#### Date: 7/18/19

#### Project Overview:
See the first notebook in this series ('Step_1_Term_Harmonize_Data_Preparation.ipynb') for an overview of this project, these notebooks, the target application, data availability, code dependencies, and our strategy for generalizing the code in these notebooks. 

#### Notebook Summary:
This notebook will take the working file which has added imputed SOCs to the rest of the mapping and will then map all these unique rows back to the original target dataset from the first notebook.  This notebook will also perform a quality control check, summary, and final formatting of the mapping file for the target application. 

***
## Load Python packages required in this notebook

In [2]:
#Load necessary packages.
import pandas as pd
import numpy as np

# Jupyter Notebook Hack: This code ensures that the results of multiple commands within a given cell are all displayed, rather than just the last. 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#Import Progress bar
from tqdm import tnrange, tqdm_notebook

  return f(*args, **kwds)


#### Create general variable names for any target application specific values.

In [8]:
# Input filename for 'target dataset' (excel file loaded in this application)
target_study_data = 'Combined_MEDHX_TERMS_20studies.xlsx' 

ont_DL1_data = 'LLT.xlsx' # Input filename for ontology file defining all DL1 terms and their codes. 
ont_DL1_name_col = 'llt_name' # column label for DL1 term name
ont_DL1_code_col ='llt_code' # column label for DL1 term code
ont_DL1_cur_col = 'llt_currency' # column label for term currency

ont_DL2_data = 'PT.xlsx' # Input filename for ontology file defining all DL2 terms and their codes. 
ont_DL2_name_col = 'pt_name' # column label for DL2 term name
ont_DL2_code_col = 'pt_code' # column label for DL2 term code

ont_DL3_DL2_data = 'HLT_PT.xlsx' # Input filename for ontology file defining connections between DL2 and DL3 term codes. 
ont_DL3_data = 'HLT.xlsx' # Input filename for ontology file defining all DL3 terms and their codes.
ont_DL3_name_col = 'hlt_name' # column label for DL3 term name
ont_DL3_code_col = 'hlt_code' # column label for DL3 term code

ont_DL4_DL3_data = 'HLGT_HLT.xlsx' # Input filename for ontology file defining connections between DL3 and DL4 term codes. 
ont_DL4_data = 'HLGT.xlsx' # Input filename for ontology file defining all DL4 terms and their codes.
ont_DL4_name_col = 'hlgt_name' # column label for DL4 term name
ont_DL4_code_col = 'hlgt_code' # column label for DL4 term code

ont_DL5_DL4_data = 'SOC_HLGT.xlsx' # Input filename for ontology file defining connections between DL4 and DL5 term codes. 
ont_DL5_data = 'SOC.xlsx' # Input filename for ontology file defining all DL5 terms and their codes.
ont_DL5_name_col = 'soc_name' # column label for DL5 term name
ont_DL5_code_col = 'soc_code' # column label for DL5 term code

DL1_FT1 = 'MHTERM' # focus term 1: This term is available over all studies. 
DL1_FT2 = 'LLT_NAME' # focus term 3: an alternative term available for a subset of studies. This one supposedly conforms to the MedDRA standard so we expect it to yield more exact matches. May offer a better match for the lowest level of the standardized terminology.
DL1_FT3 = 'MHMODIFY' # focus term 2: an alternative term available for a subset of studies. May offer a better match for the lowest level of the standardized terminology.

DL2 = 'PT_NAME' # Secondary level terms (i.e. more general than DL1 terms)
DL3 = 'HLT_NAME' # Tertiary level terms (i.e. more general than DL2 terms)
DL4 = 'HLGT_NAME' # Quarternary level terms (i.e. more general than DL3 terms)
DL5 = 'SOC_NAME' # Quinary Level terms (i.e. more general than DL4 terms)

TL1_qual_code_header = 'LLT_map_code' # column name for lowest term level mapping quality code (added to mapping file)
TL1_name_header = 'T_LLT' # column name for the 'mapped' TL1 - term name (added to mapping file)
TL1_code_header = 'T_LLT_CODE' # column name for the 'mapped' TL1 - term code (added to mapping file)
TL2_name_header = 'T_PT'
TL2_code_header = 'T_PT_CODE'
TL3_name_header = 'T_HLT'
TL3_code_header = 'T_HLT_CODE'
TL4_name_header = 'T_HLGT'
TL4_code_header = 'T_HLGT_CODE'
TL5_name_header = 'T_SOC'
TL5_code_header = 'T_SOC_CODE'

FZ1_FT1 = 'FZMatch_1_'+DL1_FT1 # column name for best FT1 fuzzy match (temporarily added to mapping file)
FZ2_FT1 = 'FZMatch_2_'+DL1_FT1 # column name for second best FT1 fuzzy match (temporarily added to mapping file)
FZ3_FT1 = 'FZMatch_3_'+DL1_FT1 # column name for third best FT1 fuzzy match (temporarily added to mapping file)
FZ4_FT1 = 'FZMatch_4_'+DL1_FT1 # column name for fourth best FT1 fuzzy match (temporarily added to mapping file)
FZ5_FT1 = 'FZMatch_5_'+DL1_FT1 # column name for fifth best FT1 fuzzy match (temporarily added to mapping file)

FZMC = 'FZMatch_Choice_ID_'+DL1_FT1 #column name for the column where manual annotator will enter the number (1-5) indicating the FT1 fuzzy matched term that offers the best match (if a good one is identified)
FZCT = 'FZMatch_Copied_Term' #column name for the column where manual annotator can alternatively manually copy in the MedDRA LLT term that best matches the term information in this row (can come from FT2 or FT3 if term was not identified in FT1)

FZ1_FT2 = 'FZMatch_1_'+DL1_FT2 # column name for best FT2 fuzzy match (temporarily added to mapping file)
FZ2_FT2 = 'FZMatch_2_'+DL1_FT2 # column name for second best FT2 fuzzy match (temporarily added to mapping file)
FZ3_FT2 = 'FZMatch_3_'+DL1_FT2 # column name for third best FT2 fuzzy match (temporarily added to mapping file)
FZ4_FT2 = 'FZMatch_4_'+DL1_FT2 # column name for forth best FT2 fuzzy match (temporarily added to mapping file)
FZ5_FT2 = 'FZMatch_5_'+DL1_FT2 # column name for fifth best FT2 fuzzy match (temporarily added to mapping file)

FZ1_FT3 = 'FZMatch_1_'+DL1_FT3 # column name for best FT3 fuzzy match (temporarily added to mapping file)
FZ2_FT3 = 'FZMatch_2_'+DL1_FT3 # column name for second best FT3 fuzzy match (temporarily added to mapping file)
FZ3_FT3 = 'FZMatch_3_'+DL1_FT3 # column name for third best FT3 fuzzy match (temporarily added to mapping file)
FZ4_FT3 = 'FZMatch_4_'+DL1_FT3 # column name for forth best FT3 fuzzy match (temporarily added to mapping file)
FZ5_FT3 = 'FZMatch_5_'+DL1_FT3 # column name for fifth best FT3 fuzzy match (temporarily added to mapping file)

TL3_B = 'HLT_branches'
TL3_BQ = 'HLT_branch_quality'
TL3_FZT = 'HLT_Fuzzy_Terms'
TL3_FZS = 'HLT_Fuzzy_Scores'

TL4_B = 'HLGT_branches'
TL4_BQ = 'HLGT_branch_quality'
TL4_FZT = 'HLGT_Fuzzy_Terms'
TL4_FZS = 'HLGT_Fuzzy_Scores'

TL5_B = 'SOC_branches'
TL5_BQ = 'SOC_branch_quality'
TL5_FZT = 'SOC_Fuzzy_Terms'
TL5_FZS = 'SOC_Fuzzy_Scores'

***
## Quality Control and Summary for all 'unique' rows in the map file

In [4]:
fd = pd.read_csv("MH_harmonization_map_22_TL5.csv", na_values=' ') 
fd.shape
fd.info()
fd.nunique()

(28720, 33)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28720 entries, 0 to 28719
Data columns (total 33 columns):
ROW_INDEX                   28720 non-null int64
MHTERM                      28720 non-null object
FZMatch_Choice_ID_MHTERM    4602 non-null float64
FZMatch_Copied_Term         2483 non-null object
LLT_map_code                28720 non-null int64
LLT_NAME                    17839 non-null object
MHMODIFY                    16937 non-null object
PT_NAME                     14235 non-null object
HLT_NAME                    15542 non-null object
HLGT_NAME                   15542 non-null object
SOC_NAME                    25361 non-null object
T_LLT                       28690 non-null object
T_LLT_CODE                  28690 non-null float64
T_PT                        28690 non-null object
T_PT_CODE                   28690 non-null float64
T_HLT                       28690 non-null object
T_HLT_CODE                  28690 non-null float64
HLT_branches                10090 non-nul

ROW_INDEX                   28720
MHTERM                      21451
FZMatch_Choice_ID_MHTERM        5
FZMatch_Copied_Term          1130
LLT_map_code                    7
LLT_NAME                     6386
MHMODIFY                    10482
PT_NAME                      2662
HLT_NAME                     1520
HLGT_NAME                     529
SOC_NAME                      139
T_LLT                        6250
T_LLT_CODE                   6229
T_PT                         3264
T_PT_CODE                    3264
T_HLT                        1092
T_HLT_CODE                   1092
HLT_branches                  929
HLT_branch_quality              5
HLT_Fuzzy_Terms               123
HLT_Fuzzy_Scores              108
T_HLGT                        313
T_HLGT_CODE                   313
HLGT_branches                   8
HLGT_branch_quality             5
HLGT_Fuzzy_Terms                2
HLGT_Fuzzy_Scores               2
T_SOC                          27
T_SOC_CODE                     27
SOC_branches  

***
### Check that terms and codes have been filled in for all rows that could be mapped (i.e. LLT mapping quality < 6)

In [5]:
quality = 0
c_llt = 0
c_llt_code = 0
c_pt = 0
c_pt_code = 0
c_hlt = 0
c_hlt_code = 0
c_hlgt = 0
c_hlgt_code = 0
c_soc = 0
c_soc_code = 0

count = 0
for each in tqdm_notebook(fd[TL1_name_header], desc='1st loop'): #for each row (i.e. AE term to be mapped)
    if pd.isna(each): # Code missing    
        c_llt += 1
    if pd.isna(fd[TL1_code_header][count]):
        c_llt_code += 1
    if pd.isna(fd[TL1_qual_code_header][count]):
        quality += 1
        
    if pd.isna(fd[TL2_name_header][count]):
        c_pt += 1
    if pd.isna(fd[TL2_code_header][count]):
        c_pt_code += 1 
        
    if pd.isna(fd[TL3_name_header][count]):
        c_hlt += 1
    if pd.isna(fd[TL3_code_header][count]):
        c_hlt_code += 1  
        
    if pd.isna(fd[TL4_name_header][count]):
        c_hlgt += 1
    if pd.isna(fd[TL4_code_header][count]):
        c_hlgt_code += 1  
        
    if pd.isna(fd[TL5_name_header][count]):
        c_soc += 1
    if pd.isna(fd[TL5_code_header][count]):
        c_soc_code += 1  
    
    count += 1
        
print('Missing Mapped Terms - Quality Report:')
print('LLT Term: '+str(c_llt)+', LLT Code: '+str(c_llt_code))
print('PT Term: '+str(c_pt)+', PT Code: '+str(c_pt_code))
print('HLT Term: '+str(c_hlt)+', HLT Code: '+str(c_hlt_code))
print('HLGT Term: '+str(c_hlgt)+', HLGT Code: '+str(c_hlgt_code))
print('SOC Term: '+str(c_soc)+', SOC Code: '+str(c_soc_code))

HBox(children=(IntProgress(value=0, description='1st loop', max=28720, style=ProgressStyle(description_width='…


Missing Mapped Terms - Quality Report:
LLT Term: 30, LLT Code: 30
PT Term: 30, PT Code: 30
HLT Term: 30, HLT Code: 30
HLGT Term: 30, HLGT Code: 30
SOC Term: 30, SOC Code: 30


*Application Notes: 30 terms could not be mapped over all unique rows, but we confirm that terms and codes have been filled in for all other rows.*

***
## Survey frequency of branches at each level of the ontology where term branches were possible

In [4]:
#Survey HLT
branch_dict = {}
for each in fd[TL3_B]:
    digets = len(str(each))
    codes = int(int(digets)/8)
    code_digets = str(codes)

    if code_digets in branch_dict:               
        branch_dict[code_digets] += 1
    else:
        branch_dict[code_digets] = 1
print("HLT Branch Counts:")
print(branch_dict)
                      
#Survey HLGT
branch_dict = {}
for each in fd[TL4_B]:
    digets = len(str(each))
    codes = int(int(digets)/8)
    code_digets = str(codes)

    
    if code_digets in branch_dict:               
        branch_dict[code_digets] += 1
    else:
        branch_dict[code_digets] = 1
print("HLGT Branch Counts:")
print(branch_dict)


#Survey SOC
branch_dict = {}
for each in fd[TL5_B]:
    digets = len(str(each))
    codes = int(int(digets)/8)
    code_digets = str(codes)

    
    if code_digets in branch_dict:               
        branch_dict[code_digets] += 1
    else:
        branch_dict[code_digets] = 1
print("SOC Branch Counts:")
print(branch_dict)  

HLT Branch Counts:
{'4': 236, '0': 18630, '3': 2205, '5': 176, '2': 7473}
HLGT Branch Counts:
{'0': 28354, '2': 366}
SOC Branch Counts:
{'0': 28205, '2': 515}


*Application notes: recall that code 0 means there was no branch (i.e. direct term imputation), code 1 means that an exact match resolved the best branch, code 2 means that the branch with the highest fuzzy match score was ultimately selected, code 3 means that a different branch was selected manually, and code 4 means that no term information was available to perform exact or fuzzy branch term matching, thus the first branch was selected by default.*

***
## Survey LLT annotation quality code frequencies (as these are most important)

In [5]:
annote_dict = {}
for each in fd[TL1_qual_code_header]:
    code = str(each)
    if code in annote_dict :               
        annote_dict[code] += 1
    else:
        annote_dict[code] = 1


print("AE final annotation code counts:")
print(annote_dict)

AE final annotation code counts:
{'0': 3922, '3': 1136, '5': 14, '6': 29, '2': 13357, '1': 3190, '4': 7072}


*Application notes: Recall that codes 0-3 represent rows resolved with exact matching, codes 4 and 5 represent rows resolved by fuzzy matching, and code 6 indicates rows that could not be mapped.* 

***
## Survey HLT Branch Codes

In [6]:
annote_dict = {}
for each in fd[TL3_BQ]:
    code = str(each)
    if code in annote_dict :               
        annote_dict[code] += 1
    else:
        annote_dict[code] = 1


print("HLT Branch code counts:")
print(annote_dict)

HLT Branch code counts:
{'1.0': 5013, '2.0': 347, '3.0': 311, '4.0': 4422, 'nan': 30, '0.0': 18597}


*Application notes: Recall that code 0 indicates a direct term imputation, code 1 represents term branching resolved by exact matching, codes 2 and 3 represent term branching resolved by fuzzy matching, and code 4 represents branching where the first branch was picked by default (because not term information was available in the original dataset). No code was assigned here for any row where the original LLT term could not be mapped.* 

***
## Survey HLGT Branch Codes

In [7]:
annote_dict = {}
for each in fd[TL4_BQ]:
    code = str(each)
    if code in annote_dict :               
        annote_dict[code] += 1
    else:
        annote_dict[code] = 1


print("HLGT Branch code counts:")
print(annote_dict)

HLGT Branch code counts:
{'1.0': 237, '3.0': 2, '2.0': 1, '4.0': 126, 'nan': 30, '0.0': 28324}


*Application notes: Recall that code 0 indicates a direct term imputation, code 1 represents term branching resolved by exact matching, codes 2 and 3 represent term branching resolved by fuzzy matching, and code 4 represents branching where the first branch was picked by default (because not term information was available in the original dataset). No code was assigned here for any row where the original LLT term could not be mapped.* 

***
## Survey SOC Branch Codes

In [8]:
annote_dict = {}
for each in fd[TL5_BQ]:
    code = str(each)
    if code in annote_dict :               
        annote_dict[code] += 1
    else:
        annote_dict[code] = 1


print("SOC Branch code counts:")
print(annote_dict)

SOC Branch code counts:
{'1.0': 315, '2.0': 96, '3.0': 4, '4.0': 100, 'nan': 30, '0.0': 28175}


*Application notes: Recall that code 0 indicates a direct term imputation, code 1 represents term branching resolved by exact matching, codes 2 and 3 represent term branching resolved by fuzzy matching, and code 4 represents branching where the first branch was picked by default (because not term information was available in the original dataset). No code was assigned here for any row where the original LLT term could not be mapped.* 

***
## Map 'unique' rows back to original target dataset from first notebook
- load original data and new mapped data
- drop row index
- add new relevant mapping columns to original dataset - initialize blank
- orginal data is new working data - for each row see mapped data, check all relevant columns to confirm row-wide match
- when match found copy the mapping data to the original data file. 
- deal with any original data rows that fail to match/map. 
- rerun QC analysis for this new dataset.  Include summary of numbers highlighting counts for unique rows vs all rows. 


In [13]:
td = pd.read_excel("Combined_MEDHX_TERMS_20studies.xlsx", sep='\t',na_values=' ') #Data loaded so that blank excell cells are 'NA'
td.shape
td.info()
td.nunique()

(37105, 15)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37105 entries, 0 to 37104
Data columns (total 15 columns):
STUDY        37105 non-null object
MHTERM       37083 non-null object
MHCODE       5811 non-null float64
MHMODIFY     19140 non-null object
LLT_CODE     16348 non-null float64
LLT_NAME     21523 non-null object
PT_CODE      14114 non-null float64
PT_NAME      19447 non-null object
HLT_CODE     9597 non-null float64
HLT_NAME     20105 non-null object
HLGTCODE     9597 non-null float64
HLGT_NAME    20105 non-null object
SOC_CODE     14114 non-null float64
SOC_NAME     33735 non-null object
PTSOC_CD     3099 non-null float64
dtypes: float64(7), object(8)
memory usage: 4.2+ MB


STUDY           20
MHTERM       21452
MHCODE        2571
MHMODIFY     10482
LLT_CODE      4004
LLT_NAME      6386
PT_CODE       2414
PT_NAME       2662
HLT_CODE       756
HLT_NAME      1520
HLGTCODE       266
HLGT_NAME      529
SOC_CODE        26
SOC_NAME       139
PTSOC_CD        26
dtype: int64

***
### Insert relevant mapping columns into original target dataset

In [14]:
#Add columns from new mapped file to the original dataset
td.insert(loc=15,column=TL1_qual_code_header,value='') 
td.insert(loc=16,column=TL1_name_header,value='') 
td.insert(loc=17,column=TL1_code_header,value='') 

td.insert(loc=18,column=TL2_name_header,value='') 
td.insert(loc=19,column=TL2_code_header,value='') 

td.insert(loc=20,column=TL3_name_header,value='') 
td.insert(loc=21,column=TL3_code_header,value='') 
td.insert(loc=22,column=TL3_B,value='') 
td.insert(loc=23,column=TL3_BQ,value='') 

td.insert(loc=24,column=TL4_name_header,value='') 
td.insert(loc=25,column=TL4_code_header,value='') 
td.insert(loc=26,column=TL4_B,value='') 
td.insert(loc=27,column=TL4_BQ,value='') 

td.insert(loc=28,column=TL5_name_header,value='') 
td.insert(loc=29,column=TL5_code_header,value='') 
td.insert(loc=30,column=TL5_B,value='') 
td.insert(loc=31,column=TL5_BQ,value='') 

td.to_csv("MH_harmonization_map_23_testmap.csv", header=True, index=False) 

***
### Fill in the values (for these inserted columns) from the working map file back to the original target dataset
To find a matching row from the original target dataset we focus on the term information in DL1_FT1, DL1_FT2, and DL1_FT3.

In [15]:
target_row = 0
for each in tqdm_notebook(td[DL1_FT1], desc='1st loop'): #for each row in main data
    #make list of target column terms
    targetList = [each,td[DL1_FT2][target_row],td[DL1_FT3][target_row]]
    other_row = 0
    matched = False
    for other in fd[DL1_FT1]:
        #make list of other column terms
        otherList = [other,fd[DL1_FT2][other_row],fd[DL1_FT3][other_row]]
        
        if targetList == otherList: #check for row match
            
            #Copy all relevant mapping column info to target dataset.
            td[TL1_qual_code_header][target_row] = fd[TL1_qual_code_header][other_row]
            td[TL1_name_header][target_row] = fd[TL1_name_header][other_row]
            td[TL1_code_header][target_row] = fd[TL1_code_header][other_row]
            
            td[TL2_name_header][target_row] = fd[TL2_name_header][other_row]
            td[TL2_code_header][target_row] = fd[TL2_code_header][other_row]
            
            td[TL3_name_header][target_row] = fd[TL3_name_header][other_row]
            td[TL3_code_header][target_row] = fd[TL3_code_header][other_row]
            td[TL3_B][target_row] = fd[TL3_B][other_row]
            td[TL3_BQ][target_row] = fd[TL3_BQ][other_row]
            
            td[TL4_name_header][target_row] = fd[TL4_name_header][other_row]
            td[TL4_code_header][target_row] = fd[TL4_code_header][other_row]
            td[TL4_B][target_row] = fd[TL4_B][other_row]
            td[TL4_BQ][target_row] = fd[TL4_BQ][other_row]
            
            td[TL5_name_header][target_row] = fd[TL5_name_header][other_row]
            td[TL5_code_header][target_row] = fd[TL5_code_header][other_row]
            td[TL5_B][target_row] = fd[TL5_B][other_row]
            td[TL5_BQ][target_row] = fd[TL5_BQ][other_row]
            
            matched = True
            break
            
        other_row += 1
        
    if not matched: 
        print(target_row)
        
    target_row += 1

td.to_csv("MH_harmonization_map_23_fullmap.csv", header=True, index=True) 

HBox(children=(IntProgress(value=0, description='1st loop', max=37105, style=ProgressStyle(description_width='…

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-

213
732
1018
1429
1430
1574
1575
2431
2447
2449
2949
2970
3106
3211
3343
4409
5884
6445
6452
11817
11818
11819
11820
11821
11822
11823
11824
11825
11826
11827
11828
14531
14539
15214
20001
21231
22399
22546
23257
23454
23488
23954
24012
24194
24427
25736
27243
29168
29200
29335
29410
29853
30122
30353
30380
30393
30460
30580
30681
30704
31356
31410
31508
31594
31610
31669
31694
31768
34291
37104


## Fix up final mapped file
- There are a number of reasons that mapping back to the original dataset will not be complete.  Here we go back and try to automate fixing up some of these issues. 
- For any row that has not been mapped or given a code of (6), i.e. unmappable, add a quality code of 6.

In [5]:
td = pd.read_csv("MH_harmonization_map_23_fullmap.csv", na_values=' ') 

#Any TL1 that has no value, add a quality code of 6 (unmappable)
target_row = 0
for each in tqdm_notebook(td[DL1_FT1], desc='1st loop'): #for each row in main data
    #make list of target column terms
    if pd.isna(each): #Check for missing value
        td[TL1_qual_code_header][target_row] = 6
    target_row += 1
    
td.to_csv("MH_harmonization_map_24_fullmap.csv", header=True, index=False) 
    

  interactivity=interactivity, compiler=compiler, result=result)


HBox(children=(IntProgress(value=0, description='1st loop', max=37105, style=ProgressStyle(description_width='…

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  





## Manual fixing of unmapped rows
- First sorted the file by code and highlighted any rows with no mapped code
- then sorted file by MHTerm - and used mapped rows to facilitate filling in entire set of mapping info (LLT, PT, etc) for as many unmapped rows as possible. 
- Some LLT's were added manually based on a separate search of MedDRA and will need to be imputed in a followup cleanup.

Following these manual fixes we saved the file as ("MH_harmonization_map_24_fullmap_manual.csv").

## Impute hierarchy for any rows that have since added a mapped LLT that were missed

### Load necessary ontology files for imputing DL2 and DL3

In [17]:
#### Load Lowest Level Terminology Standard File

tl1 = pd.read_excel(ont_DL1_data , sep='\t',na_values=' ')
tl1.shape
#Filter out any non-current low level terms (LLTs) 
tl1 = tl1.loc[tl1[ont_DL1_cur_col] == 'Y'] #column name is application specific.
#Again determine number of remaining unique LLTs
tl1.shape
#Readjusts the row index values so there are no gaps in the sequence from the row removal (important for indexing later) 
tl1 = tl1.reset_index(drop=True) 

#### Load 2nd Level Terminology Standard File
tl2 = pd.read_excel(ont_DL2_data , sep='\t',na_values=' ')
tl2.shape

#### Load 3rd Level Terminology Standard File
tl3 = pd.read_excel(ont_DL3_data , sep='\t',na_values=' ')
tl3.shape

#### Load 2nd to 3rd Level Terminology Connection File
tl3_tl2 = pd.read_excel(ont_DL3_DL2_data, sep='\t',na_values=' ')
tl3_tl2.shape

(78808, 11)

(69531, 11)

(23088, 11)

(1737, 9)

(33402, 2)

### Fill in remaining TL1 codes, and directly impute the TL2 names and codes

In [28]:
#Impute term hierarchy for any remaining manually added LLT's
td = pd.read_csv("MH_harmonization_map_24_fullmap_manual.csv", na_values=' ') 


data_count = 0
for each in tqdm_notebook(td[TL1_code_header], desc='1st loop'): #for each row 
    #if TL2 term missing and quality code is not 6(unmappable)
    if pd.isna(td[TL2_name_header][data_count]) and str(td[TL1_qual_code_header][data_count]) != str(6):
        #Idenfity the TL1 code in the MedDRA LLT file
        tempList = tl1[ont_DL1_code_col].tolist() # the column name reference is specific to the loaded MedDRA file
        tempIndex = tempList.index(each) 

        #Put the corresponding TL2 code into the key
        TL2_code = tl1[ont_DL2_code_col][tempIndex] # the column name reference is specific to the loaded MedDRA file
        td[TL2_code_header][data_count] = TL2_code

        #Identify the TL2 name from the MedDRA PT file
        tempList = tl2[ont_DL2_code_col].tolist() # the column name reference is specific to the loaded MedDRA file
        tempIndex = tempList.index(TL2_code) 

        #Put the corresponding TL2 name into the key
        TL2_name = tl2[ont_DL2_name_col][tempIndex] # the column name reference is specific to the loaded MedDRA file
        td[TL2_name_header][data_count] = TL2_name
        print(td[TL1_name_header][data_count])
    data_count +=1 
    
td.to_csv("MH_harmonization_map_25_fullmap.csv", header=True, index=False) 

HBox(children=(IntProgress(value=0, description='1st loop', max=37105, style=ProgressStyle(description_width='…

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Plastic surgery NOS
Hyperpotassemia


### Impute remaining TL3 terms and codes
In this case all the terms that needed to be imputed either had no branches, or no terms to allow for exact/fuzzy matching, thus those steps were not needed here (but could be added as needed).

In [30]:
td = pd.read_csv("MH_harmonization_map_25_fullmap.csv", na_values=' ') 

data_count = 0
total_branched = 0
total_direct = 0

for each in tqdm_notebook(td[TL2_code_header], desc='1st loop'): #for each row
    #if TL3 term missing and quality code is not 6(unmappable)
    if pd.isna(td[TL3_name_header][data_count]) and str(td[TL1_qual_code_header][data_count]) != str(6):
        
        #Idenfity the TL2 code in the MedDRA HLT_PT file
        tempList = tl3_tl2[ont_DL2_code_col].tolist() # the column name reference is specific to the loaded MedDRA file
        indexList = [i for i,val in enumerate(tempList) if val==each]

        #Put all connected TL3 codes in TL3_B separated by an underscore
        TL3_list = []
        TL3_str = ''
        for i in indexList:
            TL3_list.append(tl3_tl2[ont_DL3_code_col][i]) # the column name reference is specific to the loaded MedDRA file
            TL3_str += str(tl3_tl2[ont_DL3_code_col][i])+'_' # the column name reference is specific to the loaded MedDRA file

        #Branch Reporting
        if len(TL3_list) > 1: # is there more than one branch 
            total_branched += 1
            td[TL3_B][data_count] = TL3_str

        else: #only one branch found
            total_direct += 1
            #Identify term for single code
            tempList = tl3[ont_DL3_code_col].tolist() # the column name reference is specific to the loaded MedDRA file

            tempIndex = tempList.index(TL3_list[0]) #find index/location of code in tl3 set of terms
            term = tl3[ont_DL3_name_col][tempIndex] # the column name reference is specific to the loaded MedDRA file

            td[TL3_name_header][data_count] = term
            td[TL3_code_header][data_count] = TL3_list[0]
            td[TL3_BQ][data_count] = 0 # Top Branch Quality

    data_count +=1 

print("Directly Imputed: " +str(total_direct))
print("With Branches: " +str(total_branched))

td.to_csv("MH_harmonization_map_26_fullmap.csv", header=True, index=False) 

HBox(children=(IntProgress(value=0, description='1st loop', max=37105, style=ProgressStyle(description_width='…

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Directly Imputed: 2
With Branches: 0


*Application notes: In this case no branches were found on the terms we sought to impute, thus we don't need to attempt exact or fuzzy matching to pick the best branch.  However code may be added here to do so if needed.* 

### Load necessary ontology files for imputing DL4

In [32]:
#### Load 4th Level Terminology Standard File
tl4 = pd.read_excel(ont_DL4_data, sep='\t',na_values=' ')
tl4.shape

#### Load 3th to 4th Level Terminology Connection File
tl4_tl3 = pd.read_excel(ont_DL4_DL3_data, sep='\t',na_values=' ')
tl4_tl3.shape

(337, 9)

(1755, 2)

### Impute remaining TL4 terms and codes
In this case all the terms that needed to be imputed either had no branches, or no terms to allow for exact/fuzzy matching, thus those steps were not needed here (but could be added as needed).

In [33]:
td = pd.read_csv("MH_harmonization_map_26_fullmap.csv", na_values=' ') 
data_count = 0
total_branched = 0
total_direct = 0

for each in tqdm_notebook(td[TL3_code_header], desc='1st loop'): #for each row
    #if TL3 term missing and quality code is not 6(unmappable)
    if pd.isna(td[TL4_name_header][data_count]) and str(td[TL1_qual_code_header][data_count]) != str(6):

        #Idenfity the TL3 code in the MedDRA HLGT_HLT file
        tempList = tl4_tl3[ont_DL3_code_col].tolist() # the column name reference is specific to the loaded MedDRA file
        indexList = [i for i,val in enumerate(tempList) if val==int(float(each))]
 
        #Put all connected TL4 codes in TL4_B separated by an underscore
        TL4_list = []
        TL4_str = ''
        for i in indexList:
            TL4_list.append(tl4_tl3[ont_DL4_code_col][i]) # the column name reference is specific to the loaded MedDRA file
            TL4_str += str(tl4_tl3[ont_DL4_code_col][i])+'_' # the column name reference is specific to the loaded MedDRA file
        
        #Branch Reporting
        if len(TL4_list) > 1: # is there more than one branch 
            total_branched += 1
            td[TL4_B][data_count] = TL4_str

        else: #only one branch found
            total_direct += 1
            #Identify term for single code
            tempList = tl4[ont_DL4_code_col].tolist() # the column name reference is specific to the loaded MedDRA file

            tempIndex = tempList.index(TL4_list[0]) #find index/location of code in tl4 set of terms
            term = tl4[ont_DL4_name_col][tempIndex] # the column name reference is specific to the loaded MedDRA file

            td[TL4_name_header][data_count] = term
            td[TL4_code_header][data_count] = TL4_list[0]
            td[TL4_BQ][data_count] = 0 # Top Branch Quality

    data_count +=1 

print("Directly Imputed: " +str(total_direct))
print("With Branches: " +str(total_branched))

td.to_csv("MH_harmonization_map_27_fullmap.csv", header=True, index=False) 

HBox(children=(IntProgress(value=0, description='1st loop', max=37105, style=ProgressStyle(description_width='…

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Directly Imputed: 2
With Branches: 0


*Application notes: In this case no branches were found on the terms we sought to impute, thus we don't need to attempt exact or fuzzy matching to pick the best branch.  However code may be added here to do so if needed.* 

### Load necessary ontology files for imputing DL5

In [34]:
#### Load 5th Level Terminology Standard File
tl5 = pd.read_excel(ont_DL5_data, sep='\t',na_values=' ')
tl5.shape

#### Load 4th to 5th Level Terminology Connection File
tl5_tl4 = pd.read_excel(ont_DL5_DL4_data, sep='\t',na_values=' ')
tl5_tl4.shape

(27, 10)

(354, 2)

### Impute remaining TL5 terms and codes
In this case all the terms that needed to be imputed either had no branches, or no terms to allow for exact/fuzzy matching, thus those steps were not needed here (but could be added as needed).

In [56]:
td = pd.read_csv("MH_harmonization_map_27_fullmap.csv", na_values=' ') 
data_count = 0
total_branched = 0
total_direct = 0

for each in tqdm_notebook(td[TL4_code_header], desc='1st loop'): #for each row
    #if TL3 term missing and quality code is not 6(unmappable)
    if pd.isna(td[TL5_name_header][data_count]) and str(td[TL1_qual_code_header][data_count]) != str(6):

        #Idenfity the TL4 code in the MedDRA HLT_PT file
        tempList = tl5_tl4[ont_DL4_code_col].tolist() # the column name reference is specific to the loaded MedDRA file
        indexList = [i for i,val in enumerate(tempList) if val==each]

        #Put all connected TL5 codes in TL5_B separated by an underscore
        TL5_list = []
        TL5_str = ''
        for i in indexList:
            TL5_list.append(tl5_tl4[ont_DL5_code_col][i]) # the column name reference is specific to the loaded MedDRA file
            TL5_str += str(tl5_tl4[ont_DL5_code_col][i])+'_' # the column name reference is specific to the loaded MedDRA file

        #Branch Reporting
        if len(TL5_list) > 1: # is there more than one branch 
            total_branched += 1
            td[TL5_B][data_count] = TL5_str

        else: #only one branch found
            total_direct += 1
            #Identify term for single code
            tempList = tl5[ont_DL5_code_col].tolist() # the column name reference is specific to the loaded MedDRA file

            tempIndex = tempList.index(TL5_list[0]) #find index/location of code in tl3 set of terms
            term = tl5[ont_DL5_name_col][tempIndex] # the column name reference is specific to the loaded MedDRA file

            td[TL5_name_header][data_count] = term
            td[TL5_code_header][data_count] = TL5_list[0]
            td[TL5_BQ][data_count] = 0 # Top Branch Quality

    data_count +=1 

print("Directly Imputed: " +str(total_direct))
print("With Branches: " +str(total_branched))
td.to_csv("MH_harmonization_map_28_fullmap.csv", header=True, index=False) 

HBox(children=(IntProgress(value=0, description='1st loop', max=37105, style=ProgressStyle(description_width='…

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Directly Imputed: 2
With Branches: 0


*Application notes: Performed some minor remainin manual cleanup of the file to fix some simple copy errors from previous manual fixes.*

***
## Quality Control and Summary for all original rows in the target dataset

***
### Check that terms and codes have been filled in for all rows that could be mapped (i.e. LLT mapping quality < 6)

In [65]:
fd = pd.read_csv("MH_harmonization_map_28_fullmap_man.csv", na_values=' ')

quality = 0
c_llt = 0
c_llt_code = 0
c_pt = 0
c_pt_code = 0
c_hlt = 0
c_hlt_code = 0
c_hlgt = 0
c_hlgt_code = 0
c_soc = 0
c_soc_code = 0

count = 0
for each in tqdm_notebook(fd[TL1_name_header], desc='1st loop'): #for each row (i.e. AE term to be mapped)
    if pd.isna(each): # Code missing    
        c_llt += 1
    if pd.isna(fd[TL1_code_header][count]):
        c_llt_code += 1
    if pd.isna(fd[TL1_qual_code_header][count]):
        quality += 1
        
    if pd.isna(fd[TL2_name_header][count]):
        c_pt += 1
    if pd.isna(fd[TL2_code_header][count]):
        c_pt_code += 1 
        
    if pd.isna(fd[TL3_name_header][count]):
        c_hlt += 1
    if pd.isna(fd[TL3_code_header][count]):
        c_hlt_code += 1  
        
    if pd.isna(fd[TL4_name_header][count]):
        c_hlgt += 1
    if pd.isna(fd[TL4_code_header][count]):
        c_hlgt_code += 1  
        
    if pd.isna(fd[TL5_name_header][count]):
        c_soc += 1
    if pd.isna(fd[TL5_code_header][count]):
        c_soc_code += 1  
    
    count += 1
        
print('Missing Mapped Terms - Quality Report:')
print('LLT Term: '+str(c_llt)+', LLT Code: '+str(c_llt_code))
print('PT Term: '+str(c_pt)+', PT Code: '+str(c_pt_code))
print('HLT Term: '+str(c_hlt)+', HLT Code: '+str(c_hlt_code))
print('HLGT Term: '+str(c_hlgt)+', HLGT Code: '+str(c_hlgt_code))
print('SOC Term: '+str(c_soc)+', SOC Code: '+str(c_soc_code))

  interactivity=interactivity, compiler=compiler, result=result)


HBox(children=(IntProgress(value=0, description='1st loop', max=37105, style=ProgressStyle(description_width='…

Missing Mapped Terms - Quality Report:
LLT Term: 84, LLT Code: 84
PT Term: 84, PT Code: 84
HLT Term: 84, HLT Code: 84
HLGT Term: 84, HLGT Code: 84
SOC Term: 84, SOC Code: 84


***
## Survey LLT annotation quality code frequencies (as these are most important)

In [66]:
annote_dict = {}
for each in fd[TL1_qual_code_header]:
    code = str(each)
    if code in annote_dict :               
        annote_dict[code] += 1
    else:
        annote_dict[code] = 1

print("LLT quality code counts:")
print(annote_dict)
len(fd)

LLT quality code counts:
{'2': 15286, '1': 4934, '3': 2182, '6': 84, '0': 5375, '5': 34, '4': 9210}


37105

***
## Survey frequency of branches at each level of the ontology where term branches were possible

In [67]:
#Survey HLT
branch_dict = {}
for each in fd[TL3_B]:
    digets = len(str(each))
    codes = int(int(digets)/8)
    code_digets = str(codes)

    if code_digets in branch_dict:               
        branch_dict[code_digets] += 1
    else:
        branch_dict[code_digets] = 1
print("HLT Branch Counts:")
print(branch_dict)
                      
#Survey HLGT
branch_dict = {}
for each in fd[TL4_B]:
    digets = len(str(each))
    codes = int(int(digets)/8)
    code_digets = str(codes)

    
    if code_digets in branch_dict:               
        branch_dict[code_digets] += 1
    else:
        branch_dict[code_digets] = 1
print("HLGT Branch Counts:")
print(branch_dict)

#Survey SOC
branch_dict = {}
for each in fd[TL5_B]:
    digets = len(str(each))
    codes = int(int(digets)/8)
    code_digets = str(codes)

    
    if code_digets in branch_dict:               
        branch_dict[code_digets] += 1
    else:
        branch_dict[code_digets] = 1
print("SOC Branch Counts:")
print(branch_dict)  

HLT Branch Counts:
{'0': 24248, '2': 9494, '3': 2849, '5': 222, '4': 292}
HLGT Branch Counts:
{'0': 36662, '2': 443}
SOC Branch Counts:
{'0': 36445, '2': 660}


## Survey HLT Branch Codes

In [69]:
annote_dict = {}
for each in fd[TL3_BQ]:
    code = str(each)
    if code in annote_dict :               
        annote_dict[code] += 1
    else:
        annote_dict[code] = 1

print("HLT Branch code counts:")
print(annote_dict)

HLT Branch code counts:
{'0.0': 24161, '1.0': 6571, '4.0': 5400, 'nan': 84, '3.0': 410, '2.0': 479}


## Survey HLGT Branch Codes

In [70]:
annote_dict = {}
for each in fd[TL4_BQ]:
    code = str(each)
    if code in annote_dict :               
        annote_dict[code] += 1
    else:
        annote_dict[code] = 1

print("HLGT Branch code counts:")
print(annote_dict)

HLGT Branch code counts:
{'0.0': 36578, '1.0': 284, '2.0': 1, 'nan': 84, '3.0': 4, '4.0': 154}


## Survey SOC Branch Codes

In [71]:
annote_dict = {}
for each in fd[TL5_BQ]:
    code = str(each)
    if code in annote_dict :               
        annote_dict[code] += 1
    else:
        annote_dict[code] = 1

print("SOC Branch code counts:")
print(annote_dict)

SOC Branch code counts:
{'0.0': 36361, '1.0': 417, '4.0': 100, 'nan': 84, '3.0': 5, '2.0': 138}


## Notebook conclusions
In this notebook we performed an initial summary of the working map file with only unique rows.  Then we mapped these unique rows back to the original target dataset that we started with in the first notebook, so that all columns in the original dataset are restored, but our new ontology standardized mapping columns are added. This notebook concludes by generating summary information over the original set of term rows regarding mapping quality scores, the possible branches during imputation, and the quality of branch resolution during imputation. 

This concludes this term harmonization analysis pipeline. We hope others fine it to be of use. 