# Term Harmonization - STEP 1 : Data Preparation
#### Author: Ryan Urbanowicz (ryanurb@upenn.edu) 
#### Institution: University of Pennsylvania - Perleman School of Medicine
#### Project: CMREF Data Harmonization 
#### Date: 9/1/21
#### Project Overview: 
This set of Jupyter notebooks have been set up to facilitate the process of term harminozation and make it as replicatable as possible. The ultimate goal is to take an existing set of term data from one or more sources that have been concatenated into a single dataset, and map the available terms to an ontological standard. This proceedure is aimed at resolving one term data type at a time, e.g. medical history terms, or adverse event terms. The output of this proceedure is copy of the original concatenated datasets that adds columns for the harmonized standards terms (at their most specific level), along with any levels of a generalized hierarchy of terms that may be useful for downstream data analysis. Further we add columns that track the quality of the mapping for each row and at each level of term specificity. 

In these notebooks we lay out and document the aspects of this project that are automated (and thus fully replicatable) as well as describe and provide instructions for those aspects that required manual subjective decision making (thus are not completely replicatable). These notebooks have been set up to be as generalizable to other term harmonization tasks as possible, however throughout these notebooks we will focus (as our target application) the task outlined below.  As a result it is expected that for any outside party to utilize these notebooks for their own harmonization tasks, some modification/customization of these notebooks will be required. Thus it is best to view these notebooks as a guide/template for future term harmonization tasks, as well as a record for reproducing the harmonization tasks that we completed for our own target application. Additionally some of the computational tasks can take a number of hours to complete, therefore 'in-progress' mapping files in .csv and excel format are saved and loaded by this code after most steps. 

In general the materials required to utilize this notebook as a guide is (1) a set of source terms as rows in a dataset (including supplemental term columns, and/or any available hierarchy of more general terms from a relevant ontolology) and (2)  the reference files of a target ontology standard, e.g. MedDRA.  While these notebooks are designed specifically to utilize the MedDRA v21 ontology of terms, they could be adapted to any ontology that provides as standard terminology, as well as link between specific and more general terms as part of the ontology hierarchy (e.g. GO terms/hierarchy). 

#### Target Application:
The specific 'term harmonization task' that we tackled in building these notebooks, is to harmonize terms used to describe 'medical history' events (MHTERM) for subjects over a set of 28 separate drug trials (CMREF Project). We will use the MedDRA v21 heirarchy of terms as our ontological/terminology standard. We labor under the assumption that not all drug trials used the MedDRA standard, and those that did likely did not used the same version (v21). The primary goal is to harmonize the available specific 'medical history' term information, such that all possible terms are mapped to MedDRA low level terms (LLT). The secondary goal is to impute values for the more general levels of the MedDRA term hierarchy (i.e. preferred terms (PT), higher level term (HLT), higher level group term (HLGT), and system organ class (SOC)). In our application all trials have terms available at a specific, 'patient reported' level which we focus on in the primary LLT mapping. However in some trials, supplemental term data is available beyond the patient reported terms that we will utilize when available to provide a greater opportunity to map these term rows to the LLT MedDRA standard. <br>

#### Data Availability:
While the target term data used in this study has not been made available here (for privacy and proprietary reasons) we have included the MedDRA ontology files that we formatted as Excel files. 

#### Code Generalization: 
In order to keep the code as generalizable (to other projects/applications) as possible we describe below the labels used in the code herein and how it correponds to our specific target application described above:

*Format (General Code Labels) = (Application Specific Labels), i.e. column header names from our target dataset)*

Ontology Standard:
* (TL1) - Term Level 1 = (LLT) - lower level term
* (TL2) - Term Level 2 = (PT) - preferred term
* (TL3) - Term Level 3 = (HLT) - higher level term
* (TL4) - Term Level 4 = (HLGT) - higher level group term
* (TL5) - Term Level 5 = (SOC) - system organ class term

Target Dataset:
* (DL1_FT1) - Data Level 1 Focus Term 1  = (MHTERM) - Medical History Term (assumed available for every row)
* (DL1_FT2) - Data Level 1 Focus Term 2  = (LLT_NAME) - The primary alternative term.  In this case the original attemped mapping to LLT. (availble for some rows)
* (DL1_FT3) - Data Level 1 Focus Term 3  = (MHMODIFY) - An alternative, 'modified' term available in the data for mapping to the lowest term level . (available for some rows)
* (DL2) - Data Level 2 = (PT_NAME) - MedDRA (unknown version) preferred term (available for some rows)
* (DL3) - Data Level 3 = (HLT_NAME) - MedDRA (unknown version) higher level term (available for some rows)
* (DL4) - Data Level 4 = (HLGT_NAME) - MedDRA (unknown version) higher level group term (available for some rows)
* (DL5) - Data Level 5 = (SOC_NAME) - MedDRA (unknown version) system organ class term (available for some rows)

#### Notebook Summary:
This notebook is meant to cover initial data preparation.  This part of the harmonization process may include (1) loading the data, (2) summarizing the data for review, (3) cleaning the data (as needed), (4) formatting the data (as needed for running the remainder of the pipeline, (5) saving the cleaned, formatted file for the remainder of the harmonization pipeline. The target data is the concatenation of all available term instance rows over all studies to be harmonized.  This notebook also loads and provides summary information on the relevant MedDRA ontology files that will serve as our standard . 

#### Dependencies:
We recommend users have Python 3 or higher, having installed the anaconda python package (e.g. https://www.datacamp.com/community/tutorials/installing-anaconda-windows). It may be necessary to install some additional packages that are used by one or more of these notebooks.  We recommend using the 'pip install' mechanism for easily installing these packages (e.g. https://packaging.python.org/tutorials/installing-packages/).

## Load Python packages required in this notebook

In [1]:
#Load necessary packages.
import pandas as pd
import numpy as np

# Jupyter Notebook Hack: This code ensures that the results of multiple commands within a given cell are all displayed, rather than just the last. 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

  return f(*args, **kwds)


***
## Load Target Data
Our target dataset includes all individual medical history records for all subjects over all studies organized as rows. Columns are the variables as outlined above under "target dataset". Some terms in our intial dataset are not useful for the harmonization process so they will be excluded upfront. This includes "STUDY", the study identifier for our target application, which can be dropped here as it plays no role in the term mapping. There are also columns for old MedDRA codes for each of the 5 term levels in the MedDRA term hierarchy (we cannot confirm which MedDRA version these terms came from so they will be ignored).  These will be replaced with MedDRA v21 term codes during the mapping/imputation process. 

Below we begin by assigning the target-data-specific file and column names to general variables to facilitate adaptaion of this notebook to a different target problem. 

### Create general variable names for the target-application-specific names needed across all project notebooks
Upfront we create generally named variables to store all application-specific names needed across all notebooks for this project. Not all of these variables are needed within each notebook, but for ease of notebook adaptation we will specify them all upfront here, and load this cell once towards the beginning of each notebook. 

In [2]:
# Input filename for 'target dataset' (excel file loaded in this application)
target_study_data = 'Combined_MEDHX_TERMS_20studies.xlsx' 

ont_DL1_data = 'LLT.xlsx' # Input filename for ontology file defining all DL1 terms and their codes. 
ont_DL1_name_col = 'llt_name' # column label for DL1 term name
ont_DL1_code_col ='llt_code' # column label for DL2 term code
ont_DL1_cur_col = 'llt_currency' # column label for term currency
ont_DL2_data = 'PT.xlsx' # Input filename for ontology file defining all DL2 terms and their codes. 
ont_DL3_DL2_data = 'HLT_PT.xlsx' # Input filename for ontology file defining connections between DL2 and DL3 term codes. 
ont_DL3_data = 'HLT.xlsx' # Input filename for ontology file defining all DL3 terms and their codes.
ont_DL4_DL3_data = 'HLGT_HLT.xlsx' # Input filename for ontology file defining connections between DL3 and DL4 term codes. 
ont_DL4_data = 'HLGT.xlsx' # Input filename for ontology file defining all DL4 terms and their codes.
ont_DL5_DL4_data = 'SOC_HLGT.xlsx' # Input filename for ontology file defining connections between DL4 and DL5 term codes. 
ont_DL5_data = 'SOC.xlsx' # Input filename for ontology file defining all DL5 terms and their codes.

DL1_FT1 = 'MHTERM' # focus term 1: This term is available over all studies. 
DL1_FT2 = 'LLT_NAME' # focus term 3: an alternative term available for a subset of studies. This one supposedly conforms to the MedDRA standard so we expect it to yield more exact matches. May offer a better match for the lowest level of the standardized terminology.
DL1_FT3 = 'MHMODIFY' # focus term 2: an alternative term available for a subset of studies. May offer a better match for the lowest level of the standardized terminology.

DL2 = 'PT_NAME' # Secondary level terms (i.e. more general than DL1 terms)
DL3 = 'HLT_NAME' # Tertiary level terms (i.e. more general than DL2 terms)
DL4 = 'HLGT_NAME' # Quarternary level terms (i.e. more general than DL3 terms)
DL5 = 'SOC_NAME' # Quinary Level terms (i.e. more general than DL4 terms)

TL1_qual_code_header = 'LLT_map_code' # column name for lowest term level mapping quality code (added to mapping file)
TL1_name_header = 'T_LLT' # column name for the 'mapped' TL1 - term name (added to mapping file)
TL1_code_header = 'T_LLT_CODE' # column name for the 'mapped' TL1 - term code (added to mapping file)

In [3]:
#Load target (tab-delimited) file into a pandas data frame
td = pd.read_excel(target_study_data, sep='\t',na_values=' ') #Data loaded so that blank excell cells are 'NA'

***
## Summarize Target Data
Use pandas functions to orient ourselves to fundamental characteristics of the target dataset.

In [4]:
#View the first few rows of the dataset
td.head()

Unnamed: 0,STUDY,MHTERM,MHCODE,MHMODIFY,LLT_CODE,LLT_NAME,PT_CODE,PT_NAME,HLT_CODE,HLT_NAME,HLGTCODE,HLGT_NAME,SOC_CODE,SOC_NAME,PTSOC_CD
0,amb112565,Right ventricular hypertrophy with strain,,Right ventricular hypertrophy with strain,,,,,,,,,,,
1,amb112565,Systemic Scleroderma Stage 11,,Systemic Scleroderma Stage 11,,,,,,,,,,,
2,amb112565,"T-wave negatve in II, III, aVF, V1-6",,"T-wave negatve in II, III, aVF, V1-6",,,,,,,,,,,
3,amb112565,chronic pericardial effusion,,chronic pericardial effusion,,,,,,,,,,,
4,amb112565,dyspnea,,dyspnea,,,,,,,,,,,


In [5]:
#Report the number of (rows,columns)
td.shape

(37105, 15)

In [6]:
# Reports the column labels, number of rows with values, and the variable type that pandas thinks is in the given column.
td.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37105 entries, 0 to 37104
Data columns (total 15 columns):
STUDY        37105 non-null object
MHTERM       37083 non-null object
MHCODE       5811 non-null float64
MHMODIFY     19140 non-null object
LLT_CODE     16348 non-null float64
LLT_NAME     21523 non-null object
PT_CODE      14114 non-null float64
PT_NAME      19447 non-null object
HLT_CODE     9597 non-null float64
HLT_NAME     20105 non-null object
HLGTCODE     9597 non-null float64
HLGT_NAME    20105 non-null object
SOC_CODE     14114 non-null float64
SOC_NAME     33735 non-null object
PTSOC_CD     3099 non-null float64
dtypes: float64(7), object(8)
memory usage: 4.2+ MB


*APPLICATION NOTE: As we can see, in this target application, the dataset includes 37105 rows, but only 37083 have a value for 'MHTERM' (i.e. our DL1_FT1). These rows will be dropped from consideration given that we assume all rows we want to map will have an entry for this column.  These are artifacts from how the target dataset was assembled.*

In [7]:
#Report the number of unique values in each column.
td.nunique()

STUDY           20
MHTERM       21452
MHCODE        2571
MHMODIFY     10482
LLT_CODE      4004
LLT_NAME      6386
PT_CODE       2414
PT_NAME       2662
HLT_CODE       756
HLT_NAME      1520
HLGTCODE       266
HLGT_NAME      529
SOC_CODE        26
SOC_NAME       139
PTSOC_CD        26
dtype: int64

*APPLICATION NOTE: As we can see, in this target application, the dataset includes 37105 rows, but only 21542 unique 'level 1' terms. There are only 20 studies with Medical History data.*

***
## Clean Target Data
(1) Select only the columns needed for the term mapping (2) Missing value assessment (3) reduce dataset down to just the set of non-redundant rows, i.e. rows that contain a unique set of term entries (4) any other application specific data cleaning. We reduce the dataset down to just the unique rows to reduce the computational time required for this mapping proceedure, as well as to reduce the burden of manual mapping that will be required downstream. 

### Select subset of necessary columns
This step is very specific to the target application. We have manually endered the column headers from the original dataset that we want to keep for the mapping process, and resulting mapping file. In particular we are only keeping the three 'level one' focus terms, and the columns including any available more general MedDRA term levels (i.e. DL2, DL3, DL4, DL5).

In [8]:
#Select subset of dataset columns from the entire target dataset.
print("Data dimmensions before..")
td.shape
td = td[[DL1_FT1,DL1_FT2,DL1_FT3,DL2,DL3,DL4,DL5]] 
print("Data dimmensions after..")
td.shape


Data dimmensions before..


(37105, 15)

Data dimmensions after..


(37105, 7)

### Missing Value Assessment
(1) Determine if any 'DL1_FT1' rows have missing values (the assumption is that all of these rows to be mapped will have a value for DL1_FT1, otherwise that row can be thrown out), (2) Remove any rows with missing focus terms, (3) adjust row index values (for propper row referencing).

In [9]:
#Confirm number of Level 1 terms - i.e. Level 1 rows without missing values
print("Number of Level 1 terms available (Not missing)")
td[DL1_FT1].count()

#Evaluate missingness and data availability
print("Missing Value Counts")
td.isnull().sum()

Number of Level 1 terms available (Not missing)


37083

Missing Value Counts


MHTERM          22
LLT_NAME     15582
MHMODIFY     17965
PT_NAME      17658
HLT_NAME     17000
HLGT_NAME    17000
SOC_NAME      3370
dtype: int64

*APPLICATION NOTE: Out of 37105 rows, 22 have a DL1_FT1 term that is 'missing', meaning that these rows need to be dropped.*

In [10]:
#Drop rows with missing values in DL1_FT1
td = td.dropna(subset=[DL1_FT1])
print("Number of remaining rows/columns")
td.shape
print("Missing Value Counts")
td.isnull().sum()
print("Number of unique terms/values within each term level")
td.nunique()

Number of remaining rows/columns


(37083, 7)

Missing Value Counts


MHTERM           0
LLT_NAME     15560
MHMODIFY     17943
PT_NAME      17636
HLT_NAME     16978
HLGT_NAME    16978
SOC_NAME      3360
dtype: int64

Number of unique terms/values within each term level


MHTERM       21452
LLT_NAME      6386
MHMODIFY     10482
PT_NAME       2662
HLT_NAME      1520
HLGT_NAME      529
SOC_NAME       139
dtype: int64

*APPLICATION NOTE: Noticed that many DL1_FT1 terms are redundant over the instances in the studies (21452 unique terms in total)* <br> 


### Identify set of non-redundant rows. 
To reduce the computational and manual burden of many of the term matching tasks ahead it makes the most sense to begin by reducing the dataset down to a non-redundant set of rows.  Redundancy is defined as a row having the exact same set of present or absent term values as those found in another row.  In this application that includes the values found in the columns: (DL1_FT1, DL1_FT2, DL1_FT3, DL2, DL3, DL4, DL5).

In [11]:
td = td.drop_duplicates(subset=None, keep='first', inplace=False)

In [12]:
#Readjusts the row index values so there are no gaps in the sequence from the row removal (important for indexing later) 
td = td.reset_index(drop=True) 

In [13]:
td.shape
#Report the number of unique values for each variable.
td.nunique()

(28720, 7)

MHTERM       21452
LLT_NAME      6386
MHMODIFY     10482
PT_NAME       2662
HLT_NAME      1520
HLGT_NAME      529
SOC_NAME       139
dtype: int64

*APPLICATION NOTE: After removing redundant rows we have confirmed that the number of unique terms for all columns is maintained, and we are left with a total of 28720 unique rows to be mapped.* 

***
### Save formatted data
Before moving on to the next step of the harmonization pipeline we will save our mapping file 'in progress'. We will save the file such that it includes a new column identifying the unique row index value for each row.  We have found this useful to include early on for quality control purposes, particularly since not all steps are automated, and later, participants in this harmonization process may find it useful to sort the file by different columns to facilitate the mapping process. 

In [14]:
#In this script MH = Medical History for our target application.
td.to_csv("MH_harmonization_map_1.csv", header=True, index=True, index_label='ROW_INDEX') 

***
## Examine Terminology Standard Files
In this first notebook we also load and summarizing basic information on the different ontology files. In this particular application we are using MedDRA v21 as our terminology standard. We have previously prepared excel formatted files for the MedDRA term mapping that links terms within each level (i.e. LLT, PT, HLT, HLGT, SOC) to their respective codes, and separate files to link (1) PTs to more general HLTs, (2) HLTs to more general HLGTs, and (3) HLGTs to more general SOCs. The LLT file includes direct mappings to PTs.  Note that in this hierarchy of MedDRA terms, all LLTs directly map to a single PT, but PTs and subsequently more general levels can map to multiple more general terms.  In this pipeline we will refer to this as *branching*. Furthermore, note that these files include terms that are active in MedDRA v21, as well as those that are out of date. These are differentated with the variable 'llt_currency'. In this pipeline we will be sure to only use MedDRA terms that are included in the current version (i.e. llt_currency == Y). 

Recall the general labels we had previously assigned for the levels of this ontology above. 

* (TL1) - Term Level 1 = (LLT) - lower level term
* (TL2) - Term Level 2 = (PT) - preferred term
* (TL3) - Term Level 3 = (HLT) - higher level term
* (TL4) - Term Level 4 = (HLGT) - higher level group term
* (TL5) - Term Level 5 = (SOC) - system organ class term

### Load and Summarize Term Level 1 File (MedDRA LLT file)

In [15]:
tl1 = pd.read_excel(ont_DL1_data, sep='\t',na_values=' ')
tl1.shape
tl1.info()
#Count LLTs in MedDRA file - check for missing
tl1[ont_DL1_name_col].count() #column name is application specific.
#Count unique LLTs in MedDRA file
tl1.nunique()

(78808, 11)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78808 entries, 0 to 78807
Data columns (total 11 columns):
llt_code            78808 non-null int64
llt_name            78808 non-null object
pt_code             78808 non-null int64
llt_whoart_code+    0 non-null float64
llt_harts_code+     0 non-null float64
llt_costart_sym+    0 non-null float64
llt_icd9_code+      0 non-null float64
llt_icd9cm_code+    0 non-null float64
llt_icd10_code+     0 non-null float64
llt_currency        78808 non-null object
llt_jart_code+      0 non-null float64
dtypes: float64(7), int64(2), object(2)
memory usage: 6.6+ MB


78808

llt_code            78808
llt_name            78807
pt_code             23088
llt_whoart_code+        0
llt_harts_code+         0
llt_costart_sym+        0
llt_icd9_code+          0
llt_icd9cm_code+        0
llt_icd10_code+         0
llt_currency            2
llt_jart_code+          0
dtype: int64

*APPLICATION NOTE: Confirmed that there are no redundant MedDRA LLTs. We note that there is an addional Y/N variable, 'llt_currency' that indicates if the term is currently used in v21 of MedDRA.  To ensure our termiology is v21 compliant, we will perform our term matching using only LLTs that are current (i.e.  llt_currency = Y).*

### Filter for 'current' LLT's

In [16]:
#Filter out any non-current low level terms (LLTs) 
tl1 = tl1.loc[tl1[ont_DL1_cur_col] == 'Y'] #column name is application specific.
#Again determine number of remaining unique LLTs
tl1.nunique()
tl1.shape
#Readjusts the row index values so there are no gaps in the sequence from the row removal (important for indexing later) 
tl1 = tl1.reset_index(drop=True) 

llt_code            69531
llt_name            69531
pt_code             23088
llt_whoart_code+        0
llt_harts_code+         0
llt_costart_sym+        0
llt_icd9_code+          0
llt_icd9cm_code+        0
llt_icd10_code+         0
llt_currency            1
llt_jart_code+          0
dtype: int64

(69531, 11)

*APPLICATION NOTE: A total of 69531 unique and current LLT MedDRA terms/codes observed.*

### Load and Summarize Term Level 2 Files (MedDRA PT files)

In [17]:
tl2 = pd.read_excel(ont_DL2_data, sep='\t',na_values=' ') #Data loaded so that blank excell cells are 'NA'
tl2.shape
tl2.info()
tl2.nunique()

(23088, 11)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23088 entries, 0 to 23087
Data columns (total 11 columns):
pt_code            23088 non-null int64
pt_name            23088 non-null object
null_field         0 non-null float64
pt_soc_code        23088 non-null int64
pt_whoart_code     0 non-null float64
pt_harts_code      0 non-null float64
pt_costart_sym     0 non-null float64
pt_icd9_code+      0 non-null float64
pt_icd9cm_code+    0 non-null float64
pt_icd10_code+     0 non-null float64
pt_jart_code+      0 non-null float64
dtypes: float64(8), int64(2), object(1)
memory usage: 1.9+ MB


pt_code            23088
pt_name            23088
null_field             0
pt_soc_code           27
pt_whoart_code         0
pt_harts_code          0
pt_costart_sym         0
pt_icd9_code+          0
pt_icd9cm_code+        0
pt_icd10_code+         0
pt_jart_code+          0
dtype: int64

*APPLICATION NOTE: A total of 23088 unique PT MedDRA terms/codes observed.*

### Load and Summarize Term Level 3 Files (MedDRA HLT files)

In [18]:
tl3_tl2 = pd.read_excel(ont_DL3_DL2_data, sep='\t',na_values=' ') #Data loaded so that blank excell cells are 'NA'
tl3_tl2.shape
tl3_tl2.info()
tl3_tl2.nunique()

(33402, 2)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33402 entries, 0 to 33401
Data columns (total 2 columns):
hlt_code    33402 non-null int64
pt_code     33402 non-null int64
dtypes: int64(2)
memory usage: 522.0 KB


hlt_code     1737
pt_code     23088
dtype: int64

In [19]:
tl3 = pd.read_excel(ont_DL3_data, sep='\t',na_values=' ') #Data loaded so that blank excell cells are 'NA'
tl3.shape
tl3.info()
tl3.nunique()

(1737, 9)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1737 entries, 0 to 1736
Data columns (total 9 columns):
hlt_code            1737 non-null int64
hlt_name            1737 non-null object
hlt_whoart_code+    0 non-null float64
hlt_harts_code      0 non-null float64
hlt_costart_sym+    0 non-null float64
hlt_icd9_code+      0 non-null float64
hlt_icd9cm_code+    0 non-null float64
hlt_icd10_code+     0 non-null float64
hlt_jart_code+      0 non-null float64
dtypes: float64(7), int64(1), object(1)
memory usage: 122.2+ KB


hlt_code            1737
hlt_name            1737
hlt_whoart_code+       0
hlt_harts_code         0
hlt_costart_sym+       0
hlt_icd9_code+         0
hlt_icd9cm_code+       0
hlt_icd10_code+        0
hlt_jart_code+         0
dtype: int64

*APPLICATION NOTE: A total of 1737 unique HLT MedDRA terms/codes observed.*

### Load and Summarize Term Level 4 Files (MedDRA HLGT files)

In [20]:
tl4_tl3 = pd.read_excel(ont_DL4_DL3_data, sep='\t',na_values=' ') #Data loaded so that blank excell cells are 'NA'
tl4_tl3.shape
tl4_tl3.info()
tl4_tl3.nunique()

(1755, 2)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1755 entries, 0 to 1754
Data columns (total 2 columns):
hlgt_code    1755 non-null int64
hlt_code     1755 non-null int64
dtypes: int64(2)
memory usage: 27.5 KB


hlgt_code     337
hlt_code     1737
dtype: int64

In [21]:
tl4 = pd.read_excel(ont_DL4_data, sep='\t',na_values=' ') #Data loaded so that blank excell cells are 'NA'
tl4.shape
tl4.info()
tl4.nunique()

(337, 9)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 337 entries, 0 to 336
Data columns (total 9 columns):
hlgt_code            337 non-null int64
hlgt_name            337 non-null object
hlgt_whoart_code+    0 non-null float64
hlgt_harts_code      0 non-null float64
hlgt_costart_sym+    0 non-null float64
hlgt_icd9_code+      0 non-null float64
hlgt_icd9cm_code+    0 non-null float64
hlgt_icd10_code+     0 non-null float64
hlgt_jart_code+      0 non-null float64
dtypes: float64(7), int64(1), object(1)
memory usage: 23.8+ KB


hlgt_code            337
hlgt_name            337
hlgt_whoart_code+      0
hlgt_harts_code        0
hlgt_costart_sym+      0
hlgt_icd9_code+        0
hlgt_icd9cm_code+      0
hlgt_icd10_code+       0
hlgt_jart_code+        0
dtype: int64

*APPLICATION NOTE: A total of 337 unique HLGT MedDRA terms/codes observed.*

### Load and Summarize Term Level 5 Files (MedDRA SOC files)

In [22]:
tl5_tl4 = pd.read_excel(ont_DL5_DL4_data, sep='\t',na_values=' ') #Data loaded so that blank excell cells are 'NA'
tl5_tl4.shape
tl5_tl4.info()
tl5_tl4.nunique()

(354, 2)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354 entries, 0 to 353
Data columns (total 2 columns):
soc_code     354 non-null int64
hlgt_code    354 non-null int64
dtypes: int64(2)
memory usage: 5.6 KB


soc_code      27
hlgt_code    337
dtype: int64

In [23]:
tl5 = pd.read_excel(ont_DL5_data, sep='\t',na_values=' ') #Data loaded so that blank excell cells are 'NA'
tl5.shape
tl5.info()
tl5.nunique()

(27, 10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27 entries, 0 to 26
Data columns (total 10 columns):
soc_code            27 non-null int64
soc_name            27 non-null object
soc_abbrev          27 non-null object
soc_whoart_code+    0 non-null float64
soc_harts_code      0 non-null float64
soc_costart_sym+    0 non-null float64
soc_icd9_code+      0 non-null float64
soc_icd9cm_code+    0 non-null float64
soc_icd10_code+     0 non-null float64
soc_jart_code+      0 non-null float64
dtypes: float64(7), int64(1), object(2)
memory usage: 2.2+ KB


soc_code            27
soc_name            27
soc_abbrev          27
soc_whoart_code+     0
soc_harts_code       0
soc_costart_sym+     0
soc_icd9_code+       0
soc_icd9cm_code+     0
soc_icd10_code+      0
soc_jart_code+       0
dtype: int64

*APPLICATION NOTE: A total of 27 unique SOC MedDRA terms/codes observed.*

At this point we have confirmed that we can open these MedDRA standard terminology files and have looked over the summaries to make sure that everything looks in order. 